Chris Sidebottom
114316f361
Optimize SBGEMM / BGEMM for NEOVERSEV1 further
This changes the kernels to pack full SVE vectors and reduces the
overall complexity of the inner GEMM loop.
1 month ago
Martin Kroeker
f1ee61ea30
Include NEON header for the bfloat conversion functions
1 month ago
Martin Kroeker
b3ffd5524a
Include NEON header for the bfloat conversion functions
1 month ago
Martin Kroeker
a5e7c0e3e0
Merge pull request #5396 from abhishek-iitmadras/abhishekk_bfloat16
ARM64: Enable bfloat16 kernels by default
2 months ago
abhishek-fujitsu
0bc79da587
add neon header
2 months ago
Chris Sidebottom
ea2faf0c9a
Add optimized BGEMM for NEOVERSEN2 target
This re-uses the existing NEOVERSEN2 8x4 `sbgemm` kernel to implement `bgemm`.
2 months ago
Chris Sidebottom
2c3cdaf74e
Optimized BGEMV for NEOVERSEV1 target
- Adds bgemv T based off of sbgemv T kernel
- Adds bgemv N which is slightly alterated to not use Y as an
accumulator due to the output being bf16 which results in loss of
precision
- Enables BGEMM_GEMV_FORWARD to proxy BGEMM to BGEMV with new kernels
2 months ago
Martin Kroeker
39c90f9859
Merge pull request #5380 from quic/topic/sgemm_direct_sme1_alpha_beta
SME1 based direct kernel (with alpha and beta) for cblas_sgemm level 3
2 months ago
Rajendra Prasad Matcha
eae0abfdb6
SME1 based direct kernel with alpha and beta for cblas_sgemm level 3 API.
2 months ago
Chris Sidebottom
740efd71c4
Add optimized BGEMM kernel for NEOVERSEV1 target
This also improves the testing and generic kernel by re-using the BF16
conversion functions.
Built on top of https://github.com/OpenMathLib/OpenBLAS/pull/5357 and derived from https://github.com/OpenMathLib/OpenBLAS/pull/5287
Co-authored-by: Ye Tao <ye.tao@arm.com>
2 months ago
Martin Kroeker
fd37406817
Merge branch 'develop' into optimized_gemv_n_1x3
2 months ago
Iha, Taisei
f7ad906b49
Performance improvements of [SD]DOT with loop-unrolling on A64FX
3 months ago
Martin Kroeker
ee26caffb3
Merge pull request #5309 from davidz-ampere/dev-ampereone
Add support for Ampere AmpereOne processors
3 months ago
davidz-ampere
aa90ab4142
Add support for Ampere AmpereOne processors
3 months ago
Ian McInerney
badef1d32e
Update sbgemm_tcopy_4_neoversev1 kernel to use standard C types
3 months ago
davidz-ampere
84730068af
reduce duplicate kernel code
3 months ago
davidz-ampere
be68ef03b4
Add support for Ampere processors
3 months ago
Martin Kroeker
58eeb9041c
fix handling of dummy2
3 months ago
Martin Kroeker
1589d0b21e
Merge pull request #5281 from martin-frbg/zscal_arm64
kernel/arm64: fixed cscal and zscal
3 months ago
Sharif Inamdar
8279e68805
Optimize gemv_n_sve_v1x3 kernel
- Calculate predicate outside the loop
- Divide matrix in blocks of 3
3 months ago
Arne Juul
5442aff218
Accumulate results in output register explicitly
3 months ago
Martin Kroeker
28f8fdaf0f
support flag for NaN/Inf handling and fix scaling of NaN/Inf values
4 months ago
Martin Kroeker
5141a90993
Fix ARMV9SME target in DYNAMIC_ARCH and add SME query code for MacOS ( #5222 )
* Fix ARMV9SME target and add support_sme1 code for MacOS
* make sgemm_direct unconditionally available on all arm64
* build a (dummy) sgemm_direct kernel on all arm64
* Update dynamic_arm64.c
4 months ago
Martin Kroeker
151b74284e
Merge pull request #5203 from quic/fix-sgemmdirect-sme1
Add vector registers to clobber list to prevent compiler optimization.
4 months ago
abhishek-fujitsu
9c02cdb073
optimise dot using thread throttling for NEOVERSE V1
6 months ago
Martin Kroeker
d0e8fd6d40
Merge pull request #5239 from annop-w/gemv_n_sve
Use SVE kernel for S/DGEMVN for SVE machines
5 months ago
Iha, Taisei
08b5c18d70
fixed a potential out-of-bounds on gemv.
5 months ago
Annop Wongwathanarat
e11744a411
Use SVE kernel for S/DGEMVN for SVE machines
5 months ago
Martin Kroeker
dd38b4e811
Merge pull request #5225 from annop-w/gemv_n
Improve performance for SGEMVN on NEONVERSEN1
5 months ago
Martin Kroeker
0241d516f6
Merge pull request #5220 from iha-taisei/sdgemv_n_unroll
Further performance improvements to non-transposed [SD]GEMV kernels for A64FX and Neoverse V1.
5 months ago
Annop Wongwathanarat
d535728803
Improve performance for SGEMVN on NEONVERSEN1
5 months ago
Usui, Tetsuzo
d711906e3e
Add symv kernels for arm64
5 months ago
Iha, Taisei
f1e628b889
Further performance improvements to [SD]GEMV.
5 months ago
Annop Wongwathanarat
ec146157d3
Use SVE kernel for S/DGEMVT for SVE machines
6 months ago
Vaisakh K V
04915be829
Add vector registers to clobber list to prevent compiler optimization.
SME based SGEMMDIRECT kernel uses the vector registers (z) and adding
clobber list informs compiler not to optimize these registers.
6 months ago
Ye Tao
f27ba5efd1
fix bugs in aarch64 sbgemv_n kernel
6 months ago
Annop Wongwathanarat
edef2e4441
Fix bug in ARM64 sbgemv_t
6 months ago
Martin Kroeker
b55ca71d5b
Merge pull request #5182 from annop-w/sgemm_ncopy
Optimize aarch64 sgemm_ncopy
6 months ago
Martin Kroeker
2f778554b8
Merge pull request #5181 from taoye9/change_sbgemn_cast_bf16
replace customize bf16_to_fp32 with arm neon vcvtah_f32_bf16
6 months ago
Annop Wongwathanarat
9807f56580
Optimize aarch64 sgemm_ncopy
6 months ago
Martin Kroeker
a3e7b16072
Merge pull request #5157 from manaalmj/feature
Optimize gemv_n_sve kernel
6 months ago
Ye Tao
4c00099ed6
replace customize bf16_to_fp32 with arm neon vcvtah_f32_bf16
6 months ago
Annop Wongwathanarat
a085b6c9ec
Fix aarch64 sbgemv_t compilation error for GCC < 13
6 months ago
manjam01
5c4e38ab17
Optimize gemv_n_sve kernel
7 months ago
Martin Kroeker
1d5ed5c46b
Merge pull request #5168 from taoye9/add_sbgemvn_on_neonversen2
Add dispatch of SBGEMVNKERNEL for NEOVERSEN2 and NEOVERSEV2
7 months ago
Ye Tao
6b8b35cdf2
fix minior issues of redeclaration of float x0,x1 in sbgemv_n_neon.c
7 months ago
Ye Tao
38ee7c9301
Add dispatch of SBGEMVNKERNEL for NEOVERSEN2 and NEOVERSEV2
7 months ago
Martin Kroeker
2b941c44b5
Merge branch 'develop' into sbgemv_n_neon
7 months ago
Ye Tao
35bdbca153
Add sbgemv_n_neon kernel for arm64.
7 months ago
Annop Wongwathanarat
edaf51dd99
Add sbgemv_t_bfdot kernel for ARM64
This improves performance for sbgemv_t by up to 100x on NEOVERSEV1.
The geometric mean speedup is ~61x for M=N=[2,512].
7 months ago