Martin Kroeker
fc8894dd98
Workaround miscompilation by NVIDIA nvc
2 years ago
Martin Kroeker
5720fa02c5
Merge pull request #4168 from Mousius/sve-zgemm-cgemm
Use SVE zgemm/cgemm on Arm(R) Neoverse(TM) V1 core
2 years ago
Chris Sidebottom
84a268b6ca
Use SVE zgemm/cgemm on Arm(R) Neoverse(TM) V1 core
This patch removes the prefetches from cgemm/zgemm which improves the performance similar to sgemm/dgemm did in #3868 , this means I'm happy to enable this on any applicable cores.
I also replicated the unrolling the copies from sgemm and dgemm.
2 years ago
Chris Sidebottom
730ca04b48
Fix ZHEMM copy for SVE
Whilst disambiguating whilelt, I inadvertantly used the wrong datatype
for offsets, which can be negative. This rectifies that.
2 years ago
Martin Kroeker
849c8806b8
Merge pull request #4161 from Mousius/non-sve-kernels
Use latest non-SVE kernels in ARMV8SVE
2 years ago
Chris Sidebottom
24586bc4ff
Disambiguate whilelt
2 years ago
Chris Sidebottom
aea2a4622b
Use latest non-SVE kernels in ARMV8SVE
These are generally better and, in some cases, include threading which helps in the cores we're targeting here.
2 years ago
martin-frbg
7976deff80
Fix file permissions (issue 4095)
2 years ago
Martin Kroeker
3d31191b0f
Work around Clang failing to disambiguate SVE intrinsics and add AppleClang crossbuild to MacOS/arm64 DYNAMIC_ARCH in AzureCI ( #4140 )
* Add AppleClang crossbuild to MacOS/arm64 DYNAMIC_ARCH
* add casts to disambiguate svwhilelt for clang
2 years ago
Martin Kroeker
72caceb324
Merge pull request #4009 from Mousius/sve-gemm
Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1
2 years ago
Chris Sidebottom
ec334e69dc
Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1
This re-spins #3869 with some additional copy unrolling which helps maintain SYRK performance.
After #3868 , the SVE kernels represent a pretty good boost.
This re-uses ARMV8SVE as a base and I'm going to incrementally move everything to use ARMV8SVE in additional patches (as well as fix up anything that's not already in ARMV8SVE).
2 years ago
Martin Kroeker
44164e3a3d
revert "move alpha out of register 18" (out of PR scope, no SVE on Apple hw)
2 years ago
Martin Kroeker
8be68fa7f4
move declaration of sca to really keep the compiler from throwing it out (for now)
2 years ago
Martin Kroeker
3727672a74
Improve workaround and keep compilers from optimizing it out
2 years ago
Martin Kroeker
108a21e47a
Move ALPHA out of register 18 (reserved on OSX)
2 years ago
Martin Kroeker
0b1acb0ba3
Move ALPHA_I out of register 18 (reserved on OSX)
2 years ago
Martin Kroeker
c7bbad09ad
Move ALPHA_I out of register 18 (reserved on OSX)
2 years ago
Martin Kroeker
cda29633a3
move ALPHA_I out of register 18 (reserved on OSX)
2 years ago
Martin Kroeker
09ace3cf23
Merge pull request #3846 from lilh9598/sbgemm_opt
Improve the performance of sbgemm_tcopy on neoversen2
2 years ago
Chris Sidebottom
1361229291
Remove prefetches from SVE kernels
This is a precursor to enabling the SVE kernels for Arm(R) Neoverse(TM)
V1 which has 256-bit SVE. Testing revealed that the SVE kernel was
actually worse in some cases than the existing kernel which seemed odd -
removing these prefetches the underlying architecture seems to do a better job
😸
2 years ago
lilianhuang
729af6406f
bugfix for sbgemm_ncopy_8_neoversen2
2 years ago
Chris Sidebottom
eea006a688
Wrap SVE header with __has_include check
2 years ago
Chris Sidebottom
fd4f52c797
Add SVE implementation for sdot/ddot
This adds an SVE implementation to sdot/ddot when available, falling back to the previous Advanced SIMD kernel where there's no SVE implementation for the kernel.
All the targets were essentially treating `dot_thunderx2t99.c` as the Advanced SIMD implementation so I've renamed it to better fit with the feature detection.
2 years ago
lilianhuang
fdac8a97c1
Add sbgemm_ncopy_8 and sbgemm_tcopy_4
2 years ago
lilianhuang
135718eafc
Improve the performance of sbgemm_tcopy on neoversen2
2 years ago
Chris Sidebottom
4f7b77e08a
Remove unnecessary instructions from Advanced SIMD dot
The existing kernel was issuing extra instructions to organise the arguments into the same registers they would usually be in and similarly to put the result into the appropriate register.
This has an impact on smaller sized dots and seemed like a quick fix
2 years ago
Martin Kroeker
1688c7da43
change line endings from CRLF to LF
2 years ago
Honglin Zhu
79066b6bf3
Change file name to match the norm and delete useless code.
2 years ago
Honglin Zhu
b00d5b9746
New sbgemm implementation for Neoverse N2
1. Use UZP instructions but not gather load and scatter store instructions to get lower latency.
2. Padding k to a power of 4.
3 years ago
Martin Kroeker
e12d474780
Eliminate uses of CREAL on left-hand side of assignments
3 years ago
Martin Kroeker
9e29598575
workaround fault with ssq=inf,scale=0
3 years ago
Honglin Zhu
123e0dfb62
Neoverse N2 sbgemm:
1. Modify the algorithm to resolve multithreading failures
2. No memory allocation in sbgemm kernel
3. Optimize when alpha == 1.0f
3 years ago
Honglin Zhu
bc3728475f
format code
3 years ago
Honglin Zhu
55d686d41e
neoverse n2 sbgemm:
implement ncopy tcopy kernel_8x4
3 years ago
Honglin Zhu
04593bb27c
neoverse n2 sbgemm: init file
3 years ago
Nursultan Zarlyk
1bb7993a97
Fix MSVC ARM64 build. Add generic kernel for ARM64
3 years ago
Martin Kroeker
115bc9b98f
CortexX1 is ARMV8 like A7x
3 years ago
Martin Kroeker
b3b4672c30
Add initial support for Phytium FT2000 series and ARMV9 Cortex 510/710/X1/X2
3 years ago
Martin Kroeker
c1c0d5ce1d
Merge pull request #3492 from binebrank/arm_sve_zgemm
SVE zgemm&cgemm (and other BLAS 3 complex)
3 years ago
Bine Brank
19d435b1b3
update armv8sve + contributors
3 years ago
Bine Brank
0fb6cc07bf
fix ztrsm lt/ut copy
3 years ago
Bine Brank
f1315288a8
add sve ztrsm
3 years ago
Bine Brank
aaa2b1a861
fix sve dtrsm kernels
3 years ago
Bine Brank
8071e179f1
add remaining sve trsm copy kernels
3 years ago
Bine Brank
f87468ac91
trsm_lncopy_sve
3 years ago
Bine Brank
e8939b3d30
sve trsmRN and trsmRT
3 years ago
Bine Brank
098672b51b
add trsm_kernel_LT_sve
3 years ago
Bine Brank
be7e55880c
sve trsm_kernel_LN
3 years ago
Sunita Nadampalli
19c8f615dc
OpenBLAS: aarch64: Add neoverse-v1/n2 architecture specifics
3 years ago
Bine Brank
f33543d029
combine zchemm into single file
3 years ago