Xianyi Zhang
57ed58cefe
Refs #2587 Add small matrix optimization reference kernel for c/zgemm.
5 years ago
Xianyi Zhang
17d32a4a82
Change a1b0 gemm to b0 gemm.
5 years ago
Xianyi Zhang
59cb5de46b
Refs #2587 Fix typos.
5 years ago
Xianyi Zhang
be3349405d
Add alpha=1.0 beta=0.0 for small gemm.
5 years ago
Xianyi Zhang
0a2077901c
Add small marix optimization kernel interface.
make SMALL_MATRIX_OPT=1
5 years ago
gxw
0b8f7c8c10
Add cmake support for LOONGARCH64
4 years ago
gxw
af0a69f355
Add support for LOONGARCH64
4 years ago
Martin Kroeker
49bbf330ca
Empirical workaround for numpy SVD NaN problem from issue 3318
4 years ago
Martin Kroeker
5b4b385ecf
Temporarily disable the SkylakeX sgemv_t microkernel due to LAPACK testsuite failures
4 years ago
User User-User
39ef0880ae
copy conf
4 years ago
Martin Kroeker
c4b464cac6
Merge pull request #3273 from austinpagan/sbgemm_gcc10_fix
Power10: Fix for SBGEMM
4 years ago
Gordon Fossum
e6dd44d989
Power10: Fix for SBGEMM
While testing bfloat16 sbgemm kernel, there are some failures for odd value inputs due to updating result for
additional bytes.
4 years ago
Gilles Gouaillardet
9d292d37b2
arm64: add the missing d9 register to the clobber list
Refs. numpy/numpy#18422
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
4 years ago
Martin Kroeker
2e8ff4a781
Merge pull request #3266 from martin-frbg/powerparam
Remove spurious casts from PPC parameters and fix compilation for older targets
4 years ago
Martin Kroeker
dbba381dc3
Merge pull request #3260 from intelmy/sgemv_t_opt
Optimized sgemv_t for small N based on AVX512
4 years ago
Martin Kroeker
efdbdd8f82
Add prefetch values for power3
4 years ago
Martin Kroeker
3906ef3b0f
Add prefetch values for power3
4 years ago
Martin Kroeker
8adf0971d8
Add prefetch values for power3
4 years ago
Martin Kroeker
08e2e60762
Add prefetch values for power3
4 years ago
Martin Kroeker
fb9e678235
Fix caxpy/zaxpy for big-endian
4 years ago
Martin Kroeker
dc4fcb48df
Fix inverted conditional for caxpy/zaxpy
4 years ago
Martin Kroeker
7a48247761
fix c/zrot and sgemv for POWER5
4 years ago
Rajalakshmi Srinivasaraghavan
cbb70438df
POWER10: Fixes for sbgemm kernel
While testing bfloat16 sbgemm kernel, there are some failures
for odd value inputs due to array access beyond the boundary.
4 years ago
Ma, Yu
706a08d4a0
Optimized sgemv_t for small N based on AVX512
4 years ago
Zhaofeng Li
590be3fae3
riscv64: Add Makefile
4 years ago
Zhaofeng Li
3521cd48cb
RISCV64_GENERIC: Use generic kernel for DSDOT for better precision
The implementation in `riscv64/dot.c` fails the `test_dsdot` test, and
the generic kernel seems to have better precision. Tested on SiFive
FU740 (HiFive Unmatched) and QEMU.
Also see #1469 .
4 years ago
Zhaofeng Li
1e0192a5cc
riscv64/imin: Fix wrong comparison
Same as #1990 .
4 years ago
Martin Kroeker
5f677e782e
Merge pull request #3196 from guowangy/skylakex-gemm-batch-k
GEMM: skylake: improve the performance when m is small
4 years ago
Martin Kroeker
02087a62e7
Merge pull request #3205 from intelmy/sgemv_n_opt
optimize on sgemv_n for small n
4 years ago
Martin Kroeker
4ecf631f95
Merge pull request #3228 from martin-frbg/issue3226
filter out -mavx flag on Sandybridge zgemm/ztrmm kernels
4 years ago
Martin Kroeker
310b76aad7
Merge pull request #3231 from martin-frbg/issue3227
Support compilation with pre-C99 versions of MSVC
4 years ago
Martin Kroeker
c4da892ba0
Only filter out -mavx on Sandybridge ZGEMM/ZTRMM kernels
4 years ago
Martin Kroeker
8b90e5f202
Drop redundant inclusion of complex.h
4 years ago
Martin Kroeker
bd60fb6ffc
filter out -mavx flag on zgemm kernels as it can cause problems with older gcc
4 years ago
Martin Kroeker
37ea8702ee
Merge pull request #3192 from damonyu1989/develop
Update the intrinsic api to the offical name.
4 years ago
Martin Kroeker
c0ca63ea46
Fix missing conditionals for non-SKX kernels
4 years ago
pnp
3d4ccd2a13
fix for build error
4 years ago
pnp
c59652f0ce
optimize on sgemv_n for small n
4 years ago
Wangyang Guo
aa7b3dc3db
GEMM: skylake: improve the performance when m is small
4 years ago
damonyu
ceb44bef14
update the intrinsic api to the offical name.
4 years ago
Martin Kroeker
3d511f0e66
replace spurious avx512 requirement with fma check
4 years ago
Rajalakshmi Srinivasaraghavan
2379abaa5e
POWER10: Improve dgemm performance
This patch uses vector pair pointer for input load operation
which helps to generate power10 lxvp instructions.
4 years ago
Rajalakshmi Srinivasaraghavan
55bb9f639a
POWER10: Optimized zgemv
This patch makes use of Matrix-Multiply Assist (MMA)
feature introduced in POWER ISA v3.1 for zgemv_n and zgemv_t.
4 years ago
Martin Kroeker
2dfb24730d
Use "old" compute(24) function with clang due to register limitations
4 years ago
Martin Kroeker
147e0a75fd
Merge pull request #3170 from CodesWithWolves/sgemm_tcopy_16-invalid-read
Remove Unnecessary/Erroneous Adds/Reads In sgemm_tcopy_16.S COPY1x8 Macro
4 years ago
Rajalakshmi Srinivasaraghavan
2dbcddd83d
POWER10: Adding check for little endian
This patch makes sure that recent POWER10 patches are used
only for little endian.
4 years ago
CodesWithWolves
d2bda3b56a
Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro
There appears to have been some code leak when copying from the COPY2x8
macro above where we're reading 8 bytes into d4-d7 directly after
reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can
possibly overrun the boundary of allocated memory -- Valgrind detected
this which is what dragged my attention to it for a 128,1 copy.
Additionally, there is no need to update the addresses stored in A0-A7
as the only possible paths after running this macro will overwrite A0-7
if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows --
in which case A4-7 are unused.
4 years ago
Martin Kroeker
bdd6e3a153
Merge pull request #3157 from martin-frbg/issue3020-final
Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler on PPC
4 years ago
Martin Kroeker
7b8f580941
Merge pull request #3156 from martin-frbg/omatcopy_d
Move x86_64 DOMATCOPY_RT back to the C implementation
4 years ago
Martin Kroeker
86c5a0013f
Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler
4 years ago