Martin Kroeker
3906ef3b0f
Add prefetch values for power3
4 years ago
Martin Kroeker
8adf0971d8
Add prefetch values for power3
4 years ago
Martin Kroeker
08e2e60762
Add prefetch values for power3
4 years ago
Martin Kroeker
fb9e678235
Fix caxpy/zaxpy for big-endian
4 years ago
Martin Kroeker
dc4fcb48df
Fix inverted conditional for caxpy/zaxpy
4 years ago
Martin Kroeker
7a48247761
fix c/zrot and sgemv for POWER5
4 years ago
Zhaofeng Li
590be3fae3
riscv64: Add Makefile
4 years ago
Zhaofeng Li
3521cd48cb
RISCV64_GENERIC: Use generic kernel for DSDOT for better precision
The implementation in `riscv64/dot.c` fails the `test_dsdot` test, and
the generic kernel seems to have better precision. Tested on SiFive
FU740 (HiFive Unmatched) and QEMU.
Also see #1469 .
4 years ago
Zhaofeng Li
1e0192a5cc
riscv64/imin: Fix wrong comparison
Same as #1990 .
4 years ago
Martin Kroeker
5f677e782e
Merge pull request #3196 from guowangy/skylakex-gemm-batch-k
GEMM: skylake: improve the performance when m is small
4 years ago
Martin Kroeker
02087a62e7
Merge pull request #3205 from intelmy/sgemv_n_opt
optimize on sgemv_n for small n
4 years ago
Martin Kroeker
4ecf631f95
Merge pull request #3228 from martin-frbg/issue3226
filter out -mavx flag on Sandybridge zgemm/ztrmm kernels
4 years ago
Martin Kroeker
310b76aad7
Merge pull request #3231 from martin-frbg/issue3227
Support compilation with pre-C99 versions of MSVC
4 years ago
Martin Kroeker
c4da892ba0
Only filter out -mavx on Sandybridge ZGEMM/ZTRMM kernels
4 years ago
Martin Kroeker
8b90e5f202
Drop redundant inclusion of complex.h
4 years ago
Martin Kroeker
bd60fb6ffc
filter out -mavx flag on zgemm kernels as it can cause problems with older gcc
4 years ago
Martin Kroeker
37ea8702ee
Merge pull request #3192 from damonyu1989/develop
Update the intrinsic api to the offical name.
4 years ago
Martin Kroeker
c0ca63ea46
Fix missing conditionals for non-SKX kernels
4 years ago
pnp
3d4ccd2a13
fix for build error
4 years ago
pnp
c59652f0ce
optimize on sgemv_n for small n
4 years ago
Wangyang Guo
aa7b3dc3db
GEMM: skylake: improve the performance when m is small
4 years ago
damonyu
ceb44bef14
update the intrinsic api to the offical name.
4 years ago
Martin Kroeker
3d511f0e66
replace spurious avx512 requirement with fma check
4 years ago
Rajalakshmi Srinivasaraghavan
2379abaa5e
POWER10: Improve dgemm performance
This patch uses vector pair pointer for input load operation
which helps to generate power10 lxvp instructions.
4 years ago
Rajalakshmi Srinivasaraghavan
55bb9f639a
POWER10: Optimized zgemv
This patch makes use of Matrix-Multiply Assist (MMA)
feature introduced in POWER ISA v3.1 for zgemv_n and zgemv_t.
4 years ago
Martin Kroeker
2dfb24730d
Use "old" compute(24) function with clang due to register limitations
4 years ago
Martin Kroeker
147e0a75fd
Merge pull request #3170 from CodesWithWolves/sgemm_tcopy_16-invalid-read
Remove Unnecessary/Erroneous Adds/Reads In sgemm_tcopy_16.S COPY1x8 Macro
4 years ago
Rajalakshmi Srinivasaraghavan
2dbcddd83d
POWER10: Adding check for little endian
This patch makes sure that recent POWER10 patches are used
only for little endian.
4 years ago
CodesWithWolves
d2bda3b56a
Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro
There appears to have been some code leak when copying from the COPY2x8
macro above where we're reading 8 bytes into d4-d7 directly after
reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can
possibly overrun the boundary of allocated memory -- Valgrind detected
this which is what dragged my attention to it for a 128,1 copy.
Additionally, there is no need to update the addresses stored in A0-A7
as the only possible paths after running this macro will overwrite A0-7
if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows --
in which case A4-7 are unused.
4 years ago
Martin Kroeker
bdd6e3a153
Merge pull request #3157 from martin-frbg/issue3020-final
Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler on PPC
4 years ago
Martin Kroeker
7b8f580941
Merge pull request #3156 from martin-frbg/omatcopy_d
Move x86_64 DOMATCOPY_RT back to the C implementation
4 years ago
Martin Kroeker
86c5a0013f
Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler
4 years ago
Martin Kroeker
ef85c22474
Add workaround for LAPACK test failures with the NVIDIA HPC compiler
4 years ago
Martin Kroeker
d3555d2e50
Add workaround for LAPACK test failures with the NVIDIA HPC compiler
4 years ago
Martin Kroeker
0f5e86a0d9
Remove premature entry for DOMATCOPY_RT
4 years ago
Martin Kroeker
7b294a99fd
Move common.h back to the top of the file so that SKYLAKEX (from config.h) is defined in time
4 years ago
Martin Kroeker
0934568d9c
Move includes under the ifdef for compilers w/o intrinsics support
4 years ago
Rajalakshmi Srinivasaraghavan
09d47af2c0
Optimize zscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
4 years ago
Martin Kroeker
ef0238ba2b
Merge pull request #3130 from martin-frbg/issue3128
Replace spurious AVX512 requirement in the Haswell srot microkernel with an AVX2/FMA3 guard
4 years ago
Martin Kroeker
a9f6f7ad39
Remove spurious AVX512 requirement and add AVX2/FMA3 guard
4 years ago
Rajalakshmi Srinivasaraghavan
41646ed006
Optimize s/dasum function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
4 years ago
Rajalakshmi Srinivasaraghavan
0571c3187b
POWER10: Rename mma builtins
The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and
__builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and
__builtin_vsx_disassemble_pair respectively. This patch is to make
corresponding changes in dgemm kernel. Also made changes in
inputs to those builtins to avoid some potential typecasting issues.
Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62
4 years ago
Martin Kroeker
292d1af1a0
Update omatcopy_rt.c
4 years ago
Martin Kroeker
325b398e3c
Update omatcopy_rt.c
4 years ago
Martin Kroeker
6f5667b4d4
Enable optimized S/D OMATCOPY_RT
4 years ago
Martin Kroeker
cceeee7806
Add optimized omatcopy_rt
4 years ago
Martin Kroeker
0a4546b742
Typo fix
4 years ago
Martin Kroeker
b1eed27a54
Replace naive omatcopy_rt with 4x4 blocked implementation
as suggested by MigMuc in issue 2532
4 years ago
Martin Kroeker
47691c031f
Use Haswell optimizations for Zen as well
4 years ago
Martin Kroeker
ce7ddd8921
Use Haswell optimizations for Zen as well
4 years ago