OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	3906ef3b0f	Add prefetch values for power3	4 years ago
Martin Kroeker	8adf0971d8	Add prefetch values for power3	4 years ago
Martin Kroeker	08e2e60762	Add prefetch values for power3	4 years ago
Martin Kroeker	fb9e678235	Fix caxpy/zaxpy for big-endian	4 years ago
Martin Kroeker	dc4fcb48df	Fix inverted conditional for caxpy/zaxpy	4 years ago
Martin Kroeker	7a48247761	fix c/zrot and sgemv for POWER5	4 years ago
Zhaofeng Li	590be3fae3	riscv64: Add Makefile	4 years ago
Zhaofeng Li	3521cd48cb	RISCV64_GENERIC: Use generic kernel for DSDOT for better precision The implementation in `riscv64/dot.c` fails the `test_dsdot` test, and the generic kernel seems to have better precision. Tested on SiFive FU740 (HiFive Unmatched) and QEMU. Also see #1469.	4 years ago
Zhaofeng Li	1e0192a5cc	riscv64/imin: Fix wrong comparison Same as #1990.	4 years ago
Martin Kroeker	5f677e782e	Merge pull request #3196 from guowangy/skylakex-gemm-batch-k GEMM: skylake: improve the performance when m is small	4 years ago
Martin Kroeker	02087a62e7	Merge pull request #3205 from intelmy/sgemv_n_opt optimize on sgemv_n for small n	4 years ago
Martin Kroeker	4ecf631f95	Merge pull request #3228 from martin-frbg/issue3226 filter out -mavx flag on Sandybridge zgemm/ztrmm kernels	4 years ago
Martin Kroeker	310b76aad7	Merge pull request #3231 from martin-frbg/issue3227 Support compilation with pre-C99 versions of MSVC	4 years ago
Martin Kroeker	c4da892ba0	Only filter out -mavx on Sandybridge ZGEMM/ZTRMM kernels	4 years ago
Martin Kroeker	8b90e5f202	Drop redundant inclusion of complex.h	4 years ago
Martin Kroeker	bd60fb6ffc	filter out -mavx flag on zgemm kernels as it can cause problems with older gcc	4 years ago
Martin Kroeker	37ea8702ee	Merge pull request #3192 from damonyu1989/develop Update the intrinsic api to the offical name.	4 years ago
Martin Kroeker	c0ca63ea46	Fix missing conditionals for non-SKX kernels	4 years ago
pnp	3d4ccd2a13	fix for build error	4 years ago
pnp	c59652f0ce	optimize on sgemv_n for small n	4 years ago
Wangyang Guo	aa7b3dc3db	GEMM: skylake: improve the performance when m is small	4 years ago
damonyu	ceb44bef14	update the intrinsic api to the offical name.	4 years ago
Martin Kroeker	3d511f0e66	replace spurious avx512 requirement with fma check	4 years ago
Rajalakshmi Srinivasaraghavan	2379abaa5e	POWER10: Improve dgemm performance This patch uses vector pair pointer for input load operation which helps to generate power10 lxvp instructions.	4 years ago
Rajalakshmi Srinivasaraghavan	55bb9f639a	POWER10: Optimized zgemv This patch makes use of Matrix-Multiply Assist (MMA) feature introduced in POWER ISA v3.1 for zgemv_n and zgemv_t.	4 years ago
Martin Kroeker	2dfb24730d	Use "old" compute(24) function with clang due to register limitations	4 years ago
Martin Kroeker	147e0a75fd	Merge pull request #3170 from CodesWithWolves/sgemm_tcopy_16-invalid-read Remove Unnecessary/Erroneous Adds/Reads In sgemm_tcopy_16.S COPY1x8 Macro	4 years ago
Rajalakshmi Srinivasaraghavan	2dbcddd83d	POWER10: Adding check for little endian This patch makes sure that recent POWER10 patches are used only for little endian.	4 years ago
CodesWithWolves	d2bda3b56a	Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro There appears to have been some code leak when copying from the COPY2x8 macro above where we're reading 8 bytes into d4-d7 directly after reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can possibly overrun the boundary of allocated memory -- Valgrind detected this which is what dragged my attention to it for a 128,1 copy. Additionally, there is no need to update the addresses stored in A0-A7 as the only possible paths after running this macro will overwrite A0-7 if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows -- in which case A4-7 are unused.	4 years ago
Martin Kroeker	bdd6e3a153	Merge pull request #3157 from martin-frbg/issue3020-final Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler on PPC	4 years ago
Martin Kroeker	7b8f580941	Merge pull request #3156 from martin-frbg/omatcopy_d Move x86_64 DOMATCOPY_RT back to the C implementation	4 years ago
Martin Kroeker	86c5a0013f	Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler	4 years ago
Martin Kroeker	ef85c22474	Add workaround for LAPACK test failures with the NVIDIA HPC compiler	4 years ago
Martin Kroeker	d3555d2e50	Add workaround for LAPACK test failures with the NVIDIA HPC compiler	4 years ago
Martin Kroeker	0f5e86a0d9	Remove premature entry for DOMATCOPY_RT	4 years ago
Martin Kroeker	7b294a99fd	Move common.h back to the top of the file so that SKYLAKEX (from config.h) is defined in time	4 years ago
Martin Kroeker	0934568d9c	Move includes under the ifdef for compilers w/o intrinsics support	4 years ago
Rajalakshmi Srinivasaraghavan	09d47af2c0	Optimize zscal function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	4 years ago
Martin Kroeker	ef0238ba2b	Merge pull request #3130 from martin-frbg/issue3128 Replace spurious AVX512 requirement in the Haswell srot microkernel with an AVX2/FMA3 guard	4 years ago
Martin Kroeker	a9f6f7ad39	Remove spurious AVX512 requirement and add AVX2/FMA3 guard	4 years ago
Rajalakshmi Srinivasaraghavan	41646ed006	Optimize s/dasum function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	4 years ago
Rajalakshmi Srinivasaraghavan	0571c3187b	POWER10: Rename mma builtins The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and __builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and __builtin_vsx_disassemble_pair respectively. This patch is to make corresponding changes in dgemm kernel. Also made changes in inputs to those builtins to avoid some potential typecasting issues. Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62	4 years ago
Martin Kroeker	292d1af1a0	Update omatcopy_rt.c	4 years ago
Martin Kroeker	325b398e3c	Update omatcopy_rt.c	4 years ago
Martin Kroeker	6f5667b4d4	Enable optimized S/D OMATCOPY_RT	4 years ago
Martin Kroeker	cceeee7806	Add optimized omatcopy_rt	4 years ago
Martin Kroeker	0a4546b742	Typo fix	4 years ago
Martin Kroeker	b1eed27a54	Replace naive omatcopy_rt with 4x4 blocked implementation as suggested by MigMuc in issue 2532	4 years ago
Martin Kroeker	47691c031f	Use Haswell optimizations for Zen as well	4 years ago
Martin Kroeker	ce7ddd8921	Use Haswell optimizations for Zen as well	4 years ago

1 2 3 4 5 ...

1663 Commits (3906ef3b0fb19e7436f2b4cf6394b11f3466b1f3)