|
|
@@ -1,4 +1,99 @@ |
|
|
|
OpenBLAS ChangeLog |
|
|
|
==================================================================== |
|
|
|
Version 0.3.29 |
|
|
|
12-Jan-2025 |
|
|
|
|
|
|
|
general: |
|
|
|
- fixed a potential NULL pointer dereference in multithreaded builds |
|
|
|
- added function aliases for GEMMT using its new name GEMMTR adopted by Reference-BLAS |
|
|
|
- fixed a build failure when building without LAPACK_DEPRECATED functions |
|
|
|
- the minimum required CMake version for CMake-based builds was raised to 3.16.0 in order |
|
|
|
to remove many compatibility and deprecation warnings |
|
|
|
- added more detailed CMake rules for OpenMP builds (mainly to support recent LLVM) |
|
|
|
- fixed the behavior of the recently added CBLAS_?GEMMT functions with row-major data |
|
|
|
- improved thread scaling of multithreaded SBGEMV |
|
|
|
- improved thread scaling of multithreaded TRTRI |
|
|
|
- fixed compilation of the CBLAS testsuite with gcc14 (and no Fortran compiler) |
|
|
|
- added support for option handling changes in flang-new from LLVM18 onwards |
|
|
|
- added support for recent calling conventions changes in Cray and NVIDIA compilers |
|
|
|
- added support for compilation with the NAG Fortran compiler |
|
|
|
- fixed placement of the -fopenmp flag and libsuffix in the generated pkgconfig file |
|
|
|
- improved the CMakeConfig file generated by the Makefile build |
|
|
|
- fixed const-correctness of cblas_?geadd in cblas.h |
|
|
|
- fixed a potential inaccuracy in multithreaded BLAS3 calls |
|
|
|
- fixed empty implementations of get/set_affinity that print a warning in OpenMP builds |
|
|
|
- fixed function signatures for TRTRS in the converted C version of LAPACK |
|
|
|
- fixed omission of several single-precision LAPACK symbols in the shared library |
|
|
|
- improved build instructions for the provided "pybench" benchmarks |
|
|
|
- improved documentation, including added build instructions for WoA and HarmonyOS |
|
|
|
as well as descriptions of environment variables that affect build and runtime behavior |
|
|
|
- added a separate "make install_tests" target for use with cross-compilations |
|
|
|
- integrated improvements and corrections from Reference-LAPACK: |
|
|
|
- removed a comparison in LAPACKE ?tpmqrt that is always false (LAPACK PR 1062) |
|
|
|
- fixed the leading dimension for B in tests for GGEV (LAPACK PR 1064) |
|
|
|
- replaced the ?LARFT functions with a recursive implementation (LAPACK PR 1080) |
|
|
|
|
|
|
|
arm: |
|
|
|
- fixed build with recent versions of the NDK (missing .type declaration of symbols) |
|
|
|
|
|
|
|
arm64: |
|
|
|
- fixed a long-standing bug in the (generic) c/zgemm_beta kernel that could lead to |
|
|
|
reads and writes outside the array bounds in some circumstances |
|
|
|
- rewrote cpu autodetection to scan all cores and return the highest performing type |
|
|
|
- improved the DGEMM performance for SVE targets and small matrix sizes |
|
|
|
- improved dimension criteria for forwarding from GEMM to GEMV kernels |
|
|
|
- added SVE kernels for ROT and SWAP |
|
|
|
- improved SVE kernels for SGEMV and DGEMV on A64FX and NEOVERSEV1 |
|
|
|
- added support for using the "small matrix" kernels with CMake as well |
|
|
|
- fixed compilation on Windows on Arm |
|
|
|
- improved compile-time detection of SVE capability |
|
|
|
- added cpu autodetection and initial support for Apple M4 |
|
|
|
- added support for compilation on systems running IOS |
|
|
|
- added support for compilation on NetBSD ("evbarm" architecture) |
|
|
|
- fixed NRM2 implementations for generic SVE targets and the Neoverse N2 |
|
|
|
- fixed compilation for SVE-capable targets with the NVIDIA compiler |
|
|
|
|
|
|
|
x86_64: |
|
|
|
- fixed a wrong storage size in the SBGEMV kernel for Cooper Lake |
|
|
|
- added cpu autodetection for Intel Granite Rapids |
|
|
|
- added cpu autodetection for AMD Ryzen 5 series |
|
|
|
- added optimized SOMATCOPY_CT for AVX-capable targets |
|
|
|
- fixed the fallback implementation of GEMM3M in GENERIC builds |
|
|
|
- tentatively re-enabled builds with the EXPRECISION option |
|
|
|
- worked around a miscompilation of tests with mingw32-gfortran14 |
|
|
|
- added support for compilation with the Intel oneAPI 2025.0 compiler on Windows |
|
|
|
|
|
|
|
power: |
|
|
|
- fixed multithreaded SBGEMM |
|
|
|
- fixed a CMake build problem on POWER10 |
|
|
|
- improved the performance of SGEMV |
|
|
|
- added vectorized implementations of SBGEMV and support for forwarding 1xN SBGEMM to them |
|
|
|
- fixed illegal instructions and potential memory overflow in SGEMM on PPCG4 |
|
|
|
- fixed handling of NaN and Inf arguments in SSCAL and DSCAL on PPC440,G4 and 970 |
|
|
|
- added improved CGEMM and ZGEMM kernels for POWER10 |
|
|
|
- added Makefile logic to remove all optimization flags in DEBUG builds |
|
|
|
|
|
|
|
mips64: |
|
|
|
- fixed compilation with gcc14 |
|
|
|
- fixed GEMM parameter selection for the MIPS64_GENERIC target |
|
|
|
- fixed a potential build failure when compiling with OpenMP |
|
|
|
|
|
|
|
loongarch64: |
|
|
|
- fixed compilation for Loongson3 with recent versions of gmake |
|
|
|
- fixed a potential loss of precision in Loongson3A GEMM |
|
|
|
- fixed a potential build failure when compiling with OpenMP |
|
|
|
- added optimized SOMATCOPY for LASX-capable targets |
|
|
|
- introduced a new cpu naming scheme while retaining compatibility |
|
|
|
- added support for cross-compiling Loongarch64 targets with CMake |
|
|
|
- added support for compilation with LLVM |
|
|
|
|
|
|
|
riscv64: |
|
|
|
- removed thread yielding overhead caused by sched_yield |
|
|
|
- replaced some non-standard intrinsics with their official names |
|
|
|
- fixed and sped up the implementations of CGEMM/ZGEMM TCOPY for vector lenghts 128 and 256 |
|
|
|
- improved the performance of SNRM2/DNRM2 for RVV1.0 targets |
|
|
|
- added optimized ?OMATCOPY_CN kernels for RVV1.0 targets |
|
|
|
|
|
|
|
==================================================================== |
|
|
|
Version 0.3.28 |
|
|
|
8-Aug-2024 |
|
|
|