|
@@ -1,4 +1,127 @@ |
|
|
OpenBLAS ChangeLog |
|
|
OpenBLAS ChangeLog |
|
|
|
|
|
==================================================================== |
|
|
|
|
|
Version 0.3.28 |
|
|
|
|
|
8-Aug-2024 |
|
|
|
|
|
|
|
|
|
|
|
general: |
|
|
|
|
|
- Reworked the unfinished implementation of HUGETLB from GotoBLAS |
|
|
|
|
|
for allocating huge memory pages as buffers on suitable systems |
|
|
|
|
|
- Changed the unfinished implementation of GEMM3M for the generic |
|
|
|
|
|
target on all architectures to at least forward to regular GEMM |
|
|
|
|
|
- Improved multithreaded GEMM performance for large non-skinny matrices |
|
|
|
|
|
- Improved BLAS3 performance on larger multicore systems through improved |
|
|
|
|
|
parallelism |
|
|
|
|
|
- Improved performance of the initial memory allocation by reducing |
|
|
|
|
|
locking overhead |
|
|
|
|
|
- Improved performance of GBMV at small problem sizes by introducing |
|
|
|
|
|
a size barrier for the switch to multithreading |
|
|
|
|
|
- Added an implementation of the CBLAS_GEMM_BATCH extension |
|
|
|
|
|
- Fixed miscompilation of CAXPYC and ZAXPYC on all architectures in |
|
|
|
|
|
CMAKE builds (error introduced in 0.3.27) |
|
|
|
|
|
- Fixed corner cases involving the handling of NAN and INFINITY |
|
|
|
|
|
arguments in ?SCAL on all architectures |
|
|
|
|
|
- Added support for cross-compiling to WEBM with CMAKE (in addition |
|
|
|
|
|
to the already present makefile support) |
|
|
|
|
|
- Fixed NAN handling and potential accuracy issues in compilations with |
|
|
|
|
|
Intel ICX by supplying a suitable fp-model option by default |
|
|
|
|
|
- The contents of the github project wiki have been converted into |
|
|
|
|
|
a new set of documentation included with the source code. |
|
|
|
|
|
- It is now possible to register a callback function that replaces |
|
|
|
|
|
the built-in support for multithreading with an external backend |
|
|
|
|
|
like TBB (openblas_set_threads_callback_function) |
|
|
|
|
|
- Fixed potential duplication of suffixes in shared library naming |
|
|
|
|
|
- Improved C compiler detection by the build system to tolerate more |
|
|
|
|
|
naming variants for gcc builds |
|
|
|
|
|
- Fixed an unnecessary dependency of the utest on CBLAS |
|
|
|
|
|
- Fixed spurious error reports from the BLAS extensions utest |
|
|
|
|
|
- Fixed unwanted invocation of the GEMM3M tests in cross-compilation |
|
|
|
|
|
- Fixed a flaw in the makefile build that could lead to the pkgconfig |
|
|
|
|
|
file containing an entry of UNKNOWN for the target cpu after installing |
|
|
|
|
|
- Integrated fixes from the Reference-LAPACK project: |
|
|
|
|
|
- Fixed uninitialized variables in the LAPACK tests for ?QP3RK (PR 961) |
|
|
|
|
|
- Fixed potential bounds error in ?UNHR_COL/?ORHR_COL (PR 1018) |
|
|
|
|
|
- Fixed potential infinite loop in the LAPACK testsuite (PR 1024) |
|
|
|
|
|
- Make the variable type used for hidden length arguments configurable (PR 1025) |
|
|
|
|
|
- Fixed SYTRD workspace computation and various typos (PR 1030) |
|
|
|
|
|
- Prevent compiler use of FMA that could increase numerical error in ?GEEVX (PR 1033) |
|
|
|
|
|
|
|
|
|
|
|
x86-64: |
|
|
|
|
|
- reverted thread management under Windows to its state before 0.3.26 |
|
|
|
|
|
due to signs of race conditions in some circumstances now under study |
|
|
|
|
|
- fixed accidental selection of the unoptimized generic SBGEMM kernel |
|
|
|
|
|
in CMAKE builds for CooperLake and SapphireRapids targets |
|
|
|
|
|
- fixed a potential thread buffer overrun in SBSTOBF16 on small systems |
|
|
|
|
|
- fixed an accuracy issue in ZSCAL introduced in 0.3.26 |
|
|
|
|
|
- fixed compilation with CMAKE and recent releases of LLVM |
|
|
|
|
|
- added support for Intel Emerald Rapids and Meteor Lake cpus |
|
|
|
|
|
- added autodetection support for the Zhaoxin KX-7000 cpu |
|
|
|
|
|
- fixed autodetection of Intel Prescott (probably broken since 0.3.19) |
|
|
|
|
|
- fixed compilation for older targets with the Yocto SDK |
|
|
|
|
|
- fixed compilation of the converter-generated C versions |
|
|
|
|
|
of the LAPACK sources with gcc-14 |
|
|
|
|
|
- improved compiler options when building with CMAKE and LLVM for |
|
|
|
|
|
AVX512-capable targets |
|
|
|
|
|
- added support for supplying the L2 cache size via an environment |
|
|
|
|
|
variable (OPENBLAS_L2_SIZE) in case it is not correctly reported |
|
|
|
|
|
(as in some VM configurations) |
|
|
|
|
|
- improved the error message shown when thread creation fails on startup |
|
|
|
|
|
- fixed setting the rpath entry of the dylib in CMAKE builds on MacOS |
|
|
|
|
|
|
|
|
|
|
|
arm: |
|
|
|
|
|
- fixed building for baremetal targets with make |
|
|
|
|
|
|
|
|
|
|
|
arm64: |
|
|
|
|
|
- Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1 |
|
|
|
|
|
matrix to the corresponding GEMV kernel |
|
|
|
|
|
- added optimized SGEMV and DGEMV kernels for A64FX |
|
|
|
|
|
- added optimized SVE kernels for small-matrix GEMM |
|
|
|
|
|
- added A64FX to the cpu list for DYNAMIC_ARCH |
|
|
|
|
|
- fixed building with support for cpu affinity |
|
|
|
|
|
- worked around accuracy problems with C/ZNRM2 on NeoverseN1 and |
|
|
|
|
|
Apple M targets |
|
|
|
|
|
- improved GEMM performance on Neoverse V1 |
|
|
|
|
|
- fixed compilation for NEOVERSEN2 with older compilers |
|
|
|
|
|
- fixed potential miscompilation of the SVE SDOT and DDOT kernels |
|
|
|
|
|
- fixed potential miscompilation of the non-SVE CDOT and ZDOT kernels |
|
|
|
|
|
- fixed a potential overflow when using very large user-defined BUFFERSIZE |
|
|
|
|
|
- fixed setting the rpath entry of the dylib in CMAKE builds on MacOS |
|
|
|
|
|
|
|
|
|
|
|
power: |
|
|
|
|
|
- Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1 |
|
|
|
|
|
matrix to the corresponding GEMV kernel |
|
|
|
|
|
- significantly improved performance of SBGEMM on POWER10 |
|
|
|
|
|
- fixed compilation with OpenMP and the XLF compiler |
|
|
|
|
|
- fixed building of the BLAS extension utests under AIX |
|
|
|
|
|
- fixed building of parts of the LAPACK testsuite with XLF |
|
|
|
|
|
- fixed CSWAP/ZSWAP on big-endian POWER10 targets |
|
|
|
|
|
- fixed a performance regression in SAXPY on POWER10 with OpenXL |
|
|
|
|
|
- fixed accuracy issues in CSCAL/ZSCAL when compiled with LLVM |
|
|
|
|
|
- fixed building for POWER9 under FreeBSD |
|
|
|
|
|
- fixed a potential overflow when using very large user-defined BUFFERSIZE |
|
|
|
|
|
- fixed an accuracy issue in the POWER6 kernels for GEMM and GEMV |
|
|
|
|
|
|
|
|
|
|
|
riscv64: |
|
|
|
|
|
- Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1 |
|
|
|
|
|
matrix to the corresponding GEMV kernel |
|
|
|
|
|
- fixed building for RISCV64_GENERIC with OpenMP enabled |
|
|
|
|
|
- added DYNAMIC_ARCH support (comprising GENERIC_RISCV64 and the two |
|
|
|
|
|
RVV 1.0 targets with vector length of 128 and 256) |
|
|
|
|
|
- worked around the ZVL128B kernels for AXPBY mishandling the special |
|
|
|
|
|
case of zero Y increment |
|
|
|
|
|
|
|
|
|
|
|
loongarch64: |
|
|
|
|
|
- improved GEMM performance on servers of the 3C5000 generation |
|
|
|
|
|
- improved performance and stability of DGEMM |
|
|
|
|
|
- improved GEMV and TRSM kernels for LSX and LASX vector ABIs |
|
|
|
|
|
- fixed CMAKE compilation with the INTERFACE64 option set |
|
|
|
|
|
- fixed compilation with CMAKE |
|
|
|
|
|
- worked around spurious errors flagged by the BLAS3 tests |
|
|
|
|
|
- worked around a miscompilation of the POTRS utest by gcc 14.1 |
|
|
|
|
|
|
|
|
|
|
|
mips64: |
|
|
|
|
|
- fixed ASUM and SUM kernels to accept negative step sizes in X |
|
|
|
|
|
- fixed complex GEMV kernels for MSA |
|
|
|
|
|
|
|
|
==================================================================== |
|
|
==================================================================== |
|
|
Version 0.3.27 |
|
|
Version 0.3.27 |
|
|
4-Apr-2024 |
|
|
4-Apr-2024 |
|
|