Update Changelog.txt for 0.3.28

1 year ago · 1df95bb23a
--- a/Changelog.txt
+++ b/Changelog.txt
@@ -1,4 +1,127 @@
 OpenBLAS ChangeLog
 ====================================================================
 Version 0.3.28
 8-Aug-2024

 general:
 - Reworked the unfinished implementation of HUGETLB from GotoBLAS
  for allocating huge memory pages as buffers on suitable systems
 - Changed the unfinished implementation of GEMM3M for the generic
  target on all architectures to at least forward to regular GEMM
 - Improved multithreaded GEMM performance for large non-skinny matrices
 - Improved BLAS3 performance on larger multicore systems through improved
  parallelism
 - Improved performance of the initial memory allocation by reducing
  locking overhead
 - Improved performance of GBMV at small problem sizes by introducing
  a size barrier for the switch to multithreading
 - Added an implementation of the CBLAS_GEMM_BATCH extension
 - Fixed miscompilation of CAXPYC and ZAXPYC on all architectures in 
  CMAKE builds (error introduced in 0.3.27)
 - Fixed corner cases involving the handling of NAN and INFINITY
  arguments in ?SCAL on all architectures
 - Added support for cross-compiling to WEBM with CMAKE (in addition
  to the already present makefile support)
 - Fixed NAN handling and potential accuracy issues in compilations with
  Intel ICX by supplying a suitable fp-model option by default
 - The contents of the github project wiki have been converted into
  a new set of documentation included with the source code.
 - It is now possible to register a callback function that replaces
  the built-in support for multithreading with an external backend
  like TBB (openblas_set_threads_callback_function)
 - Fixed potential duplication of suffixes in shared library naming
 - Improved C compiler detection by the build system to tolerate more
  naming variants for gcc builds
 - Fixed an unnecessary dependency of the utest on CBLAS
 - Fixed spurious error reports from the BLAS extensions utest
 - Fixed unwanted invocation of the GEMM3M tests in cross-compilation
 - Fixed a flaw in the makefile build that could lead to the pkgconfig
  file containing an entry of UNKNOWN for the target cpu after installing
 - Integrated fixes from the Reference-LAPACK project:
  - Fixed uninitialized variables in the LAPACK tests for ?QP3RK (PR 961)
  - Fixed potential bounds error in ?UNHR_COL/?ORHR_COL (PR 1018)
  - Fixed potential infinite loop in the LAPACK testsuite (PR 1024)
  - Make the variable type used for hidden length arguments configurable (PR 1025)  
  - Fixed SYTRD workspace computation and various typos (PR 1030)
  - Prevent compiler use of FMA that could increase numerical error in ?GEEVX (PR 1033)

 x86-64:
 - reverted thread management under Windows to its state before 0.3.26
  due to signs of race conditions in some circumstances now under study
 - fixed accidental selection of the unoptimized generic SBGEMM kernel
  in CMAKE builds for CooperLake and SapphireRapids targets
 - fixed a potential thread buffer overrun in SBSTOBF16 on small systems
 - fixed an accuracy issue in ZSCAL introduced in 0.3.26
 - fixed compilation with CMAKE and recent releases of LLVM
 - added support for Intel Emerald Rapids and Meteor Lake cpus
 - added autodetection support for the Zhaoxin KX-7000 cpu
 - fixed autodetection of Intel Prescott (probably broken since 0.3.19)
 - fixed compilation for older targets with the Yocto SDK
 - fixed compilation of the converter-generated C versions
  of the LAPACK sources with gcc-14
 - improved compiler options when building with CMAKE and LLVM for
  AVX512-capable targets
 - added support for supplying the L2 cache size via an environment
  variable (OPENBLAS_L2_SIZE) in case it is not correctly reported
  (as in some VM configurations)
 - improved the error message shown when thread creation fails on startup
 - fixed setting the rpath entry of the dylib in CMAKE builds on MacOS

 arm:
 - fixed building for baremetal targets with make

 arm64:
 - Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1
  matrix to the corresponding GEMV kernel 
 - added optimized SGEMV and DGEMV kernels for A64FX
 - added optimized SVE kernels for small-matrix GEMM
 - added A64FX to the cpu list for DYNAMIC_ARCH
 - fixed building with support for cpu affinity
 - worked around accuracy problems with C/ZNRM2 on NeoverseN1 and
  Apple M targets
 - improved GEMM performance on Neoverse V1
 - fixed compilation for NEOVERSEN2 with older compilers
 - fixed potential miscompilation of the SVE SDOT and DDOT kernels
 - fixed potential miscompilation of the non-SVE CDOT and ZDOT kernels
 - fixed a potential overflow when using very large user-defined BUFFERSIZE
 - fixed setting the rpath entry of the dylib in CMAKE builds on MacOS

 power:
 - Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1
  matrix to the corresponding GEMV kernel 
 - significantly improved performance of SBGEMM on POWER10
 - fixed compilation with OpenMP and the XLF compiler
 - fixed building of the BLAS extension utests under AIX
 - fixed building of parts of the LAPACK testsuite with XLF
 - fixed CSWAP/ZSWAP on big-endian POWER10 targets
 - fixed a performance regression in SAXPY on POWER10 with OpenXL
 - fixed accuracy issues in CSCAL/ZSCAL when compiled with LLVM
 - fixed building for POWER9 under FreeBSD
 - fixed a potential overflow when using very large user-defined BUFFERSIZE
 - fixed an accuracy issue in the POWER6 kernels for GEMM and GEMV

 riscv64:
 - Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1
  matrix to the corresponding GEMV kernel 
 - fixed building for RISCV64_GENERIC with OpenMP enabled
 - added DYNAMIC_ARCH support (comprising GENERIC_RISCV64 and the two
  RVV 1.0 targets with vector length of 128 and 256)
 - worked around the ZVL128B kernels for AXPBY mishandling the special
  case of zero Y increment

 loongarch64:
 - improved GEMM performance on servers of the 3C5000 generation
 - improved performance and stability of DGEMM
 - improved GEMV and TRSM kernels for LSX and LASX vector ABIs
 - fixed CMAKE compilation with the INTERFACE64 option set
 - fixed compilation with CMAKE
 - worked around spurious errors flagged by the BLAS3 tests
 - worked around a miscompilation of the POTRS utest by gcc 14.1

 mips64:
 - fixed ASUM and SUM kernels to accept negative step sizes in X
 - fixed complex GEMV kernels for MSA

 ====================================================================
 Version 0.3.27
 4-Apr-2024