|
- OpenBLAS ChangeLog
- ====================================================================
- Version 0.3.11
- 17-Oct-2020
-
- common:
- * API change:
- the newly added BFLOAT16 functions were renamed to use the
- letter "B" instead of "H" to avoid potential confusion with
- the IEEE "half precision float" type, i.e. the 0.3.10
- SHGEMM is now SBGEMM and the corresponding build option
- was changed from "BUILD_HALF" to "BUILD_BFLOAT16".
- * Reduced the default BLAS3_MEM_ALLOC_THRESHOLD (used as an upper
- limit for placing temporary arrays on the stack) to be compatible
- with a stack size of 1mb (as imposed by the JAVA runtime library)
- * Added mixed-precision dot function SBDOT and utility functions
- shstobf16, shdtobf16, sbf16tos and dbf16tod to convert between
- single or double precision float arrays and bfloat16 arrays
- * Fixed prototypes of LAPACK_?ggsvp and LAPACK_?ggsvd functions
- in lapack.h
- * Fixed underflow and rounding errors in LAPACK SLANV2 and DLANV2
- (causing miscalculations in e.g. SHSEQR/DHSEQR, LAPACK issue #263)
- * Fixed workspace calculation in LAPACK ?GELQ (LAPACK issue #415)
- * Fixed several bugs in the LAPACK testsuite
- * Improved performance of TRMM and TRSM for certain problem sizes
- * Fixed infinite recursions and workspace miscalculations in ReLAPACK
- * CMAKE builds no longer require pkg-config for creating the .pc file
- * Makefile builds no longer misread NO_CBLAS=0 or NO_LAPACK=0 as
- enabling these options
- * Fixed detection of gfortran when invoked through an mpi wrapper
- * Improve thread reinitialization performance with OpenMP xafter a fork
- * Added support for building only the subset of the library required
- for a particular precision by specifying BUILD_SINGLE, BUILD_DOUBLE
- * Optional function name prefixes and suffixes are now correctly
- reflected in the generated cblas.h
- * Added CMAKE build support for the LAPACK and multithreading tests
-
- POWER:
- * Added optimized support for POWER10
- * Added support for compiling for POWER8 in 32bit mode
- * Added support for compilation with LLVM/clang
- * Added support for compilation with NVIDIA/PGI compilers
- * Fixed building on big-endian POWER8
- * Fixed miscompilation of ZDOTC by gcc10
- * Fixed alignment errors in the POWER8 SAXPY kernel
- * Improved CPU detection on AIX
- * Supported building with older compilers on POWER9
-
- x86_64:
- * Added support for Intel Cooperlake
- * Added autodetection of AMD Renoir/Matisse/Zen3 cpus
- * Added autodetection of Intel Comet Lake cpus
- * Reimplemented ?sum, ?dot and daxpy using universal intrinsics
- * Reset the fpu state before using the fpu on Windows as a workaround
- for a problem introduced in Windows 10 build 19041 (a.k.a. SDK 2004)
- * Fixed potentially undefined behaviour in the dot and gemv_t kernels
- * Fixed a potential segmentation fault in DYNAMIC_ARCH builds
- * Fixed building for ZEN with PGI/NVIDIA and AMD AOCC compilers
-
- ARMV7:
- * Fixed cpu detection on BSD-like systems
-
- ARMV8:
- * Added preliminary support for Apple Vortex cpus
- * Added support for the Cavium ThunderX3T110 cpu
- * Fixed cpu detection on BSD-like systems
- * Fixed compilation in -std=C18 mode
-
-
- IBM Z:
- * Added support for compiling with the clang compiler
- * Improved GEMM performance on Z14
-
- ====================================================================
- Version 0.3.10
- 14-Jun-2020
-
- common:
- * Improved thread locking behaviour in blas_server and parallel getrf
- * Imported bugfix 394 from LAPACK (spurious reference to "XERBL"
- due to overlong lines)
- * Imported bugfix 403 from LAPACK (compile option "recursive" required
- for correctness with Intel and PGI)
- * Imported bugfix 408 from LAPACK (wrong scaling in ZHEEQUB)
- * Imported bugfix 411 from LAPACK (infinite loop in LARGV/LARTG/LARTGP)
- * Fixed mismatches between BUFFERSIZE and GEMM_UNROLL parameters that
- could lead to crashes at large matrix sizes
- * Restored internal soname in dynamic libraries on FreeBSD and Dragonfly
- * Added API (openblas_setaffinity) to set the thread affinity on Linux
- * Added initial infrastructure for half-precision floating point
- (bfloat16) support with a generic implementation of SHGEMM
- * Added CMAKE build system support for building the cblas_Xgemm3m
- functions
- * Fixed CMAKE support for building in a path with embedded spaces
- * Fixed CMAKE (non)handling of NO_EXPRECISION and MAX_STACK_ALLOC
- * Fixed GCC version detection in the Makefiles
- * Allowed overriding the names of AR, AS and LD in Makefile builds
-
- POWER:
- * Fixed big-endian POWER8 ELFv2 builds on FreeBSD
- * Fixed GCC version checks and DYNAMIC_ARCH builds on POWER9
- * Fixed CMAKE build support for POWER9
- * fixed a potential race condition in the thread buffer allocation
- * Worked around LAPACK test failures on PPC G4
-
- MIPS:
- * Fixed a potential race condition in the thread buffer allocation
- * Added support for MIPS 24K/24KE family based on P5600 kernels
-
- MIPS64:
- * fixed a potential race condition in the thread buffer allocation
- * Added TARGET=GENERIC
-
- ARMV7:
- * Fixed a race condition in the thread buffer allocation
-
- ARMV8:
- * Fixed a race condition in the thread buffer allocation
- * Fixed zero initialisation in the assembly for SGEMM and DGEMM BETA
- * Improved performance of the ThunderX2 DAXPY kernel
- * Added an optimized SGEMM kernel for Cortex A53
- * Fixed Makefile support for INTERFACE64 (8-byte integer)
-
- x86_64:
- * Fixed a syntax error in the CMAKE setup for SkylakeX
- * Improved performance of STRSM on Haswell, SkylakeX and Ryzen
- * Improved SGEMM performance on SGEMM for workloads with ldc a
- multiple of 1024
- * Improved DGEMM performance on Skylake X
- * Fixed unwanted AVX512-dependency of SGEMM in DYNAMIC_ARCH
- builds created on SkylakeX
- * Removed data alignment requirement in the SSE2 copy kernels
- that could cause spurious crashes
- * Added a workaround for an optimizer bug in AppleClang 11.0.3
- * Fixed LAPACK test failures due to wrong options for Intel Fortran
- * Fixed compilation and LAPACK test results with recent Flang
- and AMD AOCC
- * Fixed DYNAMIC_ARCH builds with CMAKE on OS X
- * Fixed missing exports of cblas_i?amin, cblas_i?min, cblas_i?max,
- cblas_?sum, cblas_?gemm3m in the shared library on OS
- * Fixed reporting of cpu name in DYNAMIC_ARCH builds (would sometimes
- show the name of an older generation chip supported by the same kernels)
-
- IBM Z:
- * Improved performance of SGEMM/STRMM and DGEMM/DTRMM on Z14
-
- ====================================================================
- Version 0.3.9
- 1-Mar-2020
-
- common:
- * Fixed a miscompilation of the GETRF functions with CMAKE
- * Imported bugfix 390 from LAPACK (missing NaN propagation in xCOMBSSQ)
- * The size of the memory buffer used for splitting GEMM tasks across
- multiple threads can now be configured in the build system.
-
- POWER:
- * Fixed several compilation problems related to endianness
- and ELF version on POWER8 and POWER9
- * Fixed use of the absolute value IAMIN/IAMAX instead of IMIN/IMAX
- * Fixed a race condition in the level3 blas code
-
- MIPS64:
- * Fixed use of the absoltute value IAMIN/IAMAX instead of IMIN/IMAX
-
- ARMV7:
- * Fixed a race condition in the level3 blas code
- * Fixed compilation on Android
- ARMV8:
- * Added support for Ampere EMAG8180
- * Added support for Neoverse N1
- * Improved performance of the blas_lock function
- * Fixed a race condition in the level3 blas code
- * Fixed a performance regression on TSV110-based servers
-
- x86_64:
- * Fixed a long-standing error with undeclared register overwrites
- in the DSCAL microkernel for HASWELL,SKYLAKEX and ZEN
- * Fixed a long-standing bug in the SSE implementation of IAMAX
- * Fixed a CMAKE build failure with DYNAMIC_ARCH
- * Fixed cpu autodetection of Goldmont+, Cannon Lake and Ice Lake
- * Fixed a compilation failure on OSX with compiler name containing dash
- * Fixed compilation with MinGW on SkylakeX
- * Improved speed of the AVX512 GEMM3M kernel on SkylakeX
- * Added an AVX512 STRMM kernel for SkylakeX
- * Improved GEMM performance on Haswell and Zen
-
- zarch:
- * fixed compilation of the DYNAMIC_ARCH code
-
- ====================================================================
- Version 0.3.8
- 9-Feb-2020
-
- common:
- ` * LAPACK has been updated to 3.9.0 (plus patches up to
- January 2nd, 2020)
- * CMAKE support has been improved in several areas including
- cross-compilation
- * a thread race condition in the GEMM3M kernels was resolved
- * the "generic" (plain C) gemm beta kernel used by many targets
- has been sped up
- * an optimized version of the LAPACK trtrs functions has been added
- * an incompatibilty between the LAPACK tests and the OpenBLAS
- implementation of XERBLA was resolved, removing the numerous
- warnings about wrong error exits in the former
- * support for NetBSD has been added
- * support for compilation with g95 and non-GNU versions of ld
- has been improved
- * support for compilation with (upcoming) gcc 10 has been added
-
- POWER:
- * worked around miscompilation of several POWER8 and POWER9
- kernels by older versions of gcc
- * added support for big-endian POWER8 and for compilation on AIX
- * corrected bugs in the big-endian support for PPC440 and PPC970
- * DYNAMIC_ARCH support is now available in CMAKE builds as well
-
- ARMV8:
- * performance of DGEMM_BETA and SGEMM_NCOPY has been improved
- * compilation for 32bit works again
- * performance of the RPCC function has been improved
- * improved performance on small systems
- * DYNAMIC_ARCH support is now available in CMAKE builds as well
- * cross-compilation from OSX to IOS was simplified
-
- x86_64:
- * a new AVX512 DGEMM kernel was added and the AVX512 SGEMM kernel
- was significantly improved
- * optimized AVX512 kernels for CGEMM and ZGEMM have been added
- * AVX2 kernels for STRMM, SGEMM, and CGEMM have been significantly
- sped up and optimized CGEMM3M and ZGEMM3M kernels have been added
- * added support for QEMU virtual cpus
- * a compilation problem with PGI and SUN compilers was fixed
- * Intel "Goldmont plus" is now autodetected
- * a potential crash on program exit on MS Windows has been fixed
-
- x86:
- * an unwanted case sensitivity in the implementation of LSAME
- on older 32bit AMD cpus was fixed
-
- zarch:
- * Z15 is now supported as Z14
- * DYNAMIC_ARCH is now available on ZARCH as well
-
- ====================================================================
- Version 0.3.7
- 11-Aug 2019
-
- common:
- * having the gmake special variables TARGET_ARCH or TARGET_MACH
- defined no longer causes build failures in ctest or utest
- * defining NO_AFFINITY or USE_TLS to 0 in gmake builds no longer
- has the same effect as setting them to 1
- * a new test program was added to allow checking the library for
- thread safety
- * a new option USE_LOCKING was added to ensure thread safety when
- OpenBLAS itself is built without multithreading but will be
- called from multiple threads.
- * a build failure on Linux with glibc versions earlier than 2.5
- was fixed
- * a runtime error with CPU enumeration (and NO_AFFINITY not set)
- on glibc 2.6 was fixed
- * NO_AFFINITY was added to the CMAKE options (and defaults to being
- active on Linux, as in the gmake builds)
-
- x86_64:
- * the build-time logic for detection of AVX512 availability in
- the processor and compiler was fixed
- * gmake builds on OSX now set the internal name of the library to
- libopenblas.0.dylib (consistent with CMAKE)
- * the Haswell DGEMM kernel received a significant speedup through
- improved prefetch and load instructions
- * performance of DGEMM, DTRMM, DTRSM and ZDOT on Zen/Zen2 was markedly
- increased by avoiding vpermpd instructions
- * the SKYLAKEX (AVX512) DGEMM helper functions have now been disabled
- to fix remaining errors in DGEMM, DSYMM and DTRMM
-
- POWER:
- * added support for building on FreeBSD/powerpc64 and FreeBSD/ppc970
- * added optimized kernels for POWER9 SGEMM and STRMM
-
- ARMV7:
- * fixed the softfp implementations of xAMAX and IxAMAX
- * removed the predefined -march= flags on both ARMV5 and ARMV6 as
- they were appropriate for only a subset of platforms
-
- ====================================================================
- Version 0.3.6
- 29-Apr-2019
-
- common:
- * the build tools now check that a given cpu TARGET is actually valid
- * the build-time check of system features (c_check) has been made
- less dependent on particular perl features (this should mainly
- benefit building on Windows)
- * several problem with the ReLAPACK integration were fixed,
- including INTERFACE64 support and building a shared library
- * building with CMAKE on BSD systems was improved
- * a non-absolute SUM function was added based on the
- existing optimized code for ASUM
- * CBLAS interfaces to the IxMIN and IxMAX functions were added
- * a name clash between LAPACKE and BOOST headers was resolved
- * CMAKE builds with OpenMP failed to include the appropriate getrf_parallel
- kernels
- * a crash on thread (key) deletion with the USE_TLS=1 memory management
- option was fixed
- * restored several earlier fixes, in particular for OpenMP performance,
- building on BSD, and calling fork on CYGWIN, which had inadvertently
- been dropped in the 0.3.3 rewrite of the memory management code.
-
- x86_64:
- * the AVX512 DGEMM kernel has been disabled again due to unsolved problems
- * building with old versions of MSVC was fixed
- * it is now possible to build a static library on Windows with CMAKE
- * accessing environment variables on CYGWIN at run time was fixed
- * the CMAKE build system now recognizes 32bit userspace on 64bit hardware
- * Intel "Denverton" atom and Hygon "Dhyana" zen CPUs are now autodetected
- * building for DYNAMIC_ARCH with a DYNAMIC_LIST of targets is now supported
- with CMAKE as well
- * building for DYNAMIC_ARCH with GENERIC as the default target is now supported
- * a buffer overflow in the SSE GEMM kernel for Intel Nano targets was fixed
- * assembly bugs involving undeclared modification of input operands were fixed
- in the AXPY, DOT, GEMV, GER, SCAL, SYMV and TRSM microkernels for Nehalem,
- Sandybridge, Haswell, Bulldozer and Piledriver. These would typically cause
- test failures or segfaults when compiled with recent versions of gcc from 8 onward.
- * a similar bug was fixed in the blas_quickdivide code used to split workloads
- in most functions
- * a bug in the IxMIN implementation for the GENERIC target made it return the result of IxMAX
- * fixed building on SkylakeX systems when either the compiler or the (emulated) operating
- environment does not support AVX512
- * improved GEMM performance on ZEN targets
-
- x86:
- * build failures caused by the recently added checks for AVX512 were fixed
- * an inline assembly bug involving undeclared modification of an input argument was
- fixed in the blas_quickdivide code used to split workloads in most functions
- * a bug in the IMIN implementation for the GENERIC target made it return the result of IMAX
-
- MIPS32:
- * a bug in the IMIN implementation made it return the result of IMAX
-
- POWER:
- * single precision BLAS1/2 functions have received optimized POWER8 kernels
- * POWER9 is now a separate target, with an optimized DGEMM/DTRMM kernel
- * building on PPC970 systems under OSX Leopard or Tiger is now supported
- * out-of-bounds memory accesses in the gemm_beta microkernels were fixed
- * building a shared library on AIX is now supported for POWER6
- * DYNAMIC_ARCH support has been added for POWER6 and newer
-
- ARMv7:
- * corrected xDOT behaviour with zero INC_X or INC_Y
- * a bug in the IMIN implementation made it return the result of IMAX
-
- ARMv8:
- * added support for HiSilicon TSV110 cpus
- * the CMAKE build system now recognizes 32bit userspace on 64bit hardware
- * cross-compilation with CMAKE now works again
- * a bug in the IMIN implementation made it return the result of IMAX
- * ARMV8 builds with the BINARY=32 option are now automatically handled as ARMV7
-
- IBM Z:
- * optimized microkernels for single precicion BLAS1/2 functions have been added
- for both Z13 and Z14
-
- ====================================================================
- Version 0.3.5
- 31-Dec-2018
-
- common:
- * loop unrolling in TRMV has been enabled again.
- * A domain error in the thread workload distribution for SYRK
- has been fixed.
- * gmake builds will now automatically add -fPIC to the build
- options if the platform requires it.
- * a pthreads key leakage (and associate crash on dlclose) in
- the USE_TLS codepath was fixed.
- * building of the utest cases on systems that do not provide
- an implementation of complex.h was fixed.
-
- x86_64:
- * the SkylakeX code was changed to compile on OSX.
- * unwanted application of the -march=skylake-avx512 option
- to the common code parts of a DYNAMIC_ARCH build was fixed.
- * improved performance of SGEMM for small workloads on Skylake X.
- * performance of SGEMM and DGEMM was improved on Haswell.
-
- ARMV8:
- * a configuration error that broke the CNRM2 kernel was corrected.
- * compilation of the GEMM kernels with CMAKE was fixed.
- * DYNAMIC_ARCH builds are now available with CMAKE as well.
- * using CMAKE for cross-compilation to the new cpu TARGETs
- introduced in 0.3.4 now works.
-
- POWER:
- * a problem in cpu autodetection for AIX has been corrected.
-
- ====================================================================
- Version 0.3.4
- 02-Dec-2018
-
- common:
- * the new, experimental thread-local memory allocation had
- inadvertently been left enabled for gmake builds in 0.3.3
- despite the announcement. It is now disabled by default, and
- single-threaded builds will keep using the old allocator even
- if the USE_TLS option is turned on.
- * OpenBLAS will now provide enough buffer space for at least 50
- threads by default.
- * The output of openblas_get_config() now contains the version
- number.
- * A serious thread safety bug in GEMV operation with small M and
- large N size has been fixed.
- * The code will now automatically call blas_thread_init after a
- fork if needed before handling a call to openblas_set_num_threads
- * Accesses to parallelized level3 functions from multiple callers
- are now serialized to avoid thread races (unless using OpenMP).
- This should provide better performance than the known-threadsafe
- (but non-default) USE_SIMPLE_THREADED_LEVEL3 option.
- * When building LAPACK with gfortran, -frecursive is now (again)
- enabled by default to ensure correct behaviour.
- * The OpenBLAS version cblas.h now supports both CBLAS_ORDER and
- CBLAS_LAYOUT as the name of the matrix row/column order option.
- * Externally set LDFLAGS are now passed through to the final compile/link
- steps to facilitate setting platform-specific linker flags.
- * A potential race condition during the build of LAPACK (that would
- usually manifest itself as a failure to build TESTING/MATGEN) has been
- fixed.
- * xHEMV has been changed to stay single-threaded for small input sizes
- where the overhead of multithreading exceeds any possible gains
- * CSWAP and ZSWAP have been limited to a single thread except on ARMV8 or
- ThunderX hardware with sizable input.
- * Linker flags for the PGI compiler have been updated
- * Behaviour of AXPY with zero increments is now handled in the C interface,
- correcting the result on at least Intel Atom.
- * The result matrix from calling SGELSS with an all-zero input matrix is
- now zeroed completely.
-
- x86_64:
- * Autodetection of AMD Ryzen2 has been fixed (again).
- * CMAKE builds now support labeling of an INTERFACE64=1 build of
- the library with the _64 suffix.
- * AVX512 version of DGEMM has been added and the AVX512 SGEMM kernel
- has been sped up by rewriting with C intrinsics
- * Fixed compilation on RHEL5/CENTOS5 (issue with typename __WAIT_STATUS)
-
- POWER:
- * added support for building on AIX (with gcc and GNU tools from AIX Toolbox).
- * CPU type detection has been implemented for AIX.
- * CPU type detection has been fixed for NETBSD.
-
- MIPS64:
- * AXPY on LOONGSON3A has been corrected to pass "zero increment" utest.
- * DSDOT on LOONGSON3A has been fixed.
- * the SGEMM microkernel has been hardened against potential data loss.
-
- ARMV8:
- * DYNAMic_ARCH support is now available for 64bit ARM
- * cross-compiling for ARMV8 under iOS now works.
- * cpu-specific code has been rearranged to make better use of both
- hardware commonalities and model-specific compiler optimizations.
- * XGENE1 has been removed as a TARGET, superseded by the improved generic
- ARMV8 support.
-
- ARMV7:
- * Older assembly mnemonics have been converted to UAL form to allow
- building with clang 7.0
- * Cross compiling LAPACKE for Android has been fixed again (broken by
- update to LAPACK 3.7.0 some while ago).
-
- ====================================================================
- Version 0.3.3
- 31-Aug-2018
-
- common:
- * thread memory allocation has been switched back to the method
- used before version 0.3.1 due to unexpected problems caused by
- the new code under some circumstances. A new compile-time option
- USE_TLS has been added to enable the new code, and it is hoped
- that this can become the default again in the next version.
- * LAPAck PR272 has been integrated, which fixes spurious errors
- in DSYEVR and related functions caused by missing conversion
- from ILAENV to ILAENV_2STAGE in several _2stage routines.
- * the cmake-generated OpenBLASConfig.cmake now uses correct case
- for the name of the library
- * added support for Haiku OS
-
- x86_64:
- * added AVX512 implementations of SDOT, DDOT, SAXPY, DAXPY,
- DSCAL, DGEMVN and DSYMVL
- * added a workaround for a cygwin issue that prevented compilation
- of AVX512 code
-
- IBM Z:
- * added autodetection of Z14
- * fixed TRMM errors in the generic target
-
- ====================================================================
- Version 0.3.2
- 30-Jul-2018
-
- common:
- * fixes for regressions caused by the rewrite of the thread
- initialization code in 0.3.1
-
- POWER:
- * fixed cpu autodetection for the BSDs
-
- MIPS64:
- * fixed utest errors in AXPY, DSDOT, ROT and SWAP
-
- x86_64:
- * added autodetection of AMD Ryzen 2
- * fixed build with older versions of MSVC
-
- ====================================================================
- Version 0.3.1
- 01-Jul-2018
-
- common:
- * rewritten thread initialization code with significantly reduced overhead
- * added CBLAS interfaces to the IxAMIN BLAS extension functions
- * fixed the lapack-test target
- * CMAKE builds now create an OpenBLASConfig.cmake file
- * ZAXPY now uses a single thread for small input sizes
- * the LAPACK code was updated from Reference-LAPACK/lapack#253
- (fixing LAPACKE interfaces to Aasen's functions)
-
- POWER:
- * corrected CROT and ZROT behaviour with zero INC_X
-
- ARMV7:
- * corrected xDOT behaviour with zero INC_X or INC_Y
-
- x86_64:
- * retired some older targets of DYNAMIC_ARCH builds to a new option DYNAMIC_OLDER,
- this affects PENRYN,DUNNINGTON,OPTERON,OPTERON_SSE3,BOBCAT,ATOM and NANO
- (which will still be supported via the slower PRESCOTT kernels when this option is not set)
- * added an option DYNAMIC_LIST that (used in conjunction with DYNAMIC_ARCH) allows to
- specify the list of x86_64 targets to include. Any target not on the list will be supported
- by the Sandybridge or Nehalem kernels if available, or by Prescott.
- * improved SWITCH_RATIO on Haswell for increased GEMM throughput
- * added initial support for Intel Skylake X, including an AVX512 SGEMM kernel
- * added autodetection of Intel Cannon Lake series as Skylake X
- * added a default L2 cache size for hypervisors that return zero here (Chromebook)
- * fixed a name clash with recent Windows10 headers that broke the build with (at least)
- recent mingw from MSYS2
- * fixed a link error in mixed clang/gfortran builds with OpenMP
- * updated the OSX deployment target to 10.8
- * switched on parallel make for builds on MS Windows by default
-
- x86:
- * fixed SSWAP and DSWAP behaviour with zero INC_X and INC_Y
-
- ====================================================================
- Version 0.3.0
- 23-May-2108
-
- common:
- * fixed some more thread race and locking bugs
- * added preliminary support for calling an OpenMP build of the library from multiple threads
- * removed performance impact of thread locks added in 0.2.20 on OpenMP code
- * general code cleanup
- * optimized DSDOT implementation
- * improved thread distribution for GEMM
- * corrected IMATCOPY/OMATCOPY implementation
- * fixed out-of-bounds accesses in the multithreaded xBMV/xPMV and SYMV implementations
- * cmake build improvements
- * pkgconfig file now contains build options
- * openblas_get_config() now reports USE_OPENMP and NUM_THREADS settings used for the build
- * corrections and improvements for systems with more than 64 cpus
- * LAPACK code updated to 3.8.0 including later fixes
- * added ReLAPACK, a recursive implementation of several LAPACK functions
- * Rewrote ROTMG to handle cases that the netlib code failed to address
- * Disabled (broken) multithreading code for xTRMV
- * corrected prototypes of complex CBLAS functions to make our cblas.h match the generally accepted standard
- * shared memory access failures on startup are now handled more gracefully
- * restored utests from earlier releases (and made them pass on all affected systems)
-
- SPARC:
- * several fixes for cpu autodetection
-
- POWER:
- * corrected vector register overwriting in several Power8 kernels
- * optimized additional BLAS functions
-
- ARM:
- * added support for CortexA53 and A72
- * added autodetection for ThunderX2T99
- * made most optimized kernels the default for generic ARMv8 targets
-
- x86_64:
- * parallelized DDOT kernel for Haswell
- * changed alignment directives in assembly kernels to boost performance on OSX
- * fixed register handling in the GEMV microkernels (bug exposed by gcc7)
- * added support for building on OpenBSD and Dragonfly
- * updated compiler options to work with Intel release 2018
- * support fully optimized build with clang/flang on Microsoft Windows
- * fixed building on AIX
-
- IBM Z:
- * added optimized BLAS 1/2 functions
-
- MIPS:
- * fixed cpu autodetection helper code
- * added mips32 1004K cpu (Mediatek MT7621 and similar SoC)
- * added mips64 I6500 cpu
-
- ====================================================================
- Version 0.2.20
- 24-Jul-2017
-
- common:
- * Improved CMake support
- * Fixed several thread race and locking bugs
- * Fixed default LAPACK optimization level
- * Updated LAPACK to 3.7.0
- * Added ReLAPACK (https://github.com/HPAC/ReLAPACK, make BUILD_RELAPACK=1)
-
- POWER:
- * Optimizations for Power9
- * Fixed several Power8 assembly bugs
-
- ARM:
- * New optimized Vulcan and ThunderX2T99 targets
- * Support for ARMV7 SOFT_FP ABI (make ARM_SOFTFP_ABI=1)
- * Detect all cpu cores including offline ones
- * Fix compilation with CLANG
- * Support building a shared library for Android
-
- MIPS:
- * Fixed several threading issues
- * Fix compilation with CLANG
-
- x86_64:
- * Detect Intel Bay Trail and Apollo Lake
- * Detect Intel Sky Lake and Kaby Lake
- * Detect Intel Knights Landing
- * Detect AMD A8, A10, A12 and Ryzen
- * Support 64bit builds with Visual Studio
- * Fix building with Intel and PGI compilers
- * Fix building with MINGW and TDM-GCC
- * Fix cmake builds for Haswell and related cpus
- * Fix building for Sandybridge with CLANG 3.9
- * Add support for the FLANG compiler
-
- IBM Z:
- * New target z13 with BLAS3 optimizations
-
- ====================================================================
- Version 0.2.19
- 1-Sep-2016
- common:
- * Improved cross compiling.
- * Fix the bug on musl libc.
-
- POWER:
- * Optimize BLAS on Power8
- * Fixed Julia+OpenBLAS bugs on Power8
-
- MIPS:
- * Optimize BLAS on MIPS P5600 and I6400 (Thanks, Shivraj Patil, Kaustubh Raste)
-
- ARM:
- * Improved on ARM Cortex-A57. (Thanks, Ashwin Sekhar T K)
-
-
- ====================================================================
- Version 0.2.18
- 12-Apr-2016
- common:
- * If you set MAKE_NB_JOBS flag less or equal than zero,
- make will be without -j.
-
- x86/x86_64:
- * Support building Visual Studio static library. (#813, Thanks, theoractice)
- * Fix bugs to pass buidbot CI tests (http://build.openblas.net)
-
- ARM:
- * Provide DGEMM 8x4 kernel for Cortex-A57 (Thanks, Ashwin Sekhar T K)
-
- POWER:
- * Optimize S and C BLAS3 on Power8
- * Optimize BLAS2/1 on Power8
-
- ====================================================================
- Version 0.2.17
- 20-Mar-2016
- common:
- * Enable BUILD_LAPACK_DEPRECATED=1 by default.
-
- ====================================================================
- Version 0.2.16
- 15-Mar-2016
- common:
- * Avoid potential getenv segfault. (#716)
- * Import LAPACK svn bugfix #142-#147,#150-#155
-
- x86/x86_64:
- * Optimize c/zgemv for AMD Bulldozer, Piledriver, Steamroller
- * Fix bug with scipy linalg test.
-
- ARM:
- * Improve DGEMM for ARM Cortex-A57. (Thanks, Ashwin Sekhar T K)
-
- POWER:
- * Optimize D and Z BLAS3 functions for Power8.
-
- ====================================================================
- Version 0.2.16.rc1
- 23-Feb-2016
- common:
- * Upgrade LAPACK to 3.6.0 version.
- Add BUILD_LAPACK_DEPRECATED option in Makefile.rule to build
- LAPACK deprecated functions.
- * Add MAKE_NB_JOBS option in Makefile.
- Force number of make jobs.This is particularly
- useful when using distcc. (#735. Thanks, Jerome Robert.)
- * Redesign unit test. Run unit/regression test at every build (Travis-CI and Appveyor).
- * Disable multi-threading for small size swap and ger. (#744. Thanks, Jerome Robert)
- * Improve small zger, zgemv, ztrmv using stack alloction (#727. Thanks, Jerome Robert)
- * Let openblas_get_num_threads return the number of active threads.
- (#760. Thanks, Jerome Robert)
- * Support illumos(OmniOS). (#749. Thanks, Lauri Tirkkonen)
- * Fix LAPACK Dormbr, Dormlq bug. (#711, #713. Thanks, Brendan Tracey)
- * Update scipy benchmark script. (#745. Thanks, John Kirkham)
-
- x86/x86_64:
- * Optimize trsm kernels for AMD Bulldozer, Piledriver, Steamroller.
- * Detect Intel Avoton.
- * Detect AMD Trinity, Richland, E2-3200.
- * Fix gemv performance bug on Mac OSX Intel Haswell.
- * Fix some bugs with CMake and Visual Studio
-
- ARM:
- * Support and optimize Cortex-A57 AArch64.
- (#686. Thanks, Ashwin Sekhar TK)
- * Fix Android build on ARMV7 (#778. Thanks, Paul Mustiere)
- * Update ARMV6 kernels.
-
- POWER:
- * Fix detection of POWER architecture
- (#684. Thanks, Sebastien Villemot)
-
- ====================================================================
- Version 0.2.15
- 27-Oct-2015
- common:
- * Support cmake on x86/x86-64. Natively compiling on MS Visual Studio.
- (experimental. Thank Hank Anderson for the initial cmake porting work.)
-
- On Linux and Mac OSX, OpenBLAS cmake supports assembly kernels.
- e.g. cmake .
- make
- make test (Optional)
-
- On Windows MS Visual Studio, OpenBLAS cmake only support C kernels.
- (OpenBLAS uses AT&T style assembly, which is not supported by MSVC.)
- e.g. cmake -G "Visual Studio 12 Win64" .
- Open OpenBLAS.sln and build.
-
- * Enable MAX_STACK_ALLOC flags by default.
- Improve ger and gemv for small matrices.
- * Improve gemv parallel with small m and large n case.
- * Improve ?imatcopy when lda==ldb (#633. Thanks, Martin Koehler)
- * Add vecLib benchmarks (#565. Thanks, Andreas Noack.)
- * Fix LAPACK lantr for row major matrices (#634. Thanks, Dan Kortschak)
- * Fix LAPACKE lansy (#640. Thanks, Dan Kortschak)
- * Import bug fixes for LAPACKE s/dormlq, c/zunmlq
- * Raise the signal when pthread_create fails (#668. Thanks, James K. Lowden)
- * Remove g77 from compiler list.
- * Enable AppVeyor Windows CI.
-
- x86/x86-64:
- * Support pure C generic kernels for x86/x86-64.
- * Support Intel Boardwell and Skylake by Haswell kernels.
- * Support AMD Excavator by Steamroller kernels.
- * Optimize s/d/c/zdot for Intel SandyBridge and Haswell.
- * Optimize s/d/c/zdot for AMD Piledriver and Steamroller.
- * Optimize s/d/c/zapxy for Intel SandyBridge and Haswell.
- * Optimize s/d/c/zapxy for AMD Piledriver and Steamroller.
- * Optimize d/c/zscal for Intel Haswell, dscal for Intel SandyBridge.
- * Optimize d/c/zscal for AMD Bulldozer, Piledriver and Steamroller.
- * Optimize s/dger for Intel SandyBridge.
- * Optimize s/dsymv for Intel SandyBridge.
- * Optimize ssymv for Intel Haswell.
- * Optimize dgemv for Intel Nehalem and Haswell.
- * Optimize dtrmm for Intel Haswell.
-
- ARM:
- * Support Android NDK armeabi-v7a-hard ABI (-mfloat-abi=hard)
- e.g. make HOSTCC=gcc CC=arm-linux-androideabi-gcc NO_LAPACK=1 TARGET=ARMV7
- * Fix lock, rpcc bugs (#616, #617. Thanks, Grazvydas Ignotas)
- POWER:
- * Support ppc64le platform (ELF ABI v2. #612. Thanks, Matthew Brandyberry.)
- * Support POWER7/8 by POWER6 kernels. (#612. Thanks, Fábio Perez.)
-
- ====================================================================
- Version 0.2.14
- 24-Mar-2015
- common:
- * Improve OpenBLASConfig.cmake. (#474, #475. Thanks, xantares.)
- * Improve ger and gemv for small matrices by stack allocation.
- e.g. make -DMAX_STACK_ALLOC=2048 (#482. Thanks, Jerome Robert.)
- * Introduce openblas_get_num_threads and openblas_get_num_procs.
- (#497. Thanks, Erik Schnetter.)
- * Add ATLAS-style ?geadd function. (#509. Thanks, Martin Köhler.)
- * Fix c/zsyr bug with negative incx. (#492.)
- * Fix race condition during shutdown causing a crash in
- gotoblas_set_affinity(). (#508. Thanks, Ton van den Heuvel.)
-
- x86/x86-64:
- * Support AMD Streamroller.
-
- ARM:
- * Add Cortex-A9 and Cortex-A15 targets.
-
- ====================================================================
- Version 0.2.13
- 3-Dec-2014
- common:
- * Add SYMBOLPREFIX and SYMBOLSUFFIX makefile options
- for adding a prefix or suffix to all exported symbol names
- in the shared library.(#459, Thanks Tony Kelman)
- * Provide OpenBLASConfig.cmake at installation.
- * Fix Fortran compiler detection on FreeBSD.
- (#470, Thanks Mike Nolta)
-
-
- x86/x86-64:
- * Add generic kernel files for x86-64. make TARGET=GENERIC
- * Fix a bug of sgemm kernel on Intel Sandy Bridge.
- * Fix c_check bug on some amd64 systems. (#471, Thanks Mike Nolta)
-
- ARM:
- * Support APM's X-Gene 1 AArch64 processors.
- Optimize trmm and sgemm. (#465, Thanks Dave Nuechterlein)
-
- ====================================================================
- Version 0.2.12
- 13-Oct-2014
- common:
- * Added CBLAS interface for ?omatcopy and ?imatcopy.
- * Enable ?gemm3m functions.
- * Added benchmark for ?gemm3m.
- * Optimized multithreading lower limits.
- * Disabled SYMM3M and HEMM3M functions
- because of segment violations.
-
- x86/x86-64:
- * Improved axpy and symv performance on AMD Bulldozer.
- * Improved gemv performance on modern Intel and AMD CPUs.
-
- ====================================================================
- Version 0.2.11
- 18-Aug-2014
- common:
- * Added some benchmark codes.
- * Fix link error on Linux/musl.(Thanks Isaac Dunham)
-
- x86/x86-64:
- * Improved s/c/zgemm performance for Intel Haswell.
- * Improved s/d/c/zgemv performance.
- * Support the big numa machine.(EXPERIMENT)
-
- ARM:
- * Fix detection when cpuinfo uses "Processor". (Thanks Isaiah)
-
- ====================================================================
- Version 0.2.10
- 16-Jul-2014
- common:
- * Added BLAS extensions as following.
- s/d/c/zaxpby, s/d/c/zimatcopy, s/d/c/zomatcopy.
- * Added OPENBLAS_CORETYPE environment for dynamic_arch. (a86d34)
- * Added NO_AVX2 flag for old binutils. (#401)
- * Support outputing the CPU corename on runtime.(#407)
- * Patched LAPACK to fix bug 114, 117, 118.
- (http://www.netlib.org/lapack/bug_list.html)
- * Disabled ?gemm3m for a work-around fix. (#400)
- x86/x86-64:
- * Fixed lots of bugs for optimized kernels on sandybridge,Haswell,
- bulldozer, and piledriver.
- https://github.com/xianyi/OpenBLAS/wiki/Fixed-optimized-kernels-To-do-List
-
- ARM:
- * Improved LAPACK testing.
-
- ====================================================================
- Version 0.2.9
- 10-Jun-2014
- common:
- * Improved the result for LAPACK testing. (#372)
- * Installed DLL to prefix/bin instead of prefix/lib. (#366)
- * Build import library on Windows.(#374)
- x86/x86-64:
- * To improve LAPACK testing, we fallback some kernels. (#372)
- https://github.com/xianyi/OpenBLAS/wiki/Fixed-optimized-kernels-To-do-List
-
- ====================================================================
- Version 0.2.9.rc2
- 06-Mar-2014
- common:
- * Added OPENBLAS_VERBOSE environment variable.(#338)
- * Make OpenBLAS thread-pool resilient to fork via pthread_atfork.
- (#294, Thank Olivier Grisel)
- * Rewrote rotmg
- * Fixed sdsdot bug.
- x86/x86-64:
- * Detect Intel Haswell for new Macbook.
-
- ====================================================================
- Version 0.2.9.rc1
- 13-Jan-2013
- common:
- * Update LAPACK to 3.5.0 version
- * Fixed compatiable issues with Clang and Pathscale compilers.
-
- x86/x86-64:
- * Optimization on Intel Haswell.
- * Enable optimization kernels on AMD Bulldozer and Piledriver.
-
- ARM:
- * Support ARMv6 and ARMv7 ISA.
- * Optimization on ARM Cortex-A9.
-
- ====================================================================
- Version 0.2.8
- 01-Aug-2013
- common:
- * Support Open64 5.0. (#266)
- * Add executable stack markings. (#262, Thank Sébastien Fabbro)
- * Respect user's LDFLAGS (Thank Sébastien Fabbro)
-
- x86/x86-64:
- * Rollback bulldozer and piledriver kernels to barcelona kernels (#263)
- We will fix the compuational error bug in bulldozer and piledriver kernels.
-
- ====================================================================
- Version 0.2.7
- 20-Jul-2013
- common:
- * Support LSB (Linux Standard Base) 4.1.
- e.g. make CC=lsbcc
- * Include LAPACK 3.4.2 source codes to the repo.
- Avoid downloading at compile time.
- * Add NO_PARALLEL_MAKE flag to disable parallel make.
- * Create openblas_get_parallel to retrieve information which
- parallelization model is used by OpenBLAS. (Thank grisuthedragon)
- * Detect LLVM/Clang compiler. The default compiler is Clang on Mac OS X.
- * Change LIBSUFFIX from .lib to .a on windows.
- * A work-around for dtrti_U single thread bug. Replace it with LAPACK codes. (#191)
-
- x86/x86-64:
- * Optimize c/zgemm, trsm, dgemv_n, ddot, daxpy, dcopy on
- AMD Bulldozer. (Thank Werner Saar)
- * Add Intel Haswell support (using Sandybridge optimizations).
- (Thank Dan Luu)
- * Add AMD Piledriver support (using Bulldozer optimizations).
- * Fix the computational error in zgemm avx kernel on
- Sandybridge. (#237)
- * Fix the overflow bug in gemv.
- * Fix the overflow bug in multi-threaded BLAS3, getrf when NUM_THREADS
- is very large.(#214, #221, #246).
- MIPS64:
- * Support loongcc (Open64 based) compiler for ICT Loongson 3A/B.
-
- Power:
- * Support Power7 by old Power6 kernels. (#220)
-
- ====================================================================
- Version 0.2.6
- 2-Mar-2013
- common:
- * Improved OpenMP performance slightly. (d744c9)
- * Improved cblas.h compatibility with Intel MKL.(#185)
- * Fixed the overflowing bug in single thread cholesky factorization.
- * Fixed the overflowing buffer bug of multithreading hbmv and sbmv.(#174)
-
- x86/x86-64:
- * Added AMD Bulldozer x86-64 S/DGEMM AVX kernels. (Thank Werner Saar)
- We will tune the performance in future.
- * Auto-detect Intel Xeon E7540.
- * Fixed the overflowing buffer bug of gemv. (#173)
- * Fixed the bug of s/cdot about invalid reading NAN on x86_64. (#189)
-
- MIPS64:
-
- ====================================================================
- Version 0.2.5
- 26-Nov-2012
- common:
- * Added NO_SHARED flag to disable generating the shared library.
- * Compile LAPACKE with ILP64 modle when INTERFACE64=1 (#158)
- * Export LAPACK 3.4.2 symbols in shared library. (#147)
- * Only detect the number of physical CPU cores on Mac OSX. (#157)
- * Fixed NetBSD build. (#155)
- * Fixed compilation with TARGET=GENERIC. (#160)
- x86/x86-64:
- * Restore the original CPU affinity when calling
- openblas_set_num_threads(1) (#153)
- * Fixed a SEGFAULT bug in dgemv_t when m is very large.(#154)
- MIPS64:
-
- ====================================================================
- Version 0.2.4
- 8-Oct-2012
- common:
- * Upgraded LAPACK to 3.4.2 version. (#145)
- * Provided support for passing CFLAGS, FFLAGS, PFLAGS,
- FPFLAGS to make. (#137)
- * f77blas.h:compatibility for compilers without C99 complex
- number support. (#141)
- x86/x86-64:
- * Added NO_AVX flag. Check OS supporting AVX on runtime. (#139)
- * Fixed zdot incompatibility ABI issue with GCC 4.7 on
- Windows 32-bit. (#140)
- MIPS64:
- * Fixed the generation of shared library bug.
- * Fixed the detection bug on the Loongson 3A server.
- ====================================================================
- Version 0.2.3
- 20-Aug-2012
- common:
- * Fixed LAPACK unstable bug about ?laswp. (#130)
- * Fixed the shared library bug about unloading the library on
- Linux (#132).
- * Fixed the compilation failure on BlueGene/P (TARGET=PPC440FP2)
- Please use gcc and IBM xlf. (#134)
- x86/x86-64:
- * Supported goto_set_num_threads and openblas_set_num_threads
- APIs in Windows. They can set the number of threads on runtime.
-
- ====================================================================
- Version 0.2.2
- 6-July-2012
- common:
- * Fixed exporting DLL functions bug on Windows/MingW
- * Support GNU Hurd (Thank Sylvestre Ledru)
- * Support kfreebsd kernel (Thank Sylvestre Ledru)
- x86/x86-64:
- * Support Intel Sandy Bridge 22nm desktop/mobile CPU
- SPARC:
- * Improve the detection of SPARC (Thank Sylvestre Ledru)
-
- ====================================================================
- Version 0.2.1
- 30-Jun-2012
- common:
- x86/x86-64:
- * Fixed the SEGFAULT bug about hyper-theading
- * Support AMD Bulldozer by using GotoBLAS2 AMD Barcelona codes
-
- ====================================================================
- Version 0.2.0
- 26-Jun-2012
- common:
- * Removed the limitation (64) of numbers of CPU cores.
- Now, it supports 256 cores at max.
- * Supported clang compiler.
- * Fixed some build bugs on FreeBSD
- x86/x86-64:
- * Optimized Level-3 BLAS on Intel Sandy Bridge x86-64 by AVX instructions.
- Please use gcc >= 4.6 or clang >=3.1.
- * Support AMD Bobcat by using GotoBLAS2 AMD Barcelona codes.
-
- ====================================================================
- Version 0.1.1
- 29-Apr-2012
- common:
- * Upgraded LAPACK to 3.4.1 version. (Thank Zaheer Chothia)
- * Supported LAPACKE, a C interface to LAPACKE. (Thank Zaheer Chothia)
- * Fixed the build bug (MD5 and download) on Mac OSX.
- * Auto download CUnit 2.1.2-2 from SF.net with UTEST_CHECK=1.
- * Fxied the compatibility issue for compilers without C99 complex number
- (e.g. Visual Studio)
- x86/x86_64:
- * Auto-detect Intel Sandy Bridge Core i7-3xxx & Xeon E7 Westmere-EX.
- * Test alpha=Nan in dscale.
- * Fixed a SEGFAULT bug in samax on x86 windows.
-
- ====================================================================
- Version 0.1.0
- 23-Mar-2012
- common:
- * Set soname of shared library on Linux.
- * Added LIBNAMESUFFIX flag in Makefile.rule. The user can use
- this flag to control the library name, e.g. libopenblas.a,
- libopenblas_ifort.a or libopenblas_omp.a.
- * Added GEMM_MULTITHREAD_THRESHOLD flag in Makefile.rule.
- The lib use single thread in GEMM function with small matrices.
- x86/x86_64:
- * Used GEMV SSE/SSE2 kernels on x86 32-bit.
- * Exported CBLAS functions in Windows DLL.
- MIPS64:
- * Completed Level-3 BLAS optimization on Loongson 3A CPU.
- * Improved GEMV performance on Loongson 3A CPU.
- * Improved Level-3 BLAS performance on Loongson 3B CPU. (EXPERIMENT)
-
- ====================================================================
- Version 0.1 alpha2.5
- 19-Feb-2012
- common:
- * Fixed missing "#include <sched.h>" bug on Mac OS X.
- Thank Mike Nolta for the patch.
- * Upgraded LAPACK to 3.4.0 version
- * Fixed a bug on Mac OS X. Don't require SystemStubs on OS X.
- SystemStubs does not exist on Lion. Thank Stefan Karpinski.
- * Improved README with using OpenMP. Check the internal threads
- count less than or equal to omp_get_max_threads()
- x86/x86_64:
- * Auto-detect Intel Core i6/i7 (Sandy Bridge) CPU with Nehalem assembly kernels
- * Fixed some bugs on MingW 64-bit including zgemv, cdot, zdot.
-
- ====================================================================
- Version 0.1 alpha2.4
- 18-Sep-2011
- common:
- * Fixed a bug about installation. The header file "fblas77.h"
- works fine now.
- * Fixed #61 a building bug about setting TARGET and DYNAMIC_ARCH.
- * Try to handle absolute path of shared library in OSX. (#57)
- Thank Dr Kane O'Donnell.
- * Changed the installation folder layout to $(PREFIX)/include and
- $(PREFIX)/lib
-
- x86/x86_64:
- * Fixed #58 zdot/xdot SEGFAULT bug with GCC-4.6 on x86. According
- to i386 calling convention, The callee should remove the first
- hidden parameter.Thank Mr. John for this patch.
-
- ====================================================================
- Version 0.1 alpha2.3
- 5-Sep-2011
-
- x86/x86_64:
- * Added DTB_ENTRIES into dynamic arch setting parameters. Now,
- it can read DTB_ENTRIES on runtime. (Refs issue #55 on github)
-
- ====================================================================
- Version 0.1 alpha2.2
- 14-Jul-2011
-
- common:
- * Fixed a building bug when DYNAMIC_ARCH=1 & INTERFACE64=1.
- (Refs issue #44 on github)
-
- ====================================================================
- Version 0.1 alpha2.1
- 28-Jun-2011
-
- common:
- * Stop the build and output the error message when detecting
- fortran compiler failed. (Refs issue #42 on github)
-
- ====================================================================
- Version 0.1 alpha2
- 23-Jun-2011
-
- common:
- * Fixed blasint undefined bug in <cblas.h> file. Other software
- could include this header successfully(Refs issue #13 on github)
- * Fixed the SEGFAULT bug on 64 cores. On SMP server, the number
- of CPUs or cores should be less than or equal to 64.(Refs issue #14
- on github)
- * Support "void goto_set_num_threads(int num_threads)" and "void
- openblas_set_num_threads(int num_threads)" when USE_OPENMP=1
- * Added extern "C" to support C++. Thank Tasio for the patch(Refs
- issue #21 on github)
- * Provided an error message when the arch is not supported.(Refs
- issue #19 on github)
- * Fixed issue #23. Fixed a bug of f_check script about generating link flags.
- * Added openblas_set_num_threads for Fortran.
- * Fixed #25 a wrong result of rotmg.
- * Fixed a bug about detecting underscore prefix in c_check.
- * Print the wall time (cycles) with enabling FUNCTION_PROFILE
- * Fixed #35 a build bug with NO_LAPACK=1 & DYNAMIC_ARCH=1
- * Added install target. You can use "make install". (Refs #20)
-
-
- x86/x86_64:
- * Fixed #28 a wrong result of dsdot on x86_64.
- * Fixed #32 a SEGFAULT bug of zdotc with gcc-4.6.
- * Fixed #33 ztrmm bug on Nehalem.
- * Work-around #27 the low performance axpy issue with small imput size & multithreads.
-
- MIPS64:
- * Fixed #28 a wrong result of dsdot on Loongson3A/MIPS64.
- * Optimized single/double precision BLAS Level3 on Loongson3A/MIPS64. (Refs #2)
- * Optimized single/double precision axpy function on Loongson3A/MIPS64. (Refs #3)
-
- ====================================================================
- Version 0.1 alpha1
- 20-Mar-2011
-
- common:
- * Support "make NO_LAPACK=1" to build the library without
- LAPACK functions.
- * Fixed randomly SEGFAULT when nodemask==NULL with above Linux 2.6.34.
- Thank Mr.Ei-ji Nakama providing this patch. (Refs issue #12 on github)
- * Added DEBUG=1 rule in Makefile.rule to build debug version.
- * Disable compiling quad precision in reference BLAS library(netlib BLAS).
- * Added unit testcases in utest/ subdir. Used CUnit framework.
- * Supported OPENBLAS_* & GOTO_* environment variables (Pleas see README)
- * Imported GotoBLAS2 1.13 BSD version
-
- x86/x86_64:
- * On x86 32bits, fixed a bug in zdot_sse2.S line 191. This would casue
- zdotu & zdotc failures. Instead, work-around it. (Refs issue #8 #9 on github)
- * Modified ?axpy functions to return same netlib BLAS results
- when incx==0 or incy==0 (Refs issue #7 on github)
- * Modified ?swap functions to return same netlib BLAS results
- when incx==0 or incy==0 (Refs issue #6 on github)
- * Modified ?rot functions to return same netlib BLAS results
- when incx==0 or incy==0 (Refs issue #4 on github)
- * Detect Intel Westmere,Intel Clarkdale and Intel Arrandale
- to use Nehalem codes.
- * Fixed a typo bug about compiling dynamic ARCH library.
- MIPS64:
- * Improve daxpy performance on ICT Loongson 3A.
- * Supported ICT Loongson 3A CPU (Refs issue #1 on github)
- ====================================================================
|