Arjan van de Ven
89372e0993
Use AVX512 also for DGEMM
this required switching to the generic gemm_beta code (which is faster anyway on SKX)
for both DGEMM and SGEMM
Performance for the not-retuned version is in the 30% range
7 years ago
Martin Kroeker
0023515733
Typo fix (misplaced parenthesis)
7 years ago
Arjan van de Ven
99c7bba8e4
Initial support for SkylakeX / AVX512
This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)
target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,
which brings 2 basic things:
1) 512 bit wide SIMD (2x width of AVX2)
2) 32 SIMD registers (2x the number on AVX2)
This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel
to AVX512VL; more will follow later but this patch aims to get the infrastructure
in place for this "later".
Full performance tuning has not been done yet; with more registers and wider SIMD
it's in theory possible to retune the kernels but even without that there's an
interesting enough performance increase (30-40% range) with just this change.
7 years ago
Martin Kroeker
8562d5787a
Merge pull request #1583 from martin-frbg/issue1575
Handle INCX=0,INCY=0 case
7 years ago
Martin Kroeker
7df8c4f76f
typo fix
7 years ago
Martin Kroeker
2fc748bf72
Restore optimized swap kernel now that we have a proper fix
7 years ago
Martin Kroeker
d1b7be14aa
Handle INCX=0,INCY=0 case
Fixes #1575 (sswap/dswap failing the swap utest on x86) as suggested by atsampson.
7 years ago
Martin Kroeker
961d25e9c7
Use the new zrot.c on POWER8 for crot as well
fixes #1571 (the old zrot.S assembly does not handle incx=0 correctly)
7 years ago
Martin Kroeker
f5959f2543
Merge pull request #1567 from martin-frbg/mipstrmm
Revert " Switch mips32 target to USE_TRMM to fix complex TRMM"
7 years ago
Martin Kroeker
82012b960b
Revert " Switch mips32 target to USE_TRMM to fix complex TRMM"
... as it was just a silly workaround for the issue seen in #1563 , caused by #1419
7 years ago
Martin Kroeker
8dd3515fa2
Merge pull request #1565 from martin-frbg/mipstypo
Remove extraneous brace from previous commit of mips dsdot fix
7 years ago
Martin Kroeker
95f7f0229c
Remove extraneous brace from previous commit
7 years ago
Martin Kroeker
5082fe4306
Merge pull request #1564 from martin-frbg/issue1563
Revert changes from PR#1419
7 years ago
Martin Kroeker
7a7619af6d
Revert changes from PR#1419
at least one of these changes apparently is an oversimplification, leading to TRMM breakage on some platforms as observed in #1563
7 years ago
Martin Kroeker
893b535540
Use correct data type for initializers of v2f64, v4f32
Fixes #1561
7 years ago
Martin Kroeker
018f2dad27
Switch mips32 target to USE_TRMM to fix complex TRMM
7 years ago
Martin Kroeker
9d5098dbc9
Add MIPS 1004K target (Mediatek MT7621 SOC)
7 years ago
Martin Kroeker
954f1832de
Merge pull request #1540 from martin-frbg/mips32-zasum
Fix typo in MIPS P5600 complex ASUM code selection
7 years ago
Martin Kroeker
941ad280a8
Fix typo in MIPS P5600 complex ASUM code selection
7 years ago
Martin Kroeker
1da365312a
Merge pull request #1538 from martin-frbg/arm7utest
Fix handling of zero INCX, INCY in ArmV7 AXPY and ROT
7 years ago
Martin Kroeker
2d0929fa7c
Move the test for zero incx,incy in ARMV7 ROT
to pass the related utest (see #1469 )
7 years ago
Martin Kroeker
125343cc88
Drop test for zero incx,incy in armv7 AXPY
...to pass the related utest (see #1469 )
7 years ago
Martin Kroeker
8a3b6fa108
Use generic zrot.c on ppc64/POWER6 to work around utest failure from … ( #1535 )
* Use generic C implementation of zrot on ppc64/POWER6 to work around utest failure from #1469
7 years ago
Martin Kroeker
9c5518319a
Revert "Fix 32bit HASWELL builds"
7 years ago
Martin Kroeker
2ca0faf495
Merge pull request #1515 from martin-frbg/mipsdot
Correct precision of mips dsdot
7 years ago
Martin Kroeker
0fe434598b
Fix precision of mips dsdot
7 years ago
Martin Kroeker
c7b55b6082
Merge pull request #1499 from quickwritereader/develop
Implemented missing vsx simd kernels for power8 blas1/2 double. z13 modifications
7 years ago
Martin Kroeker
840e01061f
Merge pull request #1491 from martin-frbg/ddot_mt
Add multithreading support for Haswell DDOT
7 years ago
QWR QWR
28ca97015d
power8:Added initial zgemv_(t|n) ,i(d|z)amax,i(d|z)amin,dgemv_t(transposed),zrot
z13: improved zgemv_(t|n)_4,zscal,zaxpy
7 years ago
Martin Kroeker
6a6ffaff1e
Merge pull request #1494 from martin-frbg/x86_dsdot
Use generic/dot.c instead of the inferior arm/dot.c for x86 DSDOT
7 years ago
Martin Kroeker
28ac9ea5a6
Use generic/dot.c instead of the inferior arm/dot.c for x86 DSDOT
to resolve dsdot utest failure seen in #1492
7 years ago
Martin Kroeker
a55694dd5b
Declare dot_compute static to avoid conflicts in multiarch builds
7 years ago
Martin Kroeker
85a41e9cdb
Add multithreading support for Haswell DDOT
copied from ashwinyes' implementation in dot_thunderx2t99.c
7 years ago
Martin Kroeker
81215711a2
Re-enable DAXPY microkernels for x86_64
as the inaccuracies seen in the original testcase for #1332 appear to be due to an artefact that amplifies the very small rounding differences between FMA and discrete multiply+add
7 years ago
Martin Kroeker
22167170b3
Merge pull request #1477 from quickwritereader/develop
Power8 blas3 copy-pack routines
7 years ago
Ashwin Sekhar T K
fa9ca65c0e
ARM64: Fix utest dsdot errors
7 years ago
Martin Kroeker
719b68f077
Merge pull request #1473 from martin-frbg/p2align
Replace .align with .p2aligns in dscal.c and the Nehalem microkernels as well
7 years ago
Martin Kroeker
fe9f15f2d8
Merge pull request #1472 from martin-frbg/utest-fixes
Fix limited DSDOT precision on arm,aarch64 and zarch
7 years ago
Martin Kroeker
497f0c3d8a
Replace .align with .p2align in the Nehalem microkernels
7 years ago
Martin Kroeker
ea37db828e
Convert .align to .p2align for OSX compatibility
7 years ago
Martin Kroeker
6e70287776
Use generic/dot.c for DSDOT on ARMV5 and above
The default arm/dot.c is less precise when used for DSDOT, as shown by utest
7 years ago
Martin Kroeker
58f236ad73
Use generic/dot.c for DSDOT on zarch
7 years ago
Martin Kroeker
e207107150
Use generic/dot.c for DSDOT on z13
The implementation in arm/dot.c has lower precision, as shown by the utest for dsdot.
7 years ago
Martin Kroeker
c9d408064a
Use dot.S also for DSDOT on CORTEXA57
7 years ago
Martin Kroeker
288d1a3f6e
Use dot.S also for DSDOT on ARMV8
7 years ago
Martin Kroeker
7c1925acec
Use .p2align instead of .align for compatibility on Sandybridge as well
7 years ago
Martin Kroeker
2359c7c1a9
Use .p2align instead of .align for portability
The OSX assembler apparently mishandles the argument to decimal .align, leading to a significant loss of performance
as observed in #730 , #901 and most recently #1470
7 years ago
Martin Kroeker
e7366a4161
Restore the remaining utests ( #1462 )
* Restore the remaining utests
* Try fork test on Cygwin and Linux only, it hangs on at least ARMv8/Android as well
* Use generic sswap/dswap kernels for NEHALEM 32bit to fix fault found by the restored swap utest
* Disable zdotu test for MS cl to work around runtime error -1073741819 on AppVeyor for now
(probably coding error in the initialization of the complex numbers or wrong choice of zdotu API)
7 years ago
the mslm
2c0a008281
dgemm_ncopy_4_ save/restore
7 years ago
the mslm
c5425daa6b
power8 ?gemm_tcopy save/restore
7 years ago