Matt Brown
4f09030fdc
Optimise cswap for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Matt Brown
6f4eca5ea4
Optimise sswap for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Matt Brown
be55f96cbd
Optimise scopy for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Matt Brown
96dd0ef4f7
Optimise ccopy for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Alan Modra
dc40bc7368
Power8 inline assembly tweaks
Further fixes on top of 9e2f316ed . Writing some doco for gcc on
inline assembly woke me up to some more errors.
- dgemv_kernel_4x4 asm did not mention *ap as a memory input, and
*y is both read and write.
- sasum_kernel_32 and casum_kernel_16 did not use %x for a vsx insn
operand, a problem if the "=f" sum output was ever allocated a vsx
reg in the altivec set. This might be possible with inlining and
future gcc optimisation.
8 years ago
Zhang Xianyi
b5c96fcfcd
Support ARM SOFTFP ABI for saxpy, sdot, snrm2, sscal, sgemv, sger.
8 years ago
Denis Steckelmacher
c9ff735da6
Add ZEN support (tested for auto-detected static backend)
8 years ago
Martin Kroeker
cd135e2b59
Merge pull request #1130 from quickwritereader/develop
Blas 3 for single precision
8 years ago
Martin Kroeker
a6efabf155
Replace gnu _real_ , _imag_ extensions in initializers
8 years ago
Abdurrauf
08786c4b95
strmm and ctrmm
8 years ago
Zhang Xianyi
90e02ccf68
Support ARM softfp ABI for sgemm on ARMV7.
make ARM_SOFTFP_ABI=1
8 years ago
Abdurrauf
82e80fa82b
initial strmm(sgemm). not tuned yet
8 years ago
Martin Kroeker
d1fe040d9b
Merge pull request #1110 from quickwritereader/develop
Conventional usage of the register save area.
8 years ago
Abdurrauf
411982715c
conventional usage of the register save area
8 years ago
Abdurrauf
e831d6924e
changed to conventional register save area
8 years ago
Martin Kroeker
ffc1d6c468
Merge pull request #1108 from ashwinyes/develop_20170203_thunderx2t99
Optimized Implementations for ThunderX2T99
8 years ago
Ashwin Sekhar T K
67473d09dd
THUNDERX2T99: Bug Fixes in D/Z NRM2 and ZGEMM
8 years ago
Ashwin Sekhar T K
19ba133383
THUNDERX2T99: Add Optimized ZGEMM Implementation
8 years ago
Abdurrauf
0d96b0e2a7
Merge branch 'z13' into develop
8 years ago
Abdurrauf
848cb27b1e
ztrmm kernel.
8 years ago
Martin Kroeker
dc34a0da96
Merge pull request #915 from mdong/small_fix_for_icc
remove input from clobbered list
8 years ago
Ashwin Sekhar T K
a3935f0dfb
THUNDERX2T99: Add Optimized D/Z NRM2 Implementation
8 years ago
Ashwin Sekhar T K
738628e9a8
ARM64: Remove unused code
8 years ago
Ashwin Sekhar T K
ab3ffab96a
THUNDERX2T99: Add Optimized C/Z DOT Implementation
8 years ago
Ashwin Sekhar T K
f036be9ce2
THUNDERX2T99: Add Optimized SDOT Implementation
8 years ago
Ashwin Sekhar T K
faba876fda
THUNDERX2T99: Bug fix in C/Z IAMAX
8 years ago
Ashwin Sekhar T K
172a62d73e
THUNDERX2T99: Add Optimized C/Z IAMAX Implementation
8 years ago
Ashwin Sekhar T K
228c75a69c
THUNDERX2T99: Add parallel SCNRM2 Implementation
8 years ago
Martin Kroeker
9e2f316ede
Power8 inline assembly fixes
Quoting patch author amodra from #1078
Lots of issues here.
- The vsx regs weren't listed as clobbered.
- Poor choice of vsx regs, which along with the lack of clobbers led to
trashing v0..v21 and fr14..fr23. Ideally you'd let gcc choose all
temp vsx regs, but asms currently have a limit of 30 i/o parms.
- Other regs were clobbered unnecessarily, seemingly in an attempt to
clobber inputs, with gcc-7 complaining about the clobber of r2.
(Changed inputs should be also listed as outputs or as an i/o.)
- "r" constraint used instead of "b" for gprs used in insns where the
r0 encoding means zero rather than r0.
- There were unused asm inputs too.
- All memory was clobbered rather than hooking up memory outputs with
proper memory constraints, and that and the lack of proper memory
input constraints meant the asms needed to be volatile and their
containing function noinline.
- Some parameters were being passed unnecessarily via memory.
- When a copy of a
8 years ago
Ashwin Sekhar T K
8e89668f62
THUNDERX2T99: Fix bug in SNRM2
8 years ago
Ashwin Sekhar T K
f63deae9de
THUNDERX2T99: Add Optimized S/D IAMAX Implementation
8 years ago
Martin Kroeker
60eea75409
Merge pull request #1076 from ashwinyes/develop_20170130_thunderx2t99
More optimized implementations for ThunderX2T99
8 years ago
Ashwin Sekhar T K
071a830e8b
THUNDERX2T99: Add optimized S/D/C/Z SWAP Implementations
8 years ago
Ashwin Sekhar T K
d09f88192c
THUNDERX2T99: Add optimized S/D/C/Z COPY Implementations
8 years ago
Ashwin Sekhar T K
e58233460a
THUDNERX2T99: Add optimized D/C/Z ASUM Implementations
8 years ago
Ashwin Sekhar T K
99bd2892bf
THUNDERX2T99: Add optimized CASUM Implementation
8 years ago
Ashwin Sekhar T K
ff6f572f2e
THUNDERX2T99: Rename labels in for DDOT and SNRM2
8 years ago
Ashwin Sekhar T K
e0dc5f58c5
THUNDERX2T99: Remove Duplicate Code
8 years ago
Ashwin Sekhar T K
2757b49767
THUNDERX2T99: Add Optimized CGEMM Implementation
8 years ago
Zhang Xianyi
ff41e13385
Merge pull request #1074 from ashwinyes/develop_20170116_thunderx2t99_sgemm
Add more THUNDERX2T99 Optimized APIs
8 years ago
Ashwin Sekhar T K
907e286eb6
THUNDERX2T99: Add threaded SNRM2 Implementation
8 years ago
Ashwin Sekhar T K
cde3aee08b
ARM64: Rename kernel files to have consistent naming
8 years ago
Ashwin Sekhar T K
ee6ea7e988
THUNDERX2T99: Add Optimized CNRM2 Implementation
8 years ago
Ashwin Sekhar T K
ca0b36b012
THUNDERX2T99: Add Optimized SNRM2 Implementation
8 years ago
Ashwin Sekhar T K
d0a79ca6e0
THUNDERX2T99: Add threaded DDOT Implementation
8 years ago
Ashwin Sekhar T K
0c07003ccf
THUNDERX2T99: Add Optimized DDOT Implementation
8 years ago
Ashwin Sekhar T K
f33fcedb30
THUNDERX2T99: Improve SGEMM
8 years ago
Ashwin Sekhar T K
0f1d6e8b39
THUNDERX2T99: Improve DGEMM
8 years ago
Ashwin Sekhar T K
981064acc6
THUNDERX2T99: Add Optimized DAXPY Implementation
8 years ago
Shivraj Patil
a4d97d980f
Added rot functions.
Signed-off-by: Shivraj Patil <shivraj.patil@imgtec.com>
8 years ago