You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

param.h 101 kB

12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
12 years ago
5 years ago
12 years ago
12 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
12 years ago
12 years ago
12 years ago
5 years ago
5 years ago
12 years ago
5 years ago
5 years ago
12 years ago
5 years ago
12 years ago
5 years ago
5 years ago
5 years ago
12 years ago
6 years ago
12 years ago
12 years ago
12 years ago
12 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
12 years ago
3 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
12 years ago
12 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
12 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
12 years ago
12 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
12 years ago
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594159515961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630163116321633163416351636163716381639164016411642164316441645164616471648164916501651165216531654165516561657165816591660166116621663166416651666166716681669167016711672167316741675167616771678167916801681168216831684168516861687168816891690169116921693169416951696169716981699170017011702170317041705170617071708170917101711171217131714171517161717171817191720172117221723172417251726172717281729173017311732173317341735173617371738173917401741174217431744174517461747174817491750175117521753175417551756175717581759176017611762176317641765176617671768176917701771177217731774177517761777177817791780178117821783178417851786178717881789179017911792179317941795179617971798179918001801180218031804180518061807180818091810181118121813181418151816181718181819182018211822182318241825182618271828182918301831183218331834183518361837183818391840184118421843184418451846184718481849185018511852185318541855185618571858185918601861186218631864186518661867186818691870187118721873187418751876187718781879188018811882188318841885188618871888188918901891189218931894189518961897189818991900190119021903190419051906190719081909191019111912191319141915191619171918191919201921192219231924192519261927192819291930193119321933193419351936193719381939194019411942194319441945194619471948194919501951195219531954195519561957195819591960196119621963196419651966196719681969197019711972197319741975197619771978197919801981198219831984198519861987198819891990199119921993199419951996199719981999200020012002200320042005200620072008200920102011201220132014201520162017201820192020202120222023202420252026202720282029203020312032203320342035203620372038203920402041204220432044204520462047204820492050205120522053205420552056205720582059206020612062206320642065206620672068206920702071207220732074207520762077207820792080208120822083208420852086208720882089209020912092209320942095209620972098209921002101210221032104210521062107210821092110211121122113211421152116211721182119212021212122212321242125212621272128212921302131213221332134213521362137213821392140214121422143214421452146214721482149215021512152215321542155215621572158215921602161216221632164216521662167216821692170217121722173217421752176217721782179218021812182218321842185218621872188218921902191219221932194219521962197219821992200220122022203220422052206220722082209221022112212221322142215221622172218221922202221222222232224222522262227222822292230223122322233223422352236223722382239224022412242224322442245224622472248224922502251225222532254225522562257225822592260226122622263226422652266226722682269227022712272227322742275227622772278227922802281228222832284228522862287228822892290229122922293229422952296229722982299230023012302230323042305230623072308230923102311231223132314231523162317231823192320232123222323232423252326232723282329233023312332233323342335233623372338233923402341234223432344234523462347234823492350235123522353235423552356235723582359236023612362236323642365236623672368236923702371237223732374237523762377237823792380238123822383238423852386238723882389239023912392239323942395239623972398239924002401240224032404240524062407240824092410241124122413241424152416241724182419242024212422242324242425242624272428242924302431243224332434243524362437243824392440244124422443244424452446244724482449245024512452245324542455245624572458245924602461246224632464246524662467246824692470247124722473247424752476247724782479248024812482248324842485248624872488248924902491249224932494249524962497249824992500250125022503250425052506250725082509251025112512251325142515251625172518251925202521252225232524252525262527252825292530253125322533253425352536253725382539254025412542254325442545254625472548254925502551255225532554255525562557255825592560256125622563256425652566256725682569257025712572257325742575257625772578257925802581258225832584258525862587258825892590259125922593259425952596259725982599260026012602260326042605260626072608260926102611261226132614261526162617261826192620262126222623262426252626262726282629263026312632263326342635263626372638263926402641264226432644264526462647264826492650265126522653265426552656265726582659266026612662266326642665266626672668266926702671267226732674267526762677267826792680268126822683268426852686268726882689269026912692269326942695269626972698269927002701270227032704270527062707270827092710271127122713271427152716271727182719272027212722272327242725272627272728272927302731273227332734273527362737273827392740274127422743274427452746274727482749275027512752275327542755275627572758275927602761276227632764276527662767276827692770277127722773277427752776277727782779278027812782278327842785278627872788278927902791279227932794279527962797279827992800280128022803280428052806280728082809281028112812281328142815281628172818281928202821282228232824282528262827282828292830283128322833283428352836283728382839284028412842284328442845284628472848284928502851285228532854285528562857285828592860286128622863286428652866286728682869287028712872287328742875287628772878287928802881288228832884288528862887288828892890289128922893289428952896289728982899290029012902290329042905290629072908290929102911291229132914291529162917291829192920292129222923292429252926292729282929293029312932293329342935293629372938293929402941294229432944294529462947294829492950295129522953295429552956295729582959296029612962296329642965296629672968296929702971297229732974297529762977297829792980298129822983298429852986298729882989299029912992299329942995299629972998299930003001300230033004300530063007300830093010301130123013301430153016301730183019302030213022302330243025302630273028302930303031303230333034303530363037303830393040304130423043304430453046304730483049305030513052305330543055305630573058305930603061306230633064306530663067306830693070307130723073307430753076307730783079308030813082308330843085308630873088308930903091309230933094309530963097309830993100310131023103310431053106310731083109311031113112311331143115311631173118311931203121312231233124312531263127312831293130313131323133313431353136313731383139314031413142314331443145314631473148314931503151315231533154315531563157315831593160316131623163316431653166316731683169317031713172317331743175317631773178317931803181318231833184318531863187318831893190319131923193319431953196319731983199320032013202320332043205320632073208320932103211321232133214321532163217321832193220322132223223322432253226322732283229323032313232323332343235323632373238323932403241324232433244324532463247324832493250325132523253325432553256325732583259326032613262326332643265326632673268326932703271327232733274327532763277327832793280328132823283328432853286328732883289329032913292329332943295329632973298329933003301330233033304330533063307330833093310331133123313331433153316331733183319332033213322332333243325332633273328332933303331333233333334333533363337333833393340334133423343334433453346334733483349335033513352335333543355335633573358335933603361336233633364336533663367336833693370337133723373337433753376337733783379338033813382338333843385338633873388338933903391339233933394339533963397339833993400340134023403340434053406340734083409341034113412341334143415341634173418341934203421342234233424342534263427342834293430343134323433343434353436343734383439344034413442344334443445344634473448344934503451345234533454345534563457345834593460346134623463346434653466346734683469347034713472347334743475347634773478347934803481348234833484348534863487348834893490349134923493349434953496349734983499350035013502350335043505350635073508350935103511351235133514351535163517351835193520352135223523352435253526352735283529353035313532353335343535353635373538353935403541354235433544354535463547354835493550355135523553355435553556355735583559356035613562356335643565356635673568356935703571357235733574357535763577357835793580358135823583358435853586358735883589359035913592359335943595359635973598359936003601360236033604360536063607360836093610361136123613361436153616361736183619362036213622362336243625362636273628362936303631363236333634363536363637363836393640364136423643364436453646364736483649365036513652365336543655365636573658365936603661366236633664366536663667366836693670367136723673367436753676367736783679368036813682368336843685368636873688368936903691369236933694369536963697369836993700370137023703370437053706370737083709371037113712371337143715371637173718371937203721372237233724372537263727372837293730373137323733373437353736373737383739374037413742374337443745374637473748374937503751375237533754375537563757375837593760376137623763376437653766376737683769377037713772377337743775377637773778377937803781378237833784378537863787378837893790379137923793379437953796379737983799380038013802380338043805380638073808380938103811381238133814381538163817381838193820382138223823382438253826382738283829383038313832383338343835383638373838383938403841384238433844384538463847384838493850385138523853385438553856385738583859386038613862386338643865386638673868386938703871387238733874387538763877387838793880388138823883388438853886388738883889389038913892389338943895389638973898389939003901390239033904390539063907390839093910391139123913391439153916391739183919392039213922392339243925392639273928392939303931393239333934393539363937393839393940394139423943394439453946394739483949395039513952395339543955395639573958395939603961396239633964396539663967396839693970397139723973397439753976397739783979398039813982398339843985398639873988398939903991399239933994399539963997399839994000400140024003400440054006400740084009401040114012401340144015401640174018401940204021402240234024402540264027402840294030403140324033403440354036403740384039404040414042404340444045404640474048404940504051405240534054405540564057405840594060406140624063406440654066406740684069407040714072407340744075407640774078407940804081408240834084408540864087408840894090409140924093409440954096409740984099410041014102410341044105410641074108410941104111411241134114411541164117411841194120412141224123412441254126412741284129413041314132413341344135413641374138413941404141414241434144414541464147414841494150415141524153415441554156415741584159416041614162416341644165416641674168416941704171417241734174417541764177417841794180418141824183418441854186418741884189419041914192419341944195419641974198419942004201420242034204420542064207420842094210421142124213421442154216421742184219422042214222422342244225422642274228422942304231423242334234423542364237423842394240424142424243424442454246424742484249425042514252425342544255425642574258425942604261426242634264426542664267426842694270
  1. /*****************************************************************************
  2. Copyright (c) 2011-2023, 2025 The OpenBLAS Project
  3. All rights reserved.
  4. Redistribution and use in source and binary forms, with or without
  5. modification, are permitted provided that the following conditions are
  6. met:
  7. 1. Redistributions of source code must retain the above copyright
  8. notice, this list of conditions and the following disclaimer.
  9. 2. Redistributions in binary form must reproduce the above copyright
  10. notice, this list of conditions and the following disclaimer in
  11. the documentation and/or other materials provided with the
  12. distribution.
  13. 3. Neither the name of the OpenBLAS project nor the names of
  14. its contributors may be used to endorse or promote products
  15. derived from this software without specific prior written
  16. permission.
  17. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  18. AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  19. IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  20. ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
  21. LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  22. DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  23. SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  24. CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  25. OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
  26. USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  27. **********************************************************************************/
  28. /*********************************************************************/
  29. /* Copyright 2009, 2010 The University of Texas at Austin. */
  30. /* All rights reserved. */
  31. /* */
  32. /* Redistribution and use in source and binary forms, with or */
  33. /* without modification, are permitted provided that the following */
  34. /* conditions are met: */
  35. /* */
  36. /* 1. Redistributions of source code must retain the above */
  37. /* copyright notice, this list of conditions and the following */
  38. /* disclaimer. */
  39. /* */
  40. /* 2. Redistributions in binary form must reproduce the above */
  41. /* copyright notice, this list of conditions and the following */
  42. /* disclaimer in the documentation and/or other materials */
  43. /* provided with the distribution. */
  44. /* */
  45. /* THIS SOFTWARE IS PROVIDED BY THE UNIVERSITY OF TEXAS AT */
  46. /* AUSTIN ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, */
  47. /* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF */
  48. /* MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE */
  49. /* DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF TEXAS AT */
  50. /* AUSTIN OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, */
  51. /* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES */
  52. /* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE */
  53. /* GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR */
  54. /* BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF */
  55. /* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT */
  56. /* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT */
  57. /* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE */
  58. /* POSSIBILITY OF SUCH DAMAGE. */
  59. /* */
  60. /* The views and conclusions contained in the software and */
  61. /* documentation are those of the authors and should not be */
  62. /* interpreted as representing official policies, either expressed */
  63. /* or implied, of The University of Texas at Austin. */
  64. /*********************************************************************/
  65. #ifndef PARAM_H
  66. #define PARAM_H
  67. #define SHGEMM_DEFAULT_UNROLL_N 8
  68. #define SHGEMM_DEFAULT_UNROLL_M 8
  69. #define SHGEMM_DEFAULT_UNROLL_MN 32
  70. #define SHGEMM_DEFAULT_P 128
  71. #define SHGEMM_DEFAULT_R 240
  72. #define SHGEMM_DEFAULT_Q 12288
  73. #define BGEMM_DEFAULT_UNROLL_N 4
  74. #define BGEMM_DEFAULT_UNROLL_M 8
  75. #define BGEMM_DEFAULT_UNROLL_MN 32
  76. #define BGEMM_DEFAULT_P 256
  77. #define BGEMM_DEFAULT_R 256
  78. #define BGEMM_DEFAULT_Q 256
  79. #define BGEMM_ALIGN_K 1 // must be 2^x
  80. #define SBGEMM_DEFAULT_UNROLL_N 4
  81. #define SBGEMM_DEFAULT_UNROLL_M 8
  82. #define SBGEMM_DEFAULT_UNROLL_MN 32
  83. #define SBGEMM_DEFAULT_P 256
  84. #define SBGEMM_DEFAULT_R 256
  85. #define SBGEMM_DEFAULT_Q 256
  86. #define SBGEMM_ALIGN_K 1 // must be 2^x
  87. #ifdef OPTERON
  88. #define SNUMOPT 4
  89. #define DNUMOPT 2
  90. #define GEMM_DEFAULT_OFFSET_A 64
  91. #define GEMM_DEFAULT_OFFSET_B 256
  92. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x01ffffUL
  93. #define SGEMM_DEFAULT_UNROLL_N 4
  94. #define DGEMM_DEFAULT_UNROLL_N 4
  95. #define QGEMM_DEFAULT_UNROLL_N 2
  96. #define CGEMM_DEFAULT_UNROLL_N 2
  97. #define ZGEMM_DEFAULT_UNROLL_N 2
  98. #define XGEMM_DEFAULT_UNROLL_N 1
  99. #ifdef ARCH_X86
  100. #define SGEMM_DEFAULT_UNROLL_M 4
  101. #define DGEMM_DEFAULT_UNROLL_M 2
  102. #define QGEMM_DEFAULT_UNROLL_M 2
  103. #define CGEMM_DEFAULT_UNROLL_M 2
  104. #define ZGEMM_DEFAULT_UNROLL_M 1
  105. #define XGEMM_DEFAULT_UNROLL_M 1
  106. #else
  107. #define SGEMM_DEFAULT_UNROLL_M 8
  108. #define DGEMM_DEFAULT_UNROLL_M 4
  109. #define QGEMM_DEFAULT_UNROLL_M 2
  110. #define CGEMM_DEFAULT_UNROLL_M 4
  111. #define ZGEMM_DEFAULT_UNROLL_M 2
  112. #define XGEMM_DEFAULT_UNROLL_M 1
  113. #endif
  114. #define SGEMM_DEFAULT_P sgemm_p
  115. #define DGEMM_DEFAULT_P dgemm_p
  116. #define QGEMM_DEFAULT_P qgemm_p
  117. #define CGEMM_DEFAULT_P cgemm_p
  118. #define ZGEMM_DEFAULT_P zgemm_p
  119. #define XGEMM_DEFAULT_P xgemm_p
  120. #define SGEMM_DEFAULT_R sgemm_r
  121. #define DGEMM_DEFAULT_R dgemm_r
  122. #define QGEMM_DEFAULT_R qgemm_r
  123. #define CGEMM_DEFAULT_R cgemm_r
  124. #define ZGEMM_DEFAULT_R zgemm_r
  125. #define XGEMM_DEFAULT_R xgemm_r
  126. #ifdef ALLOC_HUGETLB
  127. #define SGEMM_DEFAULT_Q 248
  128. #define DGEMM_DEFAULT_Q 248
  129. #define QGEMM_DEFAULT_Q 248
  130. #define CGEMM_DEFAULT_Q 248
  131. #define ZGEMM_DEFAULT_Q 248
  132. #define XGEMM_DEFAULT_Q 248
  133. #else
  134. #define SGEMM_DEFAULT_Q 240
  135. #define DGEMM_DEFAULT_Q 240
  136. #define QGEMM_DEFAULT_Q 240
  137. #define CGEMM_DEFAULT_Q 240
  138. #define ZGEMM_DEFAULT_Q 240
  139. #define XGEMM_DEFAULT_Q 240
  140. #endif
  141. #define SYMV_P 16
  142. #define HAVE_EXCLUSIVE_CACHE
  143. #endif
  144. #if defined(BARCELONA) || defined(SHANGHAI) || defined(BOBCAT)
  145. #define SNUMOPT 8
  146. #define DNUMOPT 4
  147. #define GEMM_DEFAULT_OFFSET_A 64
  148. #define GEMM_DEFAULT_OFFSET_B 832
  149. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0fffUL
  150. #define SGEMM_DEFAULT_UNROLL_N 4
  151. #define DGEMM_DEFAULT_UNROLL_N 4
  152. #define QGEMM_DEFAULT_UNROLL_N 2
  153. #define CGEMM_DEFAULT_UNROLL_N 2
  154. #define ZGEMM_DEFAULT_UNROLL_N 2
  155. #define XGEMM_DEFAULT_UNROLL_N 1
  156. #ifdef ARCH_X86
  157. #define SGEMM_DEFAULT_UNROLL_M 4
  158. #define DGEMM_DEFAULT_UNROLL_M 2
  159. #define QGEMM_DEFAULT_UNROLL_M 2
  160. #define CGEMM_DEFAULT_UNROLL_M 2
  161. #define ZGEMM_DEFAULT_UNROLL_M 1
  162. #define XGEMM_DEFAULT_UNROLL_M 1
  163. #else
  164. #define SGEMM_DEFAULT_UNROLL_M 8
  165. #define DGEMM_DEFAULT_UNROLL_M 4
  166. #define QGEMM_DEFAULT_UNROLL_M 2
  167. #define CGEMM_DEFAULT_UNROLL_M 4
  168. #define ZGEMM_DEFAULT_UNROLL_M 2
  169. #define XGEMM_DEFAULT_UNROLL_M 1
  170. #endif
  171. #if 0
  172. #define SGEMM_DEFAULT_P 496
  173. #define DGEMM_DEFAULT_P 248
  174. #define QGEMM_DEFAULT_P 124
  175. #define CGEMM_DEFAULT_P 248
  176. #define ZGEMM_DEFAULT_P 124
  177. #define XGEMM_DEFAULT_P 62
  178. #define SGEMM_DEFAULT_Q 248
  179. #define DGEMM_DEFAULT_Q 248
  180. #define QGEMM_DEFAULT_Q 248
  181. #define CGEMM_DEFAULT_Q 248
  182. #define ZGEMM_DEFAULT_Q 248
  183. #define XGEMM_DEFAULT_Q 248
  184. #else
  185. #define SGEMM_DEFAULT_P 448
  186. #define DGEMM_DEFAULT_P 224
  187. #define QGEMM_DEFAULT_P 112
  188. #define CGEMM_DEFAULT_P 224
  189. #define ZGEMM_DEFAULT_P 112
  190. #define XGEMM_DEFAULT_P 56
  191. #define SGEMM_DEFAULT_Q 224
  192. #define DGEMM_DEFAULT_Q 224
  193. #define QGEMM_DEFAULT_Q 224
  194. #define CGEMM_DEFAULT_Q 224
  195. #define ZGEMM_DEFAULT_Q 224
  196. #define XGEMM_DEFAULT_Q 224
  197. #endif
  198. #define SGEMM_DEFAULT_R sgemm_r
  199. #define QGEMM_DEFAULT_R qgemm_r
  200. #define DGEMM_DEFAULT_R dgemm_r
  201. #define CGEMM_DEFAULT_R cgemm_r
  202. #define ZGEMM_DEFAULT_R zgemm_r
  203. #define XGEMM_DEFAULT_R xgemm_r
  204. #define SYMV_P 16
  205. #define HAVE_EXCLUSIVE_CACHE
  206. #define GEMM_THREAD gemm_thread_mn
  207. #endif
  208. #ifdef BULLDOZER
  209. #define SNUMOPT 8
  210. #define DNUMOPT 4
  211. #define GEMM_DEFAULT_OFFSET_A 64
  212. #define GEMM_DEFAULT_OFFSET_B 832
  213. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0fffUL
  214. #define QGEMM_DEFAULT_UNROLL_N 2
  215. #define CGEMM_DEFAULT_UNROLL_N 2
  216. #define ZGEMM_DEFAULT_UNROLL_N 2
  217. #define XGEMM_DEFAULT_UNROLL_N 1
  218. #ifdef ARCH_X86
  219. #define SGEMM_DEFAULT_UNROLL_N 4
  220. #define DGEMM_DEFAULT_UNROLL_N 4
  221. #define SGEMM_DEFAULT_UNROLL_M 4
  222. #define DGEMM_DEFAULT_UNROLL_M 2
  223. #define QGEMM_DEFAULT_UNROLL_M 2
  224. #define CGEMM_DEFAULT_UNROLL_M 2
  225. #define ZGEMM_DEFAULT_UNROLL_M 1
  226. #define XGEMM_DEFAULT_UNROLL_M 1
  227. #else
  228. #define SGEMM_DEFAULT_UNROLL_N 2
  229. #define DGEMM_DEFAULT_UNROLL_N 2
  230. #define SGEMM_DEFAULT_UNROLL_M 16
  231. #define DGEMM_DEFAULT_UNROLL_M 8
  232. #define QGEMM_DEFAULT_UNROLL_M 2
  233. #define CGEMM_DEFAULT_UNROLL_M 4
  234. #define ZGEMM_DEFAULT_UNROLL_M 2
  235. #define XGEMM_DEFAULT_UNROLL_M 1
  236. #define CGEMM3M_DEFAULT_UNROLL_N 4
  237. #define CGEMM3M_DEFAULT_UNROLL_M 8
  238. #define ZGEMM3M_DEFAULT_UNROLL_N 4
  239. #define ZGEMM3M_DEFAULT_UNROLL_M 4
  240. #define DGEMM_DEFAULT_UNROLL_MN 16
  241. #define GEMV_UNROLL 8
  242. #endif
  243. #if defined(ARCH_X86_64)
  244. #define SGEMM_DEFAULT_P 768
  245. #define DGEMM_DEFAULT_P 384
  246. #else
  247. #define SGEMM_DEFAULT_P 448
  248. #define DGEMM_DEFAULT_P 224
  249. #endif
  250. #define QGEMM_DEFAULT_P 112
  251. #define CGEMM_DEFAULT_P 224
  252. #define ZGEMM_DEFAULT_P 112
  253. #define XGEMM_DEFAULT_P 56
  254. #if defined(ARCH_X86_64)
  255. #define SGEMM_DEFAULT_Q 168
  256. #define DGEMM_DEFAULT_Q 168
  257. #else
  258. #define SGEMM_DEFAULT_Q 224
  259. #define DGEMM_DEFAULT_Q 224
  260. #endif
  261. #define QGEMM_DEFAULT_Q 224
  262. #define CGEMM_DEFAULT_Q 224
  263. #define ZGEMM_DEFAULT_Q 224
  264. #define XGEMM_DEFAULT_Q 224
  265. #define CGEMM3M_DEFAULT_P 448
  266. #define ZGEMM3M_DEFAULT_P 224
  267. #define XGEMM3M_DEFAULT_P 112
  268. #define CGEMM3M_DEFAULT_Q 224
  269. #define ZGEMM3M_DEFAULT_Q 224
  270. #define XGEMM3M_DEFAULT_Q 224
  271. #define CGEMM3M_DEFAULT_R 12288
  272. #define ZGEMM3M_DEFAULT_R 12288
  273. #define XGEMM3M_DEFAULT_R 12288
  274. #define SGEMM_DEFAULT_R sgemm_r
  275. #define QGEMM_DEFAULT_R qgemm_r
  276. #define DGEMM_DEFAULT_R dgemm_r
  277. #define CGEMM_DEFAULT_R cgemm_r
  278. #define ZGEMM_DEFAULT_R zgemm_r
  279. #define XGEMM_DEFAULT_R xgemm_r
  280. #define SYMV_P 16
  281. #define HAVE_EXCLUSIVE_CACHE
  282. #define GEMM_THREAD gemm_thread_mn
  283. #endif
  284. #ifdef PILEDRIVER
  285. #define SNUMOPT 8
  286. #define DNUMOPT 4
  287. #define GEMM_DEFAULT_OFFSET_A 64
  288. #define GEMM_DEFAULT_OFFSET_B 832
  289. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0fffUL
  290. #define QGEMM_DEFAULT_UNROLL_N 2
  291. #define CGEMM_DEFAULT_UNROLL_N 2
  292. #define ZGEMM_DEFAULT_UNROLL_N 2
  293. #define XGEMM_DEFAULT_UNROLL_N 1
  294. #ifdef ARCH_X86
  295. #define SGEMM_DEFAULT_UNROLL_N 4
  296. #define DGEMM_DEFAULT_UNROLL_N 4
  297. #define SGEMM_DEFAULT_UNROLL_M 4
  298. #define DGEMM_DEFAULT_UNROLL_M 2
  299. #define QGEMM_DEFAULT_UNROLL_M 2
  300. #define CGEMM_DEFAULT_UNROLL_M 2
  301. #define ZGEMM_DEFAULT_UNROLL_M 1
  302. #define XGEMM_DEFAULT_UNROLL_M 1
  303. #else
  304. #define SGEMM_DEFAULT_UNROLL_N 2
  305. #define DGEMM_DEFAULT_UNROLL_N 2
  306. #define SGEMM_DEFAULT_UNROLL_M 16
  307. #define DGEMM_DEFAULT_UNROLL_M 8
  308. #define QGEMM_DEFAULT_UNROLL_M 2
  309. #define CGEMM_DEFAULT_UNROLL_M 4
  310. #define ZGEMM_DEFAULT_UNROLL_M 2
  311. #define XGEMM_DEFAULT_UNROLL_M 1
  312. #define CGEMM3M_DEFAULT_UNROLL_N 4
  313. #define CGEMM3M_DEFAULT_UNROLL_M 8
  314. #define ZGEMM3M_DEFAULT_UNROLL_N 4
  315. #define ZGEMM3M_DEFAULT_UNROLL_M 4
  316. #define GEMV_UNROLL 8
  317. #endif
  318. #if defined(ARCH_X86_64)
  319. #define SGEMM_DEFAULT_P 768
  320. #define DGEMM_DEFAULT_P 768
  321. #define ZGEMM_DEFAULT_P 384
  322. #define CGEMM_DEFAULT_P 768
  323. #else
  324. #define SGEMM_DEFAULT_P 448
  325. #define DGEMM_DEFAULT_P 480
  326. #define ZGEMM_DEFAULT_P 112
  327. #define CGEMM_DEFAULT_P 224
  328. #endif
  329. #define QGEMM_DEFAULT_P 112
  330. #define XGEMM_DEFAULT_P 56
  331. #if defined(ARCH_X86_64)
  332. #define SGEMM_DEFAULT_Q 192
  333. #define DGEMM_DEFAULT_Q 168
  334. #define ZGEMM_DEFAULT_Q 168
  335. #define CGEMM_DEFAULT_Q 168
  336. #else
  337. #define SGEMM_DEFAULT_Q 224
  338. #define DGEMM_DEFAULT_Q 224
  339. #define ZGEMM_DEFAULT_Q 224
  340. #define CGEMM_DEFAULT_Q 224
  341. #endif
  342. #define QGEMM_DEFAULT_Q 224
  343. #define XGEMM_DEFAULT_Q 224
  344. #define CGEMM3M_DEFAULT_P 448
  345. #define ZGEMM3M_DEFAULT_P 224
  346. #define XGEMM3M_DEFAULT_P 112
  347. #define CGEMM3M_DEFAULT_Q 224
  348. #define ZGEMM3M_DEFAULT_Q 224
  349. #define XGEMM3M_DEFAULT_Q 224
  350. #define CGEMM3M_DEFAULT_R 12288
  351. #define ZGEMM3M_DEFAULT_R 12288
  352. #define XGEMM3M_DEFAULT_R 12288
  353. #define SGEMM_DEFAULT_R 12288
  354. #define QGEMM_DEFAULT_R qgemm_r
  355. #define DGEMM_DEFAULT_R 12288
  356. #define CGEMM_DEFAULT_R cgemm_r
  357. #define ZGEMM_DEFAULT_R zgemm_r
  358. #define XGEMM_DEFAULT_R xgemm_r
  359. #define SYMV_P 16
  360. #define HAVE_EXCLUSIVE_CACHE
  361. #define GEMM_THREAD gemm_thread_mn
  362. #endif
  363. #ifdef STEAMROLLER
  364. #define SNUMOPT 8
  365. #define DNUMOPT 4
  366. #define GEMM_DEFAULT_OFFSET_A 64
  367. #define GEMM_DEFAULT_OFFSET_B 832
  368. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0fffUL
  369. #define QGEMM_DEFAULT_UNROLL_N 2
  370. #define CGEMM_DEFAULT_UNROLL_N 2
  371. #define ZGEMM_DEFAULT_UNROLL_N 2
  372. #define XGEMM_DEFAULT_UNROLL_N 1
  373. #ifdef ARCH_X86
  374. #define SGEMM_DEFAULT_UNROLL_N 4
  375. #define DGEMM_DEFAULT_UNROLL_N 4
  376. #define SGEMM_DEFAULT_UNROLL_M 4
  377. #define DGEMM_DEFAULT_UNROLL_M 2
  378. #define QGEMM_DEFAULT_UNROLL_M 2
  379. #define CGEMM_DEFAULT_UNROLL_M 2
  380. #define ZGEMM_DEFAULT_UNROLL_M 1
  381. #define XGEMM_DEFAULT_UNROLL_M 1
  382. #else
  383. #define SGEMM_DEFAULT_UNROLL_N 2
  384. #define DGEMM_DEFAULT_UNROLL_N 2
  385. #define SGEMM_DEFAULT_UNROLL_M 16
  386. #define DGEMM_DEFAULT_UNROLL_M 8
  387. #define QGEMM_DEFAULT_UNROLL_M 2
  388. #define CGEMM_DEFAULT_UNROLL_M 4
  389. #define ZGEMM_DEFAULT_UNROLL_M 2
  390. #define XGEMM_DEFAULT_UNROLL_M 1
  391. #define CGEMM3M_DEFAULT_UNROLL_N 4
  392. #define CGEMM3M_DEFAULT_UNROLL_M 8
  393. #define ZGEMM3M_DEFAULT_UNROLL_N 4
  394. #define ZGEMM3M_DEFAULT_UNROLL_M 4
  395. #define GEMV_UNROLL 8
  396. #endif
  397. #if defined(ARCH_X86_64)
  398. #define SGEMM_DEFAULT_P 768
  399. #define DGEMM_DEFAULT_P 576
  400. #define ZGEMM_DEFAULT_P 288
  401. #define CGEMM_DEFAULT_P 576
  402. #else
  403. #define SGEMM_DEFAULT_P 448
  404. #define DGEMM_DEFAULT_P 480
  405. #define ZGEMM_DEFAULT_P 112
  406. #define CGEMM_DEFAULT_P 224
  407. #endif
  408. #define QGEMM_DEFAULT_P 112
  409. #define XGEMM_DEFAULT_P 56
  410. #if defined(ARCH_X86_64)
  411. #define SGEMM_DEFAULT_Q 192
  412. #define DGEMM_DEFAULT_Q 160
  413. #define ZGEMM_DEFAULT_Q 160
  414. #define CGEMM_DEFAULT_Q 160
  415. #else
  416. #define SGEMM_DEFAULT_Q 224
  417. #define DGEMM_DEFAULT_Q 224
  418. #define ZGEMM_DEFAULT_Q 224
  419. #define CGEMM_DEFAULT_Q 224
  420. #endif
  421. #define QGEMM_DEFAULT_Q 224
  422. #define XGEMM_DEFAULT_Q 224
  423. #define CGEMM3M_DEFAULT_P 448
  424. #define ZGEMM3M_DEFAULT_P 224
  425. #define XGEMM3M_DEFAULT_P 112
  426. #define CGEMM3M_DEFAULT_Q 224
  427. #define ZGEMM3M_DEFAULT_Q 224
  428. #define XGEMM3M_DEFAULT_Q 224
  429. #define CGEMM3M_DEFAULT_R 12288
  430. #define ZGEMM3M_DEFAULT_R 12288
  431. #define XGEMM3M_DEFAULT_R 12288
  432. #define SGEMM_DEFAULT_R 12288
  433. #define QGEMM_DEFAULT_R qgemm_r
  434. #define DGEMM_DEFAULT_R 12288
  435. #define CGEMM_DEFAULT_R cgemm_r
  436. #define ZGEMM_DEFAULT_R zgemm_r
  437. #define XGEMM_DEFAULT_R xgemm_r
  438. #define SYMV_P 16
  439. #define HAVE_EXCLUSIVE_CACHE
  440. #define GEMM_THREAD gemm_thread_mn
  441. #endif
  442. #ifdef EXCAVATOR
  443. #define SNUMOPT 8
  444. #define DNUMOPT 4
  445. #define GEMM_DEFAULT_OFFSET_A 64
  446. #define GEMM_DEFAULT_OFFSET_B 832
  447. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0fffUL
  448. #define QGEMM_DEFAULT_UNROLL_N 2
  449. #define CGEMM_DEFAULT_UNROLL_N 2
  450. #define ZGEMM_DEFAULT_UNROLL_N 2
  451. #define XGEMM_DEFAULT_UNROLL_N 1
  452. #ifdef ARCH_X86
  453. #define SGEMM_DEFAULT_UNROLL_N 4
  454. #define DGEMM_DEFAULT_UNROLL_N 4
  455. #define SGEMM_DEFAULT_UNROLL_M 4
  456. #define DGEMM_DEFAULT_UNROLL_M 2
  457. #define QGEMM_DEFAULT_UNROLL_M 2
  458. #define CGEMM_DEFAULT_UNROLL_M 2
  459. #define ZGEMM_DEFAULT_UNROLL_M 1
  460. #define XGEMM_DEFAULT_UNROLL_M 1
  461. #else
  462. #define SGEMM_DEFAULT_UNROLL_N 2
  463. #define DGEMM_DEFAULT_UNROLL_N 2
  464. #define SGEMM_DEFAULT_UNROLL_M 16
  465. #define DGEMM_DEFAULT_UNROLL_M 8
  466. #define QGEMM_DEFAULT_UNROLL_M 2
  467. #define CGEMM_DEFAULT_UNROLL_M 4
  468. #define ZGEMM_DEFAULT_UNROLL_M 2
  469. #define XGEMM_DEFAULT_UNROLL_M 1
  470. #define CGEMM3M_DEFAULT_UNROLL_N 4
  471. #define CGEMM3M_DEFAULT_UNROLL_M 8
  472. #define ZGEMM3M_DEFAULT_UNROLL_N 4
  473. #define ZGEMM3M_DEFAULT_UNROLL_M 4
  474. #define GEMV_UNROLL 8
  475. #endif
  476. #if defined(ARCH_X86_64)
  477. #define SGEMM_DEFAULT_P 768
  478. #define DGEMM_DEFAULT_P 576
  479. #define ZGEMM_DEFAULT_P 288
  480. #define CGEMM_DEFAULT_P 576
  481. #else
  482. #define SGEMM_DEFAULT_P 448
  483. #define DGEMM_DEFAULT_P 480
  484. #define ZGEMM_DEFAULT_P 112
  485. #define CGEMM_DEFAULT_P 224
  486. #endif
  487. #define QGEMM_DEFAULT_P 112
  488. #define XGEMM_DEFAULT_P 56
  489. #if defined(ARCH_X86_64)
  490. #define SGEMM_DEFAULT_Q 192
  491. #define DGEMM_DEFAULT_Q 160
  492. #define ZGEMM_DEFAULT_Q 160
  493. #define CGEMM_DEFAULT_Q 160
  494. #else
  495. #define SGEMM_DEFAULT_Q 224
  496. #define DGEMM_DEFAULT_Q 224
  497. #define ZGEMM_DEFAULT_Q 224
  498. #define CGEMM_DEFAULT_Q 224
  499. #endif
  500. #define QGEMM_DEFAULT_Q 224
  501. #define XGEMM_DEFAULT_Q 224
  502. #define CGEMM3M_DEFAULT_P 448
  503. #define ZGEMM3M_DEFAULT_P 224
  504. #define XGEMM3M_DEFAULT_P 112
  505. #define CGEMM3M_DEFAULT_Q 224
  506. #define ZGEMM3M_DEFAULT_Q 224
  507. #define XGEMM3M_DEFAULT_Q 224
  508. #define CGEMM3M_DEFAULT_R 12288
  509. #define ZGEMM3M_DEFAULT_R 12288
  510. #define XGEMM3M_DEFAULT_R 12288
  511. #define SGEMM_DEFAULT_R 12288
  512. #define QGEMM_DEFAULT_R qgemm_r
  513. #define DGEMM_DEFAULT_R 12288
  514. #define CGEMM_DEFAULT_R cgemm_r
  515. #define ZGEMM_DEFAULT_R zgemm_r
  516. #define XGEMM_DEFAULT_R xgemm_r
  517. #define SYMV_P 16
  518. #define HAVE_EXCLUSIVE_CACHE
  519. #define GEMM_THREAD gemm_thread_mn
  520. #endif
  521. #ifdef ZEN
  522. #define SNUMOPT 16
  523. #define DNUMOPT 8
  524. #define GEMM_DEFAULT_OFFSET_A 0
  525. #define GEMM_DEFAULT_OFFSET_B 0
  526. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  527. #define SYMV_P 8
  528. #if defined(XDOUBLE) || defined(DOUBLE)
  529. #define SWITCH_RATIO 4
  530. #define GEMM_PREFERED_SIZE 4
  531. #else
  532. #define SWITCH_RATIO 8
  533. #define GEMM_PREFERED_SIZE 8
  534. #endif
  535. #ifdef ARCH_X86
  536. #define SGEMM_DEFAULT_UNROLL_M 4
  537. #define DGEMM_DEFAULT_UNROLL_M 2
  538. #define QGEMM_DEFAULT_UNROLL_M 2
  539. #define CGEMM_DEFAULT_UNROLL_M 2
  540. #define ZGEMM_DEFAULT_UNROLL_M 1
  541. #define XGEMM_DEFAULT_UNROLL_M 1
  542. #define SGEMM_DEFAULT_UNROLL_N 4
  543. #define DGEMM_DEFAULT_UNROLL_N 4
  544. #define QGEMM_DEFAULT_UNROLL_N 2
  545. #define CGEMM_DEFAULT_UNROLL_N 2
  546. #define ZGEMM_DEFAULT_UNROLL_N 2
  547. #define XGEMM_DEFAULT_UNROLL_N 1
  548. #else
  549. #define SGEMM_DEFAULT_UNROLL_M 8
  550. #define DGEMM_DEFAULT_UNROLL_M 4
  551. #define QGEMM_DEFAULT_UNROLL_M 2
  552. #define CGEMM_DEFAULT_UNROLL_M 8
  553. #define ZGEMM_DEFAULT_UNROLL_M 4
  554. #define XGEMM_DEFAULT_UNROLL_M 1
  555. #define SGEMM_DEFAULT_UNROLL_N 4
  556. #define DGEMM_DEFAULT_UNROLL_N 8
  557. #define QGEMM_DEFAULT_UNROLL_N 2
  558. #define CGEMM_DEFAULT_UNROLL_N 2
  559. #define ZGEMM_DEFAULT_UNROLL_N 2
  560. #define XGEMM_DEFAULT_UNROLL_N 1
  561. /*
  562. #define SGEMM_DEFAULT_UNROLL_MN 32
  563. #define DGEMM_DEFAULT_UNROLL_MN 32
  564. */
  565. #endif
  566. #ifdef ARCH_X86
  567. #define SGEMM_DEFAULT_P 512
  568. #define SGEMM_DEFAULT_R sgemm_r
  569. #define DGEMM_DEFAULT_P 512
  570. #define DGEMM_DEFAULT_R dgemm_r
  571. #define QGEMM_DEFAULT_P 504
  572. #define QGEMM_DEFAULT_R qgemm_r
  573. #define CGEMM_DEFAULT_P 128
  574. #define CGEMM_DEFAULT_R 1024
  575. #define ZGEMM_DEFAULT_P 512
  576. #define ZGEMM_DEFAULT_R zgemm_r
  577. #define XGEMM_DEFAULT_P 252
  578. #define XGEMM_DEFAULT_R xgemm_r
  579. #define SGEMM_DEFAULT_Q 256
  580. #define DGEMM_DEFAULT_Q 256
  581. #define QGEMM_DEFAULT_Q 128
  582. #define CGEMM_DEFAULT_Q 256
  583. #define ZGEMM_DEFAULT_Q 192
  584. #define XGEMM_DEFAULT_Q 128
  585. #else
  586. #define SGEMM_DEFAULT_P 320
  587. #define DGEMM_DEFAULT_P 512
  588. #define CGEMM_DEFAULT_P 256
  589. #define ZGEMM_DEFAULT_P 192
  590. #ifdef WINDOWS_ABI
  591. #define SGEMM_DEFAULT_Q 320
  592. #define DGEMM_DEFAULT_Q 128
  593. #else
  594. #define SGEMM_DEFAULT_Q 320
  595. #define DGEMM_DEFAULT_Q 256
  596. #endif
  597. #define CGEMM_DEFAULT_Q 256
  598. #define ZGEMM_DEFAULT_Q 192
  599. #define SGEMM_DEFAULT_R sgemm_r
  600. #define DGEMM_DEFAULT_R 13824
  601. #define CGEMM_DEFAULT_R cgemm_r
  602. #define ZGEMM_DEFAULT_R zgemm_r
  603. #define QGEMM_DEFAULT_Q 128
  604. #define QGEMM_DEFAULT_P 504
  605. #define QGEMM_DEFAULT_R qgemm_r
  606. #define XGEMM_DEFAULT_P 252
  607. #define XGEMM_DEFAULT_R xgemm_r
  608. #define XGEMM_DEFAULT_Q 128
  609. #define CGEMM3M_DEFAULT_UNROLL_N 4
  610. #define CGEMM3M_DEFAULT_UNROLL_M 8
  611. #define ZGEMM3M_DEFAULT_UNROLL_N 4
  612. #define ZGEMM3M_DEFAULT_UNROLL_M 4
  613. #define CGEMM3M_DEFAULT_P 320
  614. #define ZGEMM3M_DEFAULT_P 256
  615. #define XGEMM3M_DEFAULT_P 112
  616. #define CGEMM3M_DEFAULT_Q 320
  617. #define ZGEMM3M_DEFAULT_Q 256
  618. #define XGEMM3M_DEFAULT_Q 224
  619. #define CGEMM3M_DEFAULT_R 12288
  620. #define ZGEMM3M_DEFAULT_R 12288
  621. #define XGEMM3M_DEFAULT_R 12288
  622. #endif
  623. #endif
  624. #ifdef ATHLON
  625. #define SNUMOPT 4
  626. #define DNUMOPT 2
  627. #define GEMM_DEFAULT_OFFSET_A 0
  628. #define GEMM_DEFAULT_OFFSET_B 384
  629. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0ffffUL
  630. #define SGEMM_DEFAULT_UNROLL_N 4
  631. #define DGEMM_DEFAULT_UNROLL_N 4
  632. #define QGEMM_DEFAULT_UNROLL_N 2
  633. #define CGEMM_DEFAULT_UNROLL_N 2
  634. #define ZGEMM_DEFAULT_UNROLL_N 2
  635. #define XGEMM_DEFAULT_UNROLL_N 1
  636. #define SGEMM_DEFAULT_UNROLL_M 2
  637. #define DGEMM_DEFAULT_UNROLL_M 1
  638. #define QGEMM_DEFAULT_UNROLL_M 2
  639. #define CGEMM_DEFAULT_UNROLL_M 1
  640. #define ZGEMM_DEFAULT_UNROLL_M 1
  641. #define XGEMM_DEFAULT_UNROLL_M 1
  642. #define SGEMM_DEFAULT_R sgemm_r
  643. #define DGEMM_DEFAULT_R dgemm_r
  644. #define QGEMM_DEFAULT_R qgemm_r
  645. #define CGEMM_DEFAULT_R cgemm_r
  646. #define ZGEMM_DEFAULT_R zgemm_r
  647. #define XGEMM_DEFAULT_R xgemm_r
  648. #define SGEMM_DEFAULT_P 208
  649. #define DGEMM_DEFAULT_P 104
  650. #define QGEMM_DEFAULT_P 56
  651. #define CGEMM_DEFAULT_P 104
  652. #define ZGEMM_DEFAULT_P 56
  653. #define XGEMM_DEFAULT_P 28
  654. #define SGEMM_DEFAULT_Q 208
  655. #define DGEMM_DEFAULT_Q 208
  656. #define QGEMM_DEFAULT_Q 208
  657. #define CGEMM_DEFAULT_Q 208
  658. #define ZGEMM_DEFAULT_Q 208
  659. #define XGEMM_DEFAULT_Q 208
  660. #define SYMV_P 16
  661. #define HAVE_EXCLUSIVE_CACHE
  662. #endif
  663. #ifdef VIAC3
  664. #define SNUMOPT 2
  665. #define DNUMOPT 1
  666. #define GEMM_DEFAULT_OFFSET_A 0
  667. #define GEMM_DEFAULT_OFFSET_B 256
  668. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0ffffUL
  669. #define SGEMM_DEFAULT_UNROLL_N 4
  670. #define DGEMM_DEFAULT_UNROLL_N 4
  671. #define QGEMM_DEFAULT_UNROLL_N 2
  672. #define CGEMM_DEFAULT_UNROLL_N 2
  673. #define ZGEMM_DEFAULT_UNROLL_N 2
  674. #define XGEMM_DEFAULT_UNROLL_N 1
  675. #define SGEMM_DEFAULT_UNROLL_M 2
  676. #define DGEMM_DEFAULT_UNROLL_M 1
  677. #define QGEMM_DEFAULT_UNROLL_M 2
  678. #define CGEMM_DEFAULT_UNROLL_M 1
  679. #define ZGEMM_DEFAULT_UNROLL_M 1
  680. #define XGEMM_DEFAULT_UNROLL_M 1
  681. #define SGEMM_DEFAULT_R sgemm_r
  682. #define DGEMM_DEFAULT_R dgemm_r
  683. #define QGEMM_DEFAULT_R qgemm_r
  684. #define CGEMM_DEFAULT_R cgemm_r
  685. #define ZGEMM_DEFAULT_R zgemm_r
  686. #define XGEMM_DEFAULT_R xgemm_r
  687. #define SGEMM_DEFAULT_P 128
  688. #define DGEMM_DEFAULT_P 128
  689. #define QGEMM_DEFAULT_P 128
  690. #define CGEMM_DEFAULT_P 128
  691. #define ZGEMM_DEFAULT_P 128
  692. #define XGEMM_DEFAULT_P 128
  693. #define SGEMM_DEFAULT_Q 512
  694. #define DGEMM_DEFAULT_Q 256
  695. #define QGEMM_DEFAULT_Q 256
  696. #define CGEMM_DEFAULT_Q 256
  697. #define ZGEMM_DEFAULT_Q 128
  698. #define XGEMM_DEFAULT_Q 128
  699. #define SYMV_P 16
  700. #endif
  701. #ifdef NANO
  702. #define SNUMOPT 4
  703. #define DNUMOPT 2
  704. #define GEMM_DEFAULT_OFFSET_A 64
  705. #define GEMM_DEFAULT_OFFSET_B 256
  706. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x01ffffUL
  707. #ifdef ARCH_X86
  708. #define SGEMM_DEFAULT_UNROLL_N 4
  709. #define DGEMM_DEFAULT_UNROLL_N 4
  710. #define QGEMM_DEFAULT_UNROLL_N 2
  711. #define CGEMM_DEFAULT_UNROLL_N 2
  712. #define ZGEMM_DEFAULT_UNROLL_N 2
  713. #define XGEMM_DEFAULT_UNROLL_N 1
  714. #define SGEMM_DEFAULT_UNROLL_M 4
  715. #define DGEMM_DEFAULT_UNROLL_M 2
  716. #define QGEMM_DEFAULT_UNROLL_M 2
  717. #define CGEMM_DEFAULT_UNROLL_M 2
  718. #define ZGEMM_DEFAULT_UNROLL_M 1
  719. #define XGEMM_DEFAULT_UNROLL_M 1
  720. #else
  721. #define SGEMM_DEFAULT_UNROLL_N 8
  722. #define DGEMM_DEFAULT_UNROLL_N 4
  723. #define QGEMM_DEFAULT_UNROLL_N 2
  724. #define CGEMM_DEFAULT_UNROLL_N 4
  725. #define ZGEMM_DEFAULT_UNROLL_N 2
  726. #define XGEMM_DEFAULT_UNROLL_N 1
  727. #define SGEMM_DEFAULT_UNROLL_M 4
  728. #define DGEMM_DEFAULT_UNROLL_M 4
  729. #define QGEMM_DEFAULT_UNROLL_M 2
  730. #define CGEMM_DEFAULT_UNROLL_M 2
  731. #define ZGEMM_DEFAULT_UNROLL_M 2
  732. #define XGEMM_DEFAULT_UNROLL_M 1
  733. #endif
  734. #define SGEMM_DEFAULT_P 288
  735. #define DGEMM_DEFAULT_P 288
  736. #define QGEMM_DEFAULT_P 288
  737. #define CGEMM_DEFAULT_P 288
  738. #define ZGEMM_DEFAULT_P 288
  739. #define XGEMM_DEFAULT_P 288
  740. #define SGEMM_DEFAULT_R sgemm_r
  741. #define DGEMM_DEFAULT_R dgemm_r
  742. #define QGEMM_DEFAULT_R qgemm_r
  743. #define CGEMM_DEFAULT_R cgemm_r
  744. #define ZGEMM_DEFAULT_R zgemm_r
  745. #define XGEMM_DEFAULT_R xgemm_r
  746. #define SGEMM_DEFAULT_Q 256
  747. #define DGEMM_DEFAULT_Q 128
  748. #define QGEMM_DEFAULT_Q 64
  749. #define CGEMM_DEFAULT_Q 128
  750. #define ZGEMM_DEFAULT_Q 64
  751. #define XGEMM_DEFAULT_Q 32
  752. #define SYMV_P 16
  753. #define HAVE_EXCLUSIVE_CACHE
  754. #endif
  755. #if defined(PENTIUM) || defined(PENTIUM2) || defined(PENTIUM3)
  756. #ifdef HAVE_SSE
  757. #define SNUMOPT 2
  758. #else
  759. #define SNUMOPT 1
  760. #endif
  761. #define DNUMOPT 1
  762. #define GEMM_DEFAULT_OFFSET_A 0
  763. #define GEMM_DEFAULT_OFFSET_B 0
  764. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0ffffUL
  765. #ifdef HAVE_SSE
  766. #define SGEMM_DEFAULT_UNROLL_M 8
  767. #define CGEMM_DEFAULT_UNROLL_M 4
  768. #else
  769. #define SGEMM_DEFAULT_UNROLL_M 4
  770. #define CGEMM_DEFAULT_UNROLL_M 2
  771. #endif
  772. #define DGEMM_DEFAULT_UNROLL_M 2
  773. #define SGEMM_DEFAULT_UNROLL_N 2
  774. #define DGEMM_DEFAULT_UNROLL_N 2
  775. #define QGEMM_DEFAULT_UNROLL_M 2
  776. #define QGEMM_DEFAULT_UNROLL_N 2
  777. #define CGEMM_DEFAULT_UNROLL_N 1
  778. #define ZGEMM_DEFAULT_UNROLL_M 1
  779. #define ZGEMM_DEFAULT_UNROLL_N 1
  780. #define XGEMM_DEFAULT_UNROLL_M 1
  781. #define XGEMM_DEFAULT_UNROLL_N 1
  782. #define SGEMM_DEFAULT_P sgemm_p
  783. #define SGEMM_DEFAULT_Q 256
  784. #define SGEMM_DEFAULT_R sgemm_r
  785. #define DGEMM_DEFAULT_P dgemm_p
  786. #define DGEMM_DEFAULT_Q 256
  787. #define DGEMM_DEFAULT_R dgemm_r
  788. #define QGEMM_DEFAULT_P qgemm_p
  789. #define QGEMM_DEFAULT_Q 256
  790. #define QGEMM_DEFAULT_R qgemm_r
  791. #define CGEMM_DEFAULT_P cgemm_p
  792. #define CGEMM_DEFAULT_Q 256
  793. #define CGEMM_DEFAULT_R cgemm_r
  794. #define ZGEMM_DEFAULT_P zgemm_p
  795. #define ZGEMM_DEFAULT_Q 256
  796. #define ZGEMM_DEFAULT_R zgemm_r
  797. #define XGEMM_DEFAULT_P xgemm_p
  798. #define XGEMM_DEFAULT_Q 256
  799. #define XGEMM_DEFAULT_R xgemm_r
  800. #define SYMV_P 4
  801. #endif
  802. #ifdef PENTIUMM
  803. #define SNUMOPT 2
  804. #define DNUMOPT 1
  805. #define GEMM_DEFAULT_OFFSET_A 0
  806. #define GEMM_DEFAULT_OFFSET_B 0
  807. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0ffffUL
  808. #ifdef CORE_YONAH
  809. #define SGEMM_DEFAULT_UNROLL_M 4
  810. #define SGEMM_DEFAULT_UNROLL_N 4
  811. #define DGEMM_DEFAULT_UNROLL_M 2
  812. #define DGEMM_DEFAULT_UNROLL_N 4
  813. #define QGEMM_DEFAULT_UNROLL_M 2
  814. #define QGEMM_DEFAULT_UNROLL_N 2
  815. #define CGEMM_DEFAULT_UNROLL_M 2
  816. #define CGEMM_DEFAULT_UNROLL_N 2
  817. #define ZGEMM_DEFAULT_UNROLL_M 1
  818. #define ZGEMM_DEFAULT_UNROLL_N 2
  819. #define XGEMM_DEFAULT_UNROLL_M 1
  820. #define XGEMM_DEFAULT_UNROLL_N 1
  821. #else
  822. #define SGEMM_DEFAULT_UNROLL_M 8
  823. #define SGEMM_DEFAULT_UNROLL_N 2
  824. #define DGEMM_DEFAULT_UNROLL_M 2
  825. #define DGEMM_DEFAULT_UNROLL_N 2
  826. #define QGEMM_DEFAULT_UNROLL_M 2
  827. #define QGEMM_DEFAULT_UNROLL_N 2
  828. #define CGEMM_DEFAULT_UNROLL_M 4
  829. #define CGEMM_DEFAULT_UNROLL_N 1
  830. #define ZGEMM_DEFAULT_UNROLL_M 1
  831. #define ZGEMM_DEFAULT_UNROLL_N 1
  832. #define XGEMM_DEFAULT_UNROLL_M 1
  833. #define XGEMM_DEFAULT_UNROLL_N 1
  834. #endif
  835. #define SGEMM_DEFAULT_P sgemm_p
  836. #define SGEMM_DEFAULT_Q 256
  837. #define SGEMM_DEFAULT_R sgemm_r
  838. #define DGEMM_DEFAULT_P dgemm_p
  839. #define DGEMM_DEFAULT_Q 256
  840. #define DGEMM_DEFAULT_R dgemm_r
  841. #define QGEMM_DEFAULT_P qgemm_p
  842. #define QGEMM_DEFAULT_Q 256
  843. #define QGEMM_DEFAULT_R qgemm_r
  844. #define CGEMM_DEFAULT_P cgemm_p
  845. #define CGEMM_DEFAULT_Q 256
  846. #define CGEMM_DEFAULT_R cgemm_r
  847. #define ZGEMM_DEFAULT_P zgemm_p
  848. #define ZGEMM_DEFAULT_Q 256
  849. #define ZGEMM_DEFAULT_R zgemm_r
  850. #define XGEMM_DEFAULT_P xgemm_p
  851. #define XGEMM_DEFAULT_Q 256
  852. #define XGEMM_DEFAULT_R xgemm_r
  853. #define SYMV_P 4
  854. #endif
  855. #ifdef CORE_NORTHWOOD
  856. #define SNUMOPT 4
  857. #define DNUMOPT 2
  858. #define GEMM_DEFAULT_OFFSET_A 0
  859. #define GEMM_DEFAULT_OFFSET_B 32
  860. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0ffffUL
  861. #define SYMV_P 8
  862. #define SGEMM_DEFAULT_UNROLL_M 8
  863. #define DGEMM_DEFAULT_UNROLL_M 4
  864. #define QGEMM_DEFAULT_UNROLL_M 2
  865. #define CGEMM_DEFAULT_UNROLL_M 4
  866. #define ZGEMM_DEFAULT_UNROLL_M 2
  867. #define XGEMM_DEFAULT_UNROLL_M 1
  868. #define SGEMM_DEFAULT_UNROLL_N 2
  869. #define DGEMM_DEFAULT_UNROLL_N 2
  870. #define QGEMM_DEFAULT_UNROLL_N 2
  871. #define CGEMM_DEFAULT_UNROLL_N 1
  872. #define ZGEMM_DEFAULT_UNROLL_N 1
  873. #define XGEMM_DEFAULT_UNROLL_N 1
  874. #define SGEMM_DEFAULT_P sgemm_p
  875. #define SGEMM_DEFAULT_R sgemm_r
  876. #define DGEMM_DEFAULT_P dgemm_p
  877. #define DGEMM_DEFAULT_R dgemm_r
  878. #define QGEMM_DEFAULT_P qgemm_p
  879. #define QGEMM_DEFAULT_R qgemm_r
  880. #define CGEMM_DEFAULT_P cgemm_p
  881. #define CGEMM_DEFAULT_R cgemm_r
  882. #define ZGEMM_DEFAULT_P zgemm_p
  883. #define ZGEMM_DEFAULT_R zgemm_r
  884. #define XGEMM_DEFAULT_P xgemm_p
  885. #define XGEMM_DEFAULT_R xgemm_r
  886. #define SGEMM_DEFAULT_Q 128
  887. #define DGEMM_DEFAULT_Q 128
  888. #define QGEMM_DEFAULT_Q 128
  889. #define CGEMM_DEFAULT_Q 128
  890. #define ZGEMM_DEFAULT_Q 128
  891. #define XGEMM_DEFAULT_Q 128
  892. #endif
  893. #ifdef CORE_PRESCOTT
  894. #define SNUMOPT 4
  895. #define DNUMOPT 2
  896. #ifndef __64BIT__
  897. #define GEMM_DEFAULT_OFFSET_A 128
  898. #define GEMM_DEFAULT_OFFSET_B 192
  899. #else
  900. #define GEMM_DEFAULT_OFFSET_A 0
  901. #define GEMM_DEFAULT_OFFSET_B 256
  902. #endif
  903. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0ffffUL
  904. #define SYMV_P 8
  905. #ifdef ARCH_X86
  906. #define SGEMM_DEFAULT_UNROLL_M 4
  907. #define DGEMM_DEFAULT_UNROLL_M 2
  908. #define QGEMM_DEFAULT_UNROLL_M 2
  909. #define CGEMM_DEFAULT_UNROLL_M 2
  910. #define ZGEMM_DEFAULT_UNROLL_M 1
  911. #define XGEMM_DEFAULT_UNROLL_M 1
  912. #else
  913. #define SGEMM_DEFAULT_UNROLL_M 8
  914. #define DGEMM_DEFAULT_UNROLL_M 4
  915. #define QGEMM_DEFAULT_UNROLL_M 2
  916. #define CGEMM_DEFAULT_UNROLL_M 4
  917. #define ZGEMM_DEFAULT_UNROLL_M 2
  918. #define XGEMM_DEFAULT_UNROLL_M 1
  919. #endif
  920. #define SGEMM_DEFAULT_UNROLL_N 4
  921. #define DGEMM_DEFAULT_UNROLL_N 4
  922. #define QGEMM_DEFAULT_UNROLL_N 2
  923. #define CGEMM_DEFAULT_UNROLL_N 2
  924. #define ZGEMM_DEFAULT_UNROLL_N 2
  925. #define XGEMM_DEFAULT_UNROLL_N 1
  926. #define SGEMM_DEFAULT_P sgemm_p
  927. #define SGEMM_DEFAULT_R sgemm_r
  928. #define DGEMM_DEFAULT_P dgemm_p
  929. #define DGEMM_DEFAULT_R dgemm_r
  930. #define QGEMM_DEFAULT_P qgemm_p
  931. #define QGEMM_DEFAULT_R qgemm_r
  932. #define CGEMM_DEFAULT_P cgemm_p
  933. #define CGEMM_DEFAULT_R cgemm_r
  934. #define ZGEMM_DEFAULT_P zgemm_p
  935. #define ZGEMM_DEFAULT_R zgemm_r
  936. #define XGEMM_DEFAULT_P xgemm_p
  937. #define XGEMM_DEFAULT_R xgemm_r
  938. #define SGEMM_DEFAULT_Q 128
  939. #define DGEMM_DEFAULT_Q 128
  940. #define QGEMM_DEFAULT_Q 128
  941. #define CGEMM_DEFAULT_Q 128
  942. #define ZGEMM_DEFAULT_Q 128
  943. #define XGEMM_DEFAULT_Q 128
  944. #endif
  945. #ifdef CORE2
  946. #define SNUMOPT 8
  947. #define DNUMOPT 4
  948. #define GEMM_DEFAULT_OFFSET_A 448
  949. #define GEMM_DEFAULT_OFFSET_B 128
  950. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  951. #define SYMV_P 8
  952. #define SWITCH_RATIO 4
  953. #ifdef ARCH_X86
  954. #define SGEMM_DEFAULT_UNROLL_M 8
  955. #define DGEMM_DEFAULT_UNROLL_M 4
  956. #define QGEMM_DEFAULT_UNROLL_M 2
  957. #define CGEMM_DEFAULT_UNROLL_M 4
  958. #define ZGEMM_DEFAULT_UNROLL_M 2
  959. #define XGEMM_DEFAULT_UNROLL_M 1
  960. #define SGEMM_DEFAULT_UNROLL_N 2
  961. #define DGEMM_DEFAULT_UNROLL_N 2
  962. #define QGEMM_DEFAULT_UNROLL_N 2
  963. #define CGEMM_DEFAULT_UNROLL_N 1
  964. #define ZGEMM_DEFAULT_UNROLL_N 1
  965. #define XGEMM_DEFAULT_UNROLL_N 1
  966. #define MASK(a, b) ((((a) + (b) - 1) / (b)) * (b))
  967. #else
  968. #define SGEMM_DEFAULT_UNROLL_M 8
  969. #define DGEMM_DEFAULT_UNROLL_M 4
  970. #define QGEMM_DEFAULT_UNROLL_M 2
  971. #define CGEMM_DEFAULT_UNROLL_M 4
  972. #define ZGEMM_DEFAULT_UNROLL_M 2
  973. #define XGEMM_DEFAULT_UNROLL_M 1
  974. #define SGEMM_DEFAULT_UNROLL_N 4
  975. #define DGEMM_DEFAULT_UNROLL_N 4
  976. #define QGEMM_DEFAULT_UNROLL_N 2
  977. #define CGEMM_DEFAULT_UNROLL_N 2
  978. #define ZGEMM_DEFAULT_UNROLL_N 2
  979. #define XGEMM_DEFAULT_UNROLL_N 1
  980. #endif
  981. #define SGEMM_DEFAULT_P sgemm_p
  982. #define SGEMM_DEFAULT_R sgemm_r
  983. #define DGEMM_DEFAULT_P dgemm_p
  984. #define DGEMM_DEFAULT_R dgemm_r
  985. #define QGEMM_DEFAULT_P qgemm_p
  986. #define QGEMM_DEFAULT_R qgemm_r
  987. #define CGEMM_DEFAULT_P cgemm_p
  988. #define CGEMM_DEFAULT_R cgemm_r
  989. #define ZGEMM_DEFAULT_P zgemm_p
  990. #define ZGEMM_DEFAULT_R zgemm_r
  991. #define XGEMM_DEFAULT_P xgemm_p
  992. #define XGEMM_DEFAULT_R xgemm_r
  993. #define SGEMM_DEFAULT_Q 256
  994. #define DGEMM_DEFAULT_Q 256
  995. #define QGEMM_DEFAULT_Q 256
  996. #define CGEMM_DEFAULT_Q 256
  997. #define ZGEMM_DEFAULT_Q 256
  998. #define XGEMM_DEFAULT_Q 256
  999. #endif
  1000. #ifdef PENRYN
  1001. #define SNUMOPT 8
  1002. #define DNUMOPT 4
  1003. #define GEMM_DEFAULT_OFFSET_A 128
  1004. #define GEMM_DEFAULT_OFFSET_B 0
  1005. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  1006. #define SYMV_P 8
  1007. #define SWITCH_RATIO 4
  1008. #ifdef ARCH_X86
  1009. #define SGEMM_DEFAULT_UNROLL_M 4
  1010. #define DGEMM_DEFAULT_UNROLL_M 2
  1011. #define QGEMM_DEFAULT_UNROLL_M 2
  1012. #define CGEMM_DEFAULT_UNROLL_M 2
  1013. #define ZGEMM_DEFAULT_UNROLL_M 1
  1014. #define XGEMM_DEFAULT_UNROLL_M 1
  1015. #define SGEMM_DEFAULT_UNROLL_N 4
  1016. #define DGEMM_DEFAULT_UNROLL_N 4
  1017. #define QGEMM_DEFAULT_UNROLL_N 2
  1018. #define CGEMM_DEFAULT_UNROLL_N 2
  1019. #define ZGEMM_DEFAULT_UNROLL_N 2
  1020. #define XGEMM_DEFAULT_UNROLL_N 1
  1021. #else
  1022. #define SGEMM_DEFAULT_UNROLL_M 8
  1023. #define DGEMM_DEFAULT_UNROLL_M 4
  1024. #define QGEMM_DEFAULT_UNROLL_M 2
  1025. #define CGEMM_DEFAULT_UNROLL_M 4
  1026. #define ZGEMM_DEFAULT_UNROLL_M 2
  1027. #define XGEMM_DEFAULT_UNROLL_M 1
  1028. #define SGEMM_DEFAULT_UNROLL_N 4
  1029. #define DGEMM_DEFAULT_UNROLL_N 4
  1030. #define QGEMM_DEFAULT_UNROLL_N 2
  1031. #define CGEMM_DEFAULT_UNROLL_N 2
  1032. #define ZGEMM_DEFAULT_UNROLL_N 2
  1033. #define XGEMM_DEFAULT_UNROLL_N 1
  1034. #endif
  1035. #define SGEMM_DEFAULT_P sgemm_p
  1036. #define SGEMM_DEFAULT_R sgemm_r
  1037. #define DGEMM_DEFAULT_P dgemm_p
  1038. #define DGEMM_DEFAULT_R dgemm_r
  1039. #define QGEMM_DEFAULT_P qgemm_p
  1040. #define QGEMM_DEFAULT_R qgemm_r
  1041. #define CGEMM_DEFAULT_P cgemm_p
  1042. #define CGEMM_DEFAULT_R cgemm_r
  1043. #define ZGEMM_DEFAULT_P zgemm_p
  1044. #define ZGEMM_DEFAULT_R zgemm_r
  1045. #define XGEMM_DEFAULT_P xgemm_p
  1046. #define XGEMM_DEFAULT_R xgemm_r
  1047. #define SGEMM_DEFAULT_Q 512
  1048. #define DGEMM_DEFAULT_Q 256
  1049. #define QGEMM_DEFAULT_Q 128
  1050. #define CGEMM_DEFAULT_Q 512
  1051. #define ZGEMM_DEFAULT_Q 256
  1052. #define XGEMM_DEFAULT_Q 128
  1053. #define GETRF_FACTOR 0.75
  1054. #endif
  1055. #ifdef DUNNINGTON
  1056. #define SNUMOPT 8
  1057. #define DNUMOPT 4
  1058. #define GEMM_DEFAULT_OFFSET_A 128
  1059. #define GEMM_DEFAULT_OFFSET_B 0
  1060. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  1061. #define SYMV_P 8
  1062. #define SWITCH_RATIO 4
  1063. #ifdef ARCH_X86
  1064. #define SGEMM_DEFAULT_UNROLL_M 4
  1065. #define DGEMM_DEFAULT_UNROLL_M 2
  1066. #define QGEMM_DEFAULT_UNROLL_M 2
  1067. #define CGEMM_DEFAULT_UNROLL_M 2
  1068. #define ZGEMM_DEFAULT_UNROLL_M 1
  1069. #define XGEMM_DEFAULT_UNROLL_M 1
  1070. #define SGEMM_DEFAULT_UNROLL_N 4
  1071. #define DGEMM_DEFAULT_UNROLL_N 4
  1072. #define QGEMM_DEFAULT_UNROLL_N 2
  1073. #define CGEMM_DEFAULT_UNROLL_N 2
  1074. #define ZGEMM_DEFAULT_UNROLL_N 2
  1075. #define XGEMM_DEFAULT_UNROLL_N 1
  1076. #else
  1077. #define SGEMM_DEFAULT_UNROLL_M 8
  1078. #define DGEMM_DEFAULT_UNROLL_M 4
  1079. #define QGEMM_DEFAULT_UNROLL_M 2
  1080. #define CGEMM_DEFAULT_UNROLL_M 4
  1081. #define ZGEMM_DEFAULT_UNROLL_M 2
  1082. #define XGEMM_DEFAULT_UNROLL_M 1
  1083. #define SGEMM_DEFAULT_UNROLL_N 4
  1084. #define DGEMM_DEFAULT_UNROLL_N 4
  1085. #define QGEMM_DEFAULT_UNROLL_N 2
  1086. #define CGEMM_DEFAULT_UNROLL_N 2
  1087. #define ZGEMM_DEFAULT_UNROLL_N 2
  1088. #define XGEMM_DEFAULT_UNROLL_N 1
  1089. #endif
  1090. #define SGEMM_DEFAULT_P sgemm_p
  1091. #define SGEMM_DEFAULT_R sgemm_r
  1092. #define DGEMM_DEFAULT_P dgemm_p
  1093. #define DGEMM_DEFAULT_R dgemm_r
  1094. #define QGEMM_DEFAULT_P qgemm_p
  1095. #define QGEMM_DEFAULT_R qgemm_r
  1096. #define CGEMM_DEFAULT_P cgemm_p
  1097. #define CGEMM_DEFAULT_R cgemm_r
  1098. #define ZGEMM_DEFAULT_P zgemm_p
  1099. #define ZGEMM_DEFAULT_R zgemm_r
  1100. #define XGEMM_DEFAULT_P xgemm_p
  1101. #define XGEMM_DEFAULT_R xgemm_r
  1102. #define SGEMM_DEFAULT_Q 768
  1103. #define DGEMM_DEFAULT_Q 384
  1104. #define QGEMM_DEFAULT_Q 192
  1105. #define CGEMM_DEFAULT_Q 768
  1106. #define ZGEMM_DEFAULT_Q 384
  1107. #define XGEMM_DEFAULT_Q 192
  1108. #define GETRF_FACTOR 0.75
  1109. #define GEMM_THREAD gemm_thread_mn
  1110. #endif
  1111. #ifdef NEHALEM
  1112. #define SNUMOPT 8
  1113. #define DNUMOPT 4
  1114. #define GEMM_DEFAULT_OFFSET_A 32
  1115. #define GEMM_DEFAULT_OFFSET_B 0
  1116. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  1117. #define SYMV_P 8
  1118. #define SWITCH_RATIO 4
  1119. #ifdef ARCH_X86
  1120. #define SGEMM_DEFAULT_UNROLL_M 4
  1121. #define DGEMM_DEFAULT_UNROLL_M 2
  1122. #define QGEMM_DEFAULT_UNROLL_M 2
  1123. #define CGEMM_DEFAULT_UNROLL_M 2
  1124. #define ZGEMM_DEFAULT_UNROLL_M 1
  1125. #define XGEMM_DEFAULT_UNROLL_M 1
  1126. #define SGEMM_DEFAULT_UNROLL_N 4
  1127. #define DGEMM_DEFAULT_UNROLL_N 4
  1128. #define QGEMM_DEFAULT_UNROLL_N 2
  1129. #define CGEMM_DEFAULT_UNROLL_N 2
  1130. #define ZGEMM_DEFAULT_UNROLL_N 2
  1131. #define XGEMM_DEFAULT_UNROLL_N 1
  1132. #else
  1133. #define SGEMM_DEFAULT_UNROLL_M 4
  1134. #define DGEMM_DEFAULT_UNROLL_M 2
  1135. #define QGEMM_DEFAULT_UNROLL_M 2
  1136. #define CGEMM_DEFAULT_UNROLL_M 2
  1137. #define ZGEMM_DEFAULT_UNROLL_M 1
  1138. #define XGEMM_DEFAULT_UNROLL_M 1
  1139. #define SGEMM_DEFAULT_UNROLL_N 8
  1140. #define DGEMM_DEFAULT_UNROLL_N 8
  1141. #define QGEMM_DEFAULT_UNROLL_N 2
  1142. #define CGEMM_DEFAULT_UNROLL_N 4
  1143. #define ZGEMM_DEFAULT_UNROLL_N 4
  1144. #define XGEMM_DEFAULT_UNROLL_N 1
  1145. #endif
  1146. #define SGEMM_DEFAULT_P 504
  1147. #define SGEMM_DEFAULT_R sgemm_r
  1148. #define DGEMM_DEFAULT_P 504
  1149. #define DGEMM_DEFAULT_R dgemm_r
  1150. #define QGEMM_DEFAULT_P 504
  1151. #define QGEMM_DEFAULT_R qgemm_r
  1152. #define CGEMM_DEFAULT_P 252
  1153. #define CGEMM_DEFAULT_R cgemm_r
  1154. #define ZGEMM_DEFAULT_P 252
  1155. #define ZGEMM_DEFAULT_R zgemm_r
  1156. #define XGEMM_DEFAULT_P 252
  1157. #define XGEMM_DEFAULT_R xgemm_r
  1158. #define SGEMM_DEFAULT_Q 512
  1159. #define DGEMM_DEFAULT_Q 256
  1160. #define QGEMM_DEFAULT_Q 128
  1161. #define CGEMM_DEFAULT_Q 512
  1162. #define ZGEMM_DEFAULT_Q 256
  1163. #define XGEMM_DEFAULT_Q 128
  1164. #define GETRF_FACTOR 0.72
  1165. #endif
  1166. #ifdef SANDYBRIDGE
  1167. #define SNUMOPT 8
  1168. #define DNUMOPT 4
  1169. #define GEMM_DEFAULT_OFFSET_A 0
  1170. #define GEMM_DEFAULT_OFFSET_B 0
  1171. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  1172. #define SYMV_P 8
  1173. #define SWITCH_RATIO 4
  1174. #ifdef ARCH_X86
  1175. #define SGEMM_DEFAULT_UNROLL_M 4
  1176. #define DGEMM_DEFAULT_UNROLL_M 2
  1177. #define QGEMM_DEFAULT_UNROLL_M 2
  1178. #define CGEMM_DEFAULT_UNROLL_M 2
  1179. #define ZGEMM_DEFAULT_UNROLL_M 1
  1180. #define XGEMM_DEFAULT_UNROLL_M 1
  1181. #define SGEMM_DEFAULT_UNROLL_N 4
  1182. #define DGEMM_DEFAULT_UNROLL_N 4
  1183. #define QGEMM_DEFAULT_UNROLL_N 2
  1184. #define CGEMM_DEFAULT_UNROLL_N 2
  1185. #define ZGEMM_DEFAULT_UNROLL_N 2
  1186. #define XGEMM_DEFAULT_UNROLL_N 1
  1187. #else
  1188. #define SGEMM_DEFAULT_UNROLL_M 16
  1189. #define DGEMM_DEFAULT_UNROLL_M 8
  1190. #define QGEMM_DEFAULT_UNROLL_M 2
  1191. #define CGEMM_DEFAULT_UNROLL_M 8
  1192. #define ZGEMM_DEFAULT_UNROLL_M 1
  1193. #define XGEMM_DEFAULT_UNROLL_M 1
  1194. #define SGEMM_DEFAULT_UNROLL_N 4
  1195. #define DGEMM_DEFAULT_UNROLL_N 4
  1196. #define QGEMM_DEFAULT_UNROLL_N 2
  1197. #define CGEMM_DEFAULT_UNROLL_N 2
  1198. #define ZGEMM_DEFAULT_UNROLL_N 4
  1199. #define XGEMM_DEFAULT_UNROLL_N 1
  1200. #endif
  1201. #define SGEMM_DEFAULT_P 768
  1202. #define SGEMM_DEFAULT_R sgemm_r
  1203. /*#define SGEMM_DEFAULT_R 1024*/
  1204. #define DGEMM_DEFAULT_P 512
  1205. #define DGEMM_DEFAULT_R dgemm_r
  1206. /*#define DGEMM_DEFAULT_R 1024*/
  1207. #define QGEMM_DEFAULT_P 504
  1208. #define QGEMM_DEFAULT_R qgemm_r
  1209. #define CGEMM_DEFAULT_P 768
  1210. #define CGEMM_DEFAULT_R cgemm_r
  1211. /*#define CGEMM_DEFAULT_R 1024*/
  1212. #define ZGEMM_DEFAULT_P 512
  1213. #define ZGEMM_DEFAULT_R zgemm_r
  1214. /*#define ZGEMM_DEFAULT_R 1024*/
  1215. #define XGEMM_DEFAULT_P 252
  1216. #define XGEMM_DEFAULT_R xgemm_r
  1217. #define SGEMM_DEFAULT_Q 384
  1218. #define DGEMM_DEFAULT_Q 256
  1219. #define QGEMM_DEFAULT_Q 128
  1220. #define CGEMM_DEFAULT_Q 512
  1221. #define ZGEMM_DEFAULT_Q 192
  1222. #define XGEMM_DEFAULT_Q 128
  1223. #define CGEMM3M_DEFAULT_UNROLL_N 8
  1224. #define CGEMM3M_DEFAULT_UNROLL_M 4
  1225. #define ZGEMM3M_DEFAULT_UNROLL_N 8
  1226. #define ZGEMM3M_DEFAULT_UNROLL_M 2
  1227. #define CGEMM3M_DEFAULT_P 448
  1228. #define ZGEMM3M_DEFAULT_P 224
  1229. #define XGEMM3M_DEFAULT_P 112
  1230. #define CGEMM3M_DEFAULT_Q 224
  1231. #define ZGEMM3M_DEFAULT_Q 224
  1232. #define XGEMM3M_DEFAULT_Q 224
  1233. #define CGEMM3M_DEFAULT_R 12288
  1234. #define ZGEMM3M_DEFAULT_R 12288
  1235. #define XGEMM3M_DEFAULT_R 12288
  1236. #define GETRF_FACTOR 0.72
  1237. #endif
  1238. #ifdef HASWELL
  1239. #define SNUMOPT 16
  1240. #define DNUMOPT 8
  1241. #define GEMM_DEFAULT_OFFSET_A 0
  1242. #define GEMM_DEFAULT_OFFSET_B 0
  1243. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  1244. #define SYMV_P 8
  1245. #if defined(XDOUBLE) || defined(DOUBLE)
  1246. #define SWITCH_RATIO 4
  1247. #define GEMM_PREFERED_SIZE 4
  1248. #else
  1249. #define SWITCH_RATIO 8
  1250. #define GEMM_PREFERED_SIZE 8
  1251. #endif
  1252. #ifdef ARCH_X86
  1253. #define SGEMM_DEFAULT_UNROLL_M 4
  1254. #define DGEMM_DEFAULT_UNROLL_M 2
  1255. #define QGEMM_DEFAULT_UNROLL_M 2
  1256. #define CGEMM_DEFAULT_UNROLL_M 2
  1257. #define ZGEMM_DEFAULT_UNROLL_M 1
  1258. #define XGEMM_DEFAULT_UNROLL_M 1
  1259. #define SGEMM_DEFAULT_UNROLL_N 4
  1260. #define DGEMM_DEFAULT_UNROLL_N 4
  1261. #define QGEMM_DEFAULT_UNROLL_N 2
  1262. #define CGEMM_DEFAULT_UNROLL_N 2
  1263. #define ZGEMM_DEFAULT_UNROLL_N 2
  1264. #define XGEMM_DEFAULT_UNROLL_N 1
  1265. #else
  1266. #define SGEMM_DEFAULT_UNROLL_M 8
  1267. #define DGEMM_DEFAULT_UNROLL_M 4
  1268. #define QGEMM_DEFAULT_UNROLL_M 2
  1269. #define CGEMM_DEFAULT_UNROLL_M 8
  1270. #define ZGEMM_DEFAULT_UNROLL_M 4
  1271. #define XGEMM_DEFAULT_UNROLL_M 1
  1272. #define SGEMM_DEFAULT_UNROLL_N 4
  1273. #define DGEMM_DEFAULT_UNROLL_N 8
  1274. #define QGEMM_DEFAULT_UNROLL_N 2
  1275. #define CGEMM_DEFAULT_UNROLL_N 2
  1276. #define ZGEMM_DEFAULT_UNROLL_N 2
  1277. #define XGEMM_DEFAULT_UNROLL_N 1
  1278. /*
  1279. #define SGEMM_DEFAULT_UNROLL_MN 32
  1280. #define DGEMM_DEFAULT_UNROLL_MN 32
  1281. */
  1282. #endif
  1283. #ifdef ARCH_X86
  1284. #define SGEMM_DEFAULT_P 512
  1285. #define SGEMM_DEFAULT_R sgemm_r
  1286. #define DGEMM_DEFAULT_P 512
  1287. #define DGEMM_DEFAULT_R dgemm_r
  1288. #define QGEMM_DEFAULT_P 504
  1289. #define QGEMM_DEFAULT_R qgemm_r
  1290. #define CGEMM_DEFAULT_P 128
  1291. #define CGEMM_DEFAULT_R 1024
  1292. #define ZGEMM_DEFAULT_P 512
  1293. #define ZGEMM_DEFAULT_R zgemm_r
  1294. #define XGEMM_DEFAULT_P 252
  1295. #define XGEMM_DEFAULT_R xgemm_r
  1296. #define SGEMM_DEFAULT_Q 256
  1297. #define DGEMM_DEFAULT_Q 256
  1298. #define QGEMM_DEFAULT_Q 128
  1299. #define CGEMM_DEFAULT_Q 256
  1300. #define ZGEMM_DEFAULT_Q 192
  1301. #define XGEMM_DEFAULT_Q 128
  1302. #else
  1303. #define SGEMM_DEFAULT_P 320
  1304. #define DGEMM_DEFAULT_P 512
  1305. #define CGEMM_DEFAULT_P 256
  1306. #define ZGEMM_DEFAULT_P 192
  1307. #ifdef WINDOWS_ABI
  1308. #define SGEMM_DEFAULT_Q 320
  1309. #define DGEMM_DEFAULT_Q 128
  1310. #else
  1311. #define SGEMM_DEFAULT_Q 320
  1312. #define DGEMM_DEFAULT_Q 256
  1313. #endif
  1314. #define CGEMM_DEFAULT_Q 256
  1315. #define ZGEMM_DEFAULT_Q 192
  1316. #define SGEMM_DEFAULT_R sgemm_r
  1317. #define DGEMM_DEFAULT_R 13824
  1318. #define CGEMM_DEFAULT_R cgemm_r
  1319. #define ZGEMM_DEFAULT_R zgemm_r
  1320. #define QGEMM_DEFAULT_Q 128
  1321. #define QGEMM_DEFAULT_P 504
  1322. #define QGEMM_DEFAULT_R qgemm_r
  1323. #define XGEMM_DEFAULT_P 252
  1324. #define XGEMM_DEFAULT_R xgemm_r
  1325. #define XGEMM_DEFAULT_Q 128
  1326. #define CGEMM3M_DEFAULT_UNROLL_N 4
  1327. #define CGEMM3M_DEFAULT_UNROLL_M 8
  1328. #define ZGEMM3M_DEFAULT_UNROLL_N 4
  1329. #define ZGEMM3M_DEFAULT_UNROLL_M 4
  1330. #define CGEMM3M_DEFAULT_P 320
  1331. #define ZGEMM3M_DEFAULT_P 256
  1332. #define XGEMM3M_DEFAULT_P 112
  1333. #define CGEMM3M_DEFAULT_Q 320
  1334. #define ZGEMM3M_DEFAULT_Q 256
  1335. #define XGEMM3M_DEFAULT_Q 224
  1336. #define CGEMM3M_DEFAULT_R 12288
  1337. #define ZGEMM3M_DEFAULT_R 12288
  1338. #define XGEMM3M_DEFAULT_R 12288
  1339. #endif
  1340. #endif
  1341. #ifdef SKYLAKEX
  1342. #define SNUMOPT 16
  1343. #define DNUMOPT 8
  1344. #define GEMM_DEFAULT_OFFSET_A 0
  1345. #define GEMM_DEFAULT_OFFSET_B 0
  1346. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  1347. #define SYMV_P 8
  1348. #if defined(XDOUBLE) || defined(DOUBLE)
  1349. #define SWITCH_RATIO 8
  1350. #define GEMM_PREFERED_SIZE 8
  1351. #else
  1352. #define SWITCH_RATIO 16
  1353. #define GEMM_PREFERED_SIZE 16
  1354. #endif
  1355. #define USE_SGEMM_KERNEL_DIRECT 1
  1356. #ifdef ARCH_X86
  1357. #define SGEMM_DEFAULT_UNROLL_M 4
  1358. #define DGEMM_DEFAULT_UNROLL_M 2
  1359. #define QGEMM_DEFAULT_UNROLL_M 2
  1360. #define CGEMM_DEFAULT_UNROLL_M 2
  1361. #define ZGEMM_DEFAULT_UNROLL_M 1
  1362. #define XGEMM_DEFAULT_UNROLL_M 1
  1363. #define SGEMM_DEFAULT_UNROLL_N 4
  1364. #define DGEMM_DEFAULT_UNROLL_N 4
  1365. #define QGEMM_DEFAULT_UNROLL_N 2
  1366. #define CGEMM_DEFAULT_UNROLL_N 2
  1367. #define ZGEMM_DEFAULT_UNROLL_N 2
  1368. #define XGEMM_DEFAULT_UNROLL_N 1
  1369. #else
  1370. #define SGEMM_DEFAULT_UNROLL_M 16
  1371. #define DGEMM_DEFAULT_UNROLL_M 16
  1372. #define QGEMM_DEFAULT_UNROLL_M 2
  1373. #define CGEMM_DEFAULT_UNROLL_M 8
  1374. #define ZGEMM_DEFAULT_UNROLL_M 4
  1375. #define XGEMM_DEFAULT_UNROLL_M 1
  1376. #define SGEMM_DEFAULT_UNROLL_N 4
  1377. #define DGEMM_DEFAULT_UNROLL_N 2
  1378. #define QGEMM_DEFAULT_UNROLL_N 2
  1379. #define CGEMM_DEFAULT_UNROLL_N 2
  1380. #define ZGEMM_DEFAULT_UNROLL_N 2
  1381. #define XGEMM_DEFAULT_UNROLL_N 1
  1382. #define SGEMM_DEFAULT_UNROLL_MN 32
  1383. #define DGEMM_DEFAULT_UNROLL_MN 32
  1384. #endif
  1385. #ifdef ARCH_X86
  1386. #define SGEMM_DEFAULT_P 512
  1387. #define SGEMM_DEFAULT_R sgemm_r
  1388. #define DGEMM_DEFAULT_P 512
  1389. #define DGEMM_DEFAULT_R dgemm_r
  1390. #define QGEMM_DEFAULT_P 504
  1391. #define QGEMM_DEFAULT_R qgemm_r
  1392. #define CGEMM_DEFAULT_P 128
  1393. #define CGEMM_DEFAULT_R 1024
  1394. #define ZGEMM_DEFAULT_P 512
  1395. #define ZGEMM_DEFAULT_R zgemm_r
  1396. #define XGEMM_DEFAULT_P 252
  1397. #define XGEMM_DEFAULT_R xgemm_r
  1398. #define SGEMM_DEFAULT_Q 256
  1399. #define DGEMM_DEFAULT_Q 256
  1400. #define QGEMM_DEFAULT_Q 128
  1401. #define CGEMM_DEFAULT_Q 256
  1402. #define ZGEMM_DEFAULT_Q 192
  1403. #define XGEMM_DEFAULT_Q 128
  1404. #else
  1405. #define SGEMM_DEFAULT_P 448
  1406. #define DGEMM_DEFAULT_P 192
  1407. #define CGEMM_DEFAULT_P 384
  1408. #define ZGEMM_DEFAULT_P 256
  1409. #define SGEMM_DEFAULT_Q 448
  1410. #define DGEMM_DEFAULT_Q 384
  1411. #define CGEMM_DEFAULT_Q 192
  1412. #define ZGEMM_DEFAULT_Q 128
  1413. #define SGEMM_DEFAULT_R sgemm_r
  1414. #define DGEMM_DEFAULT_R 8640
  1415. #define CGEMM_DEFAULT_R cgemm_r
  1416. #define ZGEMM_DEFAULT_R zgemm_r
  1417. #define QGEMM_DEFAULT_Q 128
  1418. #define QGEMM_DEFAULT_P 504
  1419. #define QGEMM_DEFAULT_R qgemm_r
  1420. #define XGEMM_DEFAULT_P 252
  1421. #define XGEMM_DEFAULT_R xgemm_r
  1422. #define XGEMM_DEFAULT_Q 128
  1423. #define CGEMM3M_DEFAULT_UNROLL_N 4
  1424. #define CGEMM3M_DEFAULT_UNROLL_M 8
  1425. #define ZGEMM3M_DEFAULT_UNROLL_N 4
  1426. #define ZGEMM3M_DEFAULT_UNROLL_M 4
  1427. #define CGEMM3M_DEFAULT_P 320
  1428. #define ZGEMM3M_DEFAULT_P 256
  1429. #define XGEMM3M_DEFAULT_P 112
  1430. #define CGEMM3M_DEFAULT_Q 320
  1431. #define ZGEMM3M_DEFAULT_Q 256
  1432. #define XGEMM3M_DEFAULT_Q 224
  1433. #define CGEMM3M_DEFAULT_R 12288
  1434. #define ZGEMM3M_DEFAULT_R 12288
  1435. #define XGEMM3M_DEFAULT_R 12288
  1436. #endif
  1437. #endif
  1438. #ifdef SAPPHIRERAPIDS
  1439. #define SNUMOPT 16
  1440. #define DNUMOPT 8
  1441. #define GEMM_DEFAULT_OFFSET_A 0
  1442. #define GEMM_DEFAULT_OFFSET_B 0
  1443. #define GEMM_DEFAULT_ALIGN 0x03fffUL
  1444. #define SYMV_P 8
  1445. #if defined(XDOUBLE) || defined(DOUBLE)
  1446. #define SWITCH_RATIO 8
  1447. #define GEMM_PREFERED_SIZE 8
  1448. #else
  1449. #define SWITCH_RATIO 16
  1450. #define GEMM_PREFERED_SIZE 16
  1451. #endif
  1452. #define USE_SGEMM_KERNEL_DIRECT 1
  1453. #undef SBGEMM_DEFAULT_UNROLL_N
  1454. #undef SBGEMM_DEFAULT_UNROLL_M
  1455. #undef SBGEMM_DEFAULT_P
  1456. #undef SBGEMM_DEFAULT_R
  1457. #undef SBGEMM_DEFAULT_Q
  1458. // FIXME: actually UNROLL_M = UNROLL_N = 16
  1459. // If M and N is equal, OpenBLAS will reuse OCOPY as ICOPY.
  1460. // But for AMX, they are not the same, set UNROLL_M = 32 to workaround
  1461. #define SBGEMM_DEFAULT_UNROLL_N 16
  1462. #define SBGEMM_DEFAULT_UNROLL_M 32
  1463. #define SBGEMM_DEFAULT_P 256
  1464. #define SBGEMM_DEFAULT_Q 1024
  1465. #define SBGEMM_DEFAULT_R sbgemm_r
  1466. #ifdef ARCH_X86
  1467. #define SGEMM_DEFAULT_UNROLL_M 4
  1468. #define DGEMM_DEFAULT_UNROLL_M 2
  1469. #define QGEMM_DEFAULT_UNROLL_M 2
  1470. #define CGEMM_DEFAULT_UNROLL_M 2
  1471. #define ZGEMM_DEFAULT_UNROLL_M 1
  1472. #define XGEMM_DEFAULT_UNROLL_M 1
  1473. #define SGEMM_DEFAULT_UNROLL_N 4
  1474. #define DGEMM_DEFAULT_UNROLL_N 4
  1475. #define QGEMM_DEFAULT_UNROLL_N 2
  1476. #define CGEMM_DEFAULT_UNROLL_N 2
  1477. #define ZGEMM_DEFAULT_UNROLL_N 2
  1478. #define XGEMM_DEFAULT_UNROLL_N 1
  1479. #else
  1480. #define SGEMM_DEFAULT_UNROLL_M 16
  1481. #define DGEMM_DEFAULT_UNROLL_M 16
  1482. #define QGEMM_DEFAULT_UNROLL_M 2
  1483. #define CGEMM_DEFAULT_UNROLL_M 8
  1484. #define ZGEMM_DEFAULT_UNROLL_M 4
  1485. #define XGEMM_DEFAULT_UNROLL_M 1
  1486. #define SGEMM_DEFAULT_UNROLL_N 4
  1487. #define DGEMM_DEFAULT_UNROLL_N 2
  1488. #define QGEMM_DEFAULT_UNROLL_N 2
  1489. #define CGEMM_DEFAULT_UNROLL_N 2
  1490. #define ZGEMM_DEFAULT_UNROLL_N 2
  1491. #define XGEMM_DEFAULT_UNROLL_N 1
  1492. #define SGEMM_DEFAULT_UNROLL_MN 32
  1493. #define DGEMM_DEFAULT_UNROLL_MN 32
  1494. #endif
  1495. #ifdef ARCH_X86
  1496. #define SGEMM_DEFAULT_P 512
  1497. #define SGEMM_DEFAULT_R sgemm_r
  1498. #define DGEMM_DEFAULT_P 512
  1499. #define DGEMM_DEFAULT_R dgemm_r
  1500. #define QGEMM_DEFAULT_P 504
  1501. #define QGEMM_DEFAULT_R qgemm_r
  1502. #define CGEMM_DEFAULT_P 128
  1503. #define CGEMM_DEFAULT_R 1024
  1504. #define ZGEMM_DEFAULT_P 512
  1505. #define ZGEMM_DEFAULT_R zgemm_r
  1506. #define XGEMM_DEFAULT_P 252
  1507. #define XGEMM_DEFAULT_R xgemm_r
  1508. #define SGEMM_DEFAULT_Q 256
  1509. #define DGEMM_DEFAULT_Q 256
  1510. #define QGEMM_DEFAULT_Q 128
  1511. #define CGEMM_DEFAULT_Q 256
  1512. #define ZGEMM_DEFAULT_Q 192
  1513. #define XGEMM_DEFAULT_Q 128
  1514. #else
  1515. #define SGEMM_DEFAULT_P 640
  1516. #define DGEMM_DEFAULT_P 192
  1517. #define CGEMM_DEFAULT_P 384
  1518. #define ZGEMM_DEFAULT_P 256
  1519. #define SGEMM_DEFAULT_Q 320
  1520. #define DGEMM_DEFAULT_Q 384
  1521. #define CGEMM_DEFAULT_Q 192
  1522. #define ZGEMM_DEFAULT_Q 128
  1523. #define SGEMM_DEFAULT_R sgemm_r
  1524. #define DGEMM_DEFAULT_R 8640
  1525. #define CGEMM_DEFAULT_R cgemm_r
  1526. #define ZGEMM_DEFAULT_R zgemm_r
  1527. #define QGEMM_DEFAULT_Q 128
  1528. #define QGEMM_DEFAULT_P 504
  1529. #define QGEMM_DEFAULT_R qgemm_r
  1530. #define XGEMM_DEFAULT_P 252
  1531. #define XGEMM_DEFAULT_R xgemm_r
  1532. #define XGEMM_DEFAULT_Q 128
  1533. #define CGEMM3M_DEFAULT_UNROLL_N 4
  1534. #define CGEMM3M_DEFAULT_UNROLL_M 8
  1535. #define ZGEMM3M_DEFAULT_UNROLL_N 4
  1536. #define ZGEMM3M_DEFAULT_UNROLL_M 4
  1537. #define CGEMM3M_DEFAULT_P 320
  1538. #define ZGEMM3M_DEFAULT_P 256
  1539. #define XGEMM3M_DEFAULT_P 112
  1540. #define CGEMM3M_DEFAULT_Q 320
  1541. #define ZGEMM3M_DEFAULT_Q 256
  1542. #define XGEMM3M_DEFAULT_Q 224
  1543. #define CGEMM3M_DEFAULT_R 12288
  1544. #define ZGEMM3M_DEFAULT_R 12288
  1545. #define XGEMM3M_DEFAULT_R 12288
  1546. #endif
  1547. #endif
  1548. #ifdef COOPERLAKE
  1549. #define SNUMOPT 16
  1550. #define DNUMOPT 8
  1551. #define GEMM_DEFAULT_OFFSET_A 0
  1552. #define GEMM_DEFAULT_OFFSET_B 0
  1553. #define GEMM_DEFAULT_ALIGN 0x03fffUL
  1554. #define SYMV_P 8
  1555. #if defined(XDOUBLE) || defined(DOUBLE)
  1556. #define SWITCH_RATIO 8
  1557. #define GEMM_PREFERED_SIZE 8
  1558. #else
  1559. #define SWITCH_RATIO 16
  1560. #define GEMM_PREFERED_SIZE 16
  1561. #endif
  1562. #define USE_SGEMM_KERNEL_DIRECT 1
  1563. #undef SBGEMM_DEFAULT_UNROLL_N
  1564. #undef SBGEMM_DEFAULT_UNROLL_M
  1565. #undef SBGEMM_DEFAULT_P
  1566. #undef SBGEMM_DEFAULT_R
  1567. #undef SBGEMM_DEFAULT_Q
  1568. #define SBGEMM_DEFAULT_UNROLL_N 4
  1569. #define SBGEMM_DEFAULT_UNROLL_M 16
  1570. #define SBGEMM_DEFAULT_P 384
  1571. #define SBGEMM_DEFAULT_Q 768
  1572. #define SBGEMM_DEFAULT_R sbgemm_r
  1573. #ifdef ARCH_X86
  1574. #define SGEMM_DEFAULT_UNROLL_M 4
  1575. #define DGEMM_DEFAULT_UNROLL_M 2
  1576. #define QGEMM_DEFAULT_UNROLL_M 2
  1577. #define CGEMM_DEFAULT_UNROLL_M 2
  1578. #define ZGEMM_DEFAULT_UNROLL_M 1
  1579. #define XGEMM_DEFAULT_UNROLL_M 1
  1580. #define SGEMM_DEFAULT_UNROLL_N 4
  1581. #define DGEMM_DEFAULT_UNROLL_N 4
  1582. #define QGEMM_DEFAULT_UNROLL_N 2
  1583. #define CGEMM_DEFAULT_UNROLL_N 2
  1584. #define ZGEMM_DEFAULT_UNROLL_N 2
  1585. #define XGEMM_DEFAULT_UNROLL_N 1
  1586. #else
  1587. #define SGEMM_DEFAULT_UNROLL_M 16
  1588. #define DGEMM_DEFAULT_UNROLL_M 16
  1589. #define QGEMM_DEFAULT_UNROLL_M 2
  1590. #define CGEMM_DEFAULT_UNROLL_M 8
  1591. #define ZGEMM_DEFAULT_UNROLL_M 4
  1592. #define XGEMM_DEFAULT_UNROLL_M 1
  1593. #define SGEMM_DEFAULT_UNROLL_N 4
  1594. #define DGEMM_DEFAULT_UNROLL_N 2
  1595. #define QGEMM_DEFAULT_UNROLL_N 2
  1596. #define CGEMM_DEFAULT_UNROLL_N 2
  1597. #define ZGEMM_DEFAULT_UNROLL_N 2
  1598. #define XGEMM_DEFAULT_UNROLL_N 1
  1599. #define SGEMM_DEFAULT_UNROLL_MN 32
  1600. #define DGEMM_DEFAULT_UNROLL_MN 32
  1601. #endif
  1602. #ifdef ARCH_X86
  1603. #define SGEMM_DEFAULT_P 512
  1604. #define SGEMM_DEFAULT_R sgemm_r
  1605. #define DGEMM_DEFAULT_P 512
  1606. #define DGEMM_DEFAULT_R dgemm_r
  1607. #define QGEMM_DEFAULT_P 504
  1608. #define QGEMM_DEFAULT_R qgemm_r
  1609. #define CGEMM_DEFAULT_P 128
  1610. #define CGEMM_DEFAULT_R 1024
  1611. #define ZGEMM_DEFAULT_P 512
  1612. #define ZGEMM_DEFAULT_R zgemm_r
  1613. #define XGEMM_DEFAULT_P 252
  1614. #define XGEMM_DEFAULT_R xgemm_r
  1615. #define SGEMM_DEFAULT_Q 256
  1616. #define DGEMM_DEFAULT_Q 256
  1617. #define QGEMM_DEFAULT_Q 128
  1618. #define CGEMM_DEFAULT_Q 256
  1619. #define ZGEMM_DEFAULT_Q 192
  1620. #define XGEMM_DEFAULT_Q 128
  1621. #else
  1622. #define SGEMM_DEFAULT_P 640
  1623. #define DGEMM_DEFAULT_P 192
  1624. #define CGEMM_DEFAULT_P 384
  1625. #define ZGEMM_DEFAULT_P 256
  1626. #define SGEMM_DEFAULT_Q 320
  1627. #define DGEMM_DEFAULT_Q 384
  1628. #define CGEMM_DEFAULT_Q 192
  1629. #define ZGEMM_DEFAULT_Q 128
  1630. #define SGEMM_DEFAULT_R sgemm_r
  1631. #define DGEMM_DEFAULT_R 8640
  1632. #define CGEMM_DEFAULT_R cgemm_r
  1633. #define ZGEMM_DEFAULT_R zgemm_r
  1634. #define QGEMM_DEFAULT_Q 128
  1635. #define QGEMM_DEFAULT_P 504
  1636. #define QGEMM_DEFAULT_R qgemm_r
  1637. #define XGEMM_DEFAULT_P 252
  1638. #define XGEMM_DEFAULT_R xgemm_r
  1639. #define XGEMM_DEFAULT_Q 128
  1640. #define CGEMM3M_DEFAULT_UNROLL_N 4
  1641. #define CGEMM3M_DEFAULT_UNROLL_M 8
  1642. #define ZGEMM3M_DEFAULT_UNROLL_N 4
  1643. #define ZGEMM3M_DEFAULT_UNROLL_M 4
  1644. #define CGEMM3M_DEFAULT_P 320
  1645. #define ZGEMM3M_DEFAULT_P 256
  1646. #define XGEMM3M_DEFAULT_P 112
  1647. #define CGEMM3M_DEFAULT_Q 320
  1648. #define ZGEMM3M_DEFAULT_Q 256
  1649. #define XGEMM3M_DEFAULT_Q 224
  1650. #define CGEMM3M_DEFAULT_R 12288
  1651. #define ZGEMM3M_DEFAULT_R 12288
  1652. #define XGEMM3M_DEFAULT_R 12288
  1653. #endif
  1654. #endif
  1655. #ifdef ATOM
  1656. #define SNUMOPT 2
  1657. #define DNUMOPT 1
  1658. #define GEMM_DEFAULT_OFFSET_A 64
  1659. #define GEMM_DEFAULT_OFFSET_B 0
  1660. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0ffffUL
  1661. #define SYMV_P 8
  1662. #ifdef ARCH_X86
  1663. #define SGEMM_DEFAULT_UNROLL_M 4
  1664. #define DGEMM_DEFAULT_UNROLL_M 2
  1665. #define QGEMM_DEFAULT_UNROLL_M 2
  1666. #define CGEMM_DEFAULT_UNROLL_M 2
  1667. #define ZGEMM_DEFAULT_UNROLL_M 1
  1668. #define XGEMM_DEFAULT_UNROLL_M 1
  1669. #else
  1670. #define SGEMM_DEFAULT_UNROLL_M 8
  1671. #define DGEMM_DEFAULT_UNROLL_M 4
  1672. #define QGEMM_DEFAULT_UNROLL_M 2
  1673. #define CGEMM_DEFAULT_UNROLL_M 4
  1674. #define ZGEMM_DEFAULT_UNROLL_M 2
  1675. #define XGEMM_DEFAULT_UNROLL_M 1
  1676. #endif
  1677. #define SGEMM_DEFAULT_UNROLL_N 4
  1678. #define DGEMM_DEFAULT_UNROLL_N 2
  1679. #define QGEMM_DEFAULT_UNROLL_N 2
  1680. #define CGEMM_DEFAULT_UNROLL_N 2
  1681. #define ZGEMM_DEFAULT_UNROLL_N 1
  1682. #define XGEMM_DEFAULT_UNROLL_N 1
  1683. #define SGEMM_DEFAULT_P sgemm_p
  1684. #define SGEMM_DEFAULT_R sgemm_r
  1685. #define DGEMM_DEFAULT_P dgemm_p
  1686. #define DGEMM_DEFAULT_R dgemm_r
  1687. #define QGEMM_DEFAULT_P qgemm_p
  1688. #define QGEMM_DEFAULT_R qgemm_r
  1689. #define CGEMM_DEFAULT_P cgemm_p
  1690. #define CGEMM_DEFAULT_R cgemm_r
  1691. #define ZGEMM_DEFAULT_P zgemm_p
  1692. #define ZGEMM_DEFAULT_R zgemm_r
  1693. #define XGEMM_DEFAULT_P xgemm_p
  1694. #define XGEMM_DEFAULT_R xgemm_r
  1695. #define SGEMM_DEFAULT_Q 256
  1696. #define DGEMM_DEFAULT_Q 256
  1697. #define QGEMM_DEFAULT_Q 256
  1698. #define CGEMM_DEFAULT_Q 256
  1699. #define ZGEMM_DEFAULT_Q 256
  1700. #define XGEMM_DEFAULT_Q 256
  1701. #endif
  1702. #ifdef ITANIUM2
  1703. #define SNUMOPT 4
  1704. #define DNUMOPT 4
  1705. #define GEMM_DEFAULT_OFFSET_A 0
  1706. #define GEMM_DEFAULT_OFFSET_B 128
  1707. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  1708. #define SGEMM_DEFAULT_UNROLL_M 8
  1709. #define SGEMM_DEFAULT_UNROLL_N 8
  1710. #define DGEMM_DEFAULT_UNROLL_M 8
  1711. #define DGEMM_DEFAULT_UNROLL_N 8
  1712. #define QGEMM_DEFAULT_UNROLL_M 8
  1713. #define QGEMM_DEFAULT_UNROLL_N 8
  1714. #define CGEMM_DEFAULT_UNROLL_M 4
  1715. #define CGEMM_DEFAULT_UNROLL_N 4
  1716. #define ZGEMM_DEFAULT_UNROLL_M 4
  1717. #define ZGEMM_DEFAULT_UNROLL_N 4
  1718. #define XGEMM_DEFAULT_UNROLL_M 4
  1719. #define XGEMM_DEFAULT_UNROLL_N 4
  1720. #define SGEMM_DEFAULT_P sgemm_p
  1721. #define DGEMM_DEFAULT_P dgemm_p
  1722. #define QGEMM_DEFAULT_P qgemm_p
  1723. #define CGEMM_DEFAULT_P cgemm_p
  1724. #define ZGEMM_DEFAULT_P zgemm_p
  1725. #define XGEMM_DEFAULT_P xgemm_p
  1726. #define SGEMM_DEFAULT_Q 1024
  1727. #define DGEMM_DEFAULT_Q 1024
  1728. #define QGEMM_DEFAULT_Q 1024
  1729. #define CGEMM_DEFAULT_Q 1024
  1730. #define ZGEMM_DEFAULT_Q 1024
  1731. #define XGEMM_DEFAULT_Q 1024
  1732. #define SGEMM_DEFAULT_R sgemm_r
  1733. #define DGEMM_DEFAULT_R dgemm_r
  1734. #define QGEMM_DEFAULT_R qgemm_r
  1735. #define CGEMM_DEFAULT_R cgemm_r
  1736. #define ZGEMM_DEFAULT_R zgemm_r
  1737. #define XGEMM_DEFAULT_R xgemm_r
  1738. #define SYMV_P 16
  1739. #define GETRF_FACTOR 0.65
  1740. #endif
  1741. #if defined(EV4) || defined(EV5) || defined(EV6)
  1742. #ifdef EV4
  1743. #define SNUMOPT 1
  1744. #define DNUMOPT 1
  1745. #else
  1746. #define SNUMOPT 2
  1747. #define DNUMOPT 2
  1748. #endif
  1749. #define GEMM_DEFAULT_OFFSET_A 512
  1750. #define GEMM_DEFAULT_OFFSET_B 512
  1751. #define GEMM_DEFAULT_ALIGN 0x0ffffUL
  1752. #define SGEMM_DEFAULT_UNROLL_M 4
  1753. #define SGEMM_DEFAULT_UNROLL_N 4
  1754. #define DGEMM_DEFAULT_UNROLL_M 4
  1755. #define DGEMM_DEFAULT_UNROLL_N 4
  1756. #define CGEMM_DEFAULT_UNROLL_M 2
  1757. #define CGEMM_DEFAULT_UNROLL_N 2
  1758. #define ZGEMM_DEFAULT_UNROLL_M 2
  1759. #define ZGEMM_DEFAULT_UNROLL_N 2
  1760. #define SYMV_P 8
  1761. #ifdef EV4
  1762. #define SGEMM_DEFAULT_P 32
  1763. #define SGEMM_DEFAULT_Q 112
  1764. #define SGEMM_DEFAULT_R 256
  1765. #define DGEMM_DEFAULT_P 32
  1766. #define DGEMM_DEFAULT_Q 56
  1767. #define DGEMM_DEFAULT_R 256
  1768. #define CGEMM_DEFAULT_P 32
  1769. #define CGEMM_DEFAULT_Q 64
  1770. #define CGEMM_DEFAULT_R 240
  1771. #define ZGEMM_DEFAULT_P 32
  1772. #define ZGEMM_DEFAULT_Q 32
  1773. #define ZGEMM_DEFAULT_R 240
  1774. #endif
  1775. #ifdef EV5
  1776. #define SGEMM_DEFAULT_P 64
  1777. #define SGEMM_DEFAULT_Q 256
  1778. #define DGEMM_DEFAULT_P 64
  1779. #define DGEMM_DEFAULT_Q 128
  1780. #define CGEMM_DEFAULT_P 64
  1781. #define CGEMM_DEFAULT_Q 128
  1782. #define ZGEMM_DEFAULT_P 64
  1783. #define ZGEMM_DEFAULT_Q 64
  1784. #endif
  1785. #ifdef EV6
  1786. #define SGEMM_DEFAULT_P 256
  1787. #define SGEMM_DEFAULT_Q 512
  1788. #define DGEMM_DEFAULT_P 256
  1789. #define DGEMM_DEFAULT_Q 256
  1790. #define CGEMM_DEFAULT_P 256
  1791. #define CGEMM_DEFAULT_Q 256
  1792. #define ZGEMM_DEFAULT_P 128
  1793. #define ZGEMM_DEFAULT_Q 256
  1794. #endif
  1795. #endif
  1796. #ifdef CELL
  1797. #define SNUMOPT 2
  1798. #define DNUMOPT 2
  1799. #define GEMM_DEFAULT_OFFSET_A 0
  1800. #define GEMM_DEFAULT_OFFSET_B 8192
  1801. #define GEMM_DEFAULT_ALIGN 0x0ffffUL
  1802. #define SGEMM_DEFAULT_UNROLL_M 16
  1803. #define SGEMM_DEFAULT_UNROLL_N 4
  1804. #define DGEMM_DEFAULT_UNROLL_M 4
  1805. #define DGEMM_DEFAULT_UNROLL_N 4
  1806. #define CGEMM_DEFAULT_UNROLL_M 8
  1807. #define CGEMM_DEFAULT_UNROLL_N 2
  1808. #define ZGEMM_DEFAULT_UNROLL_M 2
  1809. #define ZGEMM_DEFAULT_UNROLL_N 2
  1810. #define SGEMM_DEFAULT_P 128
  1811. #define DGEMM_DEFAULT_P 128
  1812. #define CGEMM_DEFAULT_P 128
  1813. #define ZGEMM_DEFAULT_P 128
  1814. #define SGEMM_DEFAULT_Q 512
  1815. #define DGEMM_DEFAULT_Q 256
  1816. #define CGEMM_DEFAULT_Q 256
  1817. #define ZGEMM_DEFAULT_Q 128
  1818. #define SYMV_P 4
  1819. #endif
  1820. #ifdef PPCG4
  1821. #define GEMM_DEFAULT_OFFSET_A 0
  1822. #define GEMM_DEFAULT_OFFSET_B 1024
  1823. #define GEMM_DEFAULT_ALIGN 0x0ffffUL
  1824. #define SGEMM_DEFAULT_UNROLL_M 4
  1825. #define SGEMM_DEFAULT_UNROLL_N 4
  1826. #define DGEMM_DEFAULT_UNROLL_M 4
  1827. #define DGEMM_DEFAULT_UNROLL_N 4
  1828. #define CGEMM_DEFAULT_UNROLL_M 2
  1829. #define CGEMM_DEFAULT_UNROLL_N 2
  1830. #define ZGEMM_DEFAULT_UNROLL_M 2
  1831. #define ZGEMM_DEFAULT_UNROLL_N 2
  1832. #define SGEMM_DEFAULT_P 256
  1833. #define DGEMM_DEFAULT_P 128
  1834. #define CGEMM_DEFAULT_P 128
  1835. #define ZGEMM_DEFAULT_P 64
  1836. #define SGEMM_DEFAULT_Q 256
  1837. #define DGEMM_DEFAULT_Q 256
  1838. #define CGEMM_DEFAULT_Q 256
  1839. #define ZGEMM_DEFAULT_Q 256
  1840. #define SYMV_P 4
  1841. #endif
  1842. #ifdef PPC970
  1843. #define SNUMOPT 4
  1844. #define DNUMOPT 4
  1845. #define GEMM_DEFAULT_OFFSET_A 2688
  1846. #define GEMM_DEFAULT_OFFSET_B 3072
  1847. #define GEMM_DEFAULT_ALIGN 0x03fffUL
  1848. #if defined(__BYTE_ORDER__)&&(__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
  1849. #define SGEMM_DEFAULT_UNROLL_M 4
  1850. #else
  1851. #define SGEMM_DEFAULT_UNROLL_M 16
  1852. #endif
  1853. #define SGEMM_DEFAULT_UNROLL_N 4
  1854. #define DGEMM_DEFAULT_UNROLL_M 4
  1855. #define DGEMM_DEFAULT_UNROLL_N 4
  1856. #if defined(__BYTE_ORDER__)&&(__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
  1857. #define CGEMM_DEFAULT_UNROLL_M 2
  1858. #else
  1859. #define CGEMM_DEFAULT_UNROLL_M 8
  1860. #endif
  1861. #define CGEMM_DEFAULT_UNROLL_N 2
  1862. #define ZGEMM_DEFAULT_UNROLL_M 2
  1863. #define ZGEMM_DEFAULT_UNROLL_N 2
  1864. #if defined(OS_LINUX) || defined(OS_DARWIN) || defined(OS_FREEBSD)
  1865. #if L2_SIZE == 1024976
  1866. #define SGEMM_DEFAULT_P 320
  1867. #define DGEMM_DEFAULT_P 256
  1868. #define CGEMM_DEFAULT_P 256
  1869. #define ZGEMM_DEFAULT_P 256
  1870. #else
  1871. #define SGEMM_DEFAULT_P 176
  1872. #define DGEMM_DEFAULT_P 176
  1873. #define CGEMM_DEFAULT_P 176
  1874. #define ZGEMM_DEFAULT_P 176
  1875. #endif
  1876. #endif
  1877. #define SGEMM_DEFAULT_Q 512
  1878. #define DGEMM_DEFAULT_Q 256
  1879. #define CGEMM_DEFAULT_Q 256
  1880. #define ZGEMM_DEFAULT_Q 128
  1881. #define SYMV_P 4
  1882. #endif
  1883. #ifdef PPC440
  1884. #define SNUMOPT 2
  1885. #define DNUMOPT 2
  1886. #define GEMM_DEFAULT_OFFSET_A (32 * 0)
  1887. #define GEMM_DEFAULT_OFFSET_B (32 * 0)
  1888. #define GEMM_DEFAULT_ALIGN 0x0ffffUL
  1889. #define SGEMM_DEFAULT_UNROLL_M 4
  1890. #define SGEMM_DEFAULT_UNROLL_N 4
  1891. #define DGEMM_DEFAULT_UNROLL_M 4
  1892. #define DGEMM_DEFAULT_UNROLL_N 4
  1893. #define CGEMM_DEFAULT_UNROLL_M 2
  1894. #define CGEMM_DEFAULT_UNROLL_N 2
  1895. #define ZGEMM_DEFAULT_UNROLL_M 2
  1896. #define ZGEMM_DEFAULT_UNROLL_N 2
  1897. #define SGEMM_DEFAULT_P 512
  1898. #define DGEMM_DEFAULT_P 512
  1899. #define CGEMM_DEFAULT_P 512
  1900. #define ZGEMM_DEFAULT_P 512
  1901. #define SGEMM_DEFAULT_Q 1024
  1902. #define DGEMM_DEFAULT_Q 512
  1903. #define CGEMM_DEFAULT_Q 512
  1904. #define ZGEMM_DEFAULT_Q 256
  1905. #define SGEMM_DEFAULT_R SGEMM_DEFAULT_P
  1906. #define DGEMM_DEFAULT_R DGEMM_DEFAULT_P
  1907. #define CGEMM_DEFAULT_R CGEMM_DEFAULT_P
  1908. #define ZGEMM_DEFAULT_R ZGEMM_DEFAULT_P
  1909. #define SYMV_P 4
  1910. #endif
  1911. #ifdef PPC440FP2
  1912. #define SNUMOPT 4
  1913. #define DNUMOPT 4
  1914. #define GEMM_DEFAULT_OFFSET_A (32 * 0)
  1915. #define GEMM_DEFAULT_OFFSET_B (32 * 0)
  1916. #define GEMM_DEFAULT_ALIGN 0x0ffffUL
  1917. #define SGEMM_DEFAULT_UNROLL_M 8
  1918. #define SGEMM_DEFAULT_UNROLL_N 4
  1919. #define DGEMM_DEFAULT_UNROLL_M 8
  1920. #define DGEMM_DEFAULT_UNROLL_N 4
  1921. #define CGEMM_DEFAULT_UNROLL_M 4
  1922. #define CGEMM_DEFAULT_UNROLL_N 2
  1923. #define ZGEMM_DEFAULT_UNROLL_M 4
  1924. #define ZGEMM_DEFAULT_UNROLL_N 2
  1925. #define SGEMM_DEFAULT_P 128
  1926. #define DGEMM_DEFAULT_P 128
  1927. #define CGEMM_DEFAULT_P 128
  1928. #define ZGEMM_DEFAULT_P 128
  1929. #if 1
  1930. #define SGEMM_DEFAULT_Q 4096
  1931. #define DGEMM_DEFAULT_Q 3072
  1932. #define CGEMM_DEFAULT_Q 2048
  1933. #define ZGEMM_DEFAULT_Q 1024
  1934. #else
  1935. #define SGEMM_DEFAULT_Q 512
  1936. #define DGEMM_DEFAULT_Q 256
  1937. #define CGEMM_DEFAULT_Q 256
  1938. #define ZGEMM_DEFAULT_Q 128
  1939. #endif
  1940. #define SYMV_P 4
  1941. #endif
  1942. #if defined(POWER3) || defined(POWER4) || defined(POWER5)
  1943. #define GEMM_DEFAULT_OFFSET_A 0
  1944. #define GEMM_DEFAULT_OFFSET_B 2048
  1945. #define GEMM_DEFAULT_ALIGN 0x0ffffUL
  1946. #define SGEMM_DEFAULT_UNROLL_M 4
  1947. #define SGEMM_DEFAULT_UNROLL_N 4
  1948. #define DGEMM_DEFAULT_UNROLL_M 4
  1949. #define DGEMM_DEFAULT_UNROLL_N 4
  1950. #define CGEMM_DEFAULT_UNROLL_M 2
  1951. #define CGEMM_DEFAULT_UNROLL_N 2
  1952. #define ZGEMM_DEFAULT_UNROLL_M 2
  1953. #define ZGEMM_DEFAULT_UNROLL_N 2
  1954. #ifdef POWER3
  1955. #define SNUMOPT 4
  1956. #define DNUMOPT 4
  1957. #define SGEMM_DEFAULT_P 256
  1958. #define SGEMM_DEFAULT_Q 432
  1959. #define SGEMM_DEFAULT_R 1012
  1960. #define DGEMM_DEFAULT_P 256
  1961. #define DGEMM_DEFAULT_Q 216
  1962. #define DGEMM_DEFAULT_R 1012
  1963. #define CGEMM_DEFAULT_P 256
  1964. #define CGEMM_DEFAULT_Q 104
  1965. #define CGEMM_DEFAULT_R 1012
  1966. #define ZGEMM_DEFAULT_P 256
  1967. #define ZGEMM_DEFAULT_Q 104
  1968. #define ZGEMM_DEFAULT_R 1012
  1969. #endif
  1970. #if defined(POWER4)
  1971. #ifdef ALLOC_HUGETLB
  1972. #define SGEMM_DEFAULT_P 184
  1973. #define DGEMM_DEFAULT_P 184
  1974. #define CGEMM_DEFAULT_P 184
  1975. #define ZGEMM_DEFAULT_P 184
  1976. #else
  1977. #define SGEMM_DEFAULT_P 144
  1978. #define DGEMM_DEFAULT_P 144
  1979. #define CGEMM_DEFAULT_P 144
  1980. #define ZGEMM_DEFAULT_P 144
  1981. #endif
  1982. #define SGEMM_DEFAULT_Q 256
  1983. #define CGEMM_DEFAULT_Q 256
  1984. #define DGEMM_DEFAULT_Q 256
  1985. #define ZGEMM_DEFAULT_Q 256
  1986. #endif
  1987. #if defined(POWER5)
  1988. #ifdef ALLOC_HUGETLB
  1989. #define SGEMM_DEFAULT_P 512
  1990. #define DGEMM_DEFAULT_P 256
  1991. #define CGEMM_DEFAULT_P 256
  1992. #define ZGEMM_DEFAULT_P 128
  1993. #else
  1994. #define SGEMM_DEFAULT_P 320
  1995. #define DGEMM_DEFAULT_P 160
  1996. #define CGEMM_DEFAULT_P 160
  1997. #define ZGEMM_DEFAULT_P 80
  1998. #endif
  1999. #define SGEMM_DEFAULT_Q 256
  2000. #define CGEMM_DEFAULT_Q 256
  2001. #define DGEMM_DEFAULT_Q 256
  2002. #define ZGEMM_DEFAULT_Q 256
  2003. #endif
  2004. #define SYMV_P 8
  2005. #endif
  2006. #if defined(POWER6)
  2007. #define SNUMOPT 4
  2008. #define DNUMOPT 4
  2009. #define GEMM_DEFAULT_OFFSET_A 384
  2010. #define GEMM_DEFAULT_OFFSET_B 1024
  2011. #define GEMM_DEFAULT_ALIGN 0x03fffUL
  2012. #define SGEMM_DEFAULT_UNROLL_M 4
  2013. #define SGEMM_DEFAULT_UNROLL_N 4
  2014. #define DGEMM_DEFAULT_UNROLL_M 4
  2015. #define DGEMM_DEFAULT_UNROLL_N 4
  2016. #define CGEMM_DEFAULT_UNROLL_M 2
  2017. #define CGEMM_DEFAULT_UNROLL_N 4
  2018. #define ZGEMM_DEFAULT_UNROLL_M 2
  2019. #define ZGEMM_DEFAULT_UNROLL_N 4
  2020. #define SGEMM_DEFAULT_P 992
  2021. #define DGEMM_DEFAULT_P 480
  2022. #define CGEMM_DEFAULT_P 488
  2023. #define ZGEMM_DEFAULT_P 248
  2024. #define SGEMM_DEFAULT_Q 504
  2025. #define DGEMM_DEFAULT_Q 504
  2026. #define CGEMM_DEFAULT_Q 400
  2027. #define ZGEMM_DEFAULT_Q 400
  2028. #define SYMV_P 8
  2029. #endif
  2030. #if defined(POWER8) || (defined(POWER9) && defined(OS_AIX))
  2031. #define SNUMOPT 16
  2032. #define DNUMOPT 8
  2033. #define GEMM_DEFAULT_OFFSET_A 0
  2034. #define GEMM_DEFAULT_OFFSET_B 65536
  2035. #define GEMM_DEFAULT_ALIGN 0x0ffffUL
  2036. #if defined(__32BIT__)
  2037. #warning using BINARY32==POWER6
  2038. #define SGEMM_DEFAULT_UNROLL_M 4
  2039. #define SGEMM_DEFAULT_UNROLL_N 4
  2040. #define DGEMM_DEFAULT_UNROLL_M 4
  2041. #define DGEMM_DEFAULT_UNROLL_N 4
  2042. #define CGEMM_DEFAULT_UNROLL_M 2
  2043. #define CGEMM_DEFAULT_UNROLL_N 4
  2044. #define ZGEMM_DEFAULT_UNROLL_M 2
  2045. #define ZGEMM_DEFAULT_UNROLL_N 4
  2046. #else
  2047. #define SGEMM_DEFAULT_UNROLL_M 16
  2048. #define SGEMM_DEFAULT_UNROLL_N 8
  2049. #define DGEMM_DEFAULT_UNROLL_M 16
  2050. #define DGEMM_DEFAULT_UNROLL_N 4
  2051. #define CGEMM_DEFAULT_UNROLL_M 8
  2052. #define CGEMM_DEFAULT_UNROLL_N 4
  2053. #define ZGEMM_DEFAULT_UNROLL_M 8
  2054. #define ZGEMM_DEFAULT_UNROLL_N 2
  2055. #endif
  2056. #define SGEMM_DEFAULT_P 1280UL
  2057. #define DGEMM_DEFAULT_P 640UL
  2058. #define CGEMM_DEFAULT_P 640UL
  2059. #define ZGEMM_DEFAULT_P 320UL
  2060. #define SGEMM_DEFAULT_Q 640UL
  2061. #define DGEMM_DEFAULT_Q 720UL
  2062. #define CGEMM_DEFAULT_Q 640UL
  2063. #define ZGEMM_DEFAULT_Q 640UL
  2064. #if 0
  2065. #define SGEMM_DEFAULT_R SGEMM_DEFAULT_P
  2066. #define DGEMM_DEFAULT_R DGEMM_DEFAULT_P
  2067. #define CGEMM_DEFAULT_R CGEMM_DEFAULT_P
  2068. #define ZGEMM_DEFAULT_R ZGEMM_DEFAULT_P
  2069. #endif
  2070. #define SGEMM_DEFAULT_R 4096
  2071. #define DGEMM_DEFAULT_R 4096
  2072. #define CGEMM_DEFAULT_R 4096
  2073. #define ZGEMM_DEFAULT_R 4096
  2074. #define SYMV_P 8
  2075. #endif
  2076. #if defined(POWER9) && (defined(OS_LINUX) || defined(OS_FREEBSD))
  2077. #define SNUMOPT 16
  2078. #define DNUMOPT 8
  2079. #define GEMM_DEFAULT_OFFSET_A 0
  2080. #define GEMM_DEFAULT_OFFSET_B 65536
  2081. #define GEMM_DEFAULT_ALIGN 0x0ffffUL
  2082. #define SWITCH_RATIO 16
  2083. #define GEMM_PREFERED_SIZE 16
  2084. #define SGEMM_DEFAULT_UNROLL_M 16
  2085. #define SGEMM_DEFAULT_UNROLL_N 8
  2086. #define DGEMM_DEFAULT_UNROLL_M 16
  2087. #define DGEMM_DEFAULT_UNROLL_N 4
  2088. #define CGEMM_DEFAULT_UNROLL_M 8
  2089. #define CGEMM_DEFAULT_UNROLL_N 4
  2090. #define ZGEMM_DEFAULT_UNROLL_M 8
  2091. #define ZGEMM_DEFAULT_UNROLL_N 2
  2092. #define SGEMM_DEFAULT_P 832
  2093. #define DGEMM_DEFAULT_P 128
  2094. #define CGEMM_DEFAULT_P 512
  2095. #define ZGEMM_DEFAULT_P 256
  2096. #define SGEMM_DEFAULT_Q 1026
  2097. #define DGEMM_DEFAULT_Q 384
  2098. #define CGEMM_DEFAULT_Q 1026
  2099. #define ZGEMM_DEFAULT_Q 1026
  2100. #define SGEMM_DEFAULT_R 4096
  2101. #define DGEMM_DEFAULT_R 4096
  2102. #define CGEMM_DEFAULT_R 4096
  2103. #define ZGEMM_DEFAULT_R 4096
  2104. #define SYMV_P 8
  2105. #endif
  2106. #if defined(POWER10)
  2107. #define SNUMOPT 16
  2108. #define DNUMOPT 8
  2109. #define GEMM_DEFAULT_OFFSET_A 0
  2110. #define GEMM_DEFAULT_OFFSET_B 65536
  2111. #define GEMM_DEFAULT_ALIGN 0x0ffffUL
  2112. #define SWITCH_RATIO 16
  2113. #define GEMM_PREFERED_SIZE 16
  2114. #define SGEMM_DEFAULT_UNROLL_M 16
  2115. #define SGEMM_DEFAULT_UNROLL_N 8
  2116. #define DGEMM_DEFAULT_UNROLL_M 8
  2117. #define DGEMM_DEFAULT_UNROLL_N 8
  2118. #define CGEMM_DEFAULT_UNROLL_M 8
  2119. #define CGEMM_DEFAULT_UNROLL_N 4
  2120. #define ZGEMM_DEFAULT_UNROLL_M 8
  2121. #define ZGEMM_DEFAULT_UNROLL_N 2
  2122. #define SGEMM_DEFAULT_P 512
  2123. #define DGEMM_DEFAULT_P 384
  2124. #define CGEMM_DEFAULT_P 512
  2125. #define ZGEMM_DEFAULT_P 256
  2126. #define SGEMM_DEFAULT_Q 512
  2127. #define DGEMM_DEFAULT_Q 512
  2128. #define CGEMM_DEFAULT_Q 384
  2129. #define ZGEMM_DEFAULT_Q 384
  2130. #define SGEMM_DEFAULT_R 4096
  2131. #define DGEMM_DEFAULT_R 4096
  2132. #define CGEMM_DEFAULT_R 4096
  2133. #define ZGEMM_DEFAULT_R 4096
  2134. #define SYMV_P 8
  2135. #undef SBGEMM_DEFAULT_UNROLL_N
  2136. #undef SBGEMM_DEFAULT_UNROLL_M
  2137. #undef SBGEMM_DEFAULT_P
  2138. #undef SBGEMM_DEFAULT_R
  2139. #undef SBGEMM_DEFAULT_Q
  2140. #define SBGEMM_DEFAULT_UNROLL_M 16
  2141. #define SBGEMM_DEFAULT_UNROLL_N 8
  2142. #define SBGEMM_DEFAULT_P 512
  2143. #define SBGEMM_DEFAULT_Q 1024
  2144. #define SBGEMM_DEFAULT_R 4096
  2145. #endif
  2146. #if defined(SPARC) && defined(V7)
  2147. #define SNUMOPT 4
  2148. #define DNUMOPT 4
  2149. #define GEMM_DEFAULT_OFFSET_A 0
  2150. #define GEMM_DEFAULT_OFFSET_B 2048
  2151. #define GEMM_DEFAULT_ALIGN 0x03fffUL
  2152. #define SGEMM_DEFAULT_UNROLL_M 2
  2153. #define SGEMM_DEFAULT_UNROLL_N 8
  2154. #define DGEMM_DEFAULT_UNROLL_M 2
  2155. #define DGEMM_DEFAULT_UNROLL_N 8
  2156. #define CGEMM_DEFAULT_UNROLL_M 1
  2157. #define CGEMM_DEFAULT_UNROLL_N 4
  2158. #define ZGEMM_DEFAULT_UNROLL_M 1
  2159. #define ZGEMM_DEFAULT_UNROLL_N 4
  2160. #define SGEMM_DEFAULT_P 256
  2161. #define DGEMM_DEFAULT_P 256
  2162. #define CGEMM_DEFAULT_P 256
  2163. #define ZGEMM_DEFAULT_P 256
  2164. #define SGEMM_DEFAULT_Q 512
  2165. #define DGEMM_DEFAULT_Q 256
  2166. #define CGEMM_DEFAULT_Q 256
  2167. #define ZGEMM_DEFAULT_Q 128
  2168. #define SYMV_P 8
  2169. #define GEMM_THREAD gemm_thread_mn
  2170. #endif
  2171. #if (defined(SPARC) && defined(V9)) || defined(__sparc_v9__)
  2172. #define SNUMOPT 2
  2173. #define DNUMOPT 2
  2174. #define GEMM_DEFAULT_OFFSET_A 0
  2175. #define GEMM_DEFAULT_OFFSET_B 2048
  2176. #define GEMM_DEFAULT_ALIGN 0x03fffUL
  2177. #define SGEMM_DEFAULT_UNROLL_M 4
  2178. #define SGEMM_DEFAULT_UNROLL_N 4
  2179. #define DGEMM_DEFAULT_UNROLL_M 4
  2180. #define DGEMM_DEFAULT_UNROLL_N 4
  2181. #define CGEMM_DEFAULT_UNROLL_M 2
  2182. #define CGEMM_DEFAULT_UNROLL_N 2
  2183. #define ZGEMM_DEFAULT_UNROLL_M 2
  2184. #define ZGEMM_DEFAULT_UNROLL_N 2
  2185. #define SGEMM_DEFAULT_P 512
  2186. #define DGEMM_DEFAULT_P 512
  2187. #define CGEMM_DEFAULT_P 512
  2188. #define ZGEMM_DEFAULT_P 512
  2189. #define SGEMM_DEFAULT_Q 1024
  2190. #define DGEMM_DEFAULT_Q 512
  2191. #define CGEMM_DEFAULT_Q 512
  2192. #define ZGEMM_DEFAULT_Q 256
  2193. #define SYMV_P 8
  2194. #endif
  2195. #ifdef SICORTEX
  2196. #define SNUMOPT 2
  2197. #define DNUMOPT 2
  2198. #define GEMM_DEFAULT_OFFSET_A 0
  2199. #define GEMM_DEFAULT_OFFSET_B 0
  2200. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  2201. #define SGEMM_DEFAULT_UNROLL_M 2
  2202. #define SGEMM_DEFAULT_UNROLL_N 8
  2203. #define DGEMM_DEFAULT_UNROLL_M 2
  2204. #define DGEMM_DEFAULT_UNROLL_N 8
  2205. #define CGEMM_DEFAULT_UNROLL_M 1
  2206. #define CGEMM_DEFAULT_UNROLL_N 4
  2207. #define ZGEMM_DEFAULT_UNROLL_M 1
  2208. #define ZGEMM_DEFAULT_UNROLL_N 4
  2209. #define SGEMM_DEFAULT_P 108
  2210. #define DGEMM_DEFAULT_P 112
  2211. #define CGEMM_DEFAULT_P 108
  2212. #define ZGEMM_DEFAULT_P 112
  2213. #define SGEMM_DEFAULT_Q 288
  2214. #define DGEMM_DEFAULT_Q 144
  2215. #define CGEMM_DEFAULT_Q 144
  2216. #define ZGEMM_DEFAULT_Q 72
  2217. #define SGEMM_DEFAULT_R 2000
  2218. #define DGEMM_DEFAULT_R 2000
  2219. #define CGEMM_DEFAULT_R 2000
  2220. #define ZGEMM_DEFAULT_R 2000
  2221. #define SYMV_P 16
  2222. #endif
  2223. #if defined(LOONGSON3R4)
  2224. #define SNUMOPT 2
  2225. #define DNUMOPT 2
  2226. #define GEMM_DEFAULT_OFFSET_A 0
  2227. #define GEMM_DEFAULT_OFFSET_B 0
  2228. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  2229. #if defined(NO_MSA)
  2230. #define SGEMM_DEFAULT_UNROLL_M 8
  2231. #define SGEMM_DEFAULT_UNROLL_N 4
  2232. #define DGEMM_DEFAULT_UNROLL_M 4
  2233. #define DGEMM_DEFAULT_UNROLL_N 4
  2234. #define CGEMM_DEFAULT_UNROLL_M 4
  2235. #define CGEMM_DEFAULT_UNROLL_N 2
  2236. #define ZGEMM_DEFAULT_UNROLL_M 2
  2237. #define ZGEMM_DEFAULT_UNROLL_N 2
  2238. #else
  2239. #define SGEMM_DEFAULT_UNROLL_M 8
  2240. #define SGEMM_DEFAULT_UNROLL_N 8
  2241. #define DGEMM_DEFAULT_UNROLL_M 8
  2242. #define DGEMM_DEFAULT_UNROLL_N 4
  2243. #define CGEMM_DEFAULT_UNROLL_M 8
  2244. #define CGEMM_DEFAULT_UNROLL_N 4
  2245. #define ZGEMM_DEFAULT_UNROLL_M 4
  2246. #define ZGEMM_DEFAULT_UNROLL_N 4
  2247. #endif
  2248. #define SGEMM_DEFAULT_P 64
  2249. #define DGEMM_DEFAULT_P 44
  2250. #define CGEMM_DEFAULT_P 64
  2251. #define ZGEMM_DEFAULT_P 32
  2252. #define SGEMM_DEFAULT_Q 192
  2253. #define DGEMM_DEFAULT_Q 92
  2254. #define CGEMM_DEFAULT_Q 128
  2255. #define ZGEMM_DEFAULT_Q 80
  2256. #define SGEMM_DEFAULT_R 640
  2257. #define DGEMM_DEFAULT_R dgemm_r
  2258. #define CGEMM_DEFAULT_R 640
  2259. #define ZGEMM_DEFAULT_R 640
  2260. #define GEMM_OFFSET_A1 0x10000
  2261. #define GEMM_OFFSET_B1 0x100000
  2262. #define SYMV_P 16
  2263. #endif
  2264. #if defined(LOONGSON3R3)
  2265. ////Copy from SICORTEX
  2266. #define SNUMOPT 2
  2267. #define DNUMOPT 2
  2268. #define GEMM_DEFAULT_OFFSET_A 0
  2269. #define GEMM_DEFAULT_OFFSET_B 0
  2270. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  2271. #define SGEMM_DEFAULT_UNROLL_M 8
  2272. #define SGEMM_DEFAULT_UNROLL_N 4
  2273. #define DGEMM_DEFAULT_UNROLL_M 4
  2274. #define DGEMM_DEFAULT_UNROLL_N 4
  2275. #define CGEMM_DEFAULT_UNROLL_M 4
  2276. #define CGEMM_DEFAULT_UNROLL_N 2
  2277. #define ZGEMM_DEFAULT_UNROLL_M 2
  2278. #define ZGEMM_DEFAULT_UNROLL_N 2
  2279. #define SGEMM_DEFAULT_P 64
  2280. #define DGEMM_DEFAULT_P 44
  2281. #define CGEMM_DEFAULT_P 64
  2282. #define ZGEMM_DEFAULT_P 32
  2283. #define SGEMM_DEFAULT_Q 192
  2284. #define DGEMM_DEFAULT_Q 92
  2285. #define CGEMM_DEFAULT_Q 128
  2286. #define ZGEMM_DEFAULT_Q 80
  2287. #define SGEMM_DEFAULT_R 640
  2288. #define DGEMM_DEFAULT_R dgemm_r
  2289. #define CGEMM_DEFAULT_R 640
  2290. #define ZGEMM_DEFAULT_R 640
  2291. #define GEMM_OFFSET_A1 0x10000
  2292. #define GEMM_OFFSET_B1 0x100000
  2293. #define SYMV_P 16
  2294. #endif
  2295. #if defined (LA464)
  2296. #define SNUMOPT 2
  2297. #define DNUMOPT 2
  2298. #define GEMM_DEFAULT_OFFSET_A 0x20000
  2299. #define GEMM_DEFAULT_OFFSET_B 0
  2300. #define GEMM_DEFAULT_ALIGN 0x0ffffUL
  2301. #if defined(NO_LASX)
  2302. #define DGEMM_DEFAULT_UNROLL_N 8
  2303. #define DGEMM_DEFAULT_UNROLL_M 2
  2304. #define SGEMM_DEFAULT_UNROLL_N 8
  2305. #define SGEMM_DEFAULT_UNROLL_M 2
  2306. #define CGEMM_DEFAULT_UNROLL_N 4
  2307. #define CGEMM_DEFAULT_UNROLL_M 1
  2308. #define ZGEMM_DEFAULT_UNROLL_N 4
  2309. #define ZGEMM_DEFAULT_UNROLL_M 1
  2310. #else
  2311. #define DGEMM_DEFAULT_UNROLL_N 6
  2312. #define DGEMM_DEFAULT_UNROLL_M 16
  2313. #define SGEMM_DEFAULT_UNROLL_N 8
  2314. #define SGEMM_DEFAULT_UNROLL_M 16
  2315. #define CGEMM_DEFAULT_UNROLL_N 4
  2316. #define CGEMM_DEFAULT_UNROLL_M 16
  2317. #define ZGEMM_DEFAULT_UNROLL_N 4
  2318. #define ZGEMM_DEFAULT_UNROLL_M 8
  2319. #define DGEMM_DEFAULT_UNROLL_MN 96
  2320. #endif
  2321. #define QGEMM_DEFAULT_UNROLL_N 2
  2322. #define XGEMM_DEFAULT_UNROLL_N 1
  2323. #define QGEMM_DEFAULT_UNROLL_M 2
  2324. #define XGEMM_DEFAULT_UNROLL_M 1
  2325. #define SGEMM_DEFAULT_P sgemm_p
  2326. #define DGEMM_DEFAULT_P dgemm_p
  2327. #define CGEMM_DEFAULT_P 128
  2328. #define ZGEMM_DEFAULT_P zgemm_p
  2329. #define SGEMM_DEFAULT_R sgemm_r
  2330. #define DGEMM_DEFAULT_R dgemm_r
  2331. #define CGEMM_DEFAULT_R 4096
  2332. #define ZGEMM_DEFAULT_R zgemm_r
  2333. #define SGEMM_DEFAULT_Q sgemm_q
  2334. #define DGEMM_DEFAULT_Q dgemm_q
  2335. #define CGEMM_DEFAULT_Q 128
  2336. #define ZGEMM_DEFAULT_Q zgemm_q
  2337. #define SYMV_P 16
  2338. #endif
  2339. #ifdef LA264
  2340. #define GEMM_DEFAULT_OFFSET_A 0
  2341. #define GEMM_DEFAULT_OFFSET_B 0
  2342. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  2343. #define SGEMM_DEFAULT_UNROLL_M 2
  2344. #define SGEMM_DEFAULT_UNROLL_N 8
  2345. #define DGEMM_DEFAULT_UNROLL_M 8
  2346. #define DGEMM_DEFAULT_UNROLL_N 4
  2347. #define CGEMM_DEFAULT_UNROLL_M 8
  2348. #define CGEMM_DEFAULT_UNROLL_N 4
  2349. #define ZGEMM_DEFAULT_UNROLL_M 4
  2350. #define ZGEMM_DEFAULT_UNROLL_N 4
  2351. #define SGEMM_DEFAULT_P 128
  2352. #define DGEMM_DEFAULT_P 128
  2353. #define CGEMM_DEFAULT_P 96
  2354. #define ZGEMM_DEFAULT_P 64
  2355. #define SGEMM_DEFAULT_Q 240
  2356. #define DGEMM_DEFAULT_Q 120
  2357. #define CGEMM_DEFAULT_Q 120
  2358. #define ZGEMM_DEFAULT_Q 120
  2359. #define SGEMM_DEFAULT_R 12288
  2360. #define DGEMM_DEFAULT_R 8192
  2361. #define CGEMM_DEFAULT_R 4096
  2362. #define ZGEMM_DEFAULT_R 4096
  2363. #define SYMV_P 16
  2364. #endif
  2365. #ifdef LA64_GENERIC
  2366. #define GEMM_DEFAULT_OFFSET_A 0
  2367. #define GEMM_DEFAULT_OFFSET_B 0
  2368. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  2369. #define SGEMM_DEFAULT_UNROLL_M 2
  2370. #define SGEMM_DEFAULT_UNROLL_N 8
  2371. #define DGEMM_DEFAULT_UNROLL_M 2
  2372. #define DGEMM_DEFAULT_UNROLL_N 8
  2373. #define CGEMM_DEFAULT_UNROLL_M 1
  2374. #define CGEMM_DEFAULT_UNROLL_N 4
  2375. #define ZGEMM_DEFAULT_UNROLL_M 1
  2376. #define ZGEMM_DEFAULT_UNROLL_N 4
  2377. #define SGEMM_DEFAULT_P 128
  2378. #define DGEMM_DEFAULT_P 128
  2379. #define CGEMM_DEFAULT_P 96
  2380. #define ZGEMM_DEFAULT_P 64
  2381. #define SGEMM_DEFAULT_Q 240
  2382. #define DGEMM_DEFAULT_Q 120
  2383. #define CGEMM_DEFAULT_Q 120
  2384. #define ZGEMM_DEFAULT_Q 120
  2385. #define SGEMM_DEFAULT_R 12288
  2386. #define DGEMM_DEFAULT_R 8192
  2387. #define CGEMM_DEFAULT_R 4096
  2388. #define ZGEMM_DEFAULT_R 4096
  2389. #define SYMV_P 16
  2390. #endif
  2391. #if defined(MIPS64_GENERIC) || defined(P5600) || defined(MIPS1004K) || defined(MIPS24K) || defined(I6400) || defined(P6600) || defined(I6500)
  2392. #define SNUMOPT 2
  2393. #define DNUMOPT 2
  2394. #define GEMM_DEFAULT_OFFSET_A 0
  2395. #define GEMM_DEFAULT_OFFSET_B 0
  2396. #define GEMM_DEFAULT_ALIGN (BLASLONG) 0x03fffUL
  2397. #if defined(NO_MSA) || defined(MIPS64_GENERIC)
  2398. #define SGEMM_DEFAULT_UNROLL_M 2
  2399. #define SGEMM_DEFAULT_UNROLL_N 2
  2400. #define DGEMM_DEFAULT_UNROLL_M 2
  2401. #define DGEMM_DEFAULT_UNROLL_N 2
  2402. #define CGEMM_DEFAULT_UNROLL_M 2
  2403. #define CGEMM_DEFAULT_UNROLL_N 2
  2404. #define ZGEMM_DEFAULT_UNROLL_M 2
  2405. #define ZGEMM_DEFAULT_UNROLL_N 2
  2406. #else
  2407. #define SGEMM_DEFAULT_UNROLL_M 8
  2408. #define SGEMM_DEFAULT_UNROLL_N 8
  2409. #define DGEMM_DEFAULT_UNROLL_M 8
  2410. #define DGEMM_DEFAULT_UNROLL_N 4
  2411. #define CGEMM_DEFAULT_UNROLL_M 8
  2412. #define CGEMM_DEFAULT_UNROLL_N 4
  2413. #define ZGEMM_DEFAULT_UNROLL_M 4
  2414. #define ZGEMM_DEFAULT_UNROLL_N 4
  2415. #endif
  2416. #define SGEMM_DEFAULT_P 128
  2417. #define DGEMM_DEFAULT_P 128
  2418. #define CGEMM_DEFAULT_P 96
  2419. #define ZGEMM_DEFAULT_P 64
  2420. #define SGEMM_DEFAULT_Q 240
  2421. #define DGEMM_DEFAULT_Q 120
  2422. #define CGEMM_DEFAULT_Q 120
  2423. #define ZGEMM_DEFAULT_Q 120
  2424. #define SGEMM_DEFAULT_R 12288
  2425. #define DGEMM_DEFAULT_R 8192
  2426. #define CGEMM_DEFAULT_R 4096
  2427. #define ZGEMM_DEFAULT_R 4096
  2428. #define SYMV_P 16
  2429. #endif
  2430. #ifdef RISCV64_GENERIC
  2431. #define GEMM_DEFAULT_OFFSET_A 0
  2432. #define GEMM_DEFAULT_OFFSET_B 0
  2433. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  2434. #define SGEMM_DEFAULT_UNROLL_M 2
  2435. #define SGEMM_DEFAULT_UNROLL_N 2
  2436. #define DGEMM_DEFAULT_UNROLL_M 2
  2437. #define DGEMM_DEFAULT_UNROLL_N 2
  2438. #define CGEMM_DEFAULT_UNROLL_M 2
  2439. #define CGEMM_DEFAULT_UNROLL_N 2
  2440. #define ZGEMM_DEFAULT_UNROLL_M 2
  2441. #define ZGEMM_DEFAULT_UNROLL_N 2
  2442. #define SGEMM_DEFAULT_P 128
  2443. #define DGEMM_DEFAULT_P 128
  2444. #define CGEMM_DEFAULT_P 96
  2445. #define ZGEMM_DEFAULT_P 64
  2446. #define SGEMM_DEFAULT_Q 240
  2447. #define DGEMM_DEFAULT_Q 120
  2448. #define CGEMM_DEFAULT_Q 120
  2449. #define ZGEMM_DEFAULT_Q 120
  2450. #define SGEMM_DEFAULT_R 12288
  2451. #define DGEMM_DEFAULT_R 8192
  2452. #define CGEMM_DEFAULT_R 4096
  2453. #define ZGEMM_DEFAULT_R 4096
  2454. #define SYMV_P 16
  2455. #define GEMM_DEFAULT_OFFSET_A 0
  2456. #define GEMM_DEFAULT_OFFSET_B 0
  2457. #endif
  2458. #if defined(x280)
  2459. #define GEMM_DEFAULT_OFFSET_A 0
  2460. #define GEMM_DEFAULT_OFFSET_B 0
  2461. #define GEMM_DEFAULT_ALIGN 0x03fffUL
  2462. #define SGEMM_DEFAULT_UNROLL_M 16 // 4 // 16 // 2
  2463. #define SGEMM_DEFAULT_UNROLL_N 8// 4 // 4 // 2
  2464. /* SGEMM_UNROLL_MN is calculated as max(SGEMM_UNROLL_M, SGEMM_UNROLL_N)
  2465. * Since we don't define SGEMM_UNROLL_M correctly we have to manually set this macro.
  2466. * If VLMAX size is ever more than 1024, this should be increased also. */
  2467. #define SGEMM_DEFAULT_UNROLL_MN 32
  2468. #define DGEMM_DEFAULT_UNROLL_M 16 //2 // 8
  2469. #define DGEMM_DEFAULT_UNROLL_N 8 //2 // 4
  2470. #define DGEMM_DEFAULT_UNROLL_MN 32
  2471. #define CGEMM_DEFAULT_UNROLL_M 8
  2472. #define CGEMM_DEFAULT_UNROLL_N 4
  2473. #define CGEMM_DEFAULT_UNROLL_MN 32
  2474. #define ZGEMM_DEFAULT_UNROLL_M 8
  2475. #define ZGEMM_DEFAULT_UNROLL_N 4
  2476. #define ZGEMM_DEFAULT_UNROLL_MN 16
  2477. #define SGEMM_DEFAULT_P 160
  2478. #define DGEMM_DEFAULT_P 160
  2479. #define CGEMM_DEFAULT_P 96
  2480. #define ZGEMM_DEFAULT_P 64
  2481. #define SGEMM_DEFAULT_Q 240
  2482. #define DGEMM_DEFAULT_Q 128
  2483. #define CGEMM_DEFAULT_Q 120
  2484. #define ZGEMM_DEFAULT_Q 120
  2485. #define SGEMM_DEFAULT_R 12288
  2486. #define DGEMM_DEFAULT_R 8192
  2487. #define CGEMM_DEFAULT_R 4096
  2488. #define ZGEMM_DEFAULT_R 4096
  2489. #define SYMV_P 16
  2490. #define GEMM_DEFAULT_OFFSET_A 0
  2491. #define GEMM_DEFAULT_OFFSET_B 0
  2492. #endif
  2493. #ifdef C910V
  2494. #define GEMM_DEFAULT_OFFSET_A 0
  2495. #define GEMM_DEFAULT_OFFSET_B 0
  2496. #define GEMM_DEFAULT_ALIGN 0x03fffUL
  2497. #define SGEMM_DEFAULT_UNROLL_M 16
  2498. #define SGEMM_DEFAULT_UNROLL_N 4
  2499. #define DGEMM_DEFAULT_UNROLL_M 8
  2500. #define DGEMM_DEFAULT_UNROLL_N 4
  2501. #define CGEMM_DEFAULT_UNROLL_M 2
  2502. #define CGEMM_DEFAULT_UNROLL_N 2
  2503. #define ZGEMM_DEFAULT_UNROLL_M 2
  2504. #define ZGEMM_DEFAULT_UNROLL_N 2
  2505. #define SGEMM_DEFAULT_P 160
  2506. #define DGEMM_DEFAULT_P 160
  2507. #define CGEMM_DEFAULT_P 96
  2508. #define ZGEMM_DEFAULT_P 64
  2509. #define SGEMM_DEFAULT_Q 240
  2510. #define DGEMM_DEFAULT_Q 128
  2511. #define CGEMM_DEFAULT_Q 120
  2512. #define ZGEMM_DEFAULT_Q 120
  2513. #define SGEMM_DEFAULT_R 12288
  2514. #define DGEMM_DEFAULT_R 8192
  2515. #define CGEMM_DEFAULT_R 4096
  2516. #define ZGEMM_DEFAULT_R 4096
  2517. #define SYMV_P 16
  2518. #define GEMM_DEFAULT_OFFSET_A 0
  2519. #define GEMM_DEFAULT_OFFSET_B 0
  2520. #endif
  2521. #ifdef RISCV64_ZVL128B
  2522. #define GEMM_DEFAULT_OFFSET_A 0
  2523. #define GEMM_DEFAULT_OFFSET_B 0
  2524. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  2525. #undef SHGEMM_DEFAULT_UNROLL_M
  2526. #undef SHGEMM_DEFAULT_UNROLL_N
  2527. #define SHGEMM_DEFAULT_UNROLL_M 8
  2528. #define SHGEMM_DEFAULT_UNROLL_N 8
  2529. #define SGEMM_DEFAULT_UNROLL_M 8
  2530. #define SGEMM_DEFAULT_UNROLL_N 8
  2531. #define DGEMM_DEFAULT_UNROLL_M 8
  2532. #define DGEMM_DEFAULT_UNROLL_N 4
  2533. #define CGEMM_DEFAULT_UNROLL_M 8
  2534. #define CGEMM_DEFAULT_UNROLL_N 4
  2535. #define ZGEMM_DEFAULT_UNROLL_M 4
  2536. #define ZGEMM_DEFAULT_UNROLL_N 4
  2537. #undef SHGEMM_DEFAULT_P
  2538. #define SHGEMM_DEFAULT_P 128
  2539. #define SGEMM_DEFAULT_P 128
  2540. #define DGEMM_DEFAULT_P 128
  2541. #define CGEMM_DEFAULT_P 96
  2542. #define ZGEMM_DEFAULT_P 64
  2543. #undef SHGEMM_DEFAULT_Q
  2544. #define SHGEMM_DEFAULT_Q 240
  2545. #define SGEMM_DEFAULT_Q 240
  2546. #define DGEMM_DEFAULT_Q 120
  2547. #define CGEMM_DEFAULT_Q 120
  2548. #define ZGEMM_DEFAULT_Q 120
  2549. #undef SHGEMM_DEFAULT_R
  2550. #define SHGEMM_DEFAULT_R 12288
  2551. #define SGEMM_DEFAULT_R 12288
  2552. #define DGEMM_DEFAULT_R 8192
  2553. #define CGEMM_DEFAULT_R 4096
  2554. #define ZGEMM_DEFAULT_R 4096
  2555. #define SYMV_P 16
  2556. #define GEMM_DEFAULT_OFFSET_A 0
  2557. #define GEMM_DEFAULT_OFFSET_B 0
  2558. #endif
  2559. #ifdef RISCV64_ZVL256B
  2560. #define GEMM_DEFAULT_OFFSET_A 0
  2561. #define GEMM_DEFAULT_OFFSET_B 0
  2562. #define GEMM_DEFAULT_ALIGN 0x03fffUL
  2563. #undef SHGEMM_DEFAULT_UNROLL_M
  2564. #undef SHGEMM_DEFAULT_UNROLL_N
  2565. #define SHGEMM_DEFAULT_UNROLL_M 16
  2566. #define SHGEMM_DEFAULT_UNROLL_N 8
  2567. #define SGEMM_DEFAULT_UNROLL_M 16
  2568. #define SGEMM_DEFAULT_UNROLL_N 8
  2569. #define DGEMM_DEFAULT_UNROLL_M 8
  2570. #define DGEMM_DEFAULT_UNROLL_N 8
  2571. #define CGEMM_DEFAULT_UNROLL_M 8
  2572. #define CGEMM_DEFAULT_UNROLL_N 8
  2573. #define ZGEMM_DEFAULT_UNROLL_M 8
  2574. #define ZGEMM_DEFAULT_UNROLL_N 4
  2575. #undef SHGEMM_DEFAULT_P
  2576. #define SHGEMM_DEFAULT_P 128
  2577. #define SGEMM_DEFAULT_P 128
  2578. #define DGEMM_DEFAULT_P 64
  2579. #define CGEMM_DEFAULT_P 64
  2580. #define ZGEMM_DEFAULT_P 64
  2581. #undef SHGEMM_DEFAULT_Q
  2582. #define SHGEMM_DEFAULT_Q 128
  2583. #define SGEMM_DEFAULT_Q 128
  2584. #define DGEMM_DEFAULT_Q 128
  2585. #define CGEMM_DEFAULT_Q 128
  2586. #define ZGEMM_DEFAULT_Q 64
  2587. #undef SHGEMM_DEFAULT_R
  2588. #define SHGEMM_DEFAULT_R 16384
  2589. #define SGEMM_DEFAULT_R 16384
  2590. #define DGEMM_DEFAULT_R 8192
  2591. #define CGEMM_DEFAULT_R 8192
  2592. #define ZGEMM_DEFAULT_R 4096
  2593. #define SYMV_P 16
  2594. #define GEMM_DEFAULT_OFFSET_A 0
  2595. #define GEMM_DEFAULT_OFFSET_B 0
  2596. #endif
  2597. #ifdef ARMV7
  2598. #define SNUMOPT 2
  2599. #define DNUMOPT 2
  2600. #define GEMM_DEFAULT_OFFSET_A 0
  2601. #define GEMM_DEFAULT_OFFSET_B 0
  2602. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  2603. #define SGEMM_DEFAULT_UNROLL_M 4
  2604. #define SGEMM_DEFAULT_UNROLL_N 4
  2605. #define DGEMM_DEFAULT_UNROLL_M 4
  2606. #define DGEMM_DEFAULT_UNROLL_N 4
  2607. #define CGEMM_DEFAULT_UNROLL_M 2
  2608. #define CGEMM_DEFAULT_UNROLL_N 2
  2609. #define ZGEMM_DEFAULT_UNROLL_M 2
  2610. #define ZGEMM_DEFAULT_UNROLL_N 2
  2611. #define SGEMM_DEFAULT_P 128
  2612. #define DGEMM_DEFAULT_P 128
  2613. #define CGEMM_DEFAULT_P 96
  2614. #define ZGEMM_DEFAULT_P 64
  2615. #define SGEMM_DEFAULT_Q 240
  2616. #define DGEMM_DEFAULT_Q 120
  2617. #define CGEMM_DEFAULT_Q 120
  2618. #define ZGEMM_DEFAULT_Q 120
  2619. #define SGEMM_DEFAULT_R 12288
  2620. #define DGEMM_DEFAULT_R 8192
  2621. #define CGEMM_DEFAULT_R 4096
  2622. #define ZGEMM_DEFAULT_R 4096
  2623. #define SYMV_P 16
  2624. #endif
  2625. #if defined(ARMV6)
  2626. #define SNUMOPT 2
  2627. #define DNUMOPT 2
  2628. #define GEMM_DEFAULT_OFFSET_A 0
  2629. #define GEMM_DEFAULT_OFFSET_B 0
  2630. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  2631. #define SGEMM_DEFAULT_UNROLL_M 4
  2632. #define SGEMM_DEFAULT_UNROLL_N 2
  2633. #define DGEMM_DEFAULT_UNROLL_M 4
  2634. #define DGEMM_DEFAULT_UNROLL_N 2
  2635. #define CGEMM_DEFAULT_UNROLL_M 2
  2636. #define CGEMM_DEFAULT_UNROLL_N 2
  2637. #define ZGEMM_DEFAULT_UNROLL_M 2
  2638. #define ZGEMM_DEFAULT_UNROLL_N 2
  2639. #define SGEMM_DEFAULT_P 128
  2640. #define DGEMM_DEFAULT_P 128
  2641. #define CGEMM_DEFAULT_P 96
  2642. #define ZGEMM_DEFAULT_P 64
  2643. #define SGEMM_DEFAULT_Q 240
  2644. #define DGEMM_DEFAULT_Q 120
  2645. #define CGEMM_DEFAULT_Q 120
  2646. #define ZGEMM_DEFAULT_Q 120
  2647. #define SGEMM_DEFAULT_R 12288
  2648. #define DGEMM_DEFAULT_R 8192
  2649. #define CGEMM_DEFAULT_R 4096
  2650. #define ZGEMM_DEFAULT_R 4096
  2651. #define SYMV_P 16
  2652. #endif
  2653. /* Common ARMv8 parameters */
  2654. #if defined(ARMV8)
  2655. #define SNUMOPT 2
  2656. #define DNUMOPT 2
  2657. #define GEMM_DEFAULT_OFFSET_A 0
  2658. #define GEMM_DEFAULT_OFFSET_B 0
  2659. #ifdef _WIN64
  2660. /* Use explicit casting for win64 as LLP64 datamodel is used */
  2661. #define GEMM_DEFAULT_ALIGN (BLASULONG)0x03fffUL
  2662. #else
  2663. #define GEMM_DEFAULT_ALIGN 0x03fffUL
  2664. #endif
  2665. #define SYMV_P 16
  2666. #if defined(CORTEXA57) || defined(CORTEXX1) || \
  2667. defined(CORTEXA72) || defined(CORTEXA73) || \
  2668. defined(FALKOR) || defined(TSV110) || defined(EMAG8180) || defined(VORTEX) || defined(FT2000)
  2669. #define SGEMM_DEFAULT_UNROLL_M 16
  2670. #define SGEMM_DEFAULT_UNROLL_N 4
  2671. #define DGEMM_DEFAULT_UNROLL_M 8
  2672. #define DGEMM_DEFAULT_UNROLL_N 4
  2673. #define CGEMM_DEFAULT_UNROLL_M 8
  2674. #define CGEMM_DEFAULT_UNROLL_N 4
  2675. #define ZGEMM_DEFAULT_UNROLL_M 4
  2676. #define ZGEMM_DEFAULT_UNROLL_N 4
  2677. /*FIXME: this should be using the cache size, but there is currently no easy way to
  2678. query that on ARM. So if getarch counted more than 8 cores we simply assume the host
  2679. is a big desktop or server with abundant cache rather than a phone or embedded device */
  2680. #if NUM_CORES > 8 || defined(TSV110) || defined(EMAG8180) || defined(VORTEX)|| defined(CORTEXX1)
  2681. #define SGEMM_DEFAULT_P 512
  2682. #define DGEMM_DEFAULT_P 256
  2683. #define CGEMM_DEFAULT_P 256
  2684. #define ZGEMM_DEFAULT_P 128
  2685. #define SGEMM_DEFAULT_Q 1024
  2686. #define DGEMM_DEFAULT_Q 512
  2687. #define CGEMM_DEFAULT_Q 512
  2688. #define ZGEMM_DEFAULT_Q 512
  2689. #else
  2690. #define SGEMM_DEFAULT_P 128
  2691. #define DGEMM_DEFAULT_P 160
  2692. #define CGEMM_DEFAULT_P 128
  2693. #define ZGEMM_DEFAULT_P 128
  2694. #define SGEMM_DEFAULT_Q 352
  2695. #define DGEMM_DEFAULT_Q 128
  2696. #define CGEMM_DEFAULT_Q 224
  2697. #define ZGEMM_DEFAULT_Q 112
  2698. #endif
  2699. #define SGEMM_DEFAULT_R 4096
  2700. #define DGEMM_DEFAULT_R 4096
  2701. #define CGEMM_DEFAULT_R 4096
  2702. #define ZGEMM_DEFAULT_R 2048
  2703. #elif defined(CORTEXA76)
  2704. #define SGEMM_DEFAULT_UNROLL_M 16
  2705. #define SGEMM_DEFAULT_UNROLL_N 4
  2706. #define DGEMM_DEFAULT_UNROLL_M 8
  2707. #define DGEMM_DEFAULT_UNROLL_N 4
  2708. #define CGEMM_DEFAULT_UNROLL_M 8
  2709. #define CGEMM_DEFAULT_UNROLL_N 4
  2710. #define ZGEMM_DEFAULT_UNROLL_M 4
  2711. #define ZGEMM_DEFAULT_UNROLL_N 4
  2712. #if defined(XDOUBLE) || defined(DOUBLE)
  2713. #define SWITCH_RATIO 8
  2714. #else
  2715. #define SWITCH_RATIO 16
  2716. #endif
  2717. #define SGEMM_DEFAULT_P 256
  2718. #define DGEMM_DEFAULT_P 128
  2719. #define CGEMM_DEFAULT_P 128
  2720. #define ZGEMM_DEFAULT_P 64
  2721. #define SGEMM_DEFAULT_Q 512
  2722. #define DGEMM_DEFAULT_Q 256
  2723. #define CGEMM_DEFAULT_Q 256
  2724. #define ZGEMM_DEFAULT_Q 256
  2725. #define SGEMM_DEFAULT_R 4096
  2726. #define DGEMM_DEFAULT_R 4096
  2727. #define CGEMM_DEFAULT_R 4096
  2728. #define ZGEMM_DEFAULT_R 4096
  2729. #elif defined(CORTEXA53) || defined(CORTEXA55)
  2730. #define SGEMM_DEFAULT_UNROLL_M 8
  2731. #define SGEMM_DEFAULT_UNROLL_N 8
  2732. #define DGEMM_DEFAULT_UNROLL_M 4
  2733. #define DGEMM_DEFAULT_UNROLL_N 4
  2734. #define CGEMM_DEFAULT_UNROLL_M 8
  2735. #define CGEMM_DEFAULT_UNROLL_N 4
  2736. #define ZGEMM_DEFAULT_UNROLL_M 4
  2737. #define ZGEMM_DEFAULT_UNROLL_N 4
  2738. #define SGEMM_DEFAULT_P 256
  2739. #define DGEMM_DEFAULT_P 160
  2740. #define CGEMM_DEFAULT_P 128
  2741. #define ZGEMM_DEFAULT_P 128
  2742. #define SGEMM_DEFAULT_Q 256
  2743. #define DGEMM_DEFAULT_Q 128
  2744. #define CGEMM_DEFAULT_Q 224
  2745. #define ZGEMM_DEFAULT_Q 112
  2746. #define SGEMM_DEFAULT_R 4096
  2747. #define DGEMM_DEFAULT_R 4096
  2748. #define CGEMM_DEFAULT_R 4096
  2749. #define ZGEMM_DEFAULT_R 2048
  2750. #elif defined(THUNDERX)
  2751. #define SGEMM_DEFAULT_UNROLL_M 4
  2752. #define SGEMM_DEFAULT_UNROLL_N 4
  2753. #define DGEMM_DEFAULT_UNROLL_M 2
  2754. #define DGEMM_DEFAULT_UNROLL_N 2
  2755. #define CGEMM_DEFAULT_UNROLL_M 2
  2756. #define CGEMM_DEFAULT_UNROLL_N 2
  2757. #define ZGEMM_DEFAULT_UNROLL_M 2
  2758. #define ZGEMM_DEFAULT_UNROLL_N 2
  2759. #define SGEMM_DEFAULT_P 128
  2760. #define DGEMM_DEFAULT_P 128
  2761. #define CGEMM_DEFAULT_P 96
  2762. #define ZGEMM_DEFAULT_P 64
  2763. #define SGEMM_DEFAULT_Q 240
  2764. #define DGEMM_DEFAULT_Q 120
  2765. #define CGEMM_DEFAULT_Q 120
  2766. #define ZGEMM_DEFAULT_Q 120
  2767. #define SGEMM_DEFAULT_R 12288
  2768. #define DGEMM_DEFAULT_R 8192
  2769. #define CGEMM_DEFAULT_R 4096
  2770. #define ZGEMM_DEFAULT_R 4096
  2771. #elif defined(THUNDERX2T99)
  2772. #define SGEMM_DEFAULT_UNROLL_M 16
  2773. #define SGEMM_DEFAULT_UNROLL_N 4
  2774. #define DGEMM_DEFAULT_UNROLL_M 8
  2775. #define DGEMM_DEFAULT_UNROLL_N 4
  2776. #define CGEMM_DEFAULT_UNROLL_M 8
  2777. #define CGEMM_DEFAULT_UNROLL_N 4
  2778. #define ZGEMM_DEFAULT_UNROLL_M 4
  2779. #define ZGEMM_DEFAULT_UNROLL_N 4
  2780. #define SGEMM_DEFAULT_P 128
  2781. #define DGEMM_DEFAULT_P 160
  2782. #define CGEMM_DEFAULT_P 128
  2783. #define ZGEMM_DEFAULT_P 128
  2784. #define SGEMM_DEFAULT_Q 352
  2785. #define DGEMM_DEFAULT_Q 128
  2786. #define CGEMM_DEFAULT_Q 224
  2787. #define ZGEMM_DEFAULT_Q 112
  2788. #define SGEMM_DEFAULT_R 4096
  2789. #define DGEMM_DEFAULT_R 4096
  2790. #define CGEMM_DEFAULT_R 4096
  2791. #define ZGEMM_DEFAULT_R 4096
  2792. #elif defined(THUNDERX3T110)
  2793. #define SGEMM_DEFAULT_UNROLL_M 16
  2794. #define SGEMM_DEFAULT_UNROLL_N 4
  2795. #define DGEMM_DEFAULT_UNROLL_M 8
  2796. #define DGEMM_DEFAULT_UNROLL_N 4
  2797. #define CGEMM_DEFAULT_UNROLL_M 8
  2798. #define CGEMM_DEFAULT_UNROLL_N 4
  2799. #define ZGEMM_DEFAULT_UNROLL_M 4
  2800. #define ZGEMM_DEFAULT_UNROLL_N 4
  2801. #define SGEMM_DEFAULT_P 128
  2802. #define DGEMM_DEFAULT_P 320
  2803. #define CGEMM_DEFAULT_P 128
  2804. #define ZGEMM_DEFAULT_P 128
  2805. #define SGEMM_DEFAULT_Q 352
  2806. #define DGEMM_DEFAULT_Q 128
  2807. #define CGEMM_DEFAULT_Q 224
  2808. #define ZGEMM_DEFAULT_Q 112
  2809. #define SGEMM_DEFAULT_R 4096
  2810. #define DGEMM_DEFAULT_R 4096
  2811. #define CGEMM_DEFAULT_R 4096
  2812. #define ZGEMM_DEFAULT_R 4096
  2813. #elif defined(NEOVERSEN1)
  2814. #if defined(XDOUBLE) || defined(DOUBLE)
  2815. #define SWITCH_RATIO 8
  2816. #else
  2817. #define SWITCH_RATIO 16
  2818. #endif
  2819. #define SGEMM_DEFAULT_UNROLL_M 16
  2820. #define SGEMM_DEFAULT_UNROLL_N 4
  2821. #define DGEMM_DEFAULT_UNROLL_M 8
  2822. #define DGEMM_DEFAULT_UNROLL_N 4
  2823. #define CGEMM_DEFAULT_UNROLL_M 8
  2824. #define CGEMM_DEFAULT_UNROLL_N 4
  2825. #define ZGEMM_DEFAULT_UNROLL_M 4
  2826. #define ZGEMM_DEFAULT_UNROLL_N 4
  2827. #define SGEMM_DEFAULT_P 240
  2828. #define DGEMM_DEFAULT_P 240
  2829. #define CGEMM_DEFAULT_P 128
  2830. #define ZGEMM_DEFAULT_P 128
  2831. #define SGEMM_DEFAULT_Q 640
  2832. #define DGEMM_DEFAULT_Q 320
  2833. #define CGEMM_DEFAULT_Q 224
  2834. #define ZGEMM_DEFAULT_Q 112
  2835. #define SGEMM_DEFAULT_R 4096
  2836. #define DGEMM_DEFAULT_R 4096
  2837. #define CGEMM_DEFAULT_R 4096
  2838. #define ZGEMM_DEFAULT_R 4096
  2839. #elif defined(NEOVERSEV1) // 256-bit SVE
  2840. #if defined(XDOUBLE) || defined(DOUBLE)
  2841. #define SWITCH_RATIO 8
  2842. #define GEMM_PREFERED_SIZE 4
  2843. #else
  2844. #define SWITCH_RATIO 16
  2845. #define GEMM_PREFERED_SIZE 8
  2846. #endif
  2847. #undef BGEMM_ALIGN_K
  2848. #undef BGEMM_DEFAULT_UNROLL_M
  2849. #undef BGEMM_DEFAULT_UNROLL_N
  2850. #define BGEMM_ALIGN_K 8
  2851. #define BGEMM_DEFAULT_UNROLL_N 4
  2852. #define BGEMM_DEFAULT_UNROLL_M 4
  2853. #undef SBGEMM_ALIGN_K
  2854. #undef SBGEMM_DEFAULT_UNROLL_M
  2855. #undef SBGEMM_DEFAULT_UNROLL_N
  2856. #define SBGEMM_ALIGN_K 8
  2857. #define SBGEMM_DEFAULT_UNROLL_M 4
  2858. #define SBGEMM_DEFAULT_UNROLL_N 4
  2859. #define SGEMM_DEFAULT_UNROLL_M 16
  2860. #define SGEMM_DEFAULT_UNROLL_N 8
  2861. #define DGEMM_DEFAULT_UNROLL_M 4 // Actually 2VL (8) but kept separate to keep copies separate
  2862. #define DGEMM_DEFAULT_UNROLL_N 8
  2863. #define CGEMM_DEFAULT_UNROLL_M 2
  2864. #define CGEMM_DEFAULT_UNROLL_N 4
  2865. #define CGEMM_DEFAULT_UNROLL_MN 16
  2866. #define ZGEMM_DEFAULT_UNROLL_M 2
  2867. #define ZGEMM_DEFAULT_UNROLL_N 4
  2868. #define ZGEMM_DEFAULT_UNROLL_MN 16
  2869. #define SGEMM_DEFAULT_P 240
  2870. #define DGEMM_DEFAULT_P 240
  2871. #define CGEMM_DEFAULT_P 128
  2872. #define ZGEMM_DEFAULT_P 128
  2873. #define SGEMM_DEFAULT_Q 640
  2874. #define DGEMM_DEFAULT_Q 320
  2875. #define CGEMM_DEFAULT_Q 224
  2876. #define ZGEMM_DEFAULT_Q 112
  2877. #define SGEMM_DEFAULT_R 4096
  2878. #define DGEMM_DEFAULT_R 4096
  2879. #define CGEMM_DEFAULT_R 4096
  2880. #define ZGEMM_DEFAULT_R 4096
  2881. #elif defined(NEOVERSEN2)
  2882. #if defined(XDOUBLE) || defined(DOUBLE)
  2883. #define SWITCH_RATIO 8
  2884. #else
  2885. #define SWITCH_RATIO 16
  2886. #endif
  2887. #undef SBGEMM_ALIGN_K
  2888. #define SBGEMM_ALIGN_K 4
  2889. #undef SBGEMM_DEFAULT_UNROLL_M
  2890. #undef SBGEMM_DEFAULT_UNROLL_N
  2891. #define SBGEMM_DEFAULT_UNROLL_M 8
  2892. #define SBGEMM_DEFAULT_UNROLL_N 4
  2893. #define SGEMM_DEFAULT_UNROLL_M 16
  2894. #define SGEMM_DEFAULT_UNROLL_N 4
  2895. #define DGEMM_DEFAULT_UNROLL_M 8
  2896. #define DGEMM_DEFAULT_UNROLL_N 4
  2897. #define CGEMM_DEFAULT_UNROLL_M 8
  2898. #define CGEMM_DEFAULT_UNROLL_N 4
  2899. #define ZGEMM_DEFAULT_UNROLL_M 4
  2900. #define ZGEMM_DEFAULT_UNROLL_N 4
  2901. #define SGEMM_DEFAULT_P 128
  2902. #define DGEMM_DEFAULT_P 160
  2903. #define CGEMM_DEFAULT_P 128
  2904. #define ZGEMM_DEFAULT_P 128
  2905. #define SGEMM_DEFAULT_Q 352
  2906. #define DGEMM_DEFAULT_Q 128
  2907. #define CGEMM_DEFAULT_Q 224
  2908. #define ZGEMM_DEFAULT_Q 112
  2909. #define SGEMM_DEFAULT_R 4096
  2910. #define DGEMM_DEFAULT_R 4096
  2911. #define CGEMM_DEFAULT_R 4096
  2912. #define ZGEMM_DEFAULT_R 4096
  2913. #elif defined(AMPERE1)
  2914. #if defined(XDOUBLE) || defined(DOUBLE)
  2915. #define SWITCH_RATIO 8
  2916. #else
  2917. #define SWITCH_RATIO 16
  2918. #endif
  2919. #define SGEMM_DEFAULT_UNROLL_M 16
  2920. #define SGEMM_DEFAULT_UNROLL_N 4
  2921. #define DGEMM_DEFAULT_UNROLL_M 8
  2922. #define DGEMM_DEFAULT_UNROLL_N 4
  2923. #define CGEMM_DEFAULT_UNROLL_M 8
  2924. #define CGEMM_DEFAULT_UNROLL_N 4
  2925. #define ZGEMM_DEFAULT_UNROLL_M 4
  2926. #define ZGEMM_DEFAULT_UNROLL_N 4
  2927. #define SGEMM_DEFAULT_P 240
  2928. #define DGEMM_DEFAULT_P 240
  2929. #define CGEMM_DEFAULT_P 128
  2930. #define ZGEMM_DEFAULT_P 128
  2931. #define SGEMM_DEFAULT_Q 640
  2932. #define DGEMM_DEFAULT_Q 320
  2933. #define CGEMM_DEFAULT_Q 224
  2934. #define ZGEMM_DEFAULT_Q 112
  2935. #define SGEMM_DEFAULT_R 4096
  2936. #define DGEMM_DEFAULT_R 4096
  2937. #define CGEMM_DEFAULT_R 4096
  2938. #define ZGEMM_DEFAULT_R 4096
  2939. #elif defined(A64FX) // 512-bit SVE
  2940. #define GEMM_DIVIDE_RATE 1
  2941. #if defined(XDOUBLE) || defined(DOUBLE)
  2942. #define GEMM_PREFERED_SIZE 8
  2943. #else
  2944. #define GEMM_PREFERED_SIZE 16
  2945. #endif
  2946. /* When all BLAS3 routines are implemeted with SVE, SGEMM_DEFAULT_UNROLL_M should be "sve_vl".
  2947. Until then, just keep it different than DGEMM_DEFAULT_UNROLL_N to keep copy routines in both directions seperated. */
  2948. #define SGEMM_DEFAULT_UNROLL_M 4
  2949. #define SGEMM_DEFAULT_UNROLL_N 8
  2950. /* SGEMM_UNROLL_MN is calculated as max(SGEMM_UNROLL_M, SGEMM_UNROLL_N)
  2951. * Since we don't define SGEMM_UNROLL_M correctly we have to manually set this macro.
  2952. * If SVE size is ever more than 1024, this should be increased also. */
  2953. #define SGEMM_DEFAULT_UNROLL_MN 32
  2954. /* When all BLAS3 routines are implemeted with SVE, DGEMM_DEFAULT_UNROLL_M should be "sve_vl".
  2955. Until then, just keep it different than DGEMM_DEFAULT_UNROLL_N to keep copy routines in both directions seperated. */
  2956. #define DGEMM_DEFAULT_UNROLL_M 2
  2957. #define DGEMM_DEFAULT_UNROLL_N 8
  2958. #define DGEMM_DEFAULT_UNROLL_MN 32
  2959. #define CGEMM_DEFAULT_UNROLL_M 2
  2960. #define CGEMM_DEFAULT_UNROLL_N 4
  2961. #define CGEMM_DEFAULT_UNROLL_MN 16
  2962. #define ZGEMM_DEFAULT_UNROLL_M 2
  2963. #define ZGEMM_DEFAULT_UNROLL_N 4
  2964. #define ZGEMM_DEFAULT_UNROLL_MN 16
  2965. #define SGEMM_DEFAULT_P 128
  2966. #define DGEMM_DEFAULT_P 160
  2967. #define CGEMM_DEFAULT_P 128
  2968. #define ZGEMM_DEFAULT_P 128
  2969. #define SGEMM_DEFAULT_Q 352
  2970. #define DGEMM_DEFAULT_Q 128
  2971. #define CGEMM_DEFAULT_Q 224
  2972. #define ZGEMM_DEFAULT_Q 112
  2973. #define SGEMM_DEFAULT_R 4096
  2974. #define DGEMM_DEFAULT_R 4096
  2975. #define CGEMM_DEFAULT_R 4096
  2976. #define ZGEMM_DEFAULT_R 4096
  2977. #elif defined(ARMV8SVE) || defined(ARMV9SME) || defined(ARMV9) || defined(CORTEXA510)|| defined(CORTEXA710) || defined(CORTEXX2) // 128-bit SVE
  2978. #if defined(XDOUBLE) || defined(DOUBLE)
  2979. #define SWITCH_RATIO 8
  2980. #else
  2981. #define SWITCH_RATIO 16
  2982. #endif
  2983. #define SGEMM_DEFAULT_UNROLL_M 4 // Actually 1VL (8) but kept seperate to keep copies seperate
  2984. #define SGEMM_DEFAULT_UNROLL_N 8
  2985. #define DGEMM_DEFAULT_UNROLL_M 4
  2986. #define DGEMM_DEFAULT_UNROLL_N 8
  2987. #define CGEMM_DEFAULT_UNROLL_M 2
  2988. #define CGEMM_DEFAULT_UNROLL_N 4
  2989. #define CGEMM_DEFAULT_UNROLL_MN 16
  2990. #define ZGEMM_DEFAULT_UNROLL_M 2
  2991. #define ZGEMM_DEFAULT_UNROLL_N 4
  2992. #define ZGEMM_DEFAULT_UNROLL_MN 16
  2993. #define SGEMM_DEFAULT_P 128
  2994. #define DGEMM_DEFAULT_P 160
  2995. #define CGEMM_DEFAULT_P 128
  2996. #define ZGEMM_DEFAULT_P 128
  2997. #define SGEMM_DEFAULT_Q 352
  2998. #define DGEMM_DEFAULT_Q 128
  2999. #define CGEMM_DEFAULT_Q 224
  3000. #define ZGEMM_DEFAULT_Q 112
  3001. #define SGEMM_DEFAULT_R 4096
  3002. #define DGEMM_DEFAULT_R 4096
  3003. #define CGEMM_DEFAULT_R 4096
  3004. #define ZGEMM_DEFAULT_R 4096
  3005. #else /* Other/undetected ARMv8 cores */
  3006. #define SGEMM_DEFAULT_UNROLL_M 16
  3007. #define SGEMM_DEFAULT_UNROLL_N 4
  3008. #define DGEMM_DEFAULT_UNROLL_M 8
  3009. #define DGEMM_DEFAULT_UNROLL_N 4
  3010. #define CGEMM_DEFAULT_UNROLL_M 8
  3011. #define CGEMM_DEFAULT_UNROLL_N 4
  3012. #define ZGEMM_DEFAULT_UNROLL_M 4
  3013. #define ZGEMM_DEFAULT_UNROLL_N 4
  3014. #define SGEMM_DEFAULT_P 128
  3015. #define DGEMM_DEFAULT_P 160
  3016. #define CGEMM_DEFAULT_P 128
  3017. #define ZGEMM_DEFAULT_P 128
  3018. #define SGEMM_DEFAULT_Q 352
  3019. #define DGEMM_DEFAULT_Q 128
  3020. #define CGEMM_DEFAULT_Q 224
  3021. #define ZGEMM_DEFAULT_Q 112
  3022. #define SGEMM_DEFAULT_R 4096
  3023. #define DGEMM_DEFAULT_R 4096
  3024. #define CGEMM_DEFAULT_R 4096
  3025. #define ZGEMM_DEFAULT_R 4096
  3026. #endif /* Cores */
  3027. #endif /* ARMv8 */
  3028. #if defined(ARMV9SME) /* ARMv9 SME */
  3029. #define USE_SGEMM_KERNEL_DIRECT 1
  3030. #endif /* ARMv9 SME */
  3031. #if defined(ARMV5)
  3032. #define SNUMOPT 2
  3033. #define DNUMOPT 2
  3034. #define GEMM_DEFAULT_OFFSET_A 0
  3035. #define GEMM_DEFAULT_OFFSET_B 0
  3036. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  3037. #define SGEMM_DEFAULT_UNROLL_M 2
  3038. #define SGEMM_DEFAULT_UNROLL_N 2
  3039. #define DGEMM_DEFAULT_UNROLL_M 2
  3040. #define DGEMM_DEFAULT_UNROLL_N 2
  3041. #define CGEMM_DEFAULT_UNROLL_M 2
  3042. #define CGEMM_DEFAULT_UNROLL_N 2
  3043. #define ZGEMM_DEFAULT_UNROLL_M 2
  3044. #define ZGEMM_DEFAULT_UNROLL_N 2
  3045. #define SGEMM_DEFAULT_P 128
  3046. #define DGEMM_DEFAULT_P 128
  3047. #define CGEMM_DEFAULT_P 96
  3048. #define ZGEMM_DEFAULT_P 64
  3049. #define SGEMM_DEFAULT_Q 240
  3050. #define DGEMM_DEFAULT_Q 120
  3051. #define CGEMM_DEFAULT_Q 120
  3052. #define ZGEMM_DEFAULT_Q 120
  3053. #define SGEMM_DEFAULT_R 12288
  3054. #define DGEMM_DEFAULT_R 8192
  3055. #define CGEMM_DEFAULT_R 4096
  3056. #define ZGEMM_DEFAULT_R 4096
  3057. #define SYMV_P 16
  3058. #endif
  3059. #ifdef CORTEXA9
  3060. #define SNUMOPT 2
  3061. #define DNUMOPT 2
  3062. #define GEMM_DEFAULT_OFFSET_A 0
  3063. #define GEMM_DEFAULT_OFFSET_B 0
  3064. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  3065. #define SGEMM_DEFAULT_UNROLL_M 4
  3066. #define SGEMM_DEFAULT_UNROLL_N 4
  3067. #define DGEMM_DEFAULT_UNROLL_M 4
  3068. #define DGEMM_DEFAULT_UNROLL_N 4
  3069. #define CGEMM_DEFAULT_UNROLL_M 2
  3070. #define CGEMM_DEFAULT_UNROLL_N 2
  3071. #define ZGEMM_DEFAULT_UNROLL_M 2
  3072. #define ZGEMM_DEFAULT_UNROLL_N 2
  3073. #define SGEMM_DEFAULT_P 128
  3074. #define DGEMM_DEFAULT_P 128
  3075. #define CGEMM_DEFAULT_P 96
  3076. #define ZGEMM_DEFAULT_P 64
  3077. #define SGEMM_DEFAULT_Q 240
  3078. #define DGEMM_DEFAULT_Q 120
  3079. #define CGEMM_DEFAULT_Q 120
  3080. #define ZGEMM_DEFAULT_Q 120
  3081. #define SGEMM_DEFAULT_R 12288
  3082. #define DGEMM_DEFAULT_R 8192
  3083. #define CGEMM_DEFAULT_R 4096
  3084. #define ZGEMM_DEFAULT_R 4096
  3085. #define SYMV_P 16
  3086. #endif
  3087. #ifdef CORTEXA15
  3088. #define SNUMOPT 2
  3089. #define DNUMOPT 2
  3090. #define GEMM_DEFAULT_OFFSET_A 0
  3091. #define GEMM_DEFAULT_OFFSET_B 0
  3092. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  3093. #define SGEMM_DEFAULT_UNROLL_M 4
  3094. #define SGEMM_DEFAULT_UNROLL_N 4
  3095. #define DGEMM_DEFAULT_UNROLL_M 4
  3096. #define DGEMM_DEFAULT_UNROLL_N 4
  3097. #define CGEMM_DEFAULT_UNROLL_M 2
  3098. #define CGEMM_DEFAULT_UNROLL_N 2
  3099. #define ZGEMM_DEFAULT_UNROLL_M 2
  3100. #define ZGEMM_DEFAULT_UNROLL_N 2
  3101. #define SGEMM_DEFAULT_P 128
  3102. #define DGEMM_DEFAULT_P 128
  3103. #define CGEMM_DEFAULT_P 96
  3104. #define ZGEMM_DEFAULT_P 64
  3105. #define SGEMM_DEFAULT_Q 240
  3106. #define DGEMM_DEFAULT_Q 120
  3107. #define CGEMM_DEFAULT_Q 120
  3108. #define ZGEMM_DEFAULT_Q 120
  3109. #define SGEMM_DEFAULT_R 12288
  3110. #define DGEMM_DEFAULT_R 8192
  3111. #define CGEMM_DEFAULT_R 4096
  3112. #define ZGEMM_DEFAULT_R 4096
  3113. #define SYMV_P 16
  3114. #endif
  3115. #if defined(ZARCH_GENERIC)
  3116. #define SNUMOPT 2
  3117. #define DNUMOPT 2
  3118. #define GEMM_DEFAULT_OFFSET_A 0
  3119. #define GEMM_DEFAULT_OFFSET_B 0
  3120. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  3121. #define SGEMM_DEFAULT_UNROLL_M 2
  3122. #define SGEMM_DEFAULT_UNROLL_N 2
  3123. #define DGEMM_DEFAULT_UNROLL_M 2
  3124. #define DGEMM_DEFAULT_UNROLL_N 2
  3125. #define CGEMM_DEFAULT_UNROLL_M 2
  3126. #define CGEMM_DEFAULT_UNROLL_N 2
  3127. #define ZGEMM_DEFAULT_UNROLL_M 2
  3128. #define ZGEMM_DEFAULT_UNROLL_N 2
  3129. #define SGEMM_DEFAULT_P 128
  3130. #define DGEMM_DEFAULT_P 128
  3131. #define CGEMM_DEFAULT_P 96
  3132. #define ZGEMM_DEFAULT_P 64
  3133. #define SGEMM_DEFAULT_Q 240
  3134. #define DGEMM_DEFAULT_Q 120
  3135. #define CGEMM_DEFAULT_Q 120
  3136. #define ZGEMM_DEFAULT_Q 120
  3137. #define SGEMM_DEFAULT_R 12288
  3138. #define DGEMM_DEFAULT_R 8192
  3139. #define CGEMM_DEFAULT_R 4096
  3140. #define ZGEMM_DEFAULT_R 4096
  3141. #define SYMV_P 16
  3142. #endif
  3143. #if defined(Z13)
  3144. #define SNUMOPT 2
  3145. #define DNUMOPT 2
  3146. #define GEMM_DEFAULT_OFFSET_A 0
  3147. #define GEMM_DEFAULT_OFFSET_B 0
  3148. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  3149. #define SGEMM_DEFAULT_UNROLL_M 8
  3150. #define SGEMM_DEFAULT_UNROLL_N 4
  3151. #define DGEMM_DEFAULT_UNROLL_M 8
  3152. #define DGEMM_DEFAULT_UNROLL_N 4
  3153. #define CGEMM_DEFAULT_UNROLL_M 4
  3154. #define CGEMM_DEFAULT_UNROLL_N 4
  3155. #define ZGEMM_DEFAULT_UNROLL_M 4
  3156. #define ZGEMM_DEFAULT_UNROLL_N 4
  3157. #define SGEMM_DEFAULT_P 456
  3158. #define DGEMM_DEFAULT_P 320
  3159. #define CGEMM_DEFAULT_P 480
  3160. #define ZGEMM_DEFAULT_P 224
  3161. #define SGEMM_DEFAULT_Q 488
  3162. #define DGEMM_DEFAULT_Q 384
  3163. #define CGEMM_DEFAULT_Q 128
  3164. #define ZGEMM_DEFAULT_Q 352
  3165. #define SGEMM_DEFAULT_R 8192
  3166. #define DGEMM_DEFAULT_R 4096
  3167. #define CGEMM_DEFAULT_R 4096
  3168. #define ZGEMM_DEFAULT_R 2048
  3169. #define SYMV_P 16
  3170. #endif
  3171. #if defined(Z14)
  3172. #define SNUMOPT 2
  3173. #define DNUMOPT 2
  3174. #define GEMM_DEFAULT_OFFSET_A 0
  3175. #define GEMM_DEFAULT_OFFSET_B 0
  3176. #define GEMM_DEFAULT_ALIGN 0x03fffUL
  3177. #define SGEMM_DEFAULT_UNROLL_M 16
  3178. #define SGEMM_DEFAULT_UNROLL_N 4
  3179. #define DGEMM_DEFAULT_UNROLL_M 8
  3180. #define DGEMM_DEFAULT_UNROLL_N 4
  3181. #define CGEMM_DEFAULT_UNROLL_M 4
  3182. #define CGEMM_DEFAULT_UNROLL_N 4
  3183. #define ZGEMM_DEFAULT_UNROLL_M 4
  3184. #define ZGEMM_DEFAULT_UNROLL_N 4
  3185. #define SGEMM_DEFAULT_P 480
  3186. #define DGEMM_DEFAULT_P 320
  3187. #define CGEMM_DEFAULT_P 480
  3188. #define ZGEMM_DEFAULT_P 224
  3189. #define SGEMM_DEFAULT_Q 512
  3190. #define DGEMM_DEFAULT_Q 384
  3191. #define CGEMM_DEFAULT_Q 128
  3192. #define ZGEMM_DEFAULT_Q 352
  3193. #define SGEMM_DEFAULT_R 8192
  3194. #define DGEMM_DEFAULT_R 4096
  3195. #define CGEMM_DEFAULT_R 4096
  3196. #define ZGEMM_DEFAULT_R 2048
  3197. #define SYMV_P 16
  3198. #endif
  3199. #if defined(CSKY) || defined(CK860FV)
  3200. #define GEMM_DEFAULT_OFFSET_A 0
  3201. #define GEMM_DEFAULT_OFFSET_B 0
  3202. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x03fffUL
  3203. #define SGEMM_DEFAULT_UNROLL_M 2
  3204. #define SGEMM_DEFAULT_UNROLL_N 2
  3205. #define DGEMM_DEFAULT_UNROLL_M 2
  3206. #define DGEMM_DEFAULT_UNROLL_N 2
  3207. #define CGEMM_DEFAULT_UNROLL_M 2
  3208. #define CGEMM_DEFAULT_UNROLL_N 2
  3209. #define ZGEMM_DEFAULT_UNROLL_M 2
  3210. #define ZGEMM_DEFAULT_UNROLL_N 2
  3211. #define SGEMM_DEFAULT_P 128
  3212. #define DGEMM_DEFAULT_P 128
  3213. #define CGEMM_DEFAULT_P 96
  3214. #define ZGEMM_DEFAULT_P 64
  3215. #define SGEMM_DEFAULT_Q 240
  3216. #define DGEMM_DEFAULT_Q 120
  3217. #define CGEMM_DEFAULT_Q 120
  3218. #define ZGEMM_DEFAULT_Q 120
  3219. #define SGEMM_DEFAULT_R 12288
  3220. #define DGEMM_DEFAULT_R 8192
  3221. #define CGEMM_DEFAULT_R 4096
  3222. #define ZGEMM_DEFAULT_R 4096
  3223. #define SYMV_P 16
  3224. #define GEMM_DEFAULT_OFFSET_A 0
  3225. #define GEMM_DEFAULT_OFFSET_B 0
  3226. #endif
  3227. #ifdef GENERIC
  3228. #define SNUMOPT 2
  3229. #define DNUMOPT 2
  3230. #define GEMM_DEFAULT_OFFSET_A 0
  3231. #define GEMM_DEFAULT_OFFSET_B 0
  3232. #define GEMM_DEFAULT_ALIGN (BLASLONG)0x0ffffUL
  3233. #define SGEMM_DEFAULT_UNROLL_N 2
  3234. #define DGEMM_DEFAULT_UNROLL_N 2
  3235. #define QGEMM_DEFAULT_UNROLL_N 2
  3236. #define CGEMM_DEFAULT_UNROLL_N 2
  3237. #define ZGEMM_DEFAULT_UNROLL_N 2
  3238. #define XGEMM_DEFAULT_UNROLL_N 1
  3239. #define CGEMM3M_DEFAULT_UNROLL_N 2
  3240. #define ZGEMM3M_DEFAULT_UNROLL_N 2
  3241. #ifdef ARCH_X86
  3242. #define SGEMM_DEFAULT_UNROLL_M 2
  3243. #define DGEMM_DEFAULT_UNROLL_M 2
  3244. #define QGEMM_DEFAULT_UNROLL_M 2
  3245. #define CGEMM_DEFAULT_UNROLL_M 2
  3246. #define ZGEMM_DEFAULT_UNROLL_M 2
  3247. #define XGEMM_DEFAULT_UNROLL_M 1
  3248. #else
  3249. #define SGEMM_DEFAULT_UNROLL_M 2
  3250. #define DGEMM_DEFAULT_UNROLL_M 2
  3251. #define QGEMM_DEFAULT_UNROLL_M 2
  3252. #define CGEMM_DEFAULT_UNROLL_M 2
  3253. #define ZGEMM_DEFAULT_UNROLL_M 2
  3254. #define XGEMM_DEFAULT_UNROLL_M 1
  3255. #define CGEMM3M_DEFAULT_UNROLL_M 2
  3256. #define ZGEMM3M_DEFAULT_UNROLL_M 2
  3257. #define CGEMM3M_DEFAULT_P 448
  3258. #define ZGEMM3M_DEFAULT_P 224
  3259. #define XGEMM3M_DEFAULT_P 112
  3260. #define CGEMM3M_DEFAULT_Q 224
  3261. #define ZGEMM3M_DEFAULT_Q 224
  3262. #define XGEMM3M_DEFAULT_Q 224
  3263. #define CGEMM3M_DEFAULT_R 12288
  3264. #define ZGEMM3M_DEFAULT_R 12288
  3265. #define XGEMM3M_DEFAULT_R 12288
  3266. #endif
  3267. #ifdef ARCH_MIPS
  3268. #define SGEMM_DEFAULT_P 128
  3269. #define DGEMM_DEFAULT_P 128
  3270. #define CGEMM_DEFAULT_P 96
  3271. #define ZGEMM_DEFAULT_P 64
  3272. #define SGEMM_DEFAULT_Q 240
  3273. #define DGEMM_DEFAULT_Q 120
  3274. #define CGEMM_DEFAULT_Q 120
  3275. #define ZGEMM_DEFAULT_Q 120
  3276. #define SGEMM_DEFAULT_R 12288
  3277. #define DGEMM_DEFAULT_R 8192
  3278. #define CGEMM_DEFAULT_R 4096
  3279. #define ZGEMM_DEFAULT_R 4096
  3280. #elif defined(ARCH_LOONGARCH64)
  3281. #define SGEMM_DEFAULT_P 128
  3282. #define DGEMM_DEFAULT_P 128
  3283. #define CGEMM_DEFAULT_P 96
  3284. #define ZGEMM_DEFAULT_P 64
  3285. #define SGEMM_DEFAULT_Q 240
  3286. #define DGEMM_DEFAULT_Q 120
  3287. #define CGEMM_DEFAULT_Q 120
  3288. #define ZGEMM_DEFAULT_Q 120
  3289. #define SGEMM_DEFAULT_R 12288
  3290. #define DGEMM_DEFAULT_R 8192
  3291. #define CGEMM_DEFAULT_R 4096
  3292. #define ZGEMM_DEFAULT_R 4096
  3293. #else
  3294. #define SGEMM_DEFAULT_P sgemm_p
  3295. #define DGEMM_DEFAULT_P dgemm_p
  3296. #define QGEMM_DEFAULT_P qgemm_p
  3297. #define CGEMM_DEFAULT_P cgemm_p
  3298. #define ZGEMM_DEFAULT_P zgemm_p
  3299. #define XGEMM_DEFAULT_P xgemm_p
  3300. #define SGEMM_DEFAULT_R sgemm_r
  3301. #define DGEMM_DEFAULT_R dgemm_r
  3302. #define QGEMM_DEFAULT_R qgemm_r
  3303. #define CGEMM_DEFAULT_R cgemm_r
  3304. #define ZGEMM_DEFAULT_R zgemm_r
  3305. #define XGEMM_DEFAULT_R xgemm_r
  3306. #define SGEMM_DEFAULT_Q 128
  3307. #define DGEMM_DEFAULT_Q 128
  3308. #define QGEMM_DEFAULT_Q 128
  3309. #define CGEMM_DEFAULT_Q 128
  3310. #define ZGEMM_DEFAULT_Q 128
  3311. #define XGEMM_DEFAULT_Q 128
  3312. #endif
  3313. #define SYMV_P 16
  3314. #endif
  3315. #ifndef SWITCH_RATIO
  3316. #define SWITCH_RATIO 2
  3317. #endif
  3318. #ifndef QGEMM_DEFAULT_UNROLL_M
  3319. #define QGEMM_DEFAULT_UNROLL_M 2
  3320. #endif
  3321. #ifndef QGEMM_DEFAULT_UNROLL_N
  3322. #define QGEMM_DEFAULT_UNROLL_N 2
  3323. #endif
  3324. #ifndef XGEMM_DEFAULT_UNROLL_M
  3325. #define XGEMM_DEFAULT_UNROLL_M 2
  3326. #endif
  3327. #ifndef XGEMM_DEFAULT_UNROLL_N
  3328. #define XGEMM_DEFAULT_UNROLL_N 2
  3329. #endif
  3330. #ifndef HAVE_SSE2
  3331. #define SHUFPD_0 shufps $0x44,
  3332. #define SHUFPD_1 shufps $0x4e,
  3333. #define SHUFPD_2 shufps $0xe4,
  3334. #define SHUFPD_3 shufps $0xee,
  3335. #endif
  3336. #ifndef SHUFPD_0
  3337. #define SHUFPD_0 shufpd $0,
  3338. #endif
  3339. #ifndef SHUFPD_1
  3340. #define SHUFPD_1 shufpd $1,
  3341. #endif
  3342. #ifndef SHUFPD_2
  3343. #define SHUFPD_2 shufpd $2,
  3344. #endif
  3345. #ifndef SHUFPD_3
  3346. #define SHUFPD_3 shufpd $3,
  3347. #endif
  3348. #ifndef SHUFPS_39
  3349. #define SHUFPS_39 shufps $0x39,
  3350. #endif
  3351. #endif