You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

Makefile.arm64 9.5 kB

Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396
  1. ifneq ($(C_COMPILER), PGI)
  2. ifeq ($(C_COMPILER), CLANG)
  3. ISCLANG=1
  4. endif
  5. ifeq ($(C_COMPILER), FUJITSU)
  6. ISCLANG=1
  7. endif
  8. ifneq (1, $(filter 1,$(GCCVERSIONGT4) $(ISCLANG)))
  9. CCOMMON_OPT += -march=armv8-a
  10. ifneq ($(F_COMPILER), NAG)
  11. FCOMMON_OPT += -march=armv8-a
  12. endif
  13. else
  14. ifeq ($(CORE), ARMV8)
  15. CCOMMON_OPT += -march=armv8-a
  16. ifneq ($(F_COMPILER), NAG)
  17. FCOMMON_OPT += -march=armv8-a
  18. endif
  19. endif
  20. ifeq ($(CORE), ARMV8SVE)
  21. CCOMMON_OPT += -march=armv8-a+sve
  22. ifneq ($(F_COMPILER), NAG)
  23. FCOMMON_OPT += -march=armv8-a+sve
  24. endif
  25. endif
  26. ifeq ($(CORE), ARMV9SME)
  27. CCOMMON_OPT += -march=armv9-a+sve2+sme
  28. FCOMMON_OPT += -march=armv9-a+sve2
  29. endif
  30. ifeq ($(CORE), CORTEXA53)
  31. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a53
  32. ifneq ($(F_COMPILER), NAG)
  33. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a53
  34. endif
  35. endif
  36. ifeq ($(CORE), CORTEXA57)
  37. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a57
  38. ifneq ($(F_COMPILER), NAG)
  39. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a57
  40. endif
  41. endif
  42. ifeq ($(CORE), CORTEXA72)
  43. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  44. ifneq ($(F_COMPILER), NAG)
  45. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  46. endif
  47. endif
  48. ifeq ($(CORE), CORTEXA73)
  49. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a73
  50. ifneq ($(F_COMPILER), NAG)
  51. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a73
  52. endif
  53. endif
  54. ifeq ($(CORE), CORTEXA76)
  55. CCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a76
  56. ifneq ($(F_COMPILER), NAG)
  57. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a76
  58. endif
  59. endif
  60. ifeq ($(CORE), FT2000)
  61. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  62. ifneq ($(F_COMPILER), NAG)
  63. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  64. endif
  65. endif
  66. # Use a72 tunings because Neoverse-N1 is only available
  67. # in GCC>=9
  68. ifeq ($(CORE), NEOVERSEN1)
  69. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ7) $(ISCLANG)))
  70. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ9) $(ISCLANG)))
  71. CCOMMON_OPT += -march=armv8.2-a -mtune=neoverse-n1
  72. ifneq ($(F_COMPILER), NAG)
  73. FCOMMON_OPT += -march=armv8.2-a -mtune=neoverse-n1
  74. endif
  75. else
  76. CCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  77. ifneq ($(F_COMPILER), NAG)
  78. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  79. endif
  80. endif
  81. else
  82. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  83. ifneq ($(F_COMPILER), NAG)
  84. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  85. endif
  86. endif
  87. endif
  88. # Use a72 tunings because Neoverse-V1 is only available
  89. # in GCC>=10.4
  90. ifeq ($(CORE), NEOVERSEV1)
  91. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ7) $(ISCLANG)))
  92. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ10) $(ISCLANG)))
  93. ifeq (1, $(filter 1,$(GCCMINORVERSIONGTEQ4) $(GCCVERSIONGTEQ11) $(ISCLANG)))
  94. CCOMMON_OPT += -march=armv8.4-a+sve+bf16
  95. ifeq (1, $(ISCLANG))
  96. CCOMMON_OPT += -mtune=cortex-x1
  97. else
  98. CCOMMON_OPT += -mtune=neoverse-v1
  99. endif
  100. ifneq ($(F_COMPILER), NAG)
  101. FCOMMON_OPT += -march=armv8.4-a -mtune=neoverse-v1
  102. endif
  103. else
  104. CCOMMON_OPT += -march=armv8.4-a+sve+bf16
  105. ifneq ($(CROSS), 1)
  106. CCOMMON_OPT += -mtune=native
  107. endif
  108. ifneq ($(F_COMPILER), NAG)
  109. FCOMMON_OPT += -march=armv8.4-a
  110. ifneq ($(CROSS), 1)
  111. FCOMMON_OPT += -mtune=native
  112. endif
  113. endif
  114. endif
  115. else
  116. CCOMMON_OPT += -march=armv8.2-a+sve -mtune=cortex-a72
  117. ifneq ($(F_COMPILER), NAG)
  118. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  119. endif
  120. endif
  121. else
  122. CCOMMON_OPT += -march=armv8-a+sve -mtune=cortex-a72
  123. ifneq ($(F_COMPILER), NAG)
  124. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  125. endif
  126. endif
  127. endif
  128. # Use a72 tunings because Neoverse-N2 is only available
  129. # in GCC>=10.4
  130. ifeq ($(CORE), NEOVERSEN2)
  131. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ7) $(ISCLANG)))
  132. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ10) $(ISCLANG)))
  133. ifeq (1, $(filter 1,$(GCCMINORVERSIONGTEQ4) $(GCCVERSIONGTEQ11) $(ISCLANG)))
  134. ifneq ($(OSNAME), Darwin)
  135. CCOMMON_OPT += -march=armv8.5-a+sve+sve2+bf16 -mtune=neoverse-n2
  136. else
  137. CCOMMON_OPT += -march=armv8.2-a+sve+bf16 -mtune=cortex-a72
  138. endif
  139. ifneq ($(F_COMPILER), NAG)
  140. FCOMMON_OPT += -march=armv8.5-a+sve+sve2+bf16 -mtune=neoverse-n2
  141. endif
  142. else
  143. CCOMMON_OPT += -march=armv8.5-a+sve+bf16
  144. ifneq ($(CROSS), 1)
  145. CCOMMON_OPT += -mtune=native
  146. endif
  147. ifneq ($(F_COMPILER), NAG)
  148. FCOMMON_OPT += -march=armv8.5-a
  149. ifneq ($(CROSS), 1)
  150. FCOMMON_OPT += -mtune=native
  151. endif
  152. endif
  153. endif
  154. else
  155. CCOMMON_OPT += -march=armv8.2-a+sve+bf16 -mtune=cortex-a72
  156. ifneq ($(F_COMPILER), NAG)
  157. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  158. endif
  159. endif
  160. else
  161. CCOMMON_OPT += -march=armv8-a+sve+bf16 -mtune=cortex-a72
  162. ifneq ($(F_COMPILER), NAG)
  163. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  164. endif
  165. endif
  166. endif
  167. # Detect ARM Neoverse V2.
  168. ifeq ($(CORE), NEOVERSEV2)
  169. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ12) $(ISCLANG)))
  170. CCOMMON_OPT += -march=armv9-a -mtune=neoverse-v2
  171. ifneq ($(F_COMPILER), NAG)
  172. FCOMMON_OPT += -march=armv9-a -mtune=neoverse-v2
  173. endif
  174. endif
  175. endif
  176. # Detect Ampere AmpereOne(ampere1,ampere1a) processors.
  177. ifeq ($(CORE), AMPERE1)
  178. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ12) $(ISCLANG)))
  179. CCOMMON_OPT += -march=armv8.6-a+crypto+crc+fp16+sha3+rng
  180. ifneq ($(F_COMPILER), NAG)
  181. FCOMMON_OPT += -march=armv8.6-a+crypto+crc+fp16+sha3+rng
  182. endif
  183. endif
  184. endif
  185. # Use a53 tunings because a55 is only available in GCC>=8.1
  186. ifeq ($(CORE), CORTEXA55)
  187. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ7) $(ISCLANG)))
  188. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ8) $(ISCLANG)))
  189. CCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a55
  190. ifneq ($(F_COMPILER), NAG)
  191. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a55
  192. endif
  193. else
  194. CCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a53
  195. ifneq ($(F_COMPILER), NAG)
  196. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a53
  197. endif
  198. endif
  199. else
  200. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a53
  201. ifneq ($(F_COMPILER), NAG)
  202. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a53
  203. endif
  204. endif
  205. endif
  206. ifeq ($(CORE), THUNDERX)
  207. CCOMMON_OPT += -march=armv8-a -mtune=thunderx
  208. ifneq ($(F_COMPILER), NAG)
  209. FCOMMON_OPT += -march=armv8-a -mtune=thunderx
  210. endif
  211. endif
  212. ifeq ($(CORE), FALKOR)
  213. CCOMMON_OPT += -march=armv8-a -mtune=falkor
  214. ifneq ($(F_COMPILER), NAG)
  215. FCOMMON_OPT += -march=armv8-a -mtune=falkor
  216. endif
  217. endif
  218. ifeq ($(CORE), THUNDERX2T99)
  219. CCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99
  220. ifneq ($(F_COMPILER), NAG)
  221. FCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99
  222. endif
  223. endif
  224. ifeq ($(CORE), THUNDERX3T110)
  225. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ10) $(ISCLANG)))
  226. CCOMMON_OPT += -march=armv8.3-a
  227. ifeq (0, $(ISCLANG))
  228. CCOMMON_OPT += -mtune=thunderx3t110
  229. else
  230. CCOMMON_OPT += -mtune=thunderx2t99
  231. endif
  232. ifneq ($(F_COMPILER), NAG)
  233. FCOMMON_OPT += -march=armv8.3-a -mtune=thunderx3t110
  234. endif
  235. else
  236. CCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99
  237. ifneq ($(F_COMPILER), NAG)
  238. FCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99
  239. endif
  240. endif
  241. endif
  242. ifeq ($(CORE), VORTEX)
  243. CCOMMON_OPT += -march=armv8.3-a
  244. ifneq ($(F_COMPILER), NAG)
  245. FCOMMON_OPT += -march=armv8.3-a
  246. endif
  247. endif
  248. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ9) $(ISCLANG)))
  249. ifeq ($(CORE), TSV110)
  250. CCOMMON_OPT += -march=armv8.2-a -mtune=tsv110
  251. ifneq ($(F_COMPILER), NAG)
  252. FCOMMON_OPT += -march=armv8.2-a -mtune=tsv110
  253. endif
  254. endif
  255. endif
  256. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ9) $(ISCLANG)))
  257. ifeq ($(CORE), EMAG8180)
  258. CCOMMON_OPT += -march=armv8-a
  259. ifeq ($(ISCLANG), 0)
  260. CCOMMON_OPT += -mtune=emag
  261. endif
  262. ifneq ($(F_COMPILER), NAG)
  263. FCOMMON_OPT += -march=armv8-a -mtune=emag
  264. endif
  265. endif
  266. endif
  267. ifeq ($(CORE), A64FX)
  268. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ10) $(ISCLANG)))
  269. ifeq (1, $(filter 1,$(GCCMINORVERSIONGTEQ3) $(GCCVERSIONGTEQ11) $(ISCLANG)))
  270. CCOMMON_OPT += -march=armv8.2-a+sve -mtune=a64fx
  271. ifneq ($(F_COMPILER), NAG)
  272. FCOMMON_OPT += -march=armv8.2-a+sve -mtune=a64fx
  273. endif
  274. else
  275. CCOMMON_OPT += -march=armv8.4-a+sve -mtune=neoverse-n1
  276. ifneq ($(F_COMPILER), NAG)
  277. FCOMMON_OPT += -march=armv8.4-a -mtune=neoverse-n1
  278. endif
  279. endif
  280. endif
  281. endif
  282. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  283. ifeq ($(CORE), CORTEXX1)
  284. CCOMMON_OPT += -march=armv8.2-a
  285. ifeq (1, $(filter 1,$(GCCMINORVERSIONGTEQ4) $(GCCVERSIONGTEQ12) $(ISCLANG)))
  286. CCOMMON_OPT += -mtune=cortex-x1
  287. ifneq ($(F_COMPILER), NAG)
  288. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-x1
  289. endif
  290. else
  291. CCOMMON_OPT += -mtune=cortex-a72
  292. ifneq ($(F_COMPILER), NAG)
  293. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  294. endif
  295. endif
  296. endif
  297. endif
  298. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  299. ifeq ($(CORE), CORTEXX2)
  300. CCOMMON_OPT += -march=armv8.4-a+sve
  301. ifneq ($(F_COMPILER), NAG)
  302. FCOMMON_OPT += -march=armv8.4-a+sve
  303. endif
  304. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ12) $(ISCLANG)))
  305. CCOMMON_OPT += -mtune=cortex-x2
  306. ifneq ($(F_COMPILER), NAG)
  307. FCOMMON_OPT += -mtune=cortex-x2
  308. endif
  309. endif
  310. endif
  311. endif
  312. #ifeq (1, $(filter 1,$(ISCLANG)))
  313. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  314. ifeq ($(CORE), CORTEXA510)
  315. CCOMMON_OPT += -march=armv8.4-a+sve
  316. ifneq ($(F_COMPILER), NAG)
  317. FCOMMON_OPT += -march=armv8.4-a+sve
  318. endif
  319. endif
  320. endif
  321. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  322. ifeq ($(CORE), CORTEXA710)
  323. CCOMMON_OPT += -march=armv8.4-a+sve
  324. ifneq ($(F_COMPILER), NAG)
  325. FCOMMON_OPT += -march=armv8.4-a+sve
  326. endif
  327. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ12) $(ISCLANG)))
  328. CCOMMON_OPT += -mtune=cortex-a710
  329. ifneq ($(F_COMPILER), NAG)
  330. FCOMMON_OPT += -mtune=cortex-a710
  331. endif
  332. endif
  333. endif
  334. endif
  335. endif
  336. else
  337. # NVIDIA HPC options necessary to enable SVE in the compiler
  338. ifeq ($(CORE), THUNDERX2T99)
  339. CCOMMON_OPT += -tp=thunderx2t99
  340. FCOMMON_OPT += -tp=thunderx2t99
  341. endif
  342. ifeq ($(CORE), NEOVERSEN1)
  343. CCOMMON_OPT += -tp=neoverse-n1
  344. FCOMMON_OPT += -tp=neoverse-n1
  345. endif
  346. ifeq ($(CORE), NEOVERSEV1)
  347. CCOMMON_OPT += -tp=neoverse-v1
  348. FCOMMON_OPT += -tp=neoverse-v1
  349. endif
  350. ifeq ($(CORE), NEOVERSEV2)
  351. CCOMMON_OPT += -tp=neoverse-v2
  352. FCOMMON_OPT += -tp=neoverse-v2
  353. endif
  354. ifeq ($(CORE), ARMV8SVE)
  355. CCOMMON_OPT += -tp=neoverse-v2
  356. FCOMMON_OPT += -tp=neoverse-v2
  357. endif
  358. ifeq ($(CORE), ARMV9SVE)
  359. CCOMMON_OPT += -tp=neoverse-v2
  360. FCOMMON_OPT += -tp=neoverse-v2
  361. endif
  362. endif