You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

Makefile.arm64 8.5 kB

Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354
  1. ifneq ($(C_COMPILER), PGI)
  2. ifeq ($(C_COMPILER), CLANG)
  3. ISCLANG=1
  4. endif
  5. ifeq ($(C_COMPILER), FUJITSU)
  6. ISCLANG=1
  7. endif
  8. ifneq (1, $(filter 1,$(GCCVERSIONGT4) $(ISCLANG)))
  9. CCOMMON_OPT += -march=armv8-a
  10. ifneq ($(F_COMPILER), NAG)
  11. FCOMMON_OPT += -march=armv8-a
  12. endif
  13. else
  14. ifeq ($(CORE), ARMV8)
  15. CCOMMON_OPT += -march=armv8-a
  16. ifneq ($(F_COMPILER), NAG)
  17. FCOMMON_OPT += -march=armv8-a
  18. endif
  19. endif
  20. ifeq ($(CORE), ARMV8SVE)
  21. CCOMMON_OPT += -march=armv8-a+sve
  22. ifneq ($(F_COMPILER), NAG)
  23. FCOMMON_OPT += -march=armv8-a+sve
  24. endif
  25. endif
  26. ifeq ($(CORE), CORTEXA53)
  27. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a53
  28. ifneq ($(F_COMPILER), NAG)
  29. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a53
  30. endif
  31. endif
  32. ifeq ($(CORE), CORTEXA57)
  33. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a57
  34. ifneq ($(F_COMPILER), NAG)
  35. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a57
  36. endif
  37. endif
  38. ifeq ($(CORE), CORTEXA72)
  39. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  40. ifneq ($(F_COMPILER), NAG)
  41. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  42. endif
  43. endif
  44. ifeq ($(CORE), CORTEXA73)
  45. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a73
  46. ifneq ($(F_COMPILER), NAG)
  47. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a73
  48. endif
  49. endif
  50. ifeq ($(CORE), CORTEXA76)
  51. CCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a76
  52. ifneq ($(F_COMPILER), NAG)
  53. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a76
  54. endif
  55. endif
  56. ifeq ($(CORE), FT2000)
  57. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  58. ifneq ($(F_COMPILER), NAG)
  59. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  60. endif
  61. endif
  62. # Use a72 tunings because Neoverse-N1 is only available
  63. # in GCC>=9
  64. ifeq ($(CORE), NEOVERSEN1)
  65. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ7) $(ISCLANG)))
  66. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ9) $(ISCLANG)))
  67. CCOMMON_OPT += -march=armv8.2-a -mtune=neoverse-n1
  68. ifneq ($(F_COMPILER), NAG)
  69. FCOMMON_OPT += -march=armv8.2-a -mtune=neoverse-n1
  70. endif
  71. else
  72. CCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  73. ifneq ($(F_COMPILER), NAG)
  74. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  75. endif
  76. endif
  77. else
  78. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  79. ifneq ($(F_COMPILER), NAG)
  80. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  81. endif
  82. endif
  83. endif
  84. # Use a72 tunings because Neoverse-V1 is only available
  85. # in GCC>=10.4
  86. ifeq ($(CORE), NEOVERSEV1)
  87. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ7) $(ISCLANG)))
  88. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ10) $(ISCLANG)))
  89. ifeq (1, $(filter 1,$(GCCMINORVERSIONGTEQ4) $(GCCVERSIONGTEQ11) $(ISCLANG)))
  90. CCOMMON_OPT += -march=armv8.4-a+sve
  91. ifeq (1, $(ISCLANG))
  92. CCOMMON_OPT += -mtune=cortex-x1
  93. else
  94. CCOMMON_OPT += -mtune=neoverse-v1
  95. endif
  96. ifneq ($(F_COMPILER), NAG)
  97. FCOMMON_OPT += -march=armv8.4-a -mtune=neoverse-v1
  98. endif
  99. else
  100. CCOMMON_OPT += -march=armv8.4-a+sve
  101. ifneq ($(CROSS), 1)
  102. CCOMMON_OPT += -mtune=native
  103. endif
  104. ifneq ($(F_COMPILER), NAG)
  105. FCOMMON_OPT += -march=armv8.4-a
  106. ifneq ($(CROSS), 1)
  107. FCOMMON_OPT += -mtune=native
  108. endif
  109. endif
  110. endif
  111. else
  112. CCOMMON_OPT += -march=armv8.2-a+sve -mtune=cortex-a72
  113. ifneq ($(F_COMPILER), NAG)
  114. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  115. endif
  116. endif
  117. else
  118. CCOMMON_OPT += -march=armv8-a+sve -mtune=cortex-a72
  119. ifneq ($(F_COMPILER), NAG)
  120. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  121. endif
  122. endif
  123. endif
  124. # Use a72 tunings because Neoverse-N2 is only available
  125. # in GCC>=10.4
  126. ifeq ($(CORE), NEOVERSEN2)
  127. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ7) $(ISCLANG)))
  128. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ10) $(ISCLANG)))
  129. ifeq (1, $(filter 1,$(GCCMINORVERSIONGTEQ4) $(GCCVERSIONGTEQ11) $(ISCLANG)))
  130. ifneq ($(OSNAME), Darwin)
  131. CCOMMON_OPT += -march=armv8.5-a+sve+sve2+bf16 -mtune=neoverse-n2
  132. else
  133. CCOMMON_OPT += -march=armv8.2-a+sve+bf16 -mtune=cortex-a72
  134. endif
  135. ifneq ($(F_COMPILER), NAG)
  136. FCOMMON_OPT += -march=armv8.5-a+sve+sve2+bf16 -mtune=neoverse-n2
  137. endif
  138. else
  139. CCOMMON_OPT += -march=armv8.5-a+sve+bf16
  140. ifneq ($(CROSS), 1)
  141. CCOMMON_OPT += -mtune=native
  142. endif
  143. ifneq ($(F_COMPILER), NAG)
  144. FCOMMON_OPT += -march=armv8.5-a
  145. ifneq ($(CROSS), 1)
  146. FCOMMON_OPT += -mtune=native
  147. endif
  148. endif
  149. endif
  150. else
  151. CCOMMON_OPT += -march=armv8.2-a+sve+bf16 -mtune=cortex-a72
  152. ifneq ($(F_COMPILER), NAG)
  153. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  154. endif
  155. endif
  156. else
  157. CCOMMON_OPT += -march=armv8-a+sve+bf16 -mtune=cortex-a72
  158. ifneq ($(F_COMPILER), NAG)
  159. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  160. endif
  161. endif
  162. endif
  163. # Detect ARM Neoverse V2.
  164. ifeq ($(CORE), NEOVERSEV2)
  165. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ12) $(ISCLANG)))
  166. CCOMMON_OPT += -march=armv9-a -mtune=neoverse-v2
  167. ifneq ($(F_COMPILER), NAG)
  168. FCOMMON_OPT += -march=armv9-a -mtune=neoverse-v2
  169. endif
  170. endif
  171. endif
  172. # Use a53 tunings because a55 is only available in GCC>=8.1
  173. ifeq ($(CORE), CORTEXA55)
  174. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ7) $(ISCLANG)))
  175. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ8) $(ISCLANG)))
  176. CCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a55
  177. ifneq ($(F_COMPILER), NAG)
  178. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a55
  179. endif
  180. else
  181. CCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a53
  182. ifneq ($(F_COMPILER), NAG)
  183. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a53
  184. endif
  185. endif
  186. else
  187. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a53
  188. ifneq ($(F_COMPILER), NAG)
  189. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a53
  190. endif
  191. endif
  192. endif
  193. ifeq ($(CORE), THUNDERX)
  194. CCOMMON_OPT += -march=armv8-a -mtune=thunderx
  195. ifneq ($(F_COMPILER), NAG)
  196. FCOMMON_OPT += -march=armv8-a -mtune=thunderx
  197. endif
  198. endif
  199. ifeq ($(CORE), FALKOR)
  200. CCOMMON_OPT += -march=armv8-a -mtune=falkor
  201. ifneq ($(F_COMPILER), NAG)
  202. FCOMMON_OPT += -march=armv8-a -mtune=falkor
  203. endif
  204. endif
  205. ifeq ($(CORE), THUNDERX2T99)
  206. CCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99
  207. ifneq ($(F_COMPILER), NAG)
  208. FCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99
  209. endif
  210. endif
  211. ifeq ($(CORE), THUNDERX3T110)
  212. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ10) $(ISCLANG)))
  213. CCOMMON_OPT += -march=armv8.3-a
  214. ifeq (0, $(ISCLANG))
  215. CCOMMON_OPT += -mtune=thunderx3t110
  216. else
  217. CCOMMON_OPT += -mtune=thunderx2t99
  218. endif
  219. ifneq ($(F_COMPILER), NAG)
  220. FCOMMON_OPT += -march=armv8.3-a -mtune=thunderx3t110
  221. endif
  222. else
  223. CCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99
  224. ifneq ($(F_COMPILER), NAG)
  225. FCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99
  226. endif
  227. endif
  228. endif
  229. ifeq ($(CORE), VORTEX)
  230. CCOMMON_OPT += -march=armv8.3-a
  231. ifneq ($(F_COMPILER), NAG)
  232. FCOMMON_OPT += -march=armv8.3-a
  233. endif
  234. endif
  235. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ9) $(ISCLANG)))
  236. ifeq ($(CORE), TSV110)
  237. CCOMMON_OPT += -march=armv8.2-a -mtune=tsv110
  238. ifneq ($(F_COMPILER), NAG)
  239. FCOMMON_OPT += -march=armv8.2-a -mtune=tsv110
  240. endif
  241. endif
  242. endif
  243. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ9) $(ISCLANG)))
  244. ifeq ($(CORE), EMAG8180)
  245. CCOMMON_OPT += -march=armv8-a
  246. ifeq ($(ISCLANG), 0)
  247. CCOMMON_OPT += -mtune=emag
  248. endif
  249. ifneq ($(F_COMPILER), NAG)
  250. FCOMMON_OPT += -march=armv8-a -mtune=emag
  251. endif
  252. endif
  253. endif
  254. ifeq ($(CORE), A64FX)
  255. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ10) $(ISCLANG)))
  256. ifeq (1, $(filter 1,$(GCCMINORVERSIONGTEQ3) $(GCCVERSIONGTEQ11) $(ISCLANG)))
  257. CCOMMON_OPT += -march=armv8.2-a+sve -mtune=a64fx
  258. ifneq ($(F_COMPILER), NAG)
  259. FCOMMON_OPT += -march=armv8.2-a+sve -mtune=a64fx
  260. endif
  261. else
  262. CCOMMON_OPT += -march=armv8.4-a+sve -mtune=neoverse-n1
  263. ifneq ($(F_COMPILER), NAG)
  264. FCOMMON_OPT += -march=armv8.4-a -mtune=neoverse-n1
  265. endif
  266. endif
  267. endif
  268. endif
  269. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  270. ifeq ($(CORE), CORTEXX1)
  271. CCOMMON_OPT += -march=armv8.2-a
  272. ifeq (1, $(filter 1,$(GCCMINORVERSIONGTEQ4) $(GCCVERSIONGTEQ12) $(ISCLANG)))
  273. CCOMMON_OPT += -mtune=cortex-x1
  274. ifneq ($(F_COMPILER), NAG)
  275. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-x1
  276. endif
  277. else
  278. CCOMMON_OPT += -mtune=cortex-a72
  279. ifneq ($(F_COMPILER), NAG)
  280. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  281. endif
  282. endif
  283. endif
  284. endif
  285. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  286. ifeq ($(CORE), CORTEXX2)
  287. CCOMMON_OPT += -march=armv8.4-a+sve
  288. ifneq ($(F_COMPILER), NAG)
  289. FCOMMON_OPT += -march=armv8.4-a+sve
  290. endif
  291. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ12) $(ISCLANG)))
  292. CCOMMON_OPT += -mtune=cortex-x2
  293. ifneq ($(F_COMPILER), NAG)
  294. FCOMMON_OPT += -mtune=cortex-x2
  295. endif
  296. endif
  297. endif
  298. endif
  299. #ifeq (1, $(filter 1,$(ISCLANG)))
  300. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  301. ifeq ($(CORE), CORTEXA510)
  302. CCOMMON_OPT += -march=armv8.4-a+sve
  303. ifneq ($(F_COMPILER), NAG)
  304. FCOMMON_OPT += -march=armv8.4-a+sve
  305. endif
  306. endif
  307. endif
  308. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  309. ifeq ($(CORE), CORTEXA710)
  310. CCOMMON_OPT += -march=armv8.4-a+sve
  311. ifneq ($(F_COMPILER), NAG)
  312. FCOMMON_OPT += -march=armv8.4-a+sve
  313. endif
  314. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ12) $(ISCLANG)))
  315. CCOMMON_OPT += -mtune=cortex-a710
  316. ifneq ($(F_COMPILER), NAG)
  317. FCOMMON_OPT += -mtune=cortex-a710
  318. endif
  319. endif
  320. endif
  321. endif
  322. endif
  323. endif