You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

Makefile.arm64 7.8 kB

Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
Simplifying ARMv8 build parameters ARMv8 builds were a bit mixed up, with ThunderX2 code in ARMv8 mode (which is not right because TX2 is ARMv8.1) as well as requiring a few redundancies in the defines, making it harder to maintain and understand what core has what. A few other minor issues were also fixed. Tests were made on the following cores: A53, A57, A72, Falkor, ThunderX, ThunderX2, and XGene. Tests were: OpenBLAS/test, OpenBLAS/benchmark, BLAS-Tester. A summary: * Removed TX2 code from ARMv8 build, to make sure it is compatible with all ARMv8 cores, not just v8.1. Also, the TX2 code has actually harmed performance on big cores. * Commoned up ARMv8 architectures' defines in params.h, to make sure that all will benefit from ARMv8 settings, in addition to their own. * Adding a few more cores, using ARMv8's include strategy, to benefit from compiler optimisations using mtune. Also updated cache information from the manuals, making sure we set good conservative values by default. Removed Vulcan, as it's an alias to TX2. * Auto-detecting most of those cores, but also updating the forced compilation in getarch.c, to make sure the parameters are the same whether compiled natively or forced arch. Benefits: * ARMv8 build is now guaranteed to work on all ARMv8 cores * Improved performance for ARMv8 builds on some cores (A72, Falkor, ThunderX1 and 2: up to 11%) over current develop * Improved performance for *all* cores comparing to develop branch before TX2's patch (9% ~ 36%) * ThunderX1 builds are 14% faster than ARMv8 on TX1, 9% faster than current develop's branch and 8% faster than deveop before tx2 patches Issues: * Regression from current develop branch for A53 (-12%) and A57 (-3%) with ARMv8 builds, but still faster than before TX2's commit (+15% and +24% respectively). This can be improved with a simplification of TX2's code, to be done in future patches. At least the code is guaranteed to be ARMv8.0 now. Comments: * CortexA57 builds are unchanged on A57 hardware from develop's branch, which makes sense, as it's untouched. * CortexA72 builds improve over A57 on A72 hardware, even if they're using the same includes due to new compiler tunning in the makefile.
6 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330
  1. ifneq ($(C_COMPILER), PGI)
  2. ifeq ($(C_COMPILER), CLANG)
  3. ISCLANG=1
  4. endif
  5. ifeq ($(C_COMPILER), FUJITSU)
  6. ISCLANG=1
  7. endif
  8. ifneq (1, $(filter 1,$(GCCVERSIONGT4) $(ISCLANG)))
  9. CCOMMON_OPT += -march=armv8-a
  10. ifneq ($(F_COMPILER), NAG)
  11. FCOMMON_OPT += -march=armv8-a
  12. endif
  13. else
  14. ifeq ($(CORE), ARMV8)
  15. CCOMMON_OPT += -march=armv8-a
  16. ifneq ($(F_COMPILER), NAG)
  17. FCOMMON_OPT += -march=armv8-a
  18. endif
  19. endif
  20. ifeq ($(CORE), ARMV8SVE)
  21. CCOMMON_OPT += -march=armv8-a+sve
  22. ifneq ($(F_COMPILER), NAG)
  23. FCOMMON_OPT += -march=armv8-a+sve
  24. endif
  25. endif
  26. ifeq ($(CORE), CORTEXA53)
  27. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a53
  28. ifneq ($(F_COMPILER), NAG)
  29. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a53
  30. endif
  31. endif
  32. ifeq ($(CORE), CORTEXA57)
  33. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a57
  34. ifneq ($(F_COMPILER), NAG)
  35. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a57
  36. endif
  37. endif
  38. ifeq ($(CORE), CORTEXA72)
  39. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  40. ifneq ($(F_COMPILER), NAG)
  41. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  42. endif
  43. endif
  44. ifeq ($(CORE), CORTEXA73)
  45. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a73
  46. ifneq ($(F_COMPILER), NAG)
  47. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a73
  48. endif
  49. endif
  50. ifeq ($(CORE), FT2000)
  51. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  52. ifneq ($(F_COMPILER), NAG)
  53. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  54. endif
  55. endif
  56. # Use a72 tunings because Neoverse-N1 is only available
  57. # in GCC>=9
  58. ifeq ($(CORE), NEOVERSEN1)
  59. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ7) $(ISCLANG)))
  60. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ9) $(ISCLANG)))
  61. CCOMMON_OPT += -march=armv8.2-a -mtune=neoverse-n1
  62. ifneq ($(F_COMPILER), NAG)
  63. FCOMMON_OPT += -march=armv8.2-a -mtune=neoverse-n1
  64. endif
  65. else
  66. CCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  67. ifneq ($(F_COMPILER), NAG)
  68. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  69. endif
  70. endif
  71. else
  72. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  73. ifneq ($(F_COMPILER), NAG)
  74. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  75. endif
  76. endif
  77. endif
  78. # Use a72 tunings because Neoverse-V1 is only available
  79. # in GCC>=10.4
  80. ifeq ($(CORE), NEOVERSEV1)
  81. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ7) $(ISCLANG)))
  82. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ10) $(ISCLANG)))
  83. ifeq (1, $(filter 1,$(GCCMINORVERSIONGTEQ4) $(GCCVERSIONGTEQ11) $(ISCLANG)))
  84. CCOMMON_OPT += -march=armv8.4-a+sve
  85. ifeq (1, $(ISCLANG))
  86. CCOMMON_OPT += -mtune=cortex-x1
  87. else
  88. CCOMMON_OPT += -mtune=neoverse-v1
  89. endif
  90. ifneq ($(F_COMPILER), NAG)
  91. FCOMMON_OPT += -march=armv8.4-a -mtune=neoverse-v1
  92. endif
  93. else
  94. CCOMMON_OPT += -march=armv8.4-a+sve
  95. ifneq ($(CROSS), 1)
  96. CCOMMON_OPT += -mtune=native
  97. endif
  98. ifneq ($(F_COMPILER), NAG)
  99. FCOMMON_OPT += -march=armv8.4-a
  100. ifneq ($(CROSS), 1)
  101. FCOMMON_OPT += -mtune=native
  102. endif
  103. endif
  104. endif
  105. else
  106. CCOMMON_OPT += -march=armv8.2-a+sve -mtune=cortex-a72
  107. ifneq ($(F_COMPILER), NAG)
  108. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  109. endif
  110. endif
  111. else
  112. CCOMMON_OPT += -march=armv8-a+sve -mtune=cortex-a72
  113. ifneq ($(F_COMPILER), NAG)
  114. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  115. endif
  116. endif
  117. endif
  118. # Use a72 tunings because Neoverse-N2 is only available
  119. # in GCC>=10.4
  120. ifeq ($(CORE), NEOVERSEN2)
  121. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ7) $(ISCLANG)))
  122. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ10) $(ISCLANG)))
  123. ifeq (1, $(filter 1,$(GCCMINORVERSIONGTEQ4) $(GCCVERSIONGTEQ11) $(ISCLANG)))
  124. ifneq ($(OSNAME), Darwin)
  125. CCOMMON_OPT += -march=armv8.5-a+sve+sve2+bf16 -mtune=neoverse-n2
  126. else
  127. CCOMMON_OPT += -march=armv8.2-a+sve -mtune=cortex-a72
  128. endif
  129. ifneq ($(F_COMPILER), NAG)
  130. FCOMMON_OPT += -march=armv8.5-a+sve+sve2+bf16 -mtune=neoverse-n2
  131. endif
  132. else
  133. CCOMMON_OPT += -march=armv8.5-a+sve
  134. ifneq ($(CROSS), 1)
  135. CCOMMON_OPT += -mtune=native
  136. endif
  137. ifneq ($(F_COMPILER), NAG)
  138. FCOMMON_OPT += -march=armv8.5-a
  139. ifneq ($(CROSS), 1)
  140. FCOMMON_OPT += -mtune=native
  141. endif
  142. endif
  143. endif
  144. else
  145. CCOMMON_OPT += -march=armv8.2-a+sve -mtune=cortex-a72
  146. ifneq ($(F_COMPILER), NAG)
  147. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  148. endif
  149. endif
  150. else
  151. CCOMMON_OPT += -march=armv8-a+sve -mtune=cortex-a72
  152. ifneq ($(F_COMPILER), NAG)
  153. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a72
  154. endif
  155. endif
  156. endif
  157. # Use a53 tunings because a55 is only available in GCC>=8.1
  158. ifeq ($(CORE), CORTEXA55)
  159. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ7) $(ISCLANG)))
  160. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ8) $(ISCLANG)))
  161. CCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a55
  162. ifneq ($(F_COMPILER), NAG)
  163. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a55
  164. endif
  165. else
  166. CCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a53
  167. ifneq ($(F_COMPILER), NAG)
  168. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a53
  169. endif
  170. endif
  171. else
  172. CCOMMON_OPT += -march=armv8-a -mtune=cortex-a53
  173. ifneq ($(F_COMPILER), NAG)
  174. FCOMMON_OPT += -march=armv8-a -mtune=cortex-a53
  175. endif
  176. endif
  177. endif
  178. ifeq ($(CORE), THUNDERX)
  179. CCOMMON_OPT += -march=armv8-a -mtune=thunderx
  180. ifneq ($(F_COMPILER), NAG)
  181. FCOMMON_OPT += -march=armv8-a -mtune=thunderx
  182. endif
  183. endif
  184. ifeq ($(CORE), FALKOR)
  185. CCOMMON_OPT += -march=armv8-a -mtune=falkor
  186. ifneq ($(F_COMPILER), NAG)
  187. FCOMMON_OPT += -march=armv8-a -mtune=falkor
  188. endif
  189. endif
  190. ifeq ($(CORE), THUNDERX2T99)
  191. CCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99
  192. ifneq ($(F_COMPILER), NAG)
  193. FCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99
  194. endif
  195. endif
  196. ifeq ($(CORE), THUNDERX3T110)
  197. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ10) $(ISCLANG)))
  198. CCOMMON_OPT += -march=armv8.3-a
  199. ifeq (0, $(ISCLANG))
  200. CCOMMON_OPT += -mtune=thunderx3t110
  201. else
  202. CCOMMON_OPT += -mtune=thunderx2t99
  203. endif
  204. ifneq ($(F_COMPILER), NAG)
  205. FCOMMON_OPT += -march=armv8.3-a -mtune=thunderx3t110
  206. endif
  207. else
  208. CCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99
  209. ifneq ($(F_COMPILER), NAG)
  210. FCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99
  211. endif
  212. endif
  213. endif
  214. ifeq ($(CORE), VORTEX)
  215. CCOMMON_OPT += -march=armv8.3-a
  216. ifneq ($(F_COMPILER), NAG)
  217. FCOMMON_OPT += -march=armv8.3-a
  218. endif
  219. endif
  220. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ9) $(ISCLANG)))
  221. ifeq ($(CORE), TSV110)
  222. CCOMMON_OPT += -march=armv8.2-a -mtune=tsv110
  223. ifneq ($(F_COMPILER), NAG)
  224. FCOMMON_OPT += -march=armv8.2-a -mtune=tsv110
  225. endif
  226. endif
  227. endif
  228. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ9) $(ISCLANG)))
  229. ifeq ($(CORE), EMAG8180)
  230. CCOMMON_OPT += -march=armv8-a
  231. ifeq ($(ISCLANG), 0)
  232. CCOMMON_OPT += -mtune=emag
  233. endif
  234. ifneq ($(F_COMPILER), NAG)
  235. FCOMMON_OPT += -march=armv8-a -mtune=emag
  236. endif
  237. endif
  238. endif
  239. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  240. ifeq ($(CORE), A64FX)
  241. CCOMMON_OPT += -march=armv8.2-a+sve -mtune=a64fx
  242. ifneq ($(F_COMPILER), NAG)
  243. FCOMMON_OPT += -march=armv8.2-a+sve -mtune=a64fx
  244. endif
  245. endif
  246. endif
  247. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  248. ifeq ($(CORE), CORTEXX1)
  249. CCOMMON_OPT += -march=armv8.2-a
  250. ifeq (1, $(filter 1,$(GCCMINORVERSIONGTEQ4) $(GCCVERSIONGTEQ12) $(ISCLANG)))
  251. CCOMMON_OPT += -mtune=cortex-x1
  252. ifneq ($(F_COMPILER), NAG)
  253. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-x1
  254. endif
  255. else
  256. CCOMMON_OPT += -mtune=cortex-a72
  257. ifneq ($(F_COMPILER), NAG)
  258. FCOMMON_OPT += -march=armv8.2-a -mtune=cortex-a72
  259. endif
  260. endif
  261. endif
  262. endif
  263. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  264. ifeq ($(CORE), CORTEXX2)
  265. CCOMMON_OPT += -march=armv8.4-a+sve
  266. ifneq ($(F_COMPILER), NAG)
  267. FCOMMON_OPT += -march=armv8.4-a+sve
  268. endif
  269. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ12) $(ISCLANG)))
  270. CCOMMON_OPT += -mtune=cortex-x2
  271. ifneq ($(F_COMPILER), NAG)
  272. FCOMMON_OPT += -mtune=cortex-x2
  273. endif
  274. endif
  275. endif
  276. endif
  277. #ifeq (1, $(filter 1,$(ISCLANG)))
  278. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  279. ifeq ($(CORE), CORTEXA510)
  280. CCOMMON_OPT += -march=armv8.4-a+sve
  281. ifneq ($(F_COMPILER), NAG)
  282. FCOMMON_OPT += -march=armv8.4-a+sve
  283. endif
  284. endif
  285. endif
  286. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ11) $(ISCLANG)))
  287. ifeq ($(CORE), CORTEXA710)
  288. CCOMMON_OPT += -march=armv8.4-a+sve
  289. ifneq ($(F_COMPILER), NAG)
  290. FCOMMON_OPT += -march=armv8.4-a+sve
  291. endif
  292. ifeq (1, $(filter 1,$(GCCVERSIONGTEQ12) $(ISCLANG)))
  293. CCOMMON_OPT += -mtune=cortex-a710
  294. ifneq ($(F_COMPILER), NAG)
  295. FCOMMON_OPT += -mtune=cortex-a710
  296. endif
  297. endif
  298. endif
  299. endif
  300. endif
  301. endif