This paper had the side effect of comparing NEON and SVE auto vectorization on the Neoverse V1 and V2.
V1 has 256-bit SVE, which groups two of the four 128-bit execution units together to execute the 256-bit instructions.
V2 has 128-bit SVE.
On V2 there was no speedup from SVE, which shows that the compiler didn't make use/there wasn't any gain from the instructions in SVE that don't have a NEON equivalent.
On V2 SVE was ~15% faster on average, even though both SVE and NEON share the same resources on that core, the only difference being the vector length.
> Interleaving/Unrolling and Vectorization are two popular means to optimize applications. While the first one creates multiple copies of the loop body content, the second one focuses on operating on multiple data elements in parallel thanks to SIMD units available in the CPU. In theory, interleaving and vectorization are orthogonal optimizations, one relying on instruction-level parallelism/superscalarity, and the other on data-level parallelism within a single instruction. Modern CPU architectures provide both of these parallelism mechanisms at once, and the combination of vectorization and interleaving is complex, influencing each other due to instruction selection and complexity of underlying hardware, and the programmer often has to rely on the compiler's auto-vectorization.
> Based on a large evaluation of 642 loops coming from the literature, this paper demonstrates that significant gains (up to 20%) can be obtained by adapting the LLVM auto-vectorizer to better exploit interleaving and vectorization for a given AArch64 architecture. The proposed approach is flexible and can be easily applied at both loop level or application level. Experiments on 5 mini-apps coming from the HPC realm show similar improvements and demonstrates the co-design potential of the presented approach.
V1 has 256-bit SVE, which groups two of the four 128-bit execution units together to execute the 256-bit instructions.
V2 has 128-bit SVE.
On V2 there was no speedup from SVE, which shows that the compiler didn't make use/there wasn't any gain from the instructions in SVE that don't have a NEON equivalent.
On V2 SVE was ~15% faster on average, even though both SVE and NEON share the same resources on that core, the only difference being the vector length.
Abstract:
> Interleaving/Unrolling and Vectorization are two popular means to optimize applications. While the first one creates multiple copies of the loop body content, the second one focuses on operating on multiple data elements in parallel thanks to SIMD units available in the CPU. In theory, interleaving and vectorization are orthogonal optimizations, one relying on instruction-level parallelism/superscalarity, and the other on data-level parallelism within a single instruction. Modern CPU architectures provide both of these parallelism mechanisms at once, and the combination of vectorization and interleaving is complex, influencing each other due to instruction selection and complexity of underlying hardware, and the programmer often has to rely on the compiler's auto-vectorization.
> Based on a large evaluation of 642 loops coming from the literature, this paper demonstrates that significant gains (up to 20%) can be obtained by adapting the LLVM auto-vectorizer to better exploit interleaving and vectorization for a given AArch64 architecture. The proposed approach is flexible and can be easily applied at both loop level or application level. Experiments on 5 mini-apps coming from the HPC realm show similar improvements and demonstrates the co-design potential of the presented approach.