[cfe-dev] Proposal for function vectorization and loop vectorization with function calls (original) (raw)

Tian, Xinmin via cfe-dev cfe-dev at lists.llvm.org
Wed Mar 2 11:49:05 PST 2016


Proposal for function vectorization and loop vectorization with function calls

Intel Corporation (3/2/2016)

This is a proposal for an initial work towards Clang and LLVM implementation of vectorizing a function annotated with OpenMP 4.5's "#pragma omp declare simd" (named SIMD-enabled function) and its associated clauses based on the VectorABI [2]. On the caller side, we propose to improve LLVM loopVectorizer such that the code that calls the SIMD-enabled function can be vectorized. On the callee side, we propose to add Clang FE support for "#pragma omp declare simd" syntax and a new pass to transform the SIMD-enabled function body into a SIMD loop. This newly created loop can then be fed to LLVM loopVectorizer (or its future enhancement) for vectorization. This work does leverage LLVM's existing LoopVectorizer.

Problem Statement

Currently, if a loop calls a user-defined function or a 3rd party library function, the loop can't be vectorized unless the function is inlined. In the example below the LoopVectorizer fails to vectorize the k loop due to its function call to "dowork" because "dowork" is an external function. Note that inlining the "dowork" function may result in vectorization for some of the cases, but that is not a generally applicable solution. Also, there may be reasons why compiler may not (or can't) inline the "dowork" function call. Therefore, there is value in being able to vectorize the loop with a call to "dowork" function in it.

#include<stdio.h> extern float dowork(float *a, int k);

float a[4096]; int main() { int k; #pragma clang loop vectorize(enable) for (k = 0; k < 4096; k++) { a[k] = k * 0.5; a[k] = dowork(a, k); } printf("passed %f\n", a[1024]); }

sh-4.1$ clang -c -O2 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize loopvec.c loopvec.c:15:12: remark: loop not vectorized: call instruction cannot be vectorized [-Rpass-analysis] a[k] = dowork(a, k); ^ loopvec.c:13:3: remark: loop not vectorized: use -Rpass-analysis=loop-vectorize for more info (Force=true) [-Rpass-missed=loop-vectorize] for (k = 0; k < 4096; k++) { ^ loopvec.c:13:3: warning: loop not vectorized: failed explicitly specified loop vectorization [-Wpass-failed] 1 warning generated.

New functionality of Vectorization

New functionalities and enhancements are proposed to address the issues stated above which include: a) Vectorize a function annotated by the programmer using OpenMP* SIMD extensions; b) Enhance LLVM's LoopVectorizer to vectorize a loop containing a call to SIMD-enabled function.

For example, when writing:

#include<stdio.h>

#pragma omp declare simd uniform(a) linear(k) extern float dowork(float *a, int k);

float a[4096]; int main() { int k; #pragma clang loop vectorize(enable) for (k = 0; k < 4096; k++) { a[k] = k * 0.5; a[k] = dowork(a, k); } printf("passed %f\n", a[1024]); }

the programmer asserts that a) there will be a vector version of "dowork" available for the compiler to use (link with, with appropriate signature, explained below) when vectorizing the k loop; and that b) no loop-carried backward dependencies are introduced by the "dowork" call that prevent the vectorization of the k loop.

The expected vector loop (shown as pseudo code, ignoring leftover iterations) resulting from LLVM's LoopVectorizer is

... ... vectorized_for (k = 0; k < 4096; k += VL) { a[k:VL] = {k, k+1, k+2, k+VL-1} * 0.5; a[k:VL] = _ZGVb4Nul_dowork(a, k); } ... ...

In this example "_ZGVb4Nul_dowork" is a special name mangling where: _ZGV is a prefix based on C/C++ name mangling rule suggested by GCC community, 'b' indicates "xmm" (assume we vectorize here to 128bit xmm vector registers), '4' is VL (assume we vectorize here for length 4), 'N' indicates that the function is vectorized without a mask, M indicates that the function is vecrized with a mask. 'u' indicates that the first parameter has the "uniform" property, 'l' indicates that the second argement has the "linear" property.

More details (including name mangling scheme) can be found in the following references [2].

References

  1. OpenMP SIMD language extensions: http://www.openmp.org/mp-documents/openmp-4. 5.pdf

  2. VectorABI Documentation: https://www.cilkplus.org/sites/default/files/open_specifications/Intel-ABI-Vecto r-Function-2012-v0.9.5.pdf https://sourceware.org/glibc/wiki/libmvec?action=AttachFile&do=view&target=Vecto rABI.txt

[[Note: VectorABI was reviewed at X86-64 System V Application Binary Interface mailing list. The discussion was recorded at https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4 ]]

  1. The first paper on SIMD extensions and implementations: "Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on Multicore-SIMD Processors" by Xinmin Tian, Hideki Saito, Milind Girkar, Serguei Preis, Sergey Kozhukhov, et al., IPDPS Workshops 2012, pages 2349--2358 [[Note: the first implementation and the paper were done before VectorABI was finalized with the GCC community and Redhat. The latest VectorABI version for OpenMP 4.5 is ready to be published]]

Proposed Implementation

  1. Clang FE parses "#pragma omp declare simd [clauses]" and generates mangled name including these prefixes as vector signatures. These mangled name prefixes are recorded as function attributes in LLVM function attribute group. Note that it may be possible to have several mangled names associated with the same function, which correspond to several desired vectorized versions. Clang FE generates all function attributes for expected vector variants to be generated by the back-end. E.g.,

    #pragma omp delcare simd uniform(a) linear(k) float dowork(float *a, int k) { a[k] = sinf(a[k]) + 9.8f; }

    define __stdcall f32 @_dowork(f32* %a, i32 %k) #0 ... ... attributes #0 = { nounwind uwtable "ZGVbM4ul" "ZGVbN4ul" ...}

  2. A new vector function generation pass is introduced to generate vector variants of the original scalar function based on VectorABI (see [2, 3]). For example, one vector variant is generated for "ZGVbN4ul" attribute as follows (pseudo code):

    define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0 { #pragma clang loop vectorize(enable) for (int %t = k; %t < %k + 4; %t++) { %a[t] = sinf(%a[t]) + 9.8f; } vec_load xmm0, %a[k:VL] return xmm0; }

    The body of the function is wrapped inside a loop having VL iterations, which correspond to the vector lanes.

    The LLVM LoopVectorizer will vectorize the generated %t loop, expected to produce the following vectorized code eliminating the loop (pseudo code):

    define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0 { vec_load xmm1, %a[k: VL] xmm2 = call __svml_sinf(xmm1) xmm0 = vec_add xmm2, [9,8f, 9.8f, 9.8f, 9.8f] store %a[k:VL], xmm0 return xmm0; }

    [[Note: Vectorizer support for the Short Vector Math Library (SVML) functions will be a seperate proposal. ]]

  3. The LLVM LoopVectorizer is enhanced to a) identify loops with calls that have been annotated with "#pragma omp declare simd" by checking function attribute groups; b) analyze each call instruction and its parameters in the loop, to determine if each parameter has the following properties: * uniform * linear + stride * vector * aligned * called inside a conditional branch or not ... ... Based on these properties, the signature of the vectorized call is generated; and c) performs signature matching to obtain the suitable vector variant among the signatures available for the called function. If no such signature is found, the call cannot be vectorized.

    Note that a similar enhancement can and should be made also to LLVM's SLP vectorizer.

    For example:

    #pragma omp declare simd uniform(a) linear(k) extern float dowork(float *a, int k);

    ... ... #pragma clang loop vectorize(enable) for (k = 0; k < 4096; k++) { a[k] = k * 0.5; a[k] = dowork(a, k); } ... ...

    Step a: "dowork" function is marked as SIMD-enabled function attributes #0 = { nounwind uwtable "ZGVbM4ul" "ZGVbN4ul" ...} Step b: 1) 'a' is uniform, as it is the base address of array 'a' 2) 'k' is linear, as 'k' is the induction variable with stride=1 3) SIMD "dowork" is called unconditionally in the candidate k loop. 4) it is compiled for SSE4.1 with the Vector Length VL=4. based on these properties, the signature is "ZGVbN4ul" [[Notes: For conditional call in the loop, it needs masking support, the implementation details seen in reference [1][2][3] ]] Step c: Check if the signature "ZGVbN4ul" exists in function attribute #0; if yes the suitable vectorized version is found and will be linked with. The below loop is expected to be produced by the LoopVectorizer: ... ... vectorized_for (k = 0; k < 4096; k += 4) { a[k:4] = {k, k+1, k+2, k+3} * 0.5; a[k:4] = _ZGVb4Nul_dowork(a, k); } ... ...

[[Note: Vectorizer support for the Short Vector Math Library (SVML) functions will be a seperate proposal. ]]

GCC and ICC Compatibility

With this proposal the callee function and the loop containing a call to it can each be compiled and vectorized by a different compiler, including Clang+LLVM with its LoopVectorizer as outlined above, GCC and ICC. The vectorized loop will then be linked with the vectorized callee function. Of-course each of these compilers can also be used to compile both loop and callee function.

Current Implementation Status and Plan

  1. Clang FE is done by Intel Clang FE team according to #1. Note: Clang FE syntax process patch is implemented and under community review (http://reviews.llvm.org/D10599). In general, the review feedback is very positive from the Clang community.

  2. A new pass for function vectorization is implemented to support #2 and to be prepared for LLVM community review.

  3. Work is in progress to teach LLVM's LoopVectorizer to vectorize a loop with user-defined function calls according to #3.

Call for Action

  1. Please review this proposal and provide constructive feedback on its direction and key ideas.

  2. Feel free to ask any technical questions related to this proposal and to read the associated references.

  3. Help is also highly welcome and appreciated in the development and upstreaming process.



More information about the cfe-dev mailing list