FunctionSpecificOpt - GCC Wiki (original) (raw)

Target Specific Optimization

The target specific optimization has several internal stages. These stages can be delivered in different GCC releases. The first two stages are geared towards people who need to build high performance libraries that must span several different underlying architectures, while the third stage is meant to be usable by the majority of programmers, since it will not involve source code modifications to use. While the focus of this work is to allow ix86 programmers to code for various AMD and Intel platforms, other GCC backends will be able to use target specific optimization by adding the appropriate machine dependent parts. Stages 1 and 2 were checked into the GCC mainline revision 138082.

The stages are:

Compile a single function with specific machine options using attributes.
Compile a single function with specific machine options using #pragmas.
Compile a single function multiple times with multiple different options.
Compile functions with multiple different options automatically.

Stage1: Compile single function with specific options using attributes

Stage1: Objective of compiling a single function with specific options

The objective of being able to compile a single function with specific options is to allow the user to control how an individual function in a compilation unit is compiled without having to move the function to a separate source file and specify different options on a Makefile.
In this stage, it is targeted towards power users who are willing to modify their code to achieve the benefits.
This stage is also needed to provide the necessary infrastructure for stage2 and stage3 which will need the ability to compile a function with specific options. It is expected that relatively few users will make the necessary code modifications to use this enhancement, but the bulk of the work is to add the ability to modify what options are used to compile a single function, but it makes a convenient stepping point to stage2.

Stage1: Details of compiling a single function with specific options

Most users are not willing to build their applications multiple times with the appropriate **-march=**xxx or **-mtune=**xxx options to achieve the best performance, but instead use generic options like -O2 to build their application. When you have multiple different platforms that implement the same basic instruction set, but have different additional instructions or timing characeteristics, you can leave a lot of performance on the table.
This stage is to allow users who write performance critical libraries to code up an individual routine to use special features for a particular platform (for, example using the SSE3 instructions on newer AMD and Intel machines, or using SSE4.1/SSE5 instructions in future processors). A really motivated user can do this today by using different files and different compilation options. This stage would make doing these special functions easier to code.
In the future, a secondary benefit would be for memory limited environments, like embedded environments, where you can compile non-critical functions with -Os instead of -O3 to reduce the size of the code, but not impact the performance of critical routines.
In this stage, it is a non-goal to provide any automatic means of selecting the appropriate function. It is assumed that the application will select the appropriate function to call.
Recently within AMD, we discovered one major disadvantage to the approach of compiling whole files with special target options is that static constructors that are declared inside of a module will get called, even though the machine might not have support for the instructions being compiled. Being more selective about which functions are compiled with can help avoid the problem.
By using these options, it will force the developer to test his/her software on different platforms that have different instructions or instruction timings because different code paths will be used by the compiler. Since this stage is targeted to power users that need to wring the most performance out of their software, it is assumed that they will already be testing their code for different environments.
The builtin_ia32_ intrinsics must be modified so that within a function declared with function specific options, you can use the intrinsic inside of functions compiled with the specific isa options needed for the intrinsic.
During stage1, it is not a goal that that the common intrinsics shared with other compilers work in functions compiled with different function specific options.
The inliner should be taught to query a hook provided by the backend to determine whether one function can inline another. In the case of the x86, the rules should be:
- A generic function (without function specific options) can inline a generic function;
- A generic function cannot inline a function with function specific options;
- A function with function specific options can inline generic functions;
- A function with function specific options can inline other functions with the same function specific options;
- A function with a stricter set of function specific options can inline other functions that have a subset of the caller's options. For example a sse5 or sse4.1 function can inline a sse2 function, since both sse5 and sse4.1 incorporate sse2.
I am not sure we need to add the full capability to set all -O, -f, -W, and -m options. One problem with adding these options is certain optimizations depend on other optimization options, such as PRE depends on the -O2 optimizations. It is not in the scope of this project to add such support, but once the basic ability is added to compile some functions with different options, this functionality can be added by other people.
In general, we should not add attributes that change the basic ABI of the machine, so there should be no analogs of the -m96-bit-double, -malign-double, -m128-long-double, -mintel-syntax, -mpc, -m32, -m64 x86 switches.

Stage1: Syntax for target specific option using attributes

I propose we add a new attribute option that allows the user to use certain ix86 options. Other backends that wish to provide function specific options to their users can use the same syntax (but of course will have different options). The option attribute takes one or more strings that are parsed by the backend. In the case of the x86, the string will take options separated by commas. Each option is equivalent to the -m option. The -m is not specified in the option string. The fpmath=sse,387 option must be passed as fpmath=sse+387 or fpmath=both, since the comma would separate other options. The options that would be provided are:

ISA options (both option and no-option are supported):
- attribute((option("abm")))
- attribute((option("aes")))
- attribute((option("mmx")))
- attribute((option("pclmul")))
- attribute((option("popcnt")))
- attribute((option("sse")))
- attribute((option("sse2")))
- attribute((option("sse3")))
- attribute((option("sse4")))
- attribute((option("sse4.1")))
- attribute((option("sse4.2")))
- attribute((option("sse4a")))
- attribute((option("sse5")))
- attribute((option("ssse3")))
Boolean options (both option and no-option are supported):
- attribute((option("cld")))
- attribute((option("fancy-math-387")))
- attribute((option("fused-madd")))
- attribute((option("ieee-fp")))
- attribute((option("inline-all-stringops")))
- attribute((option("inline-stringops-dynamically")))
- attribute((option("align-stringops")))
- attribute((option("recip")))
Options that take string arguments:
- attribute((option("arch=")))
- attribute((option("tune=")))
- attribute((option("fpmath=")))

Stage1: Example using attribute

Here is an example of how you might use target specific functions using attributes. It uses the GCC intrinsics. The code calculates a minimum of a vector of 32-bit signed integers, using the pcomd and pcmov instructions under SSE5 and the pminsd instruction under SSE4.1.

1 typedef int __v4si attribute ((vector_size (16), may_alias)); 2 void sse5_min (__v4si *, __v4si *, __v4si *, int) attribute ((option("sse5"))); 3 void sse4_1_min (__v4si *, __v4si *, __v4si *, int) attribute ((option("sse4.1"))); 4 void generic_min (__v4si *, __v4si *, __v4si *, int); 5 6 void sse5_min (__v4si *a, __v4si *b, __v4si *c, int n) { 7 int i; 8 for (i = 0; i < n; i++) { 9 __v4si test = __builtin_ia32_pcomltd (b[i], c[i]); 10 a[i] = __builtin_ia32_pcmov_v4si (b[i], c[i], test); 11 } 12 } 13 14 void sse4_1_min (__v4si *a, __v4si *b, __v4si *c, int n) { 15 int i; 16 for (i = 0; i < n; i++) { 17 a[i] = __builtin_ia32_pminsd (b[i], c[i]); 18 } 19 } 20 21 void generic_min (__v4si *a, __v4si *b, __v4si *c, int n) { 22 int i; 23 int n_int = 4 * n; 24 int *a_int = (int *) a; 25 int *b_int = (int *) b; 26 int *c_int = (int *) c; 27 for (i = 0; i < n_int; i++) { 28 a_int[i] = (b_int[i] < c_int[i]) ? b_int[i] : c_int[i]; 29 } 30 } 31 32 void do_min (__v4si *a, __v4si *b, __v4si *c, int n) { 33 if (HAVE_SSE5) { 34 sse5_min (a, b, c, n); 35 } else if (HAVE_SSE4_1) { 36 sse4_1_min (a, b, c, n); 37 } else { 38 generic_min (a, b, c, n); 39 } 40 }

Stage1: Syntax for optimization option using attributes

In addition to setting target options, users would like to be able to change the optimization level for functions. For example, you might want to use -O3 -funroll-loops for functions that are executed all of the time and -Os for functions that are rarely executed. I propose we add a new attribute optimize that allows the user to change the optimization options. This would be supported for all targets. The hot attribute would be modified to set the -O3 option and the cold attribute would be modified to set the -Os option. The optimize attribute takes one or more strings or a number. Commas can separate separate options in in string. Each string option is equivalent to the -f option, unless the string begins with 'O'. Numbers are equivalent to the appropriate -O level. The -f is not specified in the option string.

Stage1: Work items

This section is an attempt to break down the stage1 work into smaller chunks, with separate deliverables. It has now been rewritten after the fact to describe the work that was done.

A subversion branch (function-specific-branch) will be created at the FSF to host this project. All work will be done in this branch. All people contributing to this branch must have the appropriate FSF paperwork so that their work can be incorporated into the mainstream GCC. All FSF coding guidelines will be used. Merges from the mainline will occur at least monthly. It will take 1 day to create the branch. It is anticipated that each merge will take 1 day to do the merge, and do any updates to the target specific work that is needed.
Modify the opt*.awk scripts so that there is a new flag, Save, which indicates which variables need to be saved and restored. A structure, cl_option_attr, will be created to hold these options. Two functions, cl_options_save and cl_options_restore, will be created to save and restore the options.
Add support in c-common.c to add attribute((option(...))) and call a back end hook, valid_option_attribute_p, to validate the option.
Add a new field, function_specific, to the tree_function_decl node to hold the back end information for storing the information needed for each function with function specific options.
Use the set current function hook to change the target options when it is different than the previous function. Call target_reinit to reinitialize things like which registers are allowed to be used in the current ISA.
Change the inliner in ipa-inline.c to call tree_can_inline_p to validate each potential inline candidate. Add tree_can_inline_p to tree-inline.c to pogo to the target hook. Add a new target hook, can_inline_p, which vets the inline. Add the hook to the x86 port.
Merge all of the ix86 isa options that use independent variables into ix86_isa_flags flags word. Merge other boolean options into the target_flags word.
Modify the builtin function handling so that most builtin functions which map into x86 instructions are added to the list of declarations, and issue an error if the user tries to use the builtin function without having the proper isa.
Write tests.
Submit patches to the gcc-patches mailing list.
Deal with the comments and modify the code appropriately.

Stage2: Compile single function with specific options using pragmas

The attribute syntax is kind of clunky if you are defining multiple functions using the same function specific options. I would propose adding new #pragmas that change the default options for the functions defined after the #pragma. Internally, the #pragma would save the appropriate information and then add attribute((option(...)))'s to the function. Ideally the preprocessor variables like SSE, etc. should be changed by the #pragma.

Stage2: pragma syntax

#pragma GCC option("string") -- Add "string" to each function's attribute((option(...))) that appears after the #pragma. The string should be parsed at the time of the pragma.
#pragma GCC option(push) -- Save the current function specific options on a stack, so that a later option(pop) can recover the options. This is useful for include files that want to restore the state after having function specific pragmas.
#pragma GCC option(pop) -- Pop off the function specific options from the stack created by option(push).
#pragma GCC option(initial) -- Restore the options to those specified by the command line options.
#pragma GCC optimize("string") -- Add "string" to each function's attribute((optimize(...))) that appears after the #pragma.
#pragma GCC optimize(push) -- Save the current function specific options on a stack, so that a later option(pop) can recover the options. This is useful for include files that want to restore the state after having function specific pragmas.
#pragma GCC optimize(pop) -- Pop off the function specific options from the stack created by option(push).
#pragma GCC optimize(initial) -- Restore the options to those specified by the command line options.

Stage2: Example using #pragma

Here is an example of how you might use target specific functions using *#pragma*. It uses the common compiler intrinsics include files (and needs pragma because bmmintrin.h and smmintrin.h check for SSE5 and SSE4_1 being defined). The code calculates a minimum of a vector of 32-bit signed integers, using the pcomd and pcmov instructions under SSE5 and the pminsd instruction under SSE4.1.

1 #pragma GCC option(push) 2 #pragma GCC option("sse5") 3 #include 4 5 void sse5_min (__m128i *a, __m128i *b, __m128i *c, int n) { 6 int i; 7 for (i = 0; i < n; i++) { 8 __m128i test = _mm_comlt_epi32 (b[i], c[i]); 9 a[i] = _mm_cmov_si128 (b[i], c[i], test); 10 } 11 } 12 13 #pragma GCC option(pop) 14 #pragma GCC option(push) 15 #pragma GCC option("sse4.1") 16 #include 17 18 void sse4_1_min (__m128i *a, __m128i *b, __m128i *c, int n) { 19 int i; 20 for (i = 0; i < n; i++) { 21 a[i] = _mm_min_epi32 (b[i], c[i]); 22 } 23 } 24 25 #pragma GCC option(pop) 26 27 void generic_min (__m128i *a, __m128i *b, __m128i *c, int n) { 28 int i; 29 int n_int = 4 * n; 30 int *a_int = (int *) a; 31 int *b_int = (int *) b; 32 int *c_int = (int *) c; 33 for (i = 0; i < n_int; i++) { 34 a_int[i] = (b_int[i] < c_int[i]) ? b_int[i] : c_int[i]; 35 } 36 } 37 38 void do_min (__m128i *a, __m128i *b, __m128i *c, int n) { 39 if (HAVE_SSE5) { 40 sse5_min (a, b, c, n); 41 } else if (HAVE_SSE4_1) { 42 sse4_1_min (a, b, c, n); 43 } else { 44 generic_min (a, b, c, n); 45 } 46 }

Stage3: Details of compiling a single function multiple times manually

If this is used all over the place, it can lead to massive code bloat.
Ideally the compiler should determine if two or more clone functions generate the same code, but at present, this is not part of the goals of this project.
Users that use this option really need to test their code on multiple platforms to insure that the compiler generates the correct code for each target.
Functions that take variable arguments will not be allowed to be cloned, since the function that dispatches to the clones needs to pass all of the arguments to the clone functions.
The backend should determine what are the appropriate clone targets, while the user should just indicate that a function should be cloned. This allows for new clone targets to be added automatically without modifying the code.
To cut down on code bloat, the ix86 backend should not generate clones for each different machine, but instead compile code for feature bits (i.e., whether a machine has the SSE3, SSSE3, SSE4.1, or SSE5 instruction sets), and not a specific machine.
For 32-bit ix86 targets, it is important not to have too many clones in 32-bit, given the limited address space of user applications. I would expect the following clones to be provided:
- generic, use 387 floating point stack
- -msse2
For 64-bit ix86 targets, I would expect the following clones to be provided:
- generic (implies -msse2)
- -msse3
- -msse4.1
- -msse5
In generating the clone functions, the compiler will generate one function that dispatches to each of the clones based on feature tests. A function that runs as a static constructor will be responsible for doing the CPUID instruction(s) to determine what feature bits are supported.
It is highly desirable that the debugger know about all clones, so that if you put a breakpoint in a cloned function, it puts the same breakpoint at the same line in each cloned function.

Stage3: Example

If you have a function declared as a clone, such as:

1 void my_min (int *, int *, int *) attribute((clone)); 2 void my_min (int *a, int *b, int *c, int n) { 3 int i; 4 for (i = 0; i < n; i++) { 5 a[i] = (b[i] < c[i]) ? b[i] : c[i]; 6 } 7 }

The compiler would logically generate code that would be equivalent to:

1 static void __do_cpuid (void) attribute ((constructor)); 2 static void my_min__clone_generic (int *, int *, int *, int); 3 static void my_min__clone_sse5 (int *, int *, int *, int) attribute((sse5)); 4 static void my_min__clone_sse4_1 (int *, int *, int *, int) attribute((sse4_1)); 5 static void (*my_min__clone_ptr)(int *, int *, int *, int) = my_min__clone_generic; 6 static void __do_cpuid (void) { 7 int have_sse5; 8 int have_sse4_1; 9
10
11 if (have_sse5) { 12 my_min__clone_ptr = my_min__clone_sse5; 13 } else if (have_sse4_1) { 14 my_min__clone_ptr = my_min__clone_sse4_1; 15 } else { 16 my_min__clone_ptr = my_min__clone_generic; 17 } 18 } 19 void my_min (int *a, int *b, int c, int n) { 20 ( my_min__clone_ptr) (a, b, c, n); 21 } 22 static void my_min__clone_generic (int *a, int *b, int *c, int n) { 23 int i; 24 for (i = 0; i < n; i++) { 25 a[i] = (b[i] < c[i]) ? b[i] : c[i]; 26 } 27 } 28 29 static void my_min__clone_sse5 (int *a, int *b, int *c, int n) { 30 int i; 31 for (i = 0; i < n; i++) { 32 a[i] = (b[i] < c[i]) ? b[i] : c[i]; 33 } 34 } 35 36 void my_min__clone_sse4_1 (int *a, int *b, int *c, int n) { 37 int i; 38 for (i = 0; i < n; i++) { 39 a[i] = (b[i] < c[i]) ? b[i] : c[i]; 40 } 41 }

Stage4: Compile functions with multiple different options automatically

Stage3: Objective of compiling a single function multiple times automatically

Once we have the ability to clone functions, the compiler should be able with profile guided feedback, determine which functions are hotspot functions and automatically add the clone attribute.
It may be useful to add a -fhotspot=func1,func2,... switch as well.
The compiler should only do this cloning automatically when the user specifies to do this with a switch such as -fclone. Otherwise, if it is done via -O3 it will be incumbant on each user to test their code on multiple machines.

Branch

The svn development branch is svn://gcc.gnu.org/svn/gcc/branches/function-specific-branch
The svn tag branch is svn://gcc.gnu.org/svn/gcc/tag/function-specific-branch
The branch was created from the trunk, revision 130896.
gcc/ChangeLog-function is where ChangeLog entries for this branch should go.