[RFC] Floating-point accuracy control (original) (raw)

Hi everyone,

I’ve been working on a number of problems related to floating-point accuracy of math library calls, and I’d like to propose a set of new intrinsics to specify required accuracy for math library calls and the corresponding LLVM intrinsics. I’m also going to suggest some clarifications of the semantics of existing IR constructs related to floating-point accuracy.

I think this has some potential applicability for traditional host-only optimization, but it is especially important for heterogeneous programming models like OpenCL, OpenMP, CUDA, and SYCL.

I’m going to spend quite a bit more time in this RFC describing the problem than my proposed solution. I think this is necessary to establish a common understanding of exactly what I am trying to do so that you can properly evaluate whether or not my proposed solution accomplishes what it intends to. Feel free to skip to the end and just look at the proposed intrinsics, but thanks in advance if you have time to read the rest.

Background

Historically, this has been an area where there isn’t a whole lot to be done. If you call a math library function, you get whatever accuracy the library implementation provides and there’s not much the compiler can do about it. Maybe you enable fast-math and in a few cases the compiler might substitute an approximation sequence, but things can get shaky really fast when you head down that path. There’s also a bit of a problem if the compiler decides to vectorize the call, which I’ll discuss more in a bit.

With the rise of heterogeneous computing, however, this is a much bigger issue. If I have an algorithm that calls cos(x), for instance, I might get one result if the code executes on the host computer and something different if the code executes on an offload device. As a developer, I need some way to get this under control.

Math Functions in a Heterogeneous Environment

This is a tricky problem. Ordinarily, you think of the math library as a single entity. I can call cosf(x) from anywhere in my program, and I can count on all of them being resolved to the same function. I can go to the documentation for the library (hopefully) and look up the accuracy of the cosf() function provided by that library. However, in a heterogeneous environment I don’t have just one math library. I potentially have a different math library for each device I’m using, and each of these libraries is likely to provide different accuracy for its functions. For example, the OpenCL specification requires cosf() to provide 4 ulp accuracy. Conforming implementations from different vendors are required to be at least this accurate, but they may be more accurate. The same function provided by CUDA will have 2 ulp accuracy. The x86_64 glibc implementation has 1 ulp accuracy. I’m doing a bit of hand-waving here with regard to the definition of the function. Namespaces and such distinguish which is which within the program, but as a developer I still think of it as a call to the single-precision cosine function.

So the problem that this introduces is that I can use the same algorithm in code that executes on the host computer or various offload devices and get significantly different results. As a compiler vendor, I can patiently explain to my users why that happens, but my experience is that while they understand and accept my explanation they really don’t care why it happens. They want it not to happen.

This is the thing that’s motivating the RFC for me. I want to able to provide a way for users to get the results they want. For example, if someone is developing a SYCL kernel, the default assumption is set by the accuracy requirements of the SYCL standard. These requirements are rather loose and geared toward performance. I want to be able to offer options to instead allow the user to specify that their kernels should be constrained to an accuracy that would be consistent with what they get when executing equivalent CUDA code on a different device, or accuracy that is consistent with the host math library. And I want to be able to define an interface that allows device vendors to provide multiple library implementations and select one at runtime based on options that were passed to the compiler when the main program was compiled. That’s a different problem, but it’s one I can’t solve without having an LLVM IR representation to describe function accuracy requirements.

Existing Constructs

Currently, we have two ways of describing floating point accuracy in LLVM IR – the ‘afn’ fast-math flag allows the compiler to substitute an approximation for a math function, and the ‘fpmath’ metadata tells the compiler how much error is acceptable in the result of a floating-point operation. I think there are potential problems with both of these.

The ‘afn’ fast-math flag is defined in the LLVM Language Refence as allowing “substitution of approximate calculations for functions (sin, log, sqrt, etc).” The problem with this is that there is no indication of how “approximate” the calculation will be. In practice, we seem to only use it for transformations that are algebraically equivalent like ‘pow(X, -0.5) → 1/sqrt(X)’ which is good. However, there are cases when you get into code generation where you might want to use instructions like RSQRT14PS plus a number of refinement iterations to approximate square root. How many iterations do you need? That depends on how accurate the answer has to be, and the ‘afn’ flag doesn’t say.

The ‘fpmath’ metadata has a different problem. I believe it was introduced to handle the case where OpenCL allows 2.5 ulp error on single precision divide and square root. This is great. If the metadata is present, I can substitute and approximate square root or divide, if it isn’t (including cases where it got dropped during optimization), I have to generate correctly rounded results.

One problem with ‘fpmath’ is it only works in cases where the operation returns a correctly rounded result by default. What happens if I add “!fpmath !{float 2.5}” to a function call whose math library implementation only provides 4 ULP accuracy? Is that an error? What if the default implementation provides 4 ULP accuracy but another implementation is available that provide 2 ULP accuracy? In that case, I might select the more accurate implementation if the ‘fpmath’ metadata is present but select the less accurate implementation if it is dropped. This means I can’t reliably use ‘fpmath’ to require an implementation that is more accurate than what I’d get by default.

Another problem with ‘fpmath’ is that because it is metadata it can and will be dropped. It may seem like this is OK in cases like fdiv where the default implementation provides a correctly rounded result, but I think it’s actually problematic. Consider the OpenCL case. The OpenCL specification allows 2.5 ulp error for single precision division. This isn’t a case like the fast-math flags where the user is signing off on a bit of wiggle room in the answer. It’s just a lower accuracy requirement. The generated code can use an implementation that return 2.5 ulp error and that’s OK. The generated code can instead use an implementation that returns a correctly rounded result, and that’s OK. However, it can’t do both in different places. Or at least, as a developer I wouldn’t be happy if it did. So if I have a function that contains “%z = fdiv float %x, %y !fpmath !{float 2.5}” in multiple locations then the compiler drops the metadata from one of them, I’m potentially going to get two different answers for what should be the exact same operation.

Vectorization of Function Calls

One last thing I’d like to mention before I get to my proposed intrinsics is a problem I came across recently with function vectorization. Consider this loop:

  for (int i = 0; i < 4; ++i)
    x[i] = sinf(y[i]);

If I have turned off math errno and enabled a vector library, clang will compile that to this call instruction for x86 targets (Compiler Explorer):

  %4 = tail call <4 x float> @_ZGVbN4v_sinf(<4 x float> %3)

That’s nice. I like it. The problem is, this function isn’t as accurate as the default math library implementation of the scalar sinf function. Unless I’ve done something to relax the accuracy requirements, that shouldn’t happen. GCC won’t generate this vector call unless I use the -ffast-math option. That’s better, but it takes us back to the problem I described previously with the ‘afn’ flag not defining how accurate the substitue implementation has to be.

This problem could be solved just by looking for ‘afn’ flag if we’re willing to assume that the vector library implementations are accurate enough to conform to the semantics of that flag (which as I’ve said above, is insufficiently defined). Ideally, I’d like to have an interface that allows the vector library to define the accuracy of its implementations and allows the vectorizer to decide if that’s OK.

Proposal: fpaccuracy intrinsics

To allow control over the required accuracy of floating-point library calls, I’d like to introduce a new set of intrinsics of the form:

  {type} @llvm.fpaccurracy.{function}.{type-specifier}({type} %x, float {accuracy})

For example

  float @llvm.fpaccurracy.cos.f32(float %x, float 2.5)
  <2 x float> @llvm.fpaccurracy.cos.f32(<2 x float> %x, float 2.5)
  double @llvm.fpaccurracy.cos.f64(double %x, float 2.5)

There are definitely problems with introducing yet another set of floating point intrinsics. We already have constrained intrinsics and vector-predication intrinsics, and this is another set that aren’t mutually exclusive with either of the other two.

The other possibility I considered is defining a new type of operand bundle that can be attached to existing intrinsics. Something like this:

  %y = call float @llvm.cos.f32(float %x) [ "max-error"(float 2.5) ]
  %y = call float @llvm.experimental.constrained.cos.f32(float %x, metadata "round.dynamic", metadata "fpexcept.strict") [ "max-error"(float 2.5) ]
  %r = call <4 x float> @llvm.vp.fdiv.v4f32(<4 x float> %a, <4 x float> %b, <4 x i1> %mask, i32 %evl) [ "max-error"(float 2.5) ]

We’ve previously discussed combining the constrained float-point intrinsics with the vector predication intrinsics using operand bundles, but I don’t think that has made much progress, and in general I’m not sure there’s consensus that it’s reasonable to use operand bundles this way.

For now, I’m going to proceed on the assumption that new intrinsics will be used, but if we decide operand bundles are better, I think most of the reasoning will apply in the same way.

TargetLibraryInfo interface additions

The semantics of the fpaccuracy intrinsics I am proposing introduces requirements that won’t be met in most cases. It will require the compiler to have some knowledge of the accuracy of the math library functions. If the accuracy provided by the function library is unknown, in most cases we would need to assume that it does not meet explicity accuracy requirements.

My expectation is that front ends would only generate calls to the fpaccuracy intrinsics if they know that there will be target library support for these intrinsics. This would be accomplished using the TargetLibraryInfo interface. Similarly, the vectorizer would use new TLI methods to determine if a function call could be vectorized with the required accuracy.

Here are the new TLI functions I am proposing"

  bool isFPAccuracyAvailable() const
  bool isFPAccuracyResolvedAtRuntime() const
  bool isFunctionAccuracyAvailable(StringRef F, float Accuracy) const
  StringRef getFunctionImplWithAccuracy(StringRef F, float Accuracy) const
  bool isFunctionVectorizableWithAccuracy(StringRef F, const ElementCount &VF, float Accuracy) const
  bool isFunctionVectorizableWithAccuracy(StringRef F, float Accuracy) const
  StringRef getVectorizedFunctionWithAccuracy(StringRef F, const ElementCount &VF, float Accuracy) const

The TargetLibraryInfo implementation may defer decision on the availability of specific function accuracy implementations for cases like SYCL or OpenMP when the call will be resolved at runtime. In such cases, the TLI implementation should return true for all accuracy requests it wants to allow and provide some means of reporting an error at runtime if the required accuracy is not available.

Lowering of fpaccuracy intrinsics

I propose that a new pass will be introduced in the CodeGen phase which will lower fpaccuracy intrinsics to specific function calls before instruction selection. If no function which is known to meet the accuracy requirements exists, this pass should report a fatal error. Otherwise, the intrinsic call will be replaced by a call to the function returned by TLI::getFunctionImplWithAccuracy().

For example:

  %y = call float @llvm.fpaccuracy.exp10(float %x, float 1.0)

may become

  %y = call float @__exp10_ha(float %x)

For programming models like SYCL that might involve runtime resolution of these functions, the programming model will need a way to describe that in its intermediate representation. There are people working on this for SPIR-V.

Clang support

If this proposal gets enough traction to move forward, I will be proposing corresponding clang changes to enable the interface, including a new command-line option (-ffp-accuracy) and a new pragma (#pragma clang fp accuracy). However, I’d like to hold off on that for now until I see if others agree that this is a useful capability to support in the backend.