[RFC] Removing default widening of OpenMP collapsed IVs (original) (raw)

Hi,

Current clang implementation of OpenMP collapse clause uses a conservative approach and emits collapsed IV of i64 type to prevent potential overflow, even when original loop counters are i32 or even narrower.

I understand why that’s done in the first place, but many compiler users actually suffer from it, since they are not aware of fopenmp-optimistic-collapse option, whereas their loops have i32 counters at most and tripcounts are usually not that big won’t cause overflow with i32 collapsed IV. So most of the time they’re not even aware of presence of 64-bit arithetics in their program which they wouldn’t want to have.

The spec also doesn’t impose current conservative behavior, OMP 6.0 p. 204 ll. 26-30:

The iterations of some number of affected loops can be collapsed into one larger logical iteration space that is the collapsed iteration space. The particular integer type used to compute the iteration count for the collapsed loop is implementation defined, but its bit precision must be at least that of the widest type that the implementation would use for the iteration count of each loop if it was the only affected loop.

So I propose to switch the default behavior to the following - use i64 only when at least one of the original loop counters has that type. Otherwise use i32 collapsed IV. We can add fopenmp-pessimistic-collapse option to use i64 conservatively even with all i32 loops. I guess that would also allow us to deprecate fopenmp-optimistic-collapse.

I believe @dreachem agreed that such a change would be in line with the spec, I’d like to hear from other people working on OpenMP in clang @Meinersbur @jhuber6 @alexey-bataev @kparzysz before uploading a patch

The compiler always tries to implement a conservative, but generally correct approach. The proposed approach may lead to the incorrect process of the code. I prefer to keep the original (conservative, but safe) approach by default

zuban32 March 5, 2025, 11:50am 3

My point is that the current approach is safe from the common sense perspective, but from the spec perspective we’re okay to not follow it. The spec doesn’t make us to be this conservative, so essentially it allows such failures. And I believe in general we’re losing more than gaining, when some option is to be enabled in the majority of compiler invocations it should be set by default IMO. 64-bit arithmetic is just too heavy (especially when offloading) to be enabled by default when we’re free no to do so.

JCownie March 5, 2025, 1:50pm 4

Unfortunately, compiler users are not happy if code which used to work doesn’t after a compiler upgrade, even if the code is not standard conforming.

What is your use-case for preferring 32-bit loop counters? On any 64-bit architecture it will be promoted to a 64-bit register anyway.

IIRC, some loop normalization passes (LoopSimplify, IndVarSimplify, or SCEVExpander) also introduce 64-bit induction variables.

zuban32 March 5, 2025, 3:12pm 6

I’m mostly concerned about offloading, when collapsed loop is run on a device having 32-bit arithmetic instead of 64-bit really makes a difference.

But yeah, I see the sentiment, can’t say I 100% agree with it though. In my opinion we could do whatever we want as long as it’s up to the spec, but I get the point about unwanted regression.

jhuber6 March 5, 2025, 3:19pm 7

32-bit will save a handful of registers in exchange for being unable to handle any thing bigger than 4 billion or so. Considering that GPUs do operations in the billions I’m unsure if that’s a viable tradeoff. You could probably make an opt-in thing if you really needed it.

zuban32 March 5, 2025, 3:33pm 8

What do you mean by ‘an opt-in thing’?

After internal testing with our downstream compiler I’ve observed no failures in any apps or benchmarks from HPC domain, which is the main user of OpenMP on GPUs, ML is hardly the case there. So I’m not sure how many people really need those extended limits of i64.

Moreover it’s not just about the registers, often it’s also about limited FP64 capaibilities which either reduces throughput or even requires emulation.

jhuber6 March 5, 2025, 3:36pm 9

I forget if you’d need to change both the compiler codegen or the runtime, but you could make those opt-in through a clang flag and a CMake flag respectively.

Non-integer induction variables? I’d need to double check what that’s supposed to do.

zuban32 March 5, 2025, 3:44pm 10

I forget if you’d need to change both the compiler codegen or the runtime

It seems like using fopenmp-optimistic-collapse is enough

Non-integer induction variables?

Sorry, my bad, I just meant limited 64-bit capabilities even for integer arithmetics.

For example, some of our hardware simply doesn’t have native support for 64-bit integers, so when people try to run kernels containing collapsed loops they observe unexpected perf drop, even though their code has nothing wider than 32-bit. And that is confusing. The aforementioned option helps, but app developers are seldom aware of it.

In this case need to check the target capabilities and use i64, if the target natively supports it. If your target does not support i64, then you need to stick with i32

zuban32 March 5, 2025, 4:12pm 12

In this case need to check the target capabilities

I’m not entirely familiar with the clang codebase, do we have an access to that information in Sema?

If your target does not support i64, then you need to stick with i32

Yeah, that may be a problem on our end, our compiler uses spir64 as a universal target for offloading. Perhaps we’ll have to fix that