Quadratic scaling of bufferization (original) (raw)

I’m experiencing a very slow behavior of the OneShotBufferization pass with the function boundaries bufferization option enabled.

In my IR I have a lot of functions (some thousands), and profiling reveals that a SymbolTable is being created multiple times during the analysis for each func::CallOp. This happens in the getCalledFunction functions, which are invoked by the implementations of the BufferizableOpInterface interface for func::CallOp and CallOpInterface.
The creation of a symbol table has a linear complexity with respect to the number of symbols (i.e., functions). Each function is typically invoked at least once within the IR, and this leads to a quadratic scaling of the bufferization process.

The attached screenshot of the profiler demonstrates the problem (I zoomed in the bufferization pass, but the whole execution of ~2s is almost entirely spent there, and the rest wouldn’t really be of any interest).

This bad scaling (at least in my scenario, this is a blocking problem), can be solved leveraging the SymbolTableCollection class, which introduces a level of caching for the symbol tables. I’ve looked around on how to leverage such information within the bufferization process, but unfortunately I’ve not found any solution that doesn’t involve a change in the APIs.
More in detail, these are the two options that I have explored:

Both the approaches would require the user to keep the SymbolTableCollection in a valid state (reason why I introduced this additional method for symbol table invalidation). This aspect may be annoying, but I feel that the benefits would be way more important.

I’d like to hear opinions or other possible approaches on this. I’m available for implementing the required changes, as this is blocking for my downstream project but nonetheless I feel this would also be very beneficial for the whole community.

Screenshot From 2025-04-30 17-42-22

I would add a SymbolTableCollection member or DenseMap<CallOp, FuncOp> cache to FuncAnalysisState. This class is a state “extension” object that can be retrieved from AnalysisState. See CallOpInterface::getAliasingValues for an example.

The field would have to be declared as mutable, as you said. I don’t see a problem with that. This is a kind of cache, and we use mutable for other caches in MLIR (e.g., the TypeConverter uses mutable to cache type conversions).

getBufferType and bufferize don’t have an AnalysisState object at the moment. It could be added. But it may not even be necessary. These two functions are not called often and I don’t see them on your trace. The analysis functions (getAliasingValues etc.) are the ones that are called often and must be fast.

edit: Actually, adding an AnalysisState to the bufferize function would be problematic because we already started modifying the IR at that point. So the AnalysisState may no longer be valid.

The bufferize method corresponds the fourth “column”, even though it is indeed less called and therefore less wide in the graph. I am struggling to find getBufferType, but the underlying behavior is exactly the same. In general, however, I don’t think it is a matter of how many times those functions are called (that would only change the complexity by a constant factor), but rather of the required IR sweep that is needed every time a function has to be retrieved.
The trace I reported is actually for a very small source. The number of functions in the IR will scale considerably in other tests I aim to perform, and even if getBufferType may not be visible now, it surely will become later.
I will look into FuncAnalysisState, but if AnalysisState can’t then be given to bufferize & co., I feel that it would only be a palliative solution.

Maybe we can a new BufferizationState object to the bufferize + getBufferType methods. That would be an API change, so I would check first how far you can get with caching things during the analysis.

I have tried caching the symbol tables during the analyses, but the quadratic behavior is predictably still showing up within bufferize as soon as I increase the source size.
I can work on modifying the bufferize (together with getBufferType and possibly others) method and have it receiving also a SymbolTableCollection object, but this would inevitably introduce the requirement for the user to keep the cached symbol tables in a valid state. That is, if the user for some reason needs to delete an operation with the SymbolTrait trait, or introduce / replace / erase a Symbol operation, then these modifications are expected to be performed in the SymbolTableCollection object.

immagine

Just swinging by to express than breaking the API feels reasonable here

Ideally, I don’t want to tie the bufferize interface method signature to SymbolTable. I designed the bufferization infrastructure in such a way that all module-related functionality is separate and optional: i.e., you have One-Shot Bufferize and One-Shot Module Bufferize. That’s because many users only need function bufferization. It was also a way of keeping the core bufferization a bit simpler. Adding a SymbolTable to bufferize would break that design a bit.

What we could do: add a BufferizationState & parameter to bufferize, along with an extension mechanism in the same way as OneShotAnalysisState::Extension. The only method that we have in BufferizationState for now: retrieving an extension. We can also move the BufferizationStatistics in there. Then there’s a FuncBufferizationState that carries the SymbolTableCollection. The bufferize functions in FuncBufferizableOpInterfaceImpl are responsible for updating the symbol table in the bufferization state extension as needed.

I’m currently implementing what you suggested. However, I see multiple possible problems in the extension approach.

Edit:
PR is here: [MLIR] Add bufferization state class to OneShotBufferization pass by mscuttari · Pull Request #138143 · llvm/llvm-project · GitHub
I think I found a compromise by adding a bufferization option to enable the caching of symbol tables.
One small note: I intentionally avoided having separate BufferizationState and OneShotBufferizationState classes. I don’t know if the separation on the analysis classes is just a remnant of progressive modifications, but the downcast to OneShotAnalysisState, without any dynamic cast, doesn’t seem safe to me, and therefore I avoided this additional level.
Please let me know if this is reasonable, either here or in the PR.

I agree with that. All BufferizableOpInterface implementations must update the symbol table if they modify symbols. The reason for the extension mechanism was to isolate module-related bufferization logic from the “core” bufferization. Given that we need to update the symbol table for arith.constant and potentially other globals, I’d say let’s just put the symbol table directly in BufferizationState.

“One-Shot” is an analysis. The distinction between “One-Shot” and “normal” (i.e., “always copy”) does not matter after the analysis part is done. So your implementation sounds reasonable to me.

I’m wondering if it’s worth adding bool cacheSymbolTables = false;? It does not reduce complexity in the BufferizableOpInterface implementations. And maintaining a symbol table even when it is never used does not sounds like a large overhead either.

True, it does not reduce complexity. However it can make the transition to the new APIs somewhat smoother (first by defaulting to false, later to true, and by finally removing the flag). Anyway I’m ok with removing the option if you feel that there is no need for such a transition period.

Edit: on the other hand, it is also true that, without the option, the separate extension would bring back the “complications” I was mentioning. I will bring back the simplified implementation as you suggested.

I’m trying to address the last TODO related to the caching of symbol tables.

The cached symbol tables need to be accessible also within the getBufferType function, and therefore I was trying to update its signature to also receive an AnalysisState object. This, in turns, requires to update the bufferization::getBufferType methods. Such methods are almost always called from the getBufferType interface method, but at least in one case from bufferize, thus requiring AnalysisState or BufferizationState according to the situation.

How should we address this problem? The fact that getBufferType gets called from bufferize does even seem in contrast with the description of the function itself, which states that its purpose is limited to pre-bufferization analyses.

I thought getBufferType is always called from bufferize (or another getBufferType implementation). Is that not the case? When is called from a place where we don’t have a BufferizationState?

You’re right, it’s the other way around with the function being called mostly from bufferize. Sorry for the confusion, I kinda lost myself during the API change.

However, one problematic point is here: resolveTensorOpOperandConflicts calls allocateTensorForShapedValue, which calls getBufferType. The first method does not have access to the bufferization state (should it have?).

Edit: My last question actually made me going deeper in the exploration of the bufferization infrastructure, and from my understanding resolveTensorOpOperandConflictsshould still be called during the actual bufferization process. I followed the chain of calls and updated the interface methods that need to be aware of the bufferization state. Here is the PR.