Quadratic scaling of bufferization (original) (raw)
I’m experiencing a very slow behavior of the OneShotBufferization pass with the function boundaries bufferization option enabled.
In my IR I have a lot of functions (some thousands), and profiling reveals that a SymbolTable
is being created multiple times during the analysis for each func::CallOp
. This happens in the getCalledFunction
functions, which are invoked by the implementations of the BufferizableOpInterface
interface for func::CallOp
and CallOpInterface
.
The creation of a symbol table has a linear complexity with respect to the number of symbols (i.e., functions). Each function is typically invoked at least once within the IR, and this leads to a quadratic scaling of the bufferization process.
The attached screenshot of the profiler demonstrates the problem (I zoomed in the bufferization pass, but the whole execution of ~2s is almost entirely spent there, and the rest wouldn’t really be of any interest).
This bad scaling (at least in my scenario, this is a blocking problem), can be solved leveraging the SymbolTableCollection
class, which introduces a level of caching for the symbol tables. I’ve looked around on how to leverage such information within the bufferization process, but unfortunately I’ve not found any solution that doesn’t involve a change in the APIs.
More in detail, these are the two options that I have explored:
- A
SymbolTableCollection
member can be added to theAnalysisState
class. However, to keep APIs untouched, this member should be marked asmutable
. The analysis state is indeed passed as a const reference to the interface methods. I personally don’t like this workaround very much, it just doesn’t look clean. Moreover, other methods in the interface (e.g.,bufferize
) don’t have access to the analysis state. Do they all need it? I don’t know about all of them, but looking at the aforementioned interface implementations, I see that at least thegetBufferType
andbufferize
methods would need it. Being this a problem bound to the operation semantics, I guess the safe choice would be to have the state available in all methods. - Adding a
SymbolTableCollection&
argument to all the interface methods. This avoids themutable
qualifier, but the other aspects of the previous approach would still be there.
Both the approaches would require the user to keep the SymbolTableCollection
in a valid state (reason why I introduced this additional method for symbol table invalidation). This aspect may be annoying, but I feel that the benefits would be way more important.
I’d like to hear opinions or other possible approaches on this. I’m available for implementing the required changes, as this is blocking for my downstream project but nonetheless I feel this would also be very beneficial for the whole community.
I would add a SymbolTableCollection
member or DenseMap<CallOp, FuncOp>
cache to FuncAnalysisState
. This class is a state “extension” object that can be retrieved from AnalysisState
. See CallOpInterface::getAliasingValues
for an example.
The field would have to be declared as mutable
, as you said. I don’t see a problem with that. This is a kind of cache, and we use mutable
for other caches in MLIR (e.g., the TypeConverter
uses mutable
to cache type conversions).
getBufferType
and bufferize
don’t have an AnalysisState
object at the moment. It could be added. But it may not even be necessary. These two functions are not called often and I don’t see them on your trace. The analysis functions (getAliasingValues
etc.) are the ones that are called often and must be fast.
edit: Actually, adding an AnalysisState
to the bufferize
function would be problematic because we already started modifying the IR at that point. So the AnalysisState
may no longer be valid.
The bufferize
method corresponds the fourth “column”, even though it is indeed less called and therefore less wide in the graph. I am struggling to find getBufferType
, but the underlying behavior is exactly the same. In general, however, I don’t think it is a matter of how many times those functions are called (that would only change the complexity by a constant factor), but rather of the required IR sweep that is needed every time a function has to be retrieved.
The trace I reported is actually for a very small source. The number of functions in the IR will scale considerably in other tests I aim to perform, and even if getBufferType
may not be visible now, it surely will become later.
I will look into FuncAnalysisState
, but if AnalysisState
can’t then be given to bufferize
& co., I feel that it would only be a palliative solution.
Maybe we can a new BufferizationState
object to the bufferize
+ getBufferType
methods. That would be an API change, so I would check first how far you can get with caching things during the analysis.
I have tried caching the symbol tables during the analyses, but the quadratic behavior is predictably still showing up within bufferize
as soon as I increase the source size.
I can work on modifying the bufferize
(together with getBufferType
and possibly others) method and have it receiving also a SymbolTableCollection
object, but this would inevitably introduce the requirement for the user to keep the cached symbol tables in a valid state. That is, if the user for some reason needs to delete an operation with the SymbolTrait
trait, or introduce / replace / erase a Symbol
operation, then these modifications are expected to be performed in the SymbolTableCollection
object.
Just swinging by to express than breaking the API feels reasonable here
Ideally, I don’t want to tie the bufferize
interface method signature to SymbolTable
. I designed the bufferization infrastructure in such a way that all module-related functionality is separate and optional: i.e., you have One-Shot Bufferize and One-Shot Module Bufferize. That’s because many users only need function bufferization. It was also a way of keeping the core bufferization a bit simpler. Adding a SymbolTable
to bufferize
would break that design a bit.
What we could do: add a BufferizationState &
parameter to bufferize
, along with an extension mechanism in the same way as OneShotAnalysisState::Extension
. The only method that we have in BufferizationState
for now: retrieving an extension. We can also move the BufferizationStatistics
in there. Then there’s a FuncBufferizationState
that carries the SymbolTableCollection
. The bufferize
functions in FuncBufferizableOpInterfaceImpl
are responsible for updating the symbol table in the bufferization state extension as needed.
I’m currently implementing what you suggested. However, I see multiple possible problems in the extension approach.
- Not all symbols are necessarily functions. In case of a non-function operation with symbol semantics that needs to be bufferized, the
SymbolTableCollection
would still have to be updated, if the extension has been registered. However, one may think thatFuncBufferizationState
may be relegated to functions and function-like operations, thus not needing any attention from apparently separate domains.
The extension mechanism hides this possible interplay between the bufferization of different dialects. I would, at the very least, opt for aSymbolBufferizationState
class rather thanFuncBufferizationState
. - I don’t see a reason for not always registering the (now)
SymbolBufferizationState
extension. - At this point, why not keeping the
SymbolTableCollection
directly in theBufferizationState
class? Even if one was aware of the previous interplay, it would still be needed to check for a registeredSymbolBufferizationState
, and update it. And if the extension is supposedly always registered, then why should we even check for that?
Edit:
PR is here: [MLIR] Add bufferization state class to OneShotBufferization pass by mscuttari · Pull Request #138143 · llvm/llvm-project · GitHub
I think I found a compromise by adding a bufferization option to enable the caching of symbol tables.
One small note: I intentionally avoided having separate BufferizationState
and OneShotBufferizationState
classes. I don’t know if the separation on the analysis classes is just a remnant of progressive modifications, but the downcast to OneShotAnalysisState
, without any dynamic cast, doesn’t seem safe to me, and therefore I avoided this additional level.
Please let me know if this is reasonable, either here or in the PR.
I agree with that. All BufferizableOpInterface
implementations must update the symbol table if they modify symbols. The reason for the extension mechanism was to isolate module-related bufferization logic from the “core” bufferization. Given that we need to update the symbol table for arith.constant
and potentially other globals, I’d say let’s just put the symbol table directly in BufferizationState
.
“One-Shot” is an analysis. The distinction between “One-Shot” and “normal” (i.e., “always copy”) does not matter after the analysis part is done. So your implementation sounds reasonable to me.
I’m wondering if it’s worth adding bool cacheSymbolTables = false;
? It does not reduce complexity in the BufferizableOpInterface
implementations. And maintaining a symbol table even when it is never used does not sounds like a large overhead either.
True, it does not reduce complexity. However it can make the transition to the new APIs somewhat smoother (first by defaulting to false
, later to true
, and by finally removing the flag). Anyway I’m ok with removing the option if you feel that there is no need for such a transition period.
Edit: on the other hand, it is also true that, without the option, the separate extension would bring back the “complications” I was mentioning. I will bring back the simplified implementation as you suggested.
I’m trying to address the last TODO related to the caching of symbol tables.
The cached symbol tables need to be accessible also within the getBufferType
function, and therefore I was trying to update its signature to also receive an AnalysisState
object. This, in turns, requires to update the bufferization::getBufferType
methods. Such methods are almost always called from the getBufferType
interface method, but at least in one case from bufferize
, thus requiring AnalysisState
or BufferizationState
according to the situation.
How should we address this problem? The fact that getBufferType
gets called from bufferize
does even seem in contrast with the description of the function itself, which states that its purpose is limited to pre-bufferization analyses.
I thought getBufferType
is always called from bufferize
(or another getBufferType
implementation). Is that not the case? When is called from a place where we don’t have a BufferizationState
?
You’re right, it’s the other way around with the function being called mostly from bufferize
. Sorry for the confusion, I kinda lost myself during the API change.
However, one problematic point is here: resolveTensorOpOperandConflicts
calls allocateTensorForShapedValue
, which calls getBufferType
. The first method does not have access to the bufferization state (should it have?).
Edit: My last question actually made me going deeper in the exploration of the bufferization infrastructure, and from my understanding resolveTensorOpOperandConflicts
should still be called during the actual bufferization process. I followed the chain of calls and updated the interface methods that need to be aware of the bufferization state. Here is the PR.