Can multiple constant folds be run in parallel? (original) (raw)
December 4, 2025, 3:46am 1
For example, we have two Ops called TransposeOp and ConstantOp, and we defined a fold rule for TransposeOp (i.e. TransposeOp::fold). We have a IR graph with multiple TransposeOps and one fold of TransposeOp takes several seconds. So we expect multiple TransposeOp::fold can be run in parallel. Is this possible?
jpienaar December 4, 2025, 4:28am 2
The default would only do this in parallel at pass granularity (e.g., if using a function pass, would do the folding in parallel on different functions). Folding is interleaved with other optimizations and driven by priority queue with cost function in the case of greedy rewrite driver. A pass that does the folding using something akin to deferred folding walk could extract more parallelism but that’s not upstream.
One could also make the folders themselves parallel internally, although I think difficult to reuse the outer threading pool when doing so (but speaking under correction).
Note what I said about is the default Canonicalizer pass ( MLIR: mlir Namespace Reference ), which seems to only fold ops one-by-one.
jpienaar December 4, 2025, 7:33am 4
Canonicalizer pass can be run on multiple granularities (e.g., -pass-pipeline=‘builtin.module(func.func(canonicalize))’ runs in at func scope granularity), so it would depend in your pipeline how it is nested.
This Canonicalizer pass is actually wrapped in another pass that is run only once:
class OuterPass : public impl::OuterBase<OuterPass> {
public:
OuterPass() noexcept : {}
private:
void runOnOperation() final {
// this function runs only once on the entire graph
auto func = getOperation();
PassManager pre_pass(&getContext(), func::FuncOp::getOperationName());
// other passes omitted
(void)pre_pass.run(func);
func.walk(/* omitted */);
PassManager post_pass(&getContext(), func::FuncOp::getOperationName());
// other passes omitted
post_pass.addPass(createCanonicalizerPass()); // canonicalizer is called here
(void)post_pass.run(func);
}
};
How to change the granularity of the Canonicalizer pass inside?
jpienaar December 4, 2025, 12:29pm 6
PassManagerTest, ExecutionAction - may be most direct example of creating a nested PassManager.
Although from the setup here, it seems you are using a dynamic pass pipeline and that OuterPass is already on FuncOp granularity. So if you have two foldable ops in separate FuncOp’s they should be able to be scheduled in parallel already. Next I’d check if where OuterPass is created, if the context is threaded.
You can’t do it in general within a function (@jpienaar explained how it works across functions) because folding can modify operation in place, and everything being linked together you run the risk of race conditions.
In some specific cases (like a graph oriented IR), you could write a process / worklist where you could introduce some carefully designed concurrency. But the coupling needed between the parallel scheme and the assumptions of your particular IR is making it impossible to include in a generic pass like canonicalization. So MLIR will provide you all the tools (including thread pools abstractions) but you’ll need to assemble it yourself.
Determinism is also another concern with such scheme (you need to exploit parallelism across things that aren’t ordered with each other to ensure determinism).
First, We have only one FuncOp in the entire graph, though the graph has multiple TransposeOp.
and this is how OuterPass used:
auto createOuterPass() {
return std::make_unique<OuterPass>();
}
// in a different file
class MyContext : public MLIRContext {
public:
// constructor omitted
Data outputData;
};
Data run() noexcept {
MyContext &context = ...
::mlir::PassManager pm(&context);
pm.enableVerifier(false);
mlir::OpPassManager &FuncPM = pm.nest<mlir::func::FuncOp>(); // there is only one FuncOp in one Module
// other passes of FuncPM omitted
FuncPM.addPass(createOuterPass());
// other passes of FuncPM omitted
pm.addPass(createOutputPass()); // this pass writes to context.outputData
auto module = ... // an object of type mlir::ModuleOp
auto rst = pm.run(module);
// some sorts of cleanup omitted
return context.outputData;
}
I state the issue more clearly here: In a “function” (actually a compution graph), there are a number of transpose operation on constant matrices. I defined the fold operation on that TransposeOp which will remove the Op (replacing it with the transposed constant matrix) if the operand is a constant matrix. However, such a fold takes several second. I am looking for a way that multiple such folds can be run in parallel.
I think my answer provided you the direction for this: you need to define your own graph traversal and invoke fold yourself. The algorithm can depend if your graph has cycles for example.
Otherwise it’s pretty straightforward, I asked an AI to write such parallel topological traversal with dependency enforcement:
void parallel_visit(Block *block, function_ref<void(Operation *)> callback) {
// Implements a parallel topological traversal of operations in a block,
// ensuring each operation is processed only after all its operand owners
// have been processed. The block is considered as processed at the start.
// Include only within implementation file scope,
// assumes mlir/IR/Threading.h and related includes are available.
// Step 1: Mark the block as processed up front.
llvm::SmallDenseMap<Operation *, std::atomic<unsigned>> pendingDeps;
llvm::SmallDenseMap<Operation *, llvm::SmallVector<Operation *, 2>> users;
llvm::SmallVector<Operation *, 16> readyOps;
// Mark block argument owners as processed in a "virtual" sense.
// We will treat each operation as ready when all operands that are produced within the current block have been processed.
// Step 2: Build dependency graph.
for (Operation &op : *block) {
unsigned localDeps = 0;
for (Value operand : op.getOperands()) {
// Only consider operands defined within the block (by another operation).
if (auto result = operand.dyn_cast<OpResult>()) {
Operation *defOp = result.getOwner();
if (defOp->getBlock() == block && defOp != &op) {
++localDeps;
users[defOp].push_back(&op);
}
}
// Else: block arguments are considered ready (already processed).
}
if (localDeps == 0) {
readyOps.push_back(&op);
} else {
pendingDeps[&op] = localDeps;
}
}
// Step 3: Mark all ready operations for processing.
// Create storage for "remaining ops to process".
std::atomic<size_t> remaining{pendingDeps.size() + readyOps.size()};
auto processOp = [&](Operation *op) {
callback(op);
// Notify users that this op is processed.
auto it = users.find(op);
if (it != users.end()) {
for (Operation *user : it->second) {
unsigned numLeft = --pendingDeps[user];
if (numLeft == 0)
readyOps.push_back(user);
}
}
--remaining;
};
// Step 4: Parallel execution.
// Use MLIR's parallelFor/parallelForEach, but first snapshot the ready list and process as worklist.
mlir::parallelForEach(
block->getParentOp()->getContext(), readyOps,
[&](Operation *op) {
// Worklist algorithm: whenever an operation is completed,
// its users may become ready.
llvm::SmallVector<Operation *, 8> localQueue{op};
while (!localQueue.empty()) {
Operation *cur = localQueue.pop_back_val();
processOp(cur);
// Check if any users became ready
auto it = users.find(cur);
if (it != users.end()) {
for (Operation *user : it->second) {
// Only process newly-ready users
unsigned numLeft = --pendingDeps[user];
if (numLeft == 0)
localQueue.push_back(user);
}
}
}
});
}
Now this won’t work out of the box, because you still need some locking around the updates to the SSA def-use chains (but maybe that can be done inside the callback)
A workaround option you have is:
- Outline each of the “constant + transpose” into a separate function, i.e., pull into a function and call it from the main one. (This will only involve “moves” on the heavy operations + func op creations.)
- Run canonicalize at the func op granularity and it’ll happen in parallel across functions.
- Inline them back.
You’ll only need to implement (1), which should be straightforward.