[RFC] Grouping ops in TCP (original) (raw)
January 28, 2023, 12:35am 1
Here is our proposed design for grouping ops in TCP: [Public] TCP Design - Groups
Please feel free to add your feedback in the document or in this thread.
Thanks,
Raghavan
(on behalf of the ML Compiler Team at Cruise)
rengolin January 28, 2023, 12:36pm 2
Once you have tcp.group
of tcp
ops, what do you lower them to?
If you lower them to linalg
(which doesn’t have a fusion op), how does that guarantee that I can fuse them later? What is the difference between lowering the same sequence of tcp
ops outside of a tcp.group
and inside?
If the lowering is just a sequence of linalg.generic
, then they can still get reordered before the pass that packs them.
Once we pack and block our tensors, we create parallel loops to add tile linalg
ops in the inner loop, that can still be fused together with ops outside of the nested loops. If we could keep tcp.isolated_group
with linalg
ops inside, we wouldn’t need to look at loop iterations at all, and pack, tile and fuse inside the group, before removing it.
So the last cleanup pass would just remove tcp.isolated_group
s because they’ve done their part and have no further semantics after lowering.
Is that the plan for those ops?
It is possible to lower the ops inside the groups in different ways, depending on the use-case needed. Here are some examples of lowering ops in tcp.isolated_group
:
- If the group indicates how a graph is partitioned for execution across multiple accelerators then the operations inside backend can be lowered via a device specific pipeline (e.g. linalg based codegen for CPU vs. linalg based codegen for GPU).
- If the group represents an elementwise fusion and needs to be lowered to linalg, we could lower it to a single
linalg.generic
. This should have a similar effect as running the pass--linalg-fuse-elementwise-ops
on thelinalg.generic
ops corresponding to the ops inside the group. - If the group represents a
conv-relu
fusion, it can be lowered to call the corresponding cudnn api.
That is possible too. We could only lower the region inside a group op to linalg
, which will give you what you need IIUC.