Why PyTorch does not need a new standardized operator set (original) (raw)

with @gchanan

Background: There is no right answer only trade-offs

There have been many past efforts to build a standardized operator sets for PyTorch:

So what did we learn from all of these prior efforts? The main thing is there is no “best” IR, just many trade-offs. Different IRs are better (and worse) for different use cases.

Another key learning is decompositions destroy information that could be useful to downstream compilers. It is much harder to go from a lower-level IR to a higher-level IR, than the other way around. So it is far better to leave things at a higher level and let the backend progressively lower them as needed. The high level original Torch IR can work for everyone, because it can be easily lowered all the IRs listed above.

The other learning is making everyone use the same IR dialect is a false requirement. Looking at the above list of IRs, many of them are actively being used and working well. There is no practical downside of having multiple IR, since all of those IRs can be created from the original PyTorch program. If we were to pick one winner and impose it on all users things would be much worse.

A better way: Configurable IR dialects

What we have today is actually a lot more powerful than picking a single IR. We have a configurable library of decompositions that allow every backend to rapidly create the perfect IR for their use case. Apply all decompositions: you get something close to PrimTorch. Apply no decompositions: you get something like TorchScript IR. Apply a medium amount of decompositions: you get close to the other IRs listed above (with a different set of decompositions for each one). It is a sliding scale where tuning the IR to meet your needs is just changing a configuration in your call to AOTAutograd.

For portable representations (before passing to a backend) we should keep things in Torch IR or in not-decomposed not-functionalized pre-grad ATen IR. This maintains the maximum flexibility, and we provide tools to easily lower this IR whatever IR you want for your use case. These are the only IRs that works with all backends.

Even though the every set of selected decompositions defines a different IR dialect, there is still lots of shared code between backends. The library of decompositions is shared. The tools for working with the IRs (FX) are shared. And perhaps most importantly, AOTAutograd for dealing with training (which is so hard most backends don’t even try) is shared and easily reused across backends.

In this design, since decompositions are run inside the backend (and backends can define custom decompositions) the mapping from operators to decompositions is not a backwards compatibility surface. If you run a saved PyTorch model with a newer backend, you will use the newer set of decompositions. This allows backends to change their IR dialect while maintaining support for serialized models. From a maintainability standpoint, this is far better because it allows backends to evolve their IR over time.

To be precise, the BC contract is as follows:

We could also impose a requirement that all new operators have decompositions, however we currently feel the friction here is not worth the benefit.

How can we make things even better?

Obviously, things are not perfect. There are two main areas of investment that we are looking for contributions towards:

The webpage might look something like:

Questions

Q: What about serialization?

Serialization should be done with no decompositions, pre-autograd, and no-functionalization. Decompositions and functionalization should be applied after loading the model. This preserves the maximum amount of information and can work with all backends. It also preserves flexibility, allowing backends to change their dialect without breaking backwards compatibility.