RFC: Modular TensorFlow Graph C API by Solaryee · Pull Request #318 · tensorflow/community (original) (raw)

Design review meeting notes

@annarev: Should structs populated by plug-in be prefixed with TP_ instead of just P, to be consistent with StreamExecutor C API? (Original comment)
@ShengYang1: Yes. Changed the names.

@annarev: Should the void* optimizer parameter to P_Optimize be a struct? (Original comment)
@ShengYang1: Yes, it is a struct, following the Compute function in the kernel C API RFC.

// Struct for Optimizer. Plugin authors must provide an optimize function. // Creation and deletion functions are optional. typedef struct TP_Optimizer { size_t struct_size; void* ext; // reserved for future use void* (create_func)(); void (optimize_func)(void, TF_Buffer, TF_Buffer*); void (destory_func)(void); } TP_Optimizer;

@kulinseth: Is the format for 'tf' dialect in the buffer TBD?
We have a graph optimizer pass that is already using MLIR. Do we have to go through GraphDef for this?
@aselle: MLIR is out of the scope of this RFC.
@rmlarsen: This is for GraphDef only. MLIR needs a separate effort.
@ematejska: How much migration effort would it take to move to MLIR-based graph optimizer?
@Jianhui-Li: Our optimizer mostly identifies a group of ops to fuse into one large op. We expect to make appropriate changes to match the new interface, but the optimizations should stay the same.
@ematejska: So we don’t know yet.
@aselle: Protobufs are ABI-compatible. In theory, it’s possible to create some abstractions on top of both GraphDef and tf dialect, but the benefits from doing so are dubious.
@rmlarsen: Could redefine TF_Buffer.
@sanjoy: If you only care about fusing small ops to a bigger op, we could create some other simpler representation as well.
@aselle: We could do this completely descriptively. But it often ended up messy in practice, with op attributes and other things.
@Jianhui-Li: We need a more flexible way. For example, depending on the accelerator type, we might want to offload a bigger chunk of ops.
@sanjoy: Rules should be able to handle attributes too, right?
@aselle: In principle, yes. This is similar to TFLite delegates where we write functions that take a graph, find a particular sequence of ops, and represent them in some other ways. The problem was when delegate writers assumed they could handle something, but did not cover it completely.
@ematejska: There could be a problem with new attributes too.
@aselle: Yes. Backward compatibility doesn’t prevent this from failing when there is a new attribute.
@penpornk: Optimizers could also want shape inferences too.

@aselle: Are there any version compatibility requirements for Grappler?
@rmlarsen: The C++ API surface has been stable for years. The graph optimization passes order changes over time. But these have never caused any churn so far.
@penpornk: The area that could have versioning problems is the ops themselves.
@rmlarsen: Yes, for example, op foo has a v2, causing Grappler and XLA to no longer work.
It’s up to the plug-in authors to track and handle this.

@annarev: P_MessageToBuffer and P_BufferToMessage are not a part of the API surface, correct? (Original comment)
@ShengYang1: Yes, they are not part of the API surface.

@annarev: Can the configurations in TP_OptimizerConfigFns be TF_Bools instead of functions that return TF_Bool?
@ShengYang1: There are three possible values: true, false, and default. TF_Bool cannot represent default value.
@sanjoy: Can we just use a tristate?
@rmlarsen: Agreed. Explicit tristates are nicer than functions.
@yiqianglee: We can try that.

@annarev: Would existing optimizer configuration also be per-device-type? (Original comment)
@ShengYang1: Yes.

@ezhulenev: What does it mean to have an optimizer per a device in practice? (Original comment)
@ShengYang1: Each PluggableDevice can register its own graph optimizer. We can also register a graph optimizer without registering a PluggableDevice with it. But a PluggableDevice cannot register multiple graph optimizers.
@ezhulenev: So if there are custom passes for “CPU”, “GPU”, and “XPU”, all three passes will be run after the main Grappler passes in a random order?
@yiqianglee: Yes.

@sanjoy: Why do you have to tie a graph optimization pass to a PluggableDevice?
@yiqianglee: Pairing each PluggableDevice with its own optimizer makes it easier to limit the scope of graph transformation to the device.
@sanjoy: Even with them tied together, we’d still need to trust the plug-in to some extent.
@ezhulenev: One benefit with this approach is that you can just skip the pass if you don’t have the corresponding PluggableDevice.
@sanjoy: You could also skip the serialization/deserialization if there are no PluggableDevices loaded.
@rmlarsen: In multiple session.run() calls where some don’t contain plug-in code, we can avoid the serialization.

@ezhulenev: What about optimizer plug-ins with conflicting configurations?
@yiqianglee: This is what we would like to get feedback from here.
@ezhulenev: Remapper only fuses ops placed on CPU and GPU. So it shouldn’t touch anything on those PluggableDevices in practice (unless they are registered as “GPU”).
@rmlarsen: We could do logical AND, i.e., don’t run a pass if at least one plug-in set enable=false for that pass.
@Jianhui-Li: We could also make plug-ins only able to turn a pass off (and not on). To turn on a pass that has previously been disabled, the user needs to make changes on the TensorFlow side.
@rmlarsen: Turning Grappler passes off could have negative performance implications.
@penpornk: How about plug-ins print warnings whenever a pass is disabled?
@Jianhui-Li: Printing could help.
@rmlarsen: Each plug-in should also run a set of performance benchmarks to verify that the pass(es) it is disabling wouldn’t have too much effect.
@penpornk: What if different plug-ins disable different passes? Each of them turns off only one pass and passes the performance benchmarks, but together, they disable too many passes and then everything is slow.
@ezhulenev: We could still run with remapper turned on for non-CPU/GPU device types.
@rmlarsen: Will need to add logic in the MetaOptimizer.

@kulinseth: Is the order of these plug-in passes towards the end of all Grappler passes?
@ShengYang1: Yes.

@annarev: Should max_values in TF_GetNodesToPreserveSize be num_values? (Original comment)
@ShengYang1: Yes.

@annarev: What does preserved nodes mean in the context of TF_GetNodesToPreserveSize? (Original comment)
@ShengYang1: It means nodes that cannot be transformed/removed. Will add more details.

@kulinseth: Quick question. For custom ops that are exposed through an op kernel library and are visible in the TensorFlow namespace, would they work with the plug-in passes?
@annarev: Yes. The plug-in should be able to see the op names in the GraphDef.

Alexandre Rames: Maybe this is not the right venue for the question. I’d like to learn more about the visibility of the tf dialect. Can plug-ins see it? Or is it only visible in TensorFlow?
For example, if plug-ins want to work on the MLIR tf dialect directly. We have partial lowerings to some other dialects.
@kulinseth: This is consistent with what we want.
@ezhulenev: We want to make tf dialect visible to plug-ins too. It’s just no one has looked at it yet. And, as of now, I don’t know how it will be done. Could you write an email explaining the requirements? We can start from there.

@ematejska: The RFC should be updated to reflect the conclusions on conflicting plug-in optimizer configurations. After that, we will see if we can conclude the RFC on the PR or if we will need another design review meeting.