RFC: Adding Pluggable Device For TensorFlow by jzhoulon · Pull Request #262 · tensorflow/community (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation115 Commits29 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

This RFC will be open for comment until Monday, July 20th, 2020.

Pluggable device for TensorFlow

Status	Accepted
RFC #	262
Author(s)	Zhoulong Jiang (zhoulong.jiang@intel.com), Yiqiang Li (yiqiang.li@intel.com), Eric Lin (eric.lin@intel.com), Jianhui Li (jian.hui.li@intel.com)
Sponsor	Anna Revinskaya (annarev@google.com)
Updated	2020-08-13

Objective

Implement a pluggable device mechanism which allows to run existing TensorFlow programs on a new device without user changing the code. Users only need to install a plugin in a specified directory, and the mechanism is able to discover and plug in the capabilities offered by the plugin.

This RFC is based on the Modular TensorFlow RFC, which aims to extend the TensorFlow design to plugin capabilities like adding a new device support. The modular device interface is based on StreamExecutor C API RFC.

Thank you for the RFC! I think the links to the images are broken?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing the RFC!

@jzhoulon "View file" works for me. Sorry I didn't try that before asking!

Now that I think about it, I don't think there is a good way to make the images display in the diff mode here, since the images are not in this repo yet. If you use absolute paths (URLs), you'll need to change them back to relative paths before the PR is merged. -- Not worth the hassle.

Thanks for the RFC. We're working on adding a new device for DirectML (#243) so this RFC along with #257 are of great interest to us. I have a few high-level questions as follow:

Is there a plan to encourage a refactor of the existing GPU-CUDA device as a pluggable device in the future? The DirectML device is designed to be a replacement of the existing GPU device. If the GPU device continues to be part of TensorFlow proper, the total payload size of the package with the DirectML device, which includes the rest of TensorFlow as its dependency will never be smaller than the package that it depends on. The current TensorFlow-GPU wheel size is over 450 MB while the current TensorFlow-DirectML wheel size (pre-release) is about 65 MB.
Is there a particular reason to require all pluggable devices to implement the stream and the stream executor interface (RFC: StreamExecutor C API #257)? Why not give the device implementer the freedom to implement the device as they see fit by defining a C API for the pluggable device interface instead? As an example, the DirectML device doesn't use stream or the stream executor. Since one of the main purposes of any device is to manage the state for all the kernels it supports, it seems there is little benefit to regulate the implementation's choices of how a device and its corresponding kernels may be implemented.
What is the runtime mechanism that ensures that the version of TensorFlow supporting the plug-in is compatible with the pluggable device implemented in the plug-in? One of the design benefits of a plug-in architecture is that it allows the host and the plug-in to evolve independently as long as there is a versioning policy in place to ensure their compatibility to one another. It's not clear to me from the device discovery code sample how that mechanism is achieved.

@wchao1115 Great questions!

Is there a plan to encourage a refactor of the existing GPU-CUDA device as a pluggable device in the future? The DirectML device is designed to be a replacement of the existing GPU device. If the GPU device continues to be part of TensorFlow proper, the total payload size of the package with the DirectML device, which includes the rest of TensorFlow as its dependency will never be smaller than the package that it depends on. The current TensorFlow-GPU wheel size is over 450 MB while the current TensorFlow-DirectML wheel size (pre-release) is about 65 MB.

For the current TensorFlow stack, no. Would linking to CPU-only TensorFlow package work for you (~110.MB for 1.15, ~144MB for 2.2.0)? Would love it if you could share your techniques to get the wheel size down to 65MB as well. Looping in @gunan and @annarev for more details.

Is there a particular reason to require all pluggable devices to implement the stream and the stream executor interface (RFC: StreamExecutor C API #257)? Why not give the device implementer the freedom to implement the device as they see fit by defining a C API for the pluggable device interface instead? As an example, the DirectML device doesn't use stream or the stream executor. Since one of the main purposes of any device is to manage the state for all the kernels it supports, it seems there is little benefit to regulate the implementation's choices of how a device and its corresponding kernels may be implemented.

Yes. It’s because we happen to have the StreamExecutor C API already in the works.

The new TFRT/MLIR TensorFlow stack will have much more flexible support for third-party devices. Since the new stack is fundamentally different from the current stack, we can’t guarantee that any code added to the current stack will work with the new stack right away. So the focus for the current stack is to enable basic device integration, for people that need it now, while minimizing throw-away work. We think StreamExecutor C API (covering a small subset of StreamExecutor routines) and PluggableDevice could be sufficient. (Or we’d like to hear why they might not be enough sooner rather than later. -- Hence the RFCs.)

“Stream” is just a keyword meant to represent an ordered sequence of kernels to run. Does DirectML have a kernel queue which would fit this definition? Would DirectML be able to use StreamExecutor C API and PluggableDevice? Is there anything you need that is missing? If so, we can look into adding the missing parts (and making the things you don’t need optional).

Why not give the device implementer the freedom to implement the device as they see fit by defining a C API for the pluggable device interface instead?

Do you mean C APIs for Device and DeviceFactory? Or do you mean different APIs for different devices?

What is the runtime mechanism that ensures that the version of TensorFlow supporting the plug-in is compatible with the pluggable device implemented in the plug-in? One of the design benefits of a plug-in architecture is that it allows the host and the plug-in to evolve independently as long as there is a versioning policy in place to ensure their compatibility to one another. It's not clear to me from the device discovery code sample how that mechanism is achieved.

Functionality can be extended by appending members to the structures in StreamExecutorInterface. TensorFlow will check for the existence of members before accessing them safely. Plugins just needs to initialize struct_size using the macros provided so that TensorFlow knows which members are accessible. @yisitu will give some examples later in a separate reply.

Thanks for the RFC. I have two questions regarding support for dataset prefetch_to_device in case of PluggableDevice.

When using prefetch to device dataset iterator is on "accelerator" device and resorts to a remote call to the CPU that does the data preparation. One of the obstacles is that remote call mechanism is guarded by ProcessFunctionLibraryRuntime::GetDeviceContext that, as of now, is a bunch of hardcoded device type strings. Do you envision a way to allow PluggableDevice to support remote calls?
Also a few operators (IteratorGetNext to mention one) must be supported on the accelerator and seemingly could reuse TF implementations. This would ask for an access to operator registry such that would allow duplicating registrations of some operators from one device (say "GPU") to another.

Such features would be very welcome. Are these topics in scope of Pluggable Device RFC?

ematejska changed the title~~Adding Pluggable Device For TensorFlow RFC~~ RFC: Adding Pluggable Device For TensorFlow

Jul 8, 2020

Thanks for the comment! these are good questions.

as of now, is a bunch of hardcoded device type strings. Do you envision a way to allow PluggableDevice to support remote calls?

Tensorflow seems use lots of hardcoded device string type to do branch, some are device name(front-end name), such as (ProcessFunctionLibraryRuntime::GetDeviceContext), some are device type(back-end name), such as (ProcessMemoryTypes), I think we can extend a new branch path, such as provide an util function (IsPluggableDevice(device_type)), this function maintains a global object that contains every device type/name registered from plugin. @annarev do you have comments here? Thanks

    if (device_type != DEVICE_GPU && device_type != DEVICE_SYCL && !IsPluggableDevice(device_type)) {
     // On non-GPU and non-SYCL devices, HOST_MEMORY and DEVICE_MEMORY are always
     // compatible.
     return Status::OK();
   }

Also a few operators (IteratorGetNext to mention one) must be supported on the accelerator and seemingly could reuse TF implementations.

I'm not sure whether Kernel and Op Implementation C API will support this, maybe Google can expose IteratorGetNext as a utility functions like (TF_NewKernelBuilder) or provide an special API to register an existing implementation with new device type. @annarev

Thanks for the RFC. I have two questions regarding support for dataset prefetch_to_device in case of PluggableDevice.

When using prefetch to device dataset iterator is on "accelerator" device and resorts to a remote call to the CPU that does the data preparation. One of the obstacles is that remote call mechanism is guarded by ProcessFunctionLibraryRuntime::GetDeviceContext that, as of now, is a bunch of hardcoded device type strings. Do you envision a way to allow PluggableDevice to support remote calls?
Also a few operators (IteratorGetNext to mention one) must be supported on the accelerator and seemingly could reuse TF implementations. This would ask for an access to operator registry such that would allow duplicating registrations of some operators from one device (say "GPU") to another.

Such features would be very welcome. Are these topics in scope of Pluggable Device RFC?

What is the runtime mechanism that ensures that the version of TensorFlow supporting the plug-in is compatible with the pluggable device implemented in the plug-in? One of the design benefits of a plug-in architecture is that it allows the host and the plug-in to evolve independently as long as there is a versioning policy in place to ensure their compatibility to one another. It's not clear to me from the device discovery code sample how that mechanism is achieved.

Functionality can be extended by appending members to the structures in StreamExecutorInterface. TensorFlow will check for the existence of members before accessing them safely. Plugins just needs to initialize struct_size using the macros provided so that TensorFlow knows which members are accessible. @yisitu will give some examples later in a separate reply.

Hi @wchao1115, @penpornk! Please find more details examples here: https://github.com/tensorflow/community/pull/257/files#diff-8312b2e074fd41855aa8a03d13e92b91

There will be a public design review meeting for this RFC.
Time: Thursday, July 23rd, 2020 from 3:00-3:45pm PT
Meeting link: Google Meet

Meeting participants are expected to read the RFC ahead of time as the meeting will focus on the remaining issues/questions.


This RFC shows an example of registering kernels for PluggableDevice. Kernel and op registration and implementation API is addressed in a separate [RFC](https://github.com/tensorflow/community/blob/master/rfcs/20190814-kernel-and-op-registration.md).

To avoid kernel registration conflict with existing GPU(CUDA) kernels, the backend device_type for kernel registration should be seperated from the front-end visible device type ("GPU"). Two Options:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annarev do you have any suggestion here? I list two options to resolve the potential kernel registration conflict with existing CUDA kernels when the device type is "GPU", seems both options need StreamExecutor C API/Kernel Registration C API support.Thanks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like Option 1 a bit more since it allows backend name to be more descriptive.

There is also another option: specify kernel registration priority. Then, two kernels can be registered with same type.

I think there is a more important question to answer though. Both Option 1 and Option 2 have "GPU" specified as front-end name. So, an important question is: suppose user specifies "GPU" device, which device would be selected? May be we need to provide some way for user to specify that GPU==backend name.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annarev I thought that only One GPU device will be registered successfully? Since multiple GPU device registration with same device name and priority will got conflict. Specifying the kernel registration priority will help the new GPU kernel register with the same device type as CUDA Kernels(GPU type), however, there are still lots of code that are specific for CUDA, so we may still need a new backend device type for those code within GPU device type scope that are specific for CUDA.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also originally thinking of having one device registered per type. At the same time, I remember a question in StreamExecutor C API design review if there would be a way to use two different GPU types. I don't know if we want to support it. I guess we can discuss more in the design review on Thursday.

Sounds good about keeping the extra name.

Thanks for the RFC. I have two questions regarding support for dataset prefetch_to_device in case of PluggableDevice.
When using prefetch to device dataset iterator is on "accelerator" device and resorts to a remote call to the CPU that does the data preparation. One of the obstacles is that remote call mechanism is guarded by ProcessFunctionLibraryRuntime::GetDeviceContext that, as of now, is a bunch of hardcoded device type strings. Do you envision a way to allow PluggableDevice to support remote calls?
Also a few operators (IteratorGetNext to mention one) must be supported on the accelerator and seemingly could reuse TF implementations. This would ask for an access to operator registry such that would allow duplicating registrations of some operators from one device (say "GPU") to another.
Such features would be very welcome. Are these topics in scope of Pluggable Device RFC?

@pawelpiskorski @gunan mentioned that the above issue should be solved by registering these kernels as "DEVICE_DEFAULT", following the example in queue_ops.cc (a link). For your case, the IteratorGetNext registration should be modified from "DEVICE_DEFAULT" (a link). For the kernels being registered as "DEVICE_DEFAULT", it can be used for any device. So when a device is plugged, these operations should be assigned to the pluggable device since it has highest priority and backed by the "DEVICE_DEFAULT" kernels.

Thanks for suggestion. I registered a group consisting of IteratorV2, MakeIterator, DeleteIterator, AnonymousIterator, AnonymousIteratorV2, IteratorGetNext, IteratorGetNextSync, IteratorGetNextAsOptional, IteratorToStringHandle, IteratorFromStringHandleV2 on DEVICE_DEFAULT. This somehow places the part of dataset that's designated for CPU on the device. It's a bit surprising because I haven't changed Dataset* registrations. . In short, seems not to be working out of the box for datasets, but I'll try to dig into that.
That said, I'd love to advertise my question on tf.developers regarding device type hardcodes. It so far has gone unnoticed but please have a look. Thanks!!

annarev added a commit to annarev/community that referenced this pull request

Sep 10, 2020

Updates based on recent changes and discussion on RFC PR tensorflow#262

@wchao1115 We have updated the versioning strategy. Changes include:

Restructuring the contents so the conventions/assumptions/procedures are clearer.
An example showing that the struct size check works with struct members in any position, not just the last member of the struct.
More details on major version bump conventions. Specifying that breaking changes require an RFC.
An explicit mention that nontrivial deprecations (that cannot use zero initialization) needs a major version bump.
Adding a requirement that fields that treat 0 and NULL as invalid values must have explicit comments saying so, to ensure all plug-ins have consistent behavior, e.g., none of them is using 0 or NULL for special cases.
Updating the deprecation examples to use 'producer' and 'consumer' keywords (instead of plug-in and core TensorFlow) to cover more cases.

Could you please help take another look and raise any remaining concerns? Thank you!

Please feel free to comment on the file directly on the StreamExecutor C API RFC PR's file changes review page. (Or we can discuss here as usual.)

thanks @annarev , I think unified_memory_allocate/unified_memory_deallocate is still need, AllocatorHandle is a higher level api, which allows plugin authors to implement a specific cached allocator, which is counter part of BFCAllocator. But for those plugin authors who want to reuse existing tensorflow BFC allocator, I think the low level allocation api in StreamExecutor(AllocateArray/UnifiedMemoryAllocate) is still need, We can make it: if plugin register a allocator handle, then proper will use allocate handle as its allocator, it not , BFC Allocator will be the default Allocator and it will be forward to StreamExecutor Allocation API through sub-allocator.

I wonder if it will get confusing with 2 separate ways to set allocator functionality. Need to think of a clear way to annotate which one to use when in the API. But I will try it out and see how it looks.

I added a proposed implementation here:
21be6e9
This adds 2 sets of functions to SP_PlatformFns create_allocator/destroy_allocator and create_custom_allocator/destroy_custom_allocator. I thought keeping the two options near each other would make it less confusing.
@kulinseth, @jzhoulon ptal.

I wonder if it will get confusing with 2 separate ways to set allocator functionality. Need to think of a clear way to annotate which one to use when in the API. But I will try it out and see how it looks.

I added a proposed implementation here:
21be6e9
This adds 2 sets of functions to SP_PlatformFns create_allocator/destroy_allocator and create_custom_allocator/destroy_custom_allocator. I thought keeping the two options near each other would make it less confusing.
@kulinseth, @jzhoulon ptal.

Thanks Anna, this looks good.!.

Questions we got offline (cc: @kulinseth, @wchao1115, @jzhoulon, @Jianhui-Li).

Q1: What are the next steps?

We are trying to wrap up both RFCs (StreamExecutor C API and this).
- Wrapping up doesn’t mean we can’t make changes anymore. It’s more of a confirmation that the proposals are really happening. We can continue to discuss here and make adjustments.
PluggableDevice implementation PRs will be submitted by @jzhoulon and his team. (The PRs can be submitted for review anytime, but they likely will be merged after the RFC itself is accepted.)
Interested parties can try the experimental API and give feedback.
Reiterate until the API is somewhat stable, then consider moving it out of experimental.

Q2: How will the API be released? Will we have a branch where the API will be released in the experimental folder?

StreamExecutor C API is already in the experimental folder in the master branch. It will be released with the next TensorFlow release.
PluggableDevice will be too once it is checked in. It will be released with the soonest TensorFlow release after that.

Q3: What is the high-level timeline?

The StreamExecutor C API is ready-to-use.
@jzhoulon @Jianhui-Li Would you mind giving a rough timeline of when you plan to submit PluggableDevice implementation PRs?

Q4: Will this be in TF 2.4?

StreamExecutor C API: Yes.
PluggableDevice: Depends on when the implementation gets in. If it’s before r2.4 branch cut, then yes. I’ll post the branch cut date here once it is announced on developers@tensorflow.org.

@wchao1115: Friendly ping. Have you had a chance to look at the updated versioning strategy yet? :)

cc: @jzhoulon @Jianhui-Li @wchao1115 @kulinseth

I'd like to wrap up this RFC (i.e., merge this PR) to pave the way for the upcoming implementation PRs.

Here are the changes since the last open design review meeting:

PluggableDevice:
- Renamed PluggableDeviceAllocator to PluggableDeviceMemAllocator and added PluggableDeviceProcessState. 492ca0b
StreamExecutor C API
- Added more explanation of the overall functionality.
- Added more clarifications on versioning strategy.
- Corrected typos. Minor functionality additions such as block_host_until_done, unified_memory_allocate/deallocate, etc.

If anyone has anything else that needs to be discussed/resolved before we can move forward (i.e., fundamental issues), please explicitly say so here by this Thursday, 9/24 at 11:59pm PT.

@wchao1115 I believe the versioning strategy is not a blocking issue. We can continue discussing and refining the strategy here throughout the experimental phase.

Q4: Will this be in TF 2.4?

TensorFlow 2.4 branch cut has been announced. If PluggableDevice implementation makes it into TF by 10/21/20 at 5pm PT, it will be in TF 2.4. Otherwise, it won't.

@penpornk Sorry for the delay. The revised versioning doc looks fine for now. We can iterate more when we start implementing the V1 plug-in.

A note on deprecation by the producer leaving an attribute zero-initialized: I'm not sure this is safe and can be done in a backward compatible way even when zero is defined as an invalid value for the attribute because the consumer may reject the zero value and fail the call instead.

If in the later minor version the producer setting the value to zero assumes the attribute will be safely ignored by the older plug-ins, the plug-in may fail.

I believe it is only safe to do so when the attribute is also considered optional by the producer. A required attribute must not be left zero initialized in the later version without a major version bump.

Questions we got offline (cc: @kulinseth, @wchao1115, @jzhoulon, @Jianhui-Li).

Q1: What are the next steps?

We are trying to wrap up both RFCs (StreamExecutor C API and this).

Wrapping up doesn’t mean we can’t make changes anymore. It’s more of a confirmation that the proposals are really happening. We can continue to discuss here and make adjustments.

PluggableDevice implementation PRs will be submitted by @jzhoulon and his team. (The PRs can be submitted for review anytime, but they likely will be merged after the RFC itself is accepted.)

Interested parties can try the experimental API and give feedback.

Reiterate until the API is somewhat stable, then consider moving it out of experimental.

Q2: How will the API be released? Will we have a branch where the API will be released in the experimental folder?

StreamExecutor C API is already in the experimental folder in the master branch. It will be released with the next TensorFlow release.

PluggableDevice will be too once it is checked in. It will be released with the soonest TensorFlow release after that.

Thanks! for clarifying on these questions. There is no outstanding design items from our side. There will be few iterations and adjustments as we get to the implementation PRs and start using the API.

@jzhoulon @Jianhui-Li Would you mind giving a rough timeline of when you plan to submit PluggableDevice implementation PRs?
If PluggableDevice implementation makes it into TF by 10/21/20 at 5pm PT, it will be in TF 2.4. Otherwise, it won't.

@jzhoulon, @Jianhui-Li it will be great if you can elaborate on it. We would like to know if you are also targeting TF 2.4. Also it will be great if there are any implementation PRs we can try out to adapt our implementation and provide any feedback.

@jzhoulon, @Jianhui-Li it will be great if you can elaborate on it. We would like to know if you are also targeting TF 2.4. Also it will be great if there are any implementation PRs we can try out to adapt our implementation and provide any feedback.

@kulinseth We are targeting initial PRs for TF2.4 as a stretch goal and incremental PRs on TF2.5. At experimental stage, the initial version wont' be able to run full model, but useful for us to test various plugin and further improve the interface.

@kulinseth We are targeting initial PRs for TF2.4 as a stretch goal and incremental PRs on TF2.5. At experimental stage, the initial version wont' be able to run full model, but useful for us to test various plugin and further improve the interface.

Thanks for the update. I was also hoping to iron out and test the interface in the experimental stage. It will become more clear after looking at the PR and testing with more plugins.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Hi,

@penpornk and others
I had a general question about how the plugin modules will be packaged for the user. Will it be a regular package or a namespace package ? Also I would like to understand the workflow for installing the pluggable module:

install the core tensorflow package
install the pluggable device package (which will install it at ../site-packages/tensorflow-plugins/M)

Please let me know if there are other ways and if I am missing something.

Hi @kulinseth ,
Pluggable device plugin installation is following modular TF python design (https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md#python), it's backend's choice to distribute as regular package or namespace package, the only requirement is to follow the convention to install library under "../site-packages/tensorflow-plugins/M" so that plugin discovery can load it successfully.

install the core tensorflow package
install the pluggable device package (which will install it at ../site-packages/tensorflow-plugins/M)

Yes, that's expected installation process.

Hi all, quick updates:

The initial PluggableDevice implementation couldn't make it into TF 2.4, but it seems on track to be in TF 2.5.
Current PluggableDevice PRs under review, in case anyone would like a quick look: 1, 2, 3
The RFC for custom graph optimizer C API is now up: RFC: Modular TensorFlow Graph C API #318
- There will be an open design review on Thursday, November 19th, 2020.

Hi guys,

Is there any demo like a pseudo pluggable device to show what need to be done to add a device to tensorflow?

Thanks

Hi @kevint324 ,
We are working on a tutorial on how to implement TensorFlow plugin, will release in near future.

Sounds great. Looking forward to it.

Hi @kevint324 ,
We are working on a tutorial on how to implement TensorFlow plugin, will release in near future.
@yiqianglee

reiterating the question. is there a detailed mock pluggable device example or other document on this?
thanks

Hi @kevint324 ,
We are working on a tutorial on how to implement TensorFlow plugin, will release in near future.
@yiqianglee

reiterating the question. is there a detailed mock pluggable device example or other document on this?
thanks

Please see here for details tutorial and example for pluggable device development.
#352

Reviewers

kkimdev kkimdev left review comments

annarev annarev left review comments

yisitu yisitu left review comments

alextp alextp approved these changes

theadactyl theadactyl approved these changes

ematejska Awaiting requested review from ematejska

ewilderj Awaiting requested review from ewilderj

martinwicke Awaiting requested review from martinwicke