RFC: TensorFlow on DirectML by wchao1115 · Pull Request #243 · tensorflow/community (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation94 Commits12 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

This RFC will be open for comment until Monday, May 25th, 2020.

Status	Proposed
RFC #	243
Author(s)	Chai Chaoweeraprasit (wchao@microsoft.com), Justin Stoecker (justoeck@microsoft.com), Adrian Tsai (adtsai@microsoft.com), Patrice Vignola (pavignol@microsoft.com)
Sponsor	Penporn Koanantakool (penporn@google.com)
Updated	2020-06-08

Objective

Implement a new TensorFlow device type and a new set of kernels based on DirectML, a hardware-accelerated machine learning library on the DirectX 12 Compute platform. This change broadens the reach of TensorFlow beyond its existing GPU footprint and enables high-performance training and inferencing on Windows devices with any DirectX12-capable GPU.

ntakouris, feihugis, shahrilnet, claverru, cenit, rvanlaak, siegenthalerroger, lazyoracle, wedusk101, supernovae, and 18 more reacted with thumbs up emoji

RFC: TensorFlow on DirectML

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

wchao1115 and others added 6 commits

May 12, 2020 23:40

…'s name and email alias.

Thank you for your interest!

For the current TensorFlow stack, we recommend adding the new device and kernels as external modules, connecting to TensorFlow through C APIs. Please see the Modular TensorFlow RFC for more context. Kernels and ops can be registered through the APIs in this ongoing RFC. We will share more updates on device addition APIs next week. In addition, there are a few other uncertainties with this approach:

Modular TF cannot guarantee compatibility with the new TensorFlow runtime (TFRT) (but we will try to maximize compatibility if possible). TFRT will be modular, but it is fundamentally different from the current runtime and many things are still evolving.
The C APIs will initially be subject to changes, we cannot yet guarantee API stability

What is your integration timeline? TensorFlow is transitioning to a new backend based on TFRT and MLIR, which will not be compatible with the current stack. If your goal is to get your integration productized sometime in 2021, we recommend waiting to integrate with TFRT and MLIR to save throw-away efforts.

Thank you for your comment @penpornk. Our initial goal is to be compatible with the existing runtime behavior as that's what our audience is expecting. We aim to integrate a new backend to achieve broader hardware and platform reach while maintaining compatibility as a top priority.

We definitely want to learn more about TFRT. Based on the info in the blog, it does look like the refactoring will actually make it easier for us to integrate our device runtime and kernels. So, definitely interested to be in the loop for future development. We have not finalized our initial release timeline, but we have some good progress in our fork to date.

Given the size of the change, I would also like to know your recommendation on how we should think about the code integration strategy. The code change is large but mostly additive.

@wchao1115 regarding CLA, you will have to sign with all associated email accounts

Just checked the internal data. Your email at ntdev.microsoft.com has not signed the CLA

I manually checked and don't see the CLA for wchao@ntdev.microsoft.com. Did you sign with this address or a different alias?

I noticed that both your personal emails are associated with the same github account. Maybe you can add your corp email as the third one (in Github settings) too. (Or do we still require a separate CLA for a corp email address, @mihaimaruseac ?)

I think an easy way is to modify the commit and reset the author to associate the email that signed CLA instead.

You can use git commit -s --amend --reset-author to reset the last commit's author info.

To modify all 9 commit, you can use:

git rebase -i HEAD~9 -x "git commit --amend --author 'Author Name <author.name@mail.com>' --no-edit"

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@wchao1115 Nice timing! I was planning to update you on this once TF 2.5 is released too. It's the first release that supports PluggableDevices. :)

We're nearing the completion of our work in the 1.15 fork

That's great news! Looking forward to it!

Do you recommend that we look at the Intel's fork or just focus on what already in master, given that a handful of PRs have already been merged? Also, what is the timeline for the pluggable graph change?

Please focus on the master branch. Most implementations are already in there, including the graph optimization C API.

PluggableDevice support in TF 2.5:

StreamExecutor C API: Complete
PluggableDevice: TensorFlow can run full end-to-end model inferences on PluggableDevices (and also training, just not updating resources on the device). TensorFlow can't override manually-placed node on 1st-party GPUs to run on PluggableDevices yet. (But nodes that aren't manually placed by users could use PluggableDevice just fine.)
- Main implementation PR.
Graph optimization C API: Complete

Additional PluggableDevice support in master:

Minor change to graph optimization C API

Upcoming PRs

Let PluggableDevices override user-placed nodes (e.g., "GPU"). This will be an alternative to the original approach here.
Updating grappler plug-in configs warning messages

Upcoming RFCs

RFC: Kernel C API Extension for Variable Ops: To properly support training.
An RFC to fix buffer inplacement issue.
Another RFC in a few weeks (can't say what it's about yet).

We also have a tutorial for PluggableDevice developers under construction here.

Disclaimer: PluggableDevice hasn't really been tested with Windows yet. We would love your help testing it. :)

Regarding the blog post, I'm sorry to say that there is still no update. (It's beyond my control.)

@penpornk Thanks for the update. We'll check this out. The tutorial would be awesome. If we have it, I'll help review it with our work. Since I'm having you here, is there a plan to support an ARM64 build either on Windows or the Mac OS? I think we'll eventually need that on Windows.

is there a plan to support an ARM64 build either on Windows or the Mac OS? I think we'll eventually need that on Windows.

@wchao1115 There is no set plan that we can share yet. But please stay tuned. :)

In case this is useful: We have an aarch64 build config (--config=mkl_aarch64) recently added by @cfRod and @alenik01. This build uses oneDNN with Arm Compute Library support.

Thanks @penpornk for the reminder. I indeed plan to attend this call at 2 today. Someone else at MS has routed this meeting to me due to our current work on getting TensorFlow to run on WSL.

@penpornk You mentioned toward the end of the SIG call today that there is this conversation on the Windows build topic that I'm part of. That's true, but it seems I do not have the permission to post on that group. Is there any step I need to take to request for the permission?

@wchao1115 I think you'll need to join the group first. If you go to the group page, there should be a button "Join group" after the group name.

Just wanted to drop a quick update for the community that the official release of TensorFlow-DirectML 1.15.5 is now available at PyPI. More update can be found here.

We're moving right along to enable DirectML support for TF2 code base with the focus on the TensorFlow pluggable device for DirectML that runs on the framework's latest version from the official branch both on Windows and WSL.

What is your integration timeline? TensorFlow is transitioning to a new backend based on TFRT and MLIR, which will not be compatible with the current stack. If your goal is to get your integration productized sometime in 2021, we recommend waiting to integrate with TFRT and MLIR to save throw-away efforts.

Can we have a fresh overview on this claim? What Is the status of TF/TFRT integration in 2022?

@ematejska Sorry for the late reply! We are taking a slightly different approach than initially proposed in this RFC, i.e., making a TF-DirectML PluggableDevice plug-in instead of directly adding a new device type DmlDevice in the TF source code. So far there is no need for a design review yet and this PR has mostly been used for communication / giving updates.

Can we have a fresh overview on this claim? What Is the status of TF/TFRT integration in 2022?

@bhack I believe @wchao1115 and his team are working on a TF-DirectML PluggableDevice plug-in. There is no ~~device API~~ device plug-in C API for TFRT yet.

Update history:
Edited "device API" to "device plug-in C API".

@penpornk Thanks so I suppose that as with the original point we don't know anything about a future TFRT device plug-in C API and the PluggableDevice design compatibilty and It will be probably required to integrate twice. Right?

@bhack As what @penpornk has mentioned here. I just wanted to confirm that we have no plan in place as of now to work on TFRT. Our focus at the moment is on getting a DirectML-based plug-in working with TF2 official builds, both on Linux and on Windows.

@penpornk Given that we've taken your feedback on this RFC and pivot towards the pluggable device on TF2 now, shall we move ahead and approve this RFC for the record? This is a long road that we've started a while back, and would love to close on it.

@wchao1115 Thanks it is is clear I see also some (related) pending MS PRs in TF.
My point was just to understand, when TFRT is ready, if it will require a brand new integration effort like this one.

@penpornk Thanks so I suppose that as with the original point we don't know anything about a future TFRT device plug-in C API and the PluggableDevice design compatibilty and It will be probably required to integrate twice. Right?

@bhack Yes. The future MLIR/TFRT stack will likely require a separate integration. If possible, we would like to make it so that it doesn't require too big of a migration effort, but there is no guarantee.

@penpornk Given that we've taken your feedback on this RFC and pivot towards the pluggable device on TF2 now, shall we move ahead and approve this RFC for the record? This is a long road that we've started a while back, and would love to close on it.

@wchao1115 Sorry about that! I originally kept the PR open in case we'd need a redesign, but didn't think of closing it once you decided to go with PluggableDevice. Let me double check with the team if we need to update the RFC content to reflect the new development.

@penpornk How about updating the design document with the current state and we can mark as accepted per your recommendation?

@ematejska Sounds great! Thank you, Ewa!
@wchao1115 Could you please help update the RFC file to explicitly mention the plug-in approach?

@wchao1115 can we have an update on PluggableDevice? Your last comment was that it was something your team had been working on for TF 2.4. Since the API for PluggableDevice is official since TF 2.5, is there a preview build of some kind available? Been waiting for DirectML for TF2 for a very long time.
Thanks.

We are working on it. Should be out some time soon.

@penpornk Please take a look at the design change update in the latest commit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the update! And looking forward to the plug-in release! :D

Reviewers

ematejska ematejska approved these changes

penpornk penpornk approved these changes

ewilderj Awaiting requested review from ewilderj

martinwicke Awaiting requested review from martinwicke

theadactyl Awaiting requested review from theadactyl theadactyl is a code owner