Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Welcome to unofficial TPU enabled PyTorch implementation #61

Open
shizhediao opened this issue May 2, 2020 · 5 comments
Open

Welcome to unofficial TPU enabled PyTorch implementation #61

shizhediao opened this issue May 2, 2020 · 5 comments

Comments

@shizhediao
Copy link

shizhediao commented May 2, 2020

Hi everyone,
I implemented three TPU enabled PyTorch training repos for BigGAN-PyTorch, all of which are based on this repo.

BigGAN-PyTorch-TPU-Single: Training BigGAN with a single TPU.
BigGAN-PyTorch-TPU-Parallel: Parallel version (multiple-thread) for training BigGAN with TPU.
BigGAN-PyTorch-TPU-Distribute: Distributed version (multiple-process) for training BigGAN with TPU.

I have checked the training process which seems to be normal.
There may be some potential issues (sorry that I'm a novice for TPU training).
Pull requests to fix some of the issues would be appreciated and welcome to discuss.

@raijinspecial
Copy link

did you have some issues in mind?

@shizhediao
Copy link
Author

did you have some issues in mind?

Actually no.
I have fixed a lot of issues and I think the training process of the current repo is going well.
Due to financial reasons, I could not finish the training to reproduce the exact results.
I'm not so sure and welcome to discuss this.

@Leiwx52
Copy link

Leiwx52 commented Oct 25, 2020

hi @shizhediao

Have you executed any of the codes in TPU version so far? If so, could you please give some logs on your experiments?

@gwern
Copy link

gwern commented Oct 25, 2020

It's worth noting there's another TPU implementation where they claim to have trained models successfully: https://github.com/giannisdaras/smyrf/tree/master/examples/tpu_biggan (supporting code for "SMYRF: Efficient attention using asymmetric clustering", Daras et al 2020). Tensorfork has been considering training it, despite PyTorch requiring paying for way more VMs than a Tensorflow implementation would, to establish as baseline given our difficulties getting the compare_gan BigGAN to reach high quality.

@Leiwx52
Copy link

Leiwx52 commented Nov 7, 2020

It's worth noting there's another TPU implementation where they claim to have trained models successfully: https://github.com/giannisdaras/smyrf/tree/master/examples/tpu_biggan (supporting code for "SMYRF: Efficient attention using asymmetric clustering", Daras et al 2020). Tensorfork has been considering training it, despite PyTorch requiring paying for way more VMs than a Tensorflow implementation would, to establish as baseline given our difficulties getting the compare_gan BigGAN to reach high quality.

@gwern Thanks for sharing. However I found that simply wrap the original Pytorch BigGAN into a TPU-enable version seems to be super slow since there're some ops that requires context switching between CPU and TPU(e.g. interpolate2d in torch-xla-1.6 and torch-xla-1.7). This repo https://github.com/giannisdaras/smyrf/tree/master/examples/tpu_biggan trains celebA using small batch size (64 global batch size while global batch size >=512 for imagenet). Also when testing for FID/IS, it turns out that there's something wrong thus leading to idle-TPU situation.
BTW, what do you mean by 'tensorfolks'? If you know other implementations for BigGAN on TPUs, please keep me posted. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants