Welcome to unofficial TPU enabled PyTorch implementation #61

shizhediao · 2020-05-02T03:22:00Z

Hi everyone,
I implemented three TPU enabled PyTorch training repos for BigGAN-PyTorch, all of which are based on this repo.

BigGAN-PyTorch-TPU-Single: Training BigGAN with a single TPU.
BigGAN-PyTorch-TPU-Parallel: Parallel version (multiple-thread) for training BigGAN with TPU.
BigGAN-PyTorch-TPU-Distribute: Distributed version (multiple-process) for training BigGAN with TPU.

I have checked the training process which seems to be normal.
There may be some potential issues (sorry that I'm a novice for TPU training).
Pull requests to fix some of the issues would be appreciated and welcome to discuss.

raijinspecial · 2020-05-05T19:36:43Z

did you have some issues in mind?

shizhediao · 2020-05-06T02:45:18Z

did you have some issues in mind?

Actually no.
I have fixed a lot of issues and I think the training process of the current repo is going well.
Due to financial reasons, I could not finish the training to reproduce the exact results.
I'm not so sure and welcome to discuss this.

Leiwx52 · 2020-10-25T04:38:32Z

hi @shizhediao

Have you executed any of the codes in TPU version so far? If so, could you please give some logs on your experiments?

gwern · 2020-10-25T15:31:25Z

It's worth noting there's another TPU implementation where they claim to have trained models successfully: https://github.com/giannisdaras/smyrf/tree/master/examples/tpu_biggan (supporting code for "SMYRF: Efficient attention using asymmetric clustering", Daras et al 2020). Tensorfork has been considering training it, despite PyTorch requiring paying for way more VMs than a Tensorflow implementation would, to establish as baseline given our difficulties getting the compare_gan BigGAN to reach high quality.

Leiwx52 · 2020-11-07T15:04:21Z

It's worth noting there's another TPU implementation where they claim to have trained models successfully: https://github.com/giannisdaras/smyrf/tree/master/examples/tpu_biggan (supporting code for "SMYRF: Efficient attention using asymmetric clustering", Daras et al 2020). Tensorfork has been considering training it, despite PyTorch requiring paying for way more VMs than a Tensorflow implementation would, to establish as baseline given our difficulties getting the compare_gan BigGAN to reach high quality.

@gwern Thanks for sharing. However I found that simply wrap the original Pytorch BigGAN into a TPU-enable version seems to be super slow since there're some ops that requires context switching between CPU and TPU(e.g. interpolate2d in torch-xla-1.6 and torch-xla-1.7). This repo https://github.com/giannisdaras/smyrf/tree/master/examples/tpu_biggan trains celebA using small batch size (64 global batch size while global batch size >=512 for imagenet). Also when testing for FID/IS, it turns out that there's something wrong thus leading to idle-TPU situation.
BTW, what do you mean by 'tensorfolks'? If you know other implementations for BigGAN on TPUs, please keep me posted. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Welcome to unofficial TPU enabled PyTorch implementation #61

Welcome to unofficial TPU enabled PyTorch implementation #61

shizhediao commented May 2, 2020 •

edited

Loading

raijinspecial commented May 5, 2020

shizhediao commented May 6, 2020

Leiwx52 commented Oct 25, 2020

gwern commented Oct 25, 2020

Leiwx52 commented Nov 7, 2020

Welcome to unofficial TPU enabled PyTorch implementation #61

Welcome to unofficial TPU enabled PyTorch implementation #61

Comments

shizhediao commented May 2, 2020 • edited Loading

raijinspecial commented May 5, 2020

shizhediao commented May 6, 2020

Leiwx52 commented Oct 25, 2020

gwern commented Oct 25, 2020

Leiwx52 commented Nov 7, 2020

shizhediao commented May 2, 2020 •

edited

Loading