Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data/benchmarks #422

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from
Draft

data/benchmarks #422

wants to merge 25 commits into from

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented May 19, 2022

Please read through our contribution guide prior to
creating your pull request.

  • Note that there is a section on requirements related to adding a new DataPipe.

Fixes #416

Changes

  • Added CLI to run various benchmarks on various datasets and measure key metrics

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 19, 2022
@msaroufim msaroufim marked this pull request as draft May 19, 2022 01:08
benchmarks/README.md Outdated Show resolved Hide resolved
benchmarks/run_benchmark.py Outdated Show resolved Hide resolved
@msaroufim
Copy link
Member Author

msaroufim commented May 24, 2022

May 24 notes

Notes from discussion with Vitaly

May 31 notes

  • Pytorch profiler needs to report less things - opened an issue on kineto repo datapipe profiler showing too much information kineto#609
  • Modularize code a bit better so we can create baseline for datapipe/dataset vs dataloaderv1/v2 - make train.py take a generic iterator
  • Check if large call stack is because of shuffling done by vision
  • Do scaling after this is done

@msaroufim
Copy link
Member Author

Ok we can now get a trace and have an end to end example working, data loading is not the bottleneck here yet so will keep experimenting

image

@msaroufim
Copy link
Member Author

Screen Shot 2022-05-25 at 5 28 51 PM

@msaroufim
Copy link
Member Author

msaroufim commented May 26, 2022

Next thing I'd like to try out is including these datasets from torchtext.datasets import amazonfullreview after which we can start warming up GPUs

@msaroufim
Copy link
Member Author

msaroufim commented May 27, 2022

The call graph for data pipe construction is long but it's not the data loading that's the bottleneck here since utilization is just 7%. Need to fix the collation problems and bump up the batch size

Also I finally figured out how to build torchtext from source so can use those datapipes as well pytorch/text#1743

profile

Screen Shot 2022-05-26 at 5 14 02 PM

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it a quick look before our meeting!

benchmarks/utils.py Outdated Show resolved Hide resolved

else:
# No further preprocessing needed this returns a tuple of Images and labels as ints
# Do I need to do batching and collation manually?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need to do batching and collation manually?

No, you just need to pass batch_size=... to the DataLoader

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicolas is right that you can do that for DataLoader.

I would think it is better to use .batch since it will be necessary for DataLoaderV2. Then you can pass the same DataPipe to both versions of DL without more changes later.

dp = dp.map(lambda sample : torch.tensor(str_to_list(sample.to_categories())).to(torch.device(device)), input_col="label")

# Batch
dp = dp.batch(batch_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is needed as long as you pass batch_size=... to the DataLoader.
(But not certain either)

if num_workers == 1:
dl = DataLoader(dataset=data, batch_size=batch_size, shuffle=shuffle)

# Shuffle won't work in distributed yet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shuffle and sharding won't work out of the box with DDP. There are some suggestions here pytorch/text#1755, but no definite recommended practices yet

benchmarks/run_benchmark.py Show resolved Hide resolved
facebook-github-bot pushed a commit that referenced this pull request Jun 7, 2022
Summary:
In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community.

https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text

See test and docstring for usage instructions

### Changes

- Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks
- Added a test which is skipped if `datasets` library does not exist
- pytest passes
- Got rid of `StreamWrapper`
- Is there any documentation update I should make?

Pull Request resolved: #490

Reviewed By: NivekT, ninginthecloud

Differential Revision: D36910175

Pulled By: msaroufim

fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425
NivekT pushed a commit that referenced this pull request Jun 7, 2022
Summary:
In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community.

https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text

See test and docstring for usage instructions

### Changes

- Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks
- Added a test which is skipped if `datasets` library does not exist
- pytest passes
- Got rid of `StreamWrapper`
- Is there any documentation update I should make?

Pull Request resolved: #490

Reviewed By: NivekT, ninginthecloud

Differential Revision: D36910175

Pulled By: msaroufim

fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425
NivekT added a commit that referenced this pull request Jun 7, 2022
Summary:
In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community.

https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text

See test and docstring for usage instructions

### Changes

- Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks
- Added a test which is skipped if `datasets` library does not exist
- pytest passes
- Got rid of `StreamWrapper`
- Is there any documentation update I should make?

Pull Request resolved: #490

Reviewed By: NivekT, ninginthecloud

Differential Revision: D36910175

Pulled By: msaroufim

fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425

Co-authored-by: Mark Saroufim <[email protected]>
@msaroufim
Copy link
Member Author

Discussion with Vitaly June 21

  • Will focus on running on mc4 with dataloader v1 on various hardware configurations (SSD, HDD) and use a few starter cloudformation templates to make this easier

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM with a few comments. Let me know what you additional features you plan to add.

nit: Need copyright headers for .py files

benchmarks/README.md Outdated Show resolved Hide resolved
benchmarks/README.md Outdated Show resolved Hide resolved
benchmarks/report.py Outdated Show resolved Hide resolved
benchmarks/utils.py Outdated Show resolved Hide resolved
benchmarks/datasets.py Outdated Show resolved Hide resolved

else:
# No further preprocessing needed this returns a tuple of Images and labels as ints
# Do I need to do batching and collation manually?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicolas is right that you can do that for DataLoader.

I would think it is better to use .batch since it will be necessary for DataLoaderV2. Then you can pass the same DataPipe to both versions of DL without more changes later.

@msaroufim
Copy link
Member Author

Overall, LGTM with a few comments. Let me know what you additional features you plan to add.

nit: Need copyright headers for .py files

Thanks @NivekT will address all your feedback - As far as new features to add for this PR not much I think there's a bunch of cleanup I need to do

  • Clean up report into its own dataclass which you can then export to whatever format you want: html, md, csv etc..
  • Address all your feedback
  • Some more cleanup

And I think the next PR should be focused around integrating the aws cli into CI where we can benchmark a distributed systems setup per @NicolasHug's request

And after that we can see which of the partner integrations should be added to this setup as weell

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to drop usage of dataloader_experimental.

benchmarks/args.py Outdated Show resolved Hide resolved
if dataloaderv == 1:
from torch.utils.data import DataLoader
elif dataloaderv == 2:
from torch.utils.data.dataloader_experimental import DataLoader2 as DataLoader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using this one, perhaps you need to create your own wrapper, which will use DLv2 from torchdata repo and automatically create MultiProcessingReadingService

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is up to @NivekT , if he wants to have this update as follow-up or within this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NivekT do you have more context, I'm not sure I follow what change is being requested

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dataloader_experimental.DataLoader2 is bound to be removed eventually. The actual DataLoader2 is in torchdata.dataloader2.

I have an example in my benchmarking PR, if you're interested https://github.com/pytorch/vision/pull/6196/files#diff-32b42103e815b96c670a0b5f0db055fe63f10fc8776ccbb6aa9b61a6940abba0R207-R211

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think using torchdata.dataloader2 with MultiProcessingReadingService is preferred.

duration = int


@dataclass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NivekT what do you think about this kind of reporting instead? Probably still a few bugs around but just wanna make sure I'm not overengineering things

@NicolasHug
Copy link
Member

NicolasHug commented Jul 21, 2022

@msaroufim @VitalyFedyunin @NivekT following up on my earlier comments in #416 (comment) I also have a separate PR (pytorch/vision#6196) that already provides support for the cross-product of:

  • Distributed Learning (DDP) vs 1-GPU training
  • Datapipes (with DataLoader or torchdata.dataloader2) vs Iterable datasets (non-DP) vs MapStyle Datasets
  • Full training procedure or Data-loading only (with or without transforms) or Model training only (generating fake datasets)
  • Timing of data-loading vs model training
  • any classification model from torchvision

(It also has FFCV support, but that's less relevant for us here).

Since it's directly adapted from torchvision recipes, it's also a bit closer to the kind of training that users would be doing in the wild.

Do you think it would make sense to join our benchmarking efforts here? I'm happy to provide support if you'd like to collaborate.

CC @nairbv

@NivekT
Copy link
Contributor

NivekT commented Jul 21, 2022

@NicolasHug I am in the processing of going through both setups, running them on our AWS cluster, and identifying the differences. I agree that combining the efforts is the right approach. Let me dig a bit deeper first and I can schedule a meeting for all of us to chat.

@msaroufim
Copy link
Member Author

@NicolasHug I think the right way to divide this up would be

  • I work on the infra setup, the benchmark artifact and the benchmark export
  • I leverage your model training scripts since you're the domain expert

I would like to also eventually do something like pull any of the HF datasets and just benchmark there but I don't believe the datasets there give me sufficient information to create a toy model with the right shapes automatically

But yeah would love to talk

@msaroufim msaroufim mentioned this pull request Jul 25, 2022
facebook-github-bot pushed a commit that referenced this pull request Jul 25, 2022
Summary:
Per our offline discussion, I am separating out the cloud part of #422 which will allow us to provision an AWS infra from the command line for offline benchmarks and potentially integrate into CI and release decisions.

In followup PRs, I will make the template more interesting and revisit the logger discussion so we can build this. (This PR is the middle node)
![Screen Shot 2022-07-25 at 10 14 40 AM](https://user-images.githubusercontent.com/3282513/180836193-f75e25c4-c2ed-4d8d-8f14-67849e232a44.png)

Here's a test that the template works

![Screen Shot 2022-07-25 at 10 15 13 AM](https://user-images.githubusercontent.com/3282513/180836270-0d26c697-eb8f-46bb-9f76-db76761555d2.png)

Pull Request resolved: #680

Reviewed By: NivekT

Differential Revision: D38124731

Pulled By: msaroufim

fbshipit-source-id: 5b6749e291f06849a5a25b3b01fb7430c656e695
total_duration: int = 0


class MetricExporter(ABC):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on our discussion from Monday, I took a look at this and I think it should be reasonably simple to plug it into the other MetricLogger from the torchvision recipe (it's "metric calculator" more than a "logger", and we can remove the "log" part in favour of this one)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

data/benchmarks/
5 participants