data/benchmarks #422

msaroufim · 2022-05-19T01:05:47Z

Please read through our contribution guide prior to
creating your pull request.

Note that there is a section on requirements related to adding a new DataPipe.

Fixes #416

Changes

Added CLI to run various benchmarks on various datasets and measure key metrics

benchmarks/README.md

benchmarks/run_benchmark.py

msaroufim · 2022-05-24T18:45:44Z

May 24 notes

Notes from discussion with Vitaly

The lib-kineto profiler won't split iterator performance by datapipe and will show up as one big block
Load image, transform, batch and collate (collate will look like the biggest one) - measurements will be cumulative
Rotation, collation etc.. all need to happen before passing into data loader (all of these need to be datapipe maps) https://pytorch.org/data/main/torchdata.datapipes.map.html
Batch counting needs to happen before we receive a batch from data loader

data/benchmarks/run_benchmark.py

Line 110 in 411167b

batch_start = time.time()
Check if autograd profiler works https://github.com/pytorch/pytorch/blob/master/torch/utils/data/datapipes/_typing.py#L462 - does this work in multiprocessing? fork will make everything disappear

May 31 notes

Pytorch profiler needs to report less things - opened an issue on kineto repo datapipe profiler showing too much information kineto#609
Modularize code a bit better so we can create baseline for datapipe/dataset vs dataloaderv1/v2 - make train.py take a generic iterator
Check if large call stack is because of shuffling done by vision
Do scaling after this is done

msaroufim · 2022-05-25T23:40:11Z

Ok we can now get a trace and have an end to end example working, data loading is not the bottleneck here yet so will keep experimenting

msaroufim · 2022-05-26T00:29:10Z

msaroufim · 2022-05-26T15:59:35Z

Next thing I'd like to try out is including these datasets from torchtext.datasets import amazonfullreview after which we can start warming up GPUs

msaroufim · 2022-05-27T00:27:14Z

The call graph for data pipe construction is long but it's not the data loading that's the bottleneck here since utilization is just 7%. Need to fix the collation problems and bump up the batch size

Also I finally figured out how to build torchtext from source so can use those datapipes as well pytorch/text#1743

NicolasHug

Gave it a quick look before our meeting!

benchmarks/utils.py

NicolasHug · 2022-06-01T15:53:57Z

benchmarks/run_benchmark.py

+
+    else:
+        # No further preprocessing needed this returns a tuple of Images and labels as ints
+        # Do I need to do batching and collation manually?


Do I need to do batching and collation manually?

No, you just need to pass batch_size=... to the DataLoader

Nicolas is right that you can do that for DataLoader.

I would think it is better to use .batch since it will be necessary for DataLoaderV2. Then you can pass the same DataPipe to both versions of DL without more changes later.

NicolasHug · 2022-06-01T15:54:23Z

benchmarks/datasets.py

+    dp = dp.map(lambda sample : torch.tensor(str_to_list(sample.to_categories())).to(torch.device(device)), input_col="label")
+
+    # Batch
+    dp = dp.batch(batch_size)


Not sure this is needed as long as you pass batch_size=... to the DataLoader.
(But not certain either)

NicolasHug · 2022-06-01T15:55:39Z

benchmarks/run_benchmark.py

+if num_workers == 1:
+    dl = DataLoader(dataset=data, batch_size=batch_size, shuffle=shuffle)
+
+# Shuffle won't work in distributed yet


Shuffle and sharding won't work out of the box with DDP. There are some suggestions here pytorch/text#1755, but no definite recommended practices yet

benchmarks/run_benchmark.py

Summary: In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community. https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text See test and docstring for usage instructions ### Changes - Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks - Added a test which is skipped if `datasets` library does not exist - pytest passes - Got rid of `StreamWrapper` - Is there any documentation update I should make? Pull Request resolved: #490 Reviewed By: NivekT, ninginthecloud Differential Revision: D36910175 Pulled By: msaroufim fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425

Summary: In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community. https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text See test and docstring for usage instructions ### Changes - Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks - Added a test which is skipped if `datasets` library does not exist - pytest passes - Got rid of `StreamWrapper` - Is there any documentation update I should make? Pull Request resolved: #490 Reviewed By: NivekT, ninginthecloud Differential Revision: D36910175 Pulled By: msaroufim fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425 Co-authored-by: Mark Saroufim <[email protected]>

msaroufim · 2022-06-21T22:19:49Z

Discussion with Vitaly June 21

Will focus on running on mc4 with dataloader v1 on various hardware configurations (SSD, HDD) and use a few starter cloudformation templates to make this easier

NivekT

Overall, LGTM with a few comments. Let me know what you additional features you plan to add.

nit: Need copyright headers for .py files

benchmarks/README.md

benchmarks/report.py

benchmarks/utils.py

benchmarks/datasets.py

NivekT · 2022-07-18T19:20:43Z

benchmarks/run_benchmark.py

+
+    else:
+        # No further preprocessing needed this returns a tuple of Images and labels as ints
+        # Do I need to do batching and collation manually?


Nicolas is right that you can do that for DataLoader.

I would think it is better to use .batch since it will be necessary for DataLoaderV2. Then you can pass the same DataPipe to both versions of DL without more changes later.

msaroufim · 2022-07-19T22:19:42Z

Overall, LGTM with a few comments. Let me know what you additional features you plan to add.

nit: Need copyright headers for .py files

Thanks @NivekT will address all your feedback - As far as new features to add for this PR not much I think there's a bunch of cleanup I need to do

Clean up report into its own dataclass which you can then export to whatever format you want: html, md, csv etc..
Address all your feedback
Some more cleanup

And I think the next PR should be focused around integrating the aws cli into CI where we can benchmark a distributed systems setup per @NicolasHug's request

And after that we can see which of the partner integrations should be added to this setup as weell

VitalyFedyunin

Need to drop usage of dataloader_experimental.

benchmarks/args.py

VitalyFedyunin · 2022-07-20T16:31:11Z

benchmarks/run_benchmark.py

+if dataloaderv == 1:
+    from torch.utils.data import DataLoader
+elif dataloaderv == 2:
+    from torch.utils.data.dataloader_experimental import DataLoader2 as DataLoader


Please avoid using this one, perhaps you need to create your own wrapper, which will use DLv2 from torchdata repo and automatically create MultiProcessingReadingService

This is up to @NivekT , if he wants to have this update as follow-up or within this PR.

@NivekT do you have more context, I'm not sure I follow what change is being requested

I think dataloader_experimental.DataLoader2 is bound to be removed eventually. The actual DataLoader2 is in torchdata.dataloader2.

I have an example in my benchmarking PR, if you're interested https://github.com/pytorch/vision/pull/6196/files#diff-32b42103e815b96c670a0b5f0db055fe63f10fc8776ccbb6aa9b61a6940abba0R207-R211

Yep, I think using torchdata.dataloader2 with MultiProcessingReadingService is preferred.

msaroufim · 2022-07-21T00:21:27Z

benchmarks/report.py

+duration = int
+
+
+@dataclass


@NivekT what do you think about this kind of reporting instead? Probably still a few bugs around but just wanna make sure I'm not overengineering things

NicolasHug · 2022-07-21T15:29:06Z

@msaroufim @VitalyFedyunin @NivekT following up on my earlier comments in #416 (comment) I also have a separate PR (pytorch/vision#6196) that already provides support for the cross-product of:

Distributed Learning (DDP) vs 1-GPU training
Datapipes (with DataLoader or torchdata.dataloader2) vs Iterable datasets (non-DP) vs MapStyle Datasets
Full training procedure or Data-loading only (with or without transforms) or Model training only (generating fake datasets)
Timing of data-loading vs model training
any classification model from torchvision

(It also has FFCV support, but that's less relevant for us here).

Since it's directly adapted from torchvision recipes, it's also a bit closer to the kind of training that users would be doing in the wild.

Do you think it would make sense to join our benchmarking efforts here? I'm happy to provide support if you'd like to collaborate.

CC @nairbv

NivekT · 2022-07-21T15:55:28Z

@NicolasHug I am in the processing of going through both setups, running them on our AWS cluster, and identifying the differences. I agree that combining the efforts is the right approach. Let me dig a bit deeper first and I can schedule a meeting for all of us to chat.

msaroufim · 2022-07-21T17:55:21Z

@NicolasHug I think the right way to divide this up would be

I work on the infra setup, the benchmark artifact and the benchmark export
I leverage your model training scripts since you're the domain expert

I would like to also eventually do something like pull any of the HF datasets and just benchmark there but I don't believe the datasets there give me sufficient information to create a toy model with the right shapes automatically

But yeah would love to talk

Summary: Per our offline discussion, I am separating out the cloud part of #422 which will allow us to provision an AWS infra from the command line for offline benchmarks and potentially integrate into CI and release decisions. In followup PRs, I will make the template more interesting and revisit the logger discussion so we can build this. (This PR is the middle node) ![Screen Shot 2022-07-25 at 10 14 40 AM](https://user-images.githubusercontent.com/3282513/180836193-f75e25c4-c2ed-4d8d-8f14-67849e232a44.png) Here's a test that the template works ![Screen Shot 2022-07-25 at 10 15 13 AM](https://user-images.githubusercontent.com/3282513/180836270-0d26c697-eb8f-46bb-9f76-db76761555d2.png) Pull Request resolved: #680 Reviewed By: NivekT Differential Revision: D38124731 Pulled By: msaroufim fbshipit-source-id: 5b6749e291f06849a5a25b3b01fb7430c656e695

NicolasHug · 2022-07-29T10:42:13Z

benchmarks/report.py

+    total_duration: int = 0
+
+
+class MetricExporter(ABC):


Following up on our discussion from Monday, I took a look at this and I think it should be reasonably simple to plug it into the other MetricLogger from the torchvision recipe (it's "metric calculator" more than a "logger", and we can remove the "log" part in favour of this one)

data/benchmarks

2bd17bf

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 19, 2022

removed extra files

4e8f41c

msaroufim marked this pull request as draft May 19, 2022 01:08

msaroufim added 2 commits May 19, 2022 01:45

nananannana

0818eb6

added gpu support

590a20b

NicolasHug reviewed May 19, 2022

View reviewed changes

benchmarks/README.md Outdated Show resolved Hide resolved

benchmarks/run_benchmark.py Outdated Show resolved Hide resolved

[skip ci] enable profiler

411167b

msaroufim added 7 commits May 25, 2022 19:57

fixed preprocessing pipeline

4fffddf

[skip ci]

07739b2

update

fea3322

[skip ci] lunch

aec3a6f

[skip ci] it runs

621ca0b

[skip ci] fix profile settings

1c06bae

[skip ci] added logging and fixed profiler

9001b00

msaroufim added 2 commits May 26, 2022 00:24

update

95d48ae

[skip ci] push

4b8da09

msaroufim mentioned this pull request May 31, 2022

datapipe profiler showing too much information pytorch/kineto#609

Open

msaroufim and others added 6 commits May 31, 2022 22:47

[skip ci] added nvidia smi and lscpu

58101fe

[ski ci] refactor

18d6a2e

[ski ci] refactor

e6638c7

[skip ci] collation works

6cae2a3

[skip ci] push

cfa49aa

Merge branch 'main' into benchmarks

a933753

NicolasHug reviewed Jun 1, 2022

View reviewed changes

msaroufim mentioned this pull request Jun 1, 2022

Add support for hf datasets reader #490

Closed

msaroufim added 3 commits June 28, 2022 15:47

[skip ci] push

eed5203

[skip ci] push

cf54e20

[skip ci] update

8ace4ff

NivekT reviewed Jul 18, 2022

View reviewed changes

VitalyFedyunin reviewed Jul 20, 2022

View reviewed changes

git commit

4933239

msaroufim commented Jul 21, 2022

View reviewed changes

update

a09826d

msaroufim mentioned this pull request Jul 25, 2022

Cloud benchmarks #680

Closed

NicolasHug reviewed Jul 29, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/benchmarks #422

data/benchmarks #422

msaroufim commented May 19, 2022 •

edited

Loading

msaroufim commented May 24, 2022 •

edited

Loading

msaroufim commented May 25, 2022

msaroufim commented May 26, 2022

msaroufim commented May 26, 2022 •

edited

Loading

msaroufim commented May 27, 2022 •

edited

Loading

NicolasHug left a comment

NicolasHug Jun 1, 2022

NivekT Jul 18, 2022

NicolasHug Jun 1, 2022

NicolasHug Jun 1, 2022

msaroufim commented Jun 21, 2022

NivekT left a comment •

edited

Loading

NivekT Jul 18, 2022

msaroufim commented Jul 19, 2022

VitalyFedyunin left a comment

VitalyFedyunin Jul 20, 2022

VitalyFedyunin Jul 20, 2022

msaroufim Jul 20, 2022

NicolasHug Jul 21, 2022

NivekT Jul 21, 2022

msaroufim Jul 21, 2022

NicolasHug commented Jul 21, 2022 •

edited

Loading

NivekT commented Jul 21, 2022

msaroufim commented Jul 21, 2022

NicolasHug Jul 29, 2022

		duration = int


		@dataclass

data/benchmarks #422

Are you sure you want to change the base?

data/benchmarks #422

Conversation

msaroufim commented May 19, 2022 • edited Loading

Changes

msaroufim commented May 24, 2022 • edited Loading

msaroufim commented May 25, 2022

msaroufim commented May 26, 2022

msaroufim commented May 26, 2022 • edited Loading

msaroufim commented May 27, 2022 • edited Loading

NicolasHug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim commented Jun 21, 2022

NivekT left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim commented Jul 19, 2022

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Jul 21, 2022 • edited Loading

NivekT commented Jul 21, 2022

msaroufim commented Jul 21, 2022

Choose a reason for hiding this comment

msaroufim commented May 19, 2022 •

edited

Loading

msaroufim commented May 24, 2022 •

edited

Loading

msaroufim commented May 26, 2022 •

edited

Loading

msaroufim commented May 27, 2022 •

edited

Loading

NivekT left a comment •

edited

Loading

NicolasHug commented Jul 21, 2022 •

edited

Loading