Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Add PyTorch loaders article release #1214

Merged
merged 13 commits into from
Jul 9, 2024

Conversation

pablo-gar
Copy link
Contributor

@pablo-gar pablo-gar commented Jun 28, 2024

Images don't render in markdown, they are encoded for the Myst parser

@pablo-gar pablo-gar requested a review from ebezzi June 28, 2024 20:16
pablo-gar and others added 2 commits June 28, 2024 13:48
Co-authored-by: Emanuele Bezzi <[email protected]>
docs/articles/2024/20240702-pytorch.md Outdated Show resolved Hide resolved
docs/articles/2024/20240702-pytorch.md Outdated Show resolved Hide resolved
docs/articles/2024/20240702-pytorch.md Outdated Show resolved Hide resolved

We have made improvements to the loaders to reduce the amount of data transformations required from data fetching to model training. One such important change is to encode the expression data as a dense matrix immediately after the data is retrieved from disk/cloud.

In our benchmarks, we found that densifying data increases training speed ~3X while maintaining relatively constant memory usage (Figure 3). However we still allow users to decide whether to process the expression data in sparse or dense format via the #TODO ask ebezzi to include name of parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter is method, but I believe @ryan-williams wanted to change it?

pablo-gar and others added 3 commits June 28, 2024 14:13
Copy link

codecov bot commented Jun 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.17%. Comparing base (fc0281b) to head (a33479a).
Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1214      +/-   ##
==========================================
+ Coverage   91.11%   91.17%   +0.06%     
==========================================
  Files          77       77              
  Lines        5922     5963      +41     
==========================================
+ Hits         5396     5437      +41     
  Misses        526      526              
Flag Coverage Δ
unittests 91.17% <ø> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pablo-gar pablo-gar enabled auto-merge (squash) July 9, 2024 00:27
@pablo-gar pablo-gar requested a review from ebezzi July 9, 2024 00:27

*Published:* *July 9th, 2024*

*By:* *[Emanuele Bezzi](mailto:[email protected]), [Pablo Garcia-Nieto](mailto:[email protected]), [Prathap Sridharan](mailto:[email protected]), [Ryan Williams](mailto:[email protected])*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ryan's email is wrong. Worth checking if they're ok with adding the email here though?

Copy link
Contributor

@ryan-williams ryan-williams Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I saw that #1228 addresses this 🙏 (and yes, I'm ok / appreciate being listed!)

@pablo-gar pablo-gar merged commit b83055a into main Jul 9, 2024
14 of 15 checks passed
@pablo-gar pablo-gar deleted the pablo-gar/add-loaders-article branch July 9, 2024 01:03
@ryan-williams
Copy link
Contributor

Thanks for this! One note: the wrong circle is labeled "default" in docs/articles/2024/20240709-pytorch-fig-benchmark.png:

image

From the interactive plot, here's a gif showing the 2 data points corresponding to 2048 chunks of size 64 (one from a g4dn.4xlarge, one from a g4dn.8xlarge):

default

The circle currently labeled "default" actually corresponds to 1024 chunks of size 128.

default-1024-4x

I'm not sure how significant the speed difference is between the g4dn.{4x,8x}large nodes, with the default configuration, given N=1. I can generate a few more samples, if you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants