-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] Add PyTorch loaders article release #1214
Conversation
Co-authored-by: Emanuele Bezzi <[email protected]>
|
||
We have made improvements to the loaders to reduce the amount of data transformations required from data fetching to model training. One such important change is to encode the expression data as a dense matrix immediately after the data is retrieved from disk/cloud. | ||
|
||
In our benchmarks, we found that densifying data increases training speed ~3X while maintaining relatively constant memory usage (Figure 3). However we still allow users to decide whether to process the expression data in sparse or dense format via the #TODO ask ebezzi to include name of parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter is method
, but I believe @ryan-williams wanted to change it?
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1214 +/- ##
==========================================
+ Coverage 91.11% 91.17% +0.06%
==========================================
Files 77 77
Lines 5922 5963 +41
==========================================
+ Hits 5396 5437 +41
Misses 526 526
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Co-authored-by: Emanuele Bezzi <[email protected]>
|
||
*Published:* *July 9th, 2024* | ||
|
||
*By:* *[Emanuele Bezzi](mailto:[email protected]), [Pablo Garcia-Nieto](mailto:[email protected]), [Prathap Sridharan](mailto:[email protected]), [Ryan Williams](mailto:[email protected])* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ryan's email is wrong. Worth checking if they're ok with adding the email here though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I saw that #1228 addresses this 🙏 (and yes, I'm ok / appreciate being listed!)
Thanks for this! One note: the wrong circle is labeled "default" in docs/articles/2024/20240709-pytorch-fig-benchmark.png: From the interactive plot, here's a gif showing the 2 data points corresponding to 2048 chunks of size 64 (one from a g4dn.4xlarge, one from a g4dn.8xlarge): I'm not sure how significant the speed difference is between the g4dn.{4x,8x}large nodes, with the default configuration, given N=1. I can generate a few more samples, if you like. |
Images don't render in markdown, they are encoded for the Myst parser