-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete docstrings for PyTorch ExperimentDataPipe API #613
Conversation
Also: - Updated docsite local build instructions - Removed __main__ from pytorch.ml, which was for testing only.
Codecov Report
@@ Coverage Diff @@
## main #613 +/- ##
==========================================
+ Coverage 88.12% 88.43% +0.30%
==========================================
Files 62 62
Lines 3757 3744 -13
==========================================
Hits 3311 3311
+ Misses 446 433 -13
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
the ``obs_tables_iter`` argument. For the specified ``obs` rows, the corresponding ``X`` data is loaded and | ||
joined together. It is returned from this iterator as 2-tuples of ``X`` and obs Tensors. | ||
|
||
Internally manages the retrieval of data in SOMA-sized batches, fetching the next batch of SOMA data as needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a SOMA-sized batch? and how do I control it?
I'm guessing you mean the read buffer size controlled by soma.init_buffer_bytes
context config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe it is a reference to the batch size controlled by soma_buffer_bytes
param?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are related. The soma_buffer_bytes
sets both the soma.init_buffer_bytes
and py.init_buffer_bytes
config params. So this effects how many obs rows are read in per SOMA "batch", as determined by whatever number of rows fits into a returned obs PyArrow table chunk. For better or for worse (likely the latter), the corresponding X data will be fetched using the same init_buffer_bytes setting, so will be necessarily require multiple reads. But overall, the soma_buffer_bytes
user setting provides some control over maximum mem usage. A row-based setting might be better for the user to comprehend, but that's beyond a documentation change for this PR. How much detail do you think is useful in the docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much detail do you think is useful in the docs?
given how much this will impact performance and memory footprint, it seems like it needs to be covered at least at a high level. Important for anyone tuning.
It could be covered in a "performance" notebook/doc, rather than the docstrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and I like the notebook tuning example suggestion! Since there is clear work to be done on fixing memory usage, and which may result in changes to how memory usage is configured, I'm going to defer any doc updates (and new notebook examples) for now, and will circle back to this, armed with more confidence in the impact of this setting, or possibly new settings. I've updated the relevant bullet point in the epic covering this work. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a few totally optional nitpicks.
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Outdated
Show resolved
Hide resolved
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Outdated
Show resolved
Hide resolved
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with small comments, address them to your discretion.
|
||
>>> (tensor([0., 0., 0., 0., 0., 1., 0., 0., 0.]), # X data | ||
tensor([2415, 0, 0], dtype=torch.int64)) # obs data, encoded | ||
tensor([2415, 0, 0], dtype=torch.int64)) # obs data, encoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tensor([2415, 0, 0], dtype=torch.int64)) # obs data, encoded | |
tensor([2415, 0, 0], dtype=torch.int64)) # obs soma_joinid (first element) and obs data encoded |
Somehow suggest that the first element is an id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the contents of the obs
tensor are explained more fully later in the docstring, including the soma_joinid
.
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Outdated
Show resolved
Hide resolved
value. Maximum memory utilization is controlled by this parameter, with larger values providing better | ||
read performance. | ||
use_eager_fetch: | ||
Controls whether the returned iterator will eagerly fetch data from SOMA while client code is iterating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this explanation is clear for most Python users? I get it because we've used these concepts for a lot of SOMA/census operations.
I just want to make sure that others will get it too. Specifically I wonder if there is a way to re-phrase "will eagerly fetch data from SOMA while client code is iterating"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it another shot...lmk if you think it might be clearer to our users.
Resolves #500
Introduces #614 :)
__main__
frompytorch.py
{,_}ObsAndXSOMABatch
to indicate it's private