Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete docstrings for PyTorch ExperimentDataPipe API #613

Merged
merged 8 commits into from
Jul 18, 2023

Conversation

atolopko-czi
Copy link
Collaborator

@atolopko-czi atolopko-czi commented Jul 10, 2023

Resolves #500
Introduces #614 :)

  • Fleshes out docstrings for ExperimentDataPipe et. al.
  • Fixes & re-runs pytorch notebook
  • Removes testing-only __main__ from pytorch.py
  • Rename {,_}ObsAndXSOMABatch to indicate it's private

@atolopko-czi atolopko-czi self-assigned this Jul 10, 2023
Also:
- Updated docsite local build instructions
- Removed __main__ from pytorch.ml, which was for testing only.
@codecov
Copy link

codecov bot commented Jul 10, 2023

Codecov Report

Merging #613 (e5d8e25) into main (5ac97a0) will increase coverage by 0.30%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #613      +/-   ##
==========================================
+ Coverage   88.12%   88.43%   +0.30%     
==========================================
  Files          62       62              
  Lines        3757     3744      -13     
==========================================
  Hits         3311     3311              
+ Misses        446      433      -13     
Flag Coverage Δ
unittests 88.43% <100.00%> (+0.30%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...xgene_census/tests/experimental/ml/test_pytorch.py 94.25% <ø> (ø)
...us/src/cellxgene_census/experimental/ml/pytorch.py 92.15% <100.00%> (+4.47%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

the ``obs_tables_iter`` argument. For the specified ``obs` rows, the corresponding ``X`` data is loaded and
joined together. It is returned from this iterator as 2-tuples of ``X`` and obs Tensors.

Internally manages the retrieval of data in SOMA-sized batches, fetching the next batch of SOMA data as needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a SOMA-sized batch? and how do I control it?

I'm guessing you mean the read buffer size controlled by soma.init_buffer_bytes context config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe it is a reference to the batch size controlled by soma_buffer_bytes param?

Copy link
Collaborator Author

@atolopko-czi atolopko-czi Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are related. The soma_buffer_bytes sets both the soma.init_buffer_bytes and py.init_buffer_bytes config params. So this effects how many obs rows are read in per SOMA "batch", as determined by whatever number of rows fits into a returned obs PyArrow table chunk. For better or for worse (likely the latter), the corresponding X data will be fetched using the same init_buffer_bytes setting, so will be necessarily require multiple reads. But overall, the soma_buffer_bytes user setting provides some control over maximum mem usage. A row-based setting might be better for the user to comprehend, but that's beyond a documentation change for this PR. How much detail do you think is useful in the docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much detail do you think is useful in the docs?

given how much this will impact performance and memory footprint, it seems like it needs to be covered at least at a high level. Important for anyone tuning.

It could be covered in a "performance" notebook/doc, rather than the docstrings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and I like the notebook tuning example suggestion! Since there is clear work to be done on fixing memory usage, and which may result in changes to how memory usage is configured, I'm going to defer any doc updates (and new notebook examples) for now, and will circle back to this, armed with more confidence in the impact of this setting, or possibly new settings. I've updated the relevant bullet point in the epic covering this work. Thanks!

Copy link
Member

@ebezzi ebezzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few totally optional nitpicks.

Copy link
Contributor

@pablo-gar pablo-gar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with small comments, address them to your discretion.


>>> (tensor([0., 0., 0., 0., 0., 1., 0., 0., 0.]), # X data
tensor([2415, 0, 0], dtype=torch.int64)) # obs data, encoded
tensor([2415, 0, 0], dtype=torch.int64)) # obs data, encoded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tensor([2415, 0, 0], dtype=torch.int64)) # obs data, encoded
tensor([2415, 0, 0], dtype=torch.int64)) # obs soma_joinid (first element) and obs data encoded

Somehow suggest that the first element is an id.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the contents of the obs tensor are explained more fully later in the docstring, including the soma_joinid.

value. Maximum memory utilization is controlled by this parameter, with larger values providing better
read performance.
use_eager_fetch:
Controls whether the returned iterator will eagerly fetch data from SOMA while client code is iterating
Copy link
Contributor

@pablo-gar pablo-gar Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this explanation is clear for most Python users? I get it because we've used these concepts for a lot of SOMA/census operations.

I just want to make sure that others will get it too. Specifically I wonder if there is a way to re-phrase "will eagerly fetch data from SOMA while client code is iterating"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it another shot...lmk if you think it might be clearer to our users.

@atolopko-czi atolopko-czi merged commit d5e9e8c into main Jul 18, 2023
@atolopko-czi atolopko-czi deleted the atol/500-pytorch-docs branch July 18, 2023 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PyTorch DataLoader documentation
4 participants