Complete docstrings for PyTorch ExperimentDataPipe API #613

atolopko-czi · 2023-07-10T14:59:55Z

Resolves #500
Introduces #614 :)

Fleshes out docstrings for ExperimentDataPipe et. al.
Fixes & re-runs pytorch notebook
Removes testing-only __main__ from pytorch.py
Rename {,_}ObsAndXSOMABatch to indicate it's private

Also: - Updated docsite local build instructions - Removed __main__ from pytorch.ml, which was for testing only.

codecov · 2023-07-10T17:40:29Z

Codecov Report

Merging #613 (e5d8e25) into main (5ac97a0) will increase coverage by 0.30%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #613      +/-   ##
==========================================
+ Coverage   88.12%   88.43%   +0.30%     
==========================================
  Files          62       62              
  Lines        3757     3744      -13     
==========================================
  Hits         3311     3311              
+ Misses        446      433      -13

Flag	Coverage Δ
unittests	`88.43% <100.00%> (+0.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...xgene_census/tests/experimental/ml/test_pytorch.py	`94.25% <ø> (ø)`
...us/src/cellxgene_census/experimental/ml/pytorch.py	`92.15% <100.00%> (+4.47%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

bkmartinjr · 2023-07-10T19:16:48Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+    the ``obs_tables_iter`` argument. For the specified ``obs` rows, the corresponding ``X`` data is loaded and
+    joined together. It is returned from this iterator as 2-tuples of ``X`` and obs Tensors.
+
+    Internally manages the retrieval of data in SOMA-sized batches, fetching the next batch of SOMA data as needed.


what is a SOMA-sized batch? and how do I control it?

I'm guessing you mean the read buffer size controlled by soma.init_buffer_bytes context config?

or maybe it is a reference to the batch size controlled by soma_buffer_bytes param?

Yes, they are related. The soma_buffer_bytes sets both the soma.init_buffer_bytes and py.init_buffer_bytes config params. So this effects how many obs rows are read in per SOMA "batch", as determined by whatever number of rows fits into a returned obs PyArrow table chunk. For better or for worse (likely the latter), the corresponding X data will be fetched using the same init_buffer_bytes setting, so will be necessarily require multiple reads. But overall, the soma_buffer_bytes user setting provides some control over maximum mem usage. A row-based setting might be better for the user to comprehend, but that's beyond a documentation change for this PR. How much detail do you think is useful in the docs?

How much detail do you think is useful in the docs?

given how much this will impact performance and memory footprint, it seems like it needs to be covered at least at a high level. Important for anyone tuning.

It could be covered in a "performance" notebook/doc, rather than the docstrings.

I agree and I like the notebook tuning example suggestion! Since there is clear work to be done on fixing memory usage, and which may result in changes to how memory usage is configured, I'm going to defer any doc updates (and new notebook examples) for now, and will circle back to this, armed with more confidence in the impact of this setting, or possibly new settings. I've updated the relevant bullet point in the epic covering this work. Thanks!

ebezzi

LGTM, a few totally optional nitpicks.

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

pablo-gar

LGTM with small comments, address them to your discretion.

pablo-gar · 2023-07-11T17:51:31Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py


    >>> (tensor([0., 0., 0., 0., 0., 1., 0., 0., 0.]),  # X data
-        tensor([2415,    0,    0], dtype=torch.int64)) # obs data, encoded
+         tensor([2415,    0,    0], dtype=torch.int64)) # obs data, encoded


Suggested change

tensor([2415, 0, 0], dtype=torch.int64)) # obs data, encoded

tensor([2415, 0, 0], dtype=torch.int64)) # obs soma_joinid (first element) and obs data encoded

Somehow suggest that the first element is an id.

Note that the contents of the obs tensor are explained more fully later in the docstring, including the soma_joinid.

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

pablo-gar · 2023-07-11T17:55:17Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+                value. Maximum memory utilization is controlled by this parameter, with larger values providing better
+                read performance.
+            use_eager_fetch:
+                Controls whether the returned iterator will eagerly fetch data from SOMA while client code is iterating


Do you think this explanation is clear for most Python users? I get it because we've used these concepts for a lot of SOMA/census operations.

I just want to make sure that others will get it too. Specifically I wonder if there is a way to re-phrase "will eagerly fetch data from SOMA while client code is iterating"

I gave it another shot...lmk if you think it might be clearer to our users.

Filled out docstrings for PyTorch ExperimentDataPipe

d0b6f12

atolopko-czi self-assigned this Jul 10, 2023

atolopko-czi added 3 commits July 10, 2023 11:25

More docstrings for PyTorch ExperimentDataPipe

1647e97

Also: - Updated docsite local build instructions - Removed __main__ from pytorch.ml, which was for testing only.

Tweak docstrings for PyTorch ExperimentDataPipe

beb7075

PyTorch notebook fixes & re-run

3536535

fix docstring in PyTorch ExperimentDataPipe

fc075e6

atolopko-czi requested review from pablo-gar, bkmartinjr and ebezzi July 10, 2023 18:48

bkmartinjr reviewed Jul 10, 2023

View reviewed changes

bkmartinjr approved these changes Jul 10, 2023

View reviewed changes

ebezzi approved these changes Jul 10, 2023

View reviewed changes

pablo-gar approved these changes Jul 11, 2023

View reviewed changes

atolopko-czi added 3 commits July 11, 2023 15:23

rename pytorch param: {,return_}sparse_X

2c2a647

pytorch docstring tweaks

1231ede

pytorch docstring tweaks

e5d8e25

atolopko-czi requested a review from pablo-gar July 11, 2023 20:29

atolopko-czi merged commit d5e9e8c into main Jul 18, 2023

atolopko-czi deleted the atol/500-pytorch-docs branch July 18, 2023 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete docstrings for PyTorch ExperimentDataPipe API #613

Complete docstrings for PyTorch ExperimentDataPipe API #613

atolopko-czi commented Jul 10, 2023 •

edited

Loading

codecov bot commented Jul 10, 2023 •

edited

Loading

bkmartinjr Jul 10, 2023

bkmartinjr Jul 10, 2023

atolopko-czi Jul 10, 2023 •

edited

Loading

bkmartinjr Jul 10, 2023

atolopko-czi Jul 10, 2023

ebezzi left a comment

pablo-gar left a comment

pablo-gar Jul 11, 2023

atolopko-czi Jul 11, 2023

pablo-gar Jul 11, 2023 •

edited

Loading

atolopko-czi Jul 11, 2023

	tensor([2415, 0, 0], dtype=torch.int64)) # obs data, encoded
	tensor([2415, 0, 0], dtype=torch.int64)) # obs soma_joinid (first element) and obs data encoded

Complete docstrings for PyTorch ExperimentDataPipe API #613

Complete docstrings for PyTorch ExperimentDataPipe API #613

Conversation

atolopko-czi commented Jul 10, 2023 • edited Loading

codecov bot commented Jul 10, 2023 • edited Loading

Codecov Report

bkmartinjr Jul 10, 2023

Choose a reason for hiding this comment

bkmartinjr Jul 10, 2023

Choose a reason for hiding this comment

atolopko-czi Jul 10, 2023 • edited Loading

Choose a reason for hiding this comment

bkmartinjr Jul 10, 2023

Choose a reason for hiding this comment

atolopko-czi Jul 10, 2023

Choose a reason for hiding this comment

ebezzi left a comment

Choose a reason for hiding this comment

pablo-gar left a comment

Choose a reason for hiding this comment

pablo-gar Jul 11, 2023

Choose a reason for hiding this comment

atolopko-czi Jul 11, 2023

Choose a reason for hiding this comment

pablo-gar Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

atolopko-czi Jul 11, 2023

Choose a reason for hiding this comment

atolopko-czi commented Jul 10, 2023 •

edited

Loading

codecov bot commented Jul 10, 2023 •

edited

Loading

atolopko-czi Jul 10, 2023 •

edited

Loading

pablo-gar Jul 11, 2023 •

edited

Loading