PyTorch DataLoader #499

atolopko-czi · 2023-05-23T19:25:07Z

First draft of the PyTorch DataLoader support, sufficiently developed to warrant feedback from the community. Supports:

User-specified obs/var queries for slicing the Census
User-specified obs attributes to be returned in PyTorch Tensors (for training input or predicted labels)
User-specified PyTorch batch size for items returned by DataLoader
Automatic categorical encoding of obs attributes, with access to encoders for decoding, as needed
Option for sparse or dense Tensors for X data
Use of multiprocessing (multiple worker processes) to retrieve data in per-worker partitions
Support splitting data into train/test/etc sets using PyTorch Data Pipes (e.g. RandomSplitter)
Support shuffling data using PyTorch Data Pipes (e.g. Shuffler)
Introduces an experimental sub-package, under which the new pytorch.py module exists (further sub-pkgs could be used)
Demo code is in api/python/notebooks/ml_demo/pytorch_lr_classifier.py, which still needs to be converted into a proper notebook.
Supports optional installation of pip packages for "experimental" package use. Specify as pip install '.[experimental]'.

Addresses:

Partially addresses:

introduce this under a new "incubation" sub-package

user to specify as normal DataLoader param

"X, y" is common ordering in ML training usage

allows model training code access query dim sizes required refactoring to initialize ExperimentDataPipe when needed

cleans up experiment open logic, providing s3 region param

separate the ExperimentDataPipe instantation from the experiment_data_loader() method, to allow user to specify additional pipe operations, per her use case.

api/python/cellxgene_census/src/cellxgene_census/experimental/pytorch.py

Sparse Tensors are experimental and are currently known not to work with multiprocessing or LR models. So let's make use of dense Tensors the default. Use of sparse Tensors now requires the explicit option `sparse_X=True`. Also fix bug where dense, non-batched results were returned as 1-row batches, instead of being unwrapped.

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

* experimental tests only run with explicit pytest option * only run experimental tests with Python 3.10 on GHA, since not all experimental package dependencies are available for lower version (e.g. pytorch 2.0.1 not available for Python 3.7)

codecov · 2023-05-24T15:48:05Z

Codecov Report

Merging #499 (78040c5) into main (838ab41) will increase coverage by 0.19%.
The diff coverage is 90.21%.

@@            Coverage Diff             @@
##             main     #499      +/-   ##
==========================================
+ Coverage   88.57%   88.77%   +0.19%     
==========================================
  Files          50       52       +2     
  Lines        2784     3198     +414     
==========================================
+ Hits         2466     2839     +373     
- Misses        318      359      +41

Flag	Coverage Δ
unittests	`88.77% <90.21%> (+0.19%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...us/src/cellxgene_census/experimental/ml/pytorch.py	`87.38% <87.38%> (ø)`
api/python/cellxgene_census/tests/conftest.py	`92.30% <90.00%> (-7.70%)`	⬇️
...xgene_census/tests/experimental/ml/test_pytorch.py	`93.15% <93.15%> (ø)`
...hon/cellxgene_census/src/cellxgene_census/_open.py	`100.00% <100.00%> (ø)`
api/python/cellxgene_census/tests/test_open.py	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

api/python/cellxgene_census/pyproject.toml

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

atolopko-czi · 2023-05-25T13:34:05Z

api/python/notebooks/ml_demo/pytorch_lr_classifier.py

@anaistrate This code may look eerily familiar to you since I adapted your LR model for cell type prediction, paring it back to what I believe is a canonical example of ML training, but now using the Census PyTorch DataLoader. I intend to convert it to a proper notebook. For now, if you have any feedback on how to make this "more canonical", simpler, or correct any obvious problems, it would be much appreciated!

I'm likely going to merge this shortly, but if you have any feedback you can add to this issue: #476

rename buffer_bytes to soma_buffer_bytes for clarity

bkmartinjr

Just small comments to resolve - overall it looks like a solid foundation

- rename ms_name to measurement_name - remove ExperimentDataPipe num_workers param (unused except for an invalid usage check)

bkmartinjr · 2023-05-25T16:09:56Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

        pytorch_logger.debug(f"Retrieved batch: shape={self.X_batch.shape}, cum_stats: {self.stats}")
        return self.obs_batch_


 class ExperimentDataPipe(pipes.IterDataPipe[Dataset[ObsDatum]]):  # type: ignore
    """
-    An iterable-style data pipe.
+    An iterable-style PyTorch data pipe that reads obs and X data from a SOMA Experiment, and returns an iterator of


I know you consider them redundant, but I still think these need lifecycle tags. As Brian said, "belt and suspenders"

bkmartinjr

This is getting close enough for an experimental feature - approving. Left a few additional comments you may want to consider.

rename layer_name to X_name, to align with Census API get_anndata

* Pin tiledb version as work-around (#454) * temporarily pin required tiledb version * revert release.json URL while infra changes are fluid * Incorporate comms changes (#456) * Final edits to gget tutorial (#451) * revert R API release.json URL (#458) the new URL is not yet ready and Python API has already been reverted * symlink the new notebooks * Fix docsite version and disable searchbar (#460) * update release boostrap URL to public permalink (#459) * Remove pre-release from README.md (#462) * Enable anonymous access to S3 bucket (#275) * Enable anonymous access to S3 bucket * R + unit tests * Pin tiledbsoma==1.2.3 * R styler * Update api/python/cellxgene_census/tests/test_open.py Co-authored-by: Andrew Tolopko <[email protected]> * remove R, upgrade pyproject * remove R * add newline * add negative test --------- Co-authored-by: Andrew Tolopko <[email protected]> * Bump static census version in R tests (#472) * [r] update `get_presence_matrix()` and vignette to use zero-based matrix view (#475) * wip * wip * update census_dataset_presence.Rmd * add acceptance test run (#485) * rerun notebooks for census build 2023-05-15 ("stable") Census release (#484) * Updated gget cellxgene tutorial to reflect workflow updates [formatting corrected] (#490) * Created using Colaboratory * Created using Colaboratory * Created using Colaboratory * correct formating --------- Co-authored-by: Laura Luebbert <[email protected]> * Add docsite version number from the library version (#481) * Add docsite version number from the library version * revert odd commit * pin tiledbsoma==1.2.4 (#493) * [docs] Add autosummary (#492) * autosummary * Add module desc + fix links * [R] close census objects (#486) * Using the new stateful open/close in R tiledbsoma -- mainly docs/vignettes/tests, but a few of the helper implementations too. * Refactored `open_soma()` to facilitate sharing/reuse of `SOMATileDBContext`, and do that hroughout the tests. * Updated vignettes to reflect recent changes to the Python notebooks. * However, the vignettes are now using too much memory to build in GHA. Still troubleshooting this, but for: disabled building them in GHA in order to un-break our CI. * Fix runs-on to use matrix strategy in py-unittests (#494) * Fix runs-on to use matrix strategy in py-unittests * try ARM64 runner * roll back ARM64 runner * add if * Revert bad commit * Enable anonymous access in R (#471) * Enable anonymous access to S3 bucket * R + unit tests * Pin tiledbsoma==1.2.3 * R styler * Update api/python/cellxgene_census/tests/test_open.py Co-authored-by: Andrew Tolopko <[email protected]> * fix * fix * reset credentials --------- Co-authored-by: Andrew Tolopko <[email protected]> * Add Databricks install instructions to FAQ (#488) * [docs] Fix the Census link in navbar (#491) * [r] use `stable` by default & add alias resolution message (#502) Completes #482 For parity with python #435 Also adapts to several recent breaking API changes in tiledbsoma * PyTorch DataLoader (#499) * Add PyTorch DataLoader support * Introduce this code under a new "experimental" sub-package, with new pytest "experimental" marker for unit tests. * Add initial PyTorch example code for LR model training. Not a notebook yet, but under notebooks dir for now. * chore: update lifecycle tags (#509) * Update lifecycle tags for non-experimental Python API to "maturing" * Update lifecycle tags for experimental Python API to "experimental" * export public names for experimental ml package * bump python api tiledbsoma version (#510) bump tiledbsoma to 1.2.5, which includes updated api doc lifecycle tags * minor clarifications for the pypi.org release process (#512) * cache most R dependencies to speed up r-check CI (#517) #309 -- Cache most R dependencies instead of always building the latest versions of all of them. (Then immediately afterwards, still install the latest tiledb & tiledbsoma from r-universe) * [r] Add comp_bio_census_info.Rmd (#407) Also: - update all vignettes to recent tiledbsoma API evolution - temporarily move vignettes into `vignettes_wip/` pending a plan for how to build them outside of GitHub Actions Co-authored-by: Emanuele Bezzi <[email protected]> * fix pytorch multiprocessing result (#516) The first partition of data was being returned from each worker, apparently caused by use of a PyArrow array for passing the set of joinids to each worker's result iterator, possibly due to a bug in TileDB-SOMA. * Update release_process.md (#520) * highly variable gene annotation (#511) * initial implementation of highly_variable_genes * add test marks * add prebuffered iterator * lint * lint * docstrings * reduce expensive tests * fix typo * actually fix typo * add test for get_highly_variable_genes * lint * reduce memory use in tests * add example to docstring * fix anon access in small memory context * PR feedback * loess jitter * increase max loess noise max to 1e-6 * add tests * fix: pytorch unit test hangs (#522) * force use of multiprocessing spawn start method for pytorch * run experimental tests in all envs except 3.7 * Add support for Python 3.11 (#528) * Add support for Python 3.11 * Add 3.11 to test workflow * update notebook guidelines (#534) * update notebook guidelines * Update docs/census_notebook_guidelines.md Co-authored-by: Emanuele Bezzi <[email protected]> * Update docs/census_notebook_guidelines.md * Apply suggestions from code review Co-authored-by: Andrew Tolopko <[email protected]> --------- Co-authored-by: Emanuele Bezzi <[email protected]> Co-authored-by: Andrew Tolopko <[email protected]> * Add docs for experimental modules (#537) * Add modules to python-api.rst * Add README.md * Update docs/python-api.rst Co-authored-by: pablo-gar <[email protected]> * Update docs/python-api.rst Co-authored-by: pablo-gar <[email protected]> --------- Co-authored-by: pablo-gar <[email protected]> * demo notebook for HVG experimental API (#536) * reorg files * add HVG notebook * clean up notebook docs symlinks * lint * lint,try 2 * fix Ruff error * fix lint in thread regex * work around upstream lint * update target python version to match target container version * fix typing lint * update and fix config for pre-commit * PR feedback / fixes * Add experimental notebooks to docsite + fix headers (#541) * fix OOM on pytorch unit test (#542) skip failing test on 3.9 only * Add pytorch notebook (#551) * add pytorch notebook; minor pytorch api docstring updates * CR feedback * Update api/python/notebooks/experimental/pytorch.ipynb * Update api/python/notebooks/experimental/pytorch.ipynb * run full notebook --------- Co-authored-by: pablo-gar <[email protected]> Co-authored-by: Pablo E Garcia-Nieto <[email protected]> * fix incorrect pytorch obs soma_joinids (#555) - Use torch.from_numpy() instead of Torch.Tensor() to construct tensors. The latter created with dtype=float32, which caused the bug due to truncation of precision when casting to int64-to-float32-to-int64. In addition to fixing the bug, a data copy is eliminated by using this new method. - Rework the test fixture generation methods to allow for ranges of obs and var soma_joinids that may start at any arbitrary value. Necessary for testing the case that produced this bug. - Add asserts to ease paranoia. * Fix MacOS failing tests (#557) * Fix MacOS failing tests For the time being, we'll try to determine if the issue is specific to a Python version by removing 3.7. * try macos13 * comma * try if * exclude * typo --------- Co-authored-by: Bruce Martin <[email protected]> Co-authored-by: pablo-gar <[email protected]> Co-authored-by: Laura Luebbert <[email protected]> Co-authored-by: Andrew Tolopko <[email protected]> Co-authored-by: Emanuele Bezzi <[email protected]> Co-authored-by: Martin Kim <[email protected]> Co-authored-by: Pablo E Garcia-Nieto <[email protected]>

atolopko-czi added 20 commits May 18, 2023 14:03

add PyTorch DataLoader support

fa1024c

introduce this under a new "incubation" sub-package

add dataloader factory method

2080d40

tests for experiment_dataloader() and fixes

95ba2bb

instance var init fix

abcc380

rm explicit timeout param handling

cb37af4

user to specify as normal DataLoader param

minor fixes, comments

2e4b0bd

initial pytorch notebook for LR model training, using AnnData

895fcb5

swap order of obs and labels in DataLoader rows/batches

e33af14

"X, y" is common ordering in ML training usage

support access to ExperimentAccessQuery from ExperimentDataPipe

dd55c7d

allows model training code access query dim sizes required refactoring to initialize ExperimentDataPipe when needed

pytorch_lr_classifier example now functional

60e0b9d

refactoring

1084482

take Experiment obj instead of URI

d766785

cleans up experiment open logic, providing s3 region param

fix multiprocessing pickling

fdeb52c

pytorch example code tweaks

c66bdb3

use logging instead of prints

080e5ec

use real census in pytorch example

70aec84

add pip requirements for PyTorch DataLoader

e0968b1

support split and shuffle via PyTorch DataPipes

ad0356e

separate the ExperimentDataPipe instantation from the experiment_data_loader() method, to allow user to specify additional pipe operations, per her use case.

TODO

891684c

install experimental pip requirements optionally

503a7dc

atolopko-czi self-assigned this May 23, 2023

atolopko-czi requested review from pablo-gar, bkmartinjr, anaistrate and ebezzi May 23, 2023 19:26

move pytorch.py to experimental.ml sub-package

b51b4a9

ebezzi reviewed May 23, 2023

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/pytorch.py Outdated Show resolved Hide resolved

api/python/cellxgene_census/src/cellxgene_census/experimental/pytorch.py Outdated Show resolved Hide resolved

atolopko-czi added 3 commits May 23, 2023 16:07

refactor var name

94778ea

fix GHA builds

7f617f2

update example pytest code for sparse_X option

6313ee9

bkmartinjr reviewed May 24, 2023

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py Outdated Show resolved Hide resolved

bkmartinjr reviewed May 24, 2023

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py Show resolved Hide resolved

atarashansky reviewed May 24, 2023

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py Show resolved Hide resolved

atolopko-czi and others added 2 commits May 24, 2023 11:32

run experimental tests conditionally (#501)

4bbb3b8

* experimental tests only run with explicit pytest option * only run experimental tests with Python 3.10 on GHA, since not all experimental package dependencies are available for lower version (e.g. pytorch 2.0.1 not available for Python 3.7)

Merge branch 'main' into atol/pytorch-dataloader

c027b7d

bkmartinjr reviewed May 24, 2023

View reviewed changes

api/python/cellxgene_census/pyproject.toml Show resolved Hide resolved

bkmartinjr reviewed May 24, 2023

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py Show resolved Hide resolved

atolopko-czi commented May 25, 2023

View reviewed changes

atolopko-czi added 2 commits May 25, 2023 10:01

docs

1c3a736

use API's default buffer_bytes value

8cfc59b

rename buffer_bytes to soma_buffer_bytes for clarity

bkmartinjr suggested changes May 25, 2023

View reviewed changes

refactoring

8ce019f

- rename ms_name to measurement_name - remove ExperimentDataPipe num_workers param (unused except for an invalid usage check)

atolopko-czi requested a review from bkmartinjr May 25, 2023 15:13

bkmartinjr reviewed May 25, 2023

View reviewed changes

bkmartinjr approved these changes May 25, 2023

View reviewed changes

atolopko-czi added 5 commits May 25, 2023 12:52

refactoring

553423d

rename layer_name to X_name, to align with Census API get_anndata

add explicit scikit-learn pip dependency

51ba4bb

update TODO

ce3abe7

Merge branch 'main' into atol/pytorch-dataloader

067893b

Merge branch 'main' into atol/pytorch-dataloader

78040c5

atolopko-czi merged commit e5b59d5 into main May 30, 2023

atolopko-czi deleted the atol/pytorch-dataloader branch May 30, 2023 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch DataLoader #499

PyTorch DataLoader #499

atolopko-czi commented May 23, 2023 •

edited

Loading

codecov bot commented May 24, 2023 •

edited

Loading

atolopko-czi May 25, 2023

atolopko-czi May 25, 2023

bkmartinjr left a comment

bkmartinjr May 25, 2023

bkmartinjr left a comment

PyTorch DataLoader #499

PyTorch DataLoader #499

Conversation

atolopko-czi commented May 23, 2023 • edited Loading

codecov bot commented May 24, 2023 • edited Loading

Codecov Report

atolopko-czi May 25, 2023

Choose a reason for hiding this comment

atolopko-czi May 25, 2023

Choose a reason for hiding this comment

bkmartinjr left a comment

Choose a reason for hiding this comment

bkmartinjr May 25, 2023

Choose a reason for hiding this comment

bkmartinjr left a comment

Choose a reason for hiding this comment

atolopko-czi commented May 23, 2023 •

edited

Loading

codecov bot commented May 24, 2023 •

edited

Loading