Initial work toward PyTorch data loaders #1

bkmartinjr · 2024-09-19T00:13:38Z

Notes to reviewers:

APIs have changed from the CZI contribution. The demo notebook (tutorial_pytorch.ipynb) has been updated to run with all API modifications.
This PR is for an initial drop we plan to use for pre-release feedback. More work is required prior to release
This is set up as a separate Python package, with its own config/build/etc.

This PR contains a PyTorch iterable-style DataSet/DataPipe for use with SOMA Experiment. Initial code contributed by the Chan Zuckerberg single-cell team. Modifications to the original code include:

stand-alone Python package tiledbsoma_ml
significant refactoring to improve performance
multi-worker/multi-GPU support - should work with DDP, DataLoader and Lightning
CI and other config for stand-alone Python package
improved context/config handling - pass through all user-specified configuration allowing use with other storage solutions, AWS multi-region, etc.
fix lint and typing issues identified by the CI pipeline in this repo
update doctrings and copyright
some minor enhancements to unit tests
added torch.utils.data.IterableDataset alongside the existing torchdata.datapipes.iter.IterDataPipe, acknowledging that the torchdata.datapipes is deprecated and slated for removal
add Python 3.8 support
API refinement to better align with ExperimentAxisQuey
fix RNG state handling which caused incorrect results when shuffle=True

API change summary:

constructor takes ExperimentAxisQuery, rather than than an Experiment
reworked I/O and shuffle chunk size param for better API UX
added "set_epoch" endpoint to allow distributed shuffle

* add minimum version for several dependencies * add compat test for primary dependencies * fix typo in workflow * fix another typo * compat matrix refinement * fix quoting * refine compat test matrix * further simplify matrix * update changelog to use correct links (#3)

bkmartinjr added 30 commits September 18, 2024 18:58

initial commit of pytorch datapipe/loader

acf584f

update comments

12237b0

more lint

2fc9beb

fix typos

220a11b

rework for performance

2c870ea

tuning

1aedf3d

tweaks, checkpoint

e577ecd

lint

ee2929d

py 3.8 lint

013cea6

rework io and shuffle buffer size params

39dbab6

lint

98c4510

remove encoders; more perf work

933787d

reorganize into separate python package

d44fcae

fix name

8840983

add more paths to CI

bb70d6d

fix typo in ci

8e51344

fix a second typo in ci

6714a89

set working dir in CI

5505c28

make batched 3.12 compat

6090af2

debugging pre-commit failure

c9789f9

lint, lint, lint

33f3c2d

more CI debugging

467bb15

add build test to CI

52efb62

add code coverage

6ab8334

update GHA

4df5049

test TypeAlias

1b31d32

add missing dependencies

71be802

extend tests

a0a8344

remove coverage reporting from CI for now

13da9c7

docstrings

7074e99

bkmartinjr and others added 16 commits September 18, 2024 18:58

add further concurrency to CSR construction

723fa21

cleanup

ea38c5c

fix multi-gpu hang due to incorrect __len__ return value

f704c83

compat with Lightning

8e47320

PR review edits

70cc170

formatting

37bc9b1

add py.typed to package

b0c4547

add sparse support

8ae3992

start draft of Ligtning notebook

9809c5c

lint

656e2e8

update notebook for lightning

f9e13b0

run notebooks

eaeaab4

fix RNG state bug in shuffle; add multi-worker notebook

61face3

add rehome-census.sh, used to construct this repo's history

6ccad8e

update GHA, repo info

9804ee7

add .pre-commit-config.yaml, run lint

5347805

bkmartinjr mentioned this pull request Sep 19, 2024

[python] Initial work toward PyTorch data loaders single-cell-data/TileDB-SOMA#2823

Closed

bkmartinjr added 3 commits September 18, 2024 17:55

add .gitignore

da1389a

autoupdate pre-commit

672d306

remove tiledbsoma-specific ruff/isort rules

975dd12

bkmartinjr requested review from ryan-williams, aaronwolen, ebezzi, atolopko-czi and johnkerl September 19, 2024 01:13

bkmartinjr mentioned this pull request Sep 19, 2024

Package dependency pins & test #2

Merged

bkmartinjr marked this pull request as ready for review September 19, 2024 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial work toward PyTorch data loaders #1

Initial work toward PyTorch data loaders #1

bkmartinjr commented Sep 19, 2024 •

edited

Loading

Initial work toward PyTorch data loaders #1

Are you sure you want to change the base?

Initial work toward PyTorch data loaders #1

Conversation

bkmartinjr commented Sep 19, 2024 • edited Loading

bkmartinjr commented Sep 19, 2024 •

edited

Loading