highly variable gene annotation #511

bkmartinjr · 2023-05-31T14:11:48Z

Add SOMA-optimized equivalent of scanpy.pp.highly_variable_genes, flavor="seurat_v3"

Example:

In [1]: import cellxgene_census
   ...: from cellxgene_census.experimental.pp import get_highly_variable_genes
   ...: 
   ...: with cellxgene_census.open_soma(census_version="stable") as census:
   ...:     hvg = get_highly_variable_genes(
   ...:         census,
   ...:         organism="Mus musculus",
   ...:         obs_value_filter="is_primary_data == True and tissue_general == 'lung'",
   ...:         n_top_genes = 500
   ...:     )
   ...: 
   ...: hvg[hvg.highly_variable]
   ...: 
The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.
Out[1]: 
                means    variances  highly_variable_rank  variances_norm  highly_variable
soma_joinid                                                                              
85           0.104754    93.239460                 343.0        3.671225             True
102          0.772019  2061.330152                 332.0        3.688259             True
115          0.103565    94.853489                 167.0        4.244059             True
214          0.048323    32.914252                 281.0        3.795256             True
336          0.010261     1.912341                 360.0        3.641506             True
...               ...          ...                   ...             ...              ...
29230        0.067882    47.751253                 133.0        4.462335             True
35183        0.019692     6.483923                 471.0        3.448803             True
46714        0.033726    27.929340                 423.0        3.517277             True
48253        0.006745     0.804313                 275.0        3.805323             True
49751        0.064489    45.016323                 497.0        3.421169             True

[500 rows x 5 columns]

In [2]: var_df = census['census_data']['mus_musculus'].ms['RNA'].var.read().concat().to_pandas().set_index('soma_joinid')

In [3]: var_df
Out[3]: 
                     feature_id                       feature_name  feature_length
soma_joinid                                                                       
0            ENSMUSG00000109644                      0610005C13Rik            3583
1            ENSMUSG00000108652                      0610006L08Rik            2128
2            ENSMUSG00000007777                      0610009B22Rik             998
3            ENSMUSG00000086714                      0610009E02Rik            1803
4            ENSMUSG00000043644                      0610009L18Rik             619
...                         ...                                ...             ...
52387        ENSMUSG00000081591                           Btf3-ps9             496
52388        ENSMUSG00000118710  mmu-mir-467a-3_ENSMUSG00000118710              83
52389        ENSMUSG00000119584                              Rn18s            1849
52390        ENSMUSG00000118538                            Gm18218             970
52391        ENSMUSG00000084217                           Setd9-ps             670

[52392 rows x 3 columns]

In [4]: combined_df = pd.concat([var_df, hvg], axis=1)

In [5]: combined_df[combined_df.highly_variable]
Out[5]: 
                     feature_id   feature_name  feature_length     means    variances  highly_variable_rank  variances_norm  highly_variable
soma_joinid                                                                                                                                 
85           ENSMUSG00000075511  1700001L05Rik            6112  0.104754    93.239460                 343.0        3.671225             True
102          ENSMUSG00000026831  1700007K13Rik             965  0.772019  2061.330152                 332.0        3.688259             True
115          ENSMUSG00000023873  1700010I14Rik            2292  0.103565    94.853489                 167.0        4.244059             True
214          ENSMUSG00000044916  1700029I15Rik            1348  0.048323    32.914252                 281.0        3.795256             True
336          ENSMUSG00000097675  1700101I11Rik            1868  0.010261     1.912341                 360.0        3.641506             True
...                         ...            ...             ...       ...          ...                   ...             ...              ...
29230        ENSMUSG00000096323        Gm20767            1427  0.067882    47.751253                 133.0        4.462335             True
35183        ENSMUSG00000051537         Gm5124            2742  0.019692     6.483923                 471.0        3.448803             True
46714        ENSMUSG00000092300           Cdk3            1548  0.033726    27.929340                 423.0        3.517277             True
48253        ENSMUSG00000105428        Mir3068              79  0.006745     0.804313                 275.0        3.805323             True
49751        ENSMUSG00000023795       Pisd-ps2            2108  0.064489    45.016323                 497.0        3.421169             True

[500 rows x 8 columns]

codecov · 2023-05-31T18:19:01Z

Codecov Report

Merging #511 (ae0677d) into main (58cb475) will decrease coverage by 1.04%.
The diff coverage is 79.18%.

@@            Coverage Diff             @@
##             main     #511      +/-   ##
==========================================
- Coverage   88.67%   87.63%   -1.04%     
==========================================
  Files          53       59       +6     
  Lines        3213     3606     +393     
==========================================
+ Hits         2849     3160     +311     
- Misses        364      446      +82

Flag	Coverage Δ
unittests	`87.63% <79.18%> (-1.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...lxgene_census/src/cellxgene_census/_get_anndata.py	`100.00% <ø> (ø)`
...rc/cellxgene_census/experimental/pp/_eager_iter.py	`50.00% <50.00%> (ø)`
...us/src/cellxgene_census/experimental/pp/_online.py	`51.51% <51.51%> (ø)`
...e_census/experimental/pp/_highly_variable_genes.py	`98.31% <98.31%> (ø)`
...s/src/cellxgene_census/experimental/pp/__init__.py	`100.00% <100.00%> (ø)`
...cellxgene_census/tests/experimental/pp/test_hvg.py	`100.00% <100.00%> (ø)`
...lxgene_census/tests/experimental/pp/test_online.py	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

atolopko-czi

Made some minor suggestions and nits, but can approve as is. I did not review this deeply at an algorithmic level, though I did peruse all of the code.

Worth adding a notebook for this code?

Should experimental/pp/__init__.py include a comment to spell out that "pp" is an abbreviation for "preprocessing"? I understand that name is coming from scanpy, so if it's assumed knowledge, no change needed.

atolopko-czi · 2023-05-31T18:15:15Z

api/python/cellxgene_census/src/cellxgene_census/experimental/pp/_highly_variable_genes.py

+    Convience wrapper around ``soma.Experiment`` query and ``highly_variable_genes`` function, to build and
+     execute a query, and annotate the query result genes (``var`` dataframe) based upon variability.
+
+    See ``highly_variable_genes`` for more information on


finish sentence

atolopko-czi · 2023-05-31T18:17:37Z

api/python/cellxgene_census/src/cellxgene_census/experimental/pp/_eager_iter.py

+        super_del()
+
+
+class EagerBufferedIterator(Iterator[_T]):


Yes, but left in because I am consider it as a means of further performance enhancement.

atolopko-czi · 2023-06-01T15:03:20Z

api/python/cellxgene_census/tests/experimental/pp/test_hvg.py

+            'is_primary_data == True and tissue_general in ["heart", "lung"]',
+            marks=pytest.mark.expensive,
+        ),
+        ("mus_musculus", 'is_primary_data == True and tissue_general == "skin of body"'),


nit: move the cheap test to the front of the array

atolopko-czi · 2023-06-01T15:06:59Z

api/python/cellxgene_census/src/cellxgene_census/experimental/pp/_highly_variable_genes.py

+    flavor: Literal["seurat_v3"] = "seurat_v3",
+    span: float = 0.3,
+    batch_key: Optional[str] = None,
+) -> pd.DataFrame:


Would it be worth having this convenience wrapper return the combined_df, per the example in the PR summary?

It could, but that would require the fetch of the entire var dataframe, and I did not want to presume that the user would want that extra overhead.

Most use cases I know of do not fetch the entire var.

I'll add an example to the docstrings to show that use case.

api/python/cellxgene_census/src/cellxgene_census/experimental/pp/_highly_variable_genes.py

atolopko-czi · 2023-06-01T15:14:34Z

api/python/cellxgene_census/src/cellxgene_census/experimental/pp/_highly_variable_genes.py

+            estimate the loess variance model fit.
+
+        batch_key:
+            If specified, gene selection will be done by batch and combined.


Personally, I could use further explanation for how and why to use this param. Maybe sufficient to again point the reader to the docs for batch_key. If it's okay to assume the user is already familiar with scanpy.pp.highly_variable_genes, no change necessary.

I'll add the link.

Side note: any idea how to make giant URLs look better in docstrings?

Not really, other than using a URL shortener, which seems like the wrong to do in a docstring.

api/python/cellxgene_census/src/cellxgene_census/experimental/pp/_eager_iter.py

atarashansky · 2023-06-01T16:12:56Z

api/python/cellxgene_census/src/cellxgene_census/experimental/pp/_highly_variable_genes.py

+        y = np.log10(v[not_const])
+        x = np.log10(u[not_const])
+        model = skmisc.loess.loess(x, y, span=span, degree=2)
+        # Can we resolve low entropy loess errors by adding jitter and retrying?


I think this is a good idea. I retry manually with jitter every time I get a loess error, anyway, so it seems reasonable to automate that.

OK, will do - it seems simple and very useful.

My feeling is that the parameterization of this should be a new argument called "max_jitter", which can be set to zero to get the old behavior. The code would retry adding jitter until either success or max_jitter is exceeded. I'll ping you for another review when it is added.

@atarashansky - any thoughts on what would be a reasonable range of jitter values to try before bailing? It is a 64 bit float, so it can start small.

I'm going to start with

[ 10 ** -p for p in range(18, 8, -1) ]

i.e.,

[1e-18, 1e-17, 1e-16, 1e-15, 1e-14, 1e-13, 1e-12, 1e-11, 1e-10, 1e-09]

bkmartinjr · 2023-06-01T20:10:49Z

@atolopko-czi - I believe I addressed your comments, so feel free to take another look. I don't plan on landing this until next week after Pablo has reviewed.

atolopko-czi · 2023-06-01T20:37:05Z

@atolopko-czi - I believe I addressed your comments, so feel free to take another look. I don't plan on landing this until next week after Pablo has reviewed.

LGTM!

pablo-gar

I don't see anything obviously wrong and the test with the scanpy comparison is solid!!

bkmartinjr · 2023-06-06T00:01:02Z

@atarashansky - any last comments on the loess adaptive retry before I land this?

atarashansky · 2023-06-06T17:23:46Z

api/python/cellxgene_census/src/cellxgene_census/experimental/pp/_highly_variable_genes.py

@@ -57,6 +57,7 @@ def _highly_variable_genes_seurat_v3(
    n_top_genes: int = 1000,
    layer: str = "raw",
    span: float = 0.3,
+    max_loess_jitter: float = 1e-9,


Should we allow users to specify jitter explicitly? If specified, we don't do the search across orders of magnitude and let the function fail with ValueError if there is a loess error.

I worry that the function may take a long time if the following conditions were met:

Loess errors are common

1e-18 starting value for jitter is not enough to resolve loess

In your experimenting, what are the chances of the above conditions being met? If likely, then a jitter parameter would let users set a reasonably high value that is likely to work so they don't have to wait for the function to scan.

Otherwise, approving now as the current implementation looks good!

The amount of jitter needed is going to be very dataset dependent, so I think there are edge cases for almost any default value. I picked a conservative range of values because I thought getting the minimum jitter that "solves" was more important than performance (performance being dominated by I/O - read - time in almost all cases).

The current API allows the user to control only the max jitter. We could expand it to allow the user to specify the entire range (and step) to try. But that still leaves the issue of what a good default is, which will be where most users start.

@pablo-gar - appreciate your thoughts.

* Pin tiledb version as work-around (#454) * temporarily pin required tiledb version * revert release.json URL while infra changes are fluid * Incorporate comms changes (#456) * Final edits to gget tutorial (#451) * revert R API release.json URL (#458) the new URL is not yet ready and Python API has already been reverted * symlink the new notebooks * Fix docsite version and disable searchbar (#460) * update release boostrap URL to public permalink (#459) * Remove pre-release from README.md (#462) * Enable anonymous access to S3 bucket (#275) * Enable anonymous access to S3 bucket * R + unit tests * Pin tiledbsoma==1.2.3 * R styler * Update api/python/cellxgene_census/tests/test_open.py Co-authored-by: Andrew Tolopko <[email protected]> * remove R, upgrade pyproject * remove R * add newline * add negative test --------- Co-authored-by: Andrew Tolopko <[email protected]> * Bump static census version in R tests (#472) * [r] update `get_presence_matrix()` and vignette to use zero-based matrix view (#475) * wip * wip * update census_dataset_presence.Rmd * add acceptance test run (#485) * rerun notebooks for census build 2023-05-15 ("stable") Census release (#484) * Updated gget cellxgene tutorial to reflect workflow updates [formatting corrected] (#490) * Created using Colaboratory * Created using Colaboratory * Created using Colaboratory * correct formating --------- Co-authored-by: Laura Luebbert <[email protected]> * Add docsite version number from the library version (#481) * Add docsite version number from the library version * revert odd commit * pin tiledbsoma==1.2.4 (#493) * [docs] Add autosummary (#492) * autosummary * Add module desc + fix links * [R] close census objects (#486) * Using the new stateful open/close in R tiledbsoma -- mainly docs/vignettes/tests, but a few of the helper implementations too. * Refactored `open_soma()` to facilitate sharing/reuse of `SOMATileDBContext`, and do that hroughout the tests. * Updated vignettes to reflect recent changes to the Python notebooks. * However, the vignettes are now using too much memory to build in GHA. Still troubleshooting this, but for: disabled building them in GHA in order to un-break our CI. * Fix runs-on to use matrix strategy in py-unittests (#494) * Fix runs-on to use matrix strategy in py-unittests * try ARM64 runner * roll back ARM64 runner * add if * Revert bad commit * Enable anonymous access in R (#471) * Enable anonymous access to S3 bucket * R + unit tests * Pin tiledbsoma==1.2.3 * R styler * Update api/python/cellxgene_census/tests/test_open.py Co-authored-by: Andrew Tolopko <[email protected]> * fix * fix * reset credentials --------- Co-authored-by: Andrew Tolopko <[email protected]> * Add Databricks install instructions to FAQ (#488) * [docs] Fix the Census link in navbar (#491) * [r] use `stable` by default & add alias resolution message (#502) Completes #482 For parity with python #435 Also adapts to several recent breaking API changes in tiledbsoma * PyTorch DataLoader (#499) * Add PyTorch DataLoader support * Introduce this code under a new "experimental" sub-package, with new pytest "experimental" marker for unit tests. * Add initial PyTorch example code for LR model training. Not a notebook yet, but under notebooks dir for now. * chore: update lifecycle tags (#509) * Update lifecycle tags for non-experimental Python API to "maturing" * Update lifecycle tags for experimental Python API to "experimental" * export public names for experimental ml package * bump python api tiledbsoma version (#510) bump tiledbsoma to 1.2.5, which includes updated api doc lifecycle tags * minor clarifications for the pypi.org release process (#512) * cache most R dependencies to speed up r-check CI (#517) #309 -- Cache most R dependencies instead of always building the latest versions of all of them. (Then immediately afterwards, still install the latest tiledb & tiledbsoma from r-universe) * [r] Add comp_bio_census_info.Rmd (#407) Also: - update all vignettes to recent tiledbsoma API evolution - temporarily move vignettes into `vignettes_wip/` pending a plan for how to build them outside of GitHub Actions Co-authored-by: Emanuele Bezzi <[email protected]> * fix pytorch multiprocessing result (#516) The first partition of data was being returned from each worker, apparently caused by use of a PyArrow array for passing the set of joinids to each worker's result iterator, possibly due to a bug in TileDB-SOMA. * Update release_process.md (#520) * highly variable gene annotation (#511) * initial implementation of highly_variable_genes * add test marks * add prebuffered iterator * lint * lint * docstrings * reduce expensive tests * fix typo * actually fix typo * add test for get_highly_variable_genes * lint * reduce memory use in tests * add example to docstring * fix anon access in small memory context * PR feedback * loess jitter * increase max loess noise max to 1e-6 * add tests * fix: pytorch unit test hangs (#522) * force use of multiprocessing spawn start method for pytorch * run experimental tests in all envs except 3.7 * Add support for Python 3.11 (#528) * Add support for Python 3.11 * Add 3.11 to test workflow * update notebook guidelines (#534) * update notebook guidelines * Update docs/census_notebook_guidelines.md Co-authored-by: Emanuele Bezzi <[email protected]> * Update docs/census_notebook_guidelines.md * Apply suggestions from code review Co-authored-by: Andrew Tolopko <[email protected]> --------- Co-authored-by: Emanuele Bezzi <[email protected]> Co-authored-by: Andrew Tolopko <[email protected]> * Add docs for experimental modules (#537) * Add modules to python-api.rst * Add README.md * Update docs/python-api.rst Co-authored-by: pablo-gar <[email protected]> * Update docs/python-api.rst Co-authored-by: pablo-gar <[email protected]> --------- Co-authored-by: pablo-gar <[email protected]> * demo notebook for HVG experimental API (#536) * reorg files * add HVG notebook * clean up notebook docs symlinks * lint * lint,try 2 * fix Ruff error * fix lint in thread regex * work around upstream lint * update target python version to match target container version * fix typing lint * update and fix config for pre-commit * PR feedback / fixes * Add experimental notebooks to docsite + fix headers (#541) * fix OOM on pytorch unit test (#542) skip failing test on 3.9 only * Add pytorch notebook (#551) * add pytorch notebook; minor pytorch api docstring updates * CR feedback * Update api/python/notebooks/experimental/pytorch.ipynb * Update api/python/notebooks/experimental/pytorch.ipynb * run full notebook --------- Co-authored-by: pablo-gar <[email protected]> Co-authored-by: Pablo E Garcia-Nieto <[email protected]> * fix incorrect pytorch obs soma_joinids (#555) - Use torch.from_numpy() instead of Torch.Tensor() to construct tensors. The latter created with dtype=float32, which caused the bug due to truncation of precision when casting to int64-to-float32-to-int64. In addition to fixing the bug, a data copy is eliminated by using this new method. - Rework the test fixture generation methods to allow for ranges of obs and var soma_joinids that may start at any arbitrary value. Necessary for testing the case that produced this bug. - Add asserts to ease paranoia. * Fix MacOS failing tests (#557) * Fix MacOS failing tests For the time being, we'll try to determine if the issue is specific to a Python version by removing 3.7. * try macos13 * comma * try if * exclude * typo --------- Co-authored-by: Bruce Martin <[email protected]> Co-authored-by: pablo-gar <[email protected]> Co-authored-by: Laura Luebbert <[email protected]> Co-authored-by: Andrew Tolopko <[email protected]> Co-authored-by: Emanuele Bezzi <[email protected]> Co-authored-by: Martin Kim <[email protected]> Co-authored-by: Pablo E Garcia-Nieto <[email protected]>

bkmartinjr added 7 commits May 30, 2023 21:56

initial implementation of highly_variable_genes

c9cca64

Merge branch 'main' into bkmartinjr/hvg

a0734bc

add test marks

b6a067a

add prebuffered iterator

60ebd8a

Merge branch 'main' into bkmartinjr/hvg

a1eecd3

lint

2cd40e5

lint

2f40ed9

bkmartinjr changed the title ~~highly variable gene API~~ highly variable gene annotation May 31, 2023

bkmartinjr added 9 commits May 31, 2023 15:20

docstrings

d1d1470

reduce expensive tests

e4ce148

fix typo

b1d5ab5

actually fix typo

376e791

add test for get_highly_variable_genes

a905920

lint

dcf1069

reduce memory use in tests

426e5eb

add example to docstring

7cd9df1

fix anon access in small memory context

ebede13

bkmartinjr marked this pull request as ready for review May 31, 2023 17:36

bkmartinjr requested review from atolopko-czi and pablo-gar May 31, 2023 17:36

Merge branch 'main' into bkmartinjr/hvg

0b68885

atolopko-czi approved these changes Jun 1, 2023

View reviewed changes

atolopko-czi reviewed Jun 1, 2023

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/pp/_eager_iter.py Show resolved Hide resolved

atarashansky reviewed Jun 1, 2023

View reviewed changes

bkmartinjr added 2 commits June 1, 2023 18:12

PR feedback

3aedfcb

loess jitter

1c090c2

bkmartinjr requested a review from atarashansky June 1, 2023 20:10

pablo-gar approved these changes Jun 5, 2023

View reviewed changes

atarashansky approved these changes Jun 6, 2023

View reviewed changes

bkmartinjr added 3 commits June 7, 2023 00:07

increase max loess noise max to 1e-6

5476ea6

add tests

e6944cb

Merge branch 'main' into bkmartinjr/hvg

ae0677d

bkmartinjr merged commit 379a8ca into main Jun 7, 2023

bkmartinjr deleted the bkmartinjr/hvg branch June 7, 2023 02:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

highly variable gene annotation #511

highly variable gene annotation #511

bkmartinjr commented May 31, 2023 •

edited

Loading

codecov bot commented May 31, 2023 •

edited

Loading

atolopko-czi left a comment

atolopko-czi May 31, 2023

atolopko-czi May 31, 2023

bkmartinjr Jun 1, 2023

atolopko-czi Jun 1, 2023

atolopko-czi Jun 1, 2023

bkmartinjr Jun 1, 2023

atolopko-czi Jun 1, 2023

bkmartinjr Jun 1, 2023

atolopko-czi Jun 1, 2023

atarashansky Jun 1, 2023

bkmartinjr Jun 1, 2023

bkmartinjr Jun 1, 2023

bkmartinjr commented Jun 1, 2023

atolopko-czi commented Jun 1, 2023

pablo-gar left a comment

bkmartinjr commented Jun 6, 2023

atarashansky Jun 6, 2023

bkmartinjr Jun 6, 2023

highly variable gene annotation #511

highly variable gene annotation #511

Conversation

bkmartinjr commented May 31, 2023 • edited Loading

codecov bot commented May 31, 2023 • edited Loading

Codecov Report

atolopko-czi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkmartinjr commented Jun 1, 2023

atolopko-czi commented Jun 1, 2023

pablo-gar left a comment

Choose a reason for hiding this comment

bkmartinjr commented Jun 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkmartinjr commented May 31, 2023 •

edited

Loading

codecov bot commented May 31, 2023 •

edited

Loading