Allow dask arrays to propagate through _normalize_data() #1663

GenevieveBuckley · 2021-02-19T02:40:07Z

While I was looking through the scanpy source code, I found a note that says # dask doesn't do medians.

https://github.com/theislab/scanpy/blob/0c4ca5b21524c2972d514ddbd85834002ed623de/scanpy/preprocessing/_normalization.py#L17

Dask does in fact do medians, provided it's applied along an axis: dask/dask#5575
But this feature was only merged in November 2019 (the same month the comment above was added), so I think it was too new at the time to be widely known & available.

This PR attempts to remove the coercion to numpy, and allow dask arrays to propagate through the _normalize_data function.

GenevieveBuckley · 2021-02-19T02:45:35Z

scanpy/preprocessing/_normalization.py

-    counts = np.asarray(counts)  # dask doesn't do medians
-    after = np.median(counts[counts>0], axis=0) if after is None else after
+    try:  # dask array
+        counts_greater_than_zero = counts[counts>0].compute_chunk_sizes()


This is not perfectly efficient, we have to compute the chunk sizes here because they are unknown due to counts[counts>0]. There's probably still advantages here, compared with coercing the whole thing to a numpy array at the beginning.

See here for more details: https://docs.dask.org/en/latest/array-chunks.html

compute_chunk_sizes() immediately performs computation and modifies the array in-place.

ivirshup

Thanks for the PR, great to have this working! Bad timing on the last time we looked at it I guess 😄

Apart from the minor comments I've asked inline, could this get a test that checks to make sure the dask array is propagated through?

scanpy/tests/test_normalization.py

scanpy/preprocessing/_normalization.py

GenevieveBuckley · 2021-02-19T07:09:38Z

Huh, I don't think the tests are checking what I thought they were.

It doesn't look like you can have an AnnData object with a Dask array, what am I doing wrong here?

from anndata import AnnData
import numpy as np
import dask.array as da
from scipy.sparse import csr_matrix

X_total = [[1, 0], [3, 0], [5, 6]]

adata = AnnData(np.array(X_total), dtype='float32')
print(type(adata.X)  # is a numpy matrix, as expected

adata = AnnData(csr_matrix(X_total), dtype='float32')
print(type(adata.X)  # is a sparse matrix, as expected

adata = AnnData(da.from_array(X_total), dtype='float32')
print(type(adata.X)  # is a numpy array NOT a dask array, not what I expected

The change I made to _normalize_data() changed coercion of counts, not X. When I stepped through this private function, it seemed like things were working the way I'd expected, but there's a lot of other stuff happening before & afterwards in normalize_total() which I haven't looked at much.

What combinations of inputs to _normalize_data() need to be supported?

numpjy X, numpy counts
dask X, dask counts
csr_matrix X, csr_matrix counts

Combinations?

numpjy X, dask counts
dask X, numpy counts
numpjy X, csr_matrix counts
csr_matrix X, numpy counts
dask X, csr_matrix counts
csr_matrix X, dask counts

ivirshup · 2021-02-21T05:33:12Z

So, I'm not too surprised to see this, since I don't think much of the distributed stuff has good testing, and I'm not too familiar with it.

I believe the AnnData constructor is converting the array. You can get around this by assigning X to be a dask array, e.g.:

a = ad.AnnData(np.ones((1000, 100)))
a.X = da.from_array(a.X)
type(a.X)
# dask.array.core.Array

Better support for dask arrays would be a great feature request and series of additions to anndata. I think this is the endemic numeric python problem of "these things are all like arrays, so can kinda use the same API, but in practice every type needs to be special cased".

but there's a lot of other stuff happening before & afterwards in normalize_total() which I haven't looked at much

Yeah, I think this function has built up some cruft. I've opened a PR to streamline this #1667, but will need to check with people more familiar with the code. The private method should handle all of the computation, while the outer wrapper will do more argument handling/ getting data out of the AnnData/ assigning it back.

What combinations of inputs to _normalize_data() need to be supported

I believe counts should always be generated from X, so we don't need to worry about the combinations of types.

GenevieveBuckley · 2021-02-22T05:52:58Z

Better support for dask arrays would be a great feature request and series of additions to anndata.

I'd be great to have a bigger picture conversation about this with you. I've just sent you a twitter DM with my contact details.

ryan-williams · 2021-02-22T06:52:12Z

Apologies for jumping in, but wanted to mention that @sakoht and I have been working on Dask+Anndata a bit over in celsiustx/anndata, and would love to discuss some of these issues or listen in.

To date, a lot of the work has been upstream in Dask (e.g. dask/dask#6661; also not quite a PR yet but celsiustx/dask@sum2 improves support for Dask Arrays with scipy.sparse.spmatrix blocks), but focus should be moving more into Anndata soon.

I'll try to put some more cogent thoughts into an issue (likely on Anndata) tmrw, but just wanted to mention here since I've been following this thread with interest! Thanks.

ivirshup · 2021-02-22T09:33:54Z

Great to hear from both of you. I'd really love to have better Dask integration with AnnData and am excited to see these progress!

@ryan-williams, it'd great if you could open an issue over on anndata about this! I think that'd be a good place to discuss design considerations.

GenevieveBuckley · 2021-02-24T06:50:46Z

So it seems that in every case, no matter what array type is given to andata.X, the counts_per_cell variable generated in normalize_total() is always being created as a numpy array.

So I'm not sure why there was a note next to the line in _normalize_data() about not being able to use dask, because the input counts here are always numpy (because they've been created already in normalize_total()).

Presumably this is not intended?

ivirshup · 2021-02-24T10:17:50Z

@GenevieveBuckley, could you provide an example? I think I'm not sure what you mean, as I see dask arrays being passed to _normalize_total

I just checked out this PR, added:

print(type(X), type(counts))

as the first line of _normalize_data and ran this code:

import scanpy as sc
import dask.array as da

adata = sc.datasets.blobs(n_observations=100, n_variables=50)
adata.X = da.from_array(adata.X)
# WARNING: Some cells have total count of genes equal to zero
# <class 'dask.array.core.Array'>
# <class 'dask.array.core.Array'>
display(adata.X)
# dask.array<true_divide, shape=(100, 50), dtype=float32, chunksize=(100, 50), chunktype=numpy.ndarray>

Oh, and sorry about the conflicts! I've just removed all the ignored patterns from the pyproject.toml.

Koncopd · 2021-02-24T10:23:47Z

Yes, if i remember correctly np.ravel converts dask to numpy.

ivirshup · 2021-02-24T10:29:37Z

@Koncopd, I thought that too, but this isn't the case currently:

from dask.array import da; import numpy as np

np.ravel(da.ones(10))
# dask.array<ones, shape=(10,), dtype=float64, chunksize=(10,), chunktype=numpy.ndarray>

GenevieveBuckley · 2021-02-25T00:44:25Z

I'd been adding a breakpoint just before the call to _normalize_data() and was printing out type(adata.X) and type(counts_per_cell).

I'll check it again the same way you did, and also check if I have uncommited code in my local repo.

ivirshup · 2021-03-03T04:15:49Z

Mind if I merge #1667? It would cause more conflicts, but should clean up the flow control in normalize_total. I'd also be happy to add a commit fixing up the conflicts here.

Edit: I added that commit. Feel free to remove it's troublesome.

codecov · 2021-03-05T05:07:44Z

Codecov Report

Merging #1663 (a074541) into master (1be0a68) will increase coverage by 0.00%.
The diff coverage is 80.00%.

@@           Coverage Diff           @@
##           master    #1663   +/-   ##
=======================================
  Coverage   71.38%   71.38%           
=======================================
  Files          92       92           
  Lines       11283    11291    +8     
=======================================
+ Hits         8054     8060    +6     
- Misses       3229     3231    +2

Impacted Files	Coverage Δ
scanpy/preprocessing/_normalization.py	`87.05% <80.00%> (-1.26%)`	⬇️

GenevieveBuckley added 2 commits February 19, 2021 12:02

First attempt at dask median for normalization

d36c7e7

Allow dask arrays with _normalize_data, include tests

ddbdaf6

GenevieveBuckley changed the title ~~Dask median~~ Allow dask arrays to propagate through _normalize_data() Feb 19, 2021

GenevieveBuckley commented Feb 19, 2021

View reviewed changes

ivirshup requested changes Feb 19, 2021

View reviewed changes

scanpy/tests/test_normalization.py Outdated Show resolved Hide resolved

scanpy/preprocessing/_normalization.py Outdated Show resolved Hide resolved

GenevieveBuckley added 2 commits February 19, 2021 16:33

Clearer name ids for parameterized tests

b66f3d5

Use isinstance pattern instead

c830948

Merge branch 'master' into dask-median

c939598

ivirshup added this to the 1.8.1 milestone Jun 23, 2021

ivirshup added the Area – Performance 🐌 label Jun 23, 2021

ivirshup modified the milestones: 1.8.1, 1.8.2 Jul 7, 2021

ivirshup modified the milestones: 1.8.2, 1.8.3 Nov 3, 2021

Merge branch 'master' into dask-median

dbc0cd4

ivirshup modified the milestones: 1.8.3, 1.9.0 Mar 21, 2022

Fix faking of dask array

a074541

ivirshup added the Maint – Release-note needed label Mar 21, 2022

ivirshup approved these changes Mar 21, 2022

View reviewed changes

ivirshup merged commit 9cb915b into scverse:master Mar 21, 2022

ivirshup removed the Maint – Release-note needed label Mar 31, 2022

flying-sheep mentioned this pull request Aug 4, 2023

Expand dask support #2578

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow dask arrays to propagate through _normalize_data() #1663

Allow dask arrays to propagate through _normalize_data() #1663

GenevieveBuckley commented Feb 19, 2021

GenevieveBuckley Feb 19, 2021

ivirshup left a comment

GenevieveBuckley commented Feb 19, 2021

ivirshup commented Feb 21, 2021

GenevieveBuckley commented Feb 22, 2021

ryan-williams commented Feb 22, 2021

ivirshup commented Feb 22, 2021

GenevieveBuckley commented Feb 24, 2021

ivirshup commented Feb 24, 2021 •

edited

Loading

Koncopd commented Feb 24, 2021

ivirshup commented Feb 24, 2021

GenevieveBuckley commented Feb 25, 2021

ivirshup commented Mar 3, 2021 •

edited

Loading

codecov bot commented Mar 5, 2021 •

edited

Loading

Allow dask arrays to propagate through _normalize_data() #1663

Allow dask arrays to propagate through _normalize_data() #1663

Conversation

GenevieveBuckley commented Feb 19, 2021

GenevieveBuckley Feb 19, 2021

Choose a reason for hiding this comment

ivirshup left a comment

Choose a reason for hiding this comment

GenevieveBuckley commented Feb 19, 2021

ivirshup commented Feb 21, 2021

GenevieveBuckley commented Feb 22, 2021

ryan-williams commented Feb 22, 2021

ivirshup commented Feb 22, 2021

GenevieveBuckley commented Feb 24, 2021

ivirshup commented Feb 24, 2021 • edited Loading

Koncopd commented Feb 24, 2021

ivirshup commented Feb 24, 2021

GenevieveBuckley commented Feb 25, 2021

ivirshup commented Mar 3, 2021 • edited Loading

codecov bot commented Mar 5, 2021 • edited Loading

Codecov Report

ivirshup commented Feb 24, 2021 •

edited

Loading

ivirshup commented Mar 3, 2021 •

edited

Loading

codecov bot commented Mar 5, 2021 •

edited

Loading