Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cohort grouping for popgen #260

Merged
merged 8 commits into from
Oct 1, 2020
Merged

Conversation

tomwhite
Copy link
Collaborator

This is an initial attempt at #240, which I'm posting as a draft for discussion.

The main idea (following @eric-czech's https://github.com/pystatgen/sgkit/issues/224#issuecomment-685061129) is to pass a single dataset to the diversity, divergence, Fst functions that includes a sample_cohort variable that indicates which cohort each sample belongs to (using an index, 0 = first cohort, 1 = second cohort, etc). Then, there is a new count_cohort_alleles function that counts the number of each allele for each variant and cohort, and which is used by diversity etc to perform computations within/between cohorts.

A few questions for discussion:

  • What's a good implementation for count_cohort_alleles? I tried one using groupby, and one that should perform better using Dask's map_blocks (but that could also be much improved).
  • How does sample_cohort get specified? The tests just set it manually at the moment.
  • Do we need to consider more than two cohorts yet?

Also, there's still more to do to get some of the tests passing, and to convert Tajimas_D to use the same approach.

@jeromekelleher
Copy link
Collaborator

Looks good, thanks @tomwhite. Pinging @aktech for input.

@eric-czech
Copy link
Collaborator

eric-czech commented Sep 15, 2020

What's a good implementation for count_cohort_alleles? I tried one using groupby, and one that should perform better using Dask's map_blocks (but that could also be much improved).

Hm I haven't tested this but it's what came to mind first as a possibility, borrowing some tricks from regenie and Tim's PR:

# n = samples, c = cohorts, k = alleles
@numba.guvectorize(['void(numba.int8[:, :], numba.int32[:], numba.int32[:], numba.int32[:,:])'], '(n, k),(n),(c)->(c,k)')
def count_cohort_alleles(ac, cohorts, _, out):
    out[:,:] = 0 # This is (cohorts, alleles)
    for i in range(ac.shape[0]):
        out[cohorts[i]] += ac[i]

ds = count_call_alleles(ds)
AC, SC = ds.call_allele_count, ds.sample_cohort
C = da.empty(n_cohorts, dtype=np.uint8)

# Call directly if there is no chunking in core dimensions

count_cohort_alleles(AC.chunk((None, -1, -1)), SC.chunk((-1,)), C)
# shape = (variants, cohorts, alleles)

# OR run on individual blocks for better scalability #

AC = da.map_blocks(count_cohort_alleles, AC, SC, C, chunks=(n_cohorts, n_alleles), ...)
# shape = (variants, cohorts * AC.numblocks[0], alleles * AC.numblocks[1])

# Stack the blocks and sum across them 
# (which will only work because each chunk is guaranteed to have same size)
AC = da.stack([AC.blocks[:, i] for i in range(x.numblocks[1])]).sum(axis=0)
# shape = (variants, cohorts, alleles)

I would guess that this is a good bit better than any pure dask/xarray method if it works.

@tomwhite
Copy link
Collaborator Author

That's great @eric-czech - thank you. I've incorporated the vectorized function in the latest update.

@tomwhite
Copy link
Collaborator Author

In today's call we discussed what should happen for the n > 2 case. One option would be to have a new variable for each combination (named like Fst_cohort0_cohort1, etc), or just have a data array indexed by n cohort dimensions. The latter was thought to be preferable since you can use standard indexing to manipulate it, and xarray coordinates should allow indexing using cohort (population) labels, rather than integers. We should explore this in this PR.

@jeromekelleher
Copy link
Collaborator

Thanks @tomwhite - I've written up my thoughts on it here

@tomwhite
Copy link
Collaborator Author

Thanks @jeromekelleher. I have updated this PR with approach 2 from your suggestion. Fst returns a square matrix, where the entries above the diagonal are the Fst values for cohort i and j.

Using xarray coordinates, we can refer to the values using cohort names. In this example the names are co_0 and co_1, and the dimensions are cohorts_a and cohorts_b. (In general there could be any number of cohort names, but always just two dimensions for Fst.)

>>> cohort_names = ["co_0", "co_1"]
>>> ds = ds.assign_coords({"cohorts_a": cohort_names, "cohorts_b": cohort_names})
>>> ds = Fst(ds) # compute Fst and return a new dataset
>>> ds["stat_Fst"]
<xarray.DataArray 'stat_Fst' (cohorts_a: 2, cohorts_b: 2)>
array([[       nan, 0.02538071],
       [       nan,        nan]])
Coordinates:
  * cohorts_a  (cohorts_a) object 'co_0' 'co_1'
  * cohorts_b  (cohorts_b) object 'co_0' 'co_1'

You can use the cohort names to get a particular Fst value:

>>> ds["stat_Fst"].sel(cohorts_a="co_0", cohorts_b="co_1").values
0.02538071065989833

Similarly, diversity returns a 1D matrix, with dimension cohorts.

assert_array_shape(AC, n_variants, n_cohorts, n_alleles)

new_ds = Dataset({"cohort_allele_count": (("variants", "cohorts", "alleles"), AC)})
return conditional_merge_datasets(ds, new_ds, merge)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite A quick question, I am trying to understand what this function (count_cohort_alleles) is trying to do mathematically, is there a reference implementation or formula, which I can look at to understand this better? Or is it just a an attempt to do a better rewrite of an old implementation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aktech Good question (and this function definitely needs documentation!). It's needed because count_call_alleles and count_variant_alleles return allele counts for each sample and for all samples (respectively), whereas in this PR we need something between the two since we now have the concept of "cohort" (a set of samples): so we need per-cohort counts.

The counts array computed here is of shape (n_variants, n_cohorts, n_alleles), so we have a count for each variant/cohort/allele combination. Have a look at the tests (e.g. test_count_cohort_alleles__multi_variant_multi_sample), and perhaps try calling the allele counting functions in a Python repl to get a feel for what they are doing.

@jeromekelleher
Copy link
Collaborator

This is very beautiful @tomwhite, I don't really have any substantive input. My only question really is, should we fill the lower diagonal of the output matrix also? Fst[0, 1] and Fst[1, 0] should be equal, and we're not saving any memory by not filling out the lower diagonal. It's great that xarray also fulfils the need for labelled cohorts.

@tomwhite
Copy link
Collaborator Author

Thanks @jeromekelleher. I agree on the point about filling the lower diagonal - the only reason I haven't is because I'm not sure if xarray/dask would do the computations twice.

@jeromekelleher
Copy link
Collaborator

Thanks @jeromekelleher. I agree on the point about filling the lower diagonal - the only reason I haven't is because I'm not sure if xarray/dask would do the computations twice.

What about using numpy tricks like these? That should avoid the idea of redoing the computations.

I guess as a general pattern we should just run the calculations again though - for some functions, F[a, b] will not be equal to F[b, a], and so maybe it's simpler to just do all N^2 calculations, and make the optimisation for symmetric functions as a second pass?

@tomwhite tomwhite marked this pull request as ready for review September 24, 2020 14:45
@mergify
Copy link
Contributor

mergify bot commented Sep 24, 2020

This PR has conflicts, @tomwhite please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Sep 24, 2020
@tomwhite
Copy link
Collaborator Author

This latest update uses a generalized ufunc for computing divergence, which should be more efficient than using itertools.

I think this is ready for review - changing the version of Fst (to hudson) can be done separately, as can being able to specify subsets of cohorts.

We might want to discuss the naming convention for cohort dimensions:

  • cohorts_a, cohorts_b (current PR)
  • cohorts_1, cohorts_2 (natural alternative)
  • cohort_a, cohort_b (but other dimensions are plural, e.g. variants)
  • cohort_1, cohort_2 (ditto)

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

n_alleles = ds.dims["alleles"]

ds = count_call_alleles(ds)
AC, SC = da.asarray(ds.call_allele_count), da.asarray(ds.sample_cohort)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we standardize on using ds. call_allele_count vs ds['call_allele_count']? I've been trying to use the former in examples only and stick to the latter in code, but not for great reasons other than consistency and hedging against potentially having variable names that aren't legal attributes or conflict with something in the xarray namespace. Presumably not using attributes becomes easier to standardize on with https://github.com/pystatgen/sgkit/pull/276.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. It's certainly shorter to say ds.call_allele_count, but this form can't be used in assignments:

ds.call_allele_count = ...

I've changed the accessors in this PR, but we should probably discuss a bit more about our code standards, and what we recommend for user code.

-------
Dataset containing variable `call_allele_count` of allele counts with
shape (variants, cohorts, alleles) and values corresponding to
the number of non-missing occurrences of each allele.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this shouldn't also return a (variants, cohorts) array that contains the number of present samples used in each aggregation. It doesn't do it now, but if count_call_alleles returned -1 for all alleles when any (or potentially all) input calls were missing then I can see a count of the samples with non-negative allele counts being a useful downstream denominator.

Is that not important with the stats functions in popgen.py now? It may make sense to put a warning on those functions and log an issue related to supporting it. It seems like only ignoring the missing values during the original allele count must introduce some kind of bias through all those aggregations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I haven't thought about missingness with respect to the meaning of the functions though. Opened #290


# c = cohorts
@guvectorize( # type: ignore
["void(float64[:], float64[:,:])"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think float16 and float32 overloads would make sense here. I haven't tested it, but it looks like it should be possible to get a float32 result out of diversity by feeding it allele counts that are float32 (or float16). That seems like a good way for users to control floating point precision so I think we should try to propagate those types like numpy does where we can, especially since np.sum does that and this function is similar.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a float32 overload. Numba doesn't have float16 as far as I can see: https://numba.pydata.org/numba-doc/dev/reference/types.html#numbers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, ty

@@ -37,39 +40,70 @@ def ts_to_dataset(ts, samples=None):
def test_diversity(size):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that map_blocks is underlying these stats, I think chunking on variants and samples dimensions would be useful parameters for these tests. I'm imagining something like:

@pytest.mark.parametrize("size", [2, 3, 10, 100])
@pytest.mark.parametrize("chunks", [(-1, -1), (10, 10)])
def test_diversity(size, chunks):
    ts = msprime.simulate(size, length=100, mutation_rate=0.05, random_seed=42)
    ds = ts_to_dataset(ts).chunk(dict(zip(["samples", "variants"], chunks)))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Adding this test uncovered two bugs: one to do with the wrong numblocks index being used, and another where the guvectorize function if the input is chunked along the samples dimension. I haven't been able to figure out how fix the latter yet, so I'm going to leave it for the moment. (Popgen doesn't need to chunk along the samples dimension just yet.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added warnings to the documentation about not supporting chunking in the samples dimension.

Copy link
Collaborator

@eric-czech eric-czech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @tomwhite! Those will be great examples to follow.

I had a few suggestions/questions so let me know when you'd like me to take a look again.

Use guvectorize'd implementation of count_cohort_alleles suggested by Eric Czech.

Fix test_Tajimas_D

Avoid divide by zero error

Remove check for number of samples, since it is now handled by where clause

Return dataset for diversity

Return dataset for divergence

Return dataset for Fst

Return dataset for Tajimas_D

Cohort coordinates

Use xarray's combine_nested to build grid of cohort combinations

Use conditional_merge_datasets

Fix Fst implementation for n_cohorts > 2, and parameterize tests for n_cohorts.
Return values in upper and lower diagonal array.

Vectorized divergence

Vectorized diversity pairs

Add doc for count_cohort_alleles
@tomwhite
Copy link
Collaborator Author

tomwhite commented Oct 1, 2020

Thanks for the review @eric-czech. I have addressed your comments (and left some until later), so it should be ready for another look.

@jeromekelleher
Copy link
Collaborator

Looks like we're ready to merge?

@jeromekelleher jeromekelleher added the auto-merge Auto merge label for mergify test flight label Oct 1, 2020
@mergify mergify bot merged commit 9516604 into sgkit-dev:master Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants