Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending Xarray for domain-specific toolkits #3959

Closed
eric-czech opened this issue Apr 9, 2020 · 10 comments
Closed

Extending Xarray for domain-specific toolkits #3959

eric-czech opened this issue Apr 9, 2020 · 10 comments

Comments

@eric-czech
Copy link

Hi, I have a question about how to design an API over Xarray for a domain-specific use case (in genetics). Having seen the following now:

I wanted to reach out and seek some advice on what I'd like to do given that I don't think any of the solutions there are what I'm looking for.

More specifically, I would like to model the datasets we work with as xr.Dataset subtypes but I'd like to enforce certain preconditions for those types as well as support conversions between them. An example would be that I may have a domain-specific type GenotypeDataset that should always contain 3 DataArrays and each of those arrays should meet different dtype and dimensionality constraints. That type may be converted to another type, say HaplotypeDataset, where the underlying data goes through some kind of transformation to produce a lower dimensional form more amenable to a specific class of algorithms.

One API I envision around these models consists of functions that enforce nominal typing on Xarray classes, so in that case I don't actually care if my subtypes are preserved by Xarray when operations are run. It would be nice if that subtyping wasn't lost but I can understand that it's a limitation for now. Here's an example of what I mean:

from genetics import api

arr1 = ??? # some 3D integer DataArray of allele indices
arr2 = ??? # A missing data boolean DataArray
arr3 = ??? # Some other domain-specific stuff like variant phasing
ds = api.GenotypeDataset(arr1, arr2, arr3)

# A function that would be in the API would look like:
def analyze_haplotype(ds: xr.Dataset) -> xr.Dataset:
    # Do stuff assuming that the user has supplied a dataset compliant with 
    # the "HaplotypeDataset" constraints
    pass 

analyze_haplotype(ds.to_haplotype_dataset())

I like the idea of trying to avoid requiring API-specific data structures for all functionality in favor of conventions over Xarray data structures. I think conveniences like these subtypes would be great for enforcing those conventions (rather than checking at the beginning of each function) as well as making it easier to go between representations, but I'm certainly open to suggestion. I think something akin to structural subtyping that extends to what arrays are contained in the Dataset, how coordinates are named, what datatypes are used, etc. would be great but I have no idea if that's possible.

All that said, is it still a bad idea to try to subclass Xarray data structures even if the intent was never to touch any part of the internal APIs? I noticed Xarray does some stuff like type(array)(...) internally but that's the only catch I've found so far (which I worked around by dispatching to constructors based on the arguments given).

cc: @alimanfoo - Alistair raised some concerns about trying this to me, so he may have some thoughts here too

@TomNicholas
Copy link
Contributor

TomNicholas commented Apr 9, 2020

All that said, is it still a bad idea to try to subclass Xarray data structures even if the intent was never to touch any part of the internal APIs?

One of the more immediate problems you'll find if you subclass is that xarray internally uses methods like self._construct_dataarray(dims, values, coords, attrs) to construct return values, so you will likely find that for a lot of methods you call you will only get back a bare DataArray, not the subclass you put in.

You could make custom accessors which perform checks on the input arrays when they get used?

@xr.register_dataset_accessor('haplo')
def HaploDatasetAccessor:
    def __init__(self, ds)
        check_conforms_to_haplo_requirements(ds)
        self.data = ds
    
    def analyse(self):
        ...

ds.haplo.analyse()

I'm also wondering whether given that the only real difference (not just by convention) of your desired data structures from xarray's is the dtype, then (if xarray actually offered it) would something akin to pandas' ExtensionDtype solve your problem?

@eric-czech
Copy link
Author

Thanks @TomNicholas, some thoughts on your points:

xarray internally uses methods like self._construct_dataarray(dims, values, coords, attrs) to construct return values

I'm ok with the subtype being lost after running some methods. I saw that so I'm assuming all functions that do anything with the data structures take and return Xarray objects alone.

You could make custom accessors which perform checks on the input arrays when they get used?

Accessors could work but the issues I see with them are:

  1. What's a natural way for a user to access documentation on how to build a compliant dataset? A docstring on a constructor is great -- is there a way to do something like that with accessors? The docs could hang off of the check_* type methods, but that seems very awkward.
  2. Is there a way to avoid running check_* methods multiple times? I think those checks could be expensive and If Xarray often builds new instances as return values, this will be an issue:
ds.haplo.do_custom_analysis_1()

# Do something with coords/indexes that causes a new Dataset to be created
# e.g. ds.reset_index ultimately hits https://github.com/pydata/xarray/blob/1eedc5c146d9e6ebd46ab2cc8b271e51b3a25959/xarray/core/dataset.py#L882
# which creates a new Dataset
ds = ds.reset_index()

# The `check_conforms_to_haplo_requirements` function will run again even though
# I would know it's not necessary at this point:
ds.haplo.do_custom_analysis_2()

would something akin to pandas' ExtensionDtype solve your problem?

Ah I can see how the title on the issue is misleading, but I don't actually have a need for dtypes beyond what's already available. Well, we do actually have that problem in trying to find some way to represent 2-bit integers with sub-byte data types but I wasn't trying to get into that on this thread. I'll make the title better.

@eric-czech eric-czech changed the title Type extensions for domain-specific toolkits Extending Xarray for domain-specific toolkits Apr 10, 2020
@TomNicholas
Copy link
Contributor

A docstring on a constructor is great -- is there a way to do something like that with accessors?

There surely must be some way to do that, but I'm afraid I'm not a docs wizard. However the accessor is still just a class, whose methods you want to document - would it be too unclear for them to hang off each HaploAccessor.specific_method()?

Is there a way to avoid running check_* methods multiple times?

There is some caching, but you shouldn't rely on it. In #3268 @crusaderky said "The more high level discussion is that the statefulness of the accessor is something that is OK to use for caching and performance improvements, and not OK for storing functional information like yours."

I think those checks could be expensive

those arrays should meet different dtype and dimensionality constraints

Checking dtype and dimensions shouldn't be expensive though, or is it more than that?

Well, we do actually have that problem in trying to find some way to represent 2-bit integers with sub-byte data types but I wasn't trying to get into that on this thread. I'll make the title better.

If you have other questions about dtypes in xarray then please feel free to raise another issue about that.

@eric-czech
Copy link
Author

eric-czech commented Apr 10, 2020

would it be too unclear for them to hang off each HaploAccessor.specific_method()?

That works for documenting the methods but I'm more concerned with documenting how to build the Dataset in the first place. Specifically, this would mean describing how to construct several arrays relating to genotype calls, phasing information, variant call quality scores, individual pedigree info, blah blah etc. and all these domain-specific things can have some pretty nuanced relationships so I think describing how to create a sensible Dataset with them will be a big part of the learning curve for users. I want to essentially override the constructor docs for Dataset and make it more specific to our use cases. I can't see a good way to do that with accessors since the dataset would already need to have been created.

Checking dtype and dimensions shouldn't be expensive though, or is it more than that?

It is, or at least I'd like not to preclude the checks from doing things like checking min/max values and asserting conditions along axes (i.e. sums to 1).

If you have other questions about dtypes in xarray then please feel free to raise another issue about that.

Will do.

@keewis
Copy link
Collaborator

keewis commented Apr 10, 2020

do you have any control on how the datasets are created? If so, you could provide a factory function (maybe pass in arrays via required kwargs?) that does the checks and describes the required dataset structure in its docstring.

If you have other questions about dtypes in xarray then please feel free to raise another issue about that.

Will do.

This probably won't happen in the near future, though, since the custom dtypes for numpy are still a work in progress (NEP-40, etc.)

@eric-czech
Copy link
Author

eric-czech commented Apr 10, 2020

Thanks @keewis, that would work though I think it leads to an awkward result if I'm understanding correctly. Here's what I'm imagining:

from genetics import api

# These are different types of data structures I originally wanted to model as classes
ds1 = api.create_genotype_call_dataset(...) 
ds2 = api.create_genotype_probability_dataset(...)
ds3 = api.create_haplotype_call_dataset(...)
# ds1, ds2, and ds3 are now just xr.Dataset instances

# For each of these different types of datasets I have separate accessors 
# that expose dataset-type-specific analytical methods:

@xr.register_dataset_accessor("genotype_calls")
class GenotypeCallAccessor:
    def __init__(self, ds):
        self.ds = ds

    @property
    def analyze(self):
        # Do something you can only do with genotype call data, not probabilities or 
        # haplotype data or CNV data or any other domain-specific kind of info
        pass

@xr.register_dataset_accessor("genotype_probabilities")
class GenotypeProbabilityAccessor: ??? # This also has some "analyze" method

@xr.register_dataset_accessor("haplotype_calls")
class HaplotypeCallAccessor: ??? # This also has some "analyze" method

#  ***** Now, how do I prevent this? *****
ds1.haplotype_calls.analyze()
# ds1 is really genotype call data so it shouldn't be possible to do a haplotype analysis on it

Is there a way to make accessors available on an xr.Dataset based on some conditions about the dataset itself? That still seems like a bad solution, but I think it would help me here.

I was trying to think of some way to use static structural subtyping but I don't see how that could ever work with accessors given that 1) they're attached at runtime and 2) all accessors are available on ALL Dataset instances, regardless of whether or not I know only certain things should be possible based on their content.

If accessors are the only way Xarray plans to facilitate extension, has anyone managed to enable static type analysis on their extensions? In my case, I'd be happy to have any kind of safety whether its static or monkey-patched in at runtime, but I'm curious if making static analysis impossible was a part of the discussion in deciding on accessors.

@keewis
Copy link
Collaborator

keewis commented Apr 10, 2020

you could emulate the availability of the accessors by checking your variables in the constructor of the accessor using

dataset_types = {
    frozenset("variable1", "variable2"): "type1",
    frozenset("variable2", "variable3"): "type2",
    frozenset("variable1", "variable3"): "type3",
}

def _dataset_type(ds):
    data_vars = frozenset(ds.data_vars.keys())
    return dataset_types[data_vars]

@xr.register_dataset_accessor("type1")
class Type1Accessor:
    def __init__(self, ds):
        if _dataset_type(ds) != "type1":
            raise AttributeError("not a type1 dataset")
        self.dataset = ds

though now that we have a "type" registry, we could also have one accessor, and pass a kind parameter to your analyze function:

def analyze(self, kind="auto"):
    analyzers = {
        "type1": _analyze_type1,
        "type2": _analyze_type2,
    }
    
    if kind == "auto":
        kind = self.dataset_type
    return analyzers.get(kind)(self.dataset)

If you just wanted to use static code analysis using e.g. mypy, consider using TypedDict. I don't know anything about mypy, though, so I wasn't able to get it to accept Dataset objects instead of dict. If someone actually gets this to work, we might be able to provide a xarray.typing module to allow something like (but depending on the amount of code needed, this could also fit in the Cookbook docs section):

from xarray.typing as DatasetType, Coordinate, ArrayType, Int64Type, FloatType

class Dataset1(DatasetType):
    longitude : Coordinate[ArrayType[Float64Type]]
    latitude : Coordinate[ArrayType[Float64Type]]

    temperature : ArrayType[Float64Type]

def function(ds : Dataset1):
    # ...
    return ds

and have the type checker validate the structure of the dataset.

@eric-czech
Copy link
Author

eric-czech commented Apr 11, 2020

Thanks @keewis! I like those ideas so I experimented a bit and found a few things.

you could emulate the availability of the accessors by checking your variables in the constructor of the accessor using
... now that we have a "type" registry, we could also have one accessor, and pass a kind parameter to your analyze function:

Is there any reason not to put the name of the type into attrs and just switch on that rather than the keys in data_vars? Forcing unique data_vars keys across the different dataset types isn't a big deal, but I thought a single type name or something of the like in attrs would be simpler.

If someone actually gets this to work, we might be able to provide a xarray.typing module to allow something like (but depending on the amount of code needed, this could also fit in the Cookbook docs section)

I would love to try to use something like that. I couldn't get it to work either when trying to have a TypedDict that represents entire datasets, so I tried creating them for data_vars and coords separately. I think python/mypy#4976 is particularly problematic in either case though. The gist of that issue is that covariance for TypeDict types doesn't really exist (i.e. TypedDict -> Mapping is ok but not TypedDict -> Mapping[Hashable, Any]) and contravariance definitely isn't supported (at least not with Dict or Mapping). Some examples I played around with:

MyDict = TypedDict('MyDict', {'x': str})
v1: MyDict = MyDict(x='x')

# This is fine
v2: Mapping = v1

# But this doesn't work:
v2: Mapping[Hashable, Any] = v1 # A notable examples since it's used in xr.Dataset
# error: Incompatible types in assignment (expression has type "MyDict", variable has type "Mapping[Hashable, Any]")

# And neither do any of these:
v2: dict = v1
# error: Incompatible types in assignment (expression has type "MyDict", variable has type "Dict[Any, Any]")
v2: Mapping[str, str] = v1
# error: Incompatible types in assignment (expression has type "MyDict", variable has type "Mapping[str, str]")

Going the other direction isn't possible at all (i.e. from Mapping -> TypeDict) since TypedDict acts like a subtype of Mapping. I think that's a big issue downstream if xr.Dataset requires Mapping types for data_vars and coords since you could never do something like this:

ds = xr.Dataset(data_vars=MyTypedDict(data=...))
# Now assume a user wants to use data_vars/coords with type safety:
data_vars: MyTypedDict = ds.data_vars # This doesn't work 

Generics seem like a decent solution to all these problems, but it would obviously involve a lot of type annotation changes:

# Ideally, xarray.typing would help specify more specific constraints, 
# but this works with what exists today:
GenotypeDataVars = TypedDict('GenotypeDataVars', {'data': DataArray, 'mask': DataArray})
GenotypeCoords = TypedDict('GenotypeCoords', {'variant': DataArray, 'sample': DataArray})

D = TypeVar('D', bound=Mapping)
C = TypeVar('C', bound=Mapping)

# Assume xr.Dataset was written something like this instead:
class Dataset(Generic[D, C]):
    
    def __init__(self, data_vars: D, coords: C):
        self.data_vars = data_vars
        self.coords = coords
        
ds1: Dataset[GenotypeDataVars, GenotypeCoords] = Dataset(
    GenotypeDataVars(data=xr.DataArray(), mask=xr.DataArray()), 
    GenotypeCoords(variant=xr.DataArray(), sample=xr.DataArray())
)

# Types should then be preserved even if xarray is constantly redefining 
# new instances in internal functions:
ds2: Dataset[GenotypeDataVars, GenotypeCoords] = type(ds1)(ds1.data_vars, ds1.coords) # This is OK

Anyways, my takeaways from everything on the thread so far are:

  • Using accessors and some kind of runtime type analysis will cover what was in the scope of my original post, but it will prohibit any kind of static type safety.
  • I imagine supporting type safety on the structure of arrays, coordinates, and attributes in Xarray would be far easier than supporting polymorphism for Dataset/DataArray so I'm curious if that has been discussed much. Do you think it's worth opening a separate issue to continue a conversation that builds on your idea @keewis ?

@keewis
Copy link
Collaborator

keewis commented Apr 12, 2020

Is there any reason not to put the name of the type into attrs and just switch on that rather than the keys in data_vars?

Not really, I just thought the variables in the dataset were a way to uniquely identify its variant (i.e. do the validation of the dataset's structure). If you have different means to do so, of course you can use that instead.

Re TypedDict: the PEP introducing TypedDict especially mentions that it is only intended for Dict[str, Any] (so no subclasses of Dict for TypedDict). However, looking at the code of TypedDict, we should be able to do something similar for Dataset.

Edit: we'd still need to convince mypy that the custom TypedDict is a type...

so I'm curious if that has been discussed much

I don't think so? There were a few discussions about subclassing, but I couldn't find anything about static type analysis. It's definitely worth having this discussion, either here (repurposing this issue) or in a new issue.

@eric-czech
Copy link
Author

Thanks again @keewis! I moved the static typing discussion to #3967.

This is closed out now as far as I'm concerned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants