You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Often, I meet cases where I'd like to merge subsets of the same dataset.
However, this currently requires renaming of all dimensions, changing indices and merging them by hand.
As an example, please consider the following dataset:
There is for each subtissue and individuum at most one observation.
Now, I'd like to plot all values in subtissue == "Whole_Blood" against subtissue == "Adipose_Subcutaneous". Therefore, I have to join all "Whole_Blood" observations with all "Adipose_Subcutaneous" observations by the "individual" coordinate.
To simplify this task, I'd like to have the following abstraction:
Add parameter on: Union[str, List[str], Tuple[str], Dict[str, str]]
This should specify one or multiple coordinates which should be merged.
Simple merge: string
=> merge by left[str] and right[str]
Merge of multiple coords: list or tuple of strings
=> merge by left[str1, str2, ...] and right[str1, str2, ...]
To merge differently named coords: dict, e.g. {"str_left": "str_right})
=> merge by left[str_left] and right[str_right]
Add some parameter like newdim to specify the newly created index dimension.
If on specifies multiple coords, this new index dimension should be a multi-index of these coords.
Rename all duplicate coordinates not specified in on to some unique name
e.g. left["cdf"] => merged["cdf:1"] and right["cdf"] => merged["cdf:2"]
In case if the on parameter's coordinates do not unambiguously describe each data point, they should be combined in a cross-product manner. However, since this could cause a quadratic runtime and memory requirement, I am not sure how this can be handled in a safe manner.
What do you think about this addition?
The text was updated successfully, but these errors were encountered:
I appreciate how this could be convenient, but I am concerned about adding more complexity to xarray's merge code, which is already pretty complex and hard to maintain. My refactor in #3234 is the first time that code has been touched in quite a while and I don't think anyone (other than myself) has made contributions to that part of xarray.
To solve your use-case, what about either:
Converting observations into a MultiIndex over individual and subtissue, or
Creating separate individual/subtissue dimensions and storing the data the data in the form of a sparse array
Then you could do this sort of data munging with normal indexing/alignment/merging, e.g.,
tissue_1=ds.sel(subtissue="Whole_Blood").rename({k: k+':1'forkinds})
tissue_2=ds.sel(subtissue="Adipose_Subcutaneous").rename({k: k+':2'forkinds})
merged=tissue_1.merge(tissue_2) # would have dimensions [gene, individual]
I'd like to propose a change to the merge method.
Often, I meet cases where I'd like to merge subsets of the same dataset.
However, this currently requires renaming of all dimensions, changing indices and merging them by hand.
As an example, please consider the following dataset:
There is for each
subtissue
andindividuum
at most one observation.Now, I'd like to plot all values in
subtissue == "Whole_Blood"
againstsubtissue == "Adipose_Subcutaneous"
. Therefore, I have to join all "Whole_Blood" observations with all "Adipose_Subcutaneous" observations by the "individual" coordinate.To simplify this task, I'd like to have the following abstraction:
The result should look like this:
To summarize, I'd propose the following changes:
on: Union[str, List[str], Tuple[str], Dict[str, str]]
This should specify one or multiple coordinates which should be merged.
=> merge by
left[str]
andright[str]
=> merge by left[str1, str2, ...] and right[str1, str2, ...]
{"str_left": "str_right}
)=> merge by
left[str_left]
andright[str_right]
newdim
to specify the newly created index dimension.If
on
specifies multiple coords, this new index dimension should be a multi-index of these coords.on
to some unique namee.g.
left["cdf"] => merged["cdf:1"]
andright["cdf"] => merged["cdf:2"]
In case if the
on
parameter's coordinates do not unambiguously describe each data point, they should be combined in a cross-product manner. However, since this could cause a quadratic runtime and memory requirement, I am not sure how this can be handled in a safe manner.What do you think about this addition?
The text was updated successfully, but these errors were encountered: