Add "on"-parameter to "merge" method #3224

Hoeze · 2019-08-17T02:44:46Z

I'd like to propose a change to the merge method.

Often, I meet cases where I'd like to merge subsets of the same dataset.
However, this currently requires renaming of all dimensions, changing indices and merging them by hand.

As an example, please consider the following dataset:

Dimensions:          (genes: 8787, observations: 8166)
Coordinates:
  * observations     (observations) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-0005-SM-57WCN'
  * genes            (genes) object 'ENSG00000227232' ... 'ENSG00000198727'
    individual       (observations) object 'GTEX-111CU' ... 'GTEX-ZXG5'
    subtissue        (observations) object 'Adipose_Subcutaneous' ... 'Whole_Blood'
Data variables:
    cdf              (observations, genes) float32 0.18883839 ... 0.4876754
    l2fc             (observations, genes) float32 -0.21032093 ... -0.032540113
    padj             (observations, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0

There is for each subtissue and individuum at most one observation.

Now, I'd like to plot all values in subtissue == "Whole_Blood" against subtissue == "Adipose_Subcutaneous". Therefore, I have to join all "Whole_Blood" observations with all "Adipose_Subcutaneous" observations by the "individual" coordinate.

To simplify this task, I'd like to have the following abstraction:

# select tissues
tissue_1 = ds.sel(observations = (ds.subtissue == "Whole_Blood"))
tissue_2 = ds.sel(observations = (ds.subtissue == "Adipose_Subcutaneous"))

# inner join by individual
merged = tissue_1.merge(tissue_2, on="individual", newdim="merge_dim", join="inner")

print(merged)

The result should look like this:

Dimensions:          ("genes": 8787, "individual": 286)
Coordinates:
  * genes            (genes) object 'ENSG00000227232' ... 'ENSG00000198727'
  * merge_dim       (merge_dim) object 'GTEX-111CU' ... 'GTEX-ZXG5'
    observations:1   (merge_dim) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-1826-SM-5GZYN'
    observations:2   (merge_dim) object 'GTEX-111CU-0005-SM-57WCN' ... 'GTEX-ZXG5-0005-SM-57WCN'
    subtissue:1      (merge_dim) object 'Whole_Blood' ... 'Whole_Blood'
    subtissue:1      (merge_dim) object 'Adipose_Subcutaneous' ... 'Adipose_Subcutaneous'
Data variables:
    cdf:1            (merge_dim, genes) float32 0.18883839 ... 0.4876754
    cdf:2            (merge_dim, genes) float32 ...
    l2fc:1           (merge_dim, genes) float32 -0.21032093 ... -0.032540113
    l2fc:2           (merge_dim, genes) float32 ...
    padj:1           (merge_dim, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0
    padj:2           (merge_dim, genes) float32 ...

To summarize, I'd propose the following changes:

Add parameter on: Union[str, List[str], Tuple[str], Dict[str, str]]
This should specify one or multiple coordinates which should be merged.
- Simple merge: string
  => merge by left[str] and right[str]
- Merge of multiple coords: list or tuple of strings
  => merge by left[str1, str2, ...] and right[str1, str2, ...]
- To merge differently named coords: dict, e.g. {"str_left": "str_right})
  => merge by left[str_left] and right[str_right]
Add some parameter like newdim to specify the newly created index dimension.
If on specifies multiple coords, this new index dimension should be a multi-index of these coords.
Rename all duplicate coordinates not specified in on to some unique name
e.g. left["cdf"] => merged["cdf:1"] and right["cdf"] => merged["cdf:2"]

In case if the on parameter's coordinates do not unambiguously describe each data point, they should be combined in a cross-product manner. However, since this could cause a quadratic runtime and memory requirement, I am not sure how this can be handled in a safe manner.

What do you think about this addition?

The text was updated successfully, but these errors were encountered:

shoyer · 2019-08-20T19:13:16Z

I appreciate how this could be convenient, but I am concerned about adding more complexity to xarray's merge code, which is already pretty complex and hard to maintain. My refactor in #3234 is the first time that code has been touched in quite a while and I don't think anyone (other than myself) has made contributions to that part of xarray.

To solve your use-case, what about either:

Converting observations into a MultiIndex over individual and subtissue, or
Creating separate individual/subtissue dimensions and storing the data the data in the form of a sparse array

Then you could do this sort of data munging with normal indexing/alignment/merging, e.g.,

tissue_1 = ds.sel(subtissue="Whole_Blood").rename({k: k + ':1' for k in ds})
tissue_2 = ds.sel(subtissue="Adipose_Subcutaneous").rename({k: k + ':2' for k in ds})
merged = tissue_1.merge(tissue_2)  # would have dimensions [gene, individual]

stale · 2021-07-21T00:00:50Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

stale bot added the stale label Jul 21, 2021

dcherian closed this as completed Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "on"-parameter to "merge" method #3224

Add "on"-parameter to "merge" method #3224

Hoeze commented Aug 17, 2019 •

edited

Loading

shoyer commented Aug 20, 2019

stale bot commented Jul 21, 2021

Add "on"-parameter to "merge" method #3224

Add "on"-parameter to "merge" method #3224

Comments

Hoeze commented Aug 17, 2019 • edited Loading

shoyer commented Aug 20, 2019

stale bot commented Jul 21, 2021

Hoeze commented Aug 17, 2019 •

edited

Loading