Document data structures and design philosophy #87

hammer · 2020-08-03T17:19:06Z

Now that #51 is in, it would be good to have some documentation to describe the data structures at the heart of sgkit and the design philosophy used to formulate them.

I will pick this up and work with @alimanfoo and @eric-czech to ensure I capture their thinking as the intellectual forebears of sgkit.

The text was updated successfully, but these errors were encountered:

hammer · 2020-08-03T17:23:39Z

@tomwhite notes http://xarray.pydata.org/en/stable/data-structures.html is a good example of this sort of documentation.

hammer · 2020-08-07T13:24:04Z

Some thoughts from @eric-czech at https://github.com/pystatgen/sgkit/pull/78#issuecomment-669878845. As he notes in that comment, the design philosophy of sgkit right now is to treat xarray as a container for genetics data and to only check its shape and content with the various check_ calls when invoking a method. Because our methods don’t all hang off a central data structure, and each method can take a subset or transformation of the central data structure, it doesn’t make sense to center a data structure in the docs.

hammer · 2020-08-07T14:10:49Z

Some examples of documentation that centers the data structure with a diagram:

The latest Hail docs emphasize "input unification", but don't have a diagram for it.

alimanfoo · 2020-08-07T15:48:28Z

This is somewhat off-the-wall but I spent some time thinking about the web a few years back, and enjoyed reading Roy Fielding's PHD dissertation on the design of REST. I know we're talking about something quite different here, but the approach of thinking about design in terms of adding constraints was something I found novel and interesting. Chapter 5 is probably the most relevant.

eric-czech · 2020-08-11T19:43:36Z

I wanted to try to flesh this out a bit more so I started writing this description as if it was the sort of thing that would eventually live in our documentation somewhere. This is following up on https://github.com/pystatgen/sgkit/pull/78#issuecomment-669878845 and tries to explain some of that with much higher level context as well. I didn't want to go a lot further though without making sure we're all in agreement with at least this much. Here's what I've got so far:

Sgkit supports a variety of analytical methods for quantitative and population genetics using general-purpose frameworks such as Xarray, Dask, and Zarr. The intent of the sgkit API is to facilitate genetic analysis over large datasets while still offering seamless scaling down to smaller, experimental studies and new users. While traditional workflows of a similar nature often involve a heterogenous mixture of algorithm implementations, programming languages, system dependencies and even hardware, sgkit strives to offer the same flexibility in a single distributed computing framework. This flexibility is largely a result of the capabilities already inherent to other Python libraries for scientific computing, and sgkit attempts to better adapt these capabilities to the genetics domain by formalizing conventions for common quantities, providing access to appropriate file formats, porting standard algorithms, and prioritizing documentation/examples that promote best practices.

The primary interface is to sgkit functionality begins with the Xarray API. There are currently no data models in the library that attempt to capture the complexity of many (or even common) analyses and the data structures that would support them -- operations are applied to solely to Xarray Dataset objects. Users are free to manipulate data within these objects as they see fit, but they must do so within the confines of a set of conventions for variable names, dimensions, and underlying data types. The example below illustrates a Dataset format that would result from an assay expressible as PLINK or BGEN. This is a guideline however, and a Dataset seen in practice might include many more or fewer variables and dimensions.

Let me know if you all think that's on the right track and I'll keep going at some point.

jeromekelleher · 2020-08-12T08:20:58Z

Looks good @eric-czech. The model in the diagram more-or-less applies to VCF data too, right?

eric-czech · 2020-08-12T09:17:48Z

Yep. Perhaps it makes sense even at that introductory level to show an Xarray dataset for VCF, but just as the repr of an actual dataset and not a diagram.

jeromekelleher · 2020-08-12T09:29:10Z

Yeah. I think it's a good idea to say that across all the formats we work with a dense variant matrix that looks like your diagram, but the exact details of what goes in the cells and the information we have about the rows and columns differs a bit depending on the source.

hammer · 2020-09-10T13:15:56Z

It may be useful to point to some external documentation on migrating from working on NumPy to working with Xarray and Dask. The Satpy project has a dedicated page in their docs on this topic: Migrating to xarray and dask.

* Add usage and design documentation #87 * Update docs/index.rst Co-authored-by: Alistair Miles <[email protected]> * Suggested changes Co-authored-by: Alistair Miles <[email protected]> * Force push gh-pages branch in gh action * Suggested changes * Fix typo Co-authored-by: Alistair Miles <[email protected]>

tomwhite · 2020-10-01T14:29:30Z

Fixed in #278

hammer added the documentation Improvements or additions to documentation label Aug 3, 2020

hammer assigned hammer, alimanfoo and eric-czech Aug 3, 2020

hammer mentioned this issue Aug 3, 2020

[WIP] Docs describing the Genotype Call XArray #78

Closed

eric-czech mentioned this issue Aug 19, 2020

Append output variables from functions to input dataset #103

Closed

eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 23, 2020

Add user documentation sgkit-dev#87

60da34a

eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 23, 2020

Add user documentation sgkit-dev#87

2f1c01b

eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 23, 2020

Add user documentation sgkit-dev#87

d6e2c97

eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 24, 2020

Add usage and design documentation sgkit-dev#87

ab032c4

eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 24, 2020

Add usage and design documentation sgkit-dev#87

0650c61

eric-czech mentioned this issue Sep 24, 2020

Add usage and design documentation #278

Merged

eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 24, 2020

Add usage and design documentation sgkit-dev#87

8261456

eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 29, 2020

Add usage and design documentation sgkit-dev#87

1d1f055

tomwhite closed this as completed Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document data structures and design philosophy #87

Document data structures and design philosophy #87

hammer commented Aug 3, 2020

hammer commented Aug 3, 2020

hammer commented Aug 7, 2020

hammer commented Aug 7, 2020 •

edited

Loading

alimanfoo commented Aug 7, 2020 •

edited

Loading

eric-czech commented Aug 11, 2020

jeromekelleher commented Aug 12, 2020 •

edited

Loading

eric-czech commented Aug 12, 2020

jeromekelleher commented Aug 12, 2020

hammer commented Sep 10, 2020

tomwhite commented Oct 1, 2020

Document data structures and design philosophy #87

Document data structures and design philosophy #87

Comments

hammer commented Aug 3, 2020

hammer commented Aug 3, 2020

hammer commented Aug 7, 2020

hammer commented Aug 7, 2020 • edited Loading

alimanfoo commented Aug 7, 2020 • edited Loading

eric-czech commented Aug 11, 2020

jeromekelleher commented Aug 12, 2020 • edited Loading

eric-czech commented Aug 12, 2020

jeromekelleher commented Aug 12, 2020

hammer commented Sep 10, 2020

tomwhite commented Oct 1, 2020

hammer commented Aug 7, 2020 •

edited

Loading

alimanfoo commented Aug 7, 2020 •

edited

Loading

jeromekelleher commented Aug 12, 2020 •

edited

Loading