-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document data structures and design philosophy #87
Comments
@tomwhite notes http://xarray.pydata.org/en/stable/data-structures.html is a good example of this sort of documentation. |
Some thoughts from @eric-czech at https://github.com/pystatgen/sgkit/pull/78#issuecomment-669878845. As he notes in that comment, the design philosophy of |
Some examples of documentation that centers the data structure with a diagram: The latest Hail docs emphasize "input unification", but don't have a diagram for it. |
This is somewhat off-the-wall but I spent some time thinking about the web a few years back, and enjoyed reading Roy Fielding's PHD dissertation on the design of REST. I know we're talking about something quite different here, but the approach of thinking about design in terms of adding constraints was something I found novel and interesting. Chapter 5 is probably the most relevant. |
I wanted to try to flesh this out a bit more so I started writing this description as if it was the sort of thing that would eventually live in our documentation somewhere. This is following up on https://github.com/pystatgen/sgkit/pull/78#issuecomment-669878845 and tries to explain some of that with much higher level context as well. I didn't want to go a lot further though without making sure we're all in agreement with at least this much. Here's what I've got so far: Sgkit supports a variety of analytical methods for quantitative and population genetics using general-purpose frameworks such as Xarray, Dask, and Zarr. The intent of the sgkit API is to facilitate genetic analysis over large datasets while still offering seamless scaling down to smaller, experimental studies and new users. While traditional workflows of a similar nature often involve a heterogenous mixture of algorithm implementations, programming languages, system dependencies and even hardware, sgkit strives to offer the same flexibility in a single distributed computing framework. This flexibility is largely a result of the capabilities already inherent to other Python libraries for scientific computing, and sgkit attempts to better adapt these capabilities to the genetics domain by formalizing conventions for common quantities, providing access to appropriate file formats, porting standard algorithms, and prioritizing documentation/examples that promote best practices. The primary interface is to sgkit functionality begins with the Xarray API. There are currently no data models in the library that attempt to capture the complexity of many (or even common) analyses and the data structures that would support them -- operations are applied to solely to Xarray Dataset objects. Users are free to manipulate data within these objects as they see fit, but they must do so within the confines of a set of conventions for variable names, dimensions, and underlying data types. The example below illustrates a Dataset format that would result from an assay expressible as PLINK or BGEN. This is a guideline however, and a Dataset seen in practice might include many more or fewer variables and dimensions. Let me know if you all think that's on the right track and I'll keep going at some point. |
Looks good @eric-czech. The model in the diagram more-or-less applies to VCF data too, right? |
Yep. Perhaps it makes sense even at that introductory level to show an Xarray dataset for VCF, but just as the |
Yeah. I think it's a good idea to say that across all the formats we work with a dense variant matrix that looks like your diagram, but the exact details of what goes in the cells and the information we have about the rows and columns differs a bit depending on the source. |
It may be useful to point to some external documentation on migrating from working on NumPy to working with Xarray and Dask. The Satpy project has a dedicated page in their docs on this topic: Migrating to xarray and dask. |
* Add usage and design documentation #87 * Update docs/index.rst Co-authored-by: Alistair Miles <[email protected]> * Suggested changes Co-authored-by: Alistair Miles <[email protected]> * Force push gh-pages branch in gh action * Suggested changes * Fix typo Co-authored-by: Alistair Miles <[email protected]>
Fixed in #278 |
Now that #51 is in, it would be good to have some documentation to describe the data structures at the heart of
sgkit
and the design philosophy used to formulate them.I will pick this up and work with @alimanfoo and @eric-czech to ensure I capture their thinking as the intellectual forebears of
sgkit
.The text was updated successfully, but these errors were encountered: