Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion of an "/archives" endpoint #364

Open
ml-evs opened this issue Jun 11, 2021 · 1 comment
Open

Suggestion of an "/archives" endpoint #364

ml-evs opened this issue Jun 11, 2021 · 1 comment
Labels
PR/requires-discussion type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.

Comments

@ml-evs
Copy link
Member

ml-evs commented Jun 11, 2021

At the 2021 workshop, we discussed including an OPTIONAL archives entry type and corresponding endpoint in the specification. Below is an incomplete summary of the ideas that were discussed (please feel free to add/edit).

Other promoters: @sauliusg @jacksund

The idea is that this endpoint would serve static snapshots of an entire (as in, all endpoints) OPTIMADE implementation, potentially over subsets of the data (e.g., a particular set of materials).

This MUST be equivalent to what would be received by crawling an OPTIMADE API (in terms of format), and this could be represented as a hierarchical filesystem, e.g.

$ tree dump
dump
└── optimade.example.org
    └── v1
        ├── calculations.json
        ├── info
        │   ├── archives.json
        │   ├── calculations.json
        │   ├── links.json
        │   ├── references.json
        │   └── structures.json
        ├── links.json
        ├── references.json
        └── structures.json

3 directories, 9 files

Potential attributes:

  • time_stamp/last_modified
  • checksum
  • description
  • version
  • size
  • compression_method
  • url

Issues discussed

  • Attribution: references endpoint is naturally included in the dump. Is this enough?
  • Licensing: do we need to provide a mechanism for licensing databases differently to filtered data? Do we need to worry about this more generally?
  • ACID: should it be an explicit requirement for serving archives?
  • Indexing: completely lost, alongside any context provided by additional endpoints. Maybe defines the natural dividing line between an archivable database vs not.
  • Implementation overhead: may require extra work to support, but for small databases it should be trivial. Should be no requirement on frequency of updates.

Enabling new use cases

  • For databases that already provide archives, in some format:
    • Improved findability, plus standardization of OPTIMADE
    • Could remove some database load, e.g. could even replace pagination of “empty” filtering
  • For smaller databases, easier to archive and easier to deal with for the end user
  • For non-existent databases, e.g. a dataset on figshare… if provided as an OPTIMADE archive then allows exploration with OPTIMADE clients and hybrid OPTIMADE local clients/servers
  • Archive-only databases, pointing to persistent long-term storage, indexed in the same way as the providers repository, e.g. GitHub repo that builds archives.optimade.org with defined prefixes.

Resources

@ml-evs ml-evs added PR/requires-discussion type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus. labels Jun 11, 2021
@jacksund
Copy link

Thanks for the write-up @ml-evs.

For smaller databases, easier to archive and easier to deal with for the end user

We would still need to make converting someone's data to OPTIMADE format easier. Right now, it would require providers to read the OPTIMADE spec and convert from their current format for structures (poscar, cif, etc.). I think it's worth adding .to_optimade() methods for pymatgen Structure and/or ase Atoms classes. That way providers can automate conversion, regardless of initial structure format. These methods would also let the OPTIMADE absorb non-standardized datasets easily too.

When we go beyond just structures, this can be a lot of work though (e.g. even BandStructure classes would also need a to_optimade method)... This goes with your Implementation overhead bullet point.

Attribution & Licensing

This is probably the biggest roadblock to archives. Making it optional should make things a lot easier though. I'd anticipate the larger and more well-known a database is, the less they'll want to participate in this endpoint.

Also what if we add license to your list of attributes? So unique licensing would be attached to each individual archive dump.

Also the url attribute can also be (optionally) provider-controlled. So a cdn with authentication, a link to their own website, etc. This would leave download stats in their hands.

Another route is collecting usage statistics that can be sent back to providers (for them use in future grant proposals). Users would have to agree to such data collection if they want to download an archive. I'm personally against data collection, but it might be a necessary compromise for some providers to participate. This would have to be implemented in the OPTIMADE client package too.

Could remove some database load

One potential issue is that the OPTIMADE spec doesn't aim to be a condensed format. Instead it shoots for being robust/encompassing/flexible. So we could actually end up with dump files that are larger than the ones providers make themselves. For example, I was able to get all MP structures into a dump file below 100MB -- but I don't think I can get anywhere close to that value using the OPTIMADE spec and json format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR/requires-discussion type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus.
Projects
None yet
Development

No branches or pull requests

2 participants