Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

brainstorming: alternative signature storage/loading/query formats #1262

Open
ctb opened this issue Dec 21, 2020 · 15 comments
Open

brainstorming: alternative signature storage/loading/query formats #1262

ctb opened this issue Dec 21, 2020 · 15 comments

Comments

@ctb
Copy link
Contributor

ctb commented Dec 21, 2020

from #1226 (comment), @luizirber says:

The challenge then becomes recalculating the reference sigs with smaller scaled values (100?), and efficiently storing it. JSON + gzip for sigs is at the limit for sizes, but not sure what would be a good format that maintains good archival/self-describing/easy to parse/small trade-offs.

a couple of thoughts here -

  • I feel like we've benefited a lot from using a really boring standard format like JSON which has lots of tools & language support
  • so binary formats are fine if they have said tool & language support, but I'm not "up" on what binary formats are good - maybe protocol buffers are an option?
  • alt, I wonder if we could have a database in format that supports the kind of queries we want to do? e.g. in consider ways to improve speed of LCA database #821 I suggest sqlite.
@luizirber
Copy link
Member

I feel like we've benefited a lot from using a really boring standard format like JSON which has lots of tools & language support

Yup, I agree.

so binary formats are fine if they have said tool & language support, but I'm not "up" on what binary formats are good - maybe protocol buffers are an option?

I would really like to avoid protobuf (eg https://twitter.com/fasterthanlime/status/1340944948582113282). On the Rust side, serde has support for a bunch of formats, but performance-wise it would be better to have something that doesn't require encoding/decoding for usage (zero-copy deserialization like cap'n proto, also used by mash, or rkyv, which is rust-only), but that is not as flexible as JSON...

(Tree-buf looks REALLY interesting, but still hasn't support for other languages)

alt, I wonder if we could have a database in format that supports the kind of queries we want to do? e.g. in #821 I suggest sqlite.

Mixed feelings. I think it is a good idea when compared to using Zip files for databases, but not so sure about single signatures...

Relevant read: https://www.sqlite.org/affcase1.html

@ctb
Copy link
Contributor Author

ctb commented Feb 11, 2021

what about AVRO? https://avro.apache.org/

@luizirber
Copy link
Member

what about AVRO? https://avro.apache.org/

This is probably very easy to test, considering that https://github.com/flavray/avro-rs supports serde, and so it is a drop-in replacement in the current codebase.

I was looking more into the Arrow/Parquet direction, which would also make it easier to work with more data-analysis-like workflows (loading into pandas, and so on).

Another direction to consider: in #1221 I was using the bitmagic serialization/deserialization for saving nodegraphs, but it might be also a good representation for scaled minhash sketches (save a "compressed bitmap" of the hashes, instead of a list). bitmagic is not a good portable format, but I wonder if any of the options mentioned here support something along the bitmap idea.
(this can make a GIGANTIC difference for very large sketches).

@luizirber
Copy link
Member

I started playing with the easy ones (the formats supported by serde) in https://github.com/luizirber/2021-02-11-sourmash-binary-format, will report when I have more results.

@ctb
Copy link
Contributor Author

ctb commented Apr 20, 2022

thoughts stemming from all the manifest work that has happened:

between the recent introduction of StandaloneManifestIndex #1891 and the hopefully-soon merge of SQLite manifests in #1808, we have an increasingly clean separation between metadata (manifests) and sketches (things containing actual hashvals). This separation would seem to make it easier to experiment with non-JSON formats in the primary code base.

there's also the idea of storing sketches in kProcessor kDataFrames or other k-mer-specialized formats.

@ctb
Copy link
Contributor Author

ctb commented Apr 21, 2022

side note: it would be neat to find ways of avoiding even reading or adding hashes (e.g. store them in bands #1578, or hierarchically at different scaled values).

@ctb
Copy link
Contributor Author

ctb commented Aug 28, 2022

briefly looked into Roaring Bitmaps,

https://roaringbitmap.org/about/

which has both rust and python bindings.

however, while the roaring library and roaring-rs both seem to support 64-bit numbers, pyroaring does not yet - Ezibenroc/PyRoaringBitMap#58

update - also see https://pypi.org/project/roroaring64/ which supports deserialization but not serialization.

and also https://pypi.org/project/pilosa-roaring/ which primarily (only?) supports serialization and deserialization. not clear if it supports 64 bits.

and also https://github.com/sunzhaoping/python-croaring/ which is a cffi wrapping? but does not support 64 bits.

@luizirber
Copy link
Member

which has both rust and python bindings.

I'll do a quick check on the rust one for mastiff, I really liked the API!

At the moment #2230 is using rkyv to serialize/deserialize the list of datasets containing a hash, and while that process is fast it is using a regular BTreeMap from the Rust stdlib, which doesn't save much space.

(rkyv is fast, but it has its own binary format, which precludes using it in other languages. roaring bitmaps are well supported in many languages)

@luizirber
Copy link
Member

Seems like roaring is smaller and faster than rkyv on a first test, will try more extensive benchmarks soon.

branch: lirber/mastiff...lirber/mastiff_roaring

@luizirber
Copy link
Member

Seems like roaring is smaller and faster than rkyv on a first test, will try more extensive benchmarks soon.

branch: lirber/mastiff...lirber/mastiff_roaring

Caveat on roaring: it only stores presence/absence, so it doesn't work as a replacement for abundance. But I think we can still use a Vec for storing abundances, and call mins.rank(hash) to get the position to get/set in the abundances Vec.

@ctb
Copy link
Contributor Author

ctb commented Jan 1, 2023

Using the plugin architecture in #2428, I put together a small, minimally function Avro reader/writer here.

It works! Which seems like a good start 😆

It's in Python so it's really more for demo and exploration than for production, of course, but it's a piece of the puzzle.

@ctb
Copy link
Contributor Author

ctb commented Jan 1, 2023

flatbuffers: https://google.github.io/flatbuffers/

@ctb
Copy link
Contributor Author

ctb commented Jan 5, 2023

I thought this was a nice 'splainer about parquet vs avro - https://stackoverflow.com/questions/28957291/avro-vs-parquet - tl;dr avro is row-based, like CSV (but with more complicated rows); parquet is column based.

I think that means that parquet would be a better choice for manifests, where you might want to select on a few specific columns? While avro is essentially a replacement for JSON in the way we do things internally.

@ctb
Copy link
Contributor Author

ctb commented Mar 5, 2023

maybe relevant?

pandas 2.0 and the Arrow revolution (part I) - Marc Garcia

Pandas 2.0 is going to have an Apache Arrow backend for data. This is going to eventually be a pretty big deal for large or complex data analyses - and not just because it’ll be faster, and has better data-type and missing-value handling. It will mean the in-memory data representation is now compatible (and can be used in place) by a wide range of other tools - databases (duckDB), analysis and plotting tools, file handling tools… Garcia goes much deeper into this.

(from RCT #158)

@ctb
Copy link
Contributor Author

ctb commented Sep 12, 2024

bincode? https://github.com/bincode-org/bincode

note: sourmash-bio/sourmash_plugin_branchwater#455

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants