brainstorming: alternative signature storage/loading/query formats #1262

ctb · 2020-12-21T16:42:36Z

The challenge then becomes recalculating the reference sigs with smaller scaled values (100?), and efficiently storing it. JSON + gzip for sigs is at the limit for sizes, but not sure what would be a good format that maintains good archival/self-describing/easy to parse/small trade-offs.

a couple of thoughts here -

I feel like we've benefited a lot from using a really boring standard format like JSON which has lots of tools & language support
so binary formats are fine if they have said tool & language support, but I'm not "up" on what binary formats are good - maybe protocol buffers are an option?
alt, I wonder if we could have a database in format that supports the kind of queries we want to do? e.g. in consider ways to improve speed of LCA database #821 I suggest sqlite.

luizirber · 2020-12-21T17:43:16Z

I feel like we've benefited a lot from using a really boring standard format like JSON which has lots of tools & language support

Yup, I agree.

so binary formats are fine if they have said tool & language support, but I'm not "up" on what binary formats are good - maybe protocol buffers are an option?

I would really like to avoid protobuf (eg https://twitter.com/fasterthanlime/status/1340944948582113282). On the Rust side, serde has support for a bunch of formats, but performance-wise it would be better to have something that doesn't require encoding/decoding for usage (zero-copy deserialization like cap'n proto, also used by mash, or rkyv, which is rust-only), but that is not as flexible as JSON...

(Tree-buf looks REALLY interesting, but still hasn't support for other languages)

alt, I wonder if we could have a database in format that supports the kind of queries we want to do? e.g. in #821 I suggest sqlite.

Mixed feelings. I think it is a good idea when compared to using Zip files for databases, but not so sure about single signatures...

Relevant read: https://www.sqlite.org/affcase1.html

ctb · 2021-02-11T16:24:21Z

what about AVRO? https://avro.apache.org/

luizirber · 2021-02-11T19:42:21Z

what about AVRO? https://avro.apache.org/

This is probably very easy to test, considering that https://github.com/flavray/avro-rs supports serde, and so it is a drop-in replacement in the current codebase.

I was looking more into the Arrow/Parquet direction, which would also make it easier to work with more data-analysis-like workflows (loading into pandas, and so on).

Another direction to consider: in #1221 I was using the bitmagic serialization/deserialization for saving nodegraphs, but it might be also a good representation for scaled minhash sketches (save a "compressed bitmap" of the hashes, instead of a list). bitmagic is not a good portable format, but I wonder if any of the options mentioned here support something along the bitmap idea.
(this can make a GIGANTIC difference for very large sketches).

luizirber · 2021-02-12T02:22:13Z

I started playing with the easy ones (the formats supported by serde) in https://github.com/luizirber/2021-02-11-sourmash-binary-format, will report when I have more results.

ctb · 2022-04-20T14:53:57Z

thoughts stemming from all the manifest work that has happened:

between the recent introduction of StandaloneManifestIndex #1891 and the hopefully-soon merge of SQLite manifests in #1808, we have an increasingly clean separation between metadata (manifests) and sketches (things containing actual hashvals). This separation would seem to make it easier to experiment with non-JSON formats in the primary code base.

there's also the idea of storing sketches in kProcessor kDataFrames or other k-mer-specialized formats.

ctb · 2022-04-21T13:30:20Z

side note: it would be neat to find ways of avoiding even reading or adding hashes (e.g. store them in bands #1578, or hierarchically at different scaled values).

ctb · 2022-08-28T14:48:27Z

briefly looked into Roaring Bitmaps,

https://roaringbitmap.org/about/

which has both rust and python bindings.

however, while the roaring library and roaring-rs both seem to support 64-bit numbers, pyroaring does not yet - Ezibenroc/PyRoaringBitMap#58

update - also see https://pypi.org/project/roroaring64/ which supports deserialization but not serialization.

and also https://pypi.org/project/pilosa-roaring/ which primarily (only?) supports serialization and deserialization. not clear if it supports 64 bits.

and also https://github.com/sunzhaoping/python-croaring/ which is a cffi wrapping? but does not support 64 bits.

luizirber · 2022-09-03T14:41:14Z

which has both rust and python bindings.

I'll do a quick check on the rust one for mastiff, I really liked the API!

At the moment #2230 is using rkyv to serialize/deserialize the list of datasets containing a hash, and while that process is fast it is using a regular BTreeMap from the Rust stdlib, which doesn't save much space.

(rkyv is fast, but it has its own binary format, which precludes using it in other languages. roaring bitmaps are well supported in many languages)

luizirber · 2022-09-04T00:36:59Z

Seems like roaring is smaller and faster than rkyv on a first test, will try more extensive benchmarks soon.

branch: lirber/mastiff...lirber/mastiff_roaring

luizirber · 2022-09-05T16:24:50Z

Seems like roaring is smaller and faster than rkyv on a first test, will try more extensive benchmarks soon.

branch: lirber/mastiff...lirber/mastiff_roaring

Caveat on roaring: it only stores presence/absence, so it doesn't work as a replacement for abundance. But I think we can still use a Vec for storing abundances, and call mins.rank(hash) to get the position to get/set in the abundances Vec.

ctb · 2023-01-01T01:50:45Z

Using the plugin architecture in #2428, I put together a small, minimally function Avro reader/writer here.

It works! Which seems like a good start 😆

It's in Python so it's really more for demo and exploration than for production, of course, but it's a piece of the puzzle.

ctb · 2023-01-01T15:43:10Z

flatbuffers: https://google.github.io/flatbuffers/

ctb · 2023-01-05T14:51:20Z

I thought this was a nice 'splainer about parquet vs avro - https://stackoverflow.com/questions/28957291/avro-vs-parquet - tl;dr avro is row-based, like CSV (but with more complicated rows); parquet is column based.

I think that means that parquet would be a better choice for manifests, where you might want to select on a few specific columns? While avro is essentially a replacement for JSON in the way we do things internally.

ctb · 2023-03-05T15:03:44Z

maybe relevant?

pandas 2.0 and the Arrow revolution (part I) - Marc Garcia

Pandas 2.0 is going to have an Apache Arrow backend for data. This is going to eventually be a pretty big deal for large or complex data analyses - and not just because it’ll be faster, and has better data-type and missing-value handling. It will mean the in-memory data representation is now compatible (and can be used in place) by a wide range of other tools - databases (duckDB), analysis and plotting tools, file handling tools… Garcia goes much deeper into this.

(from RCT #158)

ctb · 2024-09-12T15:04:22Z

bincode? https://github.com/bincode-org/bincode

note: sourmash-bio/sourmash_plugin_branchwater#455

ctb mentioned this issue Aug 3, 2022

adjust JSON signature format to better support streaming and progressive I/O? #1507

Closed

ctb mentioned this issue Sep 3, 2022

[EXP] provide signature file loading function via HTTP #2256

Open

ctb mentioned this issue Dec 31, 2022

[MRG] provide an initial plugin architecture for sourmash that supports new signature saving & loading mechanisms #2428

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

brainstorming: alternative signature storage/loading/query formats #1262

brainstorming: alternative signature storage/loading/query formats #1262

ctb commented Dec 21, 2020

luizirber commented Dec 21, 2020

ctb commented Feb 11, 2021

luizirber commented Feb 11, 2021

luizirber commented Feb 12, 2021

ctb commented Apr 20, 2022

ctb commented Apr 21, 2022

ctb commented Aug 28, 2022 •

edited

Loading

luizirber commented Sep 3, 2022

luizirber commented Sep 4, 2022

luizirber commented Sep 5, 2022

ctb commented Jan 1, 2023

ctb commented Jan 1, 2023

ctb commented Jan 5, 2023

ctb commented Mar 5, 2023 •

edited

Loading

ctb commented Sep 12, 2024 •

edited

Loading

brainstorming: alternative signature storage/loading/query formats #1262

brainstorming: alternative signature storage/loading/query formats #1262

Comments

ctb commented Dec 21, 2020

luizirber commented Dec 21, 2020

ctb commented Feb 11, 2021

luizirber commented Feb 11, 2021

luizirber commented Feb 12, 2021

ctb commented Apr 20, 2022

ctb commented Apr 21, 2022

ctb commented Aug 28, 2022 • edited Loading

luizirber commented Sep 3, 2022

luizirber commented Sep 4, 2022

luizirber commented Sep 5, 2022

ctb commented Jan 1, 2023

ctb commented Jan 1, 2023

ctb commented Jan 5, 2023

ctb commented Mar 5, 2023 • edited Loading

ctb commented Sep 12, 2024 • edited Loading

ctb commented Aug 28, 2022 •

edited

Loading

ctb commented Mar 5, 2023 •

edited

Loading

ctb commented Sep 12, 2024 •

edited

Loading