Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft ZEP 0007: Strings #47

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

ivirshup
Copy link
Contributor

@ivirshup ivirshup commented Sep 7, 2023

Finally getting around to posting this proposal that was initially put out on the zulip. You can see the initial conversation on the linked hackmd: https://hackmd.io/aSz4DAYnRRaoFPMQXrml3w

I've tried to be quite conservative in the definition here. The overall idea is: "use arrow's string type".

I would like to get more feedback on this, especially from implementers.

cc: @normanrz

@jbms
Copy link

jbms commented Sep 7, 2023

I like the overall intent of this proposal. I think we need a new "array -> bytes" codec that will support the new string and binary types, since the existing bytes codec was intended for fixed-size types. Not sure what the best name for that codec would be, but "vlen" might be a reasonable choice.

Then the vlen codec could support various configuration options:

{"name": "vlen",
 "configuration": {
    "data_codecs": [{"name": "bytes"}, {"name": "blosc", "configuration": {"cname": "zstd", "clevel":5,"shuffle": "bitshuffle", "typesize":1,"blocksize":0}}],
    "index_codecs": [{"name": "bytes"}, {"name": "blosc", "configuration":{"cname": "zstd", "clevel":5,"shuffle": "shuffle", "typesize":4,"blocksize":0}}],
    "index_data_type": "uint32"
  }
}

Having separate data and index codecs allows different compression options to be used --- e.g. in the example above we use bit-wise shuffling for the data but byte-wise shuffling for the index.

One caveat is that if, as in the example above, the size of the encoded index is variable, then we would need to separately store the size of the index. Some compression formats may be self-delimiting and therefore not require that the size is stored, but we may not want to deal with that complexity.

@ivirshup
Copy link
Contributor Author

@jbms, you raise a very good point. I was able to talk to Joris about this at the numfocus summit last week and got a lot of insight onto how arrow does this.

An arrow RecordBatch contains a flattened set of buffers corresponding to the columns (docs). To use an example from the docs considering the schema:

col1: Struct<a: Int32, b: List<item: Int64>, c: Float64>
col2: Utf8

We'd have the fields:

FieldNode 0: Struct name='col1'
FieldNode 1: Int32 name='a'
FieldNode 2: List name='b'
FieldNode 3: Int64 name='item'
FieldNode 4: Float64 name='c'
FieldNode 5: Utf8 name='col2'

Which corresponds to the buffers:

buffer 0: field 0 validity
buffer 1: field 1 validity
buffer 2: field 1 values
buffer 3: field 2 validity
buffer 4: field 2 offsets
buffer 5: field 3 validity
buffer 6: field 3 values
buffer 7: field 4 validity
buffer 8: field 4 values
buffer 9: field 5 validity
buffer 10: field 5 offsets
buffer 11: field 5 data

While how compression works with the IPC format isn't super well documented (apache/arrow#37756), we can find a description of it in the flatbuffer definitions. AFAICT, each buffer is compressed separately, but I believe you cannot specify different compressors for different buffers. There is also room in the specification for compressing the entire message, instead of the buffers individually.


So, where does that leave us?

Allowing separate compression of underlying buffers may be useful, and I think gets much more useful if more variable length types are allowed. I would also like to keep the goal of very low cost interoperability with Arrow.

I don't know that I love the idea of this codec:

  • Why make it specific for vlen? Surely multiple arrays of fixed lengths could be stored in a buffer using separate compression schemes as well.
  • How do we do chunk level metadata to know where component buffers start? Probably inline as a header for each chunk?

To me, arrow RecordBatch IPC format rhymes with zarr's sharding format. Instead of including multiple chunks of one array inside a shard, it's storing related chunks across multiple arrays inside a shard.

There may be a more parsimonious solution here that shares more with sharding + variable chunk sizes, instead of defining a new codec.


since the existing bytes codec was intended for fixed-size types.

Where is the bytes codec defined? I believe this would be a basic array->bytes codec, but I do not see this defined in the v3 spec: https://zarr-specs.readthedocs.io/en/latest/v3/codecs.html

@jbms
Copy link

jbms commented Sep 17, 2023

The bytes codec is the proposed renaming of the endian codec: zarr-developers/zarr-specs#263
Sorry for the confusion about that.

I think arrow compatibility as far as the format of the offsets and data buffers is relatively easy to achieve. I am not so keen on trying to use the RecordBatch flatbuffers message format itself, since that adds a lot of baggage and complexity for what we could also accomplish with just a single 64-bit number.

It is true that there is some similarity with the sharding_indexed codec: in fact the sharding codec is just storing an array of variable-length byte strings. It differs from your proposed arrow-compatible format in that it stores both an offset and length for each entry to allow arbitrary ordering of sub-chunks.

But it is not clear to me how useful it is to try to unify the sharding format with the vlen string format, since the use cases and expected access patterns are very different.

Can you explain the connection to variable-size chunks (i.e. rectilinear grid)? Are you thinking more about sparse arrays?

I agree that if we had a different case where we are storing multiple arrays in one chunk, such as storing a chunk using a sparse array encoding, we would probably also want to allow separate codecs for each array, and these could be specified as part of the json configuration for this "sparse" codec. As far as the binary format, I suppose it could make sense to try to unify the sparse array format and the vlen string format in some way, but I'm not sure there is really that much benefit and it would bring in a lot of added complexity.

draft/ZEP0007.md Outdated Show resolved Hide resolved
@MSanKeys963
Copy link
Member

Hi @ivirshup. I've fixed the RTD build issue in #51.

The PR preview is https://zeps--47.org.readthedocs.build/en/47/draft/ZEP0007.html.

@joshmoore
Copy link
Member

@ivirshup, thinking about your comment in the description:

I would like to get more feedback on this, especially from implementers.

what would you like to see happen here on this PR before ZEP7 gets listed under https://zarr.dev/zeps/draft_zeps/?

@joshmoore
Copy link
Member

@ivirshup: in light of the renewed interest in zarr-developers/zarr-specs#83 (comment), do you see coming back to this or are you interested in passing it off? (Some discussion during the ZEP meeting today)

@rabernat
Copy link
Contributor

We discussed this today at the zarr-python meeting.

The above ideas are all good ones. The arrow approach of storing an offsets buffer and a data buffer seems to be the way most data formats today do it.

However, it may also be valuable to have a V3 codec that is backwards compatible with the existing Zarr V2 VLen codecs: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/vlen.pyx

These codecs use an "interleaved" format:

| header | item 0 len | item 0 data | item 1 len | item 1 data | ... 
| uint4  | unit4      | variable    | unit4      | variable    | ...

where the header stores the number of items.

You can see this in how Zarr V2 encodes data. Here's an example

import zarr  # V2

strings = ["hello", "world", "my", "name", "is", "Ryan"]
store = zarr.MemoryStore()
array = zarr.array(strings, dtype=str, store=store, compressor=None)

buffer = store['0']
nitems = int.from_bytes(buffer[:4], byteorder="little")
offset = 4
for _ in range(nitems):
    next_len = int.from_bytes(buffer[offset:offset+4], byteorder="little")
    offset += 4
    data = buffer[offset:offset+next_len]
    offset += next_len
    print(next_len, data)
5 b'hello'
5 b'world'
2 b'my'
4 b'name'
2 b'is'
4 b'Ryan'

@ivirshup
Copy link
Contributor Author

I'm not sure I've got the time to follow this one up in the immediate future, so if someone else is interested in picking it up that would be great.

@LDeakin
Copy link

LDeakin commented Jul 19, 2024

Thanks for renewing interest in this @rabernat. I've since experimented with variable-length data types in zarrs. My thoughts:

  • I like the vlen proposal @jbms made here for separate data and index codecs.
    • Simple to implement and suited to partial decoding.
    • Aligns well with the Zarr codec model and can benefit from current and future codecs.
    • I structured the underlying index using the Apache arrow variable-size binary layout with the validity bitmap elided.
  • The numcodecs vlen-utf8/vlen-bytes/vlen-array codecs seem to be effectively the same thing, just for different data types. I think they could be standardised as one Zarr V3 codec, something like vlen_interleaved or vlen_v2.
    • Transitioning Zarr V2 -> V3 means just changing the codec name
    • Interleaved encoding is not suitable for partial decoding, so it should not be recommended for new data

Unanswered questions:

  • How best to handle fill_value metadata for string data types? It seems that the fill_value metadata cannot be an arbitrary JSON string in Zarr V3 without a spec version bump.
  • Should the string data type encompass fixed-length strings?

@rabernat
Copy link
Contributor

Thanks for doing this work @LDeakin! Super helpful! I think your plan sounds great.

  • How best to handle fill_value metadata for string data types? It seems that the fill_value metadata cannot be an arbitrary JSON string in Zarr V3 without a spec version bump.

Having spent more time with Arrow, I find myself wishing we had the concept of "missing data" or "null values" more deeply integrated into Zarr. Do you have any thoughts on that?

@jbms
Copy link

jbms commented Jul 19, 2024

Thanks for renewing interest in this @rabernat. I've since experimented with variable-length data types in zarrs. My thoughts:

  • I like the vlen proposal @jbms made here for separate data and index codecs.
    • Simple to implement and suited to partial decoding.
    • Aligns well with the Zarr codec model and can benefit from current and future codecs.
    • I structured the underlying index using the Apache arrow variable-size binary layout with the validity bitmap elided.
  • The numcodecs vlen-utf8/vlen-bytes/vlen-array codecs seem to be effectively the same thing, just for different data types. I think they could be standardised as one Zarr V3 codec, something like vlen_interleaved or vlen_v2.
    • Transitioning Zarr V2 -> V3 means just changing the codec name
    • Interleaved encoding is not suitable for partial decoding, so it should not be recommended for new data

Unanswered questions:

  • How best to handle fill_value metadata for string data types? It seems that the fill_value metadata cannot be an arbitrary JSON string in Zarr V3 without a spec version bump.

The interpretation of the json fill value depends on the data type so there is no problem here, since we are also introducing a new data type. It is okay and expected that old implementations return an error when parsing zarr metadata that specifies unsupported features. It is only a problem if the old implementation does not return an error, but interprets the data incorrectly.

  • Should the string data type encompass fixed-length strings?

I think fixed length strings introduce some additional questions and could be deferred.

@LDeakin
Copy link

LDeakin commented Jul 20, 2024

I realise now that the vlen codec could more efficiently store chunks containing fill values. Instead of using the Apache arrow variable-size binary layout as is, negative offsets could be used to represent fill values and then their bytes would not need to be stored. EDIT: Would have to go 1-based indexing. For example:

data: [["fill value", "ab", "fill value", "cde"]
index: [-1, 1, -3, 3, 6]
bytes: [97, 98, 99, 100, 101]

It might be better to keep it simple, though.

Having spent more time with Arrow, I find myself wishing we had the concept of "missing data" or "null values" more deeply integrated into Zarr. Do you have any thoughts on that?

This is probably better suited to discussion in a new issue, but here are my thoughts. A ZEP0004 metadata convention for null/missing/mask values would be a step in the right direction, no support would be needed from Zarr implementations. But I think first-class support would be better:

  • A new optional array metadata field: "null_value": ... (same valid inputs as fill_value)
    • Implementations could support storing/retrieving optional values (e.g. Option<T> in Rust) if something like this existed
  • An array->bytes codec could exist to more efficiently encode data that contains special values, such as a fill value, null value, or just common values. For example:
    • Create an index or bitfields indicating which elements in a chunk are a special value (and which one)
    • Linearise the chunk into a 1D array with special values removed and pass it through to an array->bytes codec
{
    "name": "bikeshed",
    "configuration": {
        "values": [0.0, "NaN"],
        "index_codecs": [...],
        "array_to_bytes_codec": { "name": "bytes", "configuration": { "endian": "little" } }
    }
}

@jbms
Copy link

jbms commented Jul 20, 2024

I realise now that the vlen codec could more efficiently store chunks containing fill values. Instead of using the Apache arrow variable-size binary layout as is, negative offsets could be used to represent fill values and then their bytes would not need to be stored. EDIT: Would have to go 1-based indexing. For example:

data: [["fill value", "ab", "fill value", "cde"]
index: [-1, 1, -3, 3, 6]
bytes: [97, 98, 99, 100, 101]

It might be better to keep it simple, though.

Potentially this sort of compression could be handled by an additional codec layered on top.

Having spent more time with Arrow, I find myself wishing we had the concept of "missing data" or "null values" more deeply integrated into Zarr. Do you have any thoughts on that?

This is probably better suited to discussion in a new issue, but here are my thoughts. A ZEP0004 metadata convention for null/missing/mask values would be a step in the right direction, no support would be needed from Zarr implementations. But I think first-class support would be better:

  • A new optional array metadata field: "null_value": ... (same valid inputs as fill_value)

In arrow, a missing value is always a distinct value from any value within the domain of the data type, which is important if you need to preserve the full domain for non-missing values. In zarr I think it would most naturally be represented by some sort of separate mask array associated with the main array.

  • Implementations could support storing/retrieving optional values (e.g. Option<T> in Rust) if something like this existed
  • An array->bytes codec could exist to more efficiently encode data that contains special values, such as a fill value, null value, or just common values. For example:
    • Create an index or bitfields indicating which elements in a chunk are a special value (and which one)
    • Linearise the chunk into a 1D array with special values removed and pass it through to an array->bytes codec
{
    "name": "bikeshed",
    "configuration": {
        "values": [0.0, "NaN"],
        "index_codecs": [...],
        "array_to_bytes_codec": { "name": "bytes", "configuration": { "endian": "little" } }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants