Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change-only version selection (PyInf#10524) #264

Closed
rahasurana opened this issue Aug 28, 2023 · 7 comments · Fixed by #293
Closed

Change-only version selection (PyInf#10524) #264

rahasurana opened this issue Aug 28, 2023 · 7 comments · Fixed by #293
Assignees

Comments

@rahasurana
Copy link
Contributor

Background:

As part of an internal evaluation, we explored the time travel feature of TileDB. And based on our benchmarking, we found that computing the change in data between two timestamps/versions was upto 6X slower and 3x more memory consuming for versioned-hdf5 when compared to our toy TileDB wrapper, in general.

Findings:

The reason for TileDB to be faster is that it stores the data being added in a new TileDB fragment which can be retrieved later based on the timestamp. And when changes between two timestamps are to be fetched, it just merges the fragments between the timestamps and returns it.

OTOH, versioned-hdf5 creates new chunks for the data which are then added to the Global chunk hash and the virtual dataset for that version reflects old chunks + new chunks shadowing older chunks for same data. Now when user reads data for a given version/timestamp, all the data at the timestamp is read and needs to be filtered, which is the major bottleneck.

Requirement:
Is there an API already available to fetch only the modified data/chunks? Couldn’t find anything on API doc page.
Can you please add an API that allows accessing data of only the chunks that were added in that version.
This would reduce the amount of data being read/filtered and speed up the change-based selection.

Having a mapping of version to all the chunks added in this version should help here(?).

Also, addressing Store timestamps in array (#171), would further help speed up identifying version corresponding to a timestamp for a couple of internal use-cases.

Internal ticket: PyInf#10524

CC: @ArvidJB

@rahasurana rahasurana changed the title Return only change for a version selection Change-only version selection Aug 28, 2023
@quarl
Copy link
Member

quarl commented Aug 31, 2023

Can you think about this and estimate?

@peytondmurray peytondmurray self-assigned this Aug 31, 2023
@peytondmurray
Copy link
Collaborator

Is there an API already available to fetch only the modified data/chunks? Couldn’t find anything on API doc page.
Can you please add an API that allows accessing data of only the chunks that were added in that version.
This would reduce the amount of data being read/filtered and speed up the change-based selection.

I like this as a feature, and I'm trying to think of a sensible way to implement it. A couple of options come to mind off the top of my head:

  1. Since which version wrote what chunk to the raw dataset is not recorded, one option would be to iterate through all versions of a dataset to find the chunks written by each version. We'd need to compute the hashes and then check the hash table for every chunk, making this option probably quite slow, especially for large files.
  2. We could write the version of the file when we write a new chunk for the first time. We'd record the version in the hash table for the dataset, and we'd add a function get_chunks(vf: VersionedHDF5File, version: str, dataset: str) which would iterate through all the entries in the hash table and return the version that originally wrote the data. This could be quite a bit faster than the first option, but it wouldn't be backwards compatible with previous releases.

Both options have strengths and weaknesses. I'll keep thinking on this - there may be another way we can do this.

@peytondmurray
Copy link
Collaborator

peytondmurray commented Oct 31, 2023

I have a plan for this, and have finished with a number of other more pressing bug reports. I think we can just compare the slices that have changed between versions, which should just be n_chunks*3 comparisons between versions, making it quite a bit faster than directly comparing the datasets. I estimate 15h to test and benchmark the results.

@rahasurana The API would be something like

with h5py.File('data.hdf5', 'r') as f:
    vf = VersionedHDF5File(f)

    diff = vf.get_diff(dataset_name, 'v0', 'v1')

Here, get_diff would take in two version names and a dataset name and return a 2-length tuple containing a list of the slices that differ, and a list of the corresponding arrays as they appear in 'v1' of the dataset. What are your thoughts about this approach?

@rahasurana
Copy link
Contributor Author

Thanks for looking into this @peytondmurray!

That sounds good and is inline with the original requirement of

  • "API available to fetch only the modified data/chunks" and
  • supporting "allows accessing data of only the chunks that were added in that version" with some extra handling,
    and provide substantial performance improvement.

But I think it still won't match up to TileDb's performance which stores the diff already and therefore needs a single lookup to fetch the diff info.

Forgive me as I don't understand the internals well enough and might sound over-ambitious with the suggestion, but could we store the slice info additionally during the new version being written? That would avoid the need to pass the version written before "interested version" and also avoid computing this info on-demand. We could just refer the slices info and return a list of corresponding arrays that only appear in the interested version?

Please let me know you comments on above.

@peytondmurray
Copy link
Collaborator

But I think it still won't match up to TileDb's performance which stores the diff already and therefore needs a single lookup to fetch the diff info.

You're absolutely right, the proposed method wouldn't be a single lookup. So it wouldn't be as fast as if we stored the slice info additionally during a new version write.

That would avoid the need to pass the version written before "interested version" and also avoid computing this info on-demand.

True, although if you'd like we can make the default behavior of get_diff(dataset_name, version) to simply compute the diff with the previous version. Then the API would have the flexibility of being able to compare across v1 and v3 if desired, or simply passing v3 would compare v3 to v2 for example.

But with regards to computing this information on the fly, this would be something of a departure from the way information is currently stored. It's certainly possible, but I'd be interested in the opinion of @asmeurer about this.


The proposed method compares slices of two virtual datasets, so n_elements/chunk_size comparisons to compute the diff. So assuming a typical chunk_size of 1024 we're already looking at ~1000x fewer comparisons than the element-by-element comparisons we would have to do as-is. So for that reason I'd propose trying this first and then if that still isn't fast enough maybe we can think about how we store diffs if we want more performance.

Either way, I think it should be possible to make this work with old data versions - we just recompute the diffs for each version when opening up old data.

@ArvidJB
Copy link
Collaborator

ArvidJB commented Nov 6, 2023

I like the slices comparison: it's fast and matches what's actually going on under the hood. With slices comparison you can see whether chunks are getting reused between versions and where the extra storage is spent. The goal would be for each chunk to find the earliest version it was mapped to the same raw data chunk. I had actually written something like this for our own diagnostic purposes, it would be nice to get this fully supported!
Note that our typical chunk sizes are around 100,000, matching the recommendations in the h5py documentation: https://docs.h5py.org/en/stable/high/dataset.html#:~:text=Chunking%20has%20performance%20implications.,chunk%20is%20read%20from%20disk. So this should be very fast.

@peytondmurray
Copy link
Collaborator

Great, this sounds good to me!

I should also add that if we are comparing slices, the following chunks would be considered different:

v1 = [0, 0, 0, 0, 0]
v2 = [1, 0, 0, 0, 0]

So even though elements v1[1:] == v2[1:], get_diff would report these as differing slices. As long as that's okay for your use cases, we can proceed.

If you want more granular diffs, we can of course do element-by-element comparisons on chunks that differ. It's still way faster than diffing the entire two versions, but will represent a slower and more granular approach.

@ArvidJB ArvidJB changed the title Change-only version selection Change-only version selection (PyInf#10524) Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants