Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] Redesign InMemoryDataset data model (libhdf5 direct access) #386

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Oct 17, 2024

This PR is still missing some components - see checklist below.

versioned_hdf5.wrappers.InMemoryDataset is plagued by slowness caused by the control code, both because the control code itself is very slow and because it calls h5py.Dataset.__getitem__ on the raw_data independently for every single chunk.

This PR completely replaces the internal implementation of InMemoryDataset by instead performing a fast inner loop of calls to H5DRead, directly to the libhdf5 C library. The control logic - the generation of the index information needed to perform the transfers, which was previously performed by build_data_dict and as_subchunk_map - has also been completely rewritten and is now much more aggressively cythonized.

This new design introduces a helper class, StagedChangesArray, to which InMemoryDataset is now a fairly thin wrapper, and which holds the modified chunks. This class has minimal external API and is completely abstracted from h5py. Internally, all the control logic is encapsulated by a series of *Plan classes (GetItemPlan, SetItemPlan, etc.), each formulating what actions need to be performed in order to evade a user request (__getitem__, __setitem__, etc.). On each user request,

  1. StagedChangesArray formulates a plan;
  2. executes it, possibly mutating its internal state;
  3. discards the plan.

Other major design changes

In master __setitem__ first loads all impacted chunks into memory, in the off chance that they are not wholly covered by the parameter index, and then updates them. In this PR, this happens exclusively for the chunks that are not wholly covered by the index of __setitem__.
In other words, a[:] = 1 in master loads the whole dataset into memory if it's not in memory already, whereas in this PR it never touches the disk.

Analogously, in master resize() reads the whole dataset from disk. This no longer happens in this PR, which only loads the edge chunks if the shape is not exactly divisible by chunk_size.

There is no longer any direct __getitem__ call to the virtual dataset. It was found to be exceptionally slow, due to deficiencies in the algorithm in libhdf5 C itself, and has been ditched.
A previous redesign, #370, had to be scrapped due to this limitation.
It remains possible to obtain a plain-h5py virtual dataset by accessing the address
_version_data/versions/<version name>/<dataset name>. Notably, this can be performed in any language that offers libhdf5 bindings, not just python.

There is no longer cache-on-read: any an all caching is performed by libhdf5; the only data in memory within the StagedChangesArray are the chunks that the user intentfully modified.

Status

  • implementation
  • all pre-existing unit tests pass
  • additional unit tests for the various subsystems (slicetools.pyx, subchunk_map.py)
  • additional unit tests for staged_changes.py
  • high level documentation
  • break PR up into stages to simplify code review
  • clean up legacy code
  • ad-hoc performance benchmarks
  • review published benchmarks
  • demo notebooks

Demo and benchmakrs

See jupyter notebooks (not to be merged into master)

Subordinated PRs

This is a gigantic change, in excess of 6000 lines.
For the sake of sanity, it has been broken down into multiple PRs:

Known issues

Slices with step>1, as well as fancy indices which can be locally represented as such, can be slower than in master. The reason is that they're very slow in libhdf5, whereas master was (unconsciously?) side-stepping the issue by loading up whole chunks and then applying the strided slice to the numpy chunks in memory.

This is, technically, a regression only when calling __getitem__ after __setitem__. In master, if you call __getitem__ on an unmodified dataset, you access the virtual dataset, which is even slower as the issue with step>1 interplays with libhdf5 cache size, and the whole virtual dataset is always larger than the hdf5 cache.

This issue is mitigated, but not solved, by making sure that the chunk cache is larger than a single chunk.

Compared to a single contiguous read (potentially followed by strided copy in pure numpy), the performance of reading with step>1 directly from hdf5 is:

  • up to 12x slower, if a chunk fits in the cache (1MB by default)
  • up to 150x slower, for larger chunks

Master shows a 120x slowdown, regardless of chunk size, but only when reading from the virtual dataset (no changes).

Implementing cache-on-read would effectively work around this issue, but it's not trivial given the current design (the easiest way to do it is to first move hashing into the StagedChangesArray, which makes a lot of sense in its own right anyway).

@crusaderky crusaderky force-pushed the inmemorydataset_v3 branch 5 times, most recently from ae0d678 to 5d2f268 Compare October 18, 2024 16:05
@crusaderky crusaderky changed the title Redesign InMemoryDataset data model (take 2) Redesign InMemoryDataset data model (libhdf5 direct access) Oct 18, 2024
@crusaderky crusaderky self-assigned this Oct 20, 2024
@crusaderky crusaderky changed the title Redesign InMemoryDataset data model (libhdf5 direct access) [DNM] Redesign InMemoryDataset data model (libhdf5 direct access) Oct 20, 2024
@crusaderky crusaderky force-pushed the inmemorydataset_v3 branch 7 times, most recently from a1fa1e6 to 12423e4 Compare October 22, 2024 09:44
@crusaderky
Copy link
Collaborator Author

@ArvidJB @peytondmurray This is ready to be played with.

The state is the same as it was in #370 when we abandoned it - the new PR passes all pre-existing tests, but more thorough tests still need to be added, so do expect bugs particularly on edge cases.

@crusaderky crusaderky force-pushed the inmemorydataset_v3 branch 10 times, most recently from 8b42f49 to d526d72 Compare October 26, 2024 15:58
@crusaderky crusaderky force-pushed the inmemorydataset_v3 branch 5 times, most recently from 2113ed6 to b1c56d8 Compare October 26, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant