[DNM] Redesign InMemoryDataset data model (libhdf5 direct access) #386

crusaderky · 2024-10-17T23:47:47Z

This PR is still missing some components - see checklist below.

versioned_hdf5.wrappers.InMemoryDataset is plagued by slowness caused by the control code, both because the control code itself is very slow and because it calls h5py.Dataset.__getitem__ on the raw_data independently for every single chunk.

This PR completely replaces the internal implementation of InMemoryDataset by instead performing a fast inner loop of calls to H5DRead, directly to the libhdf5 C library. The control logic - the generation of the index information needed to perform the transfers, which was previously performed by build_data_dict and as_subchunk_map - has also been completely rewritten and is now much more aggressively cythonized.

This new design introduces a helper class, StagedChangesArray, to which InMemoryDataset is now a fairly thin wrapper, and which holds the modified chunks. This class has minimal external API and is completely abstracted from h5py. Internally, all the control logic is encapsulated by a series of *Plan classes (GetItemPlan, SetItemPlan, etc.), each formulating what actions need to be performed in order to evade a user request (__getitem__, __setitem__, etc.). On each user request,

StagedChangesArray formulates a plan;
executes it, possibly mutating its internal state;
discards the plan.

Other major design changes

In master __setitem__ first loads all impacted chunks into memory, in the off chance that they are not wholly covered by the parameter index, and then updates them. In this PR, this happens exclusively for the chunks that are not wholly covered by the index of __setitem__.
In other words, a[:] = 1 in master loads the whole dataset into memory if it's not in memory already, whereas in this PR it never touches the disk.

Analogously, in master resize() reads the whole dataset from disk. This no longer happens in this PR, which only loads the edge chunks if the shape is not exactly divisible by chunk_size.

There is no longer any direct __getitem__ call to the virtual dataset. It was found to be exceptionally slow, due to deficiencies in the algorithm in libhdf5 C itself, and has been ditched.
A previous redesign, #370, had to be scrapped due to this limitation.
It remains possible to obtain a plain-h5py virtual dataset by accessing the address
_version_data/versions/<version name>/<dataset name>. Notably, this can be performed in any language that offers libhdf5 bindings, not just python.

There is no longer cache-on-read: any an all caching is performed by libhdf5; the only data in memory within the StagedChangesArray are the chunks that the user intentfully modified.

Status

Demo and benchmakrs

See jupyter notebooks (not to be merged into master)

Subordinated PRs

This is a gigantic change, in excess of 6000 lines.
For the sake of sanity, it has been broken down into multiple PRs:

[InMemoryDataset redesign] Overhaul test_subchunk_map #374
[InMemoryDataset redesign] Read many slices at once with the HDF5 C API #378
Centralized cython tools; redefine hsize_t #382
[InMemoryDataset redesign] build_indices_and_offsets #385
[InMemoryDataset redesign] New function asarray() #384
[InMemoryDataset redesign] ix_with_slices, format_ndindex #387
[InMemoryDataset redesign] Object-oriented as_subchunk_map #372
[InMemoryDataset redesign] Optimize fancy indices #388
(blocked by 388) [InMemoryDataset redesign] Add features to subchunk_map #373
(blocked by 388, 373) [InMemoryDataset redesign] read_many_slices_param #390
(TODO PR, blocked by all above PRs) StagedChangesArray
(TODO PR, blocked by all above PRs) Integration: hot-swap the implementation of InMemoryDataset. Until this PR, all of the above work lies dormant.
(TODO PR, blocked by all above PRs) Cleanup: delete build_data_dict and as_subchunk_map

Known issues

Slices with step>1, as well as fancy indices which can be locally represented as such, can be slower than in master. The reason is that they're very slow in libhdf5, whereas master was (unconsciously?) side-stepping the issue by loading up whole chunks and then applying the strided slice to the numpy chunks in memory.

This is, technically, a regression only when calling __getitem__ after __setitem__. In master, if you call __getitem__ on an unmodified dataset, you access the virtual dataset, which is even slower as the issue with step>1 interplays with libhdf5 cache size, and the whole virtual dataset is always larger than the hdf5 cache.

This issue is mitigated, but not solved, by making sure that the chunk cache is larger than a single chunk.

Compared to a single contiguous read (potentially followed by strided copy in pure numpy), the performance of reading with step>1 directly from hdf5 is:

up to 12x slower, if a chunk fits in the cache (1MB by default)
up to 150x slower, for larger chunks

Master shows a 120x slowdown, regardless of chunk size, but only when reading from the virtual dataset (no changes).

Implementing cache-on-read would effectively work around this issue, but it's not trivial given the current design (the easiest way to do it is to first move hashing into the StagedChangesArray, which makes a lot of sense in its own right anyway).

crusaderky · 2024-10-22T13:51:07Z

@ArvidJB @peytondmurray This is ready to be played with.

The state is the same as it was in #370 when we abandoned it - the new PR passes all pre-existing tests, but more thorough tests still need to be added, so do expect bugs particularly on edge cases.

crusaderky force-pushed the inmemorydataset_v3 branch 5 times, most recently from ae0d678 to 5d2f268 Compare October 18, 2024 16:05

crusaderky changed the title ~~Redesign InMemoryDataset data model (take 2)~~ Redesign InMemoryDataset data model (libhdf5 direct access) Oct 18, 2024

crusaderky force-pushed the inmemorydataset_v3 branch from 4553a66 to 05c5aa7 Compare October 20, 2024 16:11

crusaderky self-assigned this Oct 20, 2024

crusaderky force-pushed the inmemorydataset_v3 branch from 05c5aa7 to 018413b Compare October 20, 2024 17:03

crusaderky mentioned this pull request Oct 20, 2024

[InMemoryDataset redesign] Object-oriented as_subchunk_map #372

Merged

crusaderky force-pushed the inmemorydataset_v3 branch from 018413b to e2bdbde Compare October 20, 2024 17:14

crusaderky mentioned this pull request Oct 20, 2024

[InMemoryDataset redesign] Optimize fancy indices #388

Open

crusaderky force-pushed the inmemorydataset_v3 branch from e2bdbde to cabe4bd Compare October 20, 2024 17:30

crusaderky mentioned this pull request Oct 20, 2024

[InMemoryDataset redesign] Add features to subchunk_map #373

Open

crusaderky changed the title ~~Redesign InMemoryDataset data model (libhdf5 direct access)~~ [DNM] Redesign InMemoryDataset data model (libhdf5 direct access) Oct 20, 2024

crusaderky force-pushed the inmemorydataset_v3 branch 7 times, most recently from a1fa1e6 to 12423e4 Compare October 22, 2024 09:44

crusaderky mentioned this pull request Oct 22, 2024

[DNM] Redesign InMemoryDataset data model with hyperrectangles selection #370

Closed

12 tasks

crusaderky force-pushed the inmemorydataset_v3 branch from 12423e4 to 3c30121 Compare October 22, 2024 16:02

crusaderky force-pushed the inmemorydataset_v3 branch 10 times, most recently from 8b42f49 to d526d72 Compare October 26, 2024 15:58

[InMemoryDataset redesign] Optimize fancy indices

63765a5

crusaderky force-pushed the inmemorydataset_v3 branch 5 times, most recently from 2113ed6 to b1c56d8 Compare October 26, 2024 16:30

[InMemoryDataset redesign] Add features to subchunk_map

f86f561

crusaderky force-pushed the inmemorydataset_v3 branch from b1c56d8 to 9d411cd Compare October 26, 2024 16:36

crusaderky mentioned this pull request Oct 26, 2024

[InMemoryDataset redesign] read_many_slices_param #390

Open

crusaderky force-pushed the inmemorydataset_v3 branch 4 times, most recently from 6caaa87 to d47e9dd Compare October 26, 2024 23:59

crusaderky added 6 commits October 27, 2024 01:26

[InMemoryDataset redesign] read_many_slices_param

057f9ee

Harden validation in subchunk_map

27c8c8a

EntireChunksMapper

c7ff59b

StagedChangesArray

28f35a0

Integration

1363a6e

DNM demo notebooks

075f2b3

crusaderky force-pushed the inmemorydataset_v3 branch from d47e9dd to 075f2b3 Compare October 27, 2024 00:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] Redesign InMemoryDataset data model (libhdf5 direct access) #386

[DNM] Redesign InMemoryDataset data model (libhdf5 direct access) #386

crusaderky commented Oct 17, 2024 •

edited

Loading

crusaderky commented Oct 22, 2024

[DNM] Redesign InMemoryDataset data model (libhdf5 direct access) #386

Are you sure you want to change the base?

[DNM] Redesign InMemoryDataset data model (libhdf5 direct access) #386

Conversation

crusaderky commented Oct 17, 2024 • edited Loading

Other major design changes

Status

Demo and benchmakrs

Subordinated PRs

Known issues

crusaderky commented Oct 22, 2024

crusaderky commented Oct 17, 2024 •

edited

Loading