-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM] Redesign InMemoryDataset data model (libhdf5 direct access) #386
Draft
crusaderky
wants to merge
8
commits into
deshaw:master
Choose a base branch
from
crusaderky:inmemorydataset_v3
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
crusaderky
force-pushed
the
inmemorydataset_v3
branch
5 times, most recently
from
October 18, 2024 16:05
ae0d678
to
5d2f268
Compare
crusaderky
changed the title
Redesign InMemoryDataset data model (take 2)
Redesign InMemoryDataset data model (libhdf5 direct access)
Oct 18, 2024
This was referenced Oct 18, 2024
crusaderky
force-pushed
the
inmemorydataset_v3
branch
from
October 20, 2024 16:11
4553a66
to
05c5aa7
Compare
crusaderky
force-pushed
the
inmemorydataset_v3
branch
from
October 20, 2024 17:03
05c5aa7
to
018413b
Compare
crusaderky
force-pushed
the
inmemorydataset_v3
branch
from
October 20, 2024 17:14
018413b
to
e2bdbde
Compare
crusaderky
force-pushed
the
inmemorydataset_v3
branch
from
October 20, 2024 17:30
e2bdbde
to
cabe4bd
Compare
crusaderky
changed the title
Redesign InMemoryDataset data model (libhdf5 direct access)
[DNM] Redesign InMemoryDataset data model (libhdf5 direct access)
Oct 20, 2024
crusaderky
force-pushed
the
inmemorydataset_v3
branch
7 times, most recently
from
October 22, 2024 09:44
a1fa1e6
to
12423e4
Compare
@ArvidJB @peytondmurray This is ready to be played with. The state is the same as it was in #370 when we abandoned it - the new PR passes all pre-existing tests, but more thorough tests still need to be added, so do expect bugs particularly on edge cases. |
12 tasks
crusaderky
force-pushed
the
inmemorydataset_v3
branch
from
October 22, 2024 16:02
12423e4
to
3c30121
Compare
crusaderky
force-pushed
the
inmemorydataset_v3
branch
10 times, most recently
from
October 26, 2024 15:58
8b42f49
to
d526d72
Compare
crusaderky
force-pushed
the
inmemorydataset_v3
branch
5 times, most recently
from
October 26, 2024 16:30
2113ed6
to
b1c56d8
Compare
crusaderky
force-pushed
the
inmemorydataset_v3
branch
from
October 26, 2024 16:36
b1c56d8
to
9d411cd
Compare
crusaderky
force-pushed
the
inmemorydataset_v3
branch
4 times, most recently
from
October 26, 2024 23:59
6caaa87
to
d47e9dd
Compare
crusaderky
force-pushed
the
inmemorydataset_v3
branch
from
October 27, 2024 00:59
d47e9dd
to
075f2b3
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is still missing some components - see checklist below.
versioned_hdf5.wrappers.InMemoryDataset
is plagued by slowness caused by the control code, both because the control code itself is very slow and because it callsh5py.Dataset.__getitem__
on the raw_data independently for every single chunk.This PR completely replaces the internal implementation of InMemoryDataset by instead performing a fast inner loop of calls to H5DRead, directly to the libhdf5 C library. The control logic - the generation of the index information needed to perform the transfers, which was previously performed by
build_data_dict
andas_subchunk_map
- has also been completely rewritten and is now much more aggressively cythonized.This new design introduces a helper class,
StagedChangesArray
, to whichInMemoryDataset
is now a fairly thin wrapper, and which holds the modified chunks. This class has minimal external API and is completely abstracted from h5py. Internally, all the control logic is encapsulated by a series of*Plan
classes (GetItemPlan
,SetItemPlan
, etc.), each formulating what actions need to be performed in order to evade a user request (__getitem__
,__setitem__
, etc.). On each user request,StagedChangesArray
formulates a plan;Other major design changes
In master
__setitem__
first loads all impacted chunks into memory, in the off chance that they are not wholly covered by the parameter index, and then updates them. In this PR, this happens exclusively for the chunks that are not wholly covered by the index of__setitem__
.In other words,
a[:] = 1
in master loads the whole dataset into memory if it's not in memory already, whereas in this PR it never touches the disk.Analogously, in master
resize()
reads the whole dataset from disk. This no longer happens in this PR, which only loads the edge chunks if the shape is not exactly divisible by chunk_size.There is no longer any direct
__getitem__
call to the virtual dataset. It was found to be exceptionally slow, due to deficiencies in the algorithm in libhdf5 C itself, and has been ditched.A previous redesign, #370, had to be scrapped due to this limitation.
It remains possible to obtain a plain-h5py virtual dataset by accessing the address
_version_data/versions/<version name>/<dataset name>
. Notably, this can be performed in any language that offers libhdf5 bindings, not just python.There is no longer cache-on-read: any an all caching is performed by libhdf5; the only data in memory within the StagedChangesArray are the chunks that the user intentfully modified.
Status
Demo and benchmakrs
See jupyter notebooks (not to be merged into master)
Subordinated PRs
This is a gigantic change, in excess of 6000 lines.
For the sake of sanity, it has been broken down into multiple PRs:
build_data_dict
andas_subchunk_map
Known issues
Slices with step>1, as well as fancy indices which can be locally represented as such, can be slower than in master. The reason is that they're very slow in libhdf5, whereas master was (unconsciously?) side-stepping the issue by loading up whole chunks and then applying the strided slice to the numpy chunks in memory.
This is, technically, a regression only when calling
__getitem__
after__setitem__
. In master, if you call__getitem__
on an unmodified dataset, you access the virtual dataset, which is even slower as the issue with step>1 interplays with libhdf5 cache size, and the whole virtual dataset is always larger than the hdf5 cache.This issue is mitigated, but not solved, by making sure that the chunk cache is larger than a single chunk.
Compared to a single contiguous read (potentially followed by strided copy in pure numpy), the performance of reading with step>1 directly from hdf5 is:
Master shows a 120x slowdown, regardless of chunk size, but only when reading from the virtual dataset (no changes).
Implementing cache-on-read would effectively work around this issue, but it's not trivial given the current design (the easiest way to do it is to first move hashing into the StagedChangesArray, which makes a lot of sense in its own right anyway).