-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDF5-backed data structure #145
Comments
@teresamg, you've wrestled quite a bit with the DWI object -- what is your experience? |
@jhlegarreta & @esavary, do you have any experience with or opinion on this problem? |
I don't have much experience and no strong opinion, but is there a reason to not encapsulate the HDF5 backing inside the DWI object? |
Not really. This is why I mentioned in #98 that I was not sure anymore of how this should be implemented. We can merge #98 and roll that bit back afterward when we have a plan for this issue. I'd like to know Chris' and Ariel's opinions on #98 before we merge it though. |
I see the memory fingerprint concern that you mention, and I do see the HDF5 data manipulation bit that you point to as being a burden for users. I have used HDF5 files in the past inside some wrapper class to actually provide the data.
My fear there is to propose yet another format that has is difficult to maintain, difficult to use across all Also I do not know enough about memaps as another possibility. As for the PRs I do not know the code well enough so as to weigh whether they should be merged or held now: whatever is easiest to make progress, so probably merging and leaving the HDF5 issue to be fixed for whenever it really becomes an issue. If that time has come 🙃, then I'd need to have a better grasp of the whole picture to weigh in. Apologies for that. |
@effigies - we would like to pick your brain on this one. The design idea behind this is described in the "nipreps book": https://www.nipreps.org/nipreps-book/tutorial/intro.html#step-1-identify-an-i-o-inputs-outputs-specification We basically wanted to have a data structure that could hold all the relevant aspects of a DWI dataset in the context of the problem. That meant it should have easy ways of storing and accessing:
The rationale to use HDF5 was to allow Python to clean up hefty objects in memory while having an easy interface to the object. In my imagination, this should work like an overlay built-in in the data structure so that the user does not need to bother about HDF5 at all (it is just used in the background). Maybe there were better solutions like the ArrayProxy @jhlegarreta mentions above, or maybe we should be thinking about Zarr instead. I'm happy to go in any direction that makes eddymotion work better. Let us know what you think. |
Just adding that I've had some recent interesting positive experiences with Apache Arrow (see this for some examples/details). I am not sure that it's a great match for this use-case, but thought I'd mention this here as well, to add to the options we could consider. |
Eddymotion uses large datasets and continuously needs to update them in memory, keep a large list of transforms, etc.
For that reason, we created the DWI object (which we are now generalizing so the same approach can be applied to PET or to BOLD). The idea is to have HDF5 to help minimize the memory fingerprint without damaging I/O too much.
I guess we want to get some sort of "glorified"
ArrayProxy
(using nibabel's jargon) to the details around a DWI dataset (data themselves, NIfTI headers readily available to generate NIfTIs for third-party tools, gradients' table, brain mask, head-motion realignment matrices, fieldmap, etc.)Giving a common interface to all those aspects of the DWI dataset would definitely make development faster.
I would like to pick @effigies' and @arokem's brains and see how this would be implemented.
At the moment, what we have does not seem to achieve the flexibility and seamless operation that we want.
EDIT:
In #98, we are removing data splitting from the data structure object (which I think makes a lot of sense) and this problem has become apparent:
eddymotion/src/eddymotion/data/dmri.py
Lines 77 to 129 in c073fdf
In particular, my hesitations emerge from having to do this bit:
eddymotion/src/eddymotion/data/dmri.py
Lines 101 to 105 in c073fdf
every time from outside methods if we want to take advantage of the HDF5 backing.
EDIT 2:
In the long term, and if this vision works well, I can see this kind of "internal format" be propagated across nipreps and even fitlins.
WDYT?
The text was updated successfully, but these errors were encountered: