Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store Julia objects as compound types #27

Closed
simonster opened this issue Jul 2, 2013 · 8 comments
Closed

Store Julia objects as compound types #27

simonster opened this issue Jul 2, 2013 · 8 comments

Comments

@simonster
Copy link
Member

At the moment, we can write, but not read immutables from JLD. While it would be pretty trivial to copy the code for creating new immutables from serialize.jl, I wonder if we can use compound types instead. It seems like there would be massive performance and disk space advantages to storing arrays of immutables contiguously on disk as opposed to using HDF5 references for each field, even if the on-disk representation isn't necessarily the same as the in-memory representation because of padding.

@ghost ghost assigned simonster Jul 2, 2013
@timholy
Copy link
Member

timholy commented Jul 2, 2013

Agreed 100%. I was basically waiting for immutables to land before getting serious about Compound support, and never got back to it.

Because I don't need this right away I probably won't get to this immediately; feel free to tackle it, or I'll tackle it myself in a week or two.

@timholy
Copy link
Member

timholy commented Jul 2, 2013

(I have some julia/Profile.jl bugs that need fixing first.)

@simonster
Copy link
Member Author

I've started on this, but I'm still thinking of the best way to handle things. The first decision to be made is whether we store only immutable bits types as compound types; we store all immutable types as compound types; or we store all Julia types as compound types.

There is an undeniable appeal to storing all Julia types as compound types. Reading/writing objects that contain bits type fields would be significantly faster, since those fields wouldn't need to be references. The differences between immutables and ordinary objects would just be in the way arrays are handled (i.e., as arrays of references or arrays of values). I think we could even reconstruct missing/changed types by dynamically generating a new type based on the compound type definition. The major downside is that we'd need to break compatibility with existing JLD files, leave around a method to read them, or create a converter.

I've also been thinking about how to efficiently convert HDF5 compound types to Julia types. "Efficiently" ideally means that, once the compound type is read into memory, we convert the compound type to a Julia type in place and avoid additional allocations. For immutable bits types and arrays thereof, this is easy, since we just need to add padding in the right places. It might even be possible to get the HDF5 library to perform this conversion for us. For normal Julia types, where arrays are stored as references, there isn't necessarily a big advantage to in-place conversion, since we'll never be converting very much data at a time, although we would avoid an allocation for each object. For arrays of immutable types with pointers, we would need to convert HDF5 object references to pointers in-place to avoid allocating a second buffer.

Allocating a second buffer isn't that bad, and to start with I'll probably just do this, but it limits the maximum size of an array of non-bits immutables to half of the system's available memory. It might be possible to avoid, though, either by giving the HDF5 library custom conversion functions using H5Tregister that would perform in-place conversion of objects with object references or by reading from the HDF5 file directly from Julia (although I'm not sure how to make this work for chunked datasets).

@timholy
Copy link
Member

timholy commented Jul 9, 2013

I haven't thought about this in ages, but how would one handle a type declaration like

type MyType
    x::Real
end

and for an array of them, some are Float64 and others Int16?

@simonster
Copy link
Member Author

We'd store anything that's not a bits type as a reference in the compound type, effectively mirroring the way Julia stores types in memory. If we have to reconstruct the type from the compound type definition because the Julia type changed or no longer exists, we'd just leave reference fields untyped.

@timholy
Copy link
Member

timholy commented Jul 10, 2013

That seems very reasonable.

Overall I think this sounds like a great plan. As much as it pains me to break JLD compatibility, I think the reality is that these files are not yet in heavy use, but probably will be some day (I'm just starting to make use of them in practice in my own work). So now is the time for breakage if there ever is. Moreover, since there is a version number, in principle we have all the information we need. The "converter," if we need one, could even be a current snapshot of jld.jl, with a different module name (e.g., JLD01).

As far as efficiency goes, presumably there may be places where it matters and where it doesn't (since IO is expected to be somewhat limiting). To me it seems that you have a great plan. I agree that, ultimately, we could probably get the HDF5 library to insert padding etc for us, but also that such optimizations can come second if they're nontrivial to get working.

@simonster
Copy link
Member Author

Making progress: https://gist.github.com/simonster/50d282a533a76eaebbb3

Next step: reading it out.

@timholy
Copy link
Member

timholy commented Aug 10, 2014

Oo ooh! Very nice! I'm really excited about this.

Aside from my Vector{Vector{T}} blunder (#123), the major changes that I expect you must be making here were another motivation for the JLDArchives package. Always nice to have an easy way to run tests to see what has been broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants