Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serialization / deserialization performance #7893

Closed
jakebolewski opened this issue Aug 7, 2014 · 77 comments
Closed

serialization / deserialization performance #7893

jakebolewski opened this issue Aug 7, 2014 · 77 comments
Labels
performance Must go faster

Comments

@jakebolewski
Copy link
Member

Julia's serialization / deserialization performance is pretty worrying. Deserializing a ~500 mb DataFrame object takes about ~5-6 minutes. An 8 GB dataframe object takes 3 + hours (it is not finished yet). It looks like almost all time is spent in GC. This is making it impractical to do any sort of parallel operations on dataframe objects, without dumping the dataframe out to a csv file and reading the contents back into each process.

julia> labevents = open(deserialize, "labevents.jld");

shell> du -h labevents.jld
475M    labevents.jld

julia> @time labevents = open(deserialize, "labevents.jld");
elapsed time: 337.729652605 seconds (4366175664 bytes allocated, 85.11% gc time)

@jiahao said serializing the 8GB file took ~5 hours.

@Keno
Copy link
Member

Keno commented Aug 7, 2014

You Don;t want to use serialize for this. Try the HDF package.

@JeffBezanson
Copy link
Sponsor Member

Yes, this is bad, but operating on something in parallel by copying the entire dataset around is not going to work anyway. Even if the serializer works at the speed of the network you're not likely to recoup the cost.

@jiahao
Copy link
Member

jiahao commented Aug 7, 2014

Looks like you can't save a DataFrame directly to HDF5. see JuliaData/DataFrames.jl#64

Seriously, guys, what is the point of having a serialization format that is hundreds of times slower than CSV? That's the real issue here.

@Keno
Copy link
Member

Keno commented Aug 7, 2014

I'm pretty sure that is false. I have saved DataFrames to HDF5 plenty of times. Please have a look at the HDF5 package. The serializer is also not intended as a storage format and it is not stable between versions of julia.

@JeffBezanson
Copy link
Sponsor Member

The point of it is to support arbitrary julia objects.

When we find something that's very slow, we investigate and try to speed it up. Assuming that somebody maliciously intends for it to be slow is totally inappropriate and makes no sense.

@timholy
Copy link
Sponsor Member

timholy commented Aug 7, 2014

Yes, HDF5 (really JLD) even includes a test for DataFrames. (It only runs if DataFrames is installed; I just added a test/REQUIRE file to make sure Iain runs it.)

That said, I certainly support improving serializer performance.

@JeffBezanson
Copy link
Sponsor Member

Also, it would help a lot to know what is in this DataFrame. Could we have the types of the columns?

@quinnj
Copy link
Member

quinnj commented Aug 7, 2014

@jiahao
Copy link
Member

jiahao commented Aug 7, 2014

It was not my intent to be accusatory. It's just rather frustrating that having tried a few different ways to read datasets into Julia, it seems like the fastest way is to read the whole thing in as a CSV file. (@jakebolewski just tried the HDF5 option; while it works, the speed is comparable to the serialization mode. Thanks for the correction @Keno and @timholy)

@JeffBezanson the performance we're seeing doesn't seem to vary too much across the various tables in this dataset - they're all DataArrays of one sort or another. This particular dataset's SQL signature is
int(11), int(11), int(11), int(11), datetime, varchar(100), valuenum, varchar(10), varchar(10).

@jakebolewski
Copy link
Member Author

@quinnj I have not tried protobufs. I've only skimmed the package, is there a way to easily generate a .proto file from a Dataframe object? Some potential problems are handling missing data, and reencoding all string data to UTF8. ODBC returns UTF16 encoded strings by default.

@jakebolewski
Copy link
Member Author

@JeffBezanson the serialized dataframe files are in the folder /home/jake/projects/mimic/db_dataframes on julia.mit.edu if you want to test them out.

@JeffBezanson
Copy link
Sponsor Member

Thanks, I will take a look.

@JeffBezanson
Copy link
Sponsor Member

Ok I have a strong lead here: reading ioevents.jld took 91 seconds, 48% gc time. Reading a_chartdurations.jld which is nearly 1/10th the size took 0.35 seconds. This is basically O(n^2) scaling. Then I removed all string columns from ioevents, and saving and loading that took 2 seconds or less, removing the scaling problem.

So the entire problem appears to be strings.

@jakebolewski
Copy link
Member Author

Makes sense that it is O N^2
Some more performance numbers for a completeness.

julia> @time writetable("labevents.csv", labevents)
elapsed time: 382.450579558 seconds (5702565828 bytes allocated, 92.74% gc time)

shell> head labevents.csv
"subject_id","hadm_id","icustay_id","itemid","charttime","value","valuenum","flag","valueuom"
3,NA,NA,50112,"2682-08-22T16:24:00","121",121.0,"abnormal","mg/dL"
3,NA,NA,50428,"2682-08-22T16:24:00","261",261.0,"NA","K/uL"
3,NA,NA,50443,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50415,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50409,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50431,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50326,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50396,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50333,"2682-08-22T16:24:00","0.5",0.5,"NA","%"

julia> @time readtable("labevents.csv")
elapsed time: 46.019874084 seconds (5823425276 bytes allocated, 40.02% gc time)

Timing the same IO code in python using Pandas:

In [17]: %time df = pandas.io.parsers.read_csv("labevents.csv")
CPU times: user 5.88 s, sys: 430 ms, total: 6.31 s
Wall time: 6.33 s

In [18]: %time df.to_pickle("labevents.pkl")
CPU times: user 4.28 s, sys: 508 ms, total: 4.79 s
Wall time: 4.79 s

In [25]: %time pandas.read_pickle("labevents.pkl")
CPU times: user 4.02 s, sys: 740 ms, total: 4.76 s
Wall time: 4.76 s

# Using the pickler written in pure python (no c-extensions)
In [37]: %time pickle.dump(df, fh)
CPU times: user 1min 10s, sys: 1.97 s, total: 1min 12s
Wall time: 1min 12s

In [43]: %time df2 = pickle.load(fh)
CPU times: user 30.9 s, sys: 2.16 s, total: 33.1 s
Wall time: 33.1 s

comparing filesizes:

-rw-r--r-- 1  jake jake 243M Aug  7 15:02 labevents.csv
-rw-r--r-- 1  jake jake 475M Aug  7 10:33 labevents.jld
-rw-rw-r-- 1 jake jake 259M Aug  7 15:19 labevents.pkl

@jakebolewski
Copy link
Member Author

The original 8 GB file finished deserializing:

julia> @time chartevents = open("chartevents.jld", "r") do io
           deserialize(io)
       end
elapsed time: 20390.763780769 seconds (80046322888 bytes allocated, 95.33% gc time)

@timholy I'm finding saving some of the smaller dataframes to HDF5 takes much longer than just serializing the object directly to disk.

@timholy
Copy link
Sponsor Member

timholy commented Aug 7, 2014

It's quite possible---HDF5 has to write a lot of annotation to store the type in a "readable" way, whereas the serializer just uses a couple bytes to indicate type.

In my experience JLD is ~5x faster than CSV. I bet it depends a lot on the contents, though. If you can post the dataframe somewhere, I can take a look.

@JeffBezanson
Copy link
Sponsor Member

For numeric data, that would be expected since we write binary. Here, it's all about the strings.

@simonster
Copy link
Member

I've played around a little and it looks like some of the bottlenecks in writetable are actually in Base. One is show(io, Int), which calls dec, which allocates a new string. Another is print_escaped. Together these two functions account for more than half of the time on my test data set with similar data types.

@JeffBezanson
Copy link
Sponsor Member

I'm working on some more small but worthwhile improvements.

This is a thorn in my side:
https://github.com/JuliaLang/julia/blob/master/base/serialize.jl#L296

That integer is totally redundant and AFAICT not used at all. It's obviously not a huge bottleneck, but it would be nice to eliminate given that it's totally unused. However it doesn't appear possible to remove it in a way that would allow old files to be read correctly.

@catawbasam
Copy link
Contributor

0.3 release seems like a reasonable time to update the serialize format.

JeffBezanson added a commit that referenced this issue Aug 7, 2014
mostly this avoids memory allocation from Module fullname()s, and
boxing Ints.

this is backwards-compatible for *reading* old files, but written
data will not be usable by older julias.
@StefanKarpinski
Copy link
Sponsor Member

This is a thorn in my side:
https://github.com/JuliaLang/julia/blob/master/base/serialize.jl#L296

That integer is totally redundant and AFAICT not used at all. It's obviously not a huge bottleneck, but it would be nice to eliminate given that it's totally unused. However it doesn't appear possible to remove it in a way that would allow old files to be read correctly.

Dare I ask why it's there in the first place?

@JeffBezanson
Copy link
Sponsor Member

I have no idea. It looks like @timholy added it over 2 years ago.

@Keno
Copy link
Member

Keno commented Aug 7, 2014

@JeffBezanson When linking to lines, please make sure you include the commit hash. You can simply press y after highlighting the line and it will change the link appropriately, e.g.:

Look here

@timholy
Copy link
Sponsor Member

timholy commented Aug 7, 2014

I have no idea either. Looks like it was motivated by Matlab compatibility back when I was interested in using Julia via Matlab---perhaps because Julia has intrinsic information about types in Base, but Matlab doesn't? Not sure. Anyway, sorry about the thorn. Do what you gotta do.

@JeffBezanson
Copy link
Sponsor Member

Here's where we are. Timings when this issue was filed:

julia> @time d = open(deserialize,"ioevents.jld");
elapsed time: 89.205890773 seconds (3427567732 bytes allocated, 49.20% gc time)

julia> @time serialize(f, d);
elapsed time: 89.89731217 seconds (1770653244 bytes allocated, 55.54% gc time)
# file size = 450727091 b

After my improvements plus removing the extra integer:

# same file, but re-written by new serializer
julia> @time d = open(deserialize,"/home/jeffb/ioe.jld");
elapsed time: 35.348819813 seconds (2320426896 bytes allocated, 4.91% gc time)

julia> @time serialize(f, d);
elapsed time: 51.078912958 seconds  (978248412 bytes allocated, 52.07% gc time)
# file size = 430956854 b

However I feel bad about breaking read-compatibility. @timholy as somebody who uses JLD, do you really think this is ok?

@Keno
Copy link
Member

Keno commented Aug 7, 2014

I think there is some confusion going on here. .jld is the extension of the HDF5 package dumper isn't it?

@timholy
Copy link
Sponsor Member

timholy commented Aug 8, 2014

If it's an immutable, we may want JuliaIO/HDF5.jl#27 sooner rather than later. CC @simonster.

@simonster
Copy link
Member

I'll work on that this weekend.

@timholy
Copy link
Sponsor Member

timholy commented Aug 8, 2014

That would be pretty awesome.

@IainNZ
Copy link
Member

IainNZ commented Aug 8, 2014

Did something break?
burrowsa/RunTests.jl#2

Just found this

@jakebolewski
Copy link
Member Author

Yes, it looks like the performance changes were backwards incompatible. If I attempt to open any of the data from yesterday it does not work.

julia> open(deserialize, "a_chartdurations.jls")
ERROR: type: setfield!: expected Array{Any,1}, got Type{()}
 in deserialize at serialize.jl:565
 in deserialize at serialize.jl:520
 in handle_deserialize at serialize.jl:351
 in deserialize at serialize.jl:334
 in open at ./iostream.jl:137

@simonster
Copy link
Member

@jakebolewski and I were playing around with this some more this afternoon, and reached the conclusion that the current state of garbage collection greatly magnifies the effect of any allocation in this benchmark. After the dataset is loaded into memory, each gc takes 2 seconds, as @timholy noted above. So given the current state of garbage collection, optimizing serialization/deserialization of that dataset comes down to minimizing gc. But I suspect these very long gc times will also make it hard to do anything with the data. With 100x longer gc pauses, even code that spent only 10% of its time in gc before the dataset was in memory will take 10x longer to run after the dataset is in memory.

@jakebolewski
Copy link
Member Author

For some hard numbers allocating 1,000,000. UTF16 strings takes about 0.5 seconds on the julia test machine in a fresh environment. After loading in a dataframe like labevents, allocating the same 1_000_000 UTF16 strings takes 5.07 seconds. Gc collection times are on the order of 1.6 seconds. This obviously gets much worse as you load in bigger datasets. Lab events is 450mb containing almost 10,000,000 short UTF16 strings.

@JeffBezanson
Copy link
Sponsor Member

Yes, we know.

@StefanKarpinski
Copy link
Sponsor Member

Yeah, strings are a total disaster currently – that's why they're slated for an overhaul in 0.4.

@jakebolewski
Copy link
Member Author

I tried to search the issues, how do you see the representation of strings changing? Allocating so many objects into the environment be they strings, arrays or immutables whatever gives the same behavior. I think this is more of a gc problem then a string problem. How will the new representation get around the issue that you are just allocating many small objects?

@JeffBezanson
Copy link
Sponsor Member

We need to work both sides of the issue: better gc and fewer, smaller
objects. Strings use 2 objects each now, plus Array has a lot of overhead
due to supporting multiple dimensions and resizing.
On Aug 10, 2014 9:20 PM, "Jake Bolewski" [email protected] wrote:

I tried to search the issues, how do you see the representation of strings
changing? Allocating so many objects into the environment be they strings,
arrays or immutables whatever gives the same behavior. I think this is more
of a gc problem then a string problem. How will the new representation get
around the issue that you are just allocating many small objects?


Reply to this email directly or view it on GitHub
#7893 (comment).

@StefanKarpinski
Copy link
Sponsor Member

@jakebolewski, here's a mockup of a Bytes type in Julia that could serves as a basis for storing the bytes of immutable strings: https://github.com/StefanKarpinski/tmp/blob/master/bytes.jl. It can store up to 15 bytes inline or more bytes through a pointer. The real version will likely be written in C, not Julia.

@StefanKarpinski
Copy link
Sponsor Member

Note that short strings can be stored inline and compared using Int128 operations. When I've benchmarked this prototype, sorting an array of inline strings is about 2x faster than sorting an array of slightly longer strings, non-inline strings, so there does seem to be some advantage to be had.

@nalimilan
Copy link
Member

Also, usually, you don't need to work with millions of different string values: you have millions of values from a relatively small pool. In this case PooledDataArrays will be dramatically more efficient.

@StefanKarpinski
Copy link
Sponsor Member

The approach we're considering will basically make that optimization obsolete, for strings at least. We may also make it "autopooling" – i.e. whenever two strings are equal or can share memory, we may be able to transparently use a single object.

@alanedelman
Copy link
Contributor

Boy am i glad we're starting to work on big datasets :-)

skumagai pushed a commit to skumagai/julia that referenced this issue Aug 18, 2014
skumagai pushed a commit to skumagai/julia that referenced this issue Aug 18, 2014
mostly this avoids memory allocation from Module fullname()s, and
boxing Ints.

this is backwards-compatible for *reading* old files, but written
data will not be usable by older julias.
@timofeytt
Copy link

I don't know whether the optimization techniques described in the blog post are applicable here, but I'll share it for the purposes of discussion:
http://www.serpentine.com/blog/2013/03/20/whats-good-for-c-is-good-for-haskell/

@ScottPJones
Copy link
Contributor

It's not clear from this issue, has anything been done to specifically improve the performance of serialization/deserialization (besides the general improvement from GC improvements)? Has the file extension been changed to not be .jld? Has file header information been added? Is the speed of UTF-16 conversions still a problem?

@JeffBezanson
Copy link
Sponsor Member

Yes, there was #11125, and some optimizations in #10170 as well. I don't know how we would change the file extension since we don't use or enforce that extension anywhere in Base AFAIK. We don't really have a file format yet (#7909).

@stevengj
Copy link
Member

stevengj commented Oct 7, 2016

I added a performance benchmark of this to nanosoldier, so we can track this going forward.

@JeffBezanson
Copy link
Sponsor Member

I believe this is largely fixed. New issues can be opened if anybody encounters an especially bad case for serialize/deserialize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests