serialization / deserialization performance #7893

jakebolewski · 2014-08-07T16:59:23Z

Julia's serialization / deserialization performance is pretty worrying. Deserializing a ~500 mb DataFrame object takes about ~5-6 minutes. An 8 GB dataframe object takes 3 + hours (it is not finished yet). It looks like almost all time is spent in GC. This is making it impractical to do any sort of parallel operations on dataframe objects, without dumping the dataframe out to a csv file and reading the contents back into each process.

julia> labevents = open(deserialize, "labevents.jld");

shell> du -h labevents.jld
475M    labevents.jld

julia> @time labevents = open(deserialize, "labevents.jld");
elapsed time: 337.729652605 seconds (4366175664 bytes allocated, 85.11% gc time)

@jiahao said serializing the 8GB file took ~5 hours.

Keno · 2014-08-07T17:24:16Z

You Don;t want to use serialize for this. Try the HDF package.

JeffBezanson · 2014-08-07T17:29:37Z

Yes, this is bad, but operating on something in parallel by copying the entire dataset around is not going to work anyway. Even if the serializer works at the speed of the network you're not likely to recoup the cost.

jiahao · 2014-08-07T17:30:35Z

Looks like you can't save a DataFrame directly to HDF5. see JuliaData/DataFrames.jl#64

Seriously, guys, what is the point of having a serialization format that is hundreds of times slower than CSV? That's the real issue here.

Keno · 2014-08-07T17:34:15Z

I'm pretty sure that is false. I have saved DataFrames to HDF5 plenty of times. Please have a look at the HDF5 package. The serializer is also not intended as a storage format and it is not stable between versions of julia.

JeffBezanson · 2014-08-07T17:38:12Z

The point of it is to support arbitrary julia objects.

When we find something that's very slow, we investigate and try to speed it up. Assuming that somebody maliciously intends for it to be slow is totally inappropriate and makes no sense.

timholy · 2014-08-07T17:39:29Z

Yes, HDF5 (really JLD) even includes a test for DataFrames. (It only runs if DataFrames is installed; I just added a test/REQUIRE file to make sure Iain runs it.)

That said, I certainly support improving serializer performance.

JeffBezanson · 2014-08-07T17:41:23Z

Also, it would help a lot to know what is in this DataFrame. Could we have the types of the columns?

quinnj · 2014-08-07T17:44:12Z

@jakebolewski, have you tried https://github.com/tanmaykm/ProtoBuf.jl?

jiahao · 2014-08-07T17:52:00Z

It was not my intent to be accusatory. It's just rather frustrating that having tried a few different ways to read datasets into Julia, it seems like the fastest way is to read the whole thing in as a CSV file. (@jakebolewski just tried the HDF5 option; while it works, the speed is comparable to the serialization mode. Thanks for the correction @Keno and @timholy)

@JeffBezanson the performance we're seeing doesn't seem to vary too much across the various tables in this dataset - they're all DataArrays of one sort or another. This particular dataset's SQL signature is
int(11), int(11), int(11), int(11), datetime, varchar(100), valuenum, varchar(10), varchar(10).

jakebolewski · 2014-08-07T18:08:13Z

@quinnj I have not tried protobufs. I've only skimmed the package, is there a way to easily generate a .proto file from a Dataframe object? Some potential problems are handling missing data, and reencoding all string data to UTF8. ODBC returns UTF16 encoded strings by default.

jakebolewski · 2014-08-07T18:12:12Z

@JeffBezanson the serialized dataframe files are in the folder /home/jake/projects/mimic/db_dataframes on julia.mit.edu if you want to test them out.

JeffBezanson · 2014-08-07T18:49:14Z

Thanks, I will take a look.

JeffBezanson · 2014-08-07T19:10:07Z

Ok I have a strong lead here: reading ioevents.jld took 91 seconds, 48% gc time. Reading a_chartdurations.jld which is nearly 1/10th the size took 0.35 seconds. This is basically O(n^2) scaling. Then I removed all string columns from ioevents, and saving and loading that took 2 seconds or less, removing the scaling problem.

So the entire problem appears to be strings.

jakebolewski · 2014-08-07T19:28:58Z

Makes sense that it is O N^2
Some more performance numbers for a completeness.

julia> @time writetable("labevents.csv", labevents)
elapsed time: 382.450579558 seconds (5702565828 bytes allocated, 92.74% gc time)

shell> head labevents.csv
"subject_id","hadm_id","icustay_id","itemid","charttime","value","valuenum","flag","valueuom"
3,NA,NA,50112,"2682-08-22T16:24:00","121",121.0,"abnormal","mg/dL"
3,NA,NA,50428,"2682-08-22T16:24:00","261",261.0,"NA","K/uL"
3,NA,NA,50443,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50415,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50409,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50431,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50326,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50396,"2682-08-22T16:24:00","NORMAL",NA,"NA","NA"
3,NA,NA,50333,"2682-08-22T16:24:00","0.5",0.5,"NA","%"

julia> @time readtable("labevents.csv")
elapsed time: 46.019874084 seconds (5823425276 bytes allocated, 40.02% gc time)

Timing the same IO code in python using Pandas:

In [17]: %time df = pandas.io.parsers.read_csv("labevents.csv")
CPU times: user 5.88 s, sys: 430 ms, total: 6.31 s
Wall time: 6.33 s

In [18]: %time df.to_pickle("labevents.pkl")
CPU times: user 4.28 s, sys: 508 ms, total: 4.79 s
Wall time: 4.79 s

In [25]: %time pandas.read_pickle("labevents.pkl")
CPU times: user 4.02 s, sys: 740 ms, total: 4.76 s
Wall time: 4.76 s

# Using the pickler written in pure python (no c-extensions)
In [37]: %time pickle.dump(df, fh)
CPU times: user 1min 10s, sys: 1.97 s, total: 1min 12s
Wall time: 1min 12s

In [43]: %time df2 = pickle.load(fh)
CPU times: user 30.9 s, sys: 2.16 s, total: 33.1 s
Wall time: 33.1 s

comparing filesizes:

-rw-r--r-- 1  jake jake 243M Aug  7 15:02 labevents.csv
-rw-r--r-- 1  jake jake 475M Aug  7 10:33 labevents.jld
-rw-rw-r-- 1 jake jake 259M Aug  7 15:19 labevents.pkl

slightly helps #7893

jakebolewski · 2014-08-07T20:20:34Z

The original 8 GB file finished deserializing:

julia> @time chartevents = open("chartevents.jld", "r") do io
           deserialize(io)
       end
elapsed time: 20390.763780769 seconds (80046322888 bytes allocated, 95.33% gc time)

@timholy I'm finding saving some of the smaller dataframes to HDF5 takes much longer than just serializing the object directly to disk.

timholy · 2014-08-07T20:30:16Z

It's quite possible---HDF5 has to write a lot of annotation to store the type in a "readable" way, whereas the serializer just uses a couple bytes to indicate type.

In my experience JLD is ~5x faster than CSV. I bet it depends a lot on the contents, though. If you can post the dataframe somewhere, I can take a look.

JeffBezanson · 2014-08-07T20:34:59Z

For numeric data, that would be expected since we write binary. Here, it's all about the strings.

simonster · 2014-08-07T20:43:31Z

I've played around a little and it looks like some of the bottlenecks in writetable are actually in Base. One is show(io, Int), which calls dec, which allocates a new string. Another is print_escaped. Together these two functions account for more than half of the time on my test data set with similar data types.

JeffBezanson · 2014-08-07T20:49:33Z

I'm working on some more small but worthwhile improvements.

This is a thorn in my side:
https://github.com/JuliaLang/julia/blob/master/base/serialize.jl#L296

That integer is totally redundant and AFAICT not used at all. It's obviously not a huge bottleneck, but it would be nice to eliminate given that it's totally unused. However it doesn't appear possible to remove it in a way that would allow old files to be read correctly.

catawbasam · 2014-08-07T21:18:15Z

0.3 release seems like a reasonable time to update the serialize format.

mostly this avoids memory allocation from Module fullname()s, and boxing Ints. this is backwards-compatible for *reading* old files, but written data will not be usable by older julias.

StefanKarpinski · 2014-08-07T21:27:06Z

This is a thorn in my side:
https://github.com/JuliaLang/julia/blob/master/base/serialize.jl#L296

That integer is totally redundant and AFAICT not used at all. It's obviously not a huge bottleneck, but it would be nice to eliminate given that it's totally unused. However it doesn't appear possible to remove it in a way that would allow old files to be read correctly.

Dare I ask why it's there in the first place?

JeffBezanson · 2014-08-07T21:37:50Z

I have no idea. It looks like @timholy added it over 2 years ago.

Keno · 2014-08-07T21:40:23Z

@JeffBezanson When linking to lines, please make sure you include the commit hash. You can simply press y after highlighting the line and it will change the link appropriately, e.g.:

Look here

timholy · 2014-08-07T22:01:51Z

I have no idea either. Looks like it was motivated by Matlab compatibility back when I was interested in using Julia via Matlab---perhaps because Julia has intrinsic information about types in Base, but Matlab doesn't? Not sure. Anyway, sorry about the thorn. Do what you gotta do.

JeffBezanson · 2014-08-07T23:17:08Z

Here's where we are. Timings when this issue was filed:

julia> @time d = open(deserialize,"ioevents.jld");
elapsed time: 89.205890773 seconds (3427567732 bytes allocated, 49.20% gc time)

julia> @time serialize(f, d);
elapsed time: 89.89731217 seconds (1770653244 bytes allocated, 55.54% gc time)
# file size = 450727091 b

After my improvements plus removing the extra integer:

# same file, but re-written by new serializer
julia> @time d = open(deserialize,"/home/jeffb/ioe.jld");
elapsed time: 35.348819813 seconds (2320426896 bytes allocated, 4.91% gc time)

julia> @time serialize(f, d);
elapsed time: 51.078912958 seconds  (978248412 bytes allocated, 52.07% gc time)
# file size = 430956854 b

However I feel bad about breaking read-compatibility. @timholy as somebody who uses JLD, do you really think this is ok?

Keno · 2014-08-07T23:18:50Z

I think there is some confusion going on here. .jld is the extension of the HDF5 package dumper isn't it?

timholy · 2014-08-08T19:38:13Z

If it's an immutable, we may want JuliaIO/HDF5.jl#27 sooner rather than later. CC @simonster.

simonster · 2014-08-08T19:39:19Z

I'll work on that this weekend.

timholy · 2014-08-08T19:40:49Z

That would be pretty awesome.

IainNZ · 2014-08-08T20:44:40Z

Did something break?
burrowsa/RunTests.jl#2

Just found this

jakebolewski · 2014-08-08T21:00:26Z

Yes, it looks like the performance changes were backwards incompatible. If I attempt to open any of the data from yesterday it does not work.

julia> open(deserialize, "a_chartdurations.jls")
ERROR: type: setfield!: expected Array{Any,1}, got Type{()}
 in deserialize at serialize.jl:565
 in deserialize at serialize.jl:520
 in handle_deserialize at serialize.jl:351
 in deserialize at serialize.jl:334
 in open at ./iostream.jl:137

simonster · 2014-08-09T03:57:33Z

@jakebolewski and I were playing around with this some more this afternoon, and reached the conclusion that the current state of garbage collection greatly magnifies the effect of any allocation in this benchmark. After the dataset is loaded into memory, each gc takes 2 seconds, as @timholy noted above. So given the current state of garbage collection, optimizing serialization/deserialization of that dataset comes down to minimizing gc. But I suspect these very long gc times will also make it hard to do anything with the data. With 100x longer gc pauses, even code that spent only 10% of its time in gc before the dataset was in memory will take 10x longer to run after the dataset is in memory.

jakebolewski · 2014-08-09T04:04:23Z

For some hard numbers allocating 1,000,000. UTF16 strings takes about 0.5 seconds on the julia test machine in a fresh environment. After loading in a dataframe like labevents, allocating the same 1_000_000 UTF16 strings takes 5.07 seconds. Gc collection times are on the order of 1.6 seconds. This obviously gets much worse as you load in bigger datasets. Lab events is 450mb containing almost 10,000,000 short UTF16 strings.

JeffBezanson · 2014-08-09T04:10:56Z

Yes, we know.

StefanKarpinski · 2014-08-10T21:28:34Z

Yeah, strings are a total disaster currently – that's why they're slated for an overhaul in 0.4.

jakebolewski · 2014-08-11T01:19:55Z

I tried to search the issues, how do you see the representation of strings changing? Allocating so many objects into the environment be they strings, arrays or immutables whatever gives the same behavior. I think this is more of a gc problem then a string problem. How will the new representation get around the issue that you are just allocating many small objects?

JeffBezanson · 2014-08-11T01:28:36Z

We need to work both sides of the issue: better gc and fewer, smaller
objects. Strings use 2 objects each now, plus Array has a lot of overhead
due to supporting multiple dimensions and resizing.
On Aug 10, 2014 9:20 PM, "Jake Bolewski" [email protected] wrote:

I tried to search the issues, how do you see the representation of strings
changing? Allocating so many objects into the environment be they strings,
arrays or immutables whatever gives the same behavior. I think this is more
of a gc problem then a string problem. How will the new representation get
around the issue that you are just allocating many small objects?

—
Reply to this email directly or view it on GitHub
#7893 (comment).

StefanKarpinski · 2014-08-11T03:44:01Z

@jakebolewski, here's a mockup of a Bytes type in Julia that could serves as a basis for storing the bytes of immutable strings: https://github.com/StefanKarpinski/tmp/blob/master/bytes.jl. It can store up to 15 bytes inline or more bytes through a pointer. The real version will likely be written in C, not Julia.

StefanKarpinski · 2014-08-11T03:46:18Z

Note that short strings can be stored inline and compared using Int128 operations. When I've benchmarked this prototype, sorting an array of inline strings is about 2x faster than sorting an array of slightly longer strings, non-inline strings, so there does seem to be some advantage to be had.

nalimilan · 2014-08-11T09:29:22Z

Also, usually, you don't need to work with millions of different string values: you have millions of values from a relatively small pool. In this case PooledDataArrays will be dramatically more efficient.

StefanKarpinski · 2014-08-11T15:01:35Z

The approach we're considering will basically make that optimization obsolete, for strings at least. We may also make it "autopooling" – i.e. whenever two strings are equal or can share memory, we may be able to transparently use a single object.

alanedelman · 2014-08-15T19:43:50Z

Boy am i glad we're starting to work on big datasets :-)

slightly helps JuliaLang#7893

mostly this avoids memory allocation from Module fullname()s, and boxing Ints. this is backwards-compatible for *reading* old files, but written data will not be usable by older julias.

timofeytt · 2014-10-13T04:37:21Z

I don't know whether the optimization techniques described in the blog post are applicable here, but I'll share it for the purposes of discussion:
http://www.serpentine.com/blog/2013/03/20/whats-good-for-c-is-good-for-haskell/

ScottPJones · 2015-06-17T18:08:01Z

It's not clear from this issue, has anything been done to specifically improve the performance of serialization/deserialization (besides the general improvement from GC improvements)? Has the file extension been changed to not be .jld? Has file header information been added? Is the speed of UTF-16 conversions still a problem?

JeffBezanson · 2015-06-17T18:17:42Z

Yes, there was #11125, and some optimizations in #10170 as well. I don't know how we would change the file extension since we don't use or enforce that extension anywhere in Base AFAIK. We don't really have a file format yet (#7909).

stevengj · 2016-10-07T20:13:30Z

I added a performance benchmark of this to nanosoldier, so we can track this going forward.

JeffBezanson · 2017-04-26T16:33:26Z

I believe this is largely fixed. New issues can be opened if anybody encounters an especially bad case for serialize/deserialize.

JeffBezanson added a commit that referenced this issue Aug 7, 2014

avoid an extra tuple allocation in deserializing 1d arrays.

fc56f24

slightly helps #7893

JeffBezanson added the performance label Aug 7, 2014

simonster mentioned this issue Aug 7, 2014

Speed up writetable JuliaData/DataFrames.jl#667

Closed

skumagai pushed a commit to skumagai/julia that referenced this issue Aug 18, 2014

avoid an extra tuple allocation in deserializing 1d arrays.

5133c46

slightly helps JuliaLang#7893

JeffBezanson mentioned this issue Oct 3, 2014

WIP: Incremental GC #5227

Closed

jakebolewski mentioned this issue Oct 21, 2014

static compile part 3 (modules) #8656

Merged

JeffBezanson mentioned this issue Feb 25, 2015

RFC: handle serializing objects with cycles using a Serializer object #10170

Merged

JeffBezanson mentioned this issue Dec 5, 2015

sending data by remotecall slow for composite types #14106

Closed

jakebolewski mentioned this issue Jul 22, 2016

Julia slows down when reading thousands of JLD files #17554

Closed

stevengj mentioned this issue Oct 4, 2016

serialization benchmarks JuliaCI/BaseBenchmarks.jl#29

Merged

JeffBezanson closed this as completed Apr 26, 2017

serialization / deserialization performance #7893

serialization / deserialization performance #7893

Comments

jakebolewski commented Aug 7, 2014

Keno commented Aug 7, 2014

JeffBezanson commented Aug 7, 2014

jiahao commented Aug 7, 2014

Keno commented Aug 7, 2014

JeffBezanson commented Aug 7, 2014

timholy commented Aug 7, 2014

JeffBezanson commented Aug 7, 2014

quinnj commented Aug 7, 2014

jiahao commented Aug 7, 2014

jakebolewski commented Aug 7, 2014

jakebolewski commented Aug 7, 2014

JeffBezanson commented Aug 7, 2014

JeffBezanson commented Aug 7, 2014

jakebolewski commented Aug 7, 2014

jakebolewski commented Aug 7, 2014

timholy commented Aug 7, 2014

JeffBezanson commented Aug 7, 2014

simonster commented Aug 7, 2014

JeffBezanson commented Aug 7, 2014

catawbasam commented Aug 7, 2014

StefanKarpinski commented Aug 7, 2014

JeffBezanson commented Aug 7, 2014

Keno commented Aug 7, 2014

timholy commented Aug 7, 2014

JeffBezanson commented Aug 7, 2014

Keno commented Aug 7, 2014

timholy commented Aug 8, 2014

simonster commented Aug 8, 2014

timholy commented Aug 8, 2014

IainNZ commented Aug 8, 2014

jakebolewski commented Aug 8, 2014

simonster commented Aug 9, 2014

jakebolewski commented Aug 9, 2014

JeffBezanson commented Aug 9, 2014

StefanKarpinski commented Aug 10, 2014

jakebolewski commented Aug 11, 2014

JeffBezanson commented Aug 11, 2014

StefanKarpinski commented Aug 11, 2014

StefanKarpinski commented Aug 11, 2014

nalimilan commented Aug 11, 2014

StefanKarpinski commented Aug 11, 2014

alanedelman commented Aug 15, 2014

timofeytt commented Oct 13, 2014

ScottPJones commented Jun 17, 2015

JeffBezanson commented Jun 17, 2015

stevengj commented Oct 7, 2016

JeffBezanson commented Apr 26, 2017