Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of readdlm on master #10428

Closed
jiahao opened this issue Mar 7, 2015 · 60 comments
Closed

Poor performance of readdlm on master #10428

jiahao opened this issue Mar 7, 2015 · 60 comments
Labels
performance Must go faster
Milestone

Comments

@jiahao
Copy link
Member

jiahao commented Mar 7, 2015

Here is a simple benchmark for reading in real-world data in TSV (tab-separated values) format. The data are financial time series from a proprietary source with 47 fields. The smaller samples are constructed from taking the first n rows of the original file, which has 100,654,985 rows and is 21.879 GiB in size.

Timings reported in seconds. Timings under 60s are best of 5. Timings over 60s were run once.

number of rows wc -l grep -c data.table fread R read.table pandas.read_csv readdlm DataFrames.readtable timeit pandas.read_csv gc_disabled readdlm gc_disabled DataFrames.readtable
1,000 0.003 0.003 0.013* 0.035 0.023 0.028 0.033 0.013 0.029 0.020
10,000 0.004 0.007 0.052* 0.316 0.100 0.353 0.357 0.080 0.316 0.202
100,000 0.017 0.035 0.461* 5.190 0.666 6.177 20.090 0.749 3.424 2.643
1,000,000 0.132 0.278 4.604* 48.302 7.846 268.673 47.400 6.704 35.731 24.813
10,000,000 1.292 2.732 47.503* 549.017 93.413 25298.222 916.866 71.286 348.457 264.402
100,654,985 12.691 25.779 725.966* 5147.951 799.608* 825.816*

*timings obtained from files with NUL characters stripped.

Note that the pandas timings obtained with timeit have the Python garbage collector disabled, and so are fair comparisons with the gc_disabled numbers in Julia. The first pandas column is timed with simple time.time() wrappers and are analogous to Julia @time reports.

It's quite clear from the timings that garbage collection causes the scaling to become superlinear, with the result that readdlm and readtable become unusable for large data, even on a machine large enough to read all the data into memory. However, the baseline performance of pandas is still far superior to what we have to offer in Julia, even with garbage collection turned off. This is perhaps unsurprising, since the main pandas parser is written in Cython and not pure Python.

We could probably do much better if readdlm generated less garbage.


Actual commands run:

wc -l datafile.tsv
grep -c AAPL datafile.tsv
#Python 2.7.9 |Anaconda 2.1.0
t=time(); A=pandas.read_csv("datafile.tsv", sep="\t", header=None); time()-t
timeit.timeit('pandas.read_csv("datafile.tsv", sep="\t", header=None)', setup='import pandas', number=5)/5
#Julia 0.4-dev
gc(); readdlm("datafile.tsv", '\t');
gc(); DataFrames.readtable("datafile.tsv", separator = '\t', header = false);
gc_disable(); #repeat readdlm, readtable, etc.
#R 3.1.2
> system.time(read.table("datafile.tsv", sep="\t", header=FALSE, skipNul=TRUE))                                                                                                   
   user  system elapsed 
  0.034   0.000   0.035
... similarly
  0.316   0.000   0.316 
  5.148   0.040   5.190 
 47.191   0.802  48.302 
535.361  10.463 549.017
4867.606  229.902 5147.951

> system.time(fread("datafile.tsv", sep="\t", header=FALSE)) #Chokes on NUL characters
   user  system elapsed 
  0.008   0.004   0.013 
... similarly
  0.052   0.000   0.052 
  0.457   0.004   0.461 
  4.430   0.173   4.604 
 46.230   1.051  47.503 
703.847  18.296 725.966 

Julia @time info

readdlm (returns Array{Any,2})
10^3 elapsed time: 0.027763177 seconds (7 MB allocated)
10^4 elapsed time: 0.353573064 seconds (80 MB allocated, 29.06% gc time in 3 pauses with 0 full sweep)
10^5 elapsed time: 6.177088453 seconds (809 MB allocated, 52.92% gc time in 19 pauses with 4 full sweep)
10^6 elapsed time: 268.672912983 seconds (8097 MB allocated, 86.89% gc time in 160 pauses with 17 full sweep
10^7 elapsed time: 25298.222635691 seconds (80990 MB allocated, 98.63% gc time in 2424 pauses with 177 full sweep)

DataFrames.readtable
10^3 elapsed time: 0.033111172 seconds (4 MB allocated)
10^4 elapsed time: 0.356714838 seconds (41 MB allocated, 11.54% gc time in 1 pauses with 0 full sweep)
10^5 elapsed time: 20.089927057 seconds (433 MB allocated, 79.65% gc time in 5 pauses with 3 full sweep)
10^6 elapsed time: 47.399711163 seconds (4176 MB allocated, 13.33% gc time in 40 pauses with 8 full sweep)
10^7 elapsed time: 916.86617322 seconds (40755 MB allocated, 52.70% gc time in 523 pauses with 39 full sweep)
@johnmyleswhite
Copy link
Member

There's no doubt that Pandas is doing much better than we are.

That said, I'd be interested to see how CSVReaders compares. For very large data sets, it seems to do much better than readtable because it doesn't produce an intermediate representation of all strings.

@johnmyleswhite
Copy link
Member

Actually, I take that back. CSVReaders is slower: it just uses 50% as much memory, but takes noticeably longer to run.

@johnmyleswhite
Copy link
Member

One of the things I suspect is problematic for the readtable parser is the way that type inference interprets it. Lines like https://github.com/JuliaStats/DataFrames.jl/blob/3320b2b7d160be66bb6a4bd3f95f78b8be6e0d23/src/dataframe/io.jl#L510 are, I fear, a major problem. (And this is endemic with DataFrames because of the Vector{Any} representation of the columns.

@JeffBezanson JeffBezanson added the performance Must go faster label Mar 7, 2015
@JeffBezanson
Copy link
Sponsor Member

I tried the following code to get a sense of things:

function rd(f)
    nr = countlines(f)
    fh = open(f)
    nc = length(split(readline(fh),','))
    seekstart(fh)
    d = Array(Float64, nr, nc)
    for i = 1:nr
        r = split(readline(fh),',')
        for j = 1:nc
            d[i,j] = float64(r[j])
        end
    end
    d
end

I compared this to readdlm on a simple delimited file of 5000000x4 random numbers:

julia> @time readdlm("rand3.csv",',');
elapsed time: 10.446378342 seconds (2791 MB allocated, 2.32% gc time in 34 pauses with 7 full sweep)

julia> @time rd("rand3.csv");
elapsed time: 9.385646328 seconds (2784 MB allocated, 2.10% gc time in 121 pauses with 1 full sweep)

"Only" a 10% speedup, but fairly shocking given how trivial my code is.

@johnmyleswhite
Copy link
Member

Are you shocked because it's a 10% speedup or because it's only a 10% speedup?

@JeffBezanson
Copy link
Sponsor Member

Because there's any speedup at all. A 500-line implementation ought to be much faster.

@johnmyleswhite
Copy link
Member

FWIW, I've done a bunch of profiling while working on CSVReaders. The places where you get killed are often surprising. For example, I got a huge speedup by hardcoding the sentinel values that you check for missing values and Booleans. Having those sentinel values stored in arrays that get looped through was roughly 50% slower than hardcoding them.

I'm not sure I agree with the heuristic that a larger implementation should be much faster. For example, a lot of the code in readtable is spent checking branches to make sure that type inference is still correct.

All that said, I would be ecstatic if we managed to figure out how to write a fast CSV parser in pure Julia, so I'm really happy to have you thinking about this.

@JeffBezanson
Copy link
Sponsor Member

Part of the key seems to be separating CSV, which is a very complex format, from simple TSV. If you happen to have a dead-simple TSV file, we should be able to read it super fast.

@johnmyleswhite
Copy link
Member

While that's clearly true in the abstract, I think it's important that the pandas benchmark is a reader for the full CSV format.

Sent from my iPhone

On Mar 7, 2015, at 10:01 AM, Jeff Bezanson [email protected] wrote:

Part of the key seems to be separating CSV, which is a very complex format, from simple TSV. If you happen to have a dead-simple TSV file, we should be able to read it super fast.


Reply to this email directly or view it on GitHub.

@JeffBezanson
Copy link
Sponsor Member

With 20 more minutes of work, I can beat readdlm in this case by a full factor of 2 (https://gist.github.com/JeffBezanson/83fb1d79deefa2316067). I guess my point is that if you're going to have a large implementation, it might as well include special cases like this. Of course full CSV is what matters, but it's even worse to be slow on simple numeric TSV files.

@tanmaykm
Copy link
Member

tanmaykm commented Mar 7, 2015

Much of the complexity of readdlm comes because of handling nuances of the csv format like:

  • quoted columns that allow both row and column separators and quotes within them
  • trimming spaces
  • empty lines
  • missing columns, which necessitates parsing the whole file before determining output dimensions

Some of these can be disabled with appropriate flags, but they do have some residual impact.

Also the approach of reading/mapping the whole file is probably not giving enough benefits anymore because of speedups in other parts of Julia. It is now proving to be a bottleneck with large files. A way to handle files in chunks will probably help here.

It will be good to have a way of having a simple and fast method while being able to plugin one or more routines to handle the complexity, but without having a large impact on performance.

@tanmaykm
Copy link
Member

tanmaykm commented Mar 7, 2015

I think the approach of parsing the entire file before determining output dimensions is having a large impact. With that disabled (by specifying dims), here's what I had got on a 10GB file with 60000000 x 10 floats on julia.mit.edu:

julia> @time a = readdlm("randf2.csv", ';', Float64; dims=(60000000, 10));
elapsed time: 401.356118774 seconds (33393 MB allocated, 0.58% gc time in 841 pauses with 1 full sweep)

julia> size(a)
(60000000,10)

But I was not able to read a 40GB file because it ran into memory/gc issues.
gc was enabled during the test.

@jiahao
Copy link
Member Author

jiahao commented Mar 8, 2015

@tanmaykm here are the timings for the original example when the dimensions are pre-specified:

julia> for i=3:7; @time A=readdlm("testfile.tsv", '\t', dims=(10^i,46)); end
elapsed time: 0.017998925 seconds (3 MB allocated)
elapsed time: 0.309120058 seconds (41 MB allocated, 41.21% gc time in 2 pauses with 1 full sweep)
elapsed time: 4.848010965 seconds (423 MB allocated, 57.23% gc time in 18 pauses with 2 full sweep)
elapsed time: 184.121671697 seconds (4234 MB allocated, 88.44% gc time in 154 pauses with 16 full sweep)
elapsed time: 15579.129143684 seconds (42359 MB allocated, 98.69% gc time in 1512 pauses with 154 full sweep)

With gc disabled:

julia> for i=3:7; @time A=readdlm("testfile.tsv", '\t', dims=(10^i,46)); end
elapsed time: 0.019123992 seconds (3 MB allocated)
elapsed time: 0.199007951 seconds (41 MB allocated)
elapsed time: 2.031139609 seconds (423 MB allocated)
elapsed time: 19.807757219 seconds (4234 MB allocated)
elapsed time: 175.472120788 seconds (42359 MB allocated)

It does look like pre-specifying the dimensions helps cut execution time almost by half. So much the better, since the dimensions can be determined from UNIX shell commands first for a fraction of the cost in Julia ;-)

@ViralBShah ViralBShah added this to the 0.4 milestone Mar 8, 2015
@jiahao
Copy link
Member Author

jiahao commented Mar 8, 2015

Updated OP with timings for R's read.table.

@nalimilan
Copy link
Member

@jiahao If you're looking for a fast function to benchmark, fread() in the data.table R package claims to be much faster than the standard read.table().

@tanmaykm
Copy link
Member

tanmaykm commented Mar 8, 2015

@jiahao I'll try and profile the code to figure out memory allocation hotspots. Also the data is being read as an Any array which is also much slower than reading into a numeric/bitstype array. And parsing ascii will be faster than utf8.

In this particular case, cleaning/converting these to numeric values/factors or separating them to be read from a different file may help:

  • columns 1,2,8,13 look like date/time stamps
  • column 15 looks like a category type
  • columns 23,24,25 seem to have nulls, and probably some non ASCII bytes also...

@garborg
Copy link
Contributor

garborg commented Mar 8, 2015

Seconding @nalimilan, fread is the R functionality to test against. Never used read.table when I was using R every day.

@jiahao
Copy link
Member Author

jiahao commented Mar 8, 2015

I'll have to re-devise the test then. This test file contains NUL characters which breaks fread.

@jiahao
Copy link
Member Author

jiahao commented Mar 8, 2015

Updated OP with timings for data.table fread(). The timings for the other commands did not change significantly when rerunning with files stripped of NULs.

If anyone was interested, stripping the original file of NULs with td -d '\000' took all of 1 minute.

@tanmaykm
Copy link
Member

tanmaykm commented Mar 9, 2015

Looks like the SubString created here https://github.com/JuliaLang/julia/blob/master/base/datafmt.jl#L163 contributes most of the garbage generated in readdlm.

tanmaykm added a commit to tanmaykm/julia that referenced this issue Mar 9, 2015
Using direct `ccall` wherever possible instead of creating `SubString` for every column during parsing.

ref: JuliaLang#10428
tanmaykm added a commit to tanmaykm/julia that referenced this issue Mar 10, 2015
Using direct `ccall` wherever possible instead of creating `SubString` for every column during parsing.

ref: JuliaLang#10428
@jiahao
Copy link
Member Author

jiahao commented Apr 22, 2015

Here is how Base.readdlm now performs on master after @tanmaykm's improvements:

number of rows elapsed (s) % gc time elapsed (s) with gc_disable
10^3 0.020 - 0.025
10^4 0.323 35.93% gc time in 3 pauses with 0 full sweep 0.244
10^5 4.246 39.30% gc time in 16 pauses with 4 full sweep 2.419
10^6 25.186 6.06% gc time in 11 pauses with 7 full sweep 21.251
10^7 9667.191 96.87% gc time in 1946 pauses with 110 full sweep 283.008

Unfortunately it looks like the quadratic scaling behavior still persists for n = 10^7 rows. At >200x slower than R's fread, there is still a lot of work to be done.

@JeffBezanson
Copy link
Sponsor Member

@carnaval can we turn a knob in the GC to make it back off faster?

@carnaval
Copy link
Contributor

We can, the current heuristics are nowhere near perfect, but changing this require at least a bit of regression testing w.r.t. memory usage, which is a pain.
TBH most cases where I've seen this, the best fix is to disable gc and reenable it again after, since a vast majority of the memory is useful. I know it's not very satisfactory but the opposite problem being huge memory usage, which julia is already pretty lenient on, I'm not sure what is the good thing to do.

@JeffBezanson
Copy link
Sponsor Member

Manually disabling GC is not acceptable. Surely the jump from 11 to 1946 pauses in the last table row can be avoided. I'm even ok with 39% GC time, just not 96%.

@carnaval
Copy link
Contributor

carnaval commented May 7, 2015

Ok, I still haven't come around doing this, sorry.

The 1st order improvement would probably be to introduce a custom StringArray which stores every String linearly in an IOBuffer and the offsets as ranges in a separate array. You would avoid most gc/allocation traffic, have much better cache behavior, etc. Mutation becomes a problem though, and pushing this to the maximum performance would probably makes you implement a specialized "depth-1" compacting gc for strings. I'm almost certain that doing it naievly at first would already be a huge improvement compared to allocating each string separatly.

It's also easier for the gc because you only materialize the individual strings on getindex, and they are likely to stay in young gen and be cheap to free.

Since a lot of columns seems to be string I think it would be very worthwhile to try it.

@ViralBShah
Copy link
Member

Bump. Does something have to be done on the GC front here, or on reorganizing readdlm?

@RaviMohan
Copy link

Some more GC related weirdness.

I have a simple test, I create files containing n comma separated integers, use readall to read it all in as a single string, split it with "," as separator, print length of the read-in string and exit, using the unix time command to get timing data. Here are the timings for Julia vs Python 2.7 on my machine (ubuntu 14.04, 32 GB RAM)

N Python 2.7 Julia 0.4-dev Julia 0.4-dev gc off
10k 0.027 seconds 0.207 seconds 0.217 seconds
100k 0.036 seconds 0.20 seconds 0.229 seconds
1 million 0.053 seconds 0.288 seconds 0.257 seconds
10 million 0.475 seconds 0.823 seconds 0.622 seconds
100 million 3.812 seconds 51.273 seconds 3.053 seconds
300 million 13.15 seconds 25.61 seconds 10.98 seconds
600 million 33 seconds Didn't complete after twenty minutes 36 seconds

The Python code

#just read contents of a file so I can time it
import string
def read_my_file(file_name):
    f = open(file_name)
    s = f.read()
    a = string.split(s,",")
    print(len(s))
    f.close()

read_my_file("sixhundredmillionints.txt")

The Julia code with GC on

#just read a file so I can time it
f = open("sixhundredmillionints.txt")
s = readall(f)
a = split(s,",")
print(length(s));
close(f)

Julia code without GC

# gc off to avoid gc flakiness
gc_disable();
f = open("sixhundredmillionints.txt")
s = readall(f)
a = split(s,",")
print(length(s));
close(f)

@carnaval
Copy link
Contributor

probably the same thing that is solved on ob/gctune

@RaviMohan
Copy link

So (sorry newbie here, excuse the dumb question) are you saying the GC is fixed now and can handle large inputs without flailing? If so any idea when that branch will get merged into main?

@carnaval
Copy link
Contributor

well it is not merged yet, but at least on this branch it should not do what I think it's doing in your example which is : believing it is doing well when it's not, and because of that not widening the collection interval.

There is no "general fix" for this, only improvements although this one is a very needed one because otherwise in cases like yours the heuristics are actively making things worse.

@JeffBezanson JeffBezanson modified the milestones: 0.4.x, 0.4.0 Jun 2, 2015
@RaviMohan
Copy link

Something I've been working on

Trying to load upto a 100 million row, 24 column dataset consisting of random integers, floats and strings, (limits: no other datatypes, no nulls,only comma as separator, doesn't infer schema) into a datastructure which is essentially this

ColumnType = Union(Array{Int64,1},Array{Float64,1},Array{AbstractString,1})
#this should probably be a list, but array works for now
DataFrameType = Array{ColumnType,1}

ie. a dataframe is an array of columns, each column being a Union of Int Array, Float Array, String Array , (this is not an original idea, this is how the Pandas dataframe works, modulo the types)

I'm getting the following loading times on my 4 core 32 GB Ubuntu 14.04 machine.

All times in seconds.

Number of Rows R readtable R fread Julia 0.4-dev gc on gc % Julia 0.4-dev gc off
1000 0.02 secs 0.003 secs 0.06 secs 0 % 0.06 secs
10000 0.33 secs 0.043 secs 0.12 secs 12.19 % 0.13 secs
100000 3.489 secs 0.237 secs 0.79 seconds 23.54% 0.6 secs
1000000 19.459 secs 1.939 secs 13.3 secs 57.85 % 6.2 secs
10000000 163 seconds 16.389 secs 635 secs 91.65% 59.03 secs
100000000 Killed after 30 minutes 277 secs Killed After 30 minutes NA Memory Exhausted

The code is here https://github.com/RaviMohan/loaddataframe . The Python code generates the dataset, the Julia code loads the data from the file into the above datastructure.

If the code is basically correct and not doing anything stupid, (errors are always possible, all fixes appreciated in advance :-), ) it seems that without gc, plain Julia comes within a constant factor of R.fread, and handily beats R's read table. Of course R's routines do much more like inferring column datatype, handling nulls, handling multiple formats etc.

Still it seems that fast loading of large csv files is possible, once the gc issues are fixed and/or enough memory is available.

I don't know how to write idiomatic Julia, so the code might be too imperative etc. Any style pointers (and of course bug reports/fixes) greatly appreciated.

@jiahao
Copy link
Member Author

jiahao commented Jun 29, 2015

@RaviMohan sounds like progress! How does loaddataframe perform on the test data set in this issue?

Perhaps you and @quinnj could discuss how to further combine efforts between https://github.com/quinnj/CSV.jl and https://github.com/RaviMohan/loaddataframe. @quinnj has also implemented his own CSV reader and he showed me some tests showing also good performance.

(Also a quick note, the syntax f(x) is preferred over f (x))

@RaviMohan
Copy link

@jiahao I haven't yet tested on the test data set, primarily because I haven't implemented Date/DateTime parsing, and I don't yet have it on my local machine. But the code is very generic and you should see similair speedups (not to mention that a 100 million row csv actually loads completely, unlike in your original tests, memory being available - you'll need a lot of memory On my 32 GB machine, 100 million rows with 24 columns - and very short strings - completely overwhelms the RAM, and it doesn't work with gc on at all.)

My focus with this was on getting the underlying datastructure correct, avoiding Array{Any ...} etc. This is the primary learning which comes out of this effort. We can avoid Array{Any} and still load 100 million lines of data (but schema inference is harder, see below) . And of course the Array of Columns design vs a 2D Array as the underlying datastructure

Also, I have a hardcoded comma-as-separator which is probably not very conducive to the original data set. This is primarily the last in many iterations of code explicitly trying to see if we can come close to/beat R's reading speed, and is very hacky and is not really production quality code imo.

(I must say I am very impressed that fairly straightforward Julia code can come close to R's fread. That is some very impressive language design/compiler work.)

The next important things to do are to handle nulls ,and to generate schema from examining/sampling the dataset (vs hardcoding as of now). I'm not quite sure (yet) how to do this, since it seems to involve the possibility of the ColumnType changing dynamically. A column can have the first 10 million values look like Ints then have the rest be floating values say. If we are not using Any's then the column's type has to change when you encounter the first float. Of course this is not a problem for dynamic languages

To handle nulls I tried to use Union (Array{Nullable(Int64),1} .. ) etc but the memory usage seems to blow up pretty badly on this and I can't figure out why. Something to pursue further.

I don't think @quinnj 's CSV reader has much to learn from this on the parsing side as I am just using string split with '\n' and ',' respectively :-P . I don't deal with any of the complexities of CSV parsing, which can get pretty hairy and full of gnarly details. That said of course I'd be glad to help @quinnj in any way I can.

String splitting also interacts weirdly with the gc when memory gets tight. I probably need to investigate and file a separate issue. (though it just might be an instance of #6103 )

Another thing I'm looking at in the coming weeks is to see if Spark's "Dataframe-as-API + multiple conforming components + a generic query compiler" pattern can be adapted to our needs. If so, users can use the API without really knowing or caring about implementation details, and multiple sources of data can be trivially added. We'll have simpler datastores than RDDs (of course!) but the basic architectural pattern might be valuable to decouple data sources and implementations.

@ViralBShah was mentioning something called "JuliaDB". As and when it takes concrete form, there might be commonalities/common design problem etc.

Thanks for the syntax pointer. Appreciated!

carnaval added a commit that referenced this issue Aug 15, 2015
Take the number of pointers transgressing the generational frontier
into account when deciding wether to do a young or an old gen
collection. Helps #10428. Didn't run the bench suite yet.
@ViralBShah ViralBShah modified the milestones: 0.4.x, 0.6.0 Sep 10, 2016
@ViralBShah
Copy link
Member

With all the recent successes of CSV.jl, I think it fair to close this issue.

@tkelman
Copy link
Contributor

tkelman commented Sep 10, 2016

We should remove the base readdlm if we're going to tell people to use something different.

@ViralBShah
Copy link
Member

As I recollect, we probably should leave readdlm to read homogeneous data - all of the same type.

@skanskan
Copy link

Julia should have been built to use a database transparently for the user and be able to do any operation streaming from it (including mixed effects regression or bayesian statistics).

@nalimilan
Copy link
Member

@skanskan There's no question of whether "Julia should have been built" that way or not. readdlm is here to cater to simpler use cases, but more complex scenarios are better handled via packages. See e.g. the DataStreams.jl and OnlineStats.jl packages.

@johnmyleswhite
Copy link
Member

This line of comments does not belong on the issue tracker. Please ask these things on julia-users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests