TSV parsing is extremely slow #661

philippbayer · 2012-04-02T05:57:32Z

Was just playing around with comparing Julia to Python when it comes to iterating over tab-delimited files, tried it on a 90mb table and on a 800mb table:

On 80mb:
$ python parse.py
0.797344923019

$ julia parse.jl
16789.66212272644

On 800mb:
$ python parse.py
6.59492301941

$ julia parse.jl
129588.43898773193

Here's the code for both Python and Julia-implementation, based on the benchmarks on the mainpage:

import time

def parse():
    file_handle = open("./2.txt")
    for line in file_handle:
        line = line.split("\t")

tmin = float("inf")
for i in range(5):
    t = time.time()
    parse()
    t = time.time()-t
    if t < tmin:
        tmin = t

print tmin

Julia:

macro timeit(ex,name)
    quote
        t = Inf
        for i=1:5
            t = min(t, @elapsed $ex)
        end
        println(t*1000)
    end
end

function parse()
file = LineIterator(open("./2.txt"))
    for line in file
        split(line, "\t")
     end    
end

@timeit parse() "parse"

Why is this?

StefanKarpinski · 2012-04-02T06:19:55Z

This is definitely a performance bug. We'll look into it. Thanks for reporting.

JeffBezanson · 2012-04-02T06:32:55Z

What's happening is that our split with a string is using a regex search to do the splitting, and the regex is being recompiled for every line. Our bad. Try split(line, '\t'). With a single character it should be faster.

philippbayer · 2012-04-02T06:39:45Z

True, that cut the time needed in half:

$ julia parse.jl
8891.906023025513

It's still quite slow in comparison to Python, I guess because Python's library is written in C? Unsure about that.

StefanKarpinski · 2012-04-02T07:17:52Z

Python's I/O and split functions are in C and have been heavily optimized. I don't think anyone has looked at the performance of this particular aspect of Julia just yet. Clearly needs some work.

Helps #661.

Helps with #661. We're now 6x faster than before.

philippbayer · 2012-04-02T22:59:42Z

I can confirm this, after git pull and make on the 90mb file:

$julia parse.jl
2445.7290172576904

Still slow compared to "the rest" but you're getting there :)

StefanKarpinski · 2012-04-02T23:20:28Z

Thanks. We still have a lot of work to do on I/O performance, clearly :-/. But this is the way to get there...

timholy · 2012-04-08T08:40:08Z

Stefan, one of your fixes causes a bug on my system, running Version 0.0.0+1333781226.r28fd Commit 28fd168 (2012-04-07 01:47:06).

I defined two functions:

function splitnoregex(filename::String,splitstr::String)
    file = LineIterator(open(filename))
    for line in file
        split(line,splitstr)
    end
end

function splitregex(filename::String,splitstr::String)
    r = r"$splitstr"
    file = LineIterator(open(filename))
    for line in file
        split(line,r)
    end
end

Currently "splitnoregex" gives the following error (on two of my machines):

dlsym: julia: undefined symbol: u8_strwidth
 in splitnoregex at /home/tim/juliafunc/splittest.jl:4

For me, on a 130Mb XML file, I get:

julia> @time splitregex(fl,"\t")
elapsed time: 2.401522159576416 seconds

which seems more commensurate with @DrSnuggles "fast" timing using other languages. I wonder if it's regex compilation which is taking up all the time here.

One solution, which admittedly has disadvantages as well as advantages, is to require a regex input for split (i.e., delete the string "convenience syntax").

JeffBezanson · 2012-04-11T23:51:54Z

Wait, both our @elapsed and python's time.time return seconds, so there's no need to multiply the julia time by 1000. So we are actually 3x slower, not 3000x.

Keno · 2012-04-11T23:57:49Z

Hmm, I'm wondering whether libuv gives any performance improvements here. What files did people use to get these benchmarks (just randomly generated files delimited by \t ?)

JeffBezanson · 2012-04-12T00:00:53Z

I just did csvwrite of a 100000x4 random matrix.

StefanKarpinski · 2012-04-12T00:03:33Z

Uh, that's nice. 3x slower is not bad — still ought to be improved, but it's not terrible. I've been using some 80 MB TSV file from my old job — a bunch of spell correction data.

Keno · 2012-04-12T01:35:22Z

Unfortunately libuv does not (yet, it's being worked on upstream) help much with single file reading performance (I didn't really look at the file API until now). If you want to read multiple files at the same time (or do socket stuff, etc. at the same times), it's quite useful though. The original fdio based methods work fine on windows though so I'll store the fs stuff away in a separate branch and will revisit it once the rest is completely done.

JeffBezanson · 2012-04-12T02:31:34Z

OK, after my latest commit I can get within 36% by disabling the encoding check and making the tokens SubStrings instead of copying them out.
The first one we could maybe deal with by adding the ability to set the encoding of a stream (via a TextStream{ASCIIString} type say). The second one we could deal with by making sure SubString is fast enough in general, or by adding slicing ability to all string types.

StefanKarpinski · 2012-04-13T21:47:32Z

The biggest problem I have with making all strings substrings is that now all string types need to have an offset field. Although I guess that's not the end of the world, it complicates things. The nice thing about the SubString type is that it doesn't care what kind of String it represents a substring of.

JeffBezanson · 2012-04-14T00:18:22Z

I think we can almost close this, but we should remove the performance trap of making a regex for every call on a >1 character string delimiter.

StefanKarpinski · 2012-04-14T00:34:48Z

Yeah, ok, I'll figure out how to do that. The viable options at this point seem to be:

disallow literal strings for splitting, requiring the user to use a regex
write code looks for the literal string using memchr and then checking the following characters
cache Regex objects the way Patrick's code for struct.jl is caching Struct types.

I'm going to take a crack at #2, but I think that #3 might also be a good idea. The cache can be visible, giving the programmer the ability to easily clear it if they want.

pao · 2012-04-14T00:44:18Z

strpack.jl 😄 You'll probably want to const the regex cache (still need to do this in strpack). Python caches regexes just as they do for Structs; I do believe the cache is bounded-size LRU, though. Would be a good data structure to have.

StefanKarpinski · 2012-04-14T01:04:04Z

Ah, yes. That's a good idea for both. Making it const is no problem. You can still clear the data structure. Although if it's bounded size LRU, then that probably won't be necessary unless you intentionally want to pessimize some code.

pao · 2012-04-14T01:07:56Z

Yeah, the only reason I hadn't const'd it yet is because it was easier to clear it by reloading during development, since I was reloading all the time anyways.

StefanKarpinski · 2012-04-14T01:10:20Z

For development, you can always stick a clear immediately after the const definition.

StefanKarpinski · 2012-04-14T01:10:28Z

By clear I mean del_all.

JeffBezanson · 2012-04-14T01:42:00Z

Yes, no caching here unless there's some LRU eviction.

pao · 2012-04-14T02:21:30Z

There goes my Saturday.

(I wrote an LRU with MATLAB Objects on top of containers.Map(). I don't think this will be as complicated.)

EDIT: I was wrong.

StefanKarpinski · 2012-04-16T04:59:54Z

Splitting on literal strings is now just as fast as splitting on a single character (except possibly in some pathological cases). The new split is based on the search() function, which can search a string for various kinds of patterns: a single character, a list of characters, a literal string, or a regex.

StefanKarpinski added a commit that referenced this issue Apr 2, 2012

Improve performance of single-character string splitting.

6bc8a03

Helps #661.

StefanKarpinski added a commit that referenced this issue Apr 2, 2012

Removing a method makes split on char 2x faster. Win!

92e1a85

Helps with #661. We're now 6x faster than before.

dcjones mentioned this issue Apr 3, 2012

alioth.debian.org #660

Closed

JeffBezanson added a commit that referenced this issue Apr 12, 2012

line reading performance tweaks (issue #661)

a50ba78

JeffBezanson closed this as completed Apr 17, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TSV parsing is extremely slow #661

TSV parsing is extremely slow #661

philippbayer commented Apr 2, 2012

StefanKarpinski commented Apr 2, 2012

JeffBezanson commented Apr 2, 2012

philippbayer commented Apr 2, 2012

StefanKarpinski commented Apr 2, 2012

philippbayer commented Apr 2, 2012

StefanKarpinski commented Apr 2, 2012

timholy commented Apr 8, 2012

JeffBezanson commented Apr 11, 2012

Keno commented Apr 11, 2012

JeffBezanson commented Apr 12, 2012

StefanKarpinski commented Apr 12, 2012

Keno commented Apr 12, 2012

JeffBezanson commented Apr 12, 2012

StefanKarpinski commented Apr 13, 2012

JeffBezanson commented Apr 14, 2012

StefanKarpinski commented Apr 14, 2012

pao commented Apr 14, 2012

StefanKarpinski commented Apr 14, 2012

pao commented Apr 14, 2012

StefanKarpinski commented Apr 14, 2012

StefanKarpinski commented Apr 14, 2012

JeffBezanson commented Apr 14, 2012

pao commented Apr 14, 2012

StefanKarpinski commented Apr 16, 2012

TSV parsing is extremely slow #661

TSV parsing is extremely slow #661

Comments

philippbayer commented Apr 2, 2012

StefanKarpinski commented Apr 2, 2012

JeffBezanson commented Apr 2, 2012

philippbayer commented Apr 2, 2012

StefanKarpinski commented Apr 2, 2012

philippbayer commented Apr 2, 2012

StefanKarpinski commented Apr 2, 2012

timholy commented Apr 8, 2012

JeffBezanson commented Apr 11, 2012

Keno commented Apr 11, 2012

JeffBezanson commented Apr 12, 2012

StefanKarpinski commented Apr 12, 2012

Keno commented Apr 12, 2012

JeffBezanson commented Apr 12, 2012

StefanKarpinski commented Apr 13, 2012

JeffBezanson commented Apr 14, 2012

StefanKarpinski commented Apr 14, 2012

pao commented Apr 14, 2012

StefanKarpinski commented Apr 14, 2012

pao commented Apr 14, 2012

StefanKarpinski commented Apr 14, 2012

StefanKarpinski commented Apr 14, 2012

JeffBezanson commented Apr 14, 2012

pao commented Apr 14, 2012

StefanKarpinski commented Apr 16, 2012