Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSV parsing is extremely slow #661

Closed
philippbayer opened this issue Apr 2, 2012 · 24 comments
Closed

TSV parsing is extremely slow #661

philippbayer opened this issue Apr 2, 2012 · 24 comments
Labels
performance Must go faster

Comments

@philippbayer
Copy link

Was just playing around with comparing Julia to Python when it comes to iterating over tab-delimited files, tried it on a 90mb table and on a 800mb table:

On 80mb:
$ python parse.py
0.797344923019

$ julia parse.jl
16789.66212272644

On 800mb:
$ python parse.py
6.59492301941

$ julia parse.jl
129588.43898773193

Here's the code for both Python and Julia-implementation, based on the benchmarks on the mainpage:

import time

def parse():
    file_handle = open("./2.txt")
    for line in file_handle:
        line = line.split("\t")

tmin = float("inf")
for i in range(5):
    t = time.time()
    parse()
    t = time.time()-t
    if t < tmin:
        tmin = t

print tmin

Julia:

macro timeit(ex,name)
    quote
        t = Inf
        for i=1:5
            t = min(t, @elapsed $ex)
        end
        println(t*1000)
    end
end

function parse()
file = LineIterator(open("./2.txt"))
    for line in file
        split(line, "\t")
     end    
end

@timeit parse() "parse"

Why is this?

@StefanKarpinski
Copy link
Member

This is definitely a performance bug. We'll look into it. Thanks for reporting.

@JeffBezanson
Copy link
Member

What's happening is that our split with a string is using a regex search to do the splitting, and the regex is being recompiled for every line. Our bad. Try split(line, '\t'). With a single character it should be faster.

@philippbayer
Copy link
Author

True, that cut the time needed in half:

$ julia parse.jl
8891.906023025513

It's still quite slow in comparison to Python, I guess because Python's library is written in C? Unsure about that.

@StefanKarpinski
Copy link
Member

Python's I/O and split functions are in C and have been heavily optimized. I don't think anyone has looked at the performance of this particular aspect of Julia just yet. Clearly needs some work.

StefanKarpinski added a commit that referenced this issue Apr 2, 2012
Helps with #661. We're now 6x faster than before.
@philippbayer
Copy link
Author

I can confirm this, after git pull and make on the 90mb file:

$julia parse.jl
2445.7290172576904

Still slow compared to "the rest" but you're getting there :)

@StefanKarpinski
Copy link
Member

Thanks. We still have a lot of work to do on I/O performance, clearly :-/. But this is the way to get there...

@timholy
Copy link
Member

timholy commented Apr 8, 2012

Stefan, one of your fixes causes a bug on my system, running Version 0.0.0+1333781226.r28fd Commit 28fd168 (2012-04-07 01:47:06).

I defined two functions:

function splitnoregex(filename::String,splitstr::String)
    file = LineIterator(open(filename))
    for line in file
        split(line,splitstr)
    end
end

function splitregex(filename::String,splitstr::String)
    r = r"$splitstr"
    file = LineIterator(open(filename))
    for line in file
        split(line,r)
    end
end

Currently "splitnoregex" gives the following error (on two of my machines):

dlsym: julia: undefined symbol: u8_strwidth
 in splitnoregex at /home/tim/juliafunc/splittest.jl:4

For me, on a 130Mb XML file, I get:

julia> @time splitregex(fl,"\t")
elapsed time: 2.401522159576416 seconds

which seems more commensurate with @DrSnuggles "fast" timing using other languages. I wonder if it's regex compilation which is taking up all the time here.

One solution, which admittedly has disadvantages as well as advantages, is to require a regex input for split (i.e., delete the string "convenience syntax").

@JeffBezanson
Copy link
Member

Wait, both our @elapsed and python's time.time return seconds, so there's no need to multiply the julia time by 1000. So we are actually 3x slower, not 3000x.

@Keno
Copy link
Member

Keno commented Apr 11, 2012

Hmm, I'm wondering whether libuv gives any performance improvements here. What files did people use to get these benchmarks (just randomly generated files delimited by \t ?)

@JeffBezanson
Copy link
Member

I just did csvwrite of a 100000x4 random matrix.

@StefanKarpinski
Copy link
Member

Uh, that's nice. 3x slower is not bad — still ought to be improved, but it's not terrible. I've been using some 80 MB TSV file from my old job — a bunch of spell correction data.

@Keno
Copy link
Member

Keno commented Apr 12, 2012

Unfortunately libuv does not (yet, it's being worked on upstream) help much with single file reading performance (I didn't really look at the file API until now). If you want to read multiple files at the same time (or do socket stuff, etc. at the same times), it's quite useful though. The original fdio based methods work fine on windows though so I'll store the fs stuff away in a separate branch and will revisit it once the rest is completely done.

@JeffBezanson
Copy link
Member

OK, after my latest commit I can get within 36% by disabling the encoding check and making the tokens SubStrings instead of copying them out.
The first one we could maybe deal with by adding the ability to set the encoding of a stream (via a TextStream{ASCIIString} type say). The second one we could deal with by making sure SubString is fast enough in general, or by adding slicing ability to all string types.

@StefanKarpinski
Copy link
Member

The biggest problem I have with making all strings substrings is that now all string types need to have an offset field. Although I guess that's not the end of the world, it complicates things. The nice thing about the SubString type is that it doesn't care what kind of String it represents a substring of.

@JeffBezanson
Copy link
Member

I think we can almost close this, but we should remove the performance trap of making a regex for every call on a >1 character string delimiter.

@StefanKarpinski
Copy link
Member

Yeah, ok, I'll figure out how to do that. The viable options at this point seem to be:

  1. disallow literal strings for splitting, requiring the user to use a regex
  2. write code looks for the literal string using memchr and then checking the following characters
  3. cache Regex objects the way Patrick's code for struct.jl is caching Struct types.

I'm going to take a crack at #2, but I think that #3 might also be a good idea. The cache can be visible, giving the programmer the ability to easily clear it if they want.

@pao
Copy link
Member

pao commented Apr 14, 2012

strpack.jl 😄 You'll probably want to const the regex cache (still need to do this in strpack). Python caches regexes just as they do for Structs; I do believe the cache is bounded-size LRU, though. Would be a good data structure to have.

@StefanKarpinski
Copy link
Member

Ah, yes. That's a good idea for both. Making it const is no problem. You can still clear the data structure. Although if it's bounded size LRU, then that probably won't be necessary unless you intentionally want to pessimize some code.

@pao
Copy link
Member

pao commented Apr 14, 2012

Yeah, the only reason I hadn't const'd it yet is because it was easier to clear it by reloading during development, since I was reloading all the time anyways.

@StefanKarpinski
Copy link
Member

For development, you can always stick a clear immediately after the const definition.

@StefanKarpinski
Copy link
Member

By clear I mean del_all.

@JeffBezanson
Copy link
Member

Yes, no caching here unless there's some LRU eviction.

@pao
Copy link
Member

pao commented Apr 14, 2012

There goes my Saturday.

(I wrote an LRU with MATLAB Objects on top of containers.Map(). I don't think this will be as complicated.)

EDIT: I was wrong.

@StefanKarpinski
Copy link
Member

Splitting on literal strings is now just as fast as splitting on a single character (except possibly in some pathological cases). The new split is based on the search() function, which can search a string for various kinds of patterns: a single character, a list of characters, a literal string, or a regex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants