-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TSV parsing is extremely slow #661
Comments
This is definitely a performance bug. We'll look into it. Thanks for reporting. |
What's happening is that our |
True, that cut the time needed in half: $ julia parse.jl It's still quite slow in comparison to Python, I guess because Python's library is written in C? Unsure about that. |
Python's I/O and split functions are in C and have been heavily optimized. I don't think anyone has looked at the performance of this particular aspect of Julia just yet. Clearly needs some work. |
Helps with #661. We're now 6x faster than before.
I can confirm this, after git pull and make on the 90mb file: $julia parse.jl Still slow compared to "the rest" but you're getting there :) |
Thanks. We still have a lot of work to do on I/O performance, clearly :-/. But this is the way to get there... |
Stefan, one of your fixes causes a bug on my system, running Version 0.0.0+1333781226.r28fd Commit 28fd168 (2012-04-07 01:47:06). I defined two functions:
Currently "splitnoregex" gives the following error (on two of my machines):
For me, on a 130Mb XML file, I get:
which seems more commensurate with @DrSnuggles "fast" timing using other languages. I wonder if it's regex compilation which is taking up all the time here. One solution, which admittedly has disadvantages as well as advantages, is to require a regex input for split (i.e., delete the string "convenience syntax"). |
Wait, both our |
Hmm, I'm wondering whether libuv gives any performance improvements here. What files did people use to get these benchmarks (just randomly generated files delimited by \t ?) |
I just did csvwrite of a 100000x4 random matrix. |
Uh, that's nice. 3x slower is not bad — still ought to be improved, but it's not terrible. I've been using some 80 MB TSV file from my old job — a bunch of spell correction data. |
Unfortunately libuv does not (yet, it's being worked on upstream) help much with single file reading performance (I didn't really look at the file API until now). If you want to read multiple files at the same time (or do socket stuff, etc. at the same times), it's quite useful though. The original fdio based methods work fine on windows though so I'll store the fs stuff away in a separate branch and will revisit it once the rest is completely done. |
OK, after my latest commit I can get within 36% by disabling the encoding check and making the tokens SubStrings instead of copying them out. |
The biggest problem I have with making all strings substrings is that now all string types need to have an offset field. Although I guess that's not the end of the world, it complicates things. The nice thing about the SubString type is that it doesn't care what kind of String it represents a substring of. |
I think we can almost close this, but we should remove the performance trap of making a regex for every call on a >1 character string delimiter. |
Yeah, ok, I'll figure out how to do that. The viable options at this point seem to be:
I'm going to take a crack at #2, but I think that #3 might also be a good idea. The cache can be visible, giving the programmer the ability to easily clear it if they want. |
|
Ah, yes. That's a good idea for both. Making it const is no problem. You can still clear the data structure. Although if it's bounded size LRU, then that probably won't be necessary unless you intentionally want to pessimize some code. |
Yeah, the only reason I hadn't const'd it yet is because it was easier to clear it by reloading during development, since I was reloading all the time anyways. |
For development, you can always stick a clear immediately after the const definition. |
By clear I mean del_all. |
Yes, no caching here unless there's some LRU eviction. |
There goes my Saturday. (I wrote an LRU with MATLAB Objects on top of containers.Map(). I don't think this will be as complicated.) EDIT: I was wrong. |
Splitting on literal strings is now just as fast as splitting on a single character (except possibly in some pathological cases). The new split is based on the |
Was just playing around with comparing Julia to Python when it comes to iterating over tab-delimited files, tried it on a 90mb table and on a 800mb table:
On 80mb:
$ python parse.py
0.797344923019
$ julia parse.jl
16789.66212272644
On 800mb:
$ python parse.py
6.59492301941
$ julia parse.jl
129588.43898773193
Here's the code for both Python and Julia-implementation, based on the benchmarks on the mainpage:
Julia:
Why is this?
The text was updated successfully, but these errors were encountered: