-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer speed: 2.0 << 1.9 ?! #1371
Comments
Thank for the analysis! There are some open questions about this on the TODO list for spaCy 2 stable: https://github.com/explosion/spaCy/projects/4 There are a few sources of potential problems, that could be to blame for the regression here:
The hope is that it's 1-4. 5-6 isn't so bad either. If it's 7 that'll take some more work and might force some hard decisions. We can mostly exclude 5 by setting We can investigate 1-4 by assigning different functions to those attributes of the tokenizer. I think |
I made some more experiments, on the first 5k text of the 20newsgroups corpus, averaged on 10 iterations. Here my script btw https://gist.github.com/thomasopsomer/5b044f86b9e8f1a327e409631360cc99
I wanted to test v2 with change of #1411 but didn't manage to build the develop branch ^^ |
There has been a problem with the cache in the tokenizer. But even with the fix, the v2 tokenizer is still very slow. Working on this. |
Is this still a known issue? It seems like the tokenizer is quite slow by default, even when called with pipe(). Should I be adding my own multiprocessing around it? |
I did some experiments today to test the performance of tokenizer only. It looks like spacy 2.x is still somewhat slower than spacy 1.x. Also surprisingly spacy 2.x under python 3.6 are even twice slower than spacy 2.x under python 2.7. @honnibal can you help look why performance under python 3.6 is not so good?
Environment: |
Hi @rulai-huajunzeng, I'm currently looking into improving the compilation of regular expressions in the tokenizer, with a focus on speed. We're definitely aiming to substantially improve upon the WPS stats. Which corpus did you do the above tests on? |
Merging this thread with the master thread in #1642! |
@svlandeg glad to know that you are working on that. I used a personal corpus which I could not share. It have more than 300K lines of text and each line contains one or several sentences. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Hi,
I wanted to tokenize a dataset such as 20newsgroups and I found spacy 2.0 to be quite slow. To be sure I also tried with spacy 1.9 and it was twice faster ! Actually I did some speed analysis between v1 and v2 according to document length (in character). It seems that in v2, it is more sensitive to the length of document, and processing time is more volatile... Is it something expected due to some new tokenizer features or the new machinery of v2 ?
The text was updated successfully, but these errors were encountered: