Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer speed: 2.0 << 1.9 ?! #1371

Closed
thomasopsomer opened this issue Sep 27, 2017 · 9 comments
Closed

Tokenizer speed: 2.0 << 1.9 ?! #1371

thomasopsomer opened this issue Sep 27, 2017 · 9 comments
Labels
feat / tokenizer Feature: Tokenizer help wanted Contributions welcome! perf / speed Performance: speed

Comments

@thomasopsomer
Copy link
Contributor

Hi,

I wanted to tokenize a dataset such as 20newsgroups and I found spacy 2.0 to be quite slow. To be sure I also tried with spacy 1.9 and it was twice faster ! Actually I did some speed analysis between v1 and v2 according to document length (in character). It seems that in v2, it is more sensitive to the length of document, and processing time is more volatile... Is it something expected due to some new tokenizer features or the new machinery of v2 ?

capture d ecran 2017-09-27 a 21 18 40

@honnibal
Copy link
Member

Thank for the analysis! There are some open questions about this on the TODO list for spaCy 2 stable: https://github.com/explosion/spaCy/projects/4

There are a few sources of potential problems, that could be to blame for the regression here:

  1. More complex prefix_re?
  2. More compex suffix_re?
  3. More complex infix_finditer?
  4. token_match?
  5. Bad/incorrect lexeme caching?
  6. Less/incorrect tokenizer caching?
  7. Less efficient Vocab or StringStore performance?

The hope is that it's 1-4. 5-6 isn't so bad either. If it's 7 that'll take some more work and might force some hard decisions.

We can mostly exclude 5 by setting nlp.tokenizer.vocab.lex_attr_getters = {}. This way we don't compute any of the string features. If the caching isn't working well, this will make a big difference. If it doesn't make much difference, it's unlikely to be about the lexeme caching.

We can investigate 1-4 by assigning different functions to those attributes of the tokenizer. I think token_match is a very likely culprit, given the non-linearity you've identified.

@ines ines added performance 🌙 nightly Discussion and contributions related to nightly builds labels Sep 27, 2017
@ines ines added the help wanted Contributions welcome! label Oct 13, 2017
@thomasopsomer
Copy link
Contributor Author

thomasopsomer commented Oct 19, 2017

I made some more experiments, on the first 5k text of the 20newsgroups corpus, averaged on 10 iterations. Here my script btw https://gist.github.com/thomasopsomer/5b044f86b9e8f1a327e409631360cc99

  • Default tokenizers of both version give the following time performance:
2.0 1.9
Avg processing time 12.28 8.56
Avg time per doc 0.0024 0.0017
Avg max time per doc 0.75 0.59
  • Setting exactly the same prefix_re, suffix_re and infix_finditer for both version using the regex from spacy 1.9 and removing exceptions: rules={}. It's not clear but maybe v2 regexes arm a bit performance...
2.0 1.9
Avg processing time 10.81 7.53
Avg time per doc 0.0021 0.0015
Avg max time per doc 0.70 0.58
  • As suggested I tested 5. using nlp.tokenizer.vocab.lex_attr_getters. It seems that some performance leak might be related to caching as setting lex_attr_getters = {} decrease time by 2s in 1.9 but 4s in v2 ! (see below):

    • with: nlp.tokenizer.vocab.lex_attr_getters = {}
2.0 1.9
Avg processing time 8.07 6.35
Avg time per doc 0.0016 0.0013
Avg max time per doc 0.49 0.47

I wanted to test v2 with change of #1411 but didn't manage to build the develop branch ^^

@ines ines removed the 🌙 nightly Discussion and contributions related to nightly builds label Nov 9, 2017
@honnibal
Copy link
Member

There has been a problem with the cache in the tokenizer. But even with the fix, the v2 tokenizer is still very slow. Working on this.

@phdowling
Copy link

Is this still a known issue? It seems like the tokenizer is quite slow by default, even when called with pipe(). Should I be adding my own multiprocessing around it?

@rulai-huajunzeng
Copy link

rulai-huajunzeng commented Jan 8, 2019

I did some experiments today to test the performance of tokenizer only. It looks like spacy 2.x is still somewhat slower than spacy 1.x. Also surprisingly spacy 2.x under python 3.6 are even twice slower than spacy 2.x under python 2.7. @honnibal can you help look why performance under python 3.6 is not so good?

py27_spacy1: 4189739 tokens, 404279.399319 WPS
py27_spacy2: 4191479 tokens, 297504.391077 WPS
py36_spacy1: 4189739 tokens, 416148.866741 WPS
py36_spacy2: 4191479 tokens, 149588.291103 WPS

Environment:
Machine: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz x 8
OS: Ubuntu 16.04.1
Python version: Python 2.7.15 or Python 3.6.8 :: Anaconda
spaCy version: 1.10.1 or 2.0.18

@svlandeg
Copy link
Member

svlandeg commented Jan 9, 2019

Hi @rulai-huajunzeng, I'm currently looking into improving the compilation of regular expressions in the tokenizer, with a focus on speed. We're definitely aiming to substantially improve upon the WPS stats. Which corpus did you do the above tests on?

@ines
Copy link
Member

ines commented Jan 9, 2019

Merging this thread with the master thread in #1642!

@ines ines closed this as completed Jan 9, 2019
@rulai-huajunzeng
Copy link

@svlandeg glad to know that you are working on that. I used a personal corpus which I could not share. It have more than 300K lines of text and each line contains one or several sentences.

@lock
Copy link

lock bot commented Feb 8, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Feb 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tokenizer Feature: Tokenizer help wanted Contributions welcome! perf / speed Performance: speed
Projects
None yet
Development

No branches or pull requests

6 participants