Tokenizer speed: 2.0 << 1.9 ?! #1371

thomasopsomer · 2017-09-27T19:38:07Z

Hi,

I wanted to tokenize a dataset such as 20newsgroups and I found spacy 2.0 to be quite slow. To be sure I also tried with spacy 1.9 and it was twice faster ! Actually I did some speed analysis between v1 and v2 according to document length (in character). It seems that in v2, it is more sensitive to the length of document, and processing time is more volatile... Is it something expected due to some new tokenizer features or the new machinery of v2 ?

honnibal · 2017-09-27T22:59:30Z

Thank for the analysis! There are some open questions about this on the TODO list for spaCy 2 stable: https://github.com/explosion/spaCy/projects/4

There are a few sources of potential problems, that could be to blame for the regression here:

More complex prefix_re?
More compex suffix_re?
More complex infix_finditer?
token_match?
Bad/incorrect lexeme caching?
Less/incorrect tokenizer caching?
Less efficient Vocab or StringStore performance?

The hope is that it's 1-4. 5-6 isn't so bad either. If it's 7 that'll take some more work and might force some hard decisions.

We can mostly exclude 5 by setting nlp.tokenizer.vocab.lex_attr_getters = {}. This way we don't compute any of the string features. If the caching isn't working well, this will make a big difference. If it doesn't make much difference, it's unlikely to be about the lexeme caching.

We can investigate 1-4 by assigning different functions to those attributes of the tokenizer. I think token_match is a very likely culprit, given the non-linearity you've identified.

thomasopsomer · 2017-10-19T14:49:43Z

I made some more experiments, on the first 5k text of the 20newsgroups corpus, averaged on 10 iterations. Here my script btw https://gist.github.com/thomasopsomer/5b044f86b9e8f1a327e409631360cc99

Default tokenizers of both version give the following time performance:

	2.0	1.9
Avg processing time	12.28	8.56
Avg time per doc	0.0024	0.0017
Avg max time per doc	0.75	0.59

Setting exactly the same prefix_re, suffix_re and infix_finditer for both version using the regex from spacy 1.9 and removing exceptions: rules={}. It's not clear but maybe v2 regexes arm a bit performance...

	2.0	1.9
Avg processing time	10.81	7.53
Avg time per doc	0.0021	0.0015
Avg max time per doc	0.70	0.58

As suggested I tested 5. using nlp.tokenizer.vocab.lex_attr_getters. It seems that some performance leak might be related to caching as setting lex_attr_getters = {} decrease time by 2s in 1.9 but 4s in v2 ! (see below):
- with: nlp.tokenizer.vocab.lex_attr_getters = {}

	2.0	1.9
Avg processing time	8.07	6.35
Avg time per doc	0.0016	0.0013
Avg max time per doc	0.49	0.47

I wanted to test v2 with change of #1411 but didn't manage to build the develop branch ^^

honnibal · 2017-11-15T12:54:20Z

There has been a problem with the cache in the tokenizer. But even with the fix, the v2 tokenizer is still very slow. Working on this.

phdowling · 2018-09-07T12:50:27Z

Is this still a known issue? It seems like the tokenizer is quite slow by default, even when called with pipe(). Should I be adding my own multiprocessing around it?

rulai-huajunzeng · 2019-01-08T23:26:46Z

I did some experiments today to test the performance of tokenizer only. It looks like spacy 2.x is still somewhat slower than spacy 1.x. Also surprisingly spacy 2.x under python 3.6 are even twice slower than spacy 2.x under python 2.7. @honnibal can you help look why performance under python 3.6 is not so good?

py27_spacy1: 4189739 tokens, 404279.399319 WPS
py27_spacy2: 4191479 tokens, 297504.391077 WPS
py36_spacy1: 4189739 tokens, 416148.866741 WPS
py36_spacy2: 4191479 tokens, 149588.291103 WPS

Environment:
Machine: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz x 8
OS: Ubuntu 16.04.1
Python version: Python 2.7.15 or Python 3.6.8 :: Anaconda
spaCy version: 1.10.1 or 2.0.18

svlandeg · 2019-01-09T10:27:56Z

Hi @rulai-huajunzeng, I'm currently looking into improving the compilation of regular expressions in the tokenizer, with a focus on speed. We're definitely aiming to substantially improve upon the WPS stats. Which corpus did you do the above tests on?

ines · 2019-01-09T11:00:13Z

Merging this thread with the master thread in #1642!

rulai-huajunzeng · 2019-01-09T17:52:17Z

@svlandeg glad to know that you are working on that. I used a personal corpus which I could not share. It have more than 300K lines of text and each line contains one or several sentences.

lock · 2019-02-08T18:20:51Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added performance 🌙 nightly Discussion and contributions related to nightly builds labels Sep 27, 2017

ines mentioned this issue Oct 11, 2017

Resolve issue #1078 by simplifying URL pattern #1411

Merged

8 tasks

ines added the help wanted Contributions welcome! label Oct 13, 2017

ines removed the 🌙 nightly Discussion and contributions related to nightly builds label Nov 9, 2017

ines mentioned this issue Nov 26, 2017

💫 Improve tokenization, refactor regular expressions and get rid of "regex" dependency #1642

Closed

hitvoice mentioned this issue Dec 2, 2017

AttributeError when preprocessing data for DrQA asappresearch/sru#33

Closed

rajarshd mentioned this issue Dec 22, 2017

Downgrading to an older spacy version (<=1.9.0) #1762

Closed

ines added the feat / tokenizer Feature: Tokenizer label Mar 27, 2018

ines added perf / speed Performance: speed and removed performance labels Aug 15, 2018

ines closed this as completed Jan 9, 2019

lock bot locked as resolved and limited conversation to collaborators Feb 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer speed: 2.0 << 1.9 ?! #1371

Tokenizer speed: 2.0 << 1.9 ?! #1371

thomasopsomer commented Sep 27, 2017

honnibal commented Sep 27, 2017

thomasopsomer commented Oct 19, 2017 •

edited

Loading

honnibal commented Nov 15, 2017

phdowling commented Sep 7, 2018

rulai-huajunzeng commented Jan 8, 2019 •

edited

Loading

svlandeg commented Jan 9, 2019

ines commented Jan 9, 2019

rulai-huajunzeng commented Jan 9, 2019

lock bot commented Feb 8, 2019

Tokenizer speed: 2.0 << 1.9 ?! #1371

Tokenizer speed: 2.0 << 1.9 ?! #1371

Comments

thomasopsomer commented Sep 27, 2017

honnibal commented Sep 27, 2017

thomasopsomer commented Oct 19, 2017 • edited Loading

honnibal commented Nov 15, 2017

phdowling commented Sep 7, 2018

rulai-huajunzeng commented Jan 8, 2019 • edited Loading

svlandeg commented Jan 9, 2019

ines commented Jan 9, 2019

rulai-huajunzeng commented Jan 9, 2019

lock bot commented Feb 8, 2019

thomasopsomer commented Oct 19, 2017 •

edited

Loading

rulai-huajunzeng commented Jan 8, 2019 •

edited

Loading