-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2 standard pipeline running 10x slower #1508
Comments
If you switch to Overall the v2 models are cheaper to run, because you can use them on smaller instances. However it's true that the speed is much worse if you call it document-by-document. To make using |
I just ran an experiment with I am now really missing the previous version... Is it different data structures, improvements in parallelization? Would be great to have the option of the faster, more memory-intensive processing. For reference, I dropped back to 1.9 and tried the same I was curious as to whether this might have something to do with short documents. So, I tried passing in documents that were 10x longer (50 words). I see a big improvement in the v2 processing (twice as fast), but still 7-8x slower than v1.9. Memory ratios seem approximately equivalent under these scenarios. |
4-5 words per document is probably a problem --- I hadn't thought about that use-case. There are a couple of potential problems here. Could you disable the tagger and NER to make sure it's the parser that's slower? It probably is, but just in case. The most convenient way to do that is Your parse times are really bad though. Do you know what version of Blas your numpy is linked against? If you're on a Skylake processor you might be using a version of OpenBLAS that's failed to detect your architecture correctly --- there's a bug in the version that ships with most recent versions of Ubuntu and Debian. You can avoid this bug by using numpy via conda, which is linked to MKL. Unfortunately we're still waiting to publish v2 on Conda --- we rely on CI servers to build the wheels, and there's currently a backlog :(. |
The best way to tell which Blas your numpy is linked to is to do:
Example output:
That installation has numpy linked against MKL. When numpy is linked against OpenBLAS, it looks like this:
Of note here is that the OpenBLAS version is 2.18, which is the version all my machines get by default...But this version has a significant bug that causes quite poor performance :(. To check whether your numpy's Blas is basically decent, you can benchmark it against a wrapper I made for the Blis linear algebra library:
When I'm linked against the bad OpenBLAS, I get this:
When I'm linked against MKL, I get this:
Correctly compiled versions of OpenBLAS give similar performance --- but unfortunately, that's a total pain in the ass. Unfortunately running conda is often hugely inconvenient too! I've still not managed to get my servers automatically deploying into conda environments. You might be able to tell I've spent a lot of time on this...I wish I could say I've come to a better conclusion. The good news is that it won't be long before the bug in OpenBLAS 2.18 rolls past, and we're all getting better CPU neural network performance by default. I believe that this unfortunate problem has contributed to a lot of NLP researchers greatly over-estimating the performance advantage of GPUs for their use-case. For now, my best advice is as follows:
|
Awesome--thanks for the detailed response. I had no idea about the issue with OpenBLAS2.18. I quickly checked and found that v1.9 was running in a conda env with MKL, and v2.0 was running with OpenBLAS2.18. Indeed. I thought this would be the answer, so... Updated numpy to use OpenBLAS2.19. v2.0 with MKL Benchmark:
v2.0 with OpenBLAS2.19:
Not a huge difference. Timings are still the same as before for the respective pipelines. Disabled entire pipeline for v2.0 ( |
What was the performance with OpenBLAS 2.18 on your machine? But! It sounds like the problem is much more likely to be this:
Wat. Okay...Next investigation. Could you install |
Hmm. I wonder whether the culprit is my fix to this: #285 The fix only kicks in after many documents have been processed. I didn't take much care to make the process cheap, because I thought if it only runs every N documents, there's plenty of time to amortize the cost. But if the documents are short...maybe not! Here's the relevant bit of code: https://github.com/explosion/spaCy/blob/master/spacy/language.py#L544 You could reach inside your copy of spacy and comment out those lines to see if that's the problem. You can find the path to the file with |
Reducing the language.py pipe to only yielding documents after line 544 does not improve the speed. Vocabulary size grows at the expected rate (eventually around 10% of documents contain a new noun). Not yielding the documents (i.e. removing everything after line 544) grows the vocabulary to a constant 78,930, and runs through a million documents in 3 seconds, but does not provide any output. Will continue investigating tomorrow. The OpenBLAS2.18 results were only slightly worse than 2.19, but getting them now would require reinstalling everything. |
Just to note, i'm experiencing the exact same issue. Large numbers of very small documents passed to nlp.pipe is much (more than 10x) slower than those same documents passed to nlp individually. |
Thanks, looking into it. |
Hello all, Python version: 2.7.13 numpy is linked to OpenBlas 2.18, will investigate this one further. But here is some stats. As benchmark we use Enron email dataset (measuring processing time of ~ 1200 emails). In both cases pipeline is without syntactic parser (ner and tagger) 1.6 Batch mode: 2.0 Batch mode (batch size 50, increasing leads to no improvements) There are improvements in batch mode, however, still ~ 5x decrease in comparison with 1.6. UPD Benchmark with OpenBLAS 2.18 leads to: Blis... So I guess it's not the issue |
Apologies if this is a little off track, but I just ran your Blis benchmark and got the following results:
Which of course looks great for numpy...but the Blis numbers i'm seeing seem crazy relative to others in this thread. Any idea what might be going on there? |
Ran some small benchmarks (script attached). I still suspect the problem is coming in from the change to Here's the summary:
Script attached below. # With bad blas
$ python bin/benchmark_small_docs.py
components= ['tagger', 'parser', 'ner'] disabled= [] use_pipe= 0 wps= 1182.9 words= 60307 seconds= 50.98254521697527
components= ['tagger', 'parser', 'ner'] disabled= [] use_pipe= 1 wps= 3444.612476996155 words= 60307 seconds= 17.50762978498824
$ python bin/benchmark_small_docs.py tagger -p
components= ['tagger'] disabled= ['parser', 'ner'] use_pipe= 1 wps= 16969.68281227715 words= 60307 seconds= 3.5538083220017143
$ python bin/benchmark_small_docs.py tagger
components= ['tagger'] disabled= ['parser', 'ner'] use_pipe= 0 wps= 3746.429137875223 words= 60307 seconds= 16.097194896952715
# So: Tagger is 5x slower without .pipe()
$ python bin/benchmark_small_docs.py tagger,ner
components= ['tagger', 'ner'] disabled= ['parser'] use_pipe= 0 wps= 1927.4320057496022 words= 60307 seconds= 31.288782079005614
$ python bin/benchmark_small_docs.py tagger,ner -p
components= ['tagger', 'ner'] disabled= ['parser'] use_pipe= 1 wps= 6041.756207845903 words= 60307 seconds= 9.98170034098439
# NER is 3x slower without .pipe()
# Now the numbers with numpy linked against MKL
$ python bin/benchmark_small_docs.py tagger,ner -p
('components=', ['tagger', 'ner'], 'disabled=', [u'parser'], 'use_pipe=', 1, 'wps=', 8058.4194224254325, 'words=', 55889, 'seconds=', 6.935479164123535)
('components=', ['tagger', 'ner'], 'disabled=', [u'parser'], 'use_pipe=', 0, 'wps=', 2843.937503013295, 'words=', 55889, 'seconds=', 19.651978969573975) from timeit import default_timer as timer
import spacy
import thinc.extra.datasets
import plac
def iter_phrases(texts):
for text in texts:
for i in range(0, len(text), 50):
yield text[i : i + 50]
@plac.annotations(
components=("Pipeline components", "positional"),
use_pipe=("Whether to use nlp.pipe()", "flag", "p")
)
def main(components='tagger,parser,ner', use_pipe=False):
components = components.split(',')
use_pipe = int(use_pipe)
train, dev = thinc.extra.datasets.imdb()
texts, labels = zip(*train)
texts = texts[:200]
nlp = spacy.load('en_core_web_sm')
start = timer()
disabled = [name for name in nlp.pipe_names if name not in components]
with nlp.disable_pipes(*disabled):
n_words = 0
if use_pipe:
for doc in nlp.pipe(iter_phrases(texts)):
n_words += len(doc)
else:
for phrase in iter_phrases(texts):
doc = nlp(phrase)
n_words += len(doc)
seconds = timer() - start
wps = n_words / seconds
print('components=', components, 'disabled=', disabled, 'use_pipe=', use_pipe, 'wps=', wps,
'words=', n_words, 'seconds=', seconds)
if __name__ == '__main__':
plac.call(main) |
@ashaffer What sort of CPU do you have? |
@honnibal Intel Core i7 3.1ghz (macbook pro, 2017). |
While it's running, does it spawn some crazy number of threads? Most likely I stuffed up the compilation settings for clang. |
No, actually. Seems to have just one thread. |
Btw...For some reassurance here: If we can't get this sorted out, I could update the linear model code and just train new linear models for v2. It would just be a different parser class. So even if the problem proves difficult, the worst-case solution isn't so bad. I'd much rather figure out why the neural networks are getting uneven performance though. Too much choice is bad. If we add more options, people will have to run experiments to figure out what to use. It's also really hard to make good trade-offs if everyone's using different stuff. Probably the best solution will involve writing more of the forward-pass in Cython, and training models with narrower hidden layers. At the moment the hidden layer sizes have been configured with the current overheads from making a bunch of Python calls in mind. If we eliminate those overheads, performing less computation in the matrix multiplications starts to become better value. |
I have run the benchmark on my system with both v1.9 and v2.0, both with the changes in language.py, and without. It doesn't look like the StringStore changes are to blame (nor MKL vs. OpenBLAS, though that is certainly a good point to note for future experiments) : v1.9All components (no pipe)
All components (pipe)
Only Tagger (no pipe)
Only Tagger (pipe)
v2.0Without stringstore changesAll components (no pipe)
Only Tagger (no pipe)
All components (pipe)
Only Tagger (pipe)
With changesAll components (no pipe)
Only Tagger (no pipe)
All components (pipe)
Only Tagger (pipe)
Modified Benchmark Script for both v1.9 and v2.0 (hacky):from timeit import default_timer as timer
import spacy
import thinc.extra.datasets
import plac
def iter_phrases(texts):
for text in texts:
for i in range(0, len(text), 50):
yield text[i : i + 50]
@plac.annotations(
components=("Pipeline components", "positional"),
use_pipe=("Whether to use nlp.pipe()", "flag", "p"),
version=("Is this version 1.9?","flag","v")
)
def main(components='tagger,parser,ner', use_pipe=False, version=False):
components = components.split(',')
use_pipe = int(use_pipe)
train, dev = thinc.extra.datasets.imdb()
texts, labels = zip(*train)
texts = texts[:200]
nlp = spacy.load('en_core_web_sm')
start = timer()
n_words = 0
v = "2.0"
if version:
v = "1.9"
disabled = [name for name in ['tagger','ner','parser']
if name not in components]
nlp.pipeline = []
if 'tagger' in components:
nlp.pipeline.append(nlp.tagger)
if 'ner' in components:
nlp.pipeline.append(nlp.entity)
if 'parser' in components:
nlp.pipeline.append(nlp.parser)
else:
disabled = [name for name in nlp.pipe_names if name not in components]
nlp.disable_pipes(*disabled)
if use_pipe:
for doc in nlp.pipe(iter_phrases(texts)):
n_words += len(doc)
else:
for phrase in iter_phrases(texts):
doc = nlp(phrase)
n_words += len(doc)
seconds = timer() - start
wps = n_words / seconds
print('version=', v, 'components=', components, 'disabled=', disabled,
'use_pipe=', use_pipe, 'wps=', wps, 'words=', n_words,
'seconds=', seconds)
if __name__ == '__main__':
plac.call(main) |
Ah, I don't know how I missed the fact that you switched over all models from averaged perceptron to neural. In that case, yes, I'd be happy with an extra set of linear models for en, de, fr that can be loaded. This would very likely solve the speed issue. What was the relative performance boost that you achieved with the neural models? Any results you can post? I'm using this as part of a feature selection framework, so not very sensitive to F1 unless the models are very noisy in my domains. |
Thanks! If we do a few more iterations of this we'll get to the bottom of this. Here's the accuracy comparison: https://spacy.io/usage/facts-figures#spacy-models |
Could you run this benchmark next? This exercises the machine learning library directly. I think this is a good next step of the binary search: if this is slower we're pretty close to our answer. More likely the problem will be in spaCy, but there aren't many more steps to search through. With MKL I get:
This shows the huge difference that the batch size can make for the neural network: with batch size 2, it takes 46s, with 128 it takes 4.8s. spaCy uses all words in the batch, so the effective batch size is higher. But with short documents, the batch size can be quite low still. from __future__ import print_function
import plac
from thinc.v2v import Model, ReLu, Softmax
from thinc.api import clone, chain
from thinc.neural.util import to_categorical
from thinc.extra import datasets
@plac.annotations(
batch_size=("Batch size", "positional", None, int)
)
def main(batch_size=128, width=512, depth=2):
with Model.define_operators({'**': clone, '>>': chain}):
model = ReLu(width) >> ReLu(width) >> Softmax()
train_data, dev_data, _ = datasets.mnist()
train_X, train_y = model.ops.unzip(train_data)
dev_X, dev_y = model.ops.unzip(dev_data)
dev_y = to_categorical(dev_y)
train_y = to_categorical(train_y)
with model.begin_training(train_X[:100], train_y[:100]) as (trainer, _):
trainer.nb_epoch = 3
trainer.batch_size = batch_size
for X, y in trainer.iterate(train_X, train_y):
model.begin_update(X)
if __name__ == '__main__':
plac.call(main) |
Timings for batch size = 21m27.127s Timings for batch size = 1287.018s Seems similar to what you were getting. |
I had a closer look at your numbers, and yeah your tagger performance is similar to mine. So okay it must be in the parser. Could you attach the |
Sure. I'm using GCC 5.4.0. |
Can't find anything wrong with the linking or compilation here. Hmm! |
Hi, I'm having a similar issue, though I haven't yet tried all the variations of openblas, etc. Unlike the OP my documents are not that short. I'm running on ~42k documents, each ~1 page long (total words is ~3m) and have run benchmarks with different thread/batch size on v1.8. My best results for v1.8 (on a 4-core skylake i7) are:
The v2.0 performance by comparison is 10x worse:
I am running all this in docker with a standard numpy/openblas install (I checked that its 2.18) on a skylake processor but your blis module seems to generate good numbers. Here's the blis output:
|
Okay new plan:
The models fit well in memory anyway, so one copy of the model per processor isn't so bad. And now that wheels and conda are more standard, it's easier to bring our own gemm. If necessary we can just statically link OpenBLAS. It's pretty small when compiled without Lapack or Fortran, for singe-threaded. I think this approach has the best chance of always doing sensible things. It means the same story applies to all the units in the library, rather than just the ones we happen to multi-thread. Multi-threading is also only really attractive when the threads would otherwise sit idle. In production settings that's really not the case. Even if your machine is large, you'd rather just run more containers. |
Here's a wheel for Thinc that statically links single-threaded OpenBLAS. On my 4-core laptop, the single process is training about 70% as fast as an MKL-linked numpy that uses all cores. It's also much faster than an unoptimised BLAS. thinc-6.10.2-cp36-cp36m-linux_x86_64.whl.rename.zip Try it out and see how you go when you parse with multiple processes on your machine, e.g. using joblib or just starting two processes running. The wheel is only compatible with Linux Python3.6, using a reasonably modern Intel processor (e.g. Haswell, Sandy Bridge, Ivy Bridge, etc). To try it, rename it to It's currently taking about 2 minutes per epoch to train the parser on the English Universal Dependencies corpus. We would be able to train the parser, POS tagger and entity recognizer on different processes given the current architecture. Asynchronous SGD should also work, for larger training jobs. |
Initially 10x slower was the problem... I'm curious how much closer we are again to v1! Looking forward to seeing some benchmarks. |
I've attached a benchmark script for running the pipeline on some IMDB text using multi-processing with joblib. With
My machine is allegedly 4 core, but as you can see performance is pretty terrible with more than two active processes. I wonder whether I have power management throttling me or something. I've confirmed the same behaviour applies if I simply run the command twice...
|
Any update on this? Looks like nlp.pipe is not doing a good job in parallelizing. Using the batch and threads option is not improving speed at all. |
I tried to modify the Rasa NLU code to use the pipe command rather than passing in docs one at a time and I am getting much worse performance. Even with the number of threads set to 8, it appears to be trying to use all cores on a 16 core VM... I'm still working through all of the above trying to see if I can benefit from any of the actions specifically. Is the thinc install the recommended fix? |
I believe I'm having this issue as well. Running 100,000 documents with an average length of about 8500 took 1hr 28m. The notes are stored in pandas dataframe and just iterate through them and yield it to I was also writing the parsed notes to disk so, I'm curious whether that io contributed to the slow performance. EDIT: EDIT-2: Number of documents, Spacy2, Spacy1.9 All time are in seconds. This doesn't provide a great incentive to use SpaCy 2.0 which is unfortunate because I'd really like to use the neural models that is present in SpaCy 2.0. Is there any fix for this yet? |
Try running it on a single core instance and comparing the speed. You may actually find the performance improves, because the default numpy configuration is pretty terrible at figuring out how many threads to launch to do the matrix multiplications. If you install via conda you'll get numpy linked against MKL, which is better at detecting this. The strategy going forward is to use multi-processing rather than multi-threading. The current performance target is 10,000 words per second per CPU core for the full pipeline. If you only run the parser, you should be seeing around 8k wps/core with v2.0.11. On v2.1.0, that should increase to around 12k wps/core. |
I'm guessing that this is as easy as setting
Does it mean that we don't need to worry about releasing the GIL? |
At the moment, not quite. You'll need to set the environment variables for the BLAS library numpy is calling into. Unfortunately because we delegate to numpy, and numpy multi-threads the matrix multiplications by default (by delegating them to a BLAS library), the
If we use multi-processing, we won't need to worry about releasing the GIL. |
While some models are small and it's ok to put them in memory for each process separately, what about models with word vectors? They are big, and multiplying them in memory sounds expensive. Maybe we could store vectors in memmap or something to share it across different processes? |
FWIW, we have been running Spacy v1 on a server behind Uwsgi/Flask, and on our (Linux/Mac) systems, copy-on-write seems to avoid the multiplying of memory when Uwsgi forks the app (including spacy models). I.e. we run 2 or 4 forked processes but only incur the memory cost of one. We haven't done any detailed profiling, but I would expect that memory use eventually does increase somewhat, because of the growing vocabulary (from what I understand, this means affected memory pages will be duplicated across processes). But it would be nice if this was a "feature" also of Spacy v2 (we haven't tested it yet, but this thread seems to suggest it may not be that easy). |
Previously, @honnibal suggested that one solution to closing the gap between v2 and v1.9 speeds would be retraining the linear models (#1508 (comment)). Is this still a viable option? Would we lose a good chunk of accuracy? I've been running some of the 2.1.0a nightlies and it looks like we're not likely to close much of a gap to 1.9 but perhaps I'm missing something. |
@cwharland It's not terribly difficult. I did some work a while ago on that, and got it running in a hacky way. I've just tried to find the branch and pushed the state I found, which doesn't seem to be working. I guess I started trying to unhack it. If you're on Windows, I think I've tracked down another source of slowness. It seems the Windows Subsystem for Linux doesn't expose avx2 instructions, which forces the BLAS libraries to use old kernels. This is making things about 2x slower on my new laptop. Issue about the avx2 instructions can be found here: microsoft/WSL#2234 |
Gotcha, so it seems like we should just be sure that our BLAS libraries in w/e environment we run (for us usually ubuntu based docker images) are linked correctly. |
I think we can finally close this 🎉
You should be seeing between 7000 and 9000 words per second per thread. With the pipeline properly single-threaded, you should now find it easy to use multi-processing to speed up your execution. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Your Environment
Info about spaCy
I just updated to v2.0. Not sure what changed, but the exact same pipeline of documents called in the standard
nlp = spacy.load('en'); nlp(u"string")
way is now 10x slower.The text was updated successfully, but these errors were encountered: