Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ’« spaCy v2.0.0 alpha – details, feedback & questions (plus stickers!) #1105

Closed
ines opened this issue Jun 5, 2017 · 109 comments
Closed
Labels
help wanted Contributions welcome! meta Meta topics, e.g. repo organisation and issue management πŸŒ™ nightly Discussion and contributions related to nightly builds

Comments

@ines
Copy link
Member

ines commented Jun 5, 2017

We're very excited to finally publish the first alpha pre-release of spaCy v2.0. It's still an early release and (obviously) not intended for production use. You might come across a NotImplementedError – see the release notes for the implementation details that are still missing.

This thread is intended for general discussion, feedback and all questions related to v2.0. If you come across more complex bugs, feel free to open a separate issue.

Quickstart & overview

The most important new features

  • New neural network models for English (15 MB) and multi-language NER (12 MB), plus GPU support via Chainer's CuPy.
  • Strings mapped to hash values instead of integer IDs. This means they will always match – even across models.
  • Improved saving and loading, consistent serialization API across objects, plus Pickle support.
  • Built-in displaCy visualizers with Jupyter notebook support.
  • Improved language data with support for lazy loading and multi-language models. Alpha tokenization for Norwegian BokmΓ₯l, Japanese, Danish and Polish. Lookup-based lemmatization for English, German, French, Spanish, Italian, Hungarian, Portuguese and Swedish.
  • Revised API for Matcher and language processing pipelines.
  • Trainable document vectors and contextual similarity via convolutional neural networks.
  • Various bug fixes and almost completely re-written documentation.

Installation

spaCy v2.0.0-alpha is available on pip as spacy-nightly. If you want to test the new version, we recommend setting up a clean environment first. To install the new model, you'll have to download it with its full name, using the --direct flag.

pip install spacy-nightly
python -m spacy download en_core_web_sm-2.0.0-alpha --direct   # English
python -m spacy download xx_ent_wiki_sm-2.0.0-alpha --direct   # Multi-language NER
import spacy
nlp = spacy.load('en_core_web_sm')
import en_core_web_sm
nlp = en_core_web_sm.load()

Alpha models for German, French and Spanish are coming soon!

Now on to the fun part – stickers!

stickers

We just got our first delivery of spaCy stickers and want to to share them with you! There's only one small favour we'd like to ask. The part we're currently behind on are the tests – this includes our test suite as well as in-depth testing of the new features and usage examples. So here's the idea:

  • Find something that's currently not covered in the test suite and doesn't require the models, and write a test for it - for example, language-specific tokenization tests.
  • Alternatively, find examples from the docs that haven't been added to the tests yet and add them. Plus points if the examples don't actually work – this means you've either discovered a bug in spaCy, or a bug in the docs! πŸŽ‰

Submit a PR with your test to the develop branch – if the test covers a bug and currently fails, mark it with @pytest.mark.xfail. For more info, see the test suite docs. Once your pull request is accepted, send us your address via email or private message on Gitter and we'll mail you stickers.

If you can't find anything, don't have time or can't be bothered, that's fine too. Posting your feedback on spaCy v2.0 here counts as well. To be honest, we really just want to mail out stickers πŸ˜‰

@ines ines added help wanted (easy) Contributions welcome! (also suited for spaCy beginners) meta Meta topics, e.g. repo organisation and issue management πŸŒ™ nightly Discussion and contributions related to nightly builds help wanted Contributions welcome! and removed help wanted (easy) Contributions welcome! (also suited for spaCy beginners) labels Jun 5, 2017
@kootenpv
Copy link
Contributor

kootenpv commented Jun 5, 2017

I honestly just wanted it to work :-)

I had spacy 1.5 installed on my other machine, I removed it and installed spacy-nightly v2.0.0a0.
It works to import, but then tried to download with both:

python -m spacy download en
python -m spacy download en_core_web_sm

Both give:

Compatibility error
No compatible models found for v2.0.0a0 of spaCy.

It's most likely that I'm missing something?

EDIT: Indeed, on the release page it says: en_core_web_sm-2.0.0-alpha. You also need to give the --direct flag.

python -m spacy download en_core_web_sm-2.0.0-alpha --direct

Perhaps it is possible to temporarily update the docs for it?

Otherwise: it works!

@ines
Copy link
Member Author

ines commented Jun 5, 2017

Sorry about that! We originally decided against adding the alpha models to the compatibility table and shortcuts just yet to avoid confusion – but maybe it actually ended up causing more confusion. Just added the models and shortcuts, so in about 5 minutes (which is roughly how long it takes GitHub to clear its cache for raw files), the following commands should work as well:

python -m spacy download en
python -m spacy download xx

@kootenpv
Copy link
Contributor

kootenpv commented Jun 5, 2017

Another update: I tried parsing 16k headlines. I can parse all of them* and access some common attributes of each of them, including vectors :)

I did notice that on an empty string (1 of headlines*), it now throws an exception, this was not the case in v.18.2. Probably better to fix that :)

I wanted to do a benchmark against v.1.8.2, but the machines are not comparable :( It did feel a lot slower though...

@honnibal
Copy link
Member

honnibal commented Jun 5, 2017

Thanks!

Try the doc.similarity() if you have a use-case for it? I'm not sure how well this works yet. It's using the tensors learned for the parser, NER and tagger (but no external data). It seems to have some interesting context sensitivity, and in theory it might give useful results --- but it hasn't been optimised for that. So, I'm curious to hear how it does.

http://alpha.spacy.io/docs/usage/word-vectors-similarities

@buhrmann
Copy link

buhrmann commented Jun 6, 2017

Hi, very excited about the new, better, smaller and potentially faster(?) spaCy 2.0. I hope to give it a try in the next days. Just one question. According to the (new) docs the embeddings seem to work just as they did before, i.e. external word vectors and their averages for spans and docs. But you also mention the use of tensors for similarity calculations. Is it correct that the vectors are essentially the same, but are not used as such in the similarity calculations anymore? Or are they somehow combined with the internal tensor representations of the documents? In any case thanks for the great work, and I hope to be able to give some useful feedback soon about the Spanish model etc.

@honnibal
Copy link
Member

honnibal commented Jun 6, 2017

@buhrmann This is inherently a bit confusing, because there are two types of vector representations:

  1. You can import word vectors, as before. The assumption is you'll want to leave these static, with perhaps a trainable projection layer to reduce dimension

  2. The parser, NER, tagger etc learns a small embedding table and a depth-4 convolutional layer, to assign the document a tensor with a row for each token in context.

We're calling type 1 vector, and type 2 tensor. I've designed the neural network models to use a very small embedding table, shared between the parser, tagger and NER. I've also avoided using pre-trained vectors as features. I didn't want the models to depend on, say, the GloVe vectors, because I want to make sure you can load in any arbitrary word vectors without messing up the pipeline.

@buhrmann
Copy link

buhrmann commented Jun 6, 2017

Thanks, that's clear now. I still had doubts about how the (type 1) vectors and (type 2) tensors are used in similarity calculations, since you mention above the tensors could have interesting properties in this context (something I'm keen to try). I've cleared this up looking at the code and it seems that the tensors for now are only used in similarity calculations when there are no word vectors available (which of course could easily be changed with user hooks).

@unography
Copy link

Hi,

I wanted to do a quick benchmark between spaCy v1.8.2 and the v2.0.0. First of all, the memory usage is amazingly less in the new version! The old version's model took approximately 1GB memory, while the new one about 200MB.

However, I noticed that the latest release is using all 8 cores of my machine (100 % usage), but it is remarkably very, very slow!

I made two separate virtualenv to make sure the installation was clean.
This is a small code I wrote to test it's speed -

import time
import spacy
nlp = spacy.load('en')

def do_lemma(text):
	doc = nlp(text.decode('utf-8'))
	lemma = []
	for token in doc:
		lemma.append(token.lemma_)
	return ' '.join(lemma)

def time_lemma():
	text = 'mangoes bought were nice this time'  # just a stupid sentence
	start = time.time()
	for i in range(1000):
		do_lemma(text)
	end = time.time()
	print end - start

time_lemma()

And for the latest release, same code only the model imports changed -


import time
import spacy
nlp = spacy.load('en_core_web_sm')

def do_lemma(text):
	doc = nlp(text.decode('utf-8'))
	lemma = []
	for token in doc:
		lemma.append(token.lemma_)
	return ' '.join(lemma)

def time_lemma():
	text = 'mangoes bought were nice this time'  # just a stupid sentence
	start = time.time()
	for i in range(1000):
		do_lemma(text)
	end = time.time()
	print end - start

time_lemma()

The first (v1.8.2) runs in 0.15 seconds while the latest (v2.0.0) took 11.77 seconds to run!
Is there something I'm doing wrong in the way I'm using the new model?

@honnibal
Copy link
Member

honnibal commented Jun 6, 2017

Hm! That's a lot worse than my tests, but in my tests I used the .pipe() method, which lets the model minibatch. This helps to mask the Python overhead a bit. I still think the result you're seeing is much slower than I expect though.

A thought: Could you try setting export OPENBLAS_NUM_THREADS=1 and trying again? If your machine has lots of cores, it could be that the stupid thing tries to load up like 40 threads to do this tiny amount of work per document, and that kills the performance.

@unography
Copy link

@honnibal Hi, setting export OPENBLAS_NUM_THREADS=1 surely helped! It avoided that 100% usage but it is still slower than the old guy. Now it takes about 4 seconds to run, way faster than before but still slow.

@slavaGanzin
Copy link

I just finished reading documentation for v2.0 and it's way better than for v1.*.

But this export OPENBLAS_NUM_THREADS=1 looks new for me. I thought blas used by numpy only to train vectors.
Could this be documented?

@honnibal
Copy link
Member

honnibal commented Jun 6, 2017

@slavaGanzin The neural network model makes lots of calls to numpy.tensordot, which uses blas -- both for training and runtime. I'd like to have set this within the code --- even for my own usage I don't want to micromanage this stupid environment variable! The behaviour of "Spin up 40 threads to compute this tiny matrix multiplication" is one that nobody could want. So, we should figure out how to stop it from happening.

@eldor4do What happens if you use .pipe() as well?

@unography
Copy link

@honnibal I'll try with .pipe() once, but in my actual use case I won't be able to use pipe(), it would be more like repeated calls.

@honnibal
Copy link
Member

honnibal commented Jun 6, 2017

Hopefully the ability to hold more workers in memory compensates a bit.

Btw, the changes to the StringStore are also very useful for multi-processing. The annotations from each worker are now easy to reconcile, because they're stored as hash IDs -- so the annotation encoding no longer depends on the worker's state.

@unography
Copy link

@honnibal Yes, that is a plus. Also, I tested the OPENBLAS value on 2 machines, on one it was able to reduce the threads, on the other, a 4 core machine, it failed to do so. Still all at 100% usage. Any idea what could be the problem?

@honnibal
Copy link
Member

honnibal commented Jun 6, 2017

@eldor4do That's annoying. I think it depends on what BLAS numpy is linked to. Is the second machine a mac? If so the relevant library will be Accelerate, not openblas. Maybe there's a numpy API for limiting the thread count?

Appreciate the feedback -- this is good alpha testing :)

@kootenpv
Copy link
Contributor

kootenpv commented Jun 6, 2017

@honnibal What are the expected differences for your test cases?

@anna-hope
Copy link

#1021 is still an issue with this alpha release -- all of the sentences I gave as examples fail to be parsed correctly.

@alfonsomhc
Copy link

I get some errors when I run this example from the documentation (https://alpha.spacy.io/docs/usage/lightning-tour#examples-tokens-sentences):

doc = nlp(u"Peach emoji is where it has always been. Peach is the superior "
          u"emoji. It's outranking eggplant πŸ‘ ")

assert doc[0].text == u'Peach'
assert doc[1].text == u'emoji'
assert doc[-1].text == u'πŸ‘'
assert doc[17:19].text == u'outranking eggplant'
assert doc.noun_chunks[0].text == u'Peach emoji'

sentences = list(doc.sents)
assert len(sentences) == 3
assert sentences[0].text == u'Peach is the superior emoji.'

There are two problems

  1. This expresion
    doc.noun_chunks[0].text
    has error
    TypeError: 'generator' object is not subscriptable

  2. This expresion
    sentences[0].text
    returns
    'Peach emoji is where it has always been.'
    and therefore the last assertion fails

I'm using python 3.6 (and spacy 2.0 alpha)

@christian-storm
Copy link

I appreciate the examples, they surfaced some very valid issues. The issues you raise fall into a few camps: Preventing folks from shooting themselves in the foot, defining expected behavior/model selection, and explicitly capturing static dependencies. Let's see if can motivate some solutions.

What I defined before was an api between components. Much like how a bunch of microservices would all agree on what protobuf definitions and versions they'll use to communicate with each other. It is a communication contract full stop. You are absolutely correct, it does nothing to define the expected behavior of the computational graph in turning inputs to outputs. Nor does it define what dependencies are required by each component. We have the technology. We can rebuild him.

What's the diff between pure software packages and ML based packages?
In programming it is incumbent on the developer to pick the appropriate package for a given task by looking at the documented behavior of that component. Furthermore, unit and integration tests are written to ensure the expected behavior, or a representative sample of it, remains intact as package versions are bumped and underlying code is modified. If you are publishing a package like spacy you ensure proper behavior for the user by explicitly listing each required package and version number(s) in requirements.txt.

Riffing off your example, if a developer is selecting an appropriate tokenizer from, say, the available NTLK tokenizers they could look at the docs to see if it splits contractions or not. Even if it was unspecified they could figure out which one is best through some testing or by looking at the source if it is open source. If a developer chooses the wrong tokenizer for their task and doesn't have a test suite to alert them to this fact, I would have to say the bug is on them, no? What if the tokenizer is closed source? Isn't this essentially the same black box as a ML based tokenizer? Well not exactly. As an example, when I used SAP's NLP software the docs detailed the rule set used for tokenization. If the tokenization is learned, the rules are inferred from the data and can't be written down. With the "don't" tokenization example, how would one know that "don't" is going to be properly handled without explicitly testing it?

Expected behavior
So how does one fully specify the expected behavior of a machine learning component? As you well know I don't think anyone has a good answer for this. In academia one details the algorithm, releases the code, specifies the hyper-parameters, the data set used to train and validate, and the summary metric scores found with the test set. This information allows one to intuit how well it may do on another data set but there is no substitute for trying it out.

Imagine if one had access to an inventory of trained models. To select the best model for a given data set/task, one would compare the summary statistics of each model run against the test set. Likely one might even have a human inspect individual predictions to see ensure the right model is being selected (for example). If the model seems like it would benefit from domain adaptation, further training in a manner that avoids catastrophic forgetting might prove effective.

As alluded to by your examples, what if the developer doesn't have a labeled test set to aid in model selection? My knee jerk reaction is that they are ill equipped to stray from the default setup. They should use Prodigy to create a test set first. To me it is equivalent to someone picking NLTK's moses over casual tokenizer for twitter data without running tests to see which is better. This may be a bit far afield but a solution could be to ship a model with a set of sufficient statistics that describes the distribution of the training corpus, a program to generate the statistics for a new corpus, and a way of comparing the statistics for fit/transferability (KL divergence and outliers?). For tokenization and tagging, a first approximation would be the distribution of token and POS. So if the training set didn't have "didn't" but the user's corpus does, it would alert them to that fact and they could build a test to make sure it behaves as expected and possibly give them a motivation to further train the model. It might prevent some from shooting themselves in the foot by aiding them in the model selection process.

Versioning and dependencies
In devops one has to specify the required libraries, packages, configuration files, os services, etc. required to turn a bare metal box into the working environment needed to run a certain piece of software. This is notoriously hard to do as evidenced by the sheer number of configuration tools that exist (Puppet, cfengine, chef, etc.) and next generation tools (Docker VE, VM, ...) that give up on trying to turn the full configuration of an environment into source code. I've been in dependency hell and it sucks.

So how do we ensure that each ML model is reproducible? Much in the same way versions of spacy depend on certain versions of the model as defined in compatibility.json. A model would specify what Glove vectors were used, the corpus it was trained on (or a pointer to it, e.g., ontonotes 5.0), the hyper-parameter settings, etc. Anything and everything needed to recreate that particular version of the model from scratch. Something along the lines of dataversioncontrol. To cordon off model specific data in spacy, the data would be stored in a private namespace for that model/factory. Better yet, to allow for shared data amongst models, like Glove vectors, vocab, etc., the data would be stored in a private global namespace with the model instances having pointers from its private namespace. Much like how a doc.vocab points to a gloabl vocab instances. The difference being that everything would be versioned (hash over data with a human readable version number for good measure).

Now let me walk through each of your examples to see how this further refined concept might address each situation.

Now, we can obviously ask for the POS tags to reference a particular scheme.

Yes, exactly. It would point to a versioned file.

But actually our needs are much more specific. If I go and train the parser with tags produced by process A, and then send you the weights, and you go and produce tags using process B, you might get unexpectedly bad results. It doesn't have to be a simple story of "process A was more accurate than process B".

I'm not entirely sure what you mean by process A and B. Corpus A and corpus B? You shouldn't ever pass weights around. You'd load model A's weights, tags, etc. and never change them. If you did, it would be a new model.

Another example: If you want two tokenizers, one which has "don't", "isn't", etc as one token and another which has it as two tokens, you probably want two copies of the pipeline.

Maybe I'm playing antics with semantics but I wouldn't say "copies" here. One may have two separate tokenizers or possibly one with a switch if it is ruled based. With the learned tokenizers the run of the pipeline would specify which tokenizer to use. The pipeline is defined, compiled, and run. If a different pipeline is required, e.g., a different tokenizer, the pipeline is redefined and recompiled before running. The components would be cached so it would be fast and, to your point, perhaps there is cache of predefined pipelines as well if compilation proves expensive.

If the models haven't been trained with "don't" as one word, well, that word will be completely unseen --- so the model will probably guess it's a proper noun. All the next steps will go badly from there.

Agreed. That is why model selection is so important and needs to be surfaced as a step in the development process. The situation you describe is applicable to current spacy users. Without access to Ontonotes how does one know how if it is close enough to their domain to be effective at, say, parsing? Even if one did have access to Ontonotes how does one judge how transferable the models are? One could compare vocabulary, tag, dependency overlap and their frequencies. But nothing trumps a run against the test set, right?

The problem is more acute with neural networks, if you're composing models that should communicate by tensor. If you train a tagger with the GloVe common crawl vectors, and then you swap out those vectors for some other set of vectors, your results will probably be around the random chance baseline.

Yes, that would be disastrous and shouldn't be allowed or, with the limitations of python, discouraged. The vectors are part of the model definition and when loaded would reside in a private namespace. Of course, nothing can be made private in python so someone could blow through a few stop signs and still shoot themselves in the foot.

So if you chain together pretrained statistical models, there's not really any way to declare the "required input" of some step in a way that gives you a pluggable architecture. The "expected input" is "What I saw during training, as exactly as possible".

I believe what you are saying that there is no way to ensure each model is trained on the same data set, no? In other words, to get the reported results, the "expected input" needs to be distributionally similar to the training data. If this is what you mean, one could have an optional check in the compilation step that checks to make sure the datasources are the same across the pipeline. This would prevent some noob, only looking at reported accuracies when creating a pipeline, from chaining together a twitter based model with a model trained on arxiv.

That's also why we've been trying to get this update() workflow across, and trying to explain the catastrophic forgetting problem etc. The pipeline doesn't have to be entirely static, but you might have to make updates after modifying the pipeline. For instance, it could be okay to change the tokenization of "don't" --- but only if you fine-tune the pipeline after doing so.

Agreed. However, if you decide to domain adapt the model, i.e., online learning with prodigy, this should produce new version of the model with a model definition that points to new parameters and a new data source listing that includes the original data source and the new data source.

Despite the length of this response, what I'm talking about really isn't that complicated in concept and from what I can tell not too far afield from where spacy 2.0 is now. I'd be willing to chip in if that is helpful. It'll be much more difficult once the ship leaves port.

I'm curious to hear what you think?

@honnibal
Copy link
Member

honnibal commented Sep 29, 2017

What's the diff between pure software packages and ML based packages?

The difference I'm pointing to is there's no API abstraction possible with ML. We're in a continuous space of better/closer, instead of a discrete space of match/no match.

If you imagine each component as versioned, there's no room for a range of versions --- you have to specify an exact version to get the right results, every time. Once the weights are trained nothing is interchangeable and ideally nothing should be reconfigured.

This also means you can't really usefully cache and compose the pipeline components. There's no point in regestering a component like "spaCy parsing model v1.0.0a4" on its own. The minimum versionable unit is something like "spaCy pipeline v1.0.0a4", because to get the dependency parse, you should run exactly the fully-configured tokenizer and tagger used during training, with no change of configuration whatsoever.

We can version and release a component that provides a single function nlp(text) -> doc with tags and deps. We can also version and release a component that provides a function train(exampes, config) -> nlp. But we can't version and release a component that provides functions like parse(doc_with_tags) -> doc_with_deps

I've been trying to keep the pipelines shorter in v2 to mitigate this issue, so things are more composable. The v2 parser doesn't use POS tag features anymore, and the next release will also do away with the multi-task CNN, instead giving each component its own CNN pre-process. This might all change though. If results become better with longer pipelines, maybe we want longer pipelines.

@christian-storm
Copy link

Keeping the conversation going...I really hope this isn't coming across as adversarial or grating in anyway. I actually think we are getting somewhere and agree on most things.

The difference I'm pointing to is there's no API abstraction possible with ML. We're in a continuous space of better/closer, instead of a discrete space of match/no match.

Well put and agree in principle with the caveat that code is rarely so boolean. Take a complex signal processing algorithms where the function F is either learned or is programmed with an analytic/closed form solution. How is the test and verify process of either really any different? Sure, each component of the latter can be tested individually. That certainly makes it easier to debug when things go south. However, a test set of Y = F(X) is as important in either case, right?

If you imagine each component as versioned, there's no room for a range of versions --- you have to specify an exact version to get the right results, every time. Once the weights are trained nothing is interchangeable and ideally nothing should be reconfigured.

Once gain, on the same page although I think there is another way to look at it. The key is defining exactly what the "right results" are. In building a ML model one uses the validation set to make sure the model is learning but not overfitting the training set. Then the test set is used as the ultimate test to ensure the model is transferable. If one were to pull two models off the shelf and plug them together as I've been suggesting, you'd judge the effectiveness of each using a task specific test set the two together using a test set that encompasses the whole pipeline, no? This happens all the time in ML, e.g., a speech to text system that uses spectrograms, KenLM, and DL. Even though the first two aren't learned, though they could be, there are a bunch of hyper-parameters that need to be "learned."

This also means you can't really usefully cache and compose the pipeline components. There's no point in regestering a component like "spaCy parsing model v1.0.0a4" on its own. The minimum versionable unit is something like "spaCy pipeline v1.0.0a4", because to get the dependency parse, you should run exactly the fully-configured tokenizer and tagger used during training, with no change of configuration whatsoever.

I would agree that training end to end and freezing the models in the pipeline afterwards leads to the most reproducible results. If this is the intended design, one will only ever be able to disable or append purely additive components, e.g., sentiment.

Just to play a little devils advocate here, spacy promotes the ability to swap out the tokenizer that feeds the pipeline without a warning or a mention that one should retrain. Isn't this contrary to your end-to-end abstraction? What if someone blindly decides to implement that whitespace tokenizer described in the docs? To use your example, spacy might starting labeling "don't" as a proper noun, no? The same could be said about adding special cases for tokenization? You are performing operations that weren't performed on the training data!

If the dependencies remain a constant across the pipeline, I still think plugging trained models into the pipeline makes sense if one knows what they are doing- an appropriate test harness at each step of the pipeline. On the other hand, I agree it is easy to go off the rails when components are tightly coupled, e.g., setting sent_start and making the trained parser obey them even though it wasn't trained with those sentence boundaries. However, there are many valid cases where it makes sense, e.g., training a sbd, freezing it, and then training the remaining pipeline.

Another idea
With the pipeline versioning idea in mind, why not at least allow for pluggable un-trained models that, once trained, get frozen into a versioned pipeline? Ultimately, I'm looking for a tool that plays well with experimentation, e.g., a new parser design from the literature, and devops. The difference is spacy being part of the NLP pipeline versus running the entire pipeline.

We can version and release a component that provides a single function nlp(text) -> doc with tags and deps. We can also version and release a component that provides a function train(exampes, config) -> nlp. But we can't version and release a component that provides functions like parse(doc_with_tags) -> doc_with_deps

Okay, I got it :), too much configurability can lead to bad things. But, really, why can't one version and release a component like parse(doc_with_tags) -> doc_with_deps? How is it any different than training each stage of the pipeline, freezing it, and then training the next stage of the pipeline using the same dependencies: data, tag sets, glove vectors, etc.? If trained end-to-end with errors back propagated from component to component, then yes I would agree, these tightly coupled components should be thought of as one unit and domain adapted as one unit.

Right now, if I wanted to change anything beyond the tokenizer in the pipeline it is non-trivial. However, I'm starting to realize that I maybe barking up the wrong tree here. Looking at prodigy and the usage docs for spacy, only downstream classification models (sentiment, intent, ...) are ever referenced. What if I want to add a semantic role labeler that requires a constituency parse? Or better yet what if someone publishes a parser that is much more accurate and I really could use that extra accuracy? I guess I'm back to building my own NLP pipeline.

I've been trying to keep the pipelines shorter in v2 to mitigate this issue, so things are more composable.

Yes! That is the word- composable. "A highly composable system provides components that can be selected and assembled in various combinations to satisfy specific user requirements."

That's it! I would love a world where I can truly compose a NLP pipeline. Analogous to how Keras allows you easily build, train, and use a NN; just one level of abstraction higher.

I don't see how "shorter" pipelines are more composable though. Forgive me if I'm wrong but I don't really see any composability in spacy at the moment. Maybe configurability? Though, one gets the impression by reading the docs, "you can mix and match pipeline components," that the vision is to be able to compose pipelines that deliver different behaviors (specific user requirements).

The v2 parser doesn't use POS tag features anymore, and the next release will also do away with the multi-task CNN, instead giving each component its own CNN pre-process. This might all change though. If results become better with longer pipelines, maybe we want longer pipelines.

I wish I knew the code better to react.

I'm already in a fairly precarious position needing different tokenizers and sentence boundary detectors and there isn't a clear way to add these components. With your previously proposed solution of breaking and merging the dependency tree to allow for new sentence boundaries, what would that do to accuracy? Isn't this the exact tinkering of a trained model you are trying to avoid?

Once again, thanks for engaging Matthew.

@honnibal
Copy link
Member

honnibal commented Sep 30, 2017

Keeping the conversation going...I really hope this isn't coming across as adversarial or grating in anyway. I actually think we are getting somewhere and agree on most things.

No, not at all -- I hope I'm not coming across as intransigent :)

Just to play a little devils advocate here, spacy promotes the ability to swap out the tokenizer that feeds the pipeline without a warning or a mention that one should retrain. Isn't this contrary to your end-to-end abstraction? What if someone blindly decides to implement that whitespace tokenizer described in the docs? To use your example, spacy might starting labeling "don't" as a proper noun, no? The same could be said about adding special cases for tokenization? You are performing operations that weren't performed on the training data!

I do think this is a potential problem, and maybe we should be clearer about the problem in the docs. The trade-off is sort of like having internals prefixed with an underscore in Python: it can be useful to play with these thing, but you don't really get safety guarantees.

Right now, if I wanted to change anything beyond the tokenizer in the pipeline it is non-trivial. However, I'm starting to realize that I maybe barking up the wrong tree here. Looking at prodigy and the usage docs for spacy, only downstream classification models (sentiment, intent, ...) are ever referenced. What if I want to add a semantic role labeler that requires a constituency parse?

We don't really have a data structure for constituency parses at the moment, or for semantic roles. You could add the data into user_data. More generally though:

Or better yet what if someone publishes a parser that is much more accurate and I really could use that extra accuracy? I guess I'm back to building my own NLP pipeline.

Well, not really? You could subclass NeuralDependencyParser and overwrite the predict or set_annotations methods. Or you could do neither, and add some function like this to the pipeline:

def my_dependency_parser(doc):
    parse = doc.to_array([HEAD, DEP])
    # Set every word to depend on the next word
    for i in range(len(doc)):
        parse[i, 0] = i+1
    doc.from_array([HEAD, DEP], parse)

nlp.pipeline is literally just a list. Currently the only assumption is that the list entries are callable. You can set up your own list however you like, with any or all of your own components. You could have a pipeline component that predicts the sentence boundaries, creates a sequence of Doc objects using slices of the Doc.c pointer for the sentences, and parses each sentence independently:

# From within Cython

class SentenceParser(object):
    def __init__(self, segmenter, parser):
        self.segment = segmenter
        self.parse = parser

    def __call__(self, doc):
        sentences = self.segment(doc)
        cdef Doc subdoc
        for sent in sentences:
            subdoc = Doc(doc.vocab)
            subdoc.c = &doc.c[sent.start]
            subdoc.length = sent.end-sent.start
            self.parse(subdoc)
        return doc

I haven't tested this, but in theory it should work?

@christian-storm
Copy link

nlp.pipeline is literally just a list. Currently the only assumption is that the list entries are callable. You can set up your own list however you like, with any or all of your own components.

I totally get how pipelines work under the hood now. But it isn't as simple as that, right? Which brings me back to what started all this for me . If it was that easy, set_factory would be as trivial as adding a callable function to the pipeline list (#1357) and I would be able to set sentence boundaries without new ones "magically" being created.

I appreciate you sharing the recipes of how you would do it. However, this is exactly what I was trying to avoid. As part of this exercise, I am now more familiar with the code and it is a more tenable solution. I fear you are going to leave a lot of talented people behind that could contribute to spacy and box people out that find spacy unfit for their task. Most researchers won't crack the hood open and take the time to learn cython and the inner-workings of the spacy engine just so they can add or modify a part. I think there is an opportunity for spacy to create an ecosystem much like scikit learn's which currently has 932 contributors and a clear path for becoming one.

At any rate, I'll get off my soapbox now. I'm anxiously awaiting how or if you'll solve for the sbd issue. As of right now I'm dead in the water with spacy because of it. Trying to decide if I move on or hang tight.

@honnibal
Copy link
Member

honnibal commented Sep 30, 2017

Well, I think there's a mix of a couple of issues here. One is that the SBD stuff is legit broken at the moment --- it's one of the tickets blocking spaCy 2 stable. Similarly the set_factory thing doesn't work as advertised at the moment either.

But the more interesting thing are these deeper design questions, about how the pipeline works, and to what extent we should expect components to be "hot swappable", how versioning should work, whether we can have a pluggable architecture, etc.

I agree that having me suggest Cython code isn't a scalable approach to community development :p. On the other hand, some of the problems aren't scalable/general here --- there are specific bugs, for which I'm trying to give specific mitigations.

About the more general questions: I think we should probably switch to using entry points to give a more explicit plugin infrastructure, for both the languages and the components. We also plan to have wrapper components for the common machine learning libraries, to make it easy to write a model with say PyTorch and use it to power a POS tagger. The next release of the spaCy 2 docs will also have more details about the Pipe abstract base class.

I probably don't think I want something like the declarative approach to pipelines that you mentioned above, though. I think if you want that sort of workflow, the best thing to do would be to wrap each spaCy component you're interested in as a pip package, and then use Luigi or Airflow as the data pipeline layer.

The components you wrap this way can take a Doc object instead of text if you like --- you just have to supply a different tokenizer or make_doc function. So, you don't need to repeat any work this way. You can make the steps you're presenting as spaCy pipelines as small or as big as you like. I think this will be better than designing our own pipeline management solution.

@honnibal
Copy link
Member

There's also some relevant discussion about extensibility in #1085 that might be interesting.

@christian-storm
Copy link

Yeah, I had read #1085 as part of my due diligence trying to wrap my head around all this.

I'm heartened to hear sbd is on the radar and some thought is being given to entry points/pluggable architecture and a pipe abstract class. It is hard to arrive at the right abstraction but it'll be well worth it in the long run. On the same page with respect to the vision and using the right tool for the job, e.g., pipeline management. I'll stop bugging you so you and I can get back to being productive. :)

@cbrew
Copy link

cbrew commented Oct 9, 2017

I think this little fragment ought to work. But it doesn't. Something seems to be wrong with the
saving of the added pipeline component.

I have spacy 2.0.0a16 installed in a fresh
conda environment with python 3.6.2 from conda-forge

import spacy
import spacy.lang.en
from spacy.pipeline import TextCategorizer

nlp = spacy.lang.en.English()
tokenizer = nlp.tokenizer
textcat = TextCategorizer(tokenizer.vocab, labels=['ENTITY', 'ACTION', 'MODIFIER'])
nlp.pipeline.append(textcat)
nlp.to_disk('matter')

error is

Traceback (most recent call last):
  File "loadsave.py", line 10, in <module>
    nlp.to_disk('matter')
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/language.py", line 507, in to_disk
    util.to_disk(path, serializers, {p: False for p in disable})
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/util.py", line 478, in to_disk
    writer(path / key)
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/language.py", line 505, in <lambda>
    serializers[proc.name] = lambda p, proc=proc: proc.to_disk(p, vocab=False)
  File "pipeline.pyx", line 190, in spacy.pipeline.BaseThincComponent.to_disk
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/util.py", line 478, in to_disk
    writer(path / key)
  File "pipeline.pyx", line 188, in spacy.pipeline.BaseThincComponent.to_disk.lambda7
TypeError: Required argument 'length' (pos 1) not found

@honnibal
Copy link
Member

honnibal commented Oct 10, 2017

@cbrew Thanks. Seems to be a bug in .to_bytes() --- the same happens even without adding the model to the pipeline.

Edit: Okay I think I see the issue. After __init__() the component's .model attribute won't be created yet. It's added in a second step, after you call either begin_training() or load with from_bytes() or from_disk().

I think this is leading to incorrect behaviour when you immediately try to serialize the class.

Edit2:

>>> a = True
>>> a.to_bytes()
Traceback (most recent call last):
  File "<stdin>", line 1 in <module>
TypeError: Required argument 'length' (pos 1) not found

So to_bytes() happens to clash with a method on the bool type. Sometimes dynamic typing feels like a terrible bad no good idea...

@jamesrharwood
Copy link

jamesrharwood commented Oct 11, 2017

Is it possible to run Spacy functions on a redis backed worker? I'm finding that my jobs disappear as soon as they reach the nlp() command. For instance:

#### worker.py

import redis
from rq import Worker, Queue, Connection

conn = redis.from_url("redis://localhost:6379")
with Connection(conn):
    worker = Worker(list(map(Queue, ['default'])))
    worker.work()
### test.py

import spacy
nlp=spacy.load('en_core_web_sm')

def test_nlp():
    print "before NLP call"
    r = nlp(u"this is a test")
    print "after NLP call"
    return r

Running python worker.py and then the following:

from rq import Queue
from worker import conn
from test import test_nlp

q = Queue(connection=conn)
q.enqueue(test_nlp)

Results in the worker printing:

21:58:45 *** Listening on default...
21:59:07 default: test_nlp() (61b65f56-4a02-42b2-bdfd-07a5bc7bceb6)
before NLP call
21:59:08
21:59:08 *** Listening on default...

The second print statement never appears, and if I query the job status it confirms that it's started, but not finished and not failed.

Am I missing something obvious?

spacy-nightly: 2.0.0a16
rq: 0.6.0
redis: 2.10.5

UPDATE USING CELERY INSTEAD OF RQ

Using Celery instead of RQ, I now get this error:

[2017-10-12 11:12:18,412: INFO/MainProcess] Received task: test_nlp[1a48e949-b1ba-4820-aed9-b7b44a1fed1f]
[2017-10-12 11:12:18,417: WARNING/ForkPoolWorker-3] before NLP call
[2017-10-12 11:12:19,701: ERROR/MainProcess] Process 'ForkPoolWorker-3' pid:11820 exited with 'signal 11 (SIGSEGV)'
[2017-10-12 11:12:19,718: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)

This Celery thread suggests it may be a problem with Spacy not being fork safe:
celery/celery#2964 (comment)

I tried the workaround suggested in the linked comment (importing the spacy model inside the function) but the import causes the same error.

PROBLEM SOLVED?

I tried pip install eventlet and then running the celery worker with -P eventlet -c 1000 and now the task runs successfully!

I'm not sure whether this means it's a bug within prefork or Spacy, so I'm leaving this comment here in the hope that it helps someone!

@nathanathan
Copy link
Contributor

Sentence span similarity isn't working for me in spacy-nightly 2.0.0a16:

import en_core_web_sm as spacy_model
spacy_nlp = spacy_model.load()
sent_list = list(spacy_nlp(u'I saw a duck at the park. Duck under the limbo stick.').sents)
sent_list[0].similarity(sent_list[1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "span.pyx", line 134, in spacy.tokens.span.Span.similarity
  File "span.pyx", line 231, in spacy.tokens.span.Span.vector_norm.__get__
  File "span.pyx", line 216, in spacy.tokens.span.Span.vector.__get__
  File "span.pyx", line 112, in __iter__
  File "token.pyx", line 259, in spacy.tokens.token.Token.vector.__get__
IndexError: index 0 is out of bounds for axis 0 with size 0

@jesushd12
Copy link

Hi @nathanathan were you able to resolve the problem? I'm getting the same problem with similarity function, I'm using spanish model.

@ines
Copy link
Member Author

ines commented Oct 27, 2017

@nathanathan @jesushd12 Sorry about that – we're still finalising the vector support on the current models (see #1457). We're currently training a new family of models for the next version, which includes a lot of fixes and updates currently on develop. (Unless there are new bugs or significant problems, this is likely also going to be the version we're promoting to the release candidate πŸŽ‰ )

@chaturv3di
Copy link

I'm trying to install spaCy 2.0 alpha in a new conda environment, and I'm receiving undefined symbol: PyFPE_jbuf error. Afaik, this is due to two versions of numpy. However, I have made sure that my packages numpy, scipy, msgpack-numpy, and Cython are all installed solely via pip. In fact, I even tried the flavour where all of these packages are installed solely via conda. No luck.

Would anyone be able to offer any advise?

@honnibal
Copy link
Member

@chaturv3di That error tends to occur when pip uses a cached binary package. I find this happens a lot for me with the cytoolz package --- somehow its metadata is incorrect and pip thinks it can be compatible across both Python 2 and 3.

Try pip uninstall cytoolz && pip install cytoolz --no-cache-dir

@chaturv3di
Copy link

Thanks @honnibal.

For the record, after following your advise, I received the same error but this time from the preshed package. I did the same with it, i.e. pip uninstall preshed && pip install preshed --no-cache-dir. And it worked.

@chaturv3di
Copy link

Hi All,

This is related to dependency parsing. Where can I find the exact logic for merging Spans when the "merge phrases" option is chosen on https://demos.explosion.ai?

Thanks in advance.

@ines
Copy link
Member Author

ines commented Nov 3, 2017

@chaturv3di See here in the spacy-services:

if collapse_phrases:
    for np in list(self.doc.noun_chunks):
        np.merge(np.root.tag_, np.root.lemma_, np.root.ent_type_)

Essentially, all you need to do is iterate over the noun phrases in doc.noun_chunks, merge them and make sure to re-assign the tags and labels. In spaCy v2.0, you can also specify the arguments as keyword arguments, e.g. span.merge(tag=tag, lemma=lemma, ent_type=ent_type).

@ines
Copy link
Member Author

ines commented Nov 8, 2017

Thanks everyone for your feedback! πŸ’™
spaCy v2.0 is now live: https://github.com/explosion/spaCy/releases/tag/v2.0.0

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Contributions welcome! meta Meta topics, e.g. repo organisation and issue management πŸŒ™ nightly Discussion and contributions related to nightly builds
Projects
None yet
Development

No branches or pull requests