-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
π« spaCy v2.0.0 alpha β details, feedback & questions (plus stickers!) #1105
Comments
I honestly just wanted it to work :-) I had spacy 1.5 installed on my other machine, I removed it and installed spacy-nightly v2.0.0a0.
Both give:
It's most likely that I'm missing something? EDIT: Indeed, on the release page it says:
Perhaps it is possible to temporarily update the docs for it? Otherwise: it works! |
Sorry about that! We originally decided against adding the alpha models to the compatibility table and shortcuts just yet to avoid confusion β but maybe it actually ended up causing more confusion. Just added the models and shortcuts, so in about 5 minutes (which is roughly how long it takes GitHub to clear its cache for raw files), the following commands should work as well: python -m spacy download en
python -m spacy download xx |
Another update: I tried parsing 16k headlines. I can parse all of them* and access some common attributes of each of them, including vectors :) I did notice that on an empty string (1 of headlines*), it now throws an exception, this was not the case in v.18.2. Probably better to fix that :) I wanted to do a benchmark against v.1.8.2, but the machines are not comparable :( It did feel a lot slower though... |
Thanks! Try the |
Hi, very excited about the new, better, smaller and potentially faster(?) spaCy 2.0. I hope to give it a try in the next days. Just one question. According to the (new) docs the embeddings seem to work just as they did before, i.e. external word vectors and their averages for spans and docs. But you also mention the use of tensors for similarity calculations. Is it correct that the vectors are essentially the same, but are not used as such in the similarity calculations anymore? Or are they somehow combined with the internal tensor representations of the documents? In any case thanks for the great work, and I hope to be able to give some useful feedback soon about the Spanish model etc. |
@buhrmann This is inherently a bit confusing, because there are two types of vector representations:
We're calling type 1 |
Thanks, that's clear now. I still had doubts about how the (type 1) vectors and (type 2) tensors are used in similarity calculations, since you mention above the tensors could have interesting properties in this context (something I'm keen to try). I've cleared this up looking at the code and it seems that the tensors for now are only used in similarity calculations when there are no word vectors available (which of course could easily be changed with user hooks). |
Hi, I wanted to do a quick benchmark between spaCy v1.8.2 and the v2.0.0. First of all, the memory usage is amazingly less in the new version! The old version's model took approximately 1GB memory, while the new one about 200MB. However, I noticed that the latest release is using all 8 cores of my machine (100 % usage), but it is remarkably very, very slow! I made two separate virtualenv to make sure the installation was clean.
And for the latest release, same code only the model imports changed -
The first (v1.8.2) runs in |
Hm! That's a lot worse than my tests, but in my tests I used the A thought: Could you try setting |
@honnibal Hi, setting |
I just finished reading documentation for v2.0 and it's way better than for v1.*. But this |
@slavaGanzin The neural network model makes lots of calls to @eldor4do What happens if you use |
@honnibal I'll try with |
Hopefully the ability to hold more workers in memory compensates a bit. Btw, the changes to the |
@honnibal Yes, that is a plus. Also, I tested the |
@eldor4do That's annoying. I think it depends on what BLAS numpy is linked to. Is the second machine a mac? If so the relevant library will be Accelerate, not openblas. Maybe there's a numpy API for limiting the thread count? Appreciate the feedback -- this is good alpha testing :) |
@honnibal What are the expected differences for your test cases? |
#1021 is still an issue with this alpha release -- all of the sentences I gave as examples fail to be parsed correctly. |
I get some errors when I run this example from the documentation (https://alpha.spacy.io/docs/usage/lightning-tour#examples-tokens-sentences):
There are two problems
I'm using python 3.6 (and spacy 2.0 alpha) |
I appreciate the examples, they surfaced some very valid issues. The issues you raise fall into a few camps: Preventing folks from shooting themselves in the foot, defining expected behavior/model selection, and explicitly capturing static dependencies. Let's see if can motivate some solutions. What I defined before was an api between components. Much like how a bunch of microservices would all agree on what protobuf definitions and versions they'll use to communicate with each other. It is a communication contract full stop. You are absolutely correct, it does nothing to define the expected behavior of the computational graph in turning inputs to outputs. Nor does it define what dependencies are required by each component. We have the technology. We can rebuild him. What's the diff between pure software packages and ML based packages? Riffing off your example, if a developer is selecting an appropriate tokenizer from, say, the available NTLK tokenizers they could look at the docs to see if it splits contractions or not. Even if it was unspecified they could figure out which one is best through some testing or by looking at the source if it is open source. If a developer chooses the wrong tokenizer for their task and doesn't have a test suite to alert them to this fact, I would have to say the bug is on them, no? What if the tokenizer is closed source? Isn't this essentially the same black box as a ML based tokenizer? Well not exactly. As an example, when I used SAP's NLP software the docs detailed the rule set used for tokenization. If the tokenization is learned, the rules are inferred from the data and can't be written down. With the "don't" tokenization example, how would one know that "don't" is going to be properly handled without explicitly testing it? Expected behavior Imagine if one had access to an inventory of trained models. To select the best model for a given data set/task, one would compare the summary statistics of each model run against the test set. Likely one might even have a human inspect individual predictions to see ensure the right model is being selected (for example). If the model seems like it would benefit from domain adaptation, further training in a manner that avoids catastrophic forgetting might prove effective. As alluded to by your examples, what if the developer doesn't have a labeled test set to aid in model selection? My knee jerk reaction is that they are ill equipped to stray from the default setup. They should use Prodigy to create a test set first. To me it is equivalent to someone picking NLTK's moses over casual tokenizer for twitter data without running tests to see which is better. This may be a bit far afield but a solution could be to ship a model with a set of sufficient statistics that describes the distribution of the training corpus, a program to generate the statistics for a new corpus, and a way of comparing the statistics for fit/transferability (KL divergence and outliers?). For tokenization and tagging, a first approximation would be the distribution of token and POS. So if the training set didn't have "didn't" but the user's corpus does, it would alert them to that fact and they could build a test to make sure it behaves as expected and possibly give them a motivation to further train the model. It might prevent some from shooting themselves in the foot by aiding them in the model selection process. Versioning and dependencies So how do we ensure that each ML model is reproducible? Much in the same way versions of spacy depend on certain versions of the model as defined in compatibility.json. A model would specify what Glove vectors were used, the corpus it was trained on (or a pointer to it, e.g., ontonotes 5.0), the hyper-parameter settings, etc. Anything and everything needed to recreate that particular version of the model from scratch. Something along the lines of dataversioncontrol. To cordon off model specific data in spacy, the data would be stored in a private namespace for that model/factory. Better yet, to allow for shared data amongst models, like Glove vectors, vocab, etc., the data would be stored in a private global namespace with the model instances having pointers from its private namespace. Much like how a doc.vocab points to a gloabl vocab instances. The difference being that everything would be versioned (hash over data with a human readable version number for good measure). Now let me walk through each of your examples to see how this further refined concept might address each situation.
Yes, exactly. It would point to a versioned file.
I'm not entirely sure what you mean by process A and B. Corpus A and corpus B? You shouldn't ever pass weights around. You'd load model A's weights, tags, etc. and never change them. If you did, it would be a new model.
Maybe I'm playing antics with semantics but I wouldn't say "copies" here. One may have two separate tokenizers or possibly one with a switch if it is ruled based. With the learned tokenizers the run of the pipeline would specify which tokenizer to use. The pipeline is defined, compiled, and run. If a different pipeline is required, e.g., a different tokenizer, the pipeline is redefined and recompiled before running. The components would be cached so it would be fast and, to your point, perhaps there is cache of predefined pipelines as well if compilation proves expensive.
Agreed. That is why model selection is so important and needs to be surfaced as a step in the development process. The situation you describe is applicable to current spacy users. Without access to Ontonotes how does one know how if it is close enough to their domain to be effective at, say, parsing? Even if one did have access to Ontonotes how does one judge how transferable the models are? One could compare vocabulary, tag, dependency overlap and their frequencies. But nothing trumps a run against the test set, right?
Yes, that would be disastrous and shouldn't be allowed or, with the limitations of python, discouraged. The vectors are part of the model definition and when loaded would reside in a private namespace. Of course, nothing can be made private in python so someone could blow through a few stop signs and still shoot themselves in the foot.
I believe what you are saying that there is no way to ensure each model is trained on the same data set, no? In other words, to get the reported results, the "expected input" needs to be distributionally similar to the training data. If this is what you mean, one could have an optional check in the compilation step that checks to make sure the datasources are the same across the pipeline. This would prevent some noob, only looking at reported accuracies when creating a pipeline, from chaining together a twitter based model with a model trained on arxiv.
Agreed. However, if you decide to domain adapt the model, i.e., online learning with prodigy, this should produce new version of the model with a model definition that points to new parameters and a new data source listing that includes the original data source and the new data source. Despite the length of this response, what I'm talking about really isn't that complicated in concept and from what I can tell not too far afield from where spacy 2.0 is now. I'd be willing to chip in if that is helpful. It'll be much more difficult once the ship leaves port. I'm curious to hear what you think? |
The difference I'm pointing to is there's no API abstraction possible with ML. We're in a continuous space of better/closer, instead of a discrete space of match/no match. If you imagine each component as versioned, there's no room for a range of versions --- you have to specify an exact version to get the right results, every time. Once the weights are trained nothing is interchangeable and ideally nothing should be reconfigured. This also means you can't really usefully cache and compose the pipeline components. There's no point in regestering a component like "spaCy parsing model v1.0.0a4" on its own. The minimum versionable unit is something like "spaCy pipeline v1.0.0a4", because to get the dependency parse, you should run exactly the fully-configured tokenizer and tagger used during training, with no change of configuration whatsoever. We can version and release a component that provides a single function I've been trying to keep the pipelines shorter in v2 to mitigate this issue, so things are more composable. The v2 parser doesn't use POS tag features anymore, and the next release will also do away with the multi-task CNN, instead giving each component its own CNN pre-process. This might all change though. If results become better with longer pipelines, maybe we want longer pipelines. |
Keeping the conversation going...I really hope this isn't coming across as adversarial or grating in anyway. I actually think we are getting somewhere and agree on most things.
Well put and agree in principle with the caveat that code is rarely so boolean. Take a complex signal processing algorithms where the function F is either learned or is programmed with an analytic/closed form solution. How is the test and verify process of either really any different? Sure, each component of the latter can be tested individually. That certainly makes it easier to debug when things go south. However, a test set of Y = F(X) is as important in either case, right?
Once gain, on the same page although I think there is another way to look at it. The key is defining exactly what the "right results" are. In building a ML model one uses the validation set to make sure the model is learning but not overfitting the training set. Then the test set is used as the ultimate test to ensure the model is transferable. If one were to pull two models off the shelf and plug them together as I've been suggesting, you'd judge the effectiveness of each using a task specific test set the two together using a test set that encompasses the whole pipeline, no? This happens all the time in ML, e.g., a speech to text system that uses spectrograms, KenLM, and DL. Even though the first two aren't learned, though they could be, there are a bunch of hyper-parameters that need to be "learned."
I would agree that training end to end and freezing the models in the pipeline afterwards leads to the most reproducible results. If this is the intended design, one will only ever be able to disable or append purely additive components, e.g., sentiment. Just to play a little devils advocate here, spacy promotes the ability to swap out the tokenizer that feeds the pipeline without a warning or a mention that one should retrain. Isn't this contrary to your end-to-end abstraction? What if someone blindly decides to implement that whitespace tokenizer described in the docs? To use your example, spacy might starting labeling "don't" as a proper noun, no? The same could be said about adding special cases for tokenization? You are performing operations that weren't performed on the training data! If the dependencies remain a constant across the pipeline, I still think plugging trained models into the pipeline makes sense if one knows what they are doing- an appropriate test harness at each step of the pipeline. On the other hand, I agree it is easy to go off the rails when components are tightly coupled, e.g., setting sent_start and making the trained parser obey them even though it wasn't trained with those sentence boundaries. However, there are many valid cases where it makes sense, e.g., training a sbd, freezing it, and then training the remaining pipeline. Another idea
Okay, I got it :), too much configurability can lead to bad things. But, really, why can't one version and release a component like Right now, if I wanted to change anything beyond the tokenizer in the pipeline it is non-trivial. However, I'm starting to realize that I maybe barking up the wrong tree here. Looking at prodigy and the usage docs for spacy, only downstream classification models (sentiment, intent, ...) are ever referenced. What if I want to add a semantic role labeler that requires a constituency parse? Or better yet what if someone publishes a parser that is much more accurate and I really could use that extra accuracy? I guess I'm back to building my own NLP pipeline.
Yes! That is the word- composable. "A highly composable system provides components that can be selected and assembled in various combinations to satisfy specific user requirements." That's it! I would love a world where I can truly compose a NLP pipeline. Analogous to how Keras allows you easily build, train, and use a NN; just one level of abstraction higher. I don't see how "shorter" pipelines are more composable though. Forgive me if I'm wrong but I don't really see any composability in spacy at the moment. Maybe configurability? Though, one gets the impression by reading the docs, "you can mix and match pipeline components," that the vision is to be able to compose pipelines that deliver different behaviors (specific user requirements).
I wish I knew the code better to react. I'm already in a fairly precarious position needing different tokenizers and sentence boundary detectors and there isn't a clear way to add these components. With your previously proposed solution of breaking and merging the dependency tree to allow for new sentence boundaries, what would that do to accuracy? Isn't this the exact tinkering of a trained model you are trying to avoid? Once again, thanks for engaging Matthew. |
No, not at all -- I hope I'm not coming across as intransigent :)
I do think this is a potential problem, and maybe we should be clearer about the problem in the docs. The trade-off is sort of like having internals prefixed with an underscore in Python: it can be useful to play with these thing, but you don't really get safety guarantees.
We don't really have a data structure for constituency parses at the moment, or for semantic roles. You could add the data into
Well, not really? You could subclass def my_dependency_parser(doc):
parse = doc.to_array([HEAD, DEP])
# Set every word to depend on the next word
for i in range(len(doc)):
parse[i, 0] = i+1
doc.from_array([HEAD, DEP], parse)
# From within Cython
class SentenceParser(object):
def __init__(self, segmenter, parser):
self.segment = segmenter
self.parse = parser
def __call__(self, doc):
sentences = self.segment(doc)
cdef Doc subdoc
for sent in sentences:
subdoc = Doc(doc.vocab)
subdoc.c = &doc.c[sent.start]
subdoc.length = sent.end-sent.start
self.parse(subdoc)
return doc I haven't tested this, but in theory it should work? |
I totally get how pipelines work under the hood now. But it isn't as simple as that, right? Which brings me back to what started all this for me . If it was that easy, set_factory would be as trivial as adding a callable function to the pipeline list (#1357) and I would be able to set sentence boundaries without new ones "magically" being created. I appreciate you sharing the recipes of how you would do it. However, this is exactly what I was trying to avoid. As part of this exercise, I am now more familiar with the code and it is a more tenable solution. I fear you are going to leave a lot of talented people behind that could contribute to spacy and box people out that find spacy unfit for their task. Most researchers won't crack the hood open and take the time to learn cython and the inner-workings of the spacy engine just so they can add or modify a part. I think there is an opportunity for spacy to create an ecosystem much like scikit learn's which currently has 932 contributors and a clear path for becoming one. At any rate, I'll get off my soapbox now. I'm anxiously awaiting how or if you'll solve for the sbd issue. As of right now I'm dead in the water with spacy because of it. Trying to decide if I move on or hang tight. |
Well, I think there's a mix of a couple of issues here. One is that the SBD stuff is legit broken at the moment --- it's one of the tickets blocking spaCy 2 stable. Similarly the But the more interesting thing are these deeper design questions, about how the pipeline works, and to what extent we should expect components to be "hot swappable", how versioning should work, whether we can have a pluggable architecture, etc. I agree that having me suggest Cython code isn't a scalable approach to community development :p. On the other hand, some of the problems aren't scalable/general here --- there are specific bugs, for which I'm trying to give specific mitigations. About the more general questions: I think we should probably switch to using entry points to give a more explicit plugin infrastructure, for both the languages and the components. We also plan to have wrapper components for the common machine learning libraries, to make it easy to write a model with say PyTorch and use it to power a POS tagger. The next release of the spaCy 2 docs will also have more details about the Pipe abstract base class. I probably don't think I want something like the declarative approach to pipelines that you mentioned above, though. I think if you want that sort of workflow, the best thing to do would be to wrap each spaCy component you're interested in as a pip package, and then use Luigi or Airflow as the data pipeline layer. The components you wrap this way can take a |
There's also some relevant discussion about extensibility in #1085 that might be interesting. |
Yeah, I had read #1085 as part of my due diligence trying to wrap my head around all this. I'm heartened to hear sbd is on the radar and some thought is being given to entry points/pluggable architecture and a pipe abstract class. It is hard to arrive at the right abstraction but it'll be well worth it in the long run. On the same page with respect to the vision and using the right tool for the job, e.g., pipeline management. I'll stop bugging you so you and I can get back to being productive. :) |
I think this little fragment ought to work. But it doesn't. Something seems to be wrong with the I have spacy 2.0.0a16 installed in a fresh
error is
|
@cbrew Thanks. Seems to be a bug in Edit: Okay I think I see the issue. After I think this is leading to incorrect behaviour when you immediately try to serialize the class. Edit2: >>> a = True
>>> a.to_bytes()
Traceback (most recent call last):
File "<stdin>", line 1 in <module>
TypeError: Required argument 'length' (pos 1) not found So |
Is it possible to run Spacy functions on a redis backed worker? I'm finding that my jobs disappear as soon as they reach the
Running
Results in the worker printing:
The second print statement never appears, and if I query the job status it confirms that it's started, but not finished and not failed. Am I missing something obvious? spacy-nightly: 2.0.0a16 UPDATE USING CELERY INSTEAD OF RQUsing Celery instead of RQ, I now get this error:
This Celery thread suggests it may be a problem with Spacy not being fork safe: I tried the workaround suggested in the linked comment (importing the spacy model inside the function) but the import causes the same error. PROBLEM SOLVED?I tried I'm not sure whether this means it's a bug within prefork or Spacy, so I'm leaving this comment here in the hope that it helps someone! |
Sentence span similarity isn't working for me in spacy-nightly 2.0.0a16:
|
Hi @nathanathan were you able to resolve the problem? I'm getting the same problem with similarity function, I'm using spanish model. |
@nathanathan @jesushd12 Sorry about that β we're still finalising the vector support on the current models (see #1457). We're currently training a new family of models for the next version, which includes a lot of fixes and updates currently on |
I'm trying to install spaCy 2.0 alpha in a new conda environment, and I'm receiving Would anyone be able to offer any advise? |
@chaturv3di That error tends to occur when Try |
Thanks @honnibal. For the record, after following your advise, I received the same error but this time from the |
Hi All, This is related to dependency parsing. Where can I find the exact logic for merging Thanks in advance. |
@chaturv3di See here in the if collapse_phrases:
for np in list(self.doc.noun_chunks):
np.merge(np.root.tag_, np.root.lemma_, np.root.ent_type_) Essentially, all you need to do is iterate over the noun phrases in |
Thanks everyone for your feedback! π |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
We're very excited to finally publish the first alpha pre-release of spaCy v2.0. It's still an early release and (obviously) not intended for production use. You might come across a
NotImplementedError
β see the release notes for the implementation details that are still missing.This thread is intended for general discussion, feedback and all questions related to v2.0. If you come across more complex bugs, feel free to open a separate issue.
Quickstart & overview
The most important new features
Matcher
and language processing pipelines.Installation
spaCy v2.0.0-alpha is available on pip as
spacy-nightly
. If you want to test the new version, we recommend setting up a clean environment first. To install the new model, you'll have to download it with its full name, using the--direct
flag.Alpha models for German, French and Spanish are coming soon!
Now on to the fun part β stickers!
We just got our first delivery of spaCy stickers and want to to share them with you! There's only one small favour we'd like to ask. The part we're currently behind on are the tests β this includes our test suite as well as in-depth testing of the new features and usage examples. So here's the idea:
Submit a PR with your test to the
develop
branch β if the test covers a bug and currently fails, mark it with@pytest.mark.xfail
. For more info, see the test suite docs. Once your pull request is accepted, send us your address via email or private message on Gitter and we'll mail you stickers.If you can't find anything, don't have time or can't be bothered, that's fine too. Posting your feedback on spaCy v2.0 here counts as well. To be honest, we really just want to mail out stickers π
The text was updated successfully, but these errors were encountered: