Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installing models using pip: improve documentation #1099

Closed
danielhers opened this issue Jun 4, 2017 · 9 comments
Closed

Installing models using pip: improve documentation #1099

danielhers opened this issue Jun 4, 2017 · 9 comments
Labels
docs Documentation and website models Issues related to the statistical models

Comments

@danielhers
Copy link
Contributor

danielhers commented Jun 4, 2017

Right now, my package has spaCy as a requirement in requirements.txt, and python -m spacy download en is run as part of the installation process.
According to the documentation, models can be listed in requirements.txt, but no example is given. How can I add a requirements.txt entry to just install the default English models?
And is this enough, or will I then also need to run python -m spacy link or something?

@ines ines added docs Documentation and website models Issues related to the statistical models labels Jun 4, 2017
@ines
Copy link
Member

ines commented Jun 4, 2017

Thanks – and sorry about the confusion. I agree, this should definitely be more clear!

The standard way of installing packages specified in the requirements.txt assumes they're downloadable via a PyPi server (usually pypi.python.org). While model packages are valid pip packages, they can't be uploaded to theofficial PyPi directory, as they don't meet the requirements (they're too large and consist of mostly binary data). However, a lot of companies run their own internal installations of PyPi – in that case, you can simply upload the model there and point your pip at the internal server.

Alternatively, pip also lets you specify URLs and other sources in the requirements – see here for more info and examples. So instead of only the package name, you can add the URLs of the models you want to install.

This won't run any spaCy internals like download (which is mostly a convenience wrapper for pip's installer) or link. So you'll either have to create the symlink yourself afterwards, or load the model by importing the package and calling its load() method with no arguments:

import en_core_web_sm
nlp = en_core_web_sm.load()

In general, we do recommend this syntax for larger code bases because it doesn't depend on symlinks, and is cleaner and more "native" – for example, if a model package is not installed, Python will raise an ImportError immediately, instead of failing somewhere down the line when calling spacy.load().

So if specifying models in your requirements.txt is useful for your project, there's a high chance that native model imports will actually be more convenient as well. I hope this helps – will definitely add a section about this to the docs as well 👍

TL;DR Adding the model URL instead of the package name to your requirements.txt and importing the model as a package in your code should do the trick.

@danielhers
Copy link
Contributor Author

Thank you, this is very clear!
So is en_core_web_sm the same package I get when I run python -m spacy download en?

@ines
Copy link
Member

ines commented Jun 4, 2017

Yes, en and all other shortcuts download the default models, usually the most compact ones – in this case en_core_web_sm. (In the list of available models, the default models are the ones marked with a star. Internally, spaCy resolves the shortcuts by looking them up in this table.)

@lalvarezguillen
Copy link

Very clear indeed! Now I'm wondering if there's a simple equivalent for setup.py

We used a call to spacy.en.download in our setup.py to install the required modules, I believe the practice is deprecated or frowned upon.

@ines
Copy link
Member

ines commented Jun 6, 2017

@lalvarezguillen I think you might be looking for a solution like this: https://stackoverflow.com/a/3481388/6400719

We used a call to spacy.en.download in our setup.py to install the required modules, I believe the practice is deprecated or frowned upon.

In theory, you could still use spacy.cli.download for this (spacy.en.download is deprecated since v1.7). I wouldn't say that this practice is frowned upon, but we definitely wouldn't recommend it for production use. If you know which model your application needs, you shouldn't have to do an additional roundtrip and depend on spaCy's downloader just to fetch and pip install a package from a URL. (This was also part of the reason we decided to publish the models on GitHub and not just route all requests via our server. Especially since there's not just one "the model" anymore, but several different ones for different languages and use cases.)

Btw, in spaCy v2.x, another option could be to simply package the models with your application. The new alpha models are only 12 and 15 MB – about the size of the spaCy package, and probably smaller than many other random pip packages.

Edit: Just to clarify, this approach would be mostly for internal production use – not if you're actually distributing your package on PyPi or GitHub. While the model licenses (CC BY-SA) allow redistribution, we don't want to encourage people to reupload and mirror the official spaCy models. After all, they're just binary data and we want to make sure that there's only one official distribution. This makes things safer and less confusing for everyone.

@ines
Copy link
Member

ines commented Jul 22, 2017

Addressed in 7c4bf99 and live here!

@shuhei
Copy link

shuhei commented Jan 15, 2018

For python newbies like me. To add a model to Pipfile:

[packages]

spacy = "*"
de_core_news_sm = { file = 'https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.0.0/de_core_news_sm-2.0.0.tar.gz' }

@msmedes
Copy link

msmedes commented Apr 22, 2018

Not sure why but I added the model to my Pipfile, updated the lock file, but spacy doesn't appear to be working. Right now my Pipfile looks like this:

[packages]
spacy = "*"
gunicorn = "*"
flask = "*"
"en_core_web_sm" = {file = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-1.2.0/en_core_web_sm-1.2.0.tar.gz"}

and my import and package loading looks like this:

import en_core_web_sm
print("Loading spacy...")
nlp = en_core_web_sm.load()
print(nlp)
print("Spacy loaded.")

my print statements look like this:
11:45:02 web.1 | Loading spacy something...
11:45:03 web.1 | <spacy.lang.en.English object at 0x10d676e80>
11:45:03 web.1 | Spacy loaded.

but when I actually process text or do anything with the nlp object...nothing happens. It might be tokenizing the text but not much else. If I pass text in with doc = nlp(text) and run print(doc) I get the text back. But so far any attempts at looking at doc.ents have failed. Printing doc.ents returns an empty set. I should mention that this whole thing works not through heroku. If I run it in the local environment using python app.py it fires up no problem and processes text. However when I run heroku local web or git push heroku master I get diddly, despite the fact it appears to be loading the spacy model. Any ideas as to what I'm doing wrong?

(Apologies if this is in the wrong place or I should have made a new issue. If so let me know and I'll do so.)

@lock
Copy link

lock bot commented May 22, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 22, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests

5 participants