Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using downloaded resources shouldn't require internet access #23

Closed
piskvorky opened this issue Mar 24, 2018 · 5 comments
Closed

Using downloaded resources shouldn't require internet access #23

piskvorky opened this issue Mar 24, 2018 · 5 comments
Assignees

Comments

@piskvorky
Copy link
Owner

piskvorky commented Mar 24, 2018

As seen during our workshop yesterday, various network issues can appear during live or even offline events.

Once a user had downloaded a dataset onto their machine (~/gensim-data), they shouldn't require any internet access to use it. If the API needs to do some "online checking", this checking should be optional.

@piskvorky piskvorky changed the title Using downloaded resources shouldn't need internet access Using downloaded resources shouldn't require internet access Mar 24, 2018
@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 24, 2018

Let me clarify, If the user already download a model, internet connection used for

  • Retrieve path to file that will be used (yes, the "structure" defined in lists.json too, we retrieve the part of a path to the local file from it). If we have no connection, we can "guess" only (but we have very "regular" structure, in this case, this should work fine).
  • Check that file is correct (md5 hashsum)

I agree about the check, this should be optional (but True by default, anyway, we must be sure that the data is correct, but the user should be able to disable this check at one's own risk).

@DSamuylov
Copy link

I also encountered this problem. I was going in a trip where I would have only no/very weak internet connection. I preloaded all the models before the trip hoping to still work on my project. I was caught by a big surprise when I realised I couldn't work without internet!! My easter holidays are over when they didn't even started... I have to find what to de without my laptop :)

I agree that consistency is important, but possible solution would be: 1) try if there is an internet connection, 2) if 1 fails, try to load from default location with some default model name 3) if 2 fails throw exception that the model cannot be found. I am very new to this package, but I guess the default location shouldn't change for many users?

It would be also great to have some custom exceptions telling what went wrong. Otherwise it is not really obvious why it fails. If you need help I could look into the source code and try to fix it when I am back.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 30, 2018

@DSamuylov I agree, we definitely need to add a special flag for this case, feel free to contribute (need to add "persistence" flag to https://github.com/RaRe-Technologies/gensim/blob/10a3dab8d00c0523ff871af75fb0badcff14848b/gensim/downloader.py#L357)

@piskvorky
Copy link
Owner Author

piskvorky commented Mar 30, 2018

I agree with @DSamuylov . I didn't realize gensim-data depends on an internet connection, that's bad design. The way I see it, we need two things:

  1. Fix the design so that internet is not mandatory for already-downloaded models.

  2. Better, clear progress/error messages, so users know what's going on. The errors we saw during the workshop were really terrible. Nobody knew what's going on.

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 10, 2019

Fixed via piskvorky/gensim#2545

@mpenkov mpenkov closed this as completed Sep 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants