Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

back gensim.downloader.load_info function by a cache #2545

Merged
merged 8 commits into from
Jul 7, 2019
Merged

back gensim.downloader.load_info function by a cache #2545

merged 8 commits into from
Jul 7, 2019

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented Jul 2, 2019

Currently, it's not possible to use the gensim.downloader submodule without a network connection, even for datasets that have already been downloaded.

This PR removes the above restriction by keeping the dataset information in a local cache. This cache gets updated transparently.

You can confirm it works via:

$ python -m gensim.downloader --info | head
{
    "corpora": {
        "semeval-2016-2017-task3-subtaskBC": {
            "num_records": -1,
            "record_format": "dict",
            "file_size": 6344358,
            "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py",
            "license": "All files released for the task are free for general research use",
            "fields": {
                "2016-train": [
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe

and then repeating the above command after disconnecting from the network.

@mpenkov mpenkov requested a review from piskvorky July 2, 2019 01:54
@piskvorky
Copy link
Owner

Fixes piskvorky/gensim-data#23 .

gensim/downloader.py Outdated Show resolved Hide resolved
gensim/downloader.py Show resolved Hide resolved
gensim/downloader.py Outdated Show resolved Hide resolved
gensim/downloader.py Outdated Show resolved Hide resolved
gensim/downloader.py Outdated Show resolved Hide resolved
@piskvorky
Copy link
Owner

piskvorky commented Jul 2, 2019

@mpenkov this is my output on mpenkov/offline after disconnecting internet:

$ python -m gensim.downloader --info | head

2019-07-02 10:56:16,761 : __main__ : ERROR : caught non-fatal exception, see trace below
2019-07-02 10:56:16,761 : __main__ : ERROR : <urlopen error [Errno 8] nodename nor servname provided, or not known>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1234, in endheaders
--- Logging error ---
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1392, in connect
    super().connect()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 936, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 704, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Volumes/work/workspace/gensim/trunk/gensim/downloader.py", line 176, in _load_info
    info_bytes = urlopen(DATA_LIST_URL).read()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 526, in open
    response = self._open(req, data)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 544, in _open
    '_open', req)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1361, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1320, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/logging/__init__.py", line 996, in emit
    self.flush()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/logging/__init__.py", line 976, in flush
    self.stream.flush()
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Volumes/work/workspace/gensim/trunk/gensim/downloader.py", line 504, in <module>
    output = info() if (args.info == full_information) else info(name=args.info)
  File "/Volumes/work/workspace/gensim/trunk/gensim/downloader.py", line 237, in info
    information = _load_info()
  File "/Volumes/work/workspace/gensim/trunk/gensim/downloader.py", line 186, in _load_info
    logger.error('attempting to recover from local cache (%r)', cache_path)
Message: 'attempting to recover from local cache (%r)'
Arguments: ('/Users/kofola3/gensim-data/information.json',)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Volumes/work/workspace/gensim/trunk/gensim/downloader.py", line 505, in <module>
    print(json.dumps(output, indent=4))
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

Way too wild for a user-facing functionality, we need a simpler and more actionable output.

Co-Authored-By: Radim Řehůřek <[email protected]>
@mpenkov
Copy link
Collaborator Author

mpenkov commented Jul 2, 2019

OK, I've improved the exception handling, this is what you should be seeing now:

(gensim) misha@cabron:~/git/gensim/docs/src$ python -m gensim.downloader --info | head
2019-07-02 19:15:34,947 : __main__ : ERROR : caught non-fatal exception while trying to update gensim-data cache from 'https://raw.githubusercontent.com/RaRe-Technologies/gensim-data/master/list.json'; using local cache at '/home/misha/gensim-data/information.json' instead
Traceback (most recent call last):
  File "/usr/lib/python3.7/urllib/request.py", line 1317, in do_open
Traceback (most recent call last):
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/misha/git/gensim/gensim/downloader.py", line 193, in _load_info
  File "/usr/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
    with open(cache_path, 'r', encoding=encoding) as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/home/misha/gensim-data/information.json'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/misha/git/gensim/gensim/downloader.py", line 507, in <module>
    output = info() if (args.info == full_information) else info(name=args.info)
  File "/home/misha/git/gensim/gensim/downloader.py", line 240, in info
    information = _load_info()
  File "/home/misha/git/gensim/gensim/downloader.py", line 196, in _load_info
    raise ValueError('unable to read local cache %r during fallback' % cache_path)
ValueError: unable to read local cache '/home/misha/gensim-data/information.json' during fallback

@piskvorky
Copy link
Owner

piskvorky commented Jul 2, 2019

I'm seeing this now:

$ python -m gensim.downloader --info
2019-07-02 12:37:10,898 : __main__ : ERROR : caught non-fatal exception while trying to update gensim-data cache from 'https://raw.githubusercontent.com/RaRe-Technologies/gensim-data/master/list.json'; using local cache at '/Users/kofola3/gensim-data/information.json' instead
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1392, in connect
    super().connect()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 936, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 704, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Volumes/work/workspace/gensim/trunk/gensim/downloader.py", line 176, in _load_info
    info_bytes = urlopen(url).read()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 526, in open
    response = self._open(req, data)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 544, in _open
    '_open', req)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1361, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1320, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
Traceback (most recent call last):
  File "/Volumes/work/workspace/gensim/trunk/gensim/downloader.py", line 193, in _load_info
    with open(cache_path, 'r', encoding=encoding) as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/kofola3/gensim-data/information.json'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Volumes/work/workspace/gensim/trunk/gensim/downloader.py", line 509, in <module>
    output = info() if (args.info == full_information) else info(name=args.info)
  File "/Volumes/work/workspace/gensim/trunk/gensim/downloader.py", line 242, in info
    information = _load_info()
  File "/Volumes/work/workspace/gensim/trunk/gensim/downloader.py", line 198, in _load_info
    'connect to the Internet and retry' % cache_path
ValueError: unable to read local cache '/Users/kofola3/gensim-data/information.json' during fallback, connect to the Internet and retry

The opening is verbose, but not a problem. I like the clear actionable message at the end!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants