Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace all urllib / urllib2 imports and uses with requests #2852

Open
hornc opened this issue Jan 13, 2020 · 15 comments · Fixed by #3666, #3696, #3709 or #3710
Open

Replace all urllib / urllib2 imports and uses with requests #2852

hornc opened this issue Jan 13, 2020 · 15 comments · Fixed by #3666, #3696, #3709 or #3710
Labels
Lead: @cclauss Issues overseen by Chris (Python3 & Dev-ops lead 2019-2021) [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Help Issues, typically substantial ones, that need a dedicated developer to take them on. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@hornc
Copy link
Collaborator

hornc commented Jan 13, 2020

urllib2 is old, and a bit clunky -- I've found recently it can't do some of things I needed it to do, and it also causes problem for moving to Python3, and generally seems to make the code more complex.

Instead of re-using it, and adding new features that use the older mechanisms, can we consider refactoring to use requests instead?

My PR where I stripped out urllib2 and StringIO etc to enable HEAD requests: #2838

a similar PR that went with using requests to avoid old bugs: #2779

By using requests (rather than adding imports from six) we should be getting a better tool for making requests, and reducing code complexity, and the number of imports.

Relates to #2308 since a common use of simplejson is to parse urllib2 responses. requests has a builtin json() method so an extra import is not required.

Describe the problem that you'd like solved

Proposal & Constraints

Additional context

Stakeholders

@hornc hornc added Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] State: Backlogged labels Jan 13, 2020
@tfmorris
Copy link
Contributor

I'm strongly supportive of switching to requests since it has things like built-in (or easy to add) support for things like retries, timeouts, caching, etc.

Having said that, there's so much critical work pending that I'm not sure we can afford to being doing "nice to have" work on existing code that works. Refactoring modules as they get worked on seems fine though.

@hornc
Copy link
Collaborator Author

hornc commented Jan 14, 2020

Refactoring modules as they get worked on seems fine though.

Agree.
Refactoring is why I wanted to publicise this as a common solution. When devs work on a module -- If they find themselves banging heads against urllib2 for any reason, first thought should be to remove it and use requests instead. I just had this happen a couple of times in a row, and wanted to share it as a common time saving way forward, and perhaps we can standardise some of the retry mechanisms etc.

@hornc
Copy link
Collaborator Author

hornc commented Jan 14, 2020

possibly unused urllib + urllib2 in openlibrary/plugins/upstream/addbook.py ?

@cdrini
Copy link
Collaborator

cdrini commented Jan 14, 2020

PyCharm identifies them as unused (amongst other things)
image

@xayhewalo xayhewalo added Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Theme: Upgrade to Python 3 and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Jan 14, 2020
@tfmorris
Copy link
Contributor

tfmorris commented Jan 21, 2020

#2894 shows some code smells which need to be avoided when doing any upgrade of this type.

If you still have an import urllib2, it's probably wrong.

@cclauss
Copy link
Contributor

cclauss commented Jul 23, 2020

% grep urlopen **/*.py

  • openlibrary/admin/numbers.py: sqlite_contents = urllib.request.urlopen(url).read()
  • openlibrary/catalog/amazon/add_covers.py: ret = simplejson.load(urlopen(url))
  • openlibrary/catalog/amazon/other_editions.py: return urllib.request.urlopen(url).read()
  • openlibrary/catalog/utils/edit.py: prev = unmarshal(json.load(urllib.request.urlopen(url)))
  • openlibrary/core/lending.py: content = urllib.request.urlopen(url=url, timeout=config_http_request_timeout).read()
  • openlibrary/core/lending.py: content = urllib.request.urlopen(request, timeout=config_http_request_timeout).read()
  • openlibrary/core/lending.py: content = urllib.request.urlopen(url=url, timeout=config_http_request_timeout).read()
  • openlibrary/core/lending.py: response = simplejson.loads(urllib.request.urlopen(url).read())
  • openlibrary/core/lending.py: return simplejson.loads(urllib.request.urlopen(url).read())
  • openlibrary/core/lending.py: jsontext = urllib.request.urlopen(config_ia_loan_api_url, payload,
  • openlibrary/core/models.py: d = simplejson.loads(urllib.request.urlopen(url).read())
  • openlibrary/coverstore/code.py: text = urlopen("https://archive.org/metadata/" + item).read()
  • openlibrary/coverstore/code.py: jsontext = urlopen(url).read()
  • openlibrary/plugins/admin/code.py: response = urllib.request.urlopen(s).read()
  • openlibrary/plugins/admin/code.py: json = urllib.request.urlopen("http://www.archive.org/download/stats/numUniqueIPsOL.json").read()
  • openlibrary/plugins/admin/services.py: self.data = BeautifulSoup(urllib.request.urlopen(url).read(), "lxml")
  • openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
  • openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
  • openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
  • openlibrary/plugins/books/readlinks.py: json_data = urllib.request.urlopen(solr_select).read()
  • openlibrary/plugins/openlibrary/code.py: return urllib.request.urlopen(url).read()
  • openlibrary/plugins/upstream/borrow.py: response = simplejson.loads(urllib.request.urlopen(url).read())
  • openlibrary/plugins/upstream/borrow.py: response = simplejson.loads(urllib.request.urlopen(url).read())
  • openlibrary/plugins/upstream/covers.py: response = urllib.request.urlopen(upload_url, urllib.parse.urlencode(params))
  • openlibrary/plugins/upstream/data.py: return urllib.request.urlopen(url).read()
  • openlibrary/plugins/upstream/models.py: data = urllib.request.urlopen(url).read()
  • openlibrary/plugins/worksearch/subjects.py: response = json.load(urllib.request.urlopen(solr_url))['response']
  • openlibrary/plugins/worksearch/subjects.py: response = json.load(urllib.request.urlopen(solr_url))['response']
  • openlibrary/solr/find_modified_works.py: ret.append(extract_works(json.load(urllib.request.urlopen(url))))
  • openlibrary/solr/find_modified_works.py: changes = list(json.load(urllib.request.urlopen(url)))
  • openlibrary/utils/solr.py: data = urllib.request.urlopen(url, timeout=10).read()
  • openlibrary/utils/solr.py: data = urllib.request.urlopen(request, timeout=10).read()
  • openlibrary/accounts/model.py: f = urllib.request.urlopen(req)
  • openlibrary/api.py: return urllib.request.urlopen(req)
  • openlibrary/catalog/add_book/init.py: res = urllib.request.urlopen(upload_url, urllib.parse.urlencode(params))
  • openlibrary/catalog/amazon/crawl.py: #page = urlopen('http://amazon.com/dp/' + asin).read()
  • openlibrary/catalog/amazon/load_merge.py: result = urllib.request.urlopen(ureq).read(100000)
  • openlibrary/catalog/get_ia.py:def urlopen_keep_trying(url):
  • openlibrary/catalog/get_ia.py: f = urllib.request.urlopen(url)
  • openlibrary/catalog/get_ia.py: return '<!--' in urlopen_keep_trying(IA_DOWNLOAD_URL + loc).read()
  • openlibrary/catalog/get_ia.py: data = urlopen_keep_trying(item_base + marc_xml_filename).read()
  • openlibrary/catalog/get_ia.py: data = urlopen_keep_trying(item_base + marc_bin_filename).read()
  • openlibrary/catalog/get_ia.py: tree = etree.parse(urlopen_keep_trying(url))
  • openlibrary/catalog/get_ia.py: tree = etree.parse(urlopen_keep_trying(url))
  • openlibrary/catalog/get_ia.py: f = urlopen_keep_trying(ureq)
  • openlibrary/catalog/get_ia.py: f = urlopen_keep_trying(url)
  • openlibrary/catalog/get_ia.py: f = urlopen_keep_trying(url)
  • openlibrary/catalog/importer/db_read.py: data = urlopen(url).read()
  • openlibrary/catalog/marc/download.py: f = urllib.request.urlopen("http://archive.org/download/bpl_marc/" + file)
  • openlibrary/catalog/marc/download.py: f = urllib.request.urlopen(ureq)
  • openlibrary/catalog/marc/marc_subject.py: f = urlopen_keep_trying(url)
  • openlibrary/catalog/marc/marc_subject.py: f = urlopen_keep_trying(url)
  • openlibrary/catalog/utils/query.py:def urlopen(url, data=None):
  • openlibrary/catalog/utils/query.py: return urllib.request.urlopen(req)
  • openlibrary/catalog/utils/query.py: return json.load(urlopen(url))
  • openlibrary/catalog/utils/query.py: return urlopen(url).read()
  • openlibrary/core/fulltext.py: json_data = urllib.request.urlopen(search_select, timeout=30).read()
  • openlibrary/core/loanstats.py: logger.info("urlopen %s", url)
  • openlibrary/core/loanstats.py: response = urllib.request.urlopen(url).read()
  • openlibrary/coverstore/utils.py: r = urlopen(req)
  • openlibrary/plugins/akismet/akismet.py: h = urllib.request.urlopen(req)
  • openlibrary/plugins/ol_infobase.py: response = urllib.request.urlopen(url, json).read()
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: def urlopen(self, path, data=None, method=None, headers={}):
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: self.urlopen("/account/login", data=urllib.parse.urlencode(data), method="POST")
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: response = self.urlopen(
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: data = self.urlopen("/people/" + self.username + "/lists.json").read()
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: data = self.urlopen(key + ".json").read()
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: data = self.urlopen(key + "/seeds.json").read()
  • openlibrary/plugins/openlibrary/tests/test_listapi.py: response = self.urlopen(key + "/seeds.json", json)
  • openlibrary/plugins/openlibrary/tests/test_ratingsapi.py: def urlopen(self, path, data=None, method=None, headers={}):
  • openlibrary/plugins/openlibrary/tests/test_ratingsapi.py: self.urlopen("/account/login", data=urllib.parse.urlencode(data), method="POST")
  • openlibrary/plugins/openlibrary/tests/test_ratingsapi.py: r = self.urlopen(
  • openlibrary/plugins/search/solr_client.py: ru = urlopen(query_url)
  • openlibrary/plugins/search/solr_client.py: ru = urlopen(query_url)
  • openlibrary/plugins/upstream/adapter.py: response = urllib.request.urlopen(req)
  • openlibrary/plugins/upstream/utils.py: tree = etree.parse(urllib.request.urlopen(url))
  • openlibrary/plugins/worksearch/code.py: solr_result = urllib.request.urlopen(url, timeout=10)
  • openlibrary/solr/update_work.py:def urlopen(url, data=None):
  • openlibrary/solr/update_work.py: return urllib.request.urlopen(req)
  • openlibrary/solr/update_work.py: result = json.load(urlopen(url))
  • openlibrary/solr/update_work.py: logger.info("urlopen %s", url)
  • openlibrary/solr/update_work.py: reply = json.load(urlopen(url))
  • openlibrary/solr/update_work.py: reply = json.load(urlopen(url))
  • openlibrary/tests/catalog/test_get_ia.py: monkeypatch.setattr(get_ia, 'urlopen_keep_trying', return_test_marc_xml)
  • openlibrary/tests/catalog/test_get_ia.py: monkeypatch.setattr(get_ia, 'urlopen_keep_trying', return_test_marc_bin)
  • openlibrary/tests/catalog/test_get_ia.py: monkeypatch.setattr(get_ia, 'urlopen_keep_trying', return_test_marc_bin)
  • openlibrary/tests/solr/test_update_work.py: monkeypatch.setattr(update_work, 'urlopen', lambda url: StringIO(solr_response))
  • openlibrary/utils/httpserver.py: >>> response = urllib.request.urlopen('http://0.0.0.0:8090/hello/world')
  • openlibrary/utils/httpserver.py: >>> urllib.request.urlopen('http://0.0.0.0:8090/hello/world')
  • openlibrary/views/showmarc.py: data = urllib.request.urlopen(url).read()
  • openlibrary/views/showmarc.py: data = urllib.request.urlopen(url).read()
  • openlibrary/views/showmarc.py: result = urllib.request.urlopen(ureq).read(100000)

@imskr
Copy link
Contributor

imskr commented Jul 30, 2020

Thanks, @cclauss for guiding. I'm on it!

@cclauss
Copy link
Contributor

cclauss commented Aug 15, 2020

@cclauss
Copy link
Contributor

cclauss commented Aug 21, 2020

This is going to need SEVERAL PRs before it is finally closed.

@cclauss cclauss reopened this Aug 21, 2020
@cclauss
Copy link
Contributor

cclauss commented Feb 1, 2021

grep urlopen **/*.py > files.txt

with open("files.txt") as in_file:
    files = {line.split(":")[0] for line in in_file}

print("\n".join(sorted(files)))

@dherbst
Copy link
Contributor

dherbst commented Feb 2, 2021

I don't think this script works any more scripts/z3950_view.py because PyZ3950 in not installed in the requirements any more. I can't find PyZ3950 on pypi.org. I cannot find reference to z3950_view in the codebase. Given this, should we remove z3950_view.py?

@cclauss
Copy link
Contributor

cclauss commented Feb 2, 2021

Last released in 2004 https://pypi.org/project/PyZ3950 --> http://www.panix.com/~asl2/software/PyZ3950 --> https://github.com/asl2/PyZ3950 (see the PRs) which is far from Python 3 compatible.

@cclauss cclauss changed the title Replace all urllib / urllib2 + StringIO + simplejson imports and uses with requests Replace all urllib / urllib2 imports and uses with requests Feb 3, 2021
@dherbst
Copy link
Contributor

dherbst commented Feb 4, 2021

I don't think this file openlibrary/catalog/marc/download.py is used any more. Because it tries to use web.run which is no longer in the version of web.py installed. Also it tries to call http://wiki-beta.us.archive.org:9090 which is not able to be accessed.

Should openlibrary/catalog/marc/download.py be deleted?

@cclauss cclauss added Lead: @cclauss Issues overseen by Chris (Python3 & Dev-ops lead 2019-2021) [managed] and removed Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] labels Mar 9, 2021
@mekarpeles mekarpeles added the Needs: Help Issues, typically substantial ones, that need a dedicated developer to take them on. [managed] label Sep 7, 2023
@RayBB
Copy link
Collaborator

RayBB commented Apr 19, 2024

I went ahead and did a little grepping and such to see that these are all the places urllib is used (excluding infogami).

I'm not sure which of these functions we should still try to migrate away from.
At least some of the urlencodes could be changed to just passing the dict to requests.
@cclauss what do you think?

Filename Count 16 5 4 3 3 3 2 2 2 2 2 1 1 1 1
./plugins/upstream/utils.py 8 urlencode quote parse_qs urlparse urlunparse quote_plus
./plugins/upstream/borrow.py 8 urlencode quote
./coverstore/utils.py 8 urlencode urlsplit unquote urlunsplit unquote_plus parse_qsl
./utils/solr.py 6 urlencode urlsplit
./plugins/openlibrary/code.py 8 urlencode parse_qs urlparse urlunparse
./plugins/openlibrary/tests/test_listapi.py 8 urlencode Request build_opener HTTPCookieProcessor
./plugins/openlibrary/tests/test_ratingsapi.py 7 urlencode Request build_opener HTTPCookieProcessor
./plugins/upstream/adapter.py 4 urlencode Request urlopen
./core/fulltext.py 2 urlencode
./core/ia.py 2 urlencode
./core/models.py 2 urlencode
./core/sponsorships.py 2 urlencode
./coverstore/tests/test_webapp.py 2 urlencode
./plugins/admin/graphs.py 2 urlencode
./plugins/importapi/code.py 2 urlencode
./plugins/worksearch/code.py 3 urlencode
./plugins/upstream/addbook.py 4 quote urlparse
./core/processors/readableurls.py 3 quote quote_plus
./catalog/utils/query.py 3 quote
./plugins/books/code.py 3 urlsplit unquote
./core/helpers.py 2 urlsplit
./plugins/openlibrary/lists.py 1 parse_qs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @cclauss Issues overseen by Chris (Python3 & Dev-ops lead 2019-2021) [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Help Issues, typically substantial ones, that need a dedicated developer to take them on. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
9 participants