Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0673 spider chi ohare noise #972

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

stmfunk
Copy link

@stmfunk stmfunk commented Sep 9, 2020

Summary

Issue: #673

Finished creating spider for scraping data from the Chicago O'Hare Noise Compatibility Commission.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

  • Tests are implemented
  • All tests are passing
  • Style checks run (see documentation for more details)
  • Style checks are passing
  • Code comments from template removed

Copy link
Collaborator

@pjsier pjsier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This one has been challenging for a while, so appreciate you taking it on. I left some initial comments here, let me know what you think or if any of the comments aren't clear

agency = "Chicago O'Hare Noise Compatibility Commission"
timezone = "America/Chicago"

class ChiOhareNoiseSubSpider1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this one up! Looks like it was one of the more difficult ones. I think we'll want to keep it limited to one class here though for readability and split out methods for each (something like _parse_events_start, _parse_docs_start).

I could be missing some complexity here though, is there a reason for not doing that?

"""
for item in response.css(".ev_td_li"):
surl = self._parse_url(item)
yield Request(url=response.urljoin(surl), callback=self._parse_details)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be simpler to use the response.follow shortcut method here like you did below

yield Request(url=response.urljoin(surl), callback=self._parse_details)

next_page = response.xpath("//div[@class='previousmonth']/a/@href").get()
if next_page is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to limit this to only calling the next page once so that we don't hammer the site with a ton of requests every time.

source=response.url,
)

meeting = self._get_status(meeting)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to pass the text kwarg with the raw title here so that we can check whether or not it contains "cancelled" which is handled for you in get_status.

Also, we should make sure that get_status is assigning meeting['status'] and not overwriting the meeting variable entirely. That goes for _parse_classification as well

title=self._parse_title(response).replace("CANCELLED ", "").strip("- "),
description=self._parse_description(response),
start=stime,
end=stime + timedelta(hours=1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave this as None if we aren't parsing it and potentially pass a string to the time_notes kwarg with any details about scheduling

elif "commission" in meeting["title"].lower():
meeting["classification"] = COMMISSION
else:
meeting["classification"] = NOT_CLASSIFIED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the overall organization is a commission it's safe to default to COMMISSION instead of NOT_CLASSIFIED

@@ -0,0 +1,890 @@
2020-08-28 21:56:35 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: city_scrapers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed

@@ -0,0 +1,20 @@
[]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed as well



parsed_sub_items = []
for i in range(5):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should try not to have more than 2 test files here, and running some logic in parse in a separate method that doesn't yield a request could make that easier. That way we could call something like _parse_detail without needing to worry about testing pagination

def test_source():
src = "https://www.oharenoise.org/about-oncc"
src += "/oncc-meetings/month.calendar/2019/02/03/-"
assert parsed_sub_items[0]["source"] == src
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is pretty safe, and we can shorten it a bit and get at what we're testing with == test_response.url

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants