Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

914 spider il finance authority #995

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

solisedwin
Copy link

@solisedwin solisedwin commented Dec 20, 2020

Summary

Issue: #914

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

  • Tests are implemented
  • All tests are passing
  • Style checks run (see documentation for more details)
  • Style checks are passing
  • Code comments from template removed

Comments

Glad to say after months I'm finally doing working on this spider as my first ever pull request. Thanks for the opportunity as I learned a lot. I used scrapy meta arguments for each crawl since the self keyword didn't make each item unique and would repeat values. The code for tests might seem "weird" but scarped information can vary when working with PDFs so I tried a general approach.

…or scrapying. Delete/clean spider file code and changed functions around to better help with single modularity for testing.
Copy link
Collaborator

@pjsier pjsier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, and sorry for the delay! This is looking great so far, especially when the PDF scrapers tend to be pretty tricky. I left some comments here, let me know if anything is unclear

@@ -22,6 +22,7 @@ parts/
sdist/
var/
wheels/
city_scrapers/get-pip.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we'll want to keep unrelated .gitignore changes out of PRs. If you want to ignore something locally you can add it to .git/info/exclude in your repo

timezone = "America/Chicago"
start_urls = ["https://www.il-fa.com/public-access/board-documents/"]

def __init__(self, *args, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for leaving this in? Right now it has no effect, but not sure if it was used earlier

address = address_match.group(0)
name = re.findall(r"(?:in\s*the|at\s*the).*?,", pdf_content)[0]
except Exception:
address = "Address Not Found"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just return blank strings if that's the case, but it looks like the location is pretty consistent so it could be easier to use _validate_location and raise an exception if the location isn't a set address like we do here

def _validate_location(self, response):

clean_text = re.sub(r"_*", "", clean_text)
return clean_text

except PDFSyntaxError as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this can return None, does this happen often? If so we should allow it to raise the error normally

"""Parse or generate all-day status. Defaults to False."""
return False

def _parse_links(self, link, title):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are multiple links here, so ideally we would want to pull the notice, minutes, and any other information as links even if we aren't parsing them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about the the other pdf links on the website? Could you elaborate more?
Is there an example you could provide? Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm seeing up to 4 PDF links in each row for the Agenda, Board Book, Minutes, and Voting Record. Ideally we'll want to include all of those in the links list

meeting_dict["location"] = location
meeting_dict["time"] = time

yield scrapy.Request(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this last request necessary? It looks like we're trying to yield the meeting after parsing the PDF, but is there other information here that we need to get from _parse_meeting?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last request is necessary for the code to work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more about what you mean? When you create a new request to response.url, it looks like you're submitting a second request to the URL you parsed the response from rather than going to a new page

"""Parse or generate classification from allowed options."""
if "Comm" in title:
return COMMITTEE
if "Board" in title:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can default to BOARD instead of Commission since it seems like that's the majority of these

def test_location():
location = spider._parse_location(pdf_text)
assert (
"in the Authority’s Chicago Office, 160 North LaSalle Street, Suite S-1000, Chicago, Illinois 60601"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to simplify this so that only the street address is included in the address value of the location dict. This could be fixed by using a static location checked by _validate_location mentioned in another comment here


def _parse_title(self, item):
"""Parse or generate meeting title."""
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, but if possible it would be good to replace "Comm" with "Committee" for committee meetings. To be safe we'd want to replace it only when it's at the end of the title, so something like this could work

title = re.sub(r"Comm$", "Committee", title)



def test_time():
time = spider._parse_start(pdf_text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work, but is there a reason for not testing against a Meeting item? Usually that's the way we've done it to make sure the whole process works

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I'm not sure how to test the id since it doesn't appear from parsed_items. Perhaps I'm missing the obvious from my novice coding skills, but I can't seem to think of way to get all the information to recreate//test the id. Since the process of yielding a a meeting object is very specific in order and 'tricky' for this spider.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, we aren't typically using the meta attribute on requests, but it seems like the right approach here and testing it can be tricky.

Here's an example of how we're doing that for the chi_buildings scraper where we're using it in a similar way. Let me know if that helps!

class MockRequest(object):
meta = {}
def __getitem__(self, key):
return self.meta["meeting"].get(key)
def mock_request(*args, **kwargs):
mock = MockRequest()
mock.meta = {"meeting": {}}
return mock

Copy link
Collaborator

@pjsier pjsier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the responses! I batched some replies here

"""Parse or generate all-day status. Defaults to False."""
return False

def _parse_links(self, link, title):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm seeing up to 4 PDF links in each row for the Agenda, Board Book, Minutes, and Voting Record. Ideally we'll want to include all of those in the links list

super().__init__(*args, **kwargs)

def parse(self, response):
for item in response.css("tr:nth-child(1n+2)"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! We want to be careful about spamming sites with a ton of requests at once, and typically when we scrape a site we're only interested in the last few past meetings and next few upcoming. To reduce the amount of requests we make, as well as simplify the output for anyone using the feeds directly, we try to set ranges of time relative to the current date that we're interested like everything in the past year in that example.

Scrapy's settings are a way for managing configuration across spiders like where the output is written or how quickly requests should be made. You can find more info on them in the scrapy documentation on settings.

Could you explain more about what isn't working? It's hard for me to debug without an example, but in general all the CITY_SCRAPERS_ARCHIVE settings is doing is giving us a boolean that can be put inside a conditional, so it doesn't have to work the same way as the example

meeting_dict["location"] = location
meeting_dict["time"] = time

yield scrapy.Request(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more about what you mean? When you create a new request to response.url, it looks like you're submitting a second request to the URL you parsed the response from rather than going to a new page



def test_time():
time = spider._parse_start(pdf_text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, we aren't typically using the meta attribute on requests, but it seems like the right approach here and testing it can be tricky.

Here's an example of how we're doing that for the chi_buildings scraper where we're using it in a similar way. Let me know if that helps!

class MockRequest(object):
meta = {}
def __getitem__(self, key):
return self.meta["meeting"].get(key)
def mock_request(*args, **kwargs):
mock = MockRequest()
mock.meta = {"meeting": {}}
return mock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants