Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

914 spider il finance authority #995

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ parts/
sdist/
var/
wheels/
city_scrapers/get-pip.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we'll want to keep unrelated .gitignore changes out of PRs. If you want to ignore something locally you can add it to .git/info/exclude in your repo

*.egg-info/
.installed.cfg
*.egg
Expand Down
171 changes: 171 additions & 0 deletions city_scrapers/spiders/il_finance_authority.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
import re
from datetime import datetime
from io import BytesIO, StringIO

import scrapy
from city_scrapers_core.constants import BOARD, COMMISSION, COMMITTEE
from city_scrapers_core.items import Meeting
from city_scrapers_core.spiders import CityScrapersSpider
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFSyntaxError


class IlFinanceAuthoritySpider(CityScrapersSpider):
name = "il_finance_authority"
agency = "Illinois Finance Authority"
timezone = "America/Chicago"
start_urls = ["https://www.il-fa.com/public-access/board-documents/"]

def __init__(self, *args, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for leaving this in? Right now it has no effect, but not sure if it was used earlier

super().__init__(*args, **kwargs)

def parse(self, response):
for item in response.css("tr:nth-child(1n+2)"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to use this much, but because there are so many meetings here it would be good to use our CITY_SCRAPERS_ARCHIVE setting so that we don't have to pull the entire meeting list every time. Here's an example

if start < last_year and not self.settings.getbool(
, we would probably want to check within 6 months to a year from the current date

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more about the meaning behind this idea? I added it to my code and now it doesnt work. Also more about the settings object? Thank you

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! We want to be careful about spamming sites with a ton of requests at once, and typically when we scrape a site we're only interested in the last few past meetings and next few upcoming. To reduce the amount of requests we make, as well as simplify the output for anyone using the feeds directly, we try to set ranges of time relative to the current date that we're interested like everything in the past year in that example.

Scrapy's settings are a way for managing configuration across spiders like where the output is written or how quickly requests should be made. You can find more info on them in the scrapy documentation on settings.

Could you explain more about what isn't working? It's hard for me to debug without an example, but in general all the CITY_SCRAPERS_ARCHIVE settings is doing is giving us a boolean that can be put inside a conditional, so it doesn't have to work the same way as the example

pdf_link = self._get_pdf_link(item)
if pdf_link is None or not pdf_link.endswith(".pdf"):
continue
title = self._parse_title(item)
date = self._parse_date(item)

yield scrapy.Request(
response.urljoin(pdf_link),
callback=self._parse_schedule,
dont_filter=True,
meta={"title": title, "date": date},
)

def _parse_schedule(self, response):
"""Parse PDF and then yield to meeting items"""
pdf_text = self._parse_agenda_pdf(response)
location = self._parse_location(pdf_text)
time = self._parse_start(pdf_text)
meeting_dict = dict()
meeting_dict["title"] = response.meta["title"]
meeting_dict["date"] = response.meta["date"]
meeting_dict["location"] = location
meeting_dict["time"] = time

yield scrapy.Request(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this last request necessary? It looks like we're trying to yield the meeting after parsing the PDF, but is there other information here that we need to get from _parse_meeting?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last request is necessary for the code to work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more about what you mean? When you create a new request to response.url, it looks like you're submitting a second request to the URL you parsed the response from rather than going to a new page

response.url,
callback=self._parse_meeting,
dont_filter=True,
meta={"meeting_dict": meeting_dict},
)

def _parse_agenda_pdf(self, response):
try:
lp = LAParams(line_margin=0.1)
out_str = StringIO()
extract_text_to_fp(
inf=BytesIO(response.body),
outfp=out_str,
maxpages=1,
laparams=lp,
codec="utf-8",
)

pdf_content = out_str.getvalue().replace("\n", "")
# Remove duplicate spaces
clean_text = re.sub(r"\s+", " ", pdf_content)
# Remove underscores
clean_text = re.sub(r"_*", "", clean_text)
return clean_text

except PDFSyntaxError as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this can return None, does this happen often? If so we should allow it to raise the error normally

print("~~Error: " + str(e))

def _parse_meeting(self, response):
meeting_dict = response.meta["meeting_dict"]
title = meeting_dict["title"]
date = meeting_dict["date"]
time = meeting_dict["time"]
location = meeting_dict["location"]

meeting = Meeting(
title=title,
description="",
classification=self._parse_classification(title),
start=self._meeting_datetime(date, time),
end=None,
all_day=False,
time_notes="",
location=location,
links=self._parse_links(response.url, title),
source=self._parse_source(response),
)
meeting["status"] = self._get_status(meeting)
meeting["id"] = self._get_id(meeting)
yield meeting

def _meeting_datetime(self, date, time):
meeting_start = date + " " + time
meeting_start = meeting_start.replace(", ", ",").strip()
return datetime.strptime(meeting_start, "%b %d,%Y %I:%M %p")

def _get_pdf_link(self, item):
pdf_tag = item.css("td:nth-child(4) > a")
if not (pdf_tag):
return None
pdf_link = pdf_tag[0].attrib["href"]
return pdf_link

def _parse_title(self, item):
"""Parse or generate meeting title."""
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, but if possible it would be good to replace "Comm" with "Committee" for committee meetings. To be safe we'd want to replace it only when it's at the end of the title, so something like this could work

title = re.sub(r"Comm$", "Committee", title)

title = item.css("td:nth-child(3)::text").extract_first()
return title
except TypeError:
return ""

def _parse_classification(self, title):
"""Parse or generate classification from allowed options."""
if "Comm" in title:
return COMMITTEE
if "Board" in title:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can default to BOARD instead of Commission since it seems like that's the majority of these

return BOARD
return COMMISSION

def _parse_location(self, pdf_content):
try:
"""Parse or generate location."""
address_match = re.search(
r"(?:in\s*the|at\s*the) .*(\. | \d{5})", pdf_content
)
address = address_match.group(0)
name = re.findall(r"(?:in\s*the|at\s*the).*?,", pdf_content)[0]
except Exception:
address = "Address Not Found"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just return blank strings if that's the case, but it looks like the location is pretty consistent so it could be easier to use _validate_location and raise an exception if the location isn't a set address like we do here

def _validate_location(self, response):

name = "Name Not Found"
return {"address": address, "name": name}

def _parse_date(self, item):
"""Parse start datetime as a naive datetime object."""
try:
date_str = item.css("td:nth-child(2)::text").extract_first()
return date_str
except TypeError:
return ""

def _parse_start(self, pdf_content):
try:
time = re.findall(r"\d{1,2}:\d{2}\s?(?:A.M.|P.M.|PM|AM)", pdf_content)[0]
return time
except Exception:
return "12:00 AM"

def _parse_end(self, item):
"""Parse end datetime as a naive datetime object. Added by pipeline if None"""
return None

def _parse_all_day(self, item):
"""Parse or generate all-day status. Defaults to False."""
return False

def _parse_links(self, link, title):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there are multiple links here, so ideally we would want to pull the notice, minutes, and any other information as links even if we aren't parsing them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about the the other pdf links on the website? Could you elaborate more?
Is there an example you could provide? Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm seeing up to 4 PDF links in each row for the Agenda, Board Book, Minutes, and Voting Record. Ideally we'll want to include all of those in the links list

"""parse or generate links."""
return [{"href": link, "title": title}]

def _parse_source(self, response):
"""parse or generate source."""
return response.url
Loading