Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0672 spider chi northwest home equity #966

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

SubtleHyperbole
Copy link

@SubtleHyperbole SubtleHyperbole commented Jul 17, 2020

Summary

Issue: #672

Replace "ISSUE_NUMBER" with the number of your issue so that GitHub will link this pull request with the issue and make review easier.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

  • Tests are implemented
  • [X**] All tests are passing
  • Style checks run (see documentation for more details)
  • Style checks are passing
  • Code comments from template removed

Questions

I am such a newb when it comes to using github. After royally screwing things up not once but twice, and having to literally delete the entire repository and start over, then re-adding my files, I think I got it down this time.

One thing though is odd, and its why i included the asterisk ** next to "All tests are passing" -- for some reason this final time that I re-cloned and forked the repository onto my machine, it started throwing errors concerning three spiders (unrelated to the one i'm working on) which kept throwing exceptions up for trying to import a library called pdfminer. All spiders which used these two import lines:

from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams

These lines threw exceptions about there being no existing library, even though I do have pdfminer installed on my main python as well as the pipenv shell. For some reason this was suddenly an issue. I had to actually comment out the two lines on the three spiders in order to get the terminal to finish setting up my fork. I then returned the lines to normal, but later when it came time to run pytests, it threw up errors there too for those three spiders again.

Anyhoo, long winded way of saying that what I mean by the ** in the checklist is that, yes, for all MY tests for the new spider are passing. Someone else's spiders that use pdfminer, however, are having trouble all of a sudden. The three spiders are:

il_pollution_control
chi_human_relations
cook_emergency_telephone

(none of those three spider files or tests got added to my commit/push anyway, I'm just bringing it up here because I thought it was an odd error, especially since it didn't happen when i did this a day or so ago, however it seems to have nothing to do with what I am working on)

@SubtleHyperbole
Copy link
Author

I am confused about the "status" and "id" tests in the template -- no parse functions for these exist in the template like they do for every other test. Are they necesarry? And where would they even come from? "status" sounds like the same information that's contained in "description" -- and I this particular site doesn't give each meeting a unique ID other than the date and time and title...

Copy link
Collaborator

@pjsier pjsier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left a few comments here, but before I go further it looks like it could be useful to take a look at some of the Scrapy documentation.

In general, Scrapy's workflow is to have a set of start_urls (or a start_requests method) that tells a Spider what URLs to go to first. By default, the parse method is called on each of these responses with the response argument containing the data you need to scrape. From there, you either yield Items, dictionaries or Meeting objects in our case, or alternatively you can yield additional requests to pages that Scrapy should follow. We have a few edge cases where we're using requests, but in the overwhelming majority of Spiders you won't need it, so the Scrapy docs might be helpful before updating this.

For a setup like this were we have one page we need to scrape for documents that are later associated to meetings, we usually follow a pattern of starting from the docs page in start_urls (which you did here), parsing those and setting them in self, and then yielding separate requests for each other page we need to scrape. That's your overall setup here, but using Scrapy directly can reduce a lot of work. Here's an example from cook_housing.

Once those updates are in I can take a look at the rest, and I'm happy to send over additional resources, thanks!

start_urls = ["https://nwheap.com/category/meet-minutes-and-agendas/"]

def parse(self, response):
self.location_dict = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be set above by start_urls instead of each time parse is called

}
# Before we begin, need to collect meeting minutes data -- which is
# contained on multiple iterative pages (/page/2 /page/3 etc)
r = requests.get("https://nwheap.com/category/meet-minutes-and-agendas/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to use requests here since the response from this page should be the first thing Scrapy loads in start_urls

@pjsier
Copy link
Collaborator

pjsier commented Jul 17, 2020

@SubtleHyperbole to your question about ID and status, we've got more information on those in our docs, but in general they're a part of the Open Civic Data Event specification that we're following, and status in particular is a more structured way of tracking whether meetings are cancelled, upcoming, or in the past

@SubtleHyperbole
Copy link
Author

So the successive requests are actually only needed to get links for files posted AFTER meetings which have already happened (like meetings minutes).

Following the restructuring of the website itself (which my email to their admin caused unintended), the only location where it lists future meetings dates is on the side bar to the right which exists on several different pages.

This side bar contains (at least right now) the next three future meetings, and 5-6 past meetings. While each meeting on the sidebar contains a link to its own page, that page actually contains no new information that isn't already in the side bar.

The only information that I figured would be of interest to scrape is the meetings minutes and other uploaded files for those 5-6 PAST meetings on the side bar. For some reason, the link for each meeting on the side bar doesn't actually lead to those files. Instead, separate pages for each meeting is listed on that nwheap/meeting-minutes-and-agenda main page, along with multiple additional pages in groups of ten.

So here is the thing, since theres only 5-6 meetings listed on that right sidebar now, all of their meeting-minutes-side-pages are listed on the first page of results. As it is right now, while I iterated through all the existing back pages, those include files for past meetings that go way way further back. I thought maybe I should use that list to generate the list of meetings, but unfortunately other than the date, it doesn't give any other information except the associated links.

So should I just ditch the iterative page scroll, given that, at least for right now, nothing except the first page holds any data that ends up in a yielded meeting?

@SubtleHyperbole
Copy link
Author

Also i'm still not sure about where the _id and _status comes from, but really as long as it's not something I'm supposed to be scraping from the website, that's good by me.

@SubtleHyperbole
Copy link
Author

Also, I think my first comment might be confusing

    # Now moving onto the main parse of the meetings list

should really say something more like

    # First create a list of all potential meetings minutes/associated files pages. Later in the main parse, we will cycle through this list to see if there is an associated files page containing stuff like meeting minutes files.  Only past meetings have these, and even then not every time.

@pjsier
Copy link
Collaborator

pjsier commented Jul 20, 2020

@SubtleHyperbole thanks for the clarification, but I think we'll still want to use Scrapy here, even if we're chaining together requests that aren't pulling links directly from a page. Here's an example of where we're doing this on il_elections

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants