0672 spider chi northwest home equity #966

SubtleHyperbole · 2020-07-17T10:42:39Z

Summary

Issue: #672

Replace "ISSUE_NUMBER" with the number of your issue so that GitHub will link this pull request with the issue and make review easier.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

Tests are implemented
[X**] All tests are passing
Style checks run (see documentation for more details)
Style checks are passing
Code comments from template removed

Questions

I am such a newb when it comes to using github. After royally screwing things up not once but twice, and having to literally delete the entire repository and start over, then re-adding my files, I think I got it down this time.

One thing though is odd, and its why i included the asterisk ** next to "All tests are passing" -- for some reason this final time that I re-cloned and forked the repository onto my machine, it started throwing errors concerning three spiders (unrelated to the one i'm working on) which kept throwing exceptions up for trying to import a library called pdfminer. All spiders which used these two import lines:

from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams

These lines threw exceptions about there being no existing library, even though I do have pdfminer installed on my main python as well as the pipenv shell. For some reason this was suddenly an issue. I had to actually comment out the two lines on the three spiders in order to get the terminal to finish setting up my fork. I then returned the lines to normal, but later when it came time to run pytests, it threw up errors there too for those three spiders again.

Anyhoo, long winded way of saying that what I mean by the ** in the checklist is that, yes, for all MY tests for the new spider are passing. Someone else's spiders that use pdfminer, however, are having trouble all of a sudden. The three spiders are:

il_pollution_control
chi_human_relations
cook_emergency_telephone

(none of those three spider files or tests got added to my commit/push anyway, I'm just bringing it up here because I thought it was an odd error, especially since it didn't happen when i did this a day or so ago, however it seems to have nothing to do with what I am working on)

SubtleHyperbole · 2020-07-17T10:54:58Z

I am confused about the "status" and "id" tests in the template -- no parse functions for these exist in the template like they do for every other test. Are they necesarry? And where would they even come from? "status" sounds like the same information that's contained in "description" -- and I this particular site doesn't give each meeting a unique ID other than the date and time and title...

pjsier

Thanks for the PR! I left a few comments here, but before I go further it looks like it could be useful to take a look at some of the Scrapy documentation.

In general, Scrapy's workflow is to have a set of start_urls (or a start_requests method) that tells a Spider what URLs to go to first. By default, the parse method is called on each of these responses with the response argument containing the data you need to scrape. From there, you either yield Items, dictionaries or Meeting objects in our case, or alternatively you can yield additional requests to pages that Scrapy should follow. We have a few edge cases where we're using requests, but in the overwhelming majority of Spiders you won't need it, so the Scrapy docs might be helpful before updating this.

For a setup like this were we have one page we need to scrape for documents that are later associated to meetings, we usually follow a pattern of starting from the docs page in start_urls (which you did here), parsing those and setting them in self, and then yielding separate requests for each other page we need to scrape. That's your overall setup here, but using Scrapy directly can reduce a lot of work. Here's an example from cook_housing.

Once those updates are in I can take a look at the rest, and I'm happy to send over additional resources, thanks!

pjsier · 2020-07-17T14:05:05Z

city_scrapers/spiders/chi_northwest_home_equity.py

+    start_urls = ["https://nwheap.com/category/meet-minutes-and-agendas/"]
+
+    def parse(self, response):
+        self.location_dict = {


These can be set above by start_urls instead of each time parse is called

pjsier · 2020-07-17T14:05:54Z

city_scrapers/spiders/chi_northwest_home_equity.py

+        }
+        # Before we begin, need to collect meeting minutes data -- which is
+        #  contained on multiple iterative pages (/page/2 /page/3 etc)
+        r = requests.get("https://nwheap.com/category/meet-minutes-and-agendas/")


We don't need to use requests here since the response from this page should be the first thing Scrapy loads in start_urls

pjsier · 2020-07-17T14:15:02Z

@SubtleHyperbole to your question about ID and status, we've got more information on those in our docs, but in general they're a part of the Open Civic Data Event specification that we're following, and status in particular is a more structured way of tracking whether meetings are cancelled, upcoming, or in the past

SubtleHyperbole · 2020-07-19T03:03:14Z

So the successive requests are actually only needed to get links for files posted AFTER meetings which have already happened (like meetings minutes).

Following the restructuring of the website itself (which my email to their admin caused unintended), the only location where it lists future meetings dates is on the side bar to the right which exists on several different pages.

This side bar contains (at least right now) the next three future meetings, and 5-6 past meetings. While each meeting on the sidebar contains a link to its own page, that page actually contains no new information that isn't already in the side bar.

The only information that I figured would be of interest to scrape is the meetings minutes and other uploaded files for those 5-6 PAST meetings on the side bar. For some reason, the link for each meeting on the side bar doesn't actually lead to those files. Instead, separate pages for each meeting is listed on that nwheap/meeting-minutes-and-agenda main page, along with multiple additional pages in groups of ten.

So here is the thing, since theres only 5-6 meetings listed on that right sidebar now, all of their meeting-minutes-side-pages are listed on the first page of results. As it is right now, while I iterated through all the existing back pages, those include files for past meetings that go way way further back. I thought maybe I should use that list to generate the list of meetings, but unfortunately other than the date, it doesn't give any other information except the associated links.

So should I just ditch the iterative page scroll, given that, at least for right now, nothing except the first page holds any data that ends up in a yielded meeting?

SubtleHyperbole · 2020-07-19T03:05:29Z

Also i'm still not sure about where the _id and _status comes from, but really as long as it's not something I'm supposed to be scraping from the website, that's good by me.

SubtleHyperbole · 2020-07-19T03:11:43Z

Also, I think my first comment might be confusing

    # Now moving onto the main parse of the meetings list

should really say something more like

    # First create a list of all potential meetings minutes/associated files pages. Later in the main parse, we will cycle through this list to see if there is an associated files page containing stuff like meeting minutes files.  Only past meetings have these, and even then not every time.

pjsier · 2020-07-20T13:53:36Z

@SubtleHyperbole thanks for the clarification, but I think we'll still want to use Scrapy here, even if we're chaining together requests that aren't pulling links directly from a page. Here's an example of where we're doing this on il_elections

…quity

SubtleHyperbole added 5 commits July 17, 2020 03:09

Add chi_northwest_home_equity

0620707

Add chi_northwest_home_equity

ad1adeb

Add chi_northwest_home_equity

3194cba

Add chi_northwest_home_equity

b4533b3

Add chi_northwest_home_equity

696ae63

pjsier requested changes Jul 17, 2020

View reviewed changes

yijun-li-20 mentioned this pull request Feb 14, 2021

0672 spider chi northwest home equity #1007

Open

5 tasks

Merge branch 'City-Bureau:main' into 0672-spider-chi_northwest_home_e…

008de20

…quity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0672 spider chi northwest home equity #966

0672 spider chi northwest home equity #966

SubtleHyperbole commented Jul 17, 2020 •

edited

Loading

SubtleHyperbole commented Jul 17, 2020

pjsier left a comment

pjsier Jul 17, 2020

pjsier Jul 17, 2020

pjsier commented Jul 17, 2020

SubtleHyperbole commented Jul 19, 2020

SubtleHyperbole commented Jul 19, 2020

SubtleHyperbole commented Jul 19, 2020

pjsier commented Jul 20, 2020

0672 spider chi northwest home equity #966

Are you sure you want to change the base?

0672 spider chi northwest home equity #966

Conversation

SubtleHyperbole commented Jul 17, 2020 • edited Loading

Summary

Checklist

Questions

SubtleHyperbole commented Jul 17, 2020

pjsier left a comment

Choose a reason for hiding this comment

pjsier Jul 17, 2020

Choose a reason for hiding this comment

pjsier Jul 17, 2020

Choose a reason for hiding this comment

pjsier commented Jul 17, 2020

SubtleHyperbole commented Jul 19, 2020

SubtleHyperbole commented Jul 19, 2020

SubtleHyperbole commented Jul 19, 2020

pjsier commented Jul 20, 2020

SubtleHyperbole commented Jul 17, 2020 •

edited

Loading