Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spider: Chicago Northwest Home Equity Assurance Program #672

Open
pjsier opened this issue Feb 3, 2019 · 28 comments
Open

Spider: Chicago Northwest Home Equity Assurance Program #672

pjsier opened this issue Feb 3, 2019 · 28 comments

Comments

@pjsier
Copy link
Collaborator

pjsier commented Feb 3, 2019

URL: https://nwheap.com/category/meet-minutes-and-agendas/
Spider Name: chi_northwest_home_equity
Agency Name: Chicago Northwest Home Equity Assurance Program

See the contribution guide for information on how to get started

@GeorgeDubuque
Copy link

I would like to take this one!

@pjsier
Copy link
Collaborator Author

pjsier commented Jun 21, 2019

@GeorgeDubuque sorry, I missed this initially. Right now our policy is for people to take on one at a time, so feel free to start this or the O'Hare scraper and move on to the other once you're done

@mingchan96
Copy link

mingchan96 commented Oct 30, 2019

Hi. For a class, my partner and I are looking for an issue to contribute to. I was wondering if this issue is still up for grabs? If this isn't available then is there an issue that is still open that we can look at?

@pjsier
Copy link
Collaborator Author

pjsier commented Oct 30, 2019

@mingchan96 this is open, all yours if you're interested!

@erikkristoferanderson
Copy link

I'd like to claim this one, please.

@mingchan96
Copy link

@Ekand you can have it. My partner and I are currently business with other projects.

@pjsier
Copy link
Collaborator Author

pjsier commented Apr 2, 2020

@Ekand all yours!

@pjsier pjsier added claimed and removed help wanted labels Apr 2, 2020
@erikkristoferanderson
Copy link

@pjsier Thanks!
I'll start by studying the contributors guide and try to have something in a pull request in two weeks.

@erikkristoferanderson
Copy link

@pjsier Well, I'm sorry to do this again, but I'm going to bow out and release this task. I just got a job (yay!) and I'm going to proritize that for now.

@pjsier
Copy link
Collaborator Author

pjsier commented Apr 21, 2020

@Ekand no problem, and congrats on the job!

@SubtleHyperbole
Copy link

Hey Pj, so I am working on this one (bc the illinois department of corrections seems to have not been doing what they are supposed to do in terms of posting info about public meetings for the last couple years), and I have a question.

It looks like, in general, the response variable used in the test .py is coming from a method called file_response which pulls a saved offline version of the webpage which was created (i think?) when the spider was created on the command line, leaving no way of pulling additional pages which might be needed to completely parse all meetings.

For the example on this issue (chi_northwest_home_equity, i think), the meetings are listed in pages of 10, with each additional page having a /page/2/ or /page/3/ on up. Normally when I would be scraping a site like this, I would use requests to try get a page, checking its status code looking for a 4## and upon reaching that 4##, to stop the scraper.

However, because the parser seems to be pulling from offline files which were taken when the spider was generated, I'm not sure what to do. I figure that on the command line when I create the spider, I could probably put in a list of urls, but on the command line I can't (or at least, don't know how to) check url's response code to know how many /page/#/'s to include on the list to go up to.

There are a few methods in the CityScrapersSpider class which sound promising, like .make_requests_from_url() but what little documentation I can see, that specific one is deprecated. Besides, i imagine there must be a general best practices which this should be accomplished. I've looked at the contributions guideline page and couldn't find it, though if I missed it, I apologize in advance.

@pjsier
Copy link
Collaborator Author

pjsier commented Jun 8, 2020

Hi @SubtleHyperbole, I commented on the other issue but we are still interested in agencies that aren't updating as often as they should be. If you'd like to do this one instead let me know.

For your question on file_response, we've been generally saving the HTML files for other pages manually with something like wget or curl since it's on a case-by-case basis and the template is focused on the most common cases. You can see an example of a spider with multiple pages for tests in the tests for chi_ssa_42.

For this spider, the simplest way to handle pagination is scrape the "Older posts" link each time it's on the page rather than list all of the pages up front. Because the first page already goes well back into 2019 though it may be fine to just pull the first page of results

@SubtleHyperbole
Copy link

great thank you!

@SubtleHyperbole
Copy link

Actually are you sure its ssa_42? I'm looking on their website at both https://ssa42.org/ssa-42-meeting-dates/ and https://ssa42.org/minutes-of-meetings/ but I don't see any additional pages of meetings info.

@pjsier
Copy link
Collaborator Author

pjsier commented Jun 9, 2020

That scraper is just one example of including an additional page. il_commerce is another example that might be more similar, but either one is following the same overall idea of downloading separate pages to HTML for tests

@SubtleHyperbole
Copy link

Okay so I think I have the spider for this finished and (at least from what I can see) have the tests page also finished.

Unfortunately, because I bounced around on a couple of other issues before finally landing on this one to complete, there are files within my file directory which aren't correct (they have default spiders and test pages for il_corrections, chi_housing, and cook_human_rights), so I don't want to submit a pull request because I'm pretty sure it will also try to submit these as well.

Should I just start a whole new clone directory of the project (fork? not sure the nomenclature) and start a new branch for this issue, then just copy the spider and test file over to that one, then submit the pull? er... why isn't it called a push? It seems like I'm requesting that the changes i've made locally on my laptop get PUSHED to the main project directory. Why is this called a pull request?

@pjsier
Copy link
Collaborator Author

pjsier commented Jun 25, 2020

@SubtleHyperbole glad to hear it! You should be able to only stage the files that are relevant and then commit those. So it could be something like this:

git add city_scrapers/spiders/chi_northwest_home_equity.py
git add tests/test_chi_northwest_home_equity.py
git commit -m "Add chi_northwest_home_equity"

And "pull request" is a GitHub-specific term (GitLab uses "merge request"), but my understanding has been because it's requesting the project maintainer to "pull" in your changes

@SubtleHyperbole
Copy link

oh, duh. lol that makes sense. I have a tendency to only think about things from my own perspective sometimes hah!

@SubtleHyperbole
Copy link

crap. I just submitted the request and realized that I never ran those code cleaners the faq says to run on the code beforehand. Lint i think?

@pjsier
Copy link
Collaborator Author

pjsier commented Jun 26, 2020

@SubtleHyperbole No problem! I'm not seeing the request, but it's fine to make commits to a branch after you've opened up a pull request, and that's usually the case when we review them. You can run the style checks with these commands in the docs

@SubtleHyperbole
Copy link

hmmm i ran those three lines of code you listed in the last post into my terminal, while inside the pipenv shell, while sitting in the directory of the main cityscrapers folder (so that the relative file paths used in the 3 lines of code would make sense).

@SubtleHyperbole
Copy link

(git) bash-3.2$ git add city_scrapers/spiders/chi_northwest_home_equity.py
(git) bash-3.2$ git add tests/test_chi_northwest_home_equity.py
(git) bash-3.2$ git commit -m "Add chi_northwest_home_equity"
[0672-spider-chi_northwest_home_equity 64dfa3d] Add chi_northwest_home_equity
2 files changed, 190 insertions(+)
create mode 100644 city_scrapers/spiders/chi_northwest_home_equity.py
create mode 100644 tests/test_chi_northwest_home_equity.py
(git) bash-3.2$

@pjsier
Copy link
Collaborator Author

pjsier commented Jun 26, 2020

Gotcha, that was to create a commit, but you'll need to push that and submit a pull request separately. It's usually called the "GitHub Flow" and there's more information on it here

@SubtleHyperbole
Copy link

Just as an update, I literally had the spider completed but out of an effort at completeness, I emailed the admin of the site to ask a question about what seemed like a small discrepancy between the lists of events (yes, on the page there seems to be multiple sources of meetings lists data), and to my chagrin i got a reply that they decided to revamp how the site provides info on the meetings.

In other words, my spider is now entirely broken LMAO. Right now I am waiting for their new system to work out a last kink, before I get back onto reworking the spider. Just wanted to update that I hadn't given up on this or anything.

Oh, also, the main events page (nwheap.com/events/) now is a 404 -- it might come back though, that is what I am waiting to find out.

@pjsier
Copy link
Collaborator Author

pjsier commented Jul 14, 2020

Thanks for the update! I think it's fine to submit as is for now if it's still working

@KevivJaknap
Copy link
Contributor

Hey, I would like to tackle this issue.

@haileyhoyat
Copy link
Collaborator

@KevivJaknap Hello! Thanks so much for checking out our project. Go for it.

@KevivJaknap
Copy link
Contributor

@haileyhoyat Just wanted to inform that I've submitted a pull request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants