-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spider: Chicago Northwest Home Equity Assurance Program #672
Comments
I would like to take this one! |
@GeorgeDubuque sorry, I missed this initially. Right now our policy is for people to take on one at a time, so feel free to start this or the O'Hare scraper and move on to the other once you're done |
Hi. For a class, my partner and I are looking for an issue to contribute to. I was wondering if this issue is still up for grabs? If this isn't available then is there an issue that is still open that we can look at? |
@mingchan96 this is open, all yours if you're interested! |
I'd like to claim this one, please. |
@Ekand you can have it. My partner and I are currently business with other projects. |
@Ekand all yours! |
@pjsier Thanks! |
@pjsier Well, I'm sorry to do this again, but I'm going to bow out and release this task. I just got a job (yay!) and I'm going to proritize that for now. |
@Ekand no problem, and congrats on the job! |
Hey Pj, so I am working on this one (bc the illinois department of corrections seems to have not been doing what they are supposed to do in terms of posting info about public meetings for the last couple years), and I have a question. It looks like, in general, the response variable used in the test .py is coming from a method called file_response which pulls a saved offline version of the webpage which was created (i think?) when the spider was created on the command line, leaving no way of pulling additional pages which might be needed to completely parse all meetings. For the example on this issue (chi_northwest_home_equity, i think), the meetings are listed in pages of 10, with each additional page having a /page/2/ or /page/3/ on up. Normally when I would be scraping a site like this, I would use requests to try get a page, checking its status code looking for a 4## and upon reaching that 4##, to stop the scraper. However, because the parser seems to be pulling from offline files which were taken when the spider was generated, I'm not sure what to do. I figure that on the command line when I create the spider, I could probably put in a list of urls, but on the command line I can't (or at least, don't know how to) check url's response code to know how many /page/#/'s to include on the list to go up to. There are a few methods in the CityScrapersSpider class which sound promising, like .make_requests_from_url() but what little documentation I can see, that specific one is deprecated. Besides, i imagine there must be a general best practices which this should be accomplished. I've looked at the contributions guideline page and couldn't find it, though if I missed it, I apologize in advance. |
Hi @SubtleHyperbole, I commented on the other issue but we are still interested in agencies that aren't updating as often as they should be. If you'd like to do this one instead let me know. For your question on For this spider, the simplest way to handle pagination is scrape the "Older posts" link each time it's on the page rather than list all of the pages up front. Because the first page already goes well back into 2019 though it may be fine to just pull the first page of results |
great thank you! |
Actually are you sure its ssa_42? I'm looking on their website at both https://ssa42.org/ssa-42-meeting-dates/ and https://ssa42.org/minutes-of-meetings/ but I don't see any additional pages of meetings info. |
That scraper is just one example of including an additional page. |
Okay so I think I have the spider for this finished and (at least from what I can see) have the tests page also finished. Unfortunately, because I bounced around on a couple of other issues before finally landing on this one to complete, there are files within my file directory which aren't correct (they have default spiders and test pages for il_corrections, chi_housing, and cook_human_rights), so I don't want to submit a pull request because I'm pretty sure it will also try to submit these as well. Should I just start a whole new clone directory of the project (fork? not sure the nomenclature) and start a new branch for this issue, then just copy the spider and test file over to that one, then submit the pull? er... why isn't it called a push? It seems like I'm requesting that the changes i've made locally on my laptop get PUSHED to the main project directory. Why is this called a pull request? |
@SubtleHyperbole glad to hear it! You should be able to only stage the files that are relevant and then commit those. So it could be something like this: git add city_scrapers/spiders/chi_northwest_home_equity.py
git add tests/test_chi_northwest_home_equity.py
git commit -m "Add chi_northwest_home_equity" And "pull request" is a GitHub-specific term (GitLab uses "merge request"), but my understanding has been because it's requesting the project maintainer to "pull" in your changes |
oh, duh. lol that makes sense. I have a tendency to only think about things from my own perspective sometimes hah! |
crap. I just submitted the request and realized that I never ran those code cleaners the faq says to run on the code beforehand. Lint i think? |
@SubtleHyperbole No problem! I'm not seeing the request, but it's fine to make commits to a branch after you've opened up a pull request, and that's usually the case when we review them. You can run the style checks with these commands in the docs |
hmmm i ran those three lines of code you listed in the last post into my terminal, while inside the pipenv shell, while sitting in the directory of the main cityscrapers folder (so that the relative file paths used in the 3 lines of code would make sense). |
|
Gotcha, that was to create a commit, but you'll need to push that and submit a pull request separately. It's usually called the "GitHub Flow" and there's more information on it here |
Just as an update, I literally had the spider completed but out of an effort at completeness, I emailed the admin of the site to ask a question about what seemed like a small discrepancy between the lists of events (yes, on the page there seems to be multiple sources of meetings lists data), and to my chagrin i got a reply that they decided to revamp how the site provides info on the meetings. In other words, my spider is now entirely broken LMAO. Right now I am waiting for their new system to work out a last kink, before I get back onto reworking the spider. Just wanted to update that I hadn't given up on this or anything. Oh, also, the main events page (nwheap.com/events/) now is a 404 -- it might come back though, that is what I am waiting to find out. |
Thanks for the update! I think it's fine to submit as is for now if it's still working |
Hey, I would like to tackle this issue. |
@KevivJaknap Hello! Thanks so much for checking out our project. Go for it. |
@haileyhoyat Just wanted to inform that I've submitted a pull request |
URL: https://nwheap.com/category/meet-minutes-and-agendas/
Spider Name:
chi_northwest_home_equity
Agency Name: Chicago Northwest Home Equity Assurance Program
See the contribution guide for information on how to get started
The text was updated successfully, but these errors were encountered: