Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential way to speed scraping for LA Metro #23

Open
fgregg opened this issue Jul 19, 2018 · 2 comments
Open

Potential way to speed scraping for LA Metro #23

fgregg opened this issue Jul 19, 2018 · 2 comments

Comments

@fgregg
Copy link
Member

fgregg commented Jul 19, 2018

Right now we are aggressively scraping all events and all bills on Friday afternoon and evenings to deal with changes to events and bills not modifying appropriate fields so we can catch updated information in a windowed search.

The bill scrape takes 22 minutes, which means that the maximum latency between a bill being updated by LAMetro and appearing on the councilmatic site is 22 minutes + polling frequency of import_data + time for import_data to run.

Since what LA Metro really cares about on Friday is that the agendas are accurate, we could take a somewhat different strategy that should decrease that latency.

  1. Go back to windowed search for updated bills
  2. Capture the unresolved bills from event scrapes. Direct a bill scraper to only try to scrape those unresolved bills.
@fgregg
Copy link
Member Author

fgregg commented Jul 19, 2018

For your consideration, @reginafcompton. Not time sensitive.

@reginafcompton
Copy link
Contributor

reginafcompton commented Jul 20, 2018

I like the second proposal. A couple details that we need to think about:

  1. how the scraper can ingest the bill identifiers - right now, it uses matter_ids: https://github.com/opencivicdata/scrapers-us-municipal/blob/master/lametro/bills.py#L97
  2. what the scraper should do if it cannot find a bill....raising a unique error seems like it would put us back a issue Procedure for handling "cannot resolve" Sentry errors #24 . Maybe just log it and skip it.

I know there's more to consider, but just noting some challenges that immediately come to mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants