One option: Submit a data scraper and a sample data file. For example, you could use Python to turn a PDF of police training records into JSON, or make recurring API calls to pull down files, or toss resulting data into a SQLite database.
Other options: Improve or fix existing scrapers, add tests, extend helpers and utilities, submit a version of an existing scraper in a new language, or write scraper examples or templates to cover use cases we haven't yet gotten to.
- It's legal. Collecting public records from the internet is not problematic in itself, but respecting data publishers wherever possible is a good way to ensure data stays accessible.
- Scrapers are self-contained; when they're run, data should be saved locally or in their own GitHub repo.
- Populate the
README
for your scraper with as much helpful information as you can, including steps to setup and run the code. - Include a truncated version of some sample data so we understand what is generated.
You can read more about our philosophy and priorities around scraping here.
What question are you trying to answer? What kind of data are you trying to help people use or preserve?
You can browse our Data Sources and find a source to scrape. If the source you want to scrape isn't there, please let us know as part of your submission or add it yourself. Filling in that form usually takes less than 5 minutes.
You can add a scraper to our collection, or create your own within your personal GitHub space.
If you'd like PDAP to host it, there are two options.
You can contribute it here, following our conventions, and people can find and use it as they please. Please stick to the format of scrapers_library/$STATE/$COUNTY/$MUNICIPALITY/$RECORD_TYPE
. If there is no specific county or municipality, it can live in the most relevant higher-level directory.
Or, if there's a compelling reason to have it running regularly -- there's a specific request to use the data, say, or the data gets overwritten periodically -- help us understand that. We will consider hosting it in a separate PDAP repo that also includes automated data collection, such as via GitHub Actions.
Regardless of which way you'd like to go, we'll add it to our list so people can find and use it.
A few ways to think about whether PDAP is the right home for your scraper:
Scrapers repo | Standalone repo |
---|---|
No specific or immediate need to capture the data repeatedly or on a rolling basis | The data disappears often or is overwritten, or there's a specific request to use the data collected |
Best for people choosing to do things "the PDAP way" | Best if you have strong opinions about your project's structure, license, or usage |
May reference common utilities | Does not reference common utilities |
Best for simple scrapers and common data portals | Best for complicated projects involving multiple Data Sources |
Scrapers only: no analysis, aggregation, messaging | Whatever you want |
Easier to find when people look for tools around police data | Less visible, but more control |
Best for Data Sources which people may want to scrape at any time | Best for creating a complete package of useful data which may not be updated further |
- Clone this repository. Don't know how?
Why start from scratch if we have a useful library? Keep in mind that you -- or we! -- can always refactor your work later if necessary, so if you're not sure, we still want you to submit!
Not sure where to start with a page you want to scrape? Check our examples and templates (/examples_templates
) to see if we have that covered. If you see use cases we're missing, open an issue or (please and thank you) contribute it yourself.
Example Scraper 1 - This scraper collects arrest records, firearm siezures, incident blotter, non-traffic citations, and officer training data from the Pittsburgh Bureau of Police by using the OpenData scraper.
Example Scraper 2 - This scraper collects the daily crime/fire log from the Cal Poly Humboldt Police Department by using a CrimeGraphics scraper.
Utilities are scripts provided in the /utils
folder of the project directory. These can assist in scraping data from websites, such as the list_pdf_scrapers
that can retrieve one or multiple PDF documents from a given webpage. Many of these utilities have READMEs that tell you what they do and how they work.
You can see how utilities can be imported and used in your own scraper by looking at the examples in the /examples_templates
folder.
Data portal scrapers are scripts provided in the /scrapers_library/data_portals
folder. These can assist in scraping data from common data portals where police data is often available, normally utilizing the site's API. No need to reinvent the wheel, if you want to scrape data from a common data portal you can find usage instructions in the README files accompanying each data portal scraper.
You can see how data portal scrapers can be imported and used in your own scraper by looking at the examples in the /examples_templates
folder.
The most important thing here is that your scraper is grabbing public criminal legal records, and is legal.
Beyond that, a PR for a new scraper should:
- Be based on a new branch
- Contain a detailed
README
which includes steps for setup and for running the code in addition to helpful information about the data being collected - Include recommended steps for testing (we'll poke at it other ways, too, but it's always nice to have a place to start)
What kind of data are we scraping?
Police data that's already made public by a government agency. The data sources listed here can be a to-do list. If you're not sure where to start, read more here.
What languages are allowed?
To this point we've been working in Python. If you'd like to contribute and prefer to work in another language, that's fine. Just be sure to note that in your README
along with any setup steps to get your contribution running.
Are there any specific formatting guidelines I should adhere to?
For now, if you use Python: Try to stick with PEP8 formatting. A good formatter for this is Black.