IRE 2024: Web scraping with Python

This repo contains materials for a half-day workshop at the IRE 2024 conference in Anaheim on using Python to scrape data from websites.

The session is scheduled for Sunday, June 23, from 9 a.m. - 12:30 p.m. in room Orange County Ballroom 3.

Open the cmd application. Copy and paste this text and hit enter:

cd Desktop\hands_on_classes\20240623-sunday-web-scraping-with-python-pre-registered-attendees-only && .\env\Scripts\activate && jupyter lab

Do you really need to scrape this?
Process overview:
- Fetch, parse, write data to file
- Some best practices
  - Make sure you feel OK about whether your scraping project is (legally, ethically, etc.) allowable
  - Don't DDOS your target server
  - When feasible, save copies of pages locally, then scrape from those files
  - Rotate user-agent strings and other headers if necessary to avoid bot detection
Using your favorite brower's inspection tools to deconstruct the target page(s)
- See if the data is delivered via undocumented API to the page in a ready-to-use format, such as JSON (example 1, example 2) -- Postman or similar software is handy for testing out API calls
- Is the HTML part of the actual page structure, or is it built on the fly when the page loads? (example)
- Can you open the URL directly in an incognito window and get to the same content, or does the page require a specific state to deliver the content (via search navigation, etc.)? (example)
- Are there URL query parameters that you can tweak to get different results? (example)
Choose tools that the most sense for your target page(s) -- a few popular options:
- requests and BeautifulSoup
- playwright (optionally using BeautifulSoup for the HTML parsing)
- scrapy for larger spidering/crawling tasks
Overview of our Python setup today
- Activating the virtual environment
- Jupyter notebooks
- Running .py files from the command line
Projects in this repo:

Install Python, if you haven't already (here's our guide)
Clone or download this repo
cd into the repo directory and install the requirements, preferably into a virtual environment using your tooling of choice: pip install -r requirements.txt
playwright install
jupyter lab to launch the notebook server

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ire-board		ire-board
qld-incidents		qld-incidents
sd-warn		sd-warn
tx-custodial-death-reports		tx-custodial-death-reports
us-senate-press-gallery		us-senate-press-gallery
.gitignore		.gitignore
LICENSE		LICENSE
Python syntax cheat sheet.ipynb		Python syntax cheat sheet.ipynb
Python syntax cheat sheet.pdf		Python syntax cheat sheet.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback