This repo contains materials for a half-day workshop at the IRE 2024 conference in Anaheim on using Python to scrape data from websites.
The session is scheduled for Sunday, June 23, from 9 a.m. - 12:30 p.m. in room Orange County Ballroom 3
.
Open the cmd
application. Copy and paste this text and hit enter:
cd Desktop\hands_on_classes\20240623-sunday-web-scraping-with-python-pre-registered-attendees-only && .\env\Scripts\activate && jupyter lab
- Do you really need to scrape this?
- Process overview:
- Fetch, parse, write data to file
- Some best practices
- Make sure you feel OK about whether your scraping project is (legally, ethically, etc.) allowable
- Don't DDOS your target server
- When feasible, save copies of pages locally, then scrape from those files
- Rotate user-agent strings and other headers if necessary to avoid bot detection
- Using your favorite brower's inspection tools to deconstruct the target page(s)
- See if the data is delivered via undocumented API to the page in a ready-to-use format, such as JSON (example 1, example 2) -- Postman or similar software is handy for testing out API calls
- Is the HTML part of the actual page structure, or is it built on the fly when the page loads? (example)
- Can you open the URL directly in an incognito window and get to the same content, or does the page require a specific state to deliver the content (via search navigation, etc.)? (example)
- Are there URL query parameters that you can tweak to get different results? (example)
- Choose tools that the most sense for your target page(s) -- a few popular options:
requests
andBeautifulSoup
playwright
(optionally usingBeautifulSoup
for the HTML parsing)scrapy
for larger spidering/crawling tasks
- Overview of our Python setup today
- Activating the virtual environment
- Jupyter notebooks
- Running
.py
files from the command line
- Projects in this repo:
- Try GitHub Actions if you need to put your scraper on a timer (you could also drop your script on a remote server, such as DigitalOcean, PythonAnywhere or Heroku, with a
crontab
configuration - Tipsheet on inspecting web elements
- Tipsheet on saving HTML files before scraping them
- Tipsheet with some miscellaneous scraping tips
- Install Python, if you haven't already (here's our guide)
- Clone or download this repo
cd
into the repo directory and install the requirements, preferably into a virtual environment using your tooling of choice:pip install -r requirements.txt
playwright install
jupyter lab
to launch the notebook server