Web Scraping in R -- IRE2024

The basics of scraping web pages in R using rvest

Requirements for the class

R and RStudio installed
tidyverse and rvest installed: install.packages(c("tidyverse","rvest"))
A browser with development tools (such as Chrome Inpsect)

Basics of HTML structure

Get to know the structure of an HTML element - https://developer.mozilla.org/en-US/docs/Glossary/Element

tags ex: opens and closes
attributes ex: id="shazam" inside the tag 
text ex: The text between opening and closing tags

A table built into HTML uses a <table> tag. The <th> tag is used for the header row; <tr> for table row, <td> for table data:

Basic usage of functions in rvest

Step 1: read the html from a webpage into the RStudio environment using the read_html() function:

ex. html <- read_html("url")

Step 2: pull a specific element from that html using html_element() or html_elements():

ex. everything_inside_a_table_tag <- html_element("table")
ex. everything_inside_a_p_tag <- html_element("p")

Step 3: pull the text or contents from an html element using html_text2():

ex. everything_inside_a_p_tag |> html_text2()

Websites we'll scrape in this class (we'll see how far we can get)

1 https://www.dllr.state.md.us/employment/warn.shtml

2 https://dlr.sd.gov/workforce_services/businesses/warn_notices.aspx

3 https://www.billboard.com/charts/hot-100/

You'll find the finished scripts in the finished_scripts folder.

Resources for help

check out Hadley Wickham's tutorial on web scraping
here's an IRE tipsheet on using browser development tools (such as Chrome Inspect)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
finished_scripts		finished_scripts
images		images
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
ire24-R-web-scraping.Rproj		ire24-R-web-scraping.Rproj
scrape-1.Rmd		scrape-1.Rmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping in R -- IRE2024

Requirements for the class

Basics of HTML structure

Basic usage of functions in rvest

Websites we'll scrape in this class (we'll see how far we can get)

Resources for help

About

Releases

Packages

Languages

ireapps/ire24-R-web-scraping

Folders and files

Latest commit

History

Repository files navigation

Web Scraping in R -- IRE2024

Requirements for the class

Basics of HTML structure

Basic usage of functions in rvest

Websites we'll scrape in this class (we'll see how far we can get)

Resources for help

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages