Abby Caffas | [email protected] | Spring 2021
In this project, I hoped to gather linguistic data through analysis of podcast transcripts, and compare stylistic differences across multiple human-inferred attributes (such as genre, rating, format, etc.). Using Scrapy, I was able to extract episode transcripts from 20 different podcasts:
- This American Life - 731 episodes
- Radiolab - 192 episodes
- Welcome to Nightvale - 168 episodes
- Move Your DNA - 104 episodes
- The Allusionist - 97 episodes ***
- Bullseye with Jesse Thorn - 63 episodes
- My Brother, My Brother, and Me - 32 episodes
- Sawbones - 30 episodes
- One Bad Mother - 29 episodes
- Wonderful - 29 episodes
- Friendly Fire - 28 episodes
- The Greatest Generation - 28 episodes
- Judge John Hodgman - 28 episodes
- Shmanners - 28 episodes
- NeoScum - 20 episodes
- The Adventure Zone - 19 episodes
- The Flophouse - 16 episodes
- Switchblade Sisters - 14 episodes
- You're Wrong About - 13 episodes
- Unlocking Us - 12 episodes
Access my spiders here. Though I developed a working Scrapy module, I was unable to use data scraped for the following podcasts for copyright reasons (I did not receive a response for my request to use the data):
- 99% Invisible
- Freakonomics
- Lore
- On Being
- StoryCorps
Access my main dataframe here, as well as one that I was able to parse out host speech from thanks to consistent text formatting and one to be used for linear regression measuring language change over time at a later date.
Access my machine learning models here, as well as the resulting figures.
NOTE: This is my first foray into machine learning, data science, web scraping, and text processing. All findings and analysis are for exploratory purposes only. In the future, I intend to use this data for sociolinguistic discourse analysis and syntax parsing.
Thank you for visiting my term project! Feel free to leave feedback in my guestbook