Linguistic-Styles-of-Podcasts

Abby Caffas | [email protected] | Spring 2021

Welcome to my Data Science for Linguists Term Project!

In this project, I hoped to gather linguistic data through analysis of podcast transcripts, and compare stylistic differences across multiple human-inferred attributes (such as genre, rating, format, etc.). Using Scrapy, I was able to extract episode transcripts from 20 different podcasts:

This American Life - 731 episodes
Radiolab - 192 episodes
Welcome to Nightvale - 168 episodes
Move Your DNA - 104 episodes
The Allusionist - 97 episodes ***
Bullseye with Jesse Thorn - 63 episodes
My Brother, My Brother, and Me - 32 episodes
Sawbones - 30 episodes
One Bad Mother - 29 episodes
Wonderful - 29 episodes
Friendly Fire - 28 episodes
The Greatest Generation - 28 episodes
Judge John Hodgman - 28 episodes
Shmanners - 28 episodes
NeoScum - 20 episodes
The Adventure Zone - 19 episodes
The Flophouse - 16 episodes
Switchblade Sisters - 14 episodes
You're Wrong About - 13 episodes
Unlocking Us - 12 episodes

Access my spiders here. Though I developed a working Scrapy module, I was unable to use data scraped for the following podcasts for copyright reasons (I did not receive a response for my request to use the data):

99% Invisible
Freakonomics
Lore
On Being
StoryCorps

Access my main dataframe here, as well as one that I was able to parse out host speech from thanks to consistent text formatting and one to be used for linear regression measuring language change over time at a later date.

Access my machine learning models here, as well as the resulting figures.

NOTE: This is my first foray into machine learning, data science, web scraping, and text processing. All findings and analysis are for exploratory purposes only. In the future, I intend to use this data for sociolinguistic discourse analysis and syntax parsing.

Thank you for visiting my term project! Feel free to leave feedback in my guestbook

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
Figures		Figures
Machine Learning		Machine Learning
spiders		spiders
.gitignore		.gitignore
LICENSE		LICENSE
Linguistic Styles of Podcasts slides.pdf		Linguistic Styles of Podcasts slides.pdf
README.md		README.md
data.ipynb		data.ipynb
data_incomplete.ipynb		data_incomplete.ipynb
final_report.md		final_report.md
mcelroy_df.csv		mcelroy_df.csv
mcelroy_podcasts.ipynb		mcelroy_podcasts.ipynb
progress_report.md		progress_report.md
project_plan.md		project_plan.md
this_american_life.ipynb		this_american_life.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Linguistic-Styles-of-Podcasts

Welcome to my Data Science for Linguists Term Project!

About

Releases

Packages

Languages

License

Data-Science-for-Linguists-2021/Linguistic-Styles-of-Podcasts

Folders and files

Latest commit

History

Repository files navigation

Linguistic-Styles-of-Podcasts

Welcome to my Data Science for Linguists Term Project!

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages