Skip to content

Abby's term project: Parse and annotate podcast transcripts with syntactic and semantic information, then compare results with a focus on genre and rating.

License

Notifications You must be signed in to change notification settings

Data-Science-for-Linguists-2021/Linguistic-Styles-of-Podcasts

Repository files navigation

Linguistic-Styles-of-Podcasts

Abby Caffas | [email protected] | Spring 2021

Welcome to my Data Science for Linguists Term Project!

In this project, I hoped to gather linguistic data through analysis of podcast transcripts, and compare stylistic differences across multiple human-inferred attributes (such as genre, rating, format, etc.). Using Scrapy, I was able to extract episode transcripts from 20 different podcasts:

Access my spiders here. Though I developed a working Scrapy module, I was unable to use data scraped for the following podcasts for copyright reasons (I did not receive a response for my request to use the data):

  • 99% Invisible
  • Freakonomics
  • Lore
  • On Being
  • StoryCorps

Access my main dataframe here, as well as one that I was able to parse out host speech from thanks to consistent text formatting and one to be used for linear regression measuring language change over time at a later date.

Access my machine learning models here, as well as the resulting figures.

NOTE: This is my first foray into machine learning, data science, web scraping, and text processing. All findings and analysis are for exploratory purposes only. In the future, I intend to use this data for sociolinguistic discourse analysis and syntax parsing.

Thank you for visiting my term project! Feel free to leave feedback in my guestbook

About

Abby's term project: Parse and annotate podcast transcripts with syntactic and semantic information, then compare results with a focus on genre and rating.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published