Tidytext analysis of podcast transcripts

I listen to podcasts fo several hours each day. When I wake up, I launch Overcast and listen to a podcast while I have coffee. They’re playing my drive to work, sometimes while I work,1 and when I’m running my shutdown routine at night to get ready for the next day.

I love my podcasts.

Earlier this year, _DavidSmith released a side project, Podcast Search. His description of this project:

“I take a few podcasts and run them through automated speech-to-text, which is useless for reading, but works out to be just fine for keyword searching.”

It’s a really cool site - particualry since it produces searchable quasi-transcripts for a few of my favorite podcasts:

They aren’t perfect, but it’s a pretty cool tool, and it got me thinking: wouldn’t it be cool to take that text and run it through some tidytext functions?

So I did.

I started by scraping the raw HTML for each episode of ATP, Cortex, Hypercritical, and The Talk Show. I then did a little data cleaning to get the text from Podcast Search into a tidy format. There are still a few episodes of ATP and Hypercritical processing through _DavidSmith’s speech-to-text algorithm, but there’s enough data from each podcast to do some cursory text anaysis using tidytext.

As of 6 June 2017, I have about 1.1 million lines of text from these four podcasts to analyze. Once I loaded the data, it was pretty easy to parse out the words spoken in each podcast using the unnest_tokens() function from tidytext.

# load libraries
library(tidyverse)
library(tidytext)

# load transcript data
atp <- read_rds("data/atp.rda")
cortex <- read_rds("data/cortex.rda")
hypercritical <- read_rds("data/hypercritical.rda")
the_talk_show <- read_rds("data/tts.rda")

# join podcast data together
podcasts <- bind_rows(atp, cortex, hypercritical, the_talk_show)

# get count of words by podcast
podcast_words <- podcasts %>% 
  unnest_tokens(word, line) %>%
  anti_join(stop_words) %>%
  count(podcast, word) %>%
  ungroup()

Next, I wanted to use tf-idf to determine which words were most important to each podcast. Not familiar with this approach? Here’s a description from the authors of the tidytext package:

The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents…

Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common.

In other words, if one of these podcasts uses an uncommon word more than others, this approach will help identify it. Here is the code I used to do this:

# count total words per podcast
total_words <- podcast_words %>%
  group_by(podcast) %>%
  summarise(total = sum(n))

# join individual word count and total word count
# get tf_idf by podcast
podcast_words <- left_join(podcast_words, total_words) %>% 
  bind_tf_idf(word, podcast, n)
  
# plot top 10 tf_idf words by podcast
podcast_words %>%  
  group_by(podcast) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  ggplot(aes(reorder(word, tf_idf), tf_idf, fill = podcast)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~podcast, scales = "free") +
  coord_flip() +
  labs(x = "tf-idf",
       title = "Most Important Words by Podcast",
       subtitle = "Measured by tf-idf") +
  scale_fill_manual(values = c("blue4", "grey18",
                               "palegreen4", "lightsteelblue4"))+
  theme(legend.position = "none",
        axis.title.y = element_blank())

podcast plot

This is a pretty interesting plot, but there’s definitely some noise here. It’s clear that there are some words that slipped through the stop_words filter (like “dont”) and some words that got mixed up during the speech-to-text process like (“fd”).

There’s also a lot of noise coming from each podcast’s ad reads. I’m sure this makes the folks at Casper, Lynda, Betterment, etc. happy, but I’d like to know a little more about the actual content of these podcasts. Let’s filter those words and plot again using this code:

podcast_plot_clean <- podcast_words %>% 
  arrange(desc(tf_idf)) %>% 
  filter(word != "mattress") %>%
  filter(word != "casper") %>%
  filter(word != "betterment") %>%
  filter(word != "lynda") %>%
  filter(word != "5x5") %>%
  filter(word != "rackspace") %>%
  filter(word != "mailchimp.com") %>%
  filter(word != "dont") %>%
  filter(word != "tts") %>%
  filter(word != "afm") %>%
  filter(word != "fd") %>%
  filter(word != "wealthfront") %>%
  filter(word != "apron") %>%
  filter(word != "cgpgrey") %>%
  mutate(word = factor(word, levels = rev(unique(word))))

podcast_plot_clean %>%  
  group_by(podcast) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  ggplot(aes(word, tf_idf, fill = podcast)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~podcast, scales = "free") +
  coord_flip() +
  labs(x = "tf-idf",
       title = "Most Important Words by Podcast",
       subtitle = "Measured by tf-idf") +
  scale_fill_manual(values = c("blue4", "grey18",
                               "palegreen4", "lightsteelblue4"))+
  theme(legend.position = "none",
        axis.title.y = element_blank())

clean podcast plot

This plot provides a much better understanding of what makes each of these podcasts unique:

  • ATP is clearly a very technical show. From API’s to PHP, Casey, Marco, and John spend a lot of time in the weeds of technical issues. If it were one word, “file system” (🛎) would have been on this plot.
  • Cortext, a show about the working habits of Myke Hurley and CGP Grey, is well-represented on this plot. Productivity topics come through clearly (coworking, Trello, Todoist, Amsterdam workcations), as well as their involvement in the YouTube community (Pewdiepie, vlog(s), vidcon, patreon). Grey’s frequent “mhm”-ing also came through clearly via speech-to-text.
  • Hypercritical’s most important words are the most complex, a reflection of the preparation and intentionality John Siracusa brought to this show. “Rumination” is the perfect word to emerge as the most important for Hypercritical.2
  • The Talk Show’s words are too perfect. John Gruber’s son Jonas takes the top spot, followed by ⚾️ (his beloved Yankees take the 10 spot). Two of the words (Vesper and bourbon) can be represented by cocktail 🍸🥃 emoji, which I’m sure would make him proud. I’m glad “dingus” and Han Solo made the list too. “Iowa” and “bomber” made me scratch my head, but once I looked at the transcripts, it was obvious this is how the speech-to-text interpreted how Gruber pronounces “iOS” and (Steve) “Ballmer”.

This was a pretty fun side project and there’s still a ton of data to analyze. I’ll likely write some additional posts on this front, but unti then, follow this repo to see what I’m working on.

  1. Never while I code. I can’t code and listen to vocal music. 

  2. 28 of Hypercritical’s 100 episodes are still processing through the speech-to-text algorithm, so they were unavailable for this analysis.