Over the last several years, the podcast space has grown to encompass a wide and varied range of content. On iTunes, there are nearly 300,000 shows; it’s easy for the casual listener to get lost in all the options. Here at Audiosear.ch we’ve made it our mission to chart and map out the world of podcasts by looking at different ways of connecting, organizing, and uncovering good shows.
To do this, we need to know what topics they cover, but how can we automatically determine what a show or episode is actually about? Show names and descriptions provide some clues, but can be uninformative and incomplete. What if we could actually look inside the episode to see what is being discussed? The most useful piece of data to determine what a podcast is about is the episode transcript, which we’re able to create using our speech-to-text engine.
Extracting topics from transcripts using machine learning
A useful way to organize transcripts is by topic. A topic can be as broad as “Sports” or as specific as “Mobile Tech.” Exploring transcripts by topic provides episode-level context about what is discussed and enables us to make connections between related content.
To extract topics from podcasts, I used a machine-learning technique named LDA (Latent Dirichlet Allocation) topic modeling. Given a set of text documents, LDA automatically discovers the topics in each one. Each topic can be thought of as having a set of words that are most likely associated with it, and each document is a mixture of multiple topics. Here’s an example:
“I’m going to cook steak and eggs for dinner and then watch the Dallas Cowboys game.”
LDA might represent this sentence as 60% about “Food” (cook, steak, eggs, dinner) and 40% about “NFL Football” (watch, Dallas Cowboys, game).
We trained our LDA model on 40,000 podcast transcripts, and LDA gives us the freedom to determine how many topics to generate from that data. We experimented with the number of topics and discovered that there is a sweet spot. Generating 10 topics led to topics that were too broad and uninformative, but generating 1,000 topics led to topics being too specific (most shows ended up in their own topic). Eventually, we settled on ~80 useful topics that felt cohesive and informative.
LDA groups related words together into clusters, but it’s up to us to label (name) them. Here’s a group of words that LDA clustered together (the number next to each word is representative of how ‘important’ and ‘central’ that word is to the specific topic):
0.059*”song” + 0.056*”music” + 0.030*”record” + 0.021*”band” + 0.020*”album” + 0.011*”rock” + 0.009*”playing” + 0.009*”artist” + 0.007*”guitar” + 0.007*”played” + 0.006*”pop” + 0.006*”musician” + 0.006*”musical” + 0.006*”singer” + 0.005*”studio” + 0.005*”singing” + 0.005*”track” + 0.005*”sing” + 0.005*”voice” + 0.005*”recording” + 0.004*”lyric” + 0.004*”fan” + 0.004*”jazz” + 0.003*”favorite” + 0.003*”opinion”
Any human can look at this and immediately know that the topic is Music.
The LDA model generated the topics Music and Photography, but we manually created the parent topic Arts to cluster them together.
Some topics address tone more than content. We noticed certain types of language getting clustered together. Here’s some examples:
- Colloquial: top words were 0.010*”crazy” + 0.009*”weird” + 0.007*”anyway” + 0.007*”cool” + 0.006*”literally” + 0.006*”totally” + 0.006*”bunch” + 0.006*”funny” + 0.005*”fine” + 0.005*”super” + 0.005*”hate” + 0.005*”sorry” + 0.005*”amazing” + 0.005*”wow” + 0.004*”terrible” + 0.004*”awesome” + 0.004*”stupid” + 0.003*”worst”
- NSFW: top words were a lot of f-bombs and four-letter words
Episode topics, show topics, and topic variance
After training our LDA model with ~80 topics, we can feed each episode transcript through LDA to determine its respective topic scores. Each episode gets assigned multiple topics, and we adjust each topic scores to range from 0 to 1.
For example, for episode one of Serial’s first season (Serial S01: Episode 01: The Alibi) LDA extracts the following episode topics.
- Legal 1.0
- Crime 1.0
- Dating 0.861680079464
- Relationships 0.723518333942
- Colloquial 0.575297943928
Here’s another example from a TEDTalks episode (TEDTalks Help for kids the education system ignores | Victor Rios) where LDA extracted the following episode topics:
- Education 1.0
- Communication 0.919646209148
- Judicial 0.784040993871
- Relationships 0.765608877988
- Legal 0.712575344157
- Race 0.351881683509
- Self-Improvement 0.340409492789
If we average the topic scores over all the episodes in the TEDTalks show, we end up with the topic scores for the entire show.
- Communication 0.499088435931
- Relationships 0.493147301275
- Science 0.410203534659
- News / Politics 0.276019845137
- Tech 0.217786957268
- Outdoors 0.163537074365
- Family 0.148669689489
- Scientific Research 0.148416108352
- History 0.14799191756
- International Politics 0.131679947746
- Judicial 0.1214698759
- Opinion 0.120062884806
- Nature 0.11503662601
- Brain Science 0.1125597721
- Legal 0.111055982075
- Anatomy 0.104492181047
Having the topic distributions for each episode allows us to look at the how much a show’s topic varies across each episode. As you can see, the TEDTalks series covers a huge variety of topics, so there is significant topic variance between each episode. There are plenty of shows that fit into this category of talking about different things; listen if you want to be pleasantly surprised by what you hear.
Some other shows with high topic variance include:
- CBS Radio News — Average Scores (Social Media: 0.470558 | Tech: 0.411105 | )
- Comedy Central — Average Scores (Colloquial: 0.474701 | Comedy: 0.456002 | Entertainment: 0.410230 | Relationships: 0.331285 | )
- NPR — Average Scores (News / Politics: 0.533575 | International Politics: 0.326380 | Radio: 0.324077 | )
- TEDTalks — Average Scores (Communication: 0.499088 | Relationships: 0.493147 | Science: 0.410204 | )
- The New Yorker — Average Scores (Opinion: 0.375808 | Literature: 0.326533 | )
- BrainStuff — Average Scores (Science: 0.773600 | Scientific Research: 0.495239 | Anatomy: 0.406278 | Brain Science: 0.382706 | )
Shows with low topic variance always talk about the same topics — these are shows to listen to if you know exactly what you want to listen to. For example:
- CBSSports.com Fantasy Football Today Podcast — Average Scores (NFL: 1.000000 | Sports: 1.000000 | )
- Sex With Emily — Average Scores (Relationships: 1.000000 | Sex: 0.988531 | Dating: 0.839643 | Colloquial: 0.373146 | )
- Car Talk — Average Scores (Automotive: 0.747499 | Colloquial: 0.553018 | )
We’re looking at building this topic variance metric into our API to be used as a feature for podcast recommendations.
Finding related episodes and related shows
Topics are a critical tool in determining what shows or episodes are related to each other, and therefore driving recommendations. We can use topics to recommend similar content by relying on the fact that similar episodes have similar topic scores. Here’s a map that places similar episodes close to each other. Each episode is colored by its top topic, where each topic has its own color.
Hover around different areas of the map to see which episodes are related.
Episodes clustered by topic
To create this visualization I used t-SNE, which is a dimensionality reduction technique that can shrink our 80-dimensional topic space into a 2-dimensional representation. t-SNE groups similar episodes together by considering the scores of all the topics. On the top left is Sportsballs Island (as I call it — true category name: Sports), and News/Politics is represented in the bottom left. Also check out Eating in pink on the right, and Arts in purple above it, or maybe you wanna get slammed with some Wrestling in light blue at the very very top.
In the same manner, here’s a map that places similar shows together. Hover around and discover new shows you might want to check out.
Shows clustered by topic
Topics don’t just have to be used to uncover similarity between podcasts; we can also look at the relationship between topics themselves. How often do shows that talk about Sex discuss Dating? Which sports podcasts cover the NBA versus the NFL? How often is Race discussed along with Guns/Police? How Colloquial is the speech in certain Sports podcasts?
By graphing one topic against another, we can visualize how each show scores on both topics and determine how correlated two topics are.
Use the drop-down menus at the top to change topics and explore how shows balance the discussion of different topics. Each dot is a different show — hover over it to get show info. You can also use your mouse to zoom in by drawing a window over a section you want to see in more detail.
Using topics for recommendations
If a listener only listens to sports podcasts, does that mean they are only interested in sports, or do they just not know what other types of content is out there? We don’t always want to be recommended content that is nearly the same as what we were just listening to — variety in recommendations is key in keeping a listener’s attention. Topic modeling provides useful data for making more informed podcast recommendations.
Using episode topic and show topic metadata, you could build a podcast recommendation engine that varies some topics while keeping others the same. If a listener likes several episodes that discuss International Politics, for example, we could recommend episodes that discuss both International Politics and Domestic Politics. Or perhaps episode topic metadata could reveal that a listener cares more about tone (NSFW or Colloquial language) more than content. Imagine keeping one topic dimension that a listener loves (Relationships) while shifting another to a related topic (Fitness -> Self-Improvement).
Topic modeling provides an intuitive way of making sense of the content and tone of podcasts. By wrapping this data into our API, Audiosear.ch is starting to generate more intuitive podcast recommendations that can surprise listeners by providing content they didn’t know they wanted to hear and will soon love.