When I tell people that I work with speech recognition, they sometimes ask, “Like Siri?” Or they tell me a phone call horror story with interactive voice response. (“Reservations.” “I’m sorry, I didn’t hear you.” “RESERVATIONS!!”) But as one of my professors was fond of saying, “Speech recognition is just the preprocessing.” What does that mean? We must turn sounds into information and meaning.

In the case of Siri or a call center, after speech in converted to text, the system must then figure out (1) what the user means, and (2) how to respond appropriately. Here at Audiosear.ch, we use speech recognition output to peer inside podcasts and find out all kinds of information about them, from the topics they cover to the people and places they mention. This post, though, is all about the pre-processing, and how get from abstract sounds to something more tangible from which we can derive meaning and information.

The speech recognition pipeline

As with most tasks in natural language processing, speech recognition comes down to probabilities.

We start out with just an audio file. Given the information contained inside that file, our task is to to generate the transcript that most likely corresponds to the words being spoken. In this post, we’ll take a peek inside the black box to see what happens in that process from audio input to text output, including the basic components of speech recognition models and how the numbers are crunched. (No math background necessary to read on!)


All of our speech recognition magic happens thanks to a fantastic open-source toolkit called Kaldi. (Because I know you’re wondering, “According to legend, Kaldi was the Ethiopian goatherder who discovered the coffee plant.” So says the Kaldi website, although unfortunately their documentation does not include information about the connection between coffee and speech recognition. For that, you’ll have to wait until my next post. But back to the matter at hand.) Not only does Kaldi provide access to numerous pre-trained speech recognition models, but it also has all the tools you need to train your own model and automatically transcribe audio files.

As an audio file is processed, it’s broken down into short segments, called utterances, based on silences. For each utterance Kaldi processes, it maintains an “n-best” list, a running tally of the n most likely sentences that correspond to the audio it has processed up to that point. As utterances are processed, the potential hypotheses at the bottom of the list get “pruned out”. At the beginning of the audio clip when there is little information, it’s possible that the hypotheses will look very different from each other. As more of the audio is processed, however, the model decides not pursue less likely avenues. By the end, the hypotheses may look fairly similar to each other. This example comes directly from one of my experiments.

. . . good pitching i had a life . . .

. . .the teaching i had a life. . .

. . .good pitching i had a wife. . .

. . .good pitching i get a life. . .

. . .the teaching i had a wife. . .

. . .good pitching i get a life. . .

. . .good pitching i am life. . .

. . .the teaching at life. . .

. . .good teaching and life. . .


Now let’s take a look at where that list comes from. The speech recognition model has two main components: the acoustic model and the language model.

Acoustic model

The first step in the speech recognition process is to numerically represent the audio. Speech sounds are a complex mixture of different frequencies, kind of like musical chords. Different sounds in a language have different combinations of pitches, which can be visually represented with what are called spectrograms.

Here’s what it looks like when I say the words “Pop Up Archive”. The x-axis represents time and the y-axis represents frequency.

The numeric representation is generated by chopping up the short audio segments into tiny bits and performing loads of calculations on each one. Of course, we don’t all produce sounds the same exact way, nor does each individual pronounce the same sound identically each time they say it, so translating numbers to speech sounds like “P” isn’t a trivial task. The job of the acoustic model is to calculate which sequences of sounds likely correspond to the frequencies of the audio.

Language model

Ultimately, we need more than sounds — we need words. The language model represents which sequences of words (called n-grams) are more likely. For example, the phrase “Lakshmi Singh” is more likely to occur in our language model than “Lakshmi discusses”.

We have built our own language model, tailored to the podcast domain, generated by combing through millions of words of podcast and public radio data. The counts of words and phrases are ultimately turned into probabilities, like this:





-2.00101        i       -2.33212

-3.41548        thought -1.44315

-3.69715        knew    -1.36077

-2.43248        what    -1.66517

-4.57976        cheese  -0.767902

-2.15657        was     -1.6919








-0.601123       rooseveltian things

-0.601792       rooseveltian past

-0.857688       proffering and

-0.865686       proffering a

-0.901096       proffering those

-0.90307        proffering genuine

-0.299807       meningoencephalitis an

-0.829036       wcpn’s david

-0.653185       wcpn’s sarah


Another axiom that I heard repeated in grad school was “There is no data like more data.” Consider this the Google approach of amassing as much data as possible based on the assumption that more data will yield better results. But there’s another potentially competing principle —  that of obtaining data that fits your domain (in our case, podcasting). In fact, in our experiments we found that our domain-specific language model performed better than the larger and more generic Google model, despite being smaller.


The final probability calculations for the transcript hypotheses of an utterance are based on a combination of the results from the acoustic model and the language models. The sounds and words are linked together by what’s called the lexicon, which lists the words in the vocabulary along with their phonetic pronunciations. That means that the Kaldi output is made up entirely (and exclusively) of words from the lexicon. Our lexicon starts out as a plain text file, with entries like the ones you see below that specify both spelling and pronunciation.

archive AA R K AY V

hello HH AX L OW

hello HH EH L OW

pop P AA P

world W ER L D

up AH P

A lexicon is a living, breathing thing that changes based on current events, culture, and more. For example, before 2015 we didn’t need the word “Ivanka” in our lexicon — but as our political landscape changed, it became essential to add it.

Each month we collect new text for the language model, searching through the text for new words that are not yet in our dictionary. Recent additions include: autocorrect, crowdsourced, cis, fomo, gamergate, hinglish, islamophobic, livestreaming, mcboatface, microcephalic, quvenzhane.

Kaldi pipeline, continued

We use two models in our Kaldi pipeline. The baseline model has a bigram language model (LM), meaning that it’s based on sequences of one or two words, along with the acoustic model (AM). That gives us our initial n-best lists. After that, we use a 5-gram language model (without the AM), to re-score our n-best list. That is, the probability of each sentence in the n-best list is recalculated based on sequences of one to five words, rather than just one or two.

After the re-scoring processing, the top hypothesis is chosen from the each utterance to give us the final transcript. By using this re-scoring method rather than one big model, we cut down on the initial processing time while still maintaining accuracy.

Neural nets: the wave of the future?

For a long time it was standard to use what are called Gaussian Mixture Models for the acoustic model, and n-grams for the language model. All of that is now being shaken up by the rise of neural networks. Like other kinds of models, neural nets can calculate the most probable output given an input. With neural nets, the input (which must be represented numerically) travels through one or more layers of nodes. Calculations are performed between each layer until we arrive at the final output layer.  Kaldi supports neural net acoustic models, which vastly improve accuracy.

For LMs, the advantages of neural nets aren’t as clear. Baseline Kaldi LMs must still be n-gram based. However, it is now possible to do the re-scoring process with what’s called a recurrent neural network (RNNLM). Without getting too technical, RNNLMs are theoretically way more intelligent than traditional n-gram models because they build up an understanding of word similarities. If you’re so inclined to dive deeper into the reasons why, I highly recommend this paper by Mikolov et al. Indeed, they show that RNNLMs improve word error rate (WER), the standard metric for speech recognition accuracy. However, there’s a cost — it takes significantly more time and computing power to train RNNLMs, while improvements are only slight.

Our language is ever-evolving, which is captured by how people speak in podcasts. Using n-gram models allows us to update frequently and keep up with changing styles and with the latest newsmakers. Once we’ve generated the output, we can move on to the work of identifying the people and places mentioned, thereby allowing listeners to discover podcasts they care about.