Estimation of Probabilities from Sparse Data

Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer

Author: Slava M. Katz

“Some changes look negative on the surface but you will soon realize that space is being created in your life for something new to emerge”
~Eckhart Tolle

Abstract

m-grams are word sequences of length m which are used in language modeling and speech recognition. Since many possible sequences appear rarely, it is difficult to accurately estimate their probabilities using standard methods like Maximum Likelihood Estimation (MLE), as it assigns zero probability to unseen sequences. Katz introduces a novel computationally and memory-efficient language model that uses a nonlinear recursive procedure to estimate probabilities amidst sparse data.^[1]

Introduction

MLE is unfit to estimate m-gram probabilities that are not in the training set. Similarly, many sequences known as singletons only appear once, making estimation unreliable. In MLE, the probability of an event is calculated based on its observed frequency. If an event never appears in the dataset, MLE assigns it a probability of zero—which is obviously problematic.

Katz takes some probability mass from observed events (especially singletons) and reallocates it to unseen events using renormalized Turing's estimates. Instead of directly using training frequency counts, Turing's estimate adjusts the counts based on patterns observed in low-frequency words. If an m-gram only appears once or a few times, its frequency is reduced and the freed probability mass is given to unseen m-grams.

Mathematical Basis

Let's start with a few variable definitions, where,

N is the total number of words (or m-grams) in a sample text
n_r is the number of unique m-grams that appeared r times

The total number of words is just the summation of unique m-grams and their frequency,

\[ N = \sum_r r n_r \]

We now want to adjust the observed counts to compensate for underrepresented words using Turing discounting. First, let's find the relationship \( \alpha \) between how often m-grams occurring \( r \) times will occur again in unseen data,

\[ \alpha = \frac{n_{r+1}}{n_r} \]

The adjusted frequency \( r^* \) will utilize this ratio \( \alpha \) in predicting the unseen occurrence (\( r + 1 \)),

\[ r^* = \alpha(r + 1) = \frac{n_{r+1}}{n_r}(r + 1) \]

To find the probability \( P_T \) of an m-gram appearing \( r \) times, divide this frequency count \( r^* \) by the total count of m-grams \( N \),

\[ P_T = \frac{r^*}{N} = \frac{n_{r+1}}{n_r N}(r + 1) \]

Now, we are concerned with discounting—the process of replacing raw frequency counts \( r \) with adjusted counts \( r^* \). The discount coefficient \( d_r \) scales down high-frequency words and redistributes mass to rare words,

\[ d_r = \frac{r^*}{r} \]

The MLE formula for estimating the probability of an m-gram \( w_1^m \) is the count of that specific m-gram \( c(w_1^m) \) normalized by the total number of observed m-grams,

\[ P_{ML} = \frac{c(w_1^m)}{N} \]

It's more useful to use the newly derived adjusted frequency \( r^* \) to produce a better potential estimate. We've been using \( r \) to refer to frequency-based probability adjustments, but without loss of generality Katz uses counts of a specific m-gram \( c(x) \) to denote individual counts. By the same reasoning above, the generalization ratio \( \alpha_c \) and Turing adjusted count \( c^*(x) \) is,

\[ \alpha_c = \frac{n_{c(x)+1}}{n_{c(x)}} \]

\[ c^*(x) = \alpha_c(c(x) + 1) = \frac{n_{c(x)+1}}{n_{c(x)}}(c(x) + 1) \]

The total probability assigned to m-grams is slightly less than 1, with freed probability mass \( \frac{n_1}{N} \). This is intentionally left unassigned for unseen m-grams. Here, \( n_1 \) represents the number of singletons,

\[ \sum_{w_1^m : c(w_1^m) > 0} P_T(w_1^m) = 1 - \frac{n_1}{N} \]

Conversely, the estimated probabilities for all unseen m-grams is set to the freed probability mass,

\[ \sum_{w_1^m : c(w_1^m) = 0} P_T(w_1^m) = \frac{n_1}{N} \]

This can also be shown through the difference between MLE and Turing's estimate for observed m-grams. The discount correction \( \delta_c \) defines how much probability is removed from each observed m-gram,

\[ \delta_c = \frac{c}{N} - \frac{c^*}{N} = (1 - d_c) \frac{c}{N} \]

Katz notes that the total difference between the two estimates is,

\[ \sum_{w_1^m : c(w_1^m) > 0} \left[ P_{ML}(w_1^m) - P_T(w_1^m) \right] = \sum_{w_1^m : c(w_1^m) > 0} \delta_c = \sum_{r > 0} n_r \left(1 - d_r\right) \frac{r}{N} = \frac{n_1}{N} \]

\[ \sum_{w_1^m : c(w_1^m) > 0} \delta_c = \sum_{w_1^m : c(w_1^m) = 0} P_T(w_1^m) \]

Katz notes that the conditional case holds a similar form,

\[ \delta^{(cond)}(c(w_1^m)) = \left(1 - d_c^{(w_1^m)}\right) \frac{c(w_1^m)}{c(w_1^{m-1})} \]

The conditional probability of a word \( w_m \) given a context of preceding words \( w_1^{m-1} \) that has been observed in the training data (i.e. \( c(w_1^m) > 0 \)) can be estimated using the discount coefficient \( d_{c(w_1^m)} \) and the maximum likelihood estimate MLE ratio,

\[ d_{c(w_1^m)} = \frac{c(w_1^m)}{c(w_1^{m-1})} \]

\[ P_s(w_m | w_1^{m-1}) = d \frac{c(w_1^m)}{c(w_1^{m-1})} \]

Katz estimates the sum of the conditional probabilities of all words \( w_m \) for unseen m-grams as,

\[ \beta(w_1^{m-1}) = \sum_{w_m : c(w_1^m) > 0} \delta^{(cond)}(w_1^m) = 1 - \sum_{w_m : c(w_1^m) > 0} d \frac{c(w_1^m)}{c(w_1^{m-1})} \]

The conditional probability of \( w_m \) such that the context leading up to \( w_m \) has not been observed in the data can be estimated by the product of a scaling factor \( \alpha \) and the conditional probability given the lower-order context leading up to \( w_m \). Mathematically, this relationship is,

\[ P_s(w_m | w_1^{m-1}) = \alpha P_s(w_m | w_2^{m-1}) \]

Such a scaling factor must balance the lower-order probabilities \( P_s(w_m | w_1^{m-1}) \) so the total probability for unobserved \( w_m \) sums correctly. The scaling factor \( \alpha \) is then,

\[ \alpha(w_1^{m-1}) = \frac{\beta(w_1^{m-1})}{\sum_{w_m : c(w_1^m) = 0} P_s(w_m | w_2^{m-1})} = \frac{1 - \sum_{w_m : c(w_1^m) > 0} P_s(w_m | w_1^{m-1})}{1 - \sum_{w_m : c(w_1^m) > 0} P_s(w_m | w_2^{m-1})} \]

If the higher-order context \( w_1^{m-1} \) has never been observed, the model uses the lower-order estimate \( P_s(w_m | w_2^{m-1}) \) without scaling,

\[ P_s(w_m | w_1^{m-1}) = P_s(w_m | w_2^{m-1}) \]

Consider an indicator function \( \theta \) which determines whether to include the leftover probability mass assigned to unobserved m-grams \( \beta \),

\[ \theta(x) = \begin{cases} 1, & \text{if } x = 0 \\ 0, & \text{otherwise.} \end{cases} \]

When this indicator function is leveraged, the modified MLE becomes,

\[ P_s(w_m | w_1^{m-1}) = \hat{P}(w_m | w_1^{m-1}) + \theta(\hat{P}(w_m | w_1^{m-1})) \alpha(w_1^{m-1}) P_s(w_m | w_2^{m-1}) \]

In the case where the full m-gram is not observed (\( c(w_1^m) = 0 \)), the higher-order term is zero and the lower-order term is scaled by unity,

\[ P_s(w_m | w_1^{m-1}) = 0 \]

\[ \beta(w_1^{m-1}) = 1 \]

Katz then extends the discussion to include scenarios in which m-grams with higher counts \( c > k \) are reliable and do not need to be discounted,

\[ d_r = 1, \text{ for } r > k \]

In this case, the total contribution of observed m-grams is similarly,

\[ \sum_{w_m : c(w_1^m) > 0} \delta_c(w_1^m) = \sum_{1 \leq r \leq k} n_r \left(1 - d_r\right) \frac{r}{N} = \frac{n_1}{N} \]

For a constant \( \mu \), the adjustment of probability mass is written as,

\[ (1 - d_r) = \mu \left(1 - \frac{r'}{r}\right) \]

Consider a parameter \( k \) that determines the cutoff for reliable counts. For such a parameter, the discount factor balances the probability mass and is described as,

\[ d_r = \frac{(k + 1)n_{k+1}}{n_1} \cdot \frac{r}{r'} = \frac{(k + 1)n_{k+1}}{n_1}, \text{ for } 1 \leq r \leq k. \]

The normalization constant \( \alpha \) can be precomputed. Using these procedures, a 3-gram model was built as a language component in the Real-Time Speech Recognizer. The experiments showed that setting \( d_1 = 0 \) discarded all singletons and did not degrade model performance.

Katz then built a compact language model for PC-based speech recognizers. At the time, compute was extremely limited so this optimization was a huge milestone in sparse Bayesian language modeling.

Perplexity is a metric used to evaluate the quality of LMs, with lower scores indicating better performance. Consider a model where,

\( L \) is the length of the test sequence,
\( m \) is the size of an m-gram,
\( w_t \) is the \( t \)-th word in the test sequence, and,
\( P(w_t | w_{t-m+1}^{t-1}) \) is the conditional probability of the word given context.

Perplexity is described as,

\[ H = -\frac{1}{L-m+1} \sum_{t=m}^{L} \log_2 P(w_t | w_{t-m+1}^{t-1}) \]

Katz trained on a large corpus of 750K words and tested on 100 sentences. The results show that their back-off model was on-par with SOTA at the time while being easier to construct and use.

TABLE I: Perplexity Results on 100 Test Sentences
Model	1	2	3
2-gram	118	119	117
3-gram	89	91	88

References

[1] Katz, Slava M. "Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer." IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 3, Mar. 1987, pp. 400–401.