Pay Attention: Word2Vec

Word2Vec

Annotated by: Noah Schliesman

Abstract

With help from this paper, I explain N-grams, Maximum Likelihood Estimation (MLE), Latent Semantic Analysis (LSA), Term Frequency-Inverse Document Frequency (TF-IDF), Singular Value Decomposition (SVD), Dimensionality Reduction, Latent Dirichlet Allocation (LDA), Variational Inference, Kullback-Leibler (KL) Divergence, Evidence Lower Bound (ELBO), Gibbs Sampling, Huffman Encoding, DistBelief, AdaGrad, Continuous Bag-of-Words (CBOW), Continuous Skip-gram, Negative Sampling, Cosine Distance, Microsoft Sentence Completion Challenge, and Spearman’s Rank Correlation.

The authors propose two methods for computing vector representations of words: continuous bag-of-words, and continuous skip-gram. The quality of such representations are measured in a word similarity task.^[1]

Introduction

The notion of words as atomic units provide no similarity between them and is often used for its simplicity. Considering words as atomic units means that each word in a corpus is treated as a single token in the vocabulary, without any internal structure.

An N-gram is a sequence of ‘N’ letters, syllables, words, or other segmentation from a given text. We call a one word sequence (N=1) a unigram, a two word sequence (N=2) a bigram, a three word sequence (N=3) a trigram, and so forth. Thus, N-grams are used to capture the context of words in a text.^[2]

Formally, given a sequence of words \(w_1, w_2, ..., w_T\), an N-gram model estimates the probability of a word \(w_t\) based on the preceding \(N-1\) words:

\(P(w_t | w_{t-(N-1)}, ..., w_{t-1})\)

Aside: Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a tool for estimating the parameters of a statistical model (i.e. line of best fit). The main idea is to find the values for the parameters that make the observed data most likely to have happened.^[3]

Suppose we have a dataset \(X = {x_1, x_2, ..., x_n}\) and a parametric model with parameter \(\theta\). The likelihood function is the probability of observing the data given the parameter \(\theta\), \(L(\theta; X) = P(X|\theta) = \prod_{i=1}^{n} P(x_i|\theta)\).

When we try to estimate probabilities from data, we often multiply many probabilities which result in very small numbers. Taking the log of a product of numbers turns it into a sum of logs, which is much easier to handle. Thus, the log-likelihood function for the parameter \(\theta\) is, \(l(\theta; X) = logL(\theta; X) = \sum_{i=1}^{n} logP(x_i|\theta)\).

Maximum Likelihood Estimation seeks to find the parameter value \(\theta\) that maximizes the log-likelihood function, \(\theta = \arg \max_{\theta} l(\theta; X)\). To find such a maximum, we take the derivative of the log-likelihood function with respect to \(\theta\) and set it to zero, \(\frac{\partial l(\theta; X)}{\partial \theta} = 0\). From here, we can just solve for our parameter \(\theta\).

MLE helps us find the probability of N-grams by looking at how often they appear in a large set of text data. By counting occurrences of the N-gram in the corpus and comparing to the number of times the prefix (N-1 words) appears. In other words,

\(P(w_t | w_{t-(N-1)}, ..., w_{t-1}) = \frac{Count(w_t, w_{t-(N-1)}, ..., w_{t-1})}{Count(w_{t-(N-1)}, ..., w_{t-1})}\)

The authors explain that N-grams are limited by scalability and its oversimplified nature. They mention that neural models outperform N-grams.

The authors aim to efficiently convert words to vectors that can provide a meaningful measure of similarity between word representations. They refer to some previous works on the matter, specifically that of "A Neural Probabilistic Language Model" by Bengio et. al. in 2003.^[4]

Model Architecture

They mention Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) of which distributed representations preserve linear regularities better than LSA and are less computationally expensive than LDA.

Aside: Latent Semantic Analysis (LSA) vs. Latent Dirichlet Allocation (LDA)

Latent Semantic Analysis (LSA) is a technique that helps you look at the patterns of words that appear together in data. Given a corpus of text, we first build a term-document matrix \(X\), where \(X_{ij}\) indicates the frequency of term \(i\) in document \(j\). This results in a very large and sparse matrix.^[5]

Often we measure frequency using Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF combines a measure of frequency per document and frequency in all documents to give a score about how important a word is in a document.^[6] The term frequency (TF) measures the frequency of a term \(t\) within a document \(d\). The inverse document frequency (IDF) determines the rarity of a term across the corpus. Let \(D\) be the total number of documents and let \(d \in D : t \in d\) be the number of documents containing term \(t\). We take the log to weaken the effect of high document frequencies:

\(IDF(t, D) = \log \frac{D}{1 + |d \in D : t \in d|}\)

Thus, the TF-IDF score combines term frequency and inverse document frequency to ensure terms that are frequent in a document but rare in the corpus receive higher scores:

\(TF‒IDF(t, d, D) = TF(t, d) \cdot IDF(t, D) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}} \cdot \log \frac{D}{1 + |d \in D : t \in d|}\)

Now that we have a way of measuring frequency within our term-document matrix, we perform singular value decomposition (SVD) to our matrix. Let \(U\) be a matrix representing the term eigenvectors, \(\Sigma\) be a diagonal matrix of singular values, and \(V^T\) be the matrix of document eigenvectors:

\(X = U \Sigma V^T\)^[7]

The reduced matrices represent terms and documents in a new semantic space where we can capture synonymy (different words / similar meanings) and polysemy (same word / different meanings).^[8]

Latent Dirichlet Allocation (LDA)

LDA assumes the distribution of topics is based on a Dirichlet distribution. LDA uses two parameters as Dirichlet priors on per-document topic distributions and per-topic word distributions.^[9] When new data is observed, a Dirichlet prior can be updated to form a posterior distribution.^[10]

Given a parameter vector \(p = {p_1, p_2, ..., p_k}\) with \(\sum_{i=1}^{k} p_i = 1\), and a multivariate Beta function that helps us normalize the Dirichlet distribution, we can define the distribution as:

\(Dir(p; \alpha) = \frac{1}{B(\alpha)} \prod_{i=1}^{k} p_i^{\alpha_i - 1}\)

Variational Inference and Kullback-Leibler (KL) Divergence

Variational Inference is a method used to estimate complex probability distributions by approximating such an intractable posterior distribution to that of a simpler distribution by optimizing the parameters of this distribution.^[11]

We can also use Gibbs sampling by iteratively sampling each variable from its conditional distribution given the current value of other variables.^[12]

Huffman Encoding and AdaGrad

Huffman encoding leverages the frequency of words, giving common words shorter codes and less common words longer codes. The process begins by calculating the frequency of each word in the vocabulary. From here, it constructs a binary tree called a Huffman tree, where each leaf node represents a word, and the path from the root to the leaf node represents the binary code for that word. We often use a min-heap, a binary tree where the value of each node is less than or equal to its children. When inserting a new node into the heap, place it at the next available position and then “heapify up”. This process entails iteratively comparing the inserted node with its parent and swapping if the node is smaller until the heap is a min-heap.^[13]

Thus, the bottleneck for the neural net language model (NNLM) is generally at the hidden layer complexity \(N \cdot D \cdot H\).

Another approach to neural word-to-vector parametrization is through the recurrent neural net language model (RNNLM). This model has a better short-term memory and therefore can represent more complex patterns. Again, the computational complexity per training example is:

\(Q = H \cdot H + H \cdot V\)

The authors leverage the parallelization of DistBelief, of which workers perform forwards and backwards passes, independently computing gradients. The framework also consists of parameter servers, where parameters are stored in dedicated nodes. This asynchronous gradient workflow optimizes the efficiency of models with millions (and sometimes billions) of parameters.

They also mention the usage of AdaGrad, an algorithm to dynamically adjust the learning rate for each parameter. Parameters that frequently have large gradients get smaller learning rates and smaller gradients lead to larger learning rates. Let \(\theta\) be the parameters of the model, \(\eta\) be the learning rate, \(g_t\) be the gradient at time \(t\) such that \(G_t = \sum_{i=1}^{t} g_i^2\), and \(\epsilon\) is a small constant to prevent division by zero. The update rule is then:

\(\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t\)^[14]

Continuous Bag-of-Words (CBOW) and Continuous Skip-gram

The first proposed architecture, continuous bag-of-words (CBOW), is based on the feedforward NNLM, removing the hidden layer and sharing the projection layer for all words. CBOW comes from the idea that the order of words in history does not influence the projection. Again, the crucial idea here is that the weight matrix between the input and projection layer is shared for all word positions.

CBOW predicts a target word based on the context words surrounding it. By taking a fixed-size window of words surrounding the target word, CBOW converts these words into vectors and averages them into a single context vector to predict the target word.

Each context word is represented as a one-hot encoded vector, which are in turn projected into a continuous vector space using a shared weight matrix \(W\). These embeddings are then averaged to produce a single context vector \(v_c\):

\(v_c = \frac{1}{2k} \sum_{i=1}^{2k} W \cdot one‒hot(w_{t-k+i})\)^[15]

The context vector \(v_c\) is then used to predict the target word \(w_t\). Let there be another shared weight matrix \(W'\) which is used to convert the averaged context vector back into a probability distribution over the vocabulary to predict the target word. The target word is then computed as:

\(w_t = \arg \max_w P(w | context) = \arg \max_w \frac{\exp(v_c \cdot W_w')}{\sum_{j=1}^{V} \exp(v_c \cdot W_j')}\)

We can then train these weight matrices using stochastic gradient descent (SGD) and backpropagation, with a negative log-likelihood of the target word as choice of loss function:

\(L = - \log P(w_t | context)\)

The complexity per training example of CBOW is then:

\(Q = N \cdot D + D \cdot \log_2(V)\)

The second proposed architecture, known as Continuous Skip-gram, tries to maximize classification based on predicting words within a certain range before and after the current word based on another word. Given a target word, continuous skip-gram produces context words within a specific window size around the target word. The model then learns to generate word embeddings that capture the contextual relationships between words during training.

The Skip-gram model consists again of an input layer, a hidden projection layer, and an output layer. The input is a one-hot encoded vector of the target word, and the output is the probability distribution over the vocabulary for each context word position.

To make computation more efficient for larger vocabularies, skip-gram can use negative sampling, which maximizes the probability of the context word given the target word and minimizes it for negative samples. The modified objective formula can then be written as:

\(\max_{\theta} \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \le j \le c | j \ne 0} \left[ \log \sigma(v_{w_{t+j}}^T \cdot v_{w_t}) + \sum_{n=1}^{K} \mathbb{E}_{w_n \sim P_n(w)} [\log \sigma(-v_{l}^T v_{w_t})] \right]\)

In traditional Skip-gram, the complexity per training example is denoted for a maximum distance of words \(C\) as:

\(Q = C \cdot D + D \cdot \log_2(V)\)

Succinctly, “the CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word” (MCCD12).

Results

Previous methods use a table to measure word similarity, but the authors note that there are many different types of similarity for a vocabulary. The authors note that cosine distance can measure the similarity of word-to-vector embeddings. Given two word-vector embeddings \(A\) and \(B\), we can utilize cosine distance to obtain cosine similarity. A value of 1 indicates that the vectors are pointing in the same direction (i.e. words are similar), a value of 0 indicates that the vectors are orthogonal (i.e. words are unrelated), and a value of -1 indicates that the vectors are pointing in the opposite direction (i.e. words are dissimilar). It is measured as:

\(cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\)

The test set the authors used contained five types of semantic questions: common capital city, all capital cities, currency, city-in-state, and man-woman (8869 questions). They also used nine types of syntactic questions: adjective to adverb, opposite, comparative, superlative, present participle, nationality adjective, past tense, plural nouns, plural verbs (10675 questions). They note the lack of accuracy due to morphology not being integrated into the model.

They used a Google News corpus for training the word vectors containing about 6B tokens, with a vocabulary size restricted to the 1 million most frequent words. They expect more data and higher dimensional word vectors will improve the model. They note that after some point, adding dimensions or adding more training data provides diminishing improvements (they must be added together).

They used three training epochs with stochastic gradient descent (SGD), backpropagation, and a learning rate of 0.025 [decreased linearly to approach zero at the last training epoch].

To compare models, they use the same training data and same dimensionality of 640 of the word vectors. They later provide examples beyond the 30k vocabulary. They compare the RNNLM, the NNLM, CBOW, and Skip-gram on these tasks.

On syntactic questions, CBOW outperforms the NNLM vectors which in turn outperform the RNN. However, Skip-gram performs much better on semantic tasks than the other models. They then compare to state of the art models and significantly outperform them (from 24.6% to 53.3% in total accuracy with Skip-Gram).

The authors note the need for mini-batch asynchronous gradient descent using AdaGrad with 50 to 100 model replicas during the training.

They also mention the Microsoft Sentence Completion Challenge, which consists of 1040 sentences where one word is missing and the goal is to select the most coherent word given five reasonable choices. The authors explored Skip-gram on this task when weighted in an ensemble with RNNLMs, achieving a new state of the art of 58.9% accuracy.^[16]

Examples of the Learned Relationships

Relationships can be defined by subtracting two word vectors and adding to another word. The example provided is:

\(Paris - France + Italy = Rome\)

Again, they mention that larger dimensionality datasets will perform significantly better, but vector operations on word parametrizations hold a lot of promise.

Conclusion

The authors managed to train high-quality word vectors using very simple model architectures. Due to the low computational complexity, it is easier to compute very accurate high dimensional word vectors from a much larger data set. The authors are able to outperform the previous state of the art SemEval-2012 Task 2, leveraging a 50% increase in Spearman’s rank correlation.^[16]

Spearman’s rank correlation coefficient, often denoted as \(\rho\) or \(r_s\), is a non-parametric measure of the strength and direction of the association between two variables, measuring how well they can be described by a monotonic function. First, convert the data \(X, Y\) into ranks \(R(X), R(Y)\) and calculate the difference for each observation \(i\):

\(d_i = R(X_i) - R(Y_i)\)

From here, Spearman’s rank correlation coefficient is:

\(\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\)

Alternatively, we can use the Pearson correlation formula to compute Spearman’s rank correlation:

\(r_s = \frac{cov(R(X), R(Y))}{\sigma_{R(X)} \sigma_{R(Y)}}\)

Note that \(\rho \in [-1, 1]\), where \(\rho = 1\) indicates a perfect positive monotonic relationship (i.e. positive correlation), \(\rho = 0\) indicates no monotonic relationship (i.e. no correlation), and \(\rho = -1\) indicates a perfect negative monotonic relationship (i.e. negative correlation).

The authors expect that the quality of word vectors will largely contribute to NLP applications.

Follow-up Work

The authors mention that after this paper they created a multi-threaded C++ code, of which the training speed is significantly higher.

For code, please reach out:

Citations

[1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Retrieved from arXiv.
[2] Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400-401. doi:10.1109/TASSP.1987.1165125
[3] Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., & Mercer, R. L. (1992). An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1), 31-40.
[4] Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137-1155. Retrieved from JMLR.
[5] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
[6] Aizawa, A. (2003). An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1), 45-65. doi:10.1016/S0306-4573(02)00021-3
[7] Manning, C. D., Raghavan, P., & Schütze, H. (2009). Term-document matrices and singular value decompositions. In An Introduction to Information Retrieval (pp. 403-410). Cambridge University Press. Retrieved from Stanford University.
[8] Thi, H. D., Manh, K. H., Anh, V. T., Quynh, T. P. T., & Viet, N. T. (2023). Dimensionality Reduction with Truncated Singular Value Decomposition and K-Nearest Neighbors Regression for Indoor Localization. International Journal of Advanced Computer Science and Applications, 14(10), 314-321.
[9] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022. Retrieved from JMLR.
[10] Fergusson, T. S. (1973). A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, 1(2), 209-230. This paper provides a comprehensive treatment of the Dirichlet process, which is a fundamental concept in Bayesian nonparametric statistics.
[11] Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859-877. Retrieved from arXiv.
[12] Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1), 5228-5235. doi:10.1073/pnas.0307752101
[13] Moffat, A. (2016). Huffman Coding. In Encyclopedia of Algorithms (pp. 933-936). Springer, New York, NY. doi:10.1007/978-1-4939-2864-4_633
[14] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121-2159. Retrieved from JMLR.
[15] Rong, X. (2014). Word2Vec Parameter Learning Explained. arXiv preprint arXiv:1411.2738. Retrieved from arXiv.
[16] Spearman, C. (1904). The Proof and Measurement of Association between Two Things. American Journal of Psychology, 15, 72-101.