Neural Machine Translation by Jointly Learning to Align and Translate

Annotated by: Noah Schliesman

Please download the annotated PDF to view it: Download PDF.

Abstract

Prior to this paper’s inception, encoder-decoder models encode a source sentence into a fixed-length vector from which the decoder generates a translation. Rather, Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio propose a soft-search attention mechanism for parts of a source sentence that are relevant to predicting a target work. Such a system produces state-of-the-art English-to-French translation and forms the basis for the modern attention networks prevalent in research.[1]

Introduction

Neural machine translation has typically consisted of using neural sub-components that are trained separately, most of which belong to encoder-decoder architectures. Such models are jointly trained to maximize the probability of a correct translation given a source sentence.

As mentioned in the previous work of Cho and Bahdanau[2], using a target fixed-length vector for encoding cripples performance as the length of the input sentence increases. Soft-search attention entails searching for a set of positions in the source sentence where the most relevant information is concentrated, of which the decoder predicts a target word based on the context vectors associated with these attention positions.

Rather than using a target fixed-length encoding context vector, the encoder generates a sequence of vectors, and chooses a subset of these vectors while decoding the translation. This idea of alignment allows the model to focus on meaningful relationships in the syntax of a source sentence. This technique produces significantly improved performance in machine translation.

Background: Neural Machine Translation

Neural Machine Translation (NMT) involves maximizing the conditional probability of a correct translation given a source sentence, i.e.,

$$ \arg \max_y p(y|x) $$

Encoder-decoder architectures assist such a task by learning this conditional distribution, especially using Recurrent Neural Networks (RNN) or Long Short Term Memory (LSTM) units to encode/decode.

Let \( f(z) \) and \( q(z) \) be nonlinear functions and let \( X = \{x_1, ..., x_{T_x}\} \) be a sequence of input vectors. Generally, the hidden state is computed as,

$$ h_t = f(x_t, h_{t-1}) $$

Notably, the context vector is comprised from the hidden states in a nonlinear function,

$$ c = q(\{h_1, ..., h_{T_x}\}) $$

In turn, the decoder defines a probability over the translation via joint probability decomposition. For previously predicted words \(\{y_1, ..., y_{t'-1}\}\), the probability of the translation is,

$$ p(y) = \prod_{t=1}^T p(y_t | \{y_1, ..., y_{t'-1}\}, c) $$

With an RNN, each conditional probability is modeled as a nonlinear function that considers the previously predicted words, the hidden state, and the context vector,

$$ p(y) = \prod_{t=1}^T g(y_{t-1}, s_t, c) $$

Thus, traditional methods encode and decode about a context vector based on the hidden states of a neural network.

Learning to Align and Translate

Before diving into bidirectional RNNs and attention, let’s review bidirectional RNNs. If you haven’t already, check out my primer on RNNs, GRUs, and LSTMs. I also discuss some of these topics in my analysis of Sequence to Sequence Learning with Neural Networks by Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, so feel free to check that out. I also have a primer out on Attention and Transformer Architectures, but such a field has evolved from the additive attention discussed below.

Rather than using an LSTM, the authors choose to use a gated recurrent unit (GRU) as their choice of nonlinear function (i.e., \( f(z) \equiv \) Gated Recurrent Unit). The paper itself refers to this unit as a gated hidden unit, but for the sake of coherence about research I will refer to the more popular GRU. Let’s recall an extended formulation for a GRU in the computation of a bidirectional recurrent neural network (BiRNN).

The GRU and its corresponding BiRNN involves an update gate, reset gate, candidate hidden state, and final hidden state.[2] This architecture in turn allows the GRU to maintain long-term dependencies, resolving many of the vanishing gradient issues of traditional RNNs. Here we examine the forward states of a BiRNN encoder which utilizes GRU architecture.

Given a word embedding matrix \( E \), we can precompute our embedded input such that,

$$ x_t = E w_t $$

For some one-hot encoded vector \( w_t \). The forward hidden states are computed from the beginning of the sequence, up to token \( t \).

The update gate determines the extent to which the previous hidden state is updated with the new candidate hidden state. Let \( x_t \) be the input at time step \( t \) and \( W_z, U_z \) be update gate weight matrices, \( b_z \) be the update gate bias, and \( \sigma \) be the sigmoid activation function:

$$ z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z) $$

The reset gate determines how much of the previous hidden state should be forgotten. Let \( W_r, U_r \) be reset gate weight matrices, and \( b_r \) be the reset gate bias:

$$ r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r) $$

The candidate hidden state represents the new hidden state information to potentially be added to the final hidden state. Let \( W_h, U_h \) be candidate hidden state weight matrices, and \( b_h \) be the candidate hidden state bias, and \( \tanh \) be the hyperbolic tangent activation function:

$$ h_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h) $$

The final hidden state is a linear interpolation between the previous hidden state and the candidate hidden state controlled by the update gate:

$$ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot h_t $$

When expanded, we obtain,

$$ h_t = (1 - \sigma(W_z x_t + U_z h_{t-1} + b_z)) \odot h_{t-1} + \sigma(W_z x_t + U_z h_{t-1} + b_z) \odot \tanh(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h) $$

Quite similarly, the backwards states are computed from the tokenized embedding at \( t \) to the last token. We concatenate the forward states and backward states to attain the annotation for each token, \( h_i \). When we do this for all tokens, we obtain \( H \) which in turn is simply an annotation matrix. While this increases the complexity of the encoder twofold, it in turn allows meaningful dependencies to be made in both directions for each token at time \( t \).

Each annotation represents the relationship between tokens in the input sentence. These annotations are learned at the same time as the decoder, which can utilize the annotations to create a learnable attention mechanism. This creates a synchronous system that learns to create annotations (align) and decode meaning using an attention mechanism (translate).

Now that we’ve discussed how the GRU-based BiRNN computes an annotation matrix, let’s address this notion of attention. Let \( s_{i-1} \) be the decoder’s previous state, \( v_a \) be a trainable alignment score vector, \( W_a \) be a trainable weight matrix to transform the decoder’s previous state \( s_{i-1} \), \( U_a \) be a trainable weight matrix to transform the encoder’s annotation \( h_j \). The associated energy of an annotation \( h_j \) is given by,

$$ e_{ij} = v_a^T \tanh(W_a s_{i-1} + U_a h_j) $$

The associated probability is simply calculated using a softmax function of energies,

$$ \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})} $$

Our context vector at token \( i \) is just the linear combination of the annotation and its associated probability \( \alpha_{ij} \),

$$ c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j $$

We’re now ready to discuss the decoder. Let the previous embedded output be denoted as,

$$ e_{i-1} = E y_{i-1} $$

The decoder works just as a unidirectional GRU would, with the context vector multiplied to each bias.

The update gate is,

$$ z_i = \sigma(W_z e_{i-1} + U_z s_{i-1} + b_z c_i) $$

The reset gate is described as,

$$ r_i = \sigma(W_r e_{i-1} + U_r s_{i-1} + b_r c_i) $$

The candidate hidden state is,

$$ s_i = \tanh(W_h e_{i-1} + U_h (r_i \odot s_{i-1}) + b_h c_i) $$

Lastly, the final hidden state is given by,

$$ s_i = (1 - z_i) \odot s_{i-1} + z_i \odot s_i $$

Which in turn expands to,

$$ s_i = (1 - \sigma(W_z e_{i-1} + U_z s_{i-1} + b_z c_i)) \odot s_{i-1} + \sigma(W_z e_{i-1} + U_z s_{i-1} + b_z c_i) \odot \tanh(W_h e_{i-1} + U_h (r_i \odot s_{i-1}) + b_h c_i) $$

The authors note that the initial decoder hidden state is computed as,

$$ s_0 = \tanh(W_s h_{1 \leftarrow}) $$

The intermediate vector combines the decoder’s previous hidden state \( s_{i-1} \), the previous embedded output \( e_{i-1} \), and the context vector \( c_i \) with parametrizable weights \( U_0, V_0 \), and bias \( C_0 \) to form a high-dimensional representation that captures the necessary context for the next prediction,

$$ t_i = U_0 s_{i-1} + V_0 e_{i-1} + C_0 c_i $$

The maxout operation is an activation function that selects the maximum value from pairs of elements in the intermediate vector \( t_i \). In other words, the maxout operation takes the maximum value from pairs of linear transformations of the inputs to enhance the representational capacity of the model by capturing complex patterns and in turn reduce the dimensionality of the intermediate vector. Mathematically, it is simply described as,

$$ t_i = \max\{t_{i,2j-1}, t_{i,2j}\} \quad \text{for} \quad j=1,...,lT $$

From here, we compute the logits (unnormalized probability) for the target word \( y_i \) by taking the dot product of our activated intermediate vector \( t_i \) and a weight matrix \( W_O \),

$$ z_i = W_O t_i $$

We perform a softmax activation over these logits to obtain a probability distribution over the target vocabulary,

$$ p(y_i|s_i, y_{i-1}, c_i) = \frac{\exp(z_i)}{\sum_{k=1}^{T_x} \exp(z_{ik})} $$

Just to briefly recap, we tokenized our sentence using an embedded matrix \( E \), computed an annotation vector for each token using a bidirectionally concatenated GRU (BiRNN), computed the attention weights for each annotation vector using a learnable additive energy mechanism and then normalizing it \( e_{ij} \), obtained the context vector for each token using these weights and the annotation vector, used a GRU decoder network using these context vectors to produce the decoder’s previous states \( s_{i-1} \), performing a maxout on the intermediate vector determined from the decoder states \( t_i \), computing the logits for that activated intermediate vector \( z_i \), and performing a softmax on those logits to obtain a probability distribution over the target vocabulary.

Experiment Settings

The authors used the English-to-French translation task from the ACL WMT ‘14 dataset totaling 850M words. They reduced the size of the corpus to 348M words and note that it may be possible to use a much larger monolingual corpus to pretrain an encoder.

They tokenized with a vocabulary of 30k and an UNK token, without lowercasing or stemming. They compare their model, RNNsearch, with the RNN Encoder-Decoder (reliant on fixed-length context vectors) and note great improvement. They train using mini batch stochastic gradient descent (SGD) with Adadelta and an update direction after 80 sentences. Post training, they used a beam search to maximize the conditional probability.

Results and Related Work

This novel network outperforms conventional phrase-based translation (Moses) as shown via BLEU score (36.15 for RNNsearch, 35.63 for Moses, 26.71 for RNNenc-dec). They argue that this success is largely due to fixed-length context vectors leading to difficulty with long-term dependencies. Soft-alignment (attention) deals with source and target phrases of different lengths and in turn requires the computation of the annotation weight of every word in the source sentence for each word in the translation.

The authors note advancements using self-alignment in handwriting synthesis, with the note that their methodology allows the modes of the weights of the annotations to move in both directional (hence the reason for forward and backward concatenations). They note the computational inefficiency of such a method, especially when applied to larger input/output sources.

Citations