Pay Attention

Exploring and annotating the latest research in Generative AI

The Perceptron

Published in:

Frank Rosenblatt plants the seeds for neural networks with his perceptron, taken as inspiration from Hebb's hypothesis in connectionist neurology. The perceptron is a sum or mean of input stimuli and use statistical seperability to handle both supervised and unsupervised learning via binomial theory.

Read more

Kalman Filtering

Published in:

R. E. Kalman tackles the Wiener problem—how does one obtain an original signal given additive, Gaussian noise—through the framework of state spaces in optimal control theory. Such a formulation brings about the basis of modern state space models and my personal research interests entail using Kalman Filtering to improve Neural Networks.

Read more

Learning Representations by Back-propagating Errors

Published on:

David Rumelhart, Goeffrey Hinton, and Ronald Williams establish the heart of backpropagation by minimizing the difference between the actual output vector and the desired output vector using the accumulation of gradients (as a result of the chain rule).

Read more

Estimation of Probabilities from Sparse Data for the Language Model Compoent of a Speech Recognizer

Published on:

Slava M. Katz develops a novel nonlinear recursive procedure for redistributing the probability mass of unseen m-grams using Turing discounting. This work was a huge advancement in sparse, memory-efficient language modeling before the prominence of cross-entropy and neural networks.

Read more

A Theoretical Framework for Back-Propagation

Published in:

Yann Le Cun formalizes backpropagation in Neural Networks using Lagrangian/Hamilton analysis from optimal control theory and generalizes to the cases of weight functions and recurrent networks.

Read more

Optimal Brain Damage

Published on:

Yann Le Cun, John Denker, and Sara Solla examine pruning unimportant weights in this historic paper by computing parameter saliency via second derivative analysis, thereby reducing the number of redundant parameters by a factor of four.

Read more

Learning Long-Term Dependencies with Gradient Descent is Difficult

Published in:

Yoshua Bengio, Patrice Simard, and Paolo Frasconi formalize the problem of the vanishing gradient using hyperbolic attractor theory in a recurrent neuron. After extending this to a dynamic system, they offer alternative approaches in simulated annealing, multi-grid random search, time-weighted pseudo-Newton optimization, and discrete error propagation.

Read more

Support-Vector Networks

Published on:

Corinna Cortes and Vladimir Vapnik revolutionize two group classification with support vectors, derived from the convolution of the dot product (Kernel). The use of this kernel allows for the linear decomposition of a decision surface, which in turn can be optimized using Lagrangians for classification.

Read more

Bidirectional Recurrent Neural Networks

Published in:

Mike Schuster and Kuldip K. Paliwal improve Recurrent Neural Networks (RNNs) by considering the forward states and backward states at the same time during training. These states are independently treated and allow the RNN to better grasp bidirectional context. They use the TIMIT phoneme dataset to achieve state-of-the-art results classifying speech using this new architecture.

Read more

Greedy Layer-Wise Training of Deep Networks

Published in:

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle investigate a greedy layer-wise training approach to Deep Belief Networks on a Cotton daataset, Abalone, and MNIST. By using this algorithm for pretraining, weights are better initialized and therefore are less likely to converge to bad extrema.

Read more

Reducing the Dimensionality of Data with Neural Networks

Published on:

G. E. Hinton and R. R. Salakhutdinov challenge the traditional methods using PCA to reduce the dimensionality of data. Instead they opt for a Restricted Boltzmann Machine (RBM) based encoder that is greedy layer-wise trained for the encoder. The decoder is the opposite of the encoder and the full autoencoder can be trained using cross-entropy.

Read more

Curriculum Learning

Published in:

Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston enact curriculum learning by teaching a model to learn easier aspects of the task and gradually increasing the difficulty until the model is ready for the test set. This procedure results in better local minima, quicker convergence, and a regulization effect.

Read more

ImageNet Classification with Deep Convolutional Neural Networks

Published in:

Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton tackle the issue of multiclass image recognition through developing a revolutionary deep CNN that utilizes GPU threading, ReLU activation, local response normalization, overlapping pooling, patch augmentation, and PCA for RGB channels.

Read more

Efficient Estimation of Word Representation in Vector Space

Published in:

Thomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean examine the lack of meaningful and efficient estimation of word2vector parametrization. Here I explain some of the previous methods (LDA/LSA) and how their approach varies from the previous state-of-the-art.

Read more

Sequence to Sequence Learning with Nerual Networks

Published on:

Ilya Sutskever, Oriol Vinyals, and Quoc Le tackle utilize an encoder-decoder LSTM network for language translation with four layers. Key insights include reversing the order of source tokens, resulting in state-of-the-art (for the time) BLEU scores.

Read more

Neural Machine Translation by Jointly Learning to Align and Translate

Published on:

Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio address English-to-French translation using an encoder-decoder GRU-based BiRNN with a novel attention mechanism given bidirectional annotations, outperforming Moses on BLEU translation and planting the seeds for modern attention mechanisms.

Read more

Neural GPUs Learn Algorithms

Published in:

Lukasz Kaiser and Ilya Sutskever tackle the problem of superlinear sequential long addition and long multiplication using a convolutional GRU (CGRU) with grid search, curriculum learning with minibatch, gradient noise, gate cutoff, dropout, and relaxation pull to achieve 100% accuracy.

Read more

Attention is All You Need

Published in:

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaise, and Illia Polosukhin tackle the problem of sequence transduction through a multi-headed dot-product attention mechanism (Transformer) that compares query, key, and values to minimize loss and obtain a state-of-the-art BLEU score of 41.0 using an autoregressive encoder-decoder architecture.

Read more