Pay Attention

The Perceptron

Published in: 1958

Frank Rosenblatt plants the seeds for neural networks with his perceptron, taken as inspiration from Hebb's hypothesis in connectionist neurology. The perceptron is a sum or mean of input stimuli and use statistical seperability to handle both supervised and unsupervised learning via binomial theory.

Kalman Filtering

Published in: 1960

R. E. Kalman tackles the Wiener problem—how does one obtain an original signal given additive, Gaussian noise—through the framework of state spaces in optimal control theory. Such a formulation brings about the basis of modern state space models and my personal research interests entail using Kalman Filtering to improve Neural Networks.

Learning Representations by Back-propagating Errors

Published on: October 9, 1985

David Rumelhart, Goeffrey Hinton, and Ronald Williams establish the heart of backpropagation by minimizing the difference between the actual output vector and the desired output vector using the accumulation of gradients (as a result of the chain rule).

Estimation of Probabilities from Sparse Data for the Language Model Compoent of a Speech Recognizer

Published on: March 1987

Slava M. Katz develops a novel nonlinear recursive procedure for redistributing the probability mass of unseen m-grams using Turing discounting. This work was a huge advancement in sparse, memory-efficient language modeling before the prominence of cross-entropy and neural networks.

A Theoretical Framework for Back-Propagation

Published in: 1988

Yann Le Cun formalizes backpropagation in Neural Networks using Lagrangian/Hamilton analysis from optimal control theory and generalizes to the cases of weight functions and recurrent networks.

Optimal Brain Damage

Published on: January 1, 1989

Yann Le Cun, John Denker, and Sara Solla examine pruning unimportant weights in this historic paper by computing parameter saliency via second derivative analysis, thereby reducing the number of redundant parameters by a factor of four.

Learning Long-Term Dependencies with Gradient Descent is Difficult

Published in: March 1994

Yoshua Bengio, Patrice Simard, and Paolo Frasconi formalize the problem of the vanishing gradient using hyperbolic attractor theory in a recurrent neuron. After extending this to a dynamic system, they offer alternative approaches in simulated annealing, multi-grid random search, time-weighted pseudo-Newton optimization, and discrete error propagation.

Support-Vector Networks

Published on: September 15, 1995

Corinna Cortes and Vladimir Vapnik revolutionize two group classification with support vectors, derived from the convolution of the dot product (Kernel). The use of this kernel allows for the linear decomposition of a decision surface, which in turn can be optimized using Lagrangians for classification.

Bidirectional Recurrent Neural Networks

Published in: November 1997

Mike Schuster and Kuldip K. Paliwal improve Recurrent Neural Networks (RNNs) by considering the forward states and backward states at the same time during training. These states are independently treated and allow the RNN to better grasp bidirectional context. They use the TIMIT phoneme dataset to achieve state-of-the-art results classifying speech using this new architecture.

Greedy Layer-Wise Training of Deep Networks

Published in: 2006

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle investigate a greedy layer-wise training approach to Deep Belief Networks on a Cotton daataset, Abalone, and MNIST. By using this algorithm for pretraining, weights are better initialized and therefore are less likely to converge to bad extrema.

Reducing the Dimensionality of Data with Neural Networks

Published on: July 28, 2006

G. E. Hinton and R. R. Salakhutdinov challenge the traditional methods using PCA to reduce the dimensionality of data. Instead they opt for a Restricted Boltzmann Machine (RBM) based encoder that is greedy layer-wise trained for the encoder. The decoder is the opposite of the encoder and the full autoencoder can be trained using cross-entropy.

Curriculum Learning

Published in: 2009

Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston enact curriculum learning by teaching a model to learn easier aspects of the task and gradually increasing the difficulty until the model is ready for the test set. This procedure results in better local minima, quicker convergence, and a regulization effect.

ImageNet Classification with Deep Convolutional Neural Networks

Published in: 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton tackle the issue of multiclass image recognition through developing a revolutionary deep CNN that utilizes GPU threading, ReLU activation, local response normalization, overlapping pooling, patch augmentation, and PCA for RGB channels.

Efficient Estimation of Word Representation in Vector Space

Published in: 2013

Thomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean examine the lack of meaningful and efficient estimation of word2vector parametrization. Here I explain some of the previous methods (LDA/LSA) and how their approach varies from the previous state-of-the-art.

Sequence to Sequence Learning with Nerual Networks

Published on: December 14, 2014

Ilya Sutskever, Oriol Vinyals, and Quoc Le tackle utilize an encoder-decoder LSTM network for language translation with four layers. Key insights include reversing the order of source tokens, resulting in state-of-the-art (for the time) BLEU scores.

Neural Machine Translation by Jointly Learning to Align and Translate

Published on: September 1, 2014

Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio address English-to-French translation using an encoder-decoder GRU-based BiRNN with a novel attention mechanism given bidirectional annotations, outperforming Moses on BLEU translation and planting the seeds for modern attention mechanisms.

Neural GPUs Learn Algorithms

Published in: November 2016

Lukasz Kaiser and Ilya Sutskever tackle the problem of superlinear sequential long addition and long multiplication using a convolutional GRU (CGRU) with grid search, curriculum learning with minibatch, gradient noise, gate cutoff, dropout, and relaxation pull to achieve 100% accuracy.

Attention is All You Need

Published in: June 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaise, and Illia Polosukhin tackle the problem of sequence transduction through a multi-headed dot-product attention mechanism (Transformer) that compares query, key, and values to minimize loss and obtain a state-of-the-art BLEU score of 41.0 using an autoregressive encoder-decoder architecture.