Exploring and annotating the latest research in Generative AI
The Perceptron
Published in:
Frank Rosenblatt plants the seeds for neural networks with his perceptron, taken as inspiration from Hebb's hypothesis in connectionist neurology. The perceptron is a sum or mean of input stimuli and use statistical seperability to handle both supervised and unsupervised learning via binomial theory.
R. E. Kalman tackles the Wiener problem—how does one obtain an original signal given additive, Gaussian noise—through the framework of state spaces in optimal control theory. Such a formulation brings about the basis of modern state space models and my personal research interests entail using Kalman Filtering to improve Neural Networks.
Learning Representations by Back-propagating Errors
Published on:
David Rumelhart, Goeffrey Hinton, and Ronald Williams establish the heart of backpropagation by minimizing the difference between the actual output vector and the desired output vector using the accumulation of gradients (as a result of the chain rule).
Estimation of Probabilities from Sparse Data for the Language Model Compoent of a Speech Recognizer
Published on:
Slava M. Katz develops a novel nonlinear recursive procedure for redistributing the probability mass of unseen m-grams using Turing discounting. This work was a huge advancement in sparse, memory-efficient language modeling before the prominence of cross-entropy and neural networks.
Yann Le Cun formalizes backpropagation in Neural Networks using Lagrangian/Hamilton analysis from optimal control theory and generalizes to the cases of weight functions and recurrent networks.
Yann Le Cun, John Denker, and Sara Solla examine pruning unimportant weights in this historic paper by computing parameter saliency via second derivative analysis, thereby reducing the number of redundant parameters by a factor of four.
Learning Long-Term Dependencies with Gradient Descent is Difficult
Published in:
Yoshua Bengio, Patrice Simard, and Paolo Frasconi formalize the problem of the vanishing gradient using hyperbolic attractor theory in a recurrent neuron. After extending this to a dynamic system, they offer alternative approaches in simulated annealing, multi-grid random search, time-weighted pseudo-Newton optimization, and discrete error propagation.
Corinna Cortes and Vladimir Vapnik revolutionize two group classification with support vectors, derived from the convolution of the dot product (Kernel). The use of this kernel allows for the linear decomposition of a decision surface, which in turn can be optimized using Lagrangians for classification.
Mike Schuster and Kuldip K. Paliwal improve Recurrent Neural Networks (RNNs) by considering the forward states and backward states at the same time during training. These states are independently treated and allow the RNN to better grasp bidirectional context. They use the TIMIT phoneme dataset to achieve state-of-the-art results classifying speech using this new architecture.
Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle investigate a greedy layer-wise training approach to Deep Belief Networks on a Cotton daataset, Abalone, and MNIST. By using this algorithm for pretraining, weights are better initialized and therefore are less likely to converge to bad extrema.
Reducing the Dimensionality of Data with Neural Networks
Published on:
G. E. Hinton and R. R. Salakhutdinov challenge the traditional methods using PCA to reduce the dimensionality of data. Instead they opt for a Restricted Boltzmann Machine (RBM) based encoder that is greedy layer-wise trained for the encoder. The decoder is the opposite of the encoder and the full autoencoder can be trained using cross-entropy.
Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston enact curriculum learning by teaching a model to learn easier aspects of the task and gradually increasing the difficulty until the model is ready for the test set. This procedure results in better local minima, quicker convergence, and a regulization effect.
ImageNet Classification with Deep Convolutional Neural Networks
Published in:
Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton tackle the issue of multiclass image recognition through developing a revolutionary deep CNN that utilizes GPU threading, ReLU activation, local response normalization, overlapping pooling, patch augmentation, and PCA for RGB channels.
Efficient Estimation of Word Representation in Vector Space
Published in:
Thomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean examine the lack of meaningful and efficient estimation of word2vector parametrization. Here I explain some of the previous methods (LDA/LSA) and how their approach varies from the previous state-of-the-art.
Sequence to Sequence Learning with Nerual Networks
Published on:
Ilya Sutskever, Oriol Vinyals, and Quoc Le tackle utilize an encoder-decoder LSTM network for language translation with four layers. Key insights include reversing the order of source tokens, resulting in state-of-the-art (for the time) BLEU scores.
Neural Machine Translation by Jointly Learning to Align and Translate
Published on:
Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio address English-to-French translation using an encoder-decoder GRU-based BiRNN with a novel attention mechanism given bidirectional annotations, outperforming Moses on BLEU translation and planting the seeds for modern attention mechanisms.
Lukasz Kaiser and Ilya Sutskever tackle the problem of superlinear sequential long addition and long multiplication using a convolutional GRU (CGRU) with grid search, curriculum learning with minibatch, gradient noise, gate cutoff, dropout, and relaxation pull to achieve 100% accuracy.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaise, and Illia Polosukhin tackle the problem of sequence transduction through a multi-headed dot-product attention mechanism (Transformer) that compares query, key, and values to minimize loss and obtain a state-of-the-art BLEU score of 41.0 using an autoregressive encoder-decoder architecture.