Pay Attention: ImageNet Classification

ImageNet Classification

Annotated by: Noah Schliesman

Published on: January 1, 2012 by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton

Abstract

“Our beauty lies in this extended capacity for convolution”

~Thomas Pynchon

Here, with aid from the paper, I discuss Convolutional Neural Networks (CNNs), the Fast Fourier Transform (FFT), Zero-padding, the Discrete Fourier Transform (DFT), the Cooley-Tukey Algorithm, the ImageNet dataset, Nearest Neighbor Interpolation, Bilinear Interpolation, Bicubic Interpolation, Rectified Linear Units (ReLUs), Cross Validation, Local Response Normalization, Overlapping Pooling, Image Patching, Principal Component Analysis (PCA), and Dropout.

The authors achieve top-1 and top-5 error rates of 37.5% and 17.0% on the ImageNet LSVRC-2010 contest of 1000 classes using a convolutional neural network with five convolutional layers, 60 million parameters, and 650,000 neurons. Some convolutional layers are followed by max-pooling layers and three fully-connected layers, ending with a 1000-way softmax. They use non-saturating neurons and fully utilize the GPU for convolution. They also introduce dropout as a form of regularization to reduce overfitting (KSH12).^[1]<

Introduction

They begin by emphasizing the importance of larger datasets, more powerful models, and better regularization techniques. Since the capacity of CNNs are controlled by varying their depth and breadth, they can form meaningful dependencies of pixel statistics and locality, thereby requiring fewer connections and parameters than feedforward networks. By depth and breadth, the authors mean increasing the number of layers and filters per layer (suitable for larger and more complex datasets).

They mention that the improvement in GPU power and optimizations in 2D convolution can facilitate training very strong deep CNNs without overfitting.

Recall that convolution in Convolutional Neural Networks (CNNs) involves sliding a filter (kernel) over the image and performing elementwise multiplications and summations to produce feature maps. Due to the parallel processing capabilities of GPUs and optimizations in the Fast Fourier Transform.

Let \(I\) be an input image of size \(P \times Q\), \(K\) be the convolution kernel of dimensions \(M \times N\) and the coordinates of the output feature map be denoted \(x\) and \(y\). ^[2]<2D convolution is described as,

\[I * K(x, y) = \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} I(x+i, y+j) \cdot K(i, j)\]

The Fourier transformation diagonalizes the convolution operation, thereby turning it into a simple multiplication problem in the frequency domain. This makes computing convolution significantly simpler, leading to motivations for an efficient Fast Fourier Transform (FFT) algorithm.

We aim to compute the FFT of both the image and kernel. To do so, scholars often first apply zero-padding to the image and kernel to avoid circular convolution artifacts and preserve image size.

Without padding, the convolution operation outputs an image that is smaller than that of the original image. These convolutional layers can result in the potential loss of boundary information, which is crucial as we add more layers. Zero-padding ensures the spatial dimensions of the output feature maps remain the same as the input.

We zero pad the image and kernel to a size of \((M + P - 1) \times (N + Q - 1)\).

The 2D Discrete Fourier Transform (DFT) of an image \(I\) with frequency domain coordinates \(u\) and \(v\) can be described as,^[3]<

\[F(u,v) = \sum_{i=0}^{P-1} \sum_{j=0}^{Q-1} I(x,y) e^{-i 2\pi (ux/P + vy/Q)}\]

Similarly, the inverse 2D DFT is given by,

\[I(x,y) = \frac{1}{PQ} \sum_{i=0}^{P-1} \sum_{j=0}^{Q-1} F(u,v) e^{-i 2\pi (ux/P + vy/Q)}\]

Due to the direct computation of two nested summations, the computational complexity of the DFT is \(O(M^2 N^2)\).

To reduce this complexity, we can apply the Cooley-Tukey Algorithm, which works by recursively breaking down the DFT into smaller DFTs. The algorithm divides the sequence (or in this case matrix) into even and odd indexed parts, computes the DFT of the even-indexed part and the odd-indexed part, and uses the results of the smaller DFTs to construct the DFT of the original sequence.

Due to this divide and conquer approach, the computational complexity of the Cooley-Tukey FFT is reduced to \(O(MN \log M + \log N)\). When we consider an image of size \(1024 \times 1024\), this is the difference between \(10^{12}\) and \(10^7\) operations.

GPUs consist of many smaller cores that can execute thousands of threads simultaneously. The cuFFT library from NVIDIA’s CUDA efficiently leverages these threads using the Cooley-Tukey algorithm (which is built for parallelization). The author’s network takes only five to six days on two GTX 580 3GB GPUs.

Aside: Image Resizing Algorithms

To resize an image, we require interpolation to estimate the pixel values at new positions. We often use nearest neighbor, bilinear interpolation, or bicubic interpolation.^[4]<

In nearest neighbor interpolation for image downsampling, we reduce the size of an image by selecting the nearest pixel’s value without considering any other pixel values. To determine the color of each pixel in the downsampled image, you find the closest corresponding pixel in the original image and use its value.

Let the original image be \(I_{H \times W}\) and let the downsampled image be \(I'_{H' \times W'}\). Each pixel in \((i', j')\) the original image is is found as,

\(i = \text{floor}\left(\frac{i' \cdot H}{H'}\right)\)

\(j = \text{floor}\left(\frac{j' \cdot W}{W'}\right)\)

This is a very lossy way of compression and can lose many of the fine details, despite being \(O(1)\) per pixel. Thus, a better resizing algorithm is needed.

In bilinear interpolation, we locate the four closest pixels in the original image that form a square over the target pixel. We calculate how far the target pixel is from each of the four surrounding pixels as the weights. We then average the colors of the four surrounding colors based on those weights to consider the values of multiple neighboring pixels.

We can then calculate our corresponding position in the original image as,

\(x = \frac{i' \cdot (H - 1)}{H' - 1}\)

\(y = \frac{j' \cdot (W - 1)}{W' - 1}\)

After determining the four nearest integer coordinates \((x_0, y_0), (x_1, y_0), (x_0, y_1), (x_1, y_1)\), we can compute the horizontal and vertical distances as,

\(d_x = x - x_0\)

\(d_y = y - y_0\)

And in turn interpolate the color value as,

\(I'(i', j') = (1 - d_x)(1 - d_y)I(x_0, y_0) + d_x(1 - d_y)I(x_1, y_0) + (1 - d_x)d_yI(x_0, y_1) + d_xd_yI(x_1, y_1)\)

While bilinear interpolation runs at only \(O(4)\) per pixel, it still is somewhat lossy. Bicubic interpolation considers a larger neighborhood of pixels. Instead of finding the 2x2 grid about the pixel, it entails considering the 4x4 grid.

Given the sixteen nearest integer coordinates, we define the cubic convolution kernel as,

\(f(t) = \begin{cases} 0 & \text{otherwise} \\ \alpha t|^3 - 5\alpha t|^2 + 8\alpha t| - 4\alpha & 1 < |t| < 2 \\ (\alpha + 2) t|^3 - (\alpha + 3) t|^2 + 1 & |t| \leq 1 \\ \end{cases}\)

With \(\alpha\) as a constant (typically around -0.5).

The authors computed the mean value of each pixel across all images in the training set (i.e. averaging the value of each pixel position \((x, y)\) in all images. If the images are in RGB format, this is done separately for each color channel (Red, Green, Blue). To calculate these means,

\(\mu_R(x, y) = \frac{1}{N} \sum_{i=1}^{N} I_i'(x, y, R)\)

\(\mu_G(x, y) = \frac{1}{N} \sum_{i=1}^{N} I_i'(x, y, G)\)

\(\mu_B(x, y) = \frac{1}{N} \sum_{i=1}^{N} I_i'(x, y, B)\)

Then, they subtract the computed mean value from each corresponding pixel in the image. This is done for every pixel position in every image in the dataset. This process shifts the data distribution to center around zero.

\(I_i''(x, y, R) = I_i'(x, y, R) - \mu_R(x, y)\)

\(I_i''(x, y, G) = I_i'(x, y, G) - \mu_G(x, y)\)

\(I_i''(x, y, B) = I_i'(x, y, B) - \mu_B(x, y)\)

The Architecture

Prior to the paper’s inception, the primary activation functions used were the hyperbolic tangent \(f(x) = \tanh(x)\) and the sigmoid \(f(x) = \frac{1}{1 + e^{-x}}\). In their case, using Rectified Linear Units (ReLUs) greatly improved CNN performance. Such an activation problem mitigates the problem of the vanishing gradients and results in naturally sparse units.

They also mention local average pooling as a useful technique of reducing the size of data while preserving important information. Local average pooling involves sliding a window over the input feature map and computing the average value within each window to produce a downsampled output.

The network’s kernels are split equally across the GPUs. To minimize latency and bandwidth issues associated with inter-GPU communication, data transfer is restricted to specific layers. Layer 3 kernels receive inputs from the entirety of layer 2 (both GPUs). Layer 4 kernels receive inputs from layer 3 kernels on the same GPU.

They also mention the importance of cross-validation. In K-fold Cross-Validation, scientists divide the data into folds, train the model on some parts and test it on the remaining part, repeat, and average the results.

Patterns of connectivity refer to the interactions from neurons in different layers. Dense Layers entail that every neuron in one layer is connected to the other. Convolutional Layers mean that neurons are connected to local regions in the input, thereby reducing the number of parameters. Sparsely Connected Layers cause only a subset of layer neurons to be connected to the next layer. Skip Connections allow connections to skip one or more layers, facilitating gradient flow and mitigating the vanishing gradient problem.

In biology, lateral inhibition mutes a neuron’s response to stimulus amid the excitation of a neighboring neuron. We can see this in the Hermann grid illusion:

Let the output of a neuron at position \((x, y)\) using kernel \(i\) be denoted as \(a_{x,y,i}\). The number of kernels in the layer is expressed as \(N\), whereas \(n\) is the number of adjacent kernel maps. In local response normalization (LRN), we have three hyperparameters:

\(k\): baseline constant (typically ~2)
\(\alpha\): impact of activities (typically ~10^{-4})
\(\beta\): The scaling factor of normalization

The authors express the activity after normalization \(b_{x,y,i}\) as,

\[b_{x,y,i} = \frac{a_{x,y,i}}{k + \alpha \sum_{j=\text{max}(0,i-n/2)}^{\text{min}(N-1,i+n/2)} a_{x,y,j}^2}^\beta\]

The authors shift to a discussion of overlapping pooling. Traditionally, each pool (group of neurons) does not share neurons with neighboring groups. When we consider using a stride (spacing between pools) \(s\) that is smaller than the pool size \(z\), overlapping regions lead to redundant summarization of features, enhancing the model’s ability to generalize by preserving more contextual information.

After reviewing the CNN architecture, they describe the first convolutional layer as having 96 kernels of size \(11 \times 11 \times 3\) with a stride of 4. The second convolutional layer filters using 256 kernels of size \(5 \times 5 \times 48\). The third convolutional layer has 384 kernels of size \(3 \times 3 \times 256\). The fourth convolutional layer has 384 kernels of size \(3 \times 3 \times 192\), and the fifth convolutional layer has 256 kernels of size \(3 \times 3 \times 192\). Additionally, dense layers have 4096 layers each.

Reduce Overfitting

Due to the 60 million parameters and 1000 classes of ILSVRC, the authors emphasize methods of reducing overfitting.

One method of reducing overfitting involves cropping the images from \(256 \times 256\) to \(224 \times 224\), flipping them about \(y = 0\), and adding them as varied examples. This increases training examples by a factor of 2048.

Given an image \(I\) of size \(256 \times 256\), we can extract patches \(P_i\) of size \(224 \times 224\) from the 4 corners and center. For each patch \(i = 1, 2, 3, 4, 5\), create a horizontally flipped version \(P_i'\). Thus, in total we have \(P = \{P_1, P_2, P_3, P_4, P_5, P_1', P_2', P_3', P_4', P_5'\}\). The final prediction \(F(I)\) for the entire image is the average of the softmax outputs, thereby reducing the variance of the prediction. If \(f(P)\) denotes the neural network’s softmax output for a patch, then,

\[F(I) = \frac{1}{10} \sum_{i=1}^{5} \left(f(P_i) + f(P_i')\right)\]

The authors also enhance the training images by changing the brightness and colors using Principal Component Analysis (PCA) on the pixel values of all training images. For each training image, they add a small, random variation to the colors based on PCA results, which in turn helps the model understand that objects can look different under various lighting conditions and still be the same object.

After gathering RGB values from all training images, treat each value as a vector \([R, G, B]\). Calculate the mean RGB value across all pixels in the training set and subtract this mean from each RGB vector to center the data around the origin. Form the data matrix \(X\), where each row represents an RGB vector of a pixel from the training set. If \(n\) is the number of pixels, calculate the covariance matrix \(C\) of the data matrix \(X\),

\[C = \frac{1}{n-1} X^TX\]

From here, perform eigenvalue decomposition on the covariance matrix \(C\) to determine the eigenvectors \(p_i\), which represent the directions of maximum RGB variance, and the eigenvalues \(\lambda_i\), which indicate the amount of variance captured by each principal component. For a matrix of principal components (eigenvectors) \(P\) and diagonal matrix of eigenvalues \(\Lambda\),

\[C = P \Lambda P^T\]

For each image, add multiples of the principal components proportional to the eigenvalues scaled by random variables \(\alpha_i\) drawn from a Gaussian distribution (in this case, mean of 0 and standard deviation of 0.1). The adjusted RGB values are then,

\[I_{xy}^R, I_{xy}^G, I_{xy}^B = \left(I_{xy}^R, I_{xy}^G, I_{xy}^B\right)^T + \left(p_1, p_2, p_3\right) \left(\alpha_1 \lambda_1, \alpha_2 \lambda_2, \alpha_3 \lambda_3\right)^T\]

Dropout can improve network performance and reduce errors by randomly turning off neurons (setting their output to zero) during training. Each time we train the network, we use a slightly different set of neurons. At testing time, they use all the neurons but reduce their output to half to balance things out. Mathematically, during training, the output of each neuron is,

\[h_i' = h_i \cdot \text{mask}_i\]

Dropout makes the model more robust to overfitting by ensuring that the network learns redundant representation while acting as a form of regularization by adding noise to the training process.

Details of Learning

The authors trained using stochastic gradient descent (SGD), a batch size of 128, a momentum of 0.9, and weight decay of 0.0005. Thus, the update rule was described as:

\[v_{i+1} := 0.9 \cdot v_i - 0.0005 \cdot \epsilon \cdot w_i - \epsilon \cdot \langle \frac{\partial L}{\partial w} | w_i \rangle_D\]

\[w_{i+1} := w_i + v_{i+1}\]

Here, \(i\) is the index of iteration, \(v\) is the momentum, \(\epsilon\) is the learning rate, and \(\langle \frac{\partial L}{\partial w} | w_i \rangle_D\) is the average over the \(i\)th batch of the derivative of the objective with respect to the weights.

The authors chose to initialize the weights in each layer from a zero-mean Gaussian distribution with a standard deviation of 0.01. Neuron biases in the second, fourth, fifth, and dense layers were initialized to 1. They used an equal learning rate, which was divided by 10 when the validation error stopped improving. The process took five to six days on the 1.2 million images.

Results

The results here were incredible for when this paper was released. Note that top-1 accuracy refers to the model’s highest confidence prediction, whereas top-5 accuracy refers to the model's top 5 highest confidence predictions. The top-1 accuracy achieved 37.5% (compared to the previous state-of-the-art 45.7%) and the top-5 accuracy achieved 17.0% (the state-of-the-art was 25.7%). These numbers improved slightly using an ensemble of CNNs.

The authors also note that the kernels on GPU 1 are independent of color, whereas the kernels on GPU 2 are dependent on color. An important thing to note is that “this kind of specialization occurs during every run”. The specialization of GPU regions clearly is impacting the importance of this paper.

They also note that computing similarity by using Euclidean distance between vectors is inefficient but could be made efficient by training an auto-encoder to compress these vectors.

Discussion

These are mind-blowingly cool results. The authors mention potentially using unsupervised pre-training as a future discussion, which are foundational to transformer architectures, particularly in that of natural language processing.

For code, please reach out:

Citations

[1] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 25, 1097-1105. Retrieved from https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
[2] Dumoulin, V., & Visin, F. (2018). A guide to convolution arithmetic for deep learning. Retrieved from https://arxiv.org/pdf/1603.07285.pdf.
[3] Delgutte, B., & Greenberg, J. (1999). The Discrete Fourier Transform. HST582J/6.555J/16.456J Biomedical Signal and Image Processing, Spring 2005. Retrieved from MIT OpenCourseWare.
[4] Diana Earshia, V., & Sumathi, M. (2019). A Comprehensive Study of 1D and 2D Image Interpolation Techniques. In A. Kumar & S. Mozar (Eds.), ICCCE 2018. Lecture Notes in Electrical Engineering, vol 500. Springer, Singapore. https://doi.org/10.1007/978-981-13-0212-1_40.