A Theoretical Framework for Back-Propagation
By: Yann Le Cun
Published on:
Abstract
Yann Le Cun tackles the problem of backpropagation through a Lagrangian framework. Using principles in optimal control theory and variational calculus, Le Cun poses back-propagation as an optimization problem with nonlinear constraints. They generalize to the cases of linearly dependent weight spaces and continuous time recurrent neural networks (CTRNNs), which yield the Pineda-Almeida equation for computing the gradient of the fixed point with respect to the weights using a Hamiltonian framework.
Introduction
The author begins with the note that back-propagation is simply an extension of the chain rule and gradient descent. In optimal control theory, we use Lagrange multipliers to find the optimal values for control variables by minimizing an objective function subject to constraints. Although written several years later, Cortes and Vapnik’s Support Vector Networks utilize the same framework for optimizing higher-dimensional hyperplanes.
The difficult part about back-propagation is the influence of a parameter on a function whose computation involves many steps as reflected by the connectionist nature of the problem. Optimal control theory studies how the state of a system at time T will be modified by a change in control variables at time T - t. Le Cun notes that we can leverage this procedure by considering layer indices instead of time.
He defines backpropagation as computing partial derivatives of the states with respect to the parameters to minimize an objective function. He expands beyond Lagrangian formalism and into that of the single Hamiltonian function, which meets some optimality criterion as a consequence of the Pontryagin principle and Bellman’s dynamic programming. Despite these methods, only the standard Lagrange multiplier method is necessary.
Formally, the back-propagated vector is called the co-state or adjoint state, and the backward system is the adjoint system. This optimal control solution entails a cascade of elementary systems with a minimized performance index (i.e. change in loss with respect to weights). These problems can be noted as two-point boundary value problems due to the forward pass and backward pass.
Deriving Back-Propagation Using the Hamiltonian/Lagrangian Formalism
Assume the network consists of a number of layers in a feed-forward manner, with no skip connections. The layers are indexed from 0 to N, with the state of layer k for pattern p denoted as Xpk. Note that Xp is all the concatenated Xpk's and Wk is the weight matrix between layer k-1 and layer k. The vector of weighted sums is given as,
\[ A_p^{(k)} = W^{(k)}X_p^{(k-1)} \]
Consequently, let Fk represent some non-linear activation function (in this case sigmoid). Intuitively, state Xpk can be written as,
\[ X_p^{(k)} = F_k(A_p^{(k)}) = F_k(W^{(k)}X_p^{(k-1)}) \forall k \in [1, N] \]
This requires a set of labeled desired outputs Dp, further emphasizing the paradigm that models are only as good as their training data. So far we’ve discussed the simple equations for the forward pass of backpropagation, let’s move onto the backwards pass.
Since X(N) encompasses previous layers through a dependency on intermediate states and propagation of error, we only need to worry about minimizing C(X(N)). Thus, we obtain the following objective function,
\[ \text{Minimize } C(X_p^{(N)}) \]
As mentioned previously, we need a constraint to reflect the recurrent nature of our network.
\[ \text{Subject to: } X_p^{(k)} = F_k(W^{(k)}X_p^{(k-1)}) \]
This constraint can be rewritten to equate to zero,
\[ \text{Subject to: } X_p^{(k)} - F_k(W^{(k)}X_p^{(k-1)}) = 0 \]
For a Lagrangian multiplier vector B_p(k), the Lagrangian equates to,
\[ L_p(W, X_p, B_p) = C(X_p^{(N)}) + \sum_{k=1}^{N} B_p^{(k)^T}(X_p^{(k)} - F_k(W^{(k)}X_p^{(k-1)})) \]
Here, the total loss function is simply the sum of patterns over the number of training examples P,
\[ L(W, X, B) = \sum_{p=1}^{P} L_p(W, X_p, B_p) = \sum_{p=1}^{P} C(X_p^{(N)}) + \sum_{k=1}^{N} B_p^{(k)^T}(X_p^{(k)} - F_k(W^{(k)}X_p^{(k-1)})) \]
Consider the case where the objective function is the squared error,
\[ C(X_p^{(N)}) = (D_p - X_p^{(N)})^T (D_p - X_p^{(N)}) \]
The Lagrangian expands to,
\[ L(W, X, B) = \sum_{p=1}^{P} (D_p - X_p^{(N)})^T (D_p - X_p^{(N)}) + \sum_{k=1}^{N} B_p^{(k)^T}(X_p^{(k)} - F_k(W^{(k)}X_p^{(k-1)})) \]
By definition of the Lagrangian, we can attain a local minimum for the objective function at,
\[ \nabla L(W, X, B) = 0 \]
This gradient can then be expanded into the respective partial derivatives,
\[ \frac{\partial L(W,X,B)}{\partial B} = 0 \]
\[ \frac{\partial L(W,X,B)}{\partial X} = 0 \]
\[ \frac{\partial L(W,X,B)}{\partial W} = 0 \]
The partial with respect to error accounts for the forward pass, the partial with respect to the state accounts for the backward pass, and the partial with respect to the weight ensures optimality. Through careful analysis, we can see that the first condition yields the formal dynamics equation,
\[ X_p^{(k)} = F_k(A_p^{(k)}) = F_k(W^{(k)}X_p^{(k-1)}) \forall k, p \in [1, N][1, P] \]
The second equation (as derived from the squared error) acts as the boundary condition for k = N,
\[ \frac{\partial C}{\partial X_p^{(N)}} = B_p^{(N)} = 2 (D_p - X_p^{(N)}) \]
The second condition also entails the recursive dynamics of our network for k ≠ N,
\[ \frac{\partial C}{\partial X_p^{(k)}} = B_p^{(k)} = W^{(k+1)^T}\nabla F_k(A_p^{(k+1)})B_p^{(k+1)} \]
Le Cun takes special note of the Jacobian matrix of F_k at point A_p^{(k+1)} given the i-th diagonal term,
\[ \nabla F_k(A_p^{(k+1)}) = f'(a_i^{(k+1)}) \]
They then establish variables to represent the Jacobian of the activation with errors. Let Y_p^{(N)} be the output layer boundary condition and Y_p^{(k)} be the backpropagation dynamics,
\[ Y_p^{(N)} = 2\nabla F_k(A_p^{(N)}) (D_p - X_p^{(N)}) \]
\[ Y_p^{(k)} = \nabla F_k(A_p^{(k)})B_p^{(k)} = \nabla F_k(A_p^{(k)})W^{(k+1)^T}Y_p^{(k+1)} \forall k \in [1, n - 1] \]
Intuitively, we can define the initial state as the input pattern our other boundary condition,
\[ X_p^{(0)} = I_p \]
The third gradient equation of the Lagrangian formulation ensures W is a local extrema,
\[ \frac{\partial L(W,X,B)}{\partial W} = \sum_{p=1}^{P} \nabla F_k(A_p^{(k)})B_p^{(k)}X_p^{(k-1)^T} = 0 \forall k \in [1, N] \]
Such an equation gives way for gradient descent, of which can be formulated as for step size λ,
\[ W^{(k)} \leftarrow W^{(k)} - \lambda \frac{\partial L(W,X,B)}{\partial B} \]
Which can be expanded as,
\[ W^{(k)} \leftarrow W^{(k)} - \lambda \sum_{p=1}^{P} Y_p^{(k)}X_p^{(k-1)^T} \]
Or if you’d prefer,
\[ W^{(k)} \leftarrow W^{(k)} - \lambda \sum_{p=1}^{P} \nabla F_k(A_p^{(k)})W^{(k+1)^T}Y_p^{(k+1)}X_p^{(k-1)^T} \]
Le Cun notes that, as derived by Bryson and Ho[2], the Hamiltonian can similarly be derived as,
\[ H_p^{(k)} = B_p^{(k)^T}F(W^{(k)}X_p^{(k-1)}) \]
A Few Extensions
A priori knowledge refers to information that is independent of empirical observation and often relates to the assumptions about the data. Weights can be treated as functions W(U, k) to integrate a priori knowledge directly into the network. With this change, our Lagrangian is similarly,
\[ L_p(W, X_p, B_p) = C(X_p^{(N)}) + \sum_{k=1}^{N} B_p^{(k)^T}(X_p^{(k)} - F_k(W(U, k)X_p^{(k-1)})) \]
Again, by definition of the Lagrangian, we can attain a local minimum for the objective function C at,
\[ \nabla L(W, X, B) = 0 \]
The analysis for the first two subconditions and the third weight subcondition for optimality is of interest,
\[ \frac{\partial L(W,X,B)}{\partial U} = 0 \]
The update rule then becomes,
\[ u_q \leftarrow u_q - \lambda \sum_{ijk} \frac{\partial L}{\partial u_q} \]
With this in mind, our partial of the Lagrangian with respect to the error can be expanded using the chain rule,
\[ \frac{\partial L}{\partial u_q} = \frac{\partial L}{\partial w_{ij}^{(k)}} \cdot \frac{\partial w_{ij}^{(k)}}{\partial u_q} \]
Similarly, by definition of our partial of the Lagrangian with respect to the weights, we obtain,
\[ \frac{\partial L}{\partial w_{ij}^{(k)}} = y_i^{(k)}x_j^{(k-1)} \]
So the new update rule then can be expanded,
\[ u_q \leftarrow u_q - \lambda \sum_{ijk} y_i^{(k)}x_j^{(k-1)}\frac{\partial w_{ij}^{(k)}}{\partial u_q} \]
Le Cun mentions the utility in weight sharing as a component in Recurrent Neural Networks (RNNs). The paper transitions to a discussion of continuous time recurrent neural networks (CTRNNs) and provides the following differential equation,
\[ \tau_x \frac{dX}{dt}(t) = F(WX(t)) - X(t) \]
In other words, the change in state with respect to time (and some time constant \(\tau_x\)) is equal to an activation of the weighted input minus the state itself. Such an equation can be intuitively simplified if,
\[ X_f = F(WX(t)) \text{ and } X_i = X(t) \Delta X = X_f - X_i \]
\[ \tau_x \frac{dX}{dt}(t) = \Delta X \]
The author also mentions the case where,
\[ X_f = F(X(t)) \cdot F(W) \tau_x = \text{diag}(\tau_1, \tau_2, ..., \tau_n) \]
The conversation shifts to that of the Hamiltonian energy landscape of recurrent neural networks. A fixed point can be defined such that there is no change in a state with respect to time,
\[ \frac{dX}{dt} = 0 \]
We can describe such a fixed point as the effect of some nonlinear activation subtracted from the current state. This formulation is the definition of equilibrium; a result that there is no change over time. As you’ll see, this term becomes our constraint for Hamiltonian optimization,
\[ \frac{dX}{dt} = X - F(WX) = 0 \]
As is formalized in optimal control theory, the Hamiltonian combines the objective function C(X) and a constraint term B^T(X - F(WX)). The Hamiltonian measures the total cost of the system through this formulation,
\[ H = C(X) + B^T(X - F(WX)) \]
It is now of interest to examine the change in cost with respect to weights while staying at such a fixed point.
The variation in the Hamiltonian can be decomposed as the changes in Hamiltonian with respect to the change in state combined with that change in state summed with the changes in Hamiltonian with respect to the change in weights combined with that change in weights,
\[ \frac{\partial H}{\partial X} \frac{\partial X}{\partial t} + \frac{\partial H}{\partial W} \frac{\partial W}{\partial t} \]
Intuitively, this means that the change in Hamiltonian is dependent on the change in the state and weight space. We can compute each of the intermediary partials with respect to the state X and weights W as,
\[ \frac{\partial H}{\partial X} = \frac{\partial C}{\partial X} + B^T\frac{\partial (X - F(WX))}{\partial X} \]
\[ \frac{\partial H}{\partial W} = \frac{\partial C}{\partial W} + B^T\frac{\partial (X - F(WX))}{\partial W} \]
The variation in Hamiltonian can be expanded as,
\[ \frac{\partial H}{\partial X} = \frac{\partial C}{\partial X} + B^T \frac{\partial (X - F(WX))}{\partial X} \]
Le Cun notes that we are interested in the case where the variation in X does not affect the variation in Hamiltonian (so that we just need to update our weights). This motivates setting the change in Hamiltonian with respect to the state to zero,
\[ \frac{\partial H}{\partial X} = \frac{\partial C}{\partial X} + B^T(I - \frac{\partial F}{\partial (WX)}W) = 0 \]
We can expand the following term, where I is the identity matrix,
\[ \frac{\partial (X - F(WX))}{\partial X} = I - \frac{\partial F}{\partial (WX)}W \]
So our change in cost with respect to the state becomes,
\[ \frac{\partial H}{\partial X} = \frac{\partial C}{\partial X}^T + B - W^T\frac{\partial F}{\partial (WX)}^TB = 0 \]
We can characterize such a fixed point dynamical system as,
\[ - \frac{\partial H}{\partial X} = r_b\frac{dB}{dt} \]
Or to expand,
\[ - \frac{\partial C}{\partial X}^T - B + W^T\frac{\partial F}{\partial (WX)}^TB = r_b\frac{dB}{dt} \]
Le Cun uses a variable Y to represent the sensitivity of the activation with respect to small changes in WX given a vector of Lagrange multipliers B,
\[ Y = \frac{\partial F}{\partial (WX)^T}B \]
We can multiply both sides of the previous equation by the change in activation with respect to small changes in WX,
\[ - \frac{\partial F}{\partial (WX)} \frac{\partial C}{\partial X}^T - \frac{\partial F}{\partial (WX)} B + \frac{\partial F}{\partial (WX)} W^T\frac{\partial F}{\partial (WX)^T}B = r_b\frac{dB}{dt} \frac{\partial F}{\partial (WX)} \]
Which can be rewritten in terms of the variable Y to yield the Pineda-Almeida equation for computing the gradients of the fixed point with respect to the weights,
\[ \frac{\partial F}{\partial (WX)} W^TY - \frac{\partial C}{\partial X} - Y = \tau_y \frac{dY}{dt} \]
Since we set \(\frac{\partial H}{\partial X} = 0\), the variation in the Hamiltonian is simply,
\[ \frac{\partial H}{\partial W} = \frac{\partial C}{\partial W} + B^T \frac{\partial (X - F(WX))}{\partial W} \]
\[ \frac{\partial H}{\partial W} = \frac{\partial C}{\partial W} + B^T\frac{\partial (X - F(WX))}{\partial W} \]
Given the fixed nature of our point, we can say that,
\[ \frac{dH}{dW} = \frac{dC}{dW} = \frac{\partial C}{\partial W} + B^T\frac{\partial (X - F(WX))}{\partial W} \]
If the cost function is not a direct function of the weights, the expression reduces to,
\[ \frac{dC}{dW} = B^T\frac{\partial (X - F(WX))}{\partial W} \]
The change in cost function with respect to the weight components C with respect to w_ij is obtained as,
\[ \frac{dC}{dw_{ij}} = - y_ix_j \]
The update rule then becomes,
\[ w_{ij} \leftarrow w_{ij} - \lambda y_ix_i \]
And the continuous version becomes,
\[ \tau_w \frac{dw_{ij}}{dt} = y_ix_j \]
Le Cun notes that such a Lagrangian/Hamiltonian approach to neural networks is only the tip of the iceberg and there is much meaningful work to be done in modeling connectionist systems.
Citations
- [1] Le Cun, Y. (1988). A theoretical framework for back-propagation. Proceedings of the 1988 Connectionist Models Summer School, 21-28. San Mateo, CA: Morgan Kaufmann.
- [2] Bryson, A. E., & Ho, Y.-C. (1969). Applied optimal control: Optimization, estimation, and control. Blaisdell Publishing Company.