Kalman Filtering

Annotated by: Noah Schliesman

Published on: 1960 by R. E. Kalman

Note: The following is still a work in progress. There may be MathJax (LaTeX) errors throughout and I am still in the process of editing this.

Key Equations (derived below):

Optimal Estimate: \( x^*_{t+1|t} = \Phi(t+1;t)x^*(t|t-1) + \Delta^*(t)y(t) \)

Estimation Error: \( x_{t+1|t} = \Phi^*(t+1;t)x(t|t-1) + u(t) \)

Covariance Matrix of the Estimation Error: \( P^*(t) = \text{cov}\{ x_{t+1|t} \} = \mathbb{E}\{ x_{t+1|t}x'_{t+1|t} \} \)

Expected Quadratic Loss:

\[ \sum_{i=1}^{n} \mathbb{E}[ x_i^2(t|t-1) ] = \text{trace}(P^*(t)) \]

Kalman Gain: \( \Delta^* = \Phi(t+1;t)P^*(t)M'(t) \)

\[ \Delta^* = \frac{\Phi(t+1;t)P^*(t)M'(t)}{M(t)P^*(t)M'(t)} \]

Modified State Transition Matrix: \( \Phi^*(t+1;t) = \Phi(t+1;t) - \Delta^*(t)M(t) \)

Future Covariance Matrix of Estimated Error: \( P^*(t+1) = \Phi^*(t+1;t)P^*(t)\Phi'(t+1;t) + Q(t) \)

Abstract Notes:

Kalman begins by attributing the Bode-Shannon representation of random processes and the state transition method of analysis:

Random Processes: Sequences of random variables that illustrate a system’s state over time.

Bode-Shannon Representation: Used to describe frequency response and the quantification of uncertainty. Hendrik Bode showed how systems respond to various inputs in the frequency domain. Claude Shannon provided the theoretical tools to quantify and measure uncertainty in systems.

State Transitions: Involve modeling the evolution of a system’s state from one time step to the next. Here, a state is a set of variables that encapsulate all necessary information about a system’s condition. The State Transition Equation describes how the state of the system changes with time.

Dynamic Systems: Evolve over time according to certain rules. The Kalman filter models these systems using state-space representations.

Kalman provides a filter for stationary, nonstationary, growing memory, and infinite memory processes:

Stationary Processes: Statistical properties, such as mean and variance, do not change over time. The Kalman filter can leverage these constant properties to make accurate predictions and filter out noise. Such processes include white noise and temperature fluctuations. In such an environment, the statistical behavior of the process can be modeled using a fixed set of parameters.
Nonstationary Processes: Statistical properties change over time. The Kalman filter can update its parameters dynamically as it processes new data, making it suitable for a wider range of transient applications. Examples include GDP, population growth, and speech signals.
Growing Memory: The idea that a filter may take into account an increasing amount of past data over time. Through the recursive nature of the Kalman filter, each state estimate is dependent on both the previous and new measurements.
Infinite Memory: A filter that considers all past data indefinitely, as opposed to a fixed window. Such a system is crucial in systems that must robustly latch on to information. This latching task closely models the experiment performed using recurrent neural networks (RNNs) shown by Bengio et al.^[2]

The Kalman filter applies to nonlinear or differential equations for the covariance of the expected error. The filtering problem is the dual of the noise-free regulator problem:

Noise-free Regulator Problem: Involves controlling a system so that it behaves despite the presence of disturbances.

Introduction Notes:

Prediction of random signals, separation of random signals from random noise, and the detection of signals in known form amidst noise are statistical and can all be augmented by a Kalman filter. The foundations of such lie on Wiener and that of the Wiener-Hopf Equation.

Wiener-Hopf Equation: Addresses separating signals from random noise in finding an optimal filter that processes a signal in a way that minimizes the error in the separation of signal from noise.

Consider a process given additional noise \( N(t) \). The observed process \( Y(t) = X(t) + N(t) \). We want to find a linear filter that gives the estimate:

\[ X(t) = \int_{-\infty}^{\infty} h(\tau)Y(t-\tau)d\tau \]

The goal is to minimize the mean squared error (MSE) between \( X(t) \) and \( X(t) \):

\[ MSE = \mathbb{E}[(X(t) - X(t))^2] = 0 \]

The MSE can be expanded as:

\[ MSE = \mathbb{E}[X(t)^2] - 2\mathbb{E}[X(t)X(t)] + \mathbb{E}[X(t)^2] = 0 \]

The first term can be defined by the following autocorrelation function of the signal \( X(t) \), \( R_X(\tau) \):

\[ R_X(\tau) = \mathbb{E}[X(t)X(t+\tau)] \]

For \( \tau = 0 \) (zero-lag):

\[ \mathbb{E}[X(t)^2] = R_X(0) \]

If \( X(t) \) is zero-mean:

\[ \mathbb{E}[X(t)^2] = \text{Var}(X(t)) = R_X(0) \]

Alternatively, in the frequency domain, \( \mathbb{E}[X(t)^2] \) can be computed by integrating the power spectral density (PSD) of \( X(t) \) over all frequencies, \( S_X(f) \):

\[ \mathbb{E}[X(t)^2] = \int_{-\infty}^{\infty} S_X(f)df \]

The cross-correlation function \( R_{xy}(\tau) \) is defined as:

\[ R_{xy}(\tau) = \mathbb{E}[X(t)Y(t+\tau)] \]

The middle term can be expressed using the cross-correlation function between \( X(t) \) and \( Y(t) \), \( R_{xy}(\tau) \):

\[ \mathbb{E}[X(t)X(t)] = \int_{-\infty}^{\infty} h(\tau)R_{XY}(\tau)d\tau \]

The autocorrelation function measures how the value of the process at one time relates to its value at another time:

\[ R_Y(\tau_1 - \tau_2) = \mathbb{E}[Y(t-\tau_1)Y(t-\tau_2)] \]

Thus, the last term can be expressed with impulse responses as:

\[ \mathbb{E}[X(t)^2] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} h(\tau_1)h(\tau_2)R_Y(\tau_1 - \tau_2)d\tau_1d\tau_2 \]

Using the PSD approach to \( \mathbb{E}[X(t)^2] \), the equation becomes:

\[ MSE = \int_{-\infty}^{\infty} S_X(f)df - 2\int_{-\infty}^{\infty} h(\tau)R_{XY}(\tau)d\tau + \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} h(\tau_1)h(\tau_2)R_Y(\tau_1 - \tau_2)d\tau_1d\tau_2 \]

The derivative of the MSE with respect to impulse response \( h(t) \) is:

\[ \frac{\delta MSE}{\delta h(\tau)} = -2R_{XY}(\tau) + 2\int_{-\infty}^{\infty} h(\tau')R_Y(\tau - \tau')d\tau = 0 \]

The Wiener-Hopf Integral equation is given as:

\[ R_{XY}(\tau) = \int_{-\infty}^{\infty} h(\tau')R_Y(\tau - \tau')d\tau' \]

It can be noted that for means \( \mu_X \) and \( \mu_Y \):

\[ R_{XX} = \mathbb{E}[X(t) - \mu_X)(X(t + \tau) - \mu_X)] \]

\[ R_{YY} = \mathbb{E}[Y(t) - \mu_Y)(Y(t + \tau) - \mu_Y)] \]

After performing Fourier Transforms on \( h(t) \), \( R_{XX}(t) \), \( R_{YY}(t) \), \( R_{XY}(t) \) (i.e., spectral factorization):

\[ H(f) = \int_{-\infty}^{\infty} h(\tau)e^{-j2\pi fr}dr \]

\[ S_X(f) = \int_{-\infty}^{\infty} R_{XX}(\tau)e^{-j2\pi fr}dr = \int_{-\infty}^{\infty} \mathbb{E}[X(t) - \mu_X)(X(t + \tau) - \mu_X)]e^{-j2\pi fr}d\tau \]

\[ S_Y(f) = \int_{-\infty}^{\infty} R_{YY}(\tau)e^{-j2\pi fr}dr = \int_{-\infty}^{\infty} \mathbb{E}[Y(t) - \mu_Y)(Y(t + \tau) - \mu_Y)]e^{-j2\pi fr}d\tau \]

\[ S_{XY}(f) = \int_{-\infty}^{\infty} R_{XY}(\tau)e^{-j2\pi fr}dr = \int_{-\infty}^{\infty} \mathbb{E}[X(t) - \mu_X)(Y(t + \tau) - \mu_Y)]e^{-j2\pi fr}d\tau \]

In the frequency domain, the Wiener-Hopf equation becomes:

\[ S_{XY}(f) = \int_{-\infty}^{\infty} R_{XY}(\tau)e^{-j2\pi fr}dr = H(f)S_Y(f) \]

The optimal filter can then be solved as:

\[ H(f) = \frac{S_{XY}(f)}{S_Y(f)} \]

Which of course can be expanded as:

\[ H(f) = \frac{\int_{-\infty}^{\infty} \mathbb{E}[X(t) - \mu_X)(Y(t + \tau) - \mu_Y)]e^{-j2\pi fr}d\tau}{\int_{-\infty}^{\infty} \mathbb{E}[Y(t) - \mu_Y)(Y(t + \tau) - \mu_Y)]e^{-j2\pi fr}d\tau} \]

The Wiener-Hopf equation allows us to find the optimal filter that minimizes the mean square error between the estimated signal and the actual signal \( X(t) \).

The objective is to obtain the specification of a linear dynamic system (Wiener filter) that accomplishes the prediction, separation, or detection of a random signal. The Kalman Filter addresses several limitations for solving an optimal linear dynamic system:

It is not easy to synthesize an optimal filter given its impulse response.
Computation of this optimal impulse response is involved and exponentially becomes intractable as the problem complexity increases.
Transient systems like growing-memory filters and nonstationary prediction require new derivations.
Inherent assumptions about the system can lead to error.

Conditional distributions and expectations are utilized to form means (first-order) and variances (second-order). The Wiener filter finds the best estimate of a signal by projecting it onto the space spanned by observed data, as shown by Doob^[3] and Loève^[4]:

Orthogonal Projections: Refer to the process of finding the best estimate of a random variable by projecting it onto a subspace of other random variables. The best approximation (i.e., minimize MSE) is achieved via the conditional expectation.

Joseph Doop: Formalized the concept of conditional expectation \( \mathbb{E}[X|Y] \), the expected value of a random variable given another random variable. He introduced martingales, a class of stochastic processes that model fair games. For a martingale \( X(t) \):

\[ X(t) = \mathbb{E}[X(t + 1)|X(t), X(t - 1),...] \]

Michel Loève: Studied orthogonal projections in Hilbert spaces. In such a Hilbert space, the orthogonal projection of a signal onto another signal given projection operator \( P_Y \) can be illustrated as:

\[ X(t) = P_Y[X(t)] \]

Amidst random signals and white noise, linear dynamic systems can be represented using a state system, which entails the information needed to describe the system’s future behavior given the current state. While Wiener filtering previously used statistical methods to minimize the MSE, the state approach is more dynamic and aligns with state-space systems in control theory.

Suppose \( X(t) \) represents the state of the system at time \( t \) and \( Y(t) \) represents the observed, noisy signal. Let \( A(t) \) and \( B(t) \) be matrices describing the system dynamics, \( U(t) \) be a control input, \( W(t) \) be process noise, \( C(t) \) be a matrix describing how the state is observed, and \( V(t) \) be the observation noise. Let \( Q(t) \) be the covariance of the process noise \( W(t) \) and \( R(t) \) be the covariance of the observation noise \( V(t) \).

The state-transition method can be described as:

\[ X(t + 1) = A(t)X(t) + B(t) + U(t) + W(t) \]

The observed equation is then:

\[ Y(t) = C(t)X(t) + V(t) \]

The covariance matrix \( P(t) \) represents the uncertainty in the state estimate \( X(t) \) and can be described using a Riccati equation:

Riccati Equation: A type of nonlinear differential equation that governs the evolution of the covariance matrix of the estimation error. For matrices \( A(t) \), \( B(t) \), \( Q(t) \), and \( R(t) \), the Riccati differential equation is:

\[ \frac{dP(t)}{dt} = A(t)P(t) + P(t)A^T(t) - P(t)B(t)R^{-1}(t)B^T(t)P(t) + Q(t) \]

The first two terms \( A(t)P(t) + P(t)A^T(t) \) describe how the covariance matrix evolves due to the system’s dynamics. The third term \( -P(t)B(t)R^{-1}(t)B^T(t)P(t) \) involves feedback from the measurement, scaled by the inverse of the noise covariance. The last term incorporates the covariance of the process noise \( Q(t) \).

In discrete time, the Riccati difference equation similarly becomes:

\[ P(t + 1) = A(t)P(t)A^T(t) - \left[A(t)P(t)C^T(t)\left(C(t)P(t)C^T(t) + R(t)\right)^{-1}C(t)P(t)A^T(t)\right] + Q(t) \]

The optimal filter coefficients \( K(t) \) can be calculated from the covariance matrix \( P(t) \) as:

\[ K(t) = P(t)C^T(t)\left[C(t)P(t)C^T(t) + R(t)\right]^{-1} \]

As mentioned, this new state-based formulation of the Wiener problem is the dual of the noise-free regulator problem. Such a method has clear applications in complex systems and nonstationary processes.