Kalman filtering is a state estimation technique used in many application areas such as spacecraft navigation, motion planning in robotics, signal processing, and wireless sensor networks because of its ability to extract useful information from noisy data and its small computational and memory requirements.^{12,20,27,28,29} Recent work has used Kalman filtering in controllers for computer systems.^{5,13,14,23}

### Key Insights

Although many introductions to Kalman filtering are available in the literature,^{1,2,3,4,6,7,8,9,10,11,17,21,25,29} they are usually focused on particular applications such as robot motion or state estimation in linear systems, making it difficult to see how to apply Kalman filtering to other problems. Other presentations derive Kalman filtering as an application of Bayesian inference, assuming that noise is Gaussian. This leads to the common misconception that Kalman filtering can be applied only if noise is Gaussian.^{15}

Abstractly, Kalman filtering can be seen as a particular approach to combining approximations of an unknown value to produce a better approximation. Suppose we use two devices of different designs to measure the temperature of a CPU core. Because devices are usually noisy, the measurements are likely to differ from the actual temperature of the core. As the devices are of different designs, let us assume that noise affects the two devices in unrelated ways (this is formalized here using the notion of correlation). Therefore, the measurements *x*_{1} and *x*_{2} are likely to be different from each other and from the actual core temperature *x _{c}*. A natural question is the following: is there a way to combine the information in the noisy measurements

*x*

_{1}and

*x*

_{2}to obtain a good approximation of the actual temperature

*x*?

_{c}One ad hoc solution is to use the formula 0.5**x*_{1}+0.5**x*_{2} to take the average of the two measurements, giving them equal weight. Formulas of this sort are called *linear estimators* because they use a weighted sum to fuse values; for our temperature problem, their general form is **x*_{1}+**x*_{2}. In this presentation, we use the term *estimate* to refer to both a noisy measurement and a value computed by an estimator, as both are approximations of unknown values of interest.

Suppose we have additional information about the two devices, say the second one uses more advanced temperature sensing. Because we would have more confidence in the second measurement, it seems reasonable that we should discard the first one, which is equivalent to using the linear estimator 0.0**x*_{1} + 1.0**x*_{2}. Kalman filtering tells us that in general, this intuitively reasonable linear estimator is not "optimal;" paradoxically, there is useful information even in the measurement from the lower quality device, and the optimal estimator is one in which the weight given to each measurement is proportional to the confidence we have in the device producing that measurement. Only if we have no confidence whatever in the first device should we discard its measurement.

The goal of this article^{a} is to present the abstract concepts behind Kalman filtering in a way that is accessible to most computer scientists while clarifying the key assumptions, and then show how the problem of state estimation in linear systems can be solved as an application of these general concepts. First, the informal ideas discussed here are formalized using the notions of distributions and random samples from distributions. Confidence in estimates is quantified using the variances and covariances of these distributions.^{b} Two algorithms are described next. The first one shows how to fuse estimates (such as core temperature measurements) optimally, given a reasonable definition of optimality. The second algorithm addresses a problem that arises frequently in practice: estimates are vectors (for example, the position and velocity of a robot), but only a part of the vector can be measured directly; in such a situation, how can an estimate of the entire vector be obtained from an estimate of just a part of that vector? The best linear unbiased estimator (BLUE) is used to solve this problem.^{16,19,26} It is shown that the Kalman filter can be derived in a straightforward way by using these two algorithms to solve the problem of state estimation in linear systems. The extended Kalman filter and unscented Kalman filter, which extended Kalman filtering to nonlinear systems, are described briefly at the end of the article.

### Formalizing Estimates

**Scalar estimates.** To model the behavior of devices producing noisy temperature measurements, we associate each device *i* with a *random variable* that has a *probability density function* (pdf) *p _{i}*(

*x*) such as the ones shown in Figure 1 (the x-axis in this figure represents temperature). Random variables need not be Gaussian.

^{c}Obtaining a measurement from device

*i*corresponds to drawing a random sample from the distribution for that device. We write to denote that

*x*is a random variable with pdf

_{i}*p*whose mean and variance are

_{i}_{i}and , respectively; following convention, we use

*x*to represent a random sample from this distribution as well.

_{i}

**Figure 1. Using pdfs to model devices with systematic and random errors. Ground truth is 60°C. Dashed lines are means of pdfs.**

Means and variances of distributions model different kinds of inaccuracies in measurements. Device *i* is said to have a *systematic error* or *bias* in its measurements if the mean _{i} of its distribution is not equal to the actual temperature *x _{c}* (in general, to the value being estimated, which is known as

*ground truth*); otherwise, the instrument is unbiased. Figure 1 shows pdfs for two devices that have different amounts of systematic error. The variance on the other hand is a measure of the

*random error*in the measurements. The impact of random errors can be mitigated by taking many measurements with a given device and averaging their values, but this approach will not reduce systematic error.

In the formulation of Kalman filtering, it is assumed that measuring devices do not have systematic errors. However, we do not have the luxury of taking many measurements of a given state, so we must take into account the impact of random error on a single measurement. Therefore, confidence in a device is modeled formally by the variance of the distribution associated with that device; the smaller the variance, the higher our confidence in the measurements made by the device. In Figure 1, the fact we have less confidence in the first device has been illustrated by making *p*_{1} more spread out than *p*_{2}, giving it a larger variance.

The informal notion that noise should affect the two devices in "unrelated ways" is formalized by requiring that the corresponding random variables be *uncorrelated.* This is a weaker condition than requiring them to be *independent*, as explained in our online appendix (http://dl.acm.org/citation.cfm?doid=3363294&picked=formats). Suppose we are given the measurement made by one of the devices (say *x*_{1}) and we have to guess what the other measurement (*x*_{2}) might be. If knowing *x*_{1} does not give us any new information about what *x*_{2} might be, the random variables are independent. This is expressed formally by the equation *p*(*x*_{2}|*x*_{1}) = *p*(*x*_{2}); intuitively, knowing the value of *x*_{1} does not change the pdf for the possible values of *x*_{2}. If the random variables are only uncor-related, knowing *x*_{1} might give us new information about *x*_{2} such as restricting its possible values but the mean of *x*_{2}|*x*_{1} will still be _{2}. Using expectations, this can be written as *E*[*x*_{2}|*x*_{1}] = *E*[*x*_{2}], which is equivalent to requiring that *E*[(*x*_{1}_{1})(*x*_{2}_{2})], the covariance between the two variables, be equal to zero. This is obviously a weaker condition than independence.

Although the discussion in this section has focused on measurements, the same formalization can be used for estimates produced by an estimator. Lemma shows how the mean and variance of a linear combination of pairwise uncorrelated random variables can be computed from the means and variances of the random variables.^{18} The mean and variance can be used to quantify bias and random errors for the estimator as in the case of measurements.

An *unbiased estimator* is one whose mean is equal to the unknown value being estimated and it is preferable to a biased estimator with the same variance. Only unbiased estimators are considered in this article. Furthermore, an unbiased estimator with a smaller variance is preferable to one with a larger variance as we would have more confidence in the estimates it produces. As a step toward generalizing this discussion to estimators that produce vector estimates, we refer to the variance of an unbiased scalar estimator as the *mean square error* of that estimator or *MSE* for short.

Lemma 1(ii) asserts that if a random variable is pairwise uncorrelated with a set of random variables, it is uncor-related with any linear combination of those variables.

**Lemma 1.** *Let* *be a set of pairwise uncorrelated random variables. Let* *be a random variable that is a linear combination of the x _{i}'s.*

*The mean and variance of y are:*

- ii.
*If random variable x*_{n+1}*is pair-wise uncorrelated with**x*_{1},...,*x*,_{n}*it is uncorrelated with y.*

**Vector estimates.** In some applications, estimates are vectors. For example, the state of a mobile robot might be represented by a vector containing its position and velocity. Similarly, the vital signs of a person might be represented by a vector containing his temperature, pulse rate, and blood pressure. Here, we denote a vector by a boldfaced lowercase letter, and a matrix by an uppercase letter.

The covariance matrix _{xx} of a random variable **x** is the matrix *E*[(**x** _{x}) (**x** ** _{x}**)

^{T}], where

**is the mean of**

_{x}**x**. Intuitively, entry (

*i,j*) of this matrix is the covariance between the

*i*and

*j*components of vector

**x**; in particular, entry (

*i,i*) is the variance of the

*i*component of

^{th}**x**. A random variable

**x**with a pdf

*p*whose mean is

**and covariance matrix is**

_{x}_{xx}is written as x~

*p*(

_{x},

_{xx}). The inverse of the covariance matrix is called the precision or

*information*matrix.

*Uncorrelated random variables.* The cross-covariance matrix _{vw} of two random variables **v** and **w** is the matrix *E*[(**v**_{v})(**w**_{w})^{T}]. Intuitively, element (*i,j*) of this matrix is the covariance between elements **v**(*i*) and **w**(*j*). If the random variables are uncorrelated, all entries in this matrix are zero, which is equivalent to saying that every component of **v** is uncorrelated with every component of **w.** Lemma 2 generalizes Lemma 1.

**Lemma** 2. *Let* *x*_{1}~*p*_{1}(_{1}, _{1}), ..., *x*_{n}~*p*_{n}(_{n}, _{n}) *be a set of pairwise uncorrelated random variables of length m.* *Let* .

*The mean and covariance matrix of***y**are the following:

- ii.
*If random variable***x**_{n+1}*is pairwise uncorrelated with***x**_{1}, ..,**x**_{n},*it is uncorrelated with***y**.

The *MSE* of an unbiased estimator **y** is *E*[(**y**_{y})^{T}(**y**_{y})], which is the sum of the variances of the components of **y**; if **y** has length 1, this reduces to variance as expected. The *MSE* is also the sum of the diagonal elements of _{yy} (this is called the *trace* of _{yy}).

### Fusing Scalar Estimates

We now consider the problem of choosing the optimal values of the parameters and in the linear estimator **x*_{1} + **x*_{2} for fusing two estimates *x*_{1} and *x*_{2} from uncorrelated scalar-valued random variables.

An unbiased estimator is one whose mean is equal to the unknown value being estimated and it is preferable to a biased estimator with the same variance.

The first reasonable requirement is that if the two estimates *x*_{1} and *x*_{2} are equal, fusing them should produce the same value. This implies that + =1. Therefore, the linear estimators of interest are of the form

If *x*_{1} and *x*_{2} in Equation 5 are considered to be unbiased estimators of some quantity of interest, then *y _{}* is an unbiased estimator for any value of . How should optimality of such an estimator be defined? One reasonable definition is that the optimal value of

*minimizes the variance of y*as this will produce the highest-confidence fused estimates.

_{}**Theorem 1.** *Let* *and* *be uncorrelated random variables. Consider the linear estimator y _{}(x*

_{1},

*x*

_{2})=(1-)*

*x*

_{1}+*

*x*

_{2}.

*The variance of the estimator is minimized for*.

The proof is straightforward and is given in the online appendix. The variance (*MSE*) of *y _{}* can be determined from Lemma 1:

Setting the derivative of with respect to to zero and solving the resulting equation yield the required result.

In the literature, the optimal value of is called the *Kalman gain K.* Substituting *K* into the linear fusion model, we get the optimal linear estimator *y*(*x*_{1}, *x*_{2}):

As a step toward fusion of *n*>2 estimates, it is useful to rewrite this as follows:

Substituting the optimal value of into Equation 6, we get

The expressions for *y* and are complicated because they contain the reciprocals of variances. If we let *v*_{1} and *v*_{2} denote the precisions of the two distributions, the expressions for *y* and *v _{y}* can be written more simply as follows:

These results say the weight we should give to an estimate is proportional to the confidence we have in that estimate, and that we have more confidence in the fused estimate than in the individual estimates, which is intuitively reasonable. To use these results, we need only the variances of the distributions. In particular, the pdfs *p _{i}*, which are usually not available in applications, are not needed, and the proof of Theorem 1 does not require these pdfs to have the same mean.

Kalman filtering can be seen as a particular approach to combining approximations of an unknown value to produce a better approximation.

**Fusing multiple scalar estimates.** These results can be generalized to optimally fuse multiple pairwise uncorrelated estimates *x*_{1}, *x*_{2}, ..., *x _{n}*. Let

*y*(

_{n,}*x*

_{1}, ..,

*x*) denote the linear estimator for fusing the

_{n}*n*estimates given parameters

_{1}, ..,

*, which we denote by (the notation*

_{n}*y*(

_{}*x*

_{1},

*x*

_{2}) introduced previously can be considered to be an abbreviation of

*y*

_{2,}(

*x*

_{1},

*x*

_{2})).

**Theorem 2.** *Let* *for* (1*i**n*) *be a set of pairwise uncorrelated random variables. Consider the linear estimator* *where* . *The variance of the estimator is minimized for*

The minimal variance is given by the following expression:

As before, these expressions are more intuitive if the variance is replaced with precision: the contribution of *x _{i}* to the value of

*y*(

_{n}*x*

_{1}, ..,

*x*) is proportional to

_{n}*x*'s confidence.

_{i}Equations 13 and 14 generalize Equations 10 and 11.

**Incremental fusing is optimal.** In many applications, the estimates *x*_{1}, *x*_{2}, ..., *x _{n}* become available successively over a period of time. Although it is possible to store all the estimates and use Equations 13 and 14 to fuse all the estimates from scratch whenever a new estimate becomes available, it is possible to save both time and storage if one can do this fusion incrementally. We show that just as a sequence of numbers can be added by keeping a running sum and adding the numbers to this running sum one at a time, a sequence of

*n*>2 estimates can be fused by keeping a "running estimate" and fusing estimates from the sequence one at a time into this running estimate without any loss in the quality of the final estimate. In short, we want to show that

*y*

_{n}(

*x*

_{1},..,

*x*

_{n})=

*y*

_{2}(

*y*

_{2}(..

*y*

_{2}(

*x*

_{1},

*x*

_{2})...),

*x*

_{n}). A little bit of algebra shows that if

*n*>2, Equations 13 and 14 for the optimal linear estimator and its precision can be expressed as shown in Equations 15 and 16.

This shows that *y*_{n}(*x*_{1}, ..,*x*_{n}) = *y*_{2}(*y*_{n-1} (*x*_{1}, .., *x*_{n-1}), *x*_{n}). Using this argument recursively gives the required result.^{d}

To make the connection to Kalman filtering, it is useful to derive the same result using a pictorial argument. Figure 2 shows the process of incrementally fusing the *n* estimates. In this picture, time progresses from left to right, the precision of each estimate is shown in parentheses next to it, and the weights on the edges are the weights from Equation 10. The contribution of each *x _{i}* to the final value

*y*

_{2}(

*y*

_{2}(...),

*x*) is given by the product of the weights on the path from

_{n}*x*to the final value, and this product is obviously equal to the weight of

_{i}*x*in Equation 13, showing that incremental fusion is optimal.

_{i}

**Figure 2. Dataflow graph for incremental fusion.**

**Summary.** The results in this section can be summarized informally as follows. *When using a linear estimator to fuse uncertain scalar estimates, the weight given to each estimate should be inversely proportional to the variance of the random variable from which that estimate is obtained. Furthermore, when fusing n*>2 *estimates, estimates can be fused incrementally without any loss in the quality of the final result.* These results are often expressed formally in terms of the Kalman gain *K*, as shown in Figure 3; the equations can be applied recursively to fuse multiple estimates. Note that if *v*_{1}*v*_{2}, *K*0 and *y*(*x*_{1},*x*_{2})*x*_{1}; conversely if *v*_{1}*v*_{2}, *K*1 and *y*(*x*_{1},*x*_{2})*x*_{2}.

**Figure 3. Optimal fusion of scalar estimates.**

### Fusing Vector Estimates

The results for fusing scalar estimates can be extended to vectors by replacing *variances with covariance matrices.*

For vectors, the linear estimator is where . Here *A* stands for the matrix parameters (*A*_{1}, ..., *A _{n}*). All the vectors (

*) are assumed to be of the same length. To simplify notation, we omit the subscript*

**x**_{i}*n*in

**y**

_{n,A}in the discussion here as it is obvious from the context.

*Optimality.* The parameters *A*_{1}, ..., *A _{n}* in the linear data fusion model are chosen to minimize

*MSE*(

**y**

_{A}) which is

*E*[(

**y**

_{A}-

_{yA})

^{T}(

**y**

_{A}-

_{yA})].

Theorem 3 generalizes Theorem 2 to the vector case. The proof of this theorem is given in the appendix. Comparing Theorems 2 and 3, we see that the expressions are similar, the main difference being that the role of variance in the scalar case is played by the covariance matrix in the vector case.

**Theorem 3.** *Let* **x**_{i}*p*_{i}(_{i}, _{i}) *for* (1*i**n*) *be a set of pairwise uncorrelated random variables. Consider the linear estimator* where . *The value of MSE*(**y**_{A}) *is minimized for*

Therefore the optimal estimator is

The covariance matrix of **y** can be computed by using Lemma 2.

In the vector case, precision is the inverse of a covariance matrix, denoted by *N.* Equations 2627 use precision to express the optimal estimator and its variance and generalize Equations 1314 to the vector case.

As in the scalar case, fusion of *n*>2 vector estimates can be done incrementally without loss of precision. The proof is similar to the scalar case and is omitted.

There are several equivalent expressions for the Kalman gain for the fusion of two estimates. The following one, which is easily derived from Equation 23, is the vector analog of Equation 17:

The covariance matrix of the optimal estimator **y**(**x**_{1}, **x**_{2}) is the following.

**Summary.** The results in this section can be summarized in terms of the Kalman gain *K* as shown in Figure 4.

**Figure 4. Optimal fusion of vector estimates.**

### Best Linear Unbiased Estimator (BLUE)

In some applications, the state of the system is represented by a vector but only part of the state can be measured directly, so it is necessary to estimate the hidden portion of the state corresponding to a measured value of the visible state. This section describes an estimator called the *best linear unbiased estimator* (BLUE)^{16,19,26} for doing this.

Consider the general problem of determining a value for vector **y** given a value for a vector **x.** If there is a functional relationship between **x** and **y** (say **y**=*F*(**x**) and *F* is given), it is easy to compute **y** given a value for **x** (say **x**_{1}).

In our context, however, **x** and **y** are random variables, so such a precise functional relationship will not hold. Figure 5 shows an example in which *x* and *y* are scalar-valued random variables. The gray ellipse in this figure, called a *confidence ellipse*, is a projection of the joint distribution of *x* and *y* onto the (*x, y*) plane that shows where some large proportion of the (*x, y*) values are likely to be. Suppose *x* takes the value *x*_{1}. Even within the confidence ellipse, there are many points (*x*_{1}, *y*), so we cannot associate a single value of *y* with *x*_{1}. One possibility is to compute the mean of the *y* values associated with *x*_{1} (that is, the expectation *E*[*y*|*x*=*x*_{1}]) and return this as the estimate for *y* if *x*=*x*_{1}. This requires knowing the joint distribution of *x* and *y*, which may not always be available.

**Figure 5. BLUE line corresponding to Equation (31).**

In some problems, we can assume that there is an unknown linear relationship between **x** and **y** and that uncertainty comes from noise. Therefore, we can use a technique similar to the ordinary least squares (OLS) method to estimate this linear relationship, and use it to return the best estimate of *y* for any given value of *x.* In Figure 5, we see that although there are many points (*x*_{1}, *y*), the *y* values are clustered around the line as shown in the figure so the value _{1} is a reasonable estimate for the value of *y* corresponding to *x*_{1}. This line, called the *best linear unbiased estimator* (BLUE), is the analog of ordinary least squares (OLS) for distributions.

**Computing BLUE.** Consider the estimator ** _{A,b}(x) = Ax + b.** We choose

*A*and

**b**so that this is an unbiased estimator with minimal

*MSE.*The "" over the

**y**is notation that indicates that we are computing an estimate for

**y.**

**Theorem 4.** *Let*

*The estimator _{A,b}*

**(**

*x*) =*Ax*+*b*.*for estimating the value of*

**y**for a given value of**x**is an unbiased estimator with minimal MSE ifThe proof of Theorem 4 is straight forward. For an unbiased estimator, ** E[] = E[y].** This implies that

**b**=

_{y}

*A*(

_{x}) so an unbiased estimator is of the form

**Note this is equivalent to asserting the BLUE line must pass through the point**

_{A}(x) =_{y}+*A*[x_{x}).**(**. Setting the derivative of

_{x},_{y})**with respect to**

*MSE*(_{A}_{A})*A*to zero

^{22}and solving for

*A*, we find that the best linear unbiased estimator is

This equation can be understood intuitively as follows. If we have no information about **x** and **y**, the best we can do is the estimate **( _{x}, _{y})**, which lies on the BLUE line. However, if we know that

**x**has a particular value

**x**

_{1}, we can use the correlation between

**y**and

**x**to estimate a better value for

**y**from the difference (

**x**

_{1}

_{x}). Note that if

_{yx}= 0 (that is,

**x**and

**y**are uncorrelated), the best estimate of

**y**is just

**, so knowing the value of**

_{y}**x**does not give us any additional information about

**y**as one would expect. In Figure 5, this corresponds to the case when the BLUE line is parallel to the x-axis. At the other extreme, suppose that

**y**and

**x**are functionally related so

**y**=

*C*

**x.**In that case, it is easy to see that

_{yx}=

*C*

_{xx}, so

**=**as expected. In Figure 5, this corresponds to the case when the confidence ellipse shrinks down to the BLUE line.

*C*xEquation 31 is a generalization of ordinary least squares in the sense that if we compute the relevant means and variances of a set of discrete data (*x _{i}*,

*y*) and substitute into Equation 31, we get the same line that is obtained by using OLS.

_{i}### Kalman Filters for Linear Systems

We now apply the algorithms for optimal fusion of vector estimates (Figure 4) and the BLUE estimator (Theorem 4) to the particular problem of state estimation in linear systems, which is the classical application of Kalman filtering.

Figure 6a shows how the evolution of the state of such a system over time can be computed if the initial state **x**_{0} and the model of the system dynamics are known precisely. Time advances in discrete steps. The state of the system at any time step is a function of the state of the system at the previous time step and the control inputs applied to the system during that interval. This is usually expressed by an equation of the form **x**_{t} = *f _{t}*(

**x**

_{t1},

**u**

_{t}) where

**u**

_{t}is the control input. The function

*f*is nonlinear in the general case, and can be different for different steps. If the system is linear, the relation for state evolution over time can be written as

_{t}**x**

_{t}=

*F*

_{t}

**x**

_{t-1}+

*B*

_{t}

**u**

_{t}, where

*F*and

_{t}*B*are time-dependent matrices that can be determined from the physics of the system. Therefore, if the initial state

_{t}**x**

_{0}is known exactly and the system dynamics are modeled perfectly by the

*F*and

_{t}*B*matrices, the evolution of the state over time can be computed precisely as shown in Figure 6a.

_{t}

**Figure 6. State estimation using Kalman filtering.**

In general, however, we may not know the initial state exactly, and the system dynamics and control inputs may not be known precisely. These inaccuracies may cause the state computed by the model to diverge unacceptably from the actual state over time. To avoid this, we can make measurements of the state after each time step. If these measurements were exact, there would of course be no need to model the system dynamics. However, in general, the measurements themselves are imprecise.

Kalman filtering was invented to solve the problem of state estimation in such systems. Figure 6b shows the dataflow of the computation, and we use it to introduce standard terminology. An estimate of the initial state, denoted by , is assumed to be available. At each time step *t*=1, 2, .., the system model is used to provide an estimate of the state at time *t* using information from time *t*1. This step is called *prediction* and the estimate that it provides is called the *a priori* estimate and denoted by . The *a priori* estimate is then fused with **z**_{t}, the state estimate obtained from the measurement at time *t*, and the result is the *a posteriori* state estimate at time *t*, denoted by . This *a posteriori* estimate is used by the model to produce the *a priori* estimate for the next time step and so on. As described here, the *a priori* and *a posteriori* estimates are the means of certain random variables; the covariance matrices of these random variables are shown within parentheses each estimate in Figure 6b, and these are used to weight estimates when fusing them.

We first present the state evolution model and *a priori* state estimation. Then we discuss how state estimates are fused if an estimate of the entire state can be obtained by measurement. Finally, we discuss how to address this problem when only a portion of the state can be measured directly.

**State evolution model and prediction.** The evolution of the state over time is described by a series of random variables **x**_{0}, **x**_{1}, **x**_{2},...

- The random variable
**x**_{0}captures the likelihood of different initial states. - The random variables at successive time steps are related by the following linear model:

Here,

u_{t}is the control input, which is assumed to be deterministic, andw_{t}is a zero-mean noise term that models all the uncertainty in the system. The covariance matrix ofw_{t}is denoted byQ, and the noise terms in different time steps are assumed to be uncorrelated to each other (such as,_{t}E[w_{i}w_{j}]=0 ifij) and tox_{0}.

For estimation, we have a random variable **x**_{0|0} that captures our belief about the likelihood of different states at time *t*=0, and two random variables **x**_{t|t1} and **x**_{t|t} at each time step *t* = 1, 2, ... that capture our beliefs about the likelihood of different states at time *t* before and after fusion with the measurement, respectively. The mean and covariance matrix of a random variable **x**_{i|j} are denoted by . and _{i|j}, respectively. We assume (no bias).

Prediction essentially uses **x**_{t1|t1} as a proxy for **x**_{t1} in Equation 32 to determine **x**_{t|t1} as shown in Equation 33.

For state estimation, we need only the mean and covariance matrix of **x**_{t|t1}. The predictor box in Figure 6 computes these values; the covariance matrix is obtained from Lemma 2 under the assumption that **w**_{t} is uncorrelated with **x**_{t1|t1}, which is justified here.

**Fusing complete observations of the state.** If the entire state can be measured at each time step, the imprecise measurement at time *t* is modeled as follows:

where **v**_{t} is a zero-mean noise term with covariance matrix *R _{t}*. The noise terms in different time steps are assumed to be uncorrelated with each other (such as,

*E*[

**v**

_{i}

**v**

_{j}] is zero if

*i*

*j*) as well as with

**x**

_{0|0}and all

**w**

_{k}. A subtle point here is that

**x**

_{t}in this equation is the actual state of the system at time

*t*(that is, a particular realization of the random variable

**x**

_{t}), so variability in

**z**

_{t}comes only from

**v**

_{t}and its covariance matrix

*R*.

_{t}Therefore, we have two imprecise estimates for the state at each time step *t* = 1, 2, ..., the *a priori* estimate from the predictor and the one from the measurement (**z**_{t}). If **z**_{t} is uncorrelated to **x**_{t|t1}, we can use Equations 2022 to fuse the estimates as shown in Figure 6c.

The assumptions that (i) **x**_{t1|t1} is uncorrelated with **w**_{t}, which is used in prediction, and (ii) **x**_{t|t1} is uncorrelated with **z**_{t}, which is used in fusion, are easily proved to be correct by induction on *t*, using Lemma 2(ii). Figure 6b gives the intuition: **x**_{t|t1} for example is an affine function of the random variables **x _{0|0}, w_{1}, v_{1}, w_{2}, v_{2}, ..., w_{t},** and is therefore uncorrelated with

**v**

_{t}(by assumption about

**v**

_{t}and Lemma 2(ii)) and hence with

**z**

_{t}.

Figure 7 shows the computation pictorially using confidence ellipses to illustrate uncertainty. The dotted arrows at the bottom of the figure show the evolution of the state, and the solid arrows show the computation of the *a priori* estimates and their fusion with measurements.

**Figure 7. Illustration of Kalman filtering.**

**Fusing partial observations of the state.** In some problems, only a portion of the state can be measured directly. The observable portion of the state is specified formally using a full row-rank matrix *H _{t}* called the

*observation matrix*: if the state is

**x**, what is observable is

*H*

_{t}**x**. For example, if the state vector has two components and only the first component is observable,

*H*can be [1 0]. In general, the

_{t}*H*matrix can specify a linear relationship between the state and the observation, and it can be time-dependent. The imprecise measurement model introduced in Equation 34 becomes:

_{t}The hidden portion of the state can be specified using a matrix *C _{t}*, which is an orthogonal complement of

*H*. For example, if

_{t}*H*= [1 0], one choice for

_{t}*C*is [0 1].

_{t}Figure 6d shows the computation for this case. The fusion phase can be understood intuitively in terms of the following steps.

- The observable part of the
*a priori*estimate of the state is fused with the measurement (**z**_{t}), using Equations 2022. The quantity is called the*innovation.*The result is the*a posteriori*estimate of the observable state . - The BLUE of Theorem 4 is used to obtain the
*a posteriori*estimate of the hidden state by adding to the*a priori*estimate of the hidden state a value obtained from the product of the covariance between*H*_{t}**x**_{t|t1}and*C*_{i}**x**_{t|t1}and the difference between and - The
*a posteriori*estimates of the observable and hidden portions of the state are composed to produce the*a posteriori*estimate of the entire state .

The actual implementation produces the final result directly without going through these steps as shown in Figure 6d, but these incremental steps are useful for understanding how all this works, and their implementation is shown in more detail in Figure 8.

**Figure 8. Computation of a posteriori estimate.**

Figure 6d puts all this together. In the literature, this dataflow is referred to as Kalman filtering. Unlike in Equations 18 and 21, the Kalman gain is not a dimensionless value here. If *H _{t}* =

*I*, the computations in Figure 6d reduce to those of Figure 6c as expected.

Equation 39 shows that the *a posteriori* state estimate is a linear combination of the *a priori* state estimate and the measurement (**z**_{t}). The optimality of this linear unbiased estimator is shown in the Appendix. It was shown earlier that incremental fusion of scalar estimates is optimal. The dataflow of Figures 6(c,d) computes the *a posteriori* state estimate at time *t* by incrementally fusing measurements from the previous time steps, and this incremental fusion can be shown to be optimal using a similar argument.

**Example: falling body.** To demonstrate the effectiveness of the Kalman filter, we consider an example in which an object falls from the origin at time *t*0 with an initial speed of 0 m/s and an expected constant acceleration of 9.8 m/s^{2} due to gravity. Note that acceleration in reality may not be constant due to factors such as wind, and air friction.

We have shown that Kalman filtering for state estimation in linear systems can be derived from two elementary ideas: optimal linear estimators for fusing uncorrelated estimates and best linear unbiased estimators for correlated variables.

The state vector of the object contains two components, one for the distance from the origin *s*(*t*) and one for the velocity *v*(*t*). We assume that only the velocity state can be measured at each time step. If time is discretized in steps of 0.25 seconds, the difference equation for the dynamics of the system is easily shown to be the following:

where we assume and .

The gray lines in Figure 9 show the evolution of velocity and distance with time according to this model. Because of uncertainty in modeling the system dynamics, the actual evolution of the velocity and position will be different in practice. The red lines in Figure 9 show one trajectory for this evolution, corresponding to a Gaussian noise term with covariance in Equation 32 (because this noise term is random, there are many trajectories for the evolution, and we are just showing one of them). The red lines correspond to "ground truth" in our example.

**Figure 9. Estimates of the object's state over time.**

The green points in Figure 9b show the noisy measurements of velocity at different time steps, assuming the noise is modeled by a Gaussian with variance 8. The blue lines show the *a posteriori* estimates of the velocity and position. It can be seen that the *a posteriori* estimates track the ground truth quite well even when the ideal system model (the gray lines) is inaccurate and the measurements are noisy. The cyan bars in the right figure show the variance of the velocity at different time steps. Although the initial variance is quite large, application of Kalman filtering is able to reduce it rapidly in few time steps.

**Discussion.** We have shown that Kalman filtering for state estimation in linear systems can be derived from two elementary ideas: optimal linear estimators for fusing uncorrelated estimates and best linear unbiased estimators for correlated variables. This is a different approach to the subject than the standard presentations in the literature. One standard approach is to use Bayesian inference. The other approach is to assume that the *a posteriori* state estimator is a linear combination of the form , and then find the values of *A _{t}* and

*B*that produce an unbiased estimator with minimum

_{t}*MSE.*We believe that the advantage of the presentation given here is that it exposes the concepts and assumptions that underlie Kalman filtering.

Most presentations in the literature also begin by assuming that the noise terms **w**_{t} in the state evolution equation and **v**_{t} in the measurement are Gaussian. Although some presentations^{1,10} use properties of Gaussians to derive the results in Figure 3, these results do not depend on distributions being Gaussians. Gaussians however enter the picture in a deeper way if one considers *nonlinear* estimators. It can be shown that if the noise terms are not Gaussian, there may be nonlinear estimators whose *MSE* is lower than that of the linear estimator presented in Figure 6d. However, if the noise is Gaussian, this linear estimator is as good as any unbiased nonlinear estimator (that is, the linear estimator is a *minimum variance unbiased estimator* (MVUE)). This result is proved using the Cramer-Rao lower bound.^{24}

### Extension to Nonlinear Systems

The *extended Kalman filter* (EKF) and *unscented Kalman filter* (UKF) are heuristic approaches to using Kalman filtering for nonlinear systems. The state evolution and measurement equations for nonlinear systems with additive noise can be written as follows; in these equations, *f* and *h* are nonlinear functions.

Intuitively, the EKF constructs linear approximations to the nonlinear functions *f* and *h* and applies the Kalman filter equations, whereas the UKF constructs approximations to probability distributions and propagates these through the nonlinear functions to construct approximations to the posterior distributions.

**EKF.** Examining Figure 6d, we see that the *a priori* state estimate in the predictor can be computed using the system model: . However, as the system dynamics and measurement equations are nonlinear, it is not clear how to compute the co-variance matrices for the *a priori* estimate and the measurement. In the EKF, these matrices are computed by linearizing Equations 42 and 43 using the Taylor series expansions for the nonlinear functions *f* and *h.* This requires computing the following *Jacobians*^{e}, which play the role of *F _{t}* and

*H*in Figure 6d.

_{t}The EKF performs well in some applications such as navigation systems and GPS.^{28}

**UKF.** When the system dynamics and observation models are highly nonlinear, the unscented Kalman filter (UKF)^{15} can be an improvement over the EKF. The UKF is based on the *unscented transformation*, which is a method for computing the statistics of a random variable **x** that undergoes a nonlinear transformation (**y** = *g*(**x**)). The random variable **x** is sampled using a carefully chosen set of *sigma points* and these sample points are propagated through the nonlinear function *g.* The statistics of **y** are estimated using a weighted sample mean and covariance of the posterior sigma points. The UKF tends to be more robust and accurate than the EKF but has higher computation overhead due to the sampling process.

### Conclusion

In this article, we have shown that two conceptsoptimal linear estimators for fusing uncorrelated estimates and best linear unbiased estimators for correlated variablesprovide the underpinnings for Kalman filtering. By combining these ideas, standard results on Kalman filtering for linear systems can be derived in an intuitive and straightforward way that is simpler than other presentations of this material in the literature. This approach makes clear the assumptions that underlie the optimality results associated with Kalman filtering and should make it easier to apply Kalman filtering to problems in computer systems.

### Acknowledgments

This research was supported by NSF grants 1337281, 1406355, and 1618425, and by DARPA contracts FA8750-16-2-0004 and FA8650-15-C-7563. We are indebted to K. Mani Chandy (Caltech), Ivo Babuska (UT Austin,) and Augusto Ferrante (Padova, Italy) for their valuable feedback.

### References

1. Babb, T. How a Kalman filter works, in pictures | bzarg. 2018. https://www.bzarg.com/p/how-a-kalman-filter-works-in-pictures/. Accessed: 2018-11-30

2. Balakrishnan, A.V. *Kalman Filtering Theory.* Optimization Software, Inc., Los Angeles, CA, USA, 1987.

3. Barker, A.L., Brown, D.E., Martin, W.N. *Bayesian Estimation and the Kalman Filter.* Technical Report. Charlottesville, VA, USA, 1994.

4. Becker, A. Kalman filter overview. https://www.kalmanfilter.net/default.aspx. 2018. Accessed: 2018-11-08.

5. Bergman, K. Nanophotonic interconnection networks in multicore embedded computing. In *2009 IEEE/LEOS Winter Topicals Meeting Series* (2009), 67.

6. Cao, L., Schwartz, H.M. Analysis of the Kalman filter based estimation algorithm: An orthogonal decomposition approach. *Automatica 1*, 40 (2004), 519.

7. Chui, C.K., Chen, G. *Kalman Filtering: With Real-Time Applications*, 5th edn. Springer Publishing Company, Incorporated, 2017.

8. Eubank, R.L. *A Kalman Filter Primer (Statistics: Textbooks and Monographs).* Chapman & Hall/CRC, 2005.

9. Evensen, G. *Data Assimilation: The Ensemble Kalman Filter.* Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

10. Faragher, R. Understanding the basis of the Kalman filter via a simple and intuitive derivation. *IEEE Signal Process. Mag. 5*, 29 (2012), 128132.

11. Grewal, M.S., Andrews, A.P. *Kalman Filtering: Theory and Practice with MATLAB*, 4th edn. Wiley-IEEE Press, 2014.

12. Hess, A.-K., Rantzer, A. Distributed Kalman filter algorithms for self-localization of mobile devices. In *Proceedings of the 13th ACM International Conference on Hybrid Systems: Computation and Control*, HSCC '10, 2010, 191200.

13. Imes, C., Kim, D.H.K., Maggio, M., Hoffmann, H. POET: A portable approach to minimizing energy under soft real-time constraints. In *21st IEEE Real-Time and Embedded Technology and Applications Symposium*, April 2015, 7586.

14. Imes, C., Hoffmann, H. Bard: A unified framework for managing soft timing and power constraints. In *International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS)*, 2016.

15. Julier, S.J., Uhlmann, J.K. Unscented filtering and nonlinear estimation. *Proc. IEEE 3*, 92 (2004), 401422.

16. Kitanidis, P.K. Unbiased minimum-variance linear state estimation. *Automatica 6*, 23 (1987), 775778.

17. Lindquist, A., Picci, G. *Linear Stochastic Systems.* Springer-Verlag, 2017.

18. Maybeck, P.S. *Stochastic Models, Estimation, and Control*, volume 3. Academic Press, 1982.

19. Mendel, J.M. *Lessons in Estimation Theory for Signal Processing, Communications, and Control.* Pearson Education, 1995.

20. Nagarajan, K., Gans, N., Jafari, R. Modeling human gait using a Kalman filter to measure walking distance. In *Proceedings of the 2nd Conference on Wireless Health*, WH '11 (New York, NY, USA, 2011). ACM, 34:134:2.

21. Nakamura, E.F., Loureiro, A.A.F., Frery, A.C. Information fusion for wireless sensor networks: methods, models, and classifications. *ACM Comput. Surv. 3*, 39 (2007).

22. Petersen, K.B., Pedersen, M.S. The Matrix Cookbook. http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=3274. November 2012. Version 20121115.

23. Pothukuchi, R.P., Ansari, A., Voulgaris, P., Torrellas, J. Using multiple input, multiple output formal control to maximize resource efficiency in architectures. In *Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)* (2016), IEEE, 658670.

24. Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. *Bull. Calcutta Math. Soc.*, 37 (1945), 8189.

25. Rhudy, M.B., Salguero, R.A., Holappa, K. A Kalman filtering tutorial for undergraduate students. *Int. J. Comp. Sci. Eng. Surv.* (1), 8 (2017).

26. Sengupta, S.K. Fundamentals of statistical signal processing: Estimation theory. *Technometrics* (4), 37 (1995), 465466.

27. Souza, É.L., Nakamura, E.F., Pazzi, R.W. Target tracking for sensor networks: A survey. *ACM Comput. Surv.* (*2*), 49 (2016), 30:130:31.

28. Thrun, S., Burgard, W., Fox, D. *Probabilistic Robotics (Intelligent Robotics and Autonomous Agents).* The MIT Press, 2005.

29. Welch, G., Bishop, G. *An Introduction to the Kalman Filter.* Technical Report. Chapel Hill, NC, USA, 1995.

30. Pei, Y., Biswas, S., Fussell, D.S., Pingali, K. An Elementary Introduction to Kalman Filtering. *ArXiv e-prints.* October 2017.

### Footnotes

a. An extended version of this article that includes additional background material and proofs is available.^{30}

b. Basic concepts such as probability density function, mean, expectation, variance and covariance are introduced in the online appendix.

c. The role of Gaussians in Kalman filtering is discussed later in the article.

d. We thank Mani Chandy for showing us this approach to proving the result.

e. The Jacobian matrix is the matrix of all first order partial derivatives of a vector-valued function.

The online appendix for this article can be found at http://dl.acm.org/citation.cfm?doid=3363294&picked=formats

**©2019 ACM 0001-0782/19/11**

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2019 ACM, Inc.