Bridging the ensemble Kalman and particle filter
Abstract.
In many applications of Monte Carlo nonlinear filtering, the propagation step is computationally expensive, and hence, the sample size is limited. With small sample sizes, the update step becomes crucial. Particle filtering suffers from the wellknown problem of sample degeneracy. Ensemble Kalman filtering avoids this, at the expense of treating nonGaussian features of the forecast distribution incorrectly. Here we introduce a procedure which makes a continuous transition indexed by between the ensemble and the particle filter update. We propose automatic choices of the parameter such that the update stays as close as possible to the particle filter update subject to avoiding degeneracy. In various examples, we show that this procedure leads to updates which are able to handle nonGaussian features of the prediction sample even in highdimensional situations.
1. Introduction
State space models consist of a (discrete or continuous time) Markov process which is partially observed at discrete time points and subject to independent random errors. Estimation of the state at time given observations up to the same time is called filtering or data assimilation. Since exact computations are possible essentially only in linear Gaussian situations, mostly Monte Carlo methods are used for filtering. In many environmental applications, in particular in atmospheric physics, oceanography and reservoir modelling, the dimension of the state is, however, very large and the computational costs to propagate the state forward in time are huge, which severely limits the potential sample size for Monte Carlo filtering methods. Particle filters (Gordon et al., 1993; Pitt and Shephard, 1999; Doucet et al., 2000) suffer from the wellknown problem of sample degeneracy (Snyder et al., 2008). In contrast, the ensemble Kalman filter (Evensen, 1994; Burgers et al., 1998; Houtekamer and Mitchell, 1998) can handle some problems where the dimensions of states and observations are large, and the number of replicates is small, but at the expense of incorrectly treating nonGaussian features of the forecast distribution that arise in nonlinear systems.
To relax the Gaussian assumption, two paradigms are predominant: mixture filters that approximate the forecast distribution as a mixture of Gaussians (Bengtsson et al., 2003; Sun et al., 2009; Dovera and Della Rossa, 2011; Frei and Künsch, 2013; Hoteit et al., 2012; Rezaie and Eidsvik, 2012), and sequential importance samplers that use the ensemble Kalman filter as a proposal distribution (Mandel and Beezley, 2009; Papadakis et al., 2010). In this article, we introduce an update scheme that blends these two flavours: A Gaussian mixture proposal obtained from an ensemble Kalman filter update based on a tempered likelihood is corrected by a particle filter update. In this way we do not have to fit a Gaussian mixture to the forecast sample nor do we have to approximate the ratio of the predictive density to the proposal density. A further advantage of our procedure is that we can implement these two steps in such a way that the particle weights do not depend on artificial observation noise variables and the resampling avoids ties.
Our procedure depends on a single tuning parameter , which allows continuous interpolation between the ensemble Kalman filter () and the particle filter (). Hence, the parameter controls the biasvariance tradeoff between a correct update and maintaining the diversity of the sample. It can be chosen without prior knowledge based on a suitable measure of diversity like effective sample size ess (Liu, 1996), or the expected number of Gaussian components which are represented in the resample.
The rest of the article is organized as follows. In Section 2, we detail the problem setting, introduce some notation, and provide background material. In Section 3, we present our new method and discuss implementational aspects. In Section 4, we discuss the choice of the tuning parameter . In Section 5, we consider numerical examples that involve single updates only; based on different prior specifications, we examine the differences of our method in comparison to the ensemble Kalman filter and the particle filter. In Section 6, we consider examples that involve many update cycles in two common test beds. Section 7 contains an outlook to possible generalizations.
2. Problem setting, notation and background material
We consider a dynamical system with state variable and observations . The state follows a deterministic or stochastic Markovian evolution, that is where the system noise is independent of all past values and all , . There is no need to know the function in explicit form, we only assume that for given we are able to simulate from the distribution of . In particular, the evolution can be in continuous time, given by an ordinary or stochastic differential equation.
In all cases we assume linear observations with Gaussian noise: . This means that the likelihood for the state given the observation is . Here and in the following denotes the (in general multivariate) normal density with mean and covariance at . In the final section, we will discuss briefly how to adjust the method for nonGaussian likelihoods.
We denote all observations up to time , by . The forecast distribution at time is the conditional distribution of given , and the filter distribution at time is the conditional distribution of given . In principle, these distributions can be computed recursively, alternating between propagation and update steps. The propagation step leads from to : is the distribution of where and is independent of and has the distribution given by the evolution of the system. The update step leads from to and is nothing else than Bayes formula: . However, analytical computations are possible (essentially) only if the system evolution is also linear with additive Gaussian noise. Hence one resorts to Monte Carlo approximations, i.e., one represents and by ensembles (samples) and respectively. The members of these ensembles are called particles.
The propagation step just lets the particles evolve according to the dynamics of the state, that is we simulate according to the time evolution starting at at time , . However, the computational complexity of this step limits the number of particles, that is the size of the sample.
The (bootstrap) particle filter (Gordon et al., 1993) updates the forecast particles by weighting with weights proportional to the likelihood and converts this into an unweighted sample by resampling, i.e., is obtained by sampling from
(1) 
Thus some of the forecast particles disappear and others are replicated. If the likelihood is quite peaked, which is the case in high dimensions with many independent observations, the weights will be heavily unbalanced, and the filter sample eventually degenerates since it concentrates on a single or a few particles; see Snyder et al. (2008). Auxiliary particle filters (Pitt and Shephard, 1999) can attenuate this behaviour to some extent, but they require good proposal distributions for the propagation, and an analytical expression for the transition densities.
The ensemble Kalman filter (Burgers et al., 1998; Houtekamer and Mitchell, 1998) makes an affine correction of the forecast particles based on the new observation and artificial observation noise variables :
where is an estimate of the forecast covariance at time , typically a regularized version of the sample covariance of , and is the Kalman gain . This update formula is (asymptotically) correct under the assumption that the forecast distribution is Gaussian (Le Gland et al., 2010). Although this is usually not valid, the update nevertheless has been found to work well in a variety of situations; see Evensen (2007) and references therein. For later use, we note that conditional on the forecast sample,
Therefore the filter sample can be considered as a balanced sample from the (conditional) Gaussian mixture
(2) 
Here, “balanced sample” simply means that we draw exactly one realization from each of the equally weighted Gaussian components.
3. A bridge between ensemble and particle updates: The ensemble Kalman particle filter
3.1. The new method
We consider here the update at a single fixed time and thus suppress in the notation. We follow the “progressive correction” idea (Musso et al., 2001) and write
where is arbitrary. Our approach is to use an ensemble Kalman filter update to go from to , and a particle filter update to go from to . The rationale behind this twostage procedure is to achieve a compromise between sample diversity and systematic error due to nonGaussian features of . The former is large if is close to one because the ensemble Kalman filter update draws the particles closer to the observation , and the exponent dampens the ratio of any two resampling probabilities. The latter is small if is small.
Since , and
(3) 
the ensemble filter update is straightforward: We only need to compute the gain with the reduced covariance . The particle update then resamples with weights proportional to . However, there are two immediate drawbacks to such an algorithm: the particle weights depend on the artificial observation noise variables which are needed for the ensemble Kalman filter update, and the resampling introduces tied values. We show next how to address both points. By (2), we can write
(4) 
where
(5)  
(6) 
Instead of sampling from (4) and applying a particle correction, we delay the ensemble Kalman filter sampling step, and update (4) analytically. This is easy because the update of a Gaussian mixture by a Gaussian likelihood is again a Gaussian mixture whose parameters can be computed easily (Alspach and Sorenson, 1972). We obtain
(7) 
where EnKPF stands for ensemble Kalman particle filter and
(8)  
(9)  
(10) 
The update consists now in sampling from (7). The mixture proportions do not depend on the artificial observation noise variables, and even if one dominates, there is still some diversity in the filter sample because the covariance is not zero if .
Sampling from the th component of (7) can be done as follows: Let and be two independent random variables. Then
(11) 
clearly has distribution , and thus by standard arguments
is a sample from . Hence there is no need to compute a square root of .
To summarize, given a forecast ensemble and an observation , the ensemble Kalman particle filter consists of the following steps:
1. Compute the estimated forecast covariance . 
2. Choose and compute according to (3) and according to (5). 
3. Compute according to (6) and the weights according to (8). 
4. Choose indices by sampling from the weights with some 
balanced sampling scheme; e.g., equation (12) in Künsch (2005). 
5. Generate and set . 
6. Compute , generate and set 
. 
Because matrix inversion is continuous, it is easy to check that as , , , , and . Hence in the limit , we obtain the particle filter update. Similarly, in the limit we obtain the ensemble Kalman filter update because for converges to zero and thus . The ensemble Kalman particle filter therefore provides a continuous interpolation between the particle and the ensemble Kalman filter.
3.2. Modifications in high dimensions
If the dimension of the state space is larger than the number of particles, then the variability of the usual sample covariance is huge, and one should use a regularized version as estimate . In the context of the ensemble Kalman filter, the standard regularization technique is the use of a tapered estimate, that is we multiply the sample covariance matrix by a correlation matrix which is zero as soon as the distance between the two components of the state is larger than some threshold, see e.g., Houtekamer and Mitchell (2001); Furrer and Bengtsson (2007). If also the error covariance matrix and the observation matrix are sparse, then this tapering has the additional benefit that the computation of in step 2 and of in step 5 is much faster because we do not need to compute for this. It is sufficient to solve equations of the form which is fast for sparse matrices. However, this advantage is lost because for we need , which is in general a full matrix. We could multiply the gain matrix by another taper in order to facilitate the computation of and to turn into a sparse matrix. This would then make steps 4 and 6 in the algorithm above faster because again all we need to do is to solve equations of the form . Alternatively, one can could also replace the gain by the optimal matrix with a given sparsity pattern, i.e., using a localized update in grid space, see Sakov and Bertino (2010).
Because is only used to compute , a simpler alternative which avoids computing gain matrices is to generate the values needed in step 5 before step 3 and 4, and then to replace by a sparse regularized version of the sample covariance of these values. If this approach is taken, it is usually feasible to generate more than such values in order to reduce the Monte Carlo error in the regularized sample covariance matrix.
3.3. Consistency of the ensemble Kalman particle filter in the Gaussian case
We establish consistency of the ensemble Kalman particle filter for any as the ensemble size tends to infinity, provided that the forecast sample is iid normal. We assume that all random quantities are defined on some given probability space and “almost surely” is short for “almost surely”. The observation is considered to be fixed (nonrandom). The superscript is added to any random quantity that depends on the ensemble size . We use the following notion of convergence: A sequence of random probability measures converges almost surely weakly to the probability measure if converges almost surely to for any continuous and bounded function .
Theorem 1.
Suppose that is an iid sample from . Then, for any , the sequence as defined in (7) converges almost surely weakly to the true posterior . Additionally, if is a conditionally iid sample from , then also converges almost surely weakly to .
A proof is given in the appendix. Notice that if balanced sampling is used to sample from , the particles are no longer conditionally iid, and the arguments become more complicated; see Künsch (2005) for a discussion in the context of auxiliary particle filters. Inspection of the proof of Theorem 1 shows that if is nonGaussian with finite second moments, then still converges almost surely weakly to a nonrandom limit distribution . The limit distribution depends on and generally differs from the correct posterior for . The limit cannot be easily identified, and in particular it is difficult to quantify the systematic error as a function of . Using similar arguments as in Randles (1982), it is also possible to show that
weakly, where the asymptotic covariance depends on , and . In general, is analytically intractable, and we cannot verify if decreases as a function of , as we expect.
4. Choice of
4.1. Asymptotics of weights
Recall that for the method is exactly the particle filter, and for it is exactly the ensemble Kalman filter. Hence it is clear that there is a range of values where we obtain an interesting compromise between the two methods in the sense that the weights are neither uniform nor degenerate. We try to provide some theoretical insight where this range of values is, and later we develop a criterion which chooses a good value automatically.
We want to see how the weights in (7) behave as a function of when the dimension of the observations is large. By definition
where is the prediction mean,
and and stand for and .
The following lemma gives an approximate formula for the variance of the .
Lemma 1.
Define approximate weights by
where and are as defined above, but with the true forecast covariance instead of . If the forecast sample is iid , then
As , we have
(12) 
where .
A proof is given in the appendix. The matrix is positive definite and also the trace in formula (12) is positive. Therefore we expect that the variance of the weights is of the order : This is true if , and are all multiples of the identity, and there is no reason for a different behavior in other cases. This suggests that for highdimensional observations, we need to choose close to one in order to avoid degeneracy. Note however that the final update can still differ from the ensemble Kalman filter update, even if the largest part of the update occurs with the ensemble Kalman filter.
4.2. Criteria for the selection of
Because the Kalman filter update uses only the first two moments of the forecast distribution, it seems plausible that for nonGaussian forecast distributions, the Kalman update will be less informative than the correct update. Hence, as long as the spread is a meaningful measure of uncertainty, we expect the Kalman update to have a larger spread than the correct update. This is not always true though: The variance of the correct posterior is only on average smaller than the variance of the ensemble Kalman filter update. In particular, this may fail to hold for some values of if the prior is multimodal.
Still, this heuristic suggests that in some cases the spread of the update ensemble will increase monotonically in and we could choose such that the spread of the update is not smaller than a factor times the spread of an ensemble Kalman filter update (where is maybe between 0.5 and 0.8). This means that we compare the variances of the Gaussian mixture , where denotes the number of times component in (7) has been selected, with the variances of the Kalman filter update, that is the diagonal elements of . However, this is computationally demanding, because we have to compute among other things ; compare the discussion in Section 3.2 above.
A simpler procedure is based on the standard deviations of the updated ensembles. Denote by the standard deviation of the th component of the final update sample (which depends on ) and by the standard deviation of the update sample using the ensemble Kalman filter. Then we could choose the smallest such that or (in order to control not only the total spread but all marginal spreads) such that . This has the advantage that it is easy to compute also in high dimensions. The disadvantage is that it depends also on the generated noises. This can be reduced somehow by taking the same noises and for all values of under consideration.
A third possibility is to look only at the properties of the weights . We can take as the measure of the sampling diversity the socalled effective sample size ess (Liu, 1996), which is defined as
or the quantity
which is the expected number of components that are chosen when generating according to (7) with balanced sampling. Both measures are related to a distance between the and uniform weights. Although there is no universal relation between the two criteria, in typical cases , i.e., ess is a more conservative measure of diversity. Both criteria do not take the spread in into account, which also increases if increases. Therefore, they give only a lower bound for the diversity, but they are easy to compute. We then choose as the smallest value for which or . In order to avoid excessive computations, we considered in the examples below only multiples of as values for and used a binary search tree (with at most 4 search steps), assuming that the diversity is increasing in . We did not try to prove this assumption since the calculation is expected to be extremely tedious; the assumption is safe to make, though, since at the worst we end up with a too large . Alternatively one could use an approximation of ess based on (12).
5. Examples of single updates
5.1. Description of the setup
We consider a single update for 4 situations. In all cases and . There are 2 forecast samples combined with 2 values and for the observations. The construction of the forecast sample starts with a sample , which is then modified to introduce nonGaussian features. More precisely, we consider the following situations:

A Gaussian prior: We set , , , and . This means that the second observation contradicts the prior, although not excessively.

A bimodal prior: We set , for and for , and . In the former case, the true posterior is unimodal and in the latter case it is bimodal.
We take and . We generate one sample in dimension and use the first components. In all cases, we use a triangular taper with range 10, assuming that the states are values along a line (the optimal taper has range 0, since the true covariance is diagonal). Without a taper, all procedures break down: They become overconfident as the dimension increases, and the estimated sampling diversity increases with .
5.2. Variation of
We compute the update for and take as the measure of the sampling diversity the quantity introduced above. Some results are shown in Figure 1. The diversity increases with and decreases with the dimension as expected. In the bimodal case, even the particle filter does not degenerate, and typically small or moderate values of apparently give sufficient diversity.
5.3. The updates of the first two coordinates
We concentrate on the first two coordinates of which contain the nonGaussian features (if present). We show the contours of the true update density (7) for the ensemble Kalman filter and for the filter with chosen such that the diversity is approximately 40%. Figure 2 shows the results for bimodal prior with . In case of the Gaussian prior, the two plots (which are not shown here) are virtually identical. In the nonGaussian situation, the combined filter is able to pick up some nonGaussian features. In particular, the shape and not only the location depends on the observation, and the bimodality of the posterior is captured.
6. Examples of filtering with many cycles
6.1. The Lorenz 96 model
The 40variable configuration of the Lorenz 96 model (Lorenz and Emanuel, 1998) is governed by the ordinary differential equation
where the boundary conditions are assumed to be cyclic, i.e., . The model is chaotic and mimics the timeevolution of a scalar meteorological quantity on a latitude circle. We adopt the same experimental setup as in Bengtsson et al. (2003) and Frei and Künsch (2013): Measurements of odd components with uncorrelated additive noise at observation times , , are taken. The large lead time produces a strongly nonlinear propagation step. The system is integrated using Euler’s method with step size . Both the ensemble Kalman filter and ensemble Kalman particle filter are run with ensemble members. The true initial state and the initial ensemble members are randomly drawn from . All sample covariance matrices are replaced by tapered estimates; for the sake of simplicity, we used the same taper matrix throughout, namely the GC taper constructed from the correlation function given in (Gaspari and Cohn, 1999, equation (4.10)) with support halflength . For the ensemble Kalman particle filter, the parameter is chosen adaptively to ensure that the diversity stays within a prespecified interval if possible. We also ran the filter proposed in Papadakis et al. (2010). For the given ensemble size, we were not able to obtain a nondivergent run; the filter collapsed after just a few cycles.
The filter performance is assessed via a scoring rule evaluated at observation times. Here, we use the root mean square error of the ensemble mean, and the continuous ranked probability score (Gneiting et al., 2007) for the first two state variables. More precisely, if is the true solution at time , and the mean of the updated ensemble, and the marginal empirical cumulative distribution function of the updated ensemble, then the root mean square error of the ensemble mean at time is
(13) 
and the continuous ranked probability score for the th variable at time is
(14) 
where . Notice that for reasons of symmetry, we only consider the crps of the first two state variables (i.e., one observed and one unobserved variable).
Tables 1 and 2 compile summaries (first and ninth decile, mean and median) of the 2000 rmse and crps values.
mean  

EnKF  0.56  0.81  0.87  1.25  
EnKPF  0.52  0.75  0.83  1.21  
EnKPF  0.51  0.73  0.80  1.18  
EnKPF  0.50  0.71  0.79  1.17  
EnKPF  0.49  0.70  0.78  1.16  
EnKPF  0.49  0.71  0.79  1.17 
mean  mean  

EnKF  0.12  0.22  0.32  0.65  0.14  0.38  0.57  1.18  
EnKPF  0.11  0.21  0.30  0.62  0.13  0.33  0.54  1.13  
EnKPF  0.11  0.21  0.29  0.61  0.12  0.32  0.51  1.10  
EnKPF  0.11  0.20  0.29  0.59  0.12  0.32  0.49  1.02  
EnKPF  0.10  0.20  0.28  0.58  0.11  0.31  0.48  1.00  
EnKPF  0.10  0.21  0.29  0.59  0.11  0.31  0.50  1.05 
The gain over the ensemble Kalman filter achieved by the ensemble Kalman particle filter is substantial. In particular, as the crps values show, the ensemble Kalman particle filter is able to track the unobserved states much more accurately than the ensemble Kalman filter. Overall, the results for the ensemble Kalman particle filter are comparable with those reported for the XEnKF in Frei and Künsch (2013). Arguably, the best performance of the ensemble Kalman particle filter is achieved with diversity constrained to , but the scores are surprisingly robust. Finally, we note that for smaller ensemble sizes, e.g., , the ensemble Kalman particle filter still improves over the ensemble Kalman filter, but the results (which are not shown here) are less impressive. Nevertheless, one should keep in mind that with a particle filter, even with judicious tuning, much more particles are required to compete with the ensemble Kalman filter, as illustrated in Bocquet et al. (2010) (for a slightly different configuration of the Lorenz model).
6.2. The Kortewegde Vries equation
We consider the Kortewegde Vries equation on the circle (Drazin and Johnson, 1989):
with domain and periodic boundary conditions, . Versions of this equation have been used as test beds for data assimilation in, e.g., van Leeuwen (2003), Lawson and Hansen (2005), or Zupanski and Zupanski (2006). The spatial domain is discretized using an equispaced grid with grid points. The spectral split step method is used to solve the equation numerically (with an explicit 4th order RungeKutta time step for the nonlinear part of the equation). As initial prior we take the random field
For the truth, we use , and the initial ensemble is a (quasirandom) sample from . The ensemble size is , and thus . Six irregularly spaced observations with uncorrelated additive noise at observation times , , are taken. For illustration, Figure 3 displays the initial 16member ensemble and the predictive ensemble at the first observation time (together with the observations).
The particle filter, the ensemble Kalman filter and the ensemble Kalman particle filter are run (with no tapering applied). For the ensemble Kalman particle filter, we fix , which ensures that lies roughly in the interval . Since the particle filter degenerates very quickly for such a small ensemble, a benchmark run with particles is carried out. Figure 4 displays ensemble deviations from the true solution after 1 and 10 update cycles.
Apparently, both the ensemble Kalman filter and ensemble Kalman particle filter track the true solution reasonably well, and the state uncertainty is well represented. In terms of error of the ensemble mean, there is not much difference between the ensemble Kalman filter and the ensemble Kalman particle filter. However, the ensemble Kalman particle filter produces particles that exhibit less dynamical inconsistencies. In Figure 5, for each filter, the particle with the most curvature after 10 update cycles is shown, where the curvature of a solution is defined by , and a finite difference approximation is used for the discretized solutions. For reference, we note that the true solution (not shown in the plots) is virtually identical to the particle shown in the rightmost plot.
The particle filter (which conserves dynamical constraints) yields very smooth particles, whereas the ensemble Kalman filter may produce wiggly, “unphysical” particles. The ensemble Kalman particle filter lies inbetween: some particles still show slight dynamical imbalances, but these are much less pronounced than for the ensemble Kalman filter. Also, if a taper is applied to the forecast covariance matrix (which is not the case in the example here), the ensemble Kalman filter suffers even more from these imbalances.
7. Outlook to possible generalizations
In the spirit of the “progressive correction” idea (Musso et al., 2001), the ensemble Kalman particle filter update could also be split up in several steps. We fix constants and with . Then, for , we apply an ensemble Kalman filter update with likelihood , followed by a particle filter update with likelihood , followed by the resampling step. It is only necessary to estimate the predictive covariance for the first step; for the subsequent steps, , we can compute the covariance analytically from the mixture representation (7) (for large , this is numerically delicate, but the same remedies as discussed in Section 3.2 can be applied). We expect that the bias of such an iterative ensemble Kalman particle filter update is similar as for a single ensemble Kalman particle filter update with , but the variance will decrease with increasing since the likelihoods become flatter. In the limiting case , which corresponds to a full particle filter update, we conjecture that is sufficient to retain the sampling diversity. This claim is supported by Beskos et al. (2012), who analyze the “tempering” idea in simpler but related situations.
A potential drawback of the ensemble Kalman particle filter (in comparison to nonGaussian ensemble filters akin to Lei and Bickel (2011)) is its restriction to Gaussian linear observations. However, the idea of combining an ensemble Kalman filter and a particle filter update could also be used for arbitrary observation densities. Let be a matrix that selects those components of the state variable that influence the observation, and assume that we have an approximation of the likelihood of the form . Then we can use this approximation for an ensemble Kalman filter update, and correct by a particle filter update with weights proportional to . In order to construct an approximation of the likelihood of the above form, we can use a Taylor approximation
where and are the gradient and the Hessian, respectively, of the log likelihood. Then and . Alternatively, one could center the expansion at the mode of likelihood. Such an approximation is expected to work well in cases where the likelihood is logconcave, e.g., when given is Poisson with parameter .
Acknowledgments
The authors thank Jo Eidsvik for fruitful discussions.
Appendix
Proof of Theorem 1
Convergence of to implies convergence of to , see Lemma 7 in Frei and Künsch (2013). It remains to establish the former convergence. To begin with, we introduce some notation used throughout the remainder. For ease of legibility, the dependence on is dropped. An overbar is added to any quantity to denote its population counterpart, in which has been replaced by . For a measure and a function , we write . Expressions of the form are shorthand for converges to almost surely as goes to . Straightforward application of the strong law of large numbers shows that the population version converges to some nonrandom limit, and it is clear by construction that this limit equals . Hence, it remains to prove that the population version of the ensemble Kalman particle filter has the same limit as the ensemble Kalman particle filter, i.e., we need to prove that
(15) 
for any continuous and bounded function . In addition, we may assume that is compactly supported on, say, , since these functions are still convergence determining for the weak topology. Write .
Observe that
(16) 
In the following, we show that both terms on the righthand side of (16) converge to , which proves (15). For later use, we note that , and hence by continuity,
(17) 
The first term in (16) can be bounded by . Let and denote the unnormalized weights in (8), i.e.,
Observe that
where the last inequality follows from for arbitrary positive definite and positive semidefinite . The same bound holds true for . Notice that
and and . The are iid, hence, converges almost surely, and we conclude that if , where
(18) 
To show (18), we fix a compact set . Then we have
(19) 