Plug-in Bandwidth Selectors for Bivariate Kernel Density Estimation

Tarn Duong
Supervisor: Martin Hazelton


Date: May 2002

This talk is about my PhD research, supervised by Dr Martin Hazelton. My topic is bandwidth selection for multivariate kernel density estimators. Today I will be talking about some research carried out late last year/ early this year on a particular type of bandwidth selector, namely plug-in bandwidth selectors.

The outline for today's talk is as follows. (Read slide.)

This is the `Old Faithful' geyser data - it is most commonly analysed in a reduced univariate form. Here then is the full bivariate dataset. In the scatterplot, the horizontal axis is the duration time of an eruption and the vertical axis is the time till the next eruption. What we wish to know is: what is a plausible probability density function which describes this dataset i.e. how to recover all the essential features of this dataset.

The approach that we take is to use kernel density estimation. The appeal of kernel density estimators is that they are easy to visualise conceptually. All that is involved is that we place a kernel function, with probability mass 1/n centred at each of the n data points and then we simply add all the kernel functions together to form a kernel density estimate. To demonstrate, I will use a small dataset with just 5 points in it so I can show you the explicit process without any obscuring details.

So here we have five data points - and we have centred a kernel function, of mass 1/5, at each of the data points. Here the kernel functions are represented by the dotted contour curves.

(This goes on top of the previous slide.) If we then add all these kernels, we have then a kernel density estimate, shown here in the solid contour curves.

The equation which represents what I have been just showing is on the slide: (read out the equation conditions.) It expresses that to form a kernel density estimate we take the mean of the n normal kernels, each with mean Xi and variance H.

We have seen that kernel density estimation is a simple procedure but it also has other advantages including

* * *

The single most factor in determining the performance of a kernel density estimator and the shape of the resulting kernel density estimate is the choice of bandwidth matrix. Since the bandwidth matrix can be thought of as the variance of the kernel function, it controls both the spread and the orientation of the kernel.

So how do we choose the most appropriate bandwidth? Implicit in this question is the criterion that we use to measure appropriateness. The most commonly used criterion is the Mean Integrated Squared Error or MISE for short. The formula for the MISE is given on the slide.

So this is the integral of the MSE of $ \hat{f}$ or equivalently is the expected squared distance between a kernel density estimate $ \hat{f}$ and the target density f integrated over $ \mathbb{R}^2.$. Note that the MISE doesn't depend on the actual dataset as we take expectation so this is a measure of the `average' performance of the kernel density estimator.

Unfortunately this MISE does not have a closed form for an arbitrary density f and so we need to resort to using an asymptotic approximation via a Taylor's series expansion. This is known as the AMISE, the A is for asymptotic. The formula for it is here.

The first term is the asymptotic integrated variance and the second is the asymptotic integrated squared bias. Now vech is the `vector half' operator which takes a matrix and returns a vector of all the elements in the lower triangular half i.e if H is this matrix then vech H is this vector on the slide; and $ \boldsymbol{\Psi}$ is a matrix of functionals that depend on the target density f.

So it now seems straight forward to find the optimal bandwidth matrix by simply minimising the above AMISE criterion. However since the $ \boldsymbol{\Psi}$ matrix depends on the target density f we can't use the AMISE yet. We need to first estimate this matrix. This is where the plug-in methods derive their name - we produce an estimate and which we then plug into the AMISE and then proceed with the minimisation.

The general algorithm for a plug-in selector consists of four main steps. First we decide whether we will use diagonal or a full bandwidth matrix selector. Then we estimate the $ \boldsymbol{\Psi}$ matrix. We then plug-in this estimate into the AMISE criterion to form the PI (for plug-in) criterion. We minimise this to find the optimal bandwidth which we denote as $ \hat{\mathbf{H}}_\mathrm{PI}.$

The existing methods use a diagonal bandwidth matrix and estimate the $ \boldsymbol{\Psi}$ element-wise. Our proposed method makes the following two modifications to the existing methods. We use a full bandwidth matrix rather than the restricted diagonal matrix; and we estimate the $ \boldsymbol{\Psi}$ matrix matrix-wise rather than element-wise. The final two steps of the algorithm remain the same for both methods.

The current method simplifies the problem by restricting H to be a diagonal matrix. However it now means that we are restricted to using kernels that are oriented parallel to the co-ordinate axes. On the left are kernels with diagonal bandwidth matrices. On the right are kernels with full bandwidth matrices. The trade-off here for increased flexibility is that the problem becomes more computationally intensive and no longer has an analytical solution.

* * *

For the second step, if we introduce some more notation we can write down an explicit expression for the $ \boldsymbol{\Psi}$ matrix which will help us greatly in trying to estimate it. We use the following notation to indicate partial derivatives and the $ \psi$ functional (indexed by r1, r2) is defined on the slide. Note that the $ \psi$ functionals are scalars. Then the explicit expression is given on the slide. An important property of this matrix is that it is positive definite.

Now we are in a position to estimate this $ \boldsymbol{\Psi}$ matrix. As $ \psi_\vec{r}= \mathbb{E}\hspace*{2pt}f^{(\vec{r})} (\vec{X})$ where X has density f then an estimator is given on the slide. This equation says that we take the mean of the n (r1, r2)-th partial derivatives of the kernel density estimates evaluated at each of the data points Xi. Note that we use a scalar pilot bandwidth g here because if we try to use a full pilot bandwidth matrix then the problem becomes much more mathematically involved. The trade off between simplicity and flexibility can be justified to be valid if we pre-transform the data into a suitable form.

So the problem is now to optimally select this pilot bandwidth gThis will be done by minimising the MSE of this estimator.

The AMSE of the $ \hat{\psi}_{r_1, r_2}$ is given on the slide. The first term is the asymptotic integrated variance and the second term is the asymptotic integrated squared bias.

There two different formulas, one for the case where r1 and r2 are both even and and one for the case where r1 and r2 are both odd. Int the first case, arises from a bias annihilating calculation i.e we set the second term of the AMSE to be zero and solve forg. In the second case, the first term of the bias is 0 and so instead of a bias annihilating calculation, we find the minimum of the AMSE by setting the derivative to 0 and solving for g These two approaches give rise to the two different expressions for the optimal pilot bandwidths.

The formulas for deriving the pilot bandwidths are given on the slide here.


The current method of estimating the $ \boldsymbol{\Psi}$ matrix is to estimate each of its elements with its own pilot bandwidth so we require 5 pilot bandwidths and the estimate looks like this.

A problem with this method of element-wise estimation of a matrix is that it does not try to reproduce the properties of the matrix. In particular, this method does not guarantee the $ \hat{\boldsymbol{\Psi}}$ to be positive definite. This is important since if $ \hat{\boldsymbol{\Psi}}$ is not positive definite then PI(H) will not have a finite minimum and even if $ \hat{\boldsymbol{\Psi}}$ is positive definite but nearly singular then this may lead to numerical instability in the minimisation of PI(H).

This problem has not been studied hitherto now because if we use a diagonal bandwidth matrix then analytical expressions are available for the optimal bandwidths and so no numerical minimisation is required.

Our second modification that we propose is to use a single pilot bandwidth to estimate all the $ \psi$ functionals. If we sacrifice flexibility and optimality of using separate pilot bandwidths at this stage then we have a parsimonious selector as we only have to find one pilot bandwidth. This is done via summing all 5 of the AMSE expressions to form the SAMSE (sum of AMSE).

The expression for SAMSE is on the slide. We derive it multiplying out the squared bias term from the AMSE, then removing any lower order terms and then summing.
So the optimal g is this case can be found by differentiating it. The derivative just so happens to be a quadratic in n-1g-8 and can be solved with little difficulty and the result is on the slide.

So the estimate of $ \boldsymbol{\Psi}$ looks like this.

Most importantly the $ \hat{\boldsymbol{\Psi}}$ estimated in this way is guaranteed to be positive definite. So in this sense we are estimating the $ \boldsymbol{\Psi}$ matrix matrix-wise by reproducing its matrix properties. The question we now ask is how much efficiency do we sacrifice to ensure positive definiteness? The quick answer is not much; however the calculations to show this are quite involved and won't be reproduced here today.

So those were the first two steps for a plug-in selector. For the final two steps in the algorithm i.e. forming the plug-in criterion and minimising it to find the optimal bandwidth are straight forward, and they are the same for the existing and proposed methods.

* * *

We now look at some simulation results. The target densities we look at are normal mixtures because they a) exhibit a wide range of features that we wish to recover using kernel density estimation and b) have a closed form for the ISE or Integrated Squared Error. Note that the ISE depends on the data so we will have a different ISE for each sample. We use the log(ISE) to compare the performance between the bandwidth selectors. The three selectors we compare are the existing diagonal with different pilot bandwidths, a selector wtih the first modification we propose i.e. a full selector with different pilot bandwidths and a selector with the two modifications we have proposed i.e. a full matrix selector with a single pilot bandwidth. We run these simulations for 400 trials each with a sample size of 1000.

The first density we look at is a highly correlated normal density, with correlation 0.9. This is the contour plot of this density. We expect that a full bandwidth selector will perform better here than a diagonal bandwidth selector as most of the probability mass is aligned 45 degrees to the axes.

If we have a look at the boxplot of the log(ISE) then this is indeed the case. The log(ISE) for the diagonal selector, here on the right, is higher than that of the full selectors. For this density, if we use a full bandwidth matrix with different pilot bandwidths, the estimated $ \boldsymbol{\Psi}$ matrix fails to be positive definite 3.25 % of the time or in 13 trials out of 400.

The next density we consider is a more complicated density with three modes. The contour plot shows that this density has probability mass oriented to the axes as well as obliquely. So it not clear whether a diagonal or full bandwidth selector will have better performance.

If we have a look at the boxplot of the log(ISE) then we see in this case, that the performance for the three selectors are comparable (with perhaps the full selectors a little better.) Though this time, even if we use the full bandwidth matrix with different pilot bandwidths, there is no problem with lack of positive definiteness of the estimated $ \boldsymbol{\Psi}$ matrix.

So for the two densities we have considered today, we see that full bandwidth matrix selectors performs at least as well as the existing diagonal selectors in terms of ISE and the full bandwidth selector with a single pilot performs better than the full bandwidth selector with different pilots in terms of positive definiteness. Buoyed by these simulation results, we now turn to a real data set, the `Old Faithful' dataset that I have shown you previously.

This is the kernel density estimate of geyser data using the existing diagonal selector. The yellow and green areas are relatively lower then the blue and red regions. From this plot, we see that the contours are rather wavy - this is a result from the using a kernel that is aligned to the axes whilst the data are not.

If we use the proposed full bandwidth matrix and apply a kernel density estimate, we have the following plot. Notice now that the contours are smoother and follow the main orientation of the dataset. I believe that we have a better description of the data than before.

* * *

In summary, the comparison between the existing and proposed bandwidth selectors are ...


These notes are an edited version of a seminar given by the author on 30 May 2002 as part of the Statistics Seminar Series for the Department of Mathematics and Statistics, at the University of Western Australia. Please feel free to contact the author at duongt(at)maths(dot)uwa(dot)edu(dot)au if you have any questions.




Tarn Duong 2002-06-07