Tarn Duong
Supervisor: Martin Hazelton
Date: May 2002
This talk is about my PhD research, supervised by Dr Martin Hazelton. My topic is bandwidth selection for multivariate kernel density estimators. Today I will be talking about some research carried out late last year/ early this year on a particular type of bandwidth selector, namely plug-in bandwidth selectors.
The outline for today's talk is as follows. (Read slide.)



The single most factor in determining the performance of a kernel density estimator and the shape of the resulting kernel density estimate is the choice of bandwidth matrix. Since the bandwidth matrix can be thought of as the variance of the kernel function, it controls both the spread and the orientation of the kernel.
So how do we choose the most appropriate bandwidth? Implicit in this question is the criterion that we use to measure appropriateness. The most commonly used criterion is the Mean Integrated Squared Error or MISE for short. The formula for the MISE is given on the slide.

So this is the integral of the MSE of
or equivalently
is the expected squared distance between a kernel density estimate
and the target density f integrated over
.
Note that the MISE doesn't depend on the actual dataset as we take
expectation so this is a measure of the `average' performance of the
kernel density estimator.
Unfortunately this MISE does not have a closed form for an arbitrary
density f
and so we need to resort to using an asymptotic approximation via a
Taylor's series expansion. This is known as the AMISE, the A is for
asymptotic. The formula for it is here.

is a matrix of functionals that depend on the target
density f.
So it now seems straight forward to find the optimal bandwidth matrix
by simply minimising the above AMISE criterion. However since the
matrix depends on the target density
f we can't use the
AMISE yet. We need to first estimate this matrix. This is where
the plug-in methods derive their name - we produce an estimate and which
we then plug into the AMISE and then proceed with the minimisation.

matrix. We then plug-in this
estimate into the AMISE criterion to form the PI (for plug-in) criterion.
We minimise this to find the optimal bandwidth which we
denote as

element-wise.
Our proposed method makes the following two modifications to the existing
methods. We use a full bandwidth matrix rather than the restricted diagonal
matrix; and we estimate the
matrix matrix-wise rather than
element-wise. The final two steps of the algorithm remain the same for
both methods.

The current method simplifies the problem by restricting H to be a diagonal matrix. However it now means that we are restricted to using kernels that are oriented parallel to the co-ordinate axes. On the left are kernels with diagonal bandwidth matrices. On the right are kernels with full bandwidth matrices. The trade-off here for increased flexibility is that the problem becomes more computationally intensive and no longer has an analytical solution.

matrix which will help us greatly in trying
to estimate it. We use the following notation to indicate
partial derivatives and the
functional (indexed by r1, r2)
is defined on the slide. Note that the
functionals are scalars.
Then the explicit expression is given on the slide. An important property
of this matrix is that it is positive definite.

matrix. As
where X has density f
then an estimator is given on the slide. This equation says that we
take the mean
of the n (r1, r2)-th partial derivatives of the
kernel density estimates
evaluated at each of the data points Xi.
Note that we use a scalar
pilot bandwidth g here because if we try to use a full pilot bandwidth
matrix then the problem becomes much more mathematically involved. The trade
off between simplicity and flexibility can be justified to be valid if we
pre-transform the data into a suitable form.
So the problem is now to optimally select this pilot bandwidth gThis will be done by minimising the MSE of this estimator.
is given on the slide.
The first term is the asymptotic integrated variance and
the second term is the asymptotic integrated squared bias.
There two different formulas, one for the case where r1 and r2 are both even and and one for the case where r1 and r2 are both odd. Int the first case, arises from a bias annihilating calculation i.e we set the second term of the AMSE to be zero and solve forg. In the second case, the first term of the bias is 0 and so instead of a bias annihilating calculation, we find the minimum of the AMSE by setting the derivative to 0 and solving for g These two approaches give rise to the two different expressions for the optimal pilot bandwidths.
matrix is to estimate
each of its elements with its own pilot bandwidth so we require 5
pilot bandwidths and the estimate looks like this.
to be
positive definite. This is important since if
is
not positive definite then PI(H) will not have a finite minimum and
even if
is positive definite but nearly singular then
this may lead to numerical instability in the minimisation of PI(H).
This problem has not been studied hitherto now because if we use a diagonal
bandwidth matrix then analytical
expressions are available for the optimal bandwidths and so no numerical
minimisation is required.
Our second modification that we propose is to use a single pilot bandwidth
to estimate all the
functionals. If we sacrifice flexibility
and optimality of using separate pilot bandwidths at this stage then
we have a parsimonious selector as we only have to find one pilot
bandwidth. This is done via summing all 5 of the AMSE expressions to form the
SAMSE (sum of AMSE).
So the estimate of
looks
like this.
estimated in this way
is guaranteed to be positive definite. So in this sense we are estimating
the
matrix matrix-wise by reproducing its matrix properties.
The question we now ask is how much efficiency do we sacrifice to ensure
positive definiteness? The quick answer is not much; however the
calculations to show this are quite involved and won't be reproduced
here today.


matrix
fails to be positive definite 3.25 % of the time or in 13 trials out of 400.

matrix.
So for the two densities we have considered today, we see that full bandwidth matrix selectors performs at least as well as the existing diagonal selectors in terms of ISE and the full bandwidth selector with a single pilot performs better than the full bandwidth selector with different pilots in terms of positive definiteness. Buoyed by these simulation results, we now turn to a real data set, the `Old Faithful' dataset that I have shown you previously.
In summary, the comparison between the existing and proposed bandwidth selectors are ...
These notes are an edited version of a seminar given by the author on 30 May 2002 as part of the Statistics Seminar Series for the Department of Mathematics and Statistics, at the University of Western Australia. Please feel free to contact the author at duongt(at)maths(dot)uwa(dot)edu(dot)au if you have any questions.