<< Chapter < Page | Chapter >> Page > |
Convergence to the density rather than its integral over a region can occur if, as the amount of data grows, we reducethe binwidth and increase , the number of bins. However, if we choose the binwidth too small for theamount of available data, few bins contain data and the estimate is inaccurate. Letting denote the midpoint of a bin, using a Taylor expansion about this point reveals that the mean-squared error between thehistogram and the density at that point is [Thompson and Tapia:44-59] This mean-squared error becomes zero only if , , and . Thus, the binwidth must decrease more slowly than the rate of increase of the number of observations. We find the "optimum" compromise between thedecreasing binwidth and the increasing amount of data to be This result assumes that the second derivative of the density is nonzero. If it is not, eitherthe Taylor series expansion brings higher order terms into play or, if all the derivatives are zero, no optimum binwidthcan be defined for minimizing the mean-squared error. Using this binwidth, we find the the mean-squared error to be proportional to . We have thus discovered the famous "4/5" rule of density estimation; this is one of the few cases where the variance ofa convergent statistic decreases more slowly than the reciprocal of the number of observations. In practice, thisoptimal binwidth cannot be used because the proportionality constant depends on the unknown density being estimated.Roughly speaking, wider bins should be employed where the density is changing slowly. How the optimal binwidth varieswith can be used to adjust the histogram estimate as more data becomes available.
Once a density estimate is produced, the class of density that best
coincides with the estimate remains an issue:Is the density just estimated statistically similar to a Gaussian?
The histogram estimate can be used directly in a hypothesis test todetermine similarity with any proposed density.
Assume that the observations are obtained from a white,stationary, stochastic sequence.
Let
denote the hypothesis that the data has an amplitude
distribution equal to the presumed density and
the dissimilarity hypothesis. If
is true, the estimate for each bin should not deviate greatly
from the probability of a randomly chosen datum lying in thebin. We determine this probability from the presumed density
by integrating over the bin. Summing these deviations overthe entire estimate, the result should not exceed a threshold.
The theory of standard hypothesis testing requires us toproduce a specific density for the alternative hypothesis
. We cannot rationally assign such a density; consistency is
being tested, not whether either of two densities provides thebest fit. However, taking inspiration from the Neyman-Pearson
approach to hypothesis testing [See
Neyman-Pearson Criterion ], we candevelop a test statistic and require its statistical
characteristics
only under
. The typically used, but
In many circumstances, the formula for the density is known but not some of its parameters. In the Gaussian case, forexample, the mean or variance are usually unknown. These parameters must be determined from the same data used in theconsistency test before the test can be used. Doesn't the fact that we use estimates rather than actual values affectthe similarity test? The answer is "yes," but in an interesting way: The similarity test changes only in that thenumber of degrees of freedom of the random variable used to establish the threshold is reduced by one for each estimated parameter.If a Gaussian density is being tested, for example, the mean and variance usually need to be found. The threshold shouldthen be determined according to the distribution of a random variable.
Three sets of observations are considered: Two are drawn from a Gaussian distribution and the other not. The first Gaussianexample is white noise, a signal whose characteristics match the assumptions of this section. The second is non-Gaussian,which should not pass the test. Finally, the last test consists of colored Gaussian noise that, because of dependentsamples, does not have as many degrees of freedom as would be expected. The number of data available in each case is 2000.The histogram estimator uses fixed-width bins and the test demands at least ten observations per merged bin. The mean and variance estimates are used in constructing thenominal Gaussian density. The histogram estimates and their approximation by the nominal density whose mean and variancewere computed from the data are shown in the [link] . The chi-squared test yielded the following results.
Density | |||
---|---|---|---|
White Gaussian | 70 | 82.2 | 78.4 |
White Sech | 65 | 76.6 | 232.6 |
Colored Gaussian | 65 | 76.6 | 77.8 |
Notification Switch
Would you like to follow the 'Statistical signal processing' conversation and receive update notifications?