<< Chapter < Page Chapter >> Page >

Convergence to the density rather than its integral over a region can occur if, as the amount of data grows, we reducethe binwidth δ i and increase N , the number of bins. However, if we choose the binwidth too small for theamount of available data, few bins contain data and the estimate is inaccurate. Letting r denote the midpoint of a bin, using a Taylor expansion about this point reveals that the mean-squared error between thehistogram and the density at that point is [Thompson and Tapia:44-59] p r r h i 2 p r r 2 L δ i δ i 4 36 r r r 2 p r r 2 O 1 L O δ i 5 This mean-squared error becomes zero only if L , L δ i , and δ i 0 . Thus, the binwidth must decrease more slowly than the rate of increase of the number of observations. We find the "optimum" compromise between thedecreasing binwidth and the increasing amount of data to be This result assumes that the second derivative of the density is nonzero. If it is not, eitherthe Taylor series expansion brings higher order terms into play or, if all the derivatives are zero, no optimum binwidthcan be defined for minimizing the mean-squared error. δ i 9 p r r 2 r r r 2 p r r 2 1 5 L 1 5 Using this binwidth, we find the the mean-squared error to be proportional to L 4 5 . We have thus discovered the famous "4/5" rule of density estimation; this is one of the few cases where the variance ofa convergent statistic decreases more slowly than the reciprocal of the number of observations. In practice, thisoptimal binwidth cannot be used because the proportionality constant depends on the unknown density being estimated.Roughly speaking, wider bins should be employed where the density is changing slowly. How the optimal binwidth varieswith L can be used to adjust the histogram estimate as more data becomes available.

Density verification

Once a density estimate is produced, the class of density that best coincides with the estimate remains an issue:Is the density just estimated statistically similar to a Gaussian? The histogram estimate can be used directly in a hypothesis test todetermine similarity with any proposed density. Assume that the observations are obtained from a white,stationary, stochastic sequence. Let 0 denote the hypothesis that the data has an amplitude distribution equal to the presumed density and 1 the dissimilarity hypothesis. If 0 is true, the estimate for each bin should not deviate greatly from the probability of a randomly chosen datum lying in thebin. We determine this probability from the presumed density by integrating over the bin. Summing these deviations overthe entire estimate, the result should not exceed a threshold. The theory of standard hypothesis testing requires us toproduce a specific density for the alternative hypothesis 1 . We cannot rationally assign such a density; consistency is being tested, not whether either of two densities provides thebest fit. However, taking inspiration from the Neyman-Pearson approach to hypothesis testing [See Neyman-Pearson Criterion ], we candevelop a test statistic and require its statistical characteristics only under 0 . The typically used, but ad hoc test statistic S L N is related to the histogram estimate's mean-squared error [Cramer:416-41] . S L N i 1 N L i L P i 2 L P i i 1 N L i 2 L P i L This statistic sums over the various bins the squared error of the number of observations relative to the expected number.For large L , S L N has a χ 2 probability distribution with N 1 degrees of freedom [Cramer:417] . Thus, for a given number of observations L we establish a threshold η N by picking a false-alarm probability P F and using tables to solve χ N - 1 2 η N P F . To enhance the validity of this approximation, statisticians recommend selecting the binwidth so that each bin contains atleast ten observations. In practice, we fulfill this criterion by merging adjacent bins until a sufficient numberof observations occur in the new bin and defining its binwidth as the sum of the merged bins' widths. Thus, the number ofbins is reduced to some number N , which determines the degrees of freedom in the hypothesis test. The similarity test between the histogram estimate of aprobability density function and an assumed ideal form becomes S L N 0 1 η N

In many circumstances, the formula for the density is known but not some of its parameters. In the Gaussian case, forexample, the mean or variance are usually unknown. These parameters must be determined from the same data used in theconsistency test before the test can be used. Doesn't the fact that we use estimates rather than actual values affectthe similarity test? The answer is "yes," but in an interesting way: The similarity test changes only in that thenumber of degrees of freedom of the χ 2 random variable used to establish the threshold is reduced by one for each estimated parameter.If a Gaussian density is being tested, for example, the mean and variance usually need to be found. The threshold shouldthen be determined according to the distribution of a χ N - 3 2 random variable.

Three sets of observations are considered: Two are drawn from a Gaussian distribution and the other not. The first Gaussianexample is white noise, a signal whose characteristics match the assumptions of this section. The second is non-Gaussian,which should not pass the test. Finally, the last test consists of colored Gaussian noise that, because of dependentsamples, does not have as many degrees of freedom as would be expected. The number of data available in each case is 2000.The histogram estimator uses fixed-width bins and the χ 2 test demands at least ten observations per merged bin. The mean and variance estimates are used in constructing thenominal Gaussian density. The histogram estimates and their approximation by the nominal density whose mean and variancewere computed from the data are shown in the [link] .

Three histogram density estimates are shown and compared with Gaussian densities having the same mean and variance.The histogram on the top is obtained from Gaussian data that are presumed to be white. The middle one is obtainedfrom a non-Gaussian distribution related to the hyperbolic secant p r r 1 2 σ r 2 σ 2 . This density resembles a Gaussian about the origin but decreases exponentially in the tails. Thebottom histogram is taken from a first-order autoregressive Gaussian signal. Thus, these data arecorrelated, but yield a histogram resembling the true amplitude distribution. In each case, 2000 data pointswere used and the histogram contained 100 bins.
The chi-squared test P F 0.1 yielded the following results.
Density N χ N - 3 2 S 2000 N
White Gaussian 70 82.2 78.4
White Sech 65 76.6 232.6
Colored Gaussian 65 76.6 77.8
The white Gaussian noise example clearly passes the χ 2 test. The test correctly evaluated the non-Gaussian example, but declared the colored Gaussian data to be non-Gaussian,yielding a value near the threshold. Failing in the latter case to correctly determine the data's Gaussianity, we seethat the χ 2 test is sensitive to the statistical independence of the observations.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical signal processing. OpenStax CNX. Dec 05, 2011 Download for free at http://cnx.org/content/col11382/1.1
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical signal processing' conversation and receive update notifications?

Ask