<< Chapter < Page | Chapter >> Page > |
x | y | x | y |
---|---|---|---|
1915 | 10.1 | 1969 | 36.7 |
1926 | 17.7 | 1975 | 49.3 |
1935 | 13.7 | 1979 | 72.6 |
1940 | 14.7 | 1980 | 82.4 |
1947 | 24.1 | 1986 | 109.6 |
1952 | 26.5 | 1991 | 130.7 |
1964 | 31.0 | 1999 | 166.6 |
In the example, notice the pattern of the points compared to the line. Although the correlation coefficient is significant, the pattern in the scatterplot indicates that a curve would be a more appropriate model to use than a line. In this example, a statistician should prefer to use other methods to fit a curve to this data, rather than model the data with the line we found. In addition to doing the calculations, it is always important to look at the scatterplot when deciding whether a linear model is appropriate.
If you are interested in seeing more years of data, visit the Bureau of Labor Statistics CPI website ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt; our data is taken from the column entitled "Annual Avg." (third column from the right). For example you could add more current years of data. Try adding the more recent years: 2004: CPI = 188.9; 2008: CPI = 215.3; 2011: CPI = 224.9. See how it affects the model. (Check: ŷ = –4436 + 2.295 x ; r = 0.9018. Is r significant? Is the fit better with the addition of the new points?)
The following table shows economic development measured in per capita income PCINC.
Year | PCINC | Year | PCINC |
---|---|---|---|
1870 | 340 | 1920 | 1050 |
1880 | 499 | 1930 | 1170 |
1890 | 592 | 1940 | 1364 |
1900 | 757 | 1950 | 1836 |
1910 | 927 | 1960 | 2132 |
a. The independent variable ( x ) is the year and the dependent variable ( y ) is the per capita income.
b.
c. ŷ = 18.61x – 34574; r = 0.9732
d. At df = 8, the critical value is 0.632. The r value is significant because it is greater than the critical value.
e. There does appear to be a linear relationship between the variables.
f. The coefficient of determination is 0.947, which means that 94.7% of the variation in PCINC is explained by the variation in the years.
g. and h. The slope of the regression equation is 18.61, and it means that per capita income increases by $18.61 for each passing year. ŷ = 785 when the year is 1900, and ŷ = 2,646 when the year is 2000.
i. There do not appear to be any outliers.
Degrees of Freedom: n – 2 | Critical Values: (+ and –) |
---|---|
1 | 0.997 |
2 | 0.950 |
3 | 0.878 |
4 | 0.811 |
5 | 0.754 |
6 | 0.707 |
7 | 0.666 |
8 | 0.632 |
9 | 0.602 |
10 | 0.576 |
11 | 0.555 |
12 | 0.532 |
13 | 0.514 |
14 | 0.497 |
15 | 0.482 |
16 | 0.468 |
17 | 0.456 |
18 | 0.444 |
19 | 0.433 |
20 | 0.423 |
21 | 0.413 |
22 | 0.404 |
23 | 0.396 |
24 | 0.388 |
25 | 0.381 |
26 | 0.374 |
27 | 0.367 |
28 | 0.361 |
29 | 0.355 |
30 | 0.349 |
40 | 0.304 |
50 | 0.273 |
60 | 0.250 |
70 | 0.232 |
80 | 0.217 |
90 | 0.205 |
100 | 0.195 |
Data from the House Ways and Means Committee, the Health and Human Services Department.
Data from Microsoft Bookshelf.
Data from the United States Department of Labor, the Bureau of Labor Statistics.
Data from the Physician’s Handbook, 1990.
Data from the United States Department of Labor, the Bureau of Labor Statistics.
To determine if a point is an outlier, do one of the following:
where s is the standard deviation of the residuals
If any point is above y 2 or below y 3 then the point is considered to be an outlier.
Use the following information to answer the next four exercises. The scatter plot shows the relationship between hours spent studying and exam scores. The line shown is the calculated line of best fit. The correlation coefficient is 0.69.
Do there appear to be any outliers?
Yes, there appears to be an outlier at (6, 58).
A point is removed, and the line of best fit is recalculated. The new correlation coefficient is 0.98. Does the point appear to have been an outlier? Why?
What effect did the potential outlier have on the line of best fit?
The potential outlier flattened the slope of the line of best fit because it was below the data set. It made the line of best fit less accurate is a predictor for the data.
Are you more or less confident in the predictive ability of the new line of best fit?
The Sum of Squared Errors for a data set of 18 numbers is 49. What is the standard deviation?
s = 1.75
The Standard Deviation for the Sum of Squared Errors for a data set is 9.8. What is the cutoff for the vertical distance that a point can be from the line of best fit to be considered an outlier?
Notification Switch
Would you like to follow the 'Introductory statistics' conversation and receive update notifications?