and
are independent of
The selectivity bias arises because
In effect the residual
includes the same unobserved characteristics as does the residual
causing the two error terms to be correlated. OLS estimation of equation (1) would have a missing variable—the bias created by the missing observations (due to wage data not being available for women not in the work force). As in other cases of omitted variables, the estimates of the parameters of the model,
would be biased. Heckman (1979) notes in his seminal article on selectivity bias:
One can also show that the least squares estimator of the population variance is downward biased. Second, a symptom of selection bias is that variables that do not belong in the true structural equation (variables in not in may appear to be statistically significant determinants of when regressions are fit on selected samples. Third, the model just outlined contains a variety of previous models as special cases. ...For a more complete development of the relationship between the model developed here and previous models for limited dependent variables, censored samples and truncated samples, see Heckman (1976). Fourth, multivariate extensions of the preceding analysis, while mathematically straightforward, are of consider-able substantive interest. One example is offered. Consider migrants choosing among K possible regions of residence. If the self selection rule is to choose to migrate to that region with the highest income, both the self selection rule and the subsample regression functions can be simply characterized by a direct extension of the previous analysis. (Notation has been altered to match the notation used in this module, see Heckman, 1979: 155)
Estimation strategy
Heckman (1979) suggests a two-step estimation strategy. In the first step a probit estimate of equation (2) is used to construct a variable that measures the bias. This variable is known as the “inverse Mills ratio.” Heckman and others demonstrate that
where
and
are the probability density function and the cumulative distribution functions, respectively, evaluated at
The ratio in the brackets in equation (7) is known as the
inverse Mills ratio . We will use an estimate of the inverse Mills ratio in the estimation of equation (5) to measure the sample selectivity bias.
The Heckman two-step estimator is relatively easy to implement. In the first step you use a maximum likelihood probit regression on the whole sample to calculate
from equation (2). You then use
to estimate the inverse Mills ratio:
In the second step, we estimate:
using OLS and where
Thus, a t-ratio test of the null hypothesis
is equivalent to testing the null hypothesis
and is a test of existence of the sample selectivity bias.