This module contains a brief introduction to the econometric problem of sample selectivity bias using Stata.
Sample selection bias
Introduction
These notes discuss how to handle one of the more common problems that arise in economic analyses—sample selection bias. Essentially, sample selection bias can arise whenever some potential observations cannot be observed. For instance, the students enrolled in an intermediate microeconomics course are not a random sample of all undergraduates. Students self-select when they enroll in any class or choose a major. While we do not know all of the reasons for this self-selection, we suspect that students choosing to take advanced economics courses have more quantitative skills than students choosing courses in the humanities. Since we do not observe the grades that students who did not enroll in the intermediate microeconomics class would have made had they enrolled, we can never observe the grades that they would have made. Under certain circumstances the omission of potential members of a sample will cause ordinary least squares (OLS) to give biased estimates of the parameters of a model.
In the 1970s James Heckman developed techniques that will correct the bias introduced by sample selection bias. Since then, most econometric computer programs include a command that automatically used Heckman’s method. However, blind use of these commands can lead to errors that would be avoided by a better understanding of his correction technique. This module is intended to provide this understanding.
In the first section I discuss the sources of sample selection bias by examining the basic economic model used to understand the problem. In the second section I present the estimation strategy first developed by Heckman. In the third section I discuss how to estimate the Heckman model in
Stata . In the final section I examine an extended example of the technique. An exercise is included at the end of the discussion.
The model
Assume that there is an unobserved latent variable,
and an unobserved latent index,
such that:
The matrix notation above means (1) that
-
-
Substituting (1), (2) and (3) into (4) gives:
Note that
N is the total sample size and
n is the number of observations for which
Since
is not observed for
the question becomes why are these observations missing. A concrete example of such a model is a model of female wage determination. Equation (1) would model the wage rate earned by women in the labor force and Equation (2) would model the decision by a female to enter the labor force. In this case,
, the wage rate woman
i receives, is a function of the variables in
however, women not in the labor force are not included in the sample. If these missing observations are drawn randomly from the population, there is no need for concern. Selectivity bias arises if the
omitted observations have unobserved characteristics that affect the likelihood that
and are correlated with the wage the woman would receive had she entered the labor force. For instance, a mentally unstable female is likely to earn relatively low wages and might be more unlikely to enter the labor force. In this case, the error terms,
and
would be independent and identically distributed
where