Abstract
In early 2010, the US Department of Health and Human Services released the findings from a large, experimental evaluation of the Head Start program. A common interpretation of the findings is that they show “small” effects, which has lead to, among other things, calls to improve the efficacy of Head Start. However, it is not clear that Head Start is performing worse than should be reasonably expected. To provide a frame of reference for evaluating the program, we compare the performance of Head Start childcare centers to the performance of non-Head Start childcare centers, the latter being the preferred childcare option of wealthier families. We find that, on average, Head Start centers perform similarly to non-Head Start centers. Our results suggest that expectations for the Head Start program may be too high.
Similar content being viewed by others
INTRODUCTION
Head Start is a federal matching grant program that provides disadvantaged children with access to pre-kindergarten education. The program began as part of the War on Poverty in the 1960s, and today serves roughly 900,000 children each year at a cost of $6.9 billion.
There is a large and robust literature that evaluates the effects of Head Start on program participants. A recent and particularly influential study is the Head Start Impact Study [US Department of Health and Human Services 2010], which uses an experimental design and a nationally representative sample of children to evaluate the program. The Impact Study shows that Head Start has positive immediate-term effects, but these effects have been described as “disappointingly small” [Besharov 2005].Footnote 1 It also shows that the effects of Head Start fade out quickly.Footnote 2 The Impact Study has rekindled an intense policy debate over the future of Head Start, and has lead to increased calls for change.
Prior research provides some insight as to why the initially positive effects of Head Start fade out over time — Head Start children go on to attend inferior K–12 schools [Lee et al. 1990; Lee and Loeb 1995; Currie and Thomas 2000].Footnote 3 However, this does not address the commonly voiced concern that even the immediate-term effects of the program are too small. It is often presumed, sometimes implicitly, that Head Start's small program impacts are attributable to the poor performance of the program itself. Taking this presumption as a point of departure, one logical course of action would be to intervene to improve the program. However, it is also possible that Head Start is performing well, but the program's margin for effect is smaller than is typically thought. These two explanations are substantively different, and they have different policy implications. For example, if Head Start's small program impacts are an artifact of the nature of the intervention itself, it makes it less likely that adjustments to the program will meaningfully improve children's outcomes.
It is difficult to know what types of program impacts are reasonable to expect from Head Start because we have limited evidence on the effects of preschool interventions in general, particularly large-scale interventions.Footnote 4 In our study, we use the performance of non-Head Start childcare centers as a benchmark for evaluating the Head Start program. Although there is no other program exactly like Head Start that can be used for comparison, non-Head Start childcare centers offer a similar service, and are a viable, if imperfect, option.
A direct comparison between Head Start and non-Head Start childcare centers is difficult because the programs serve different populations of children. The key to our empirical strategy is that we have data from children who enrolled in Head Start and non-Head Start centers but were treated for only a short time. We use these children to control for selection into each childcare type, and compare their outcomes to outcomes from children who were treated for a much longer duration. This approach produces estimates of program impacts for Head Start and non-Head Start centers that correspond to differences in treatment duration within each program. We compare these estimates in a difference-in-difference framework.
Motivated largely by the Impact Study findings and subsequent policy discussions, we focus our evaluation on how Head Start and non-Head Start centers affect children's immediate-term test scores. We find that, on average, both childcare types perform similarly. Our findings are at odds with the common perception that Head Start is performing poorly, but this perception appears to be driven more by lofty expectations for the program than by the program's actual performance. Implicit in calls for improvements to the delivery of Head Start is the notion that the program is somehow “broken” and must be fixed, but this notion is not supported by the evidence presented here. We additionally note that there is a growing body of research showing that Head Start confers positive and economically meaningful long-term benefits to program participants [e.g., see Garces et al. 2002; Ludwig and Miller 2007; Deming 2009]. Combined with this prior research, our study suggests that the negative reaction to the Impact Study findings may be overblown.Footnote 5
MOTIVATION AND RESEARCH DESIGN
The Head Start literature to date has focused primarily on estimating treatment impacts for treated children. The typical comparison in the literature is between Head Start and a composite of the childcare arrangements available to children who are similar to Head Start attendees, but who do not attend (including non-Head Start, non-relative care; relative care; and parent care). While this type of comparison provides an answer to exactly the right question when we are interested in knowing whether Head Start participants are better off for having participated (a clear first-order issue), it is only partly informative about Head Start's performance in the broader sense. A key issue is that the non-Head Start childcare arrangements available to socioeconomically disadvantaged families are likely to be of lower quality than the childcare arrangements available to other families. This means that from the current literature we only know how well Head Start performs relative to what are likely to be inferior childcare alternatives.
Even so, the Impact Study, like many other studies in the literature, finds that Head Start has small treatment impacts. Prompted by the Impact Study results, Whitehurst [2010] writes: “Head Start isn’t doing the job the families it serves and the nation need. It must be improved. There are many proposals for doing so. Let's try them, test them, and do what works.” Implicit in Whitehurst's argument, and the many others like it, is that Head Start can perform better if only we can figure out how to improve efficacy. But this is far from certain. In fact, we only know that Head Start seems to marginally improve upon the available alternatives for program participants. An important question remains: How much better could we reasonably expect Head Start to perform? There are very few studies in the literature that attempt to answer this question.
A notable exception is Gormley et al. [2010], who compare Head Start to an early-childhood program in Tulsa, Oklahoma. They find that the Tulsa program generally outperforms Head Start (with the exception that Head Start has larger effects on health outcomes). On the one hand, Gormley et al. [2010] establish that early-childhood interventions can have large effects on children's outcomes, which is important. However, on the other hand, comparability is an issue because the childcare inputs in the Tulsa program appear to be superior to those in Head Start. For example, Gormley and Gayer [2005] report that care providers in the Tulsa program are required to have 4-year college degrees and are paid like public elementary-school teachers.Footnote 6 This raises concerns about the fiscal scalability of such a program.Footnote 7
Instead of comparing Head Start to the composite of alternatives that are available to socioeconomically disadvantaged children, or to a program that will be difficult to bring to scale for fiscal or other reasons, the contribution of the present study is to compare Head Start to the larger non-Head Start, center-based childcare sector. This comparison is appealing because over half of all children in the United States attend some form of non-Head Start, center-based care, making it the most common type of care in the country.Footnote 8 Furthermore, childcare costs in the non-Head Start, center-based childcare sector are more closely aligned with those of Head Start than are the costs of previously studied, higher-quality childcare programs.Footnote 9
One way to view the contribution of our study is as a complement to earlier work, like that of Gormley et al. [2010] and others, which shows that high-quality pre-kindergarten interventions can be quite successful. The prior literature provides insight into what we could expect to see if we were to dramatically scale up the quality of the Head Start program, and our fiscal commitment to its success. At the other end of the spectrum is the question of how well Head Start performs compared to more typical childcare arrangements in the United States — prior to our study, we are not aware of any work that attempts to shed light on this question.
Our approach is conceptually straightforward: we estimate average program impacts on children's test scores for Head Start and non-Head Start childcare centers, and compare these estimates using a difference-in-difference framework.Footnote 10 We estimate the effects of the programs using test-score data collected at the time of treatment. Although much of the recent controversy surrounding Head Start is related to the lack of persistent program impacts, we cannot perform our comparison using later-year outcomes because these outcomes are also affected by the different post-treatment experiences of Head Start and non-Head Start children [Lee et al. 1990; Lee and Loeb 1995; Currie and Thomas 1995; 2000]. For our analysis, where interest is in the performance of Head Start itself, these post-treatment experiences are confounding factors.
The key empirical issue that we face is that children and their families are non-randomly selected into childcare treatments. To minimize the influence of selection bias in our study, we rely on a fairly unique feature of our data set, the ECLS-B. In the survey, each caregiver for an age-4 child is asked to indicate how long the child has been receiving care. We use caregiver responses to this question to categorize children as either “treatments” or “controls,” within care type, based on treatment duration. Specifically, some children had been enrolled in care for only a few months at the time of the survey, while others had been enrolled for much longer. The children who had been enrolled for only a short time received very little treatment, but they are useful for our analysis because they selected into treatment. We use these “control” children to capture selection effects, and compare their outcomes to outcomes from children who were in care for a longer duration, who we define as “treated.” Although Head Start children differ markedly from non-Head Start children across virtually every observable measure, the within-program samples of treatments and controls are observationally very similar.Footnote 11
Our research design greatly improves observational comparability in our study, which we show below, but we rely on variation across children in the timing-of-entry into childcare to identify program impacts.Footnote 12 For our within-program estimates of treatment effects to be unbiased, timing-of-entry must be independent of potential outcomes conditional on observable information in each program. That is, for each program:
In (1), Y0 and Y1 are outcomes that correspond to the control and treatment states; T indicates treatment status, which is determined by duration in care; and X is a vector of observable information about children and their families.
The conditional independence assumption will be violated if timing-of-entry is correlated with unobserved factors that are not controlled for by the components of X, and if these factors also influence children's outcomes. Unfortunately, we cannot directly test for the biasing effects of unobservables, but we do use the rich set of observable characteristics available in the ECLS-B data to show that within the same program, children who differ in terms of treatment duration over the range of durations that we consider are observationally very similar (see below). Additionally, we provide some validation of our empirical approach by comparing the treatment effects that we estimate for Head Start to analogous experimental estimates from the Impact Study. Our findings from this exercise are reported in Appendix B which shows that our Head Start-specific estimates are similar to the experimental results, particularly in math. Our interpretation of this result is that even if conditional independence is not mechanically satisfied in our analysis, any bias generated by its failure is likely to be small.Footnote 13
We also use a difference-in-difference framework to further limit the impact of bias in our estimates. The difference-in-difference models estimate Head Start's relative effect as the difference between the within-program estimates for Head Start and non-Head Start centers. Taking this difference will remove any bias generated by endogenous timing-of-entry decisions that is consistent across both programs. So, for example, if early entrants into both types of care are more disadvantaged, any resulting bias in the program-specific estimates will be reduced in our difference-in-difference models.Footnote 14
DATA
Our data are from the ECLS-B, provided by the National Center for Education Statistics (NCES). The ECLS-B began tracking children at birth and administered follow-up surveys when the children were aged 9 months, 2 years, and 4 years (the survey is ongoing). During each wave of the survey information is collected from parents and children, and for the age-2 and age-4 follow-ups, childcare providers were also surveyed. We use the data file from the age-4 follow-up in our analysis.
The ECLS-B contains detailed information about children and their families. Two of the most important variables in the data are family income and parental-education levels, which are highly correlated with children's outcomes. Also, because the survey began at birth, it provides access to information that is not commonly available, like birth weight. Childcare arrangements, past and present, are reported for each child. The data on childcare arrangements are collected contemporaneously. Therefore, unlike much of the prior work in this area, we do not need to rely on retrospective questionnaires given to parents to assign childcare treatments.
Assessment data are collected from children during each wave of the survey. During the 9-month and 2-year follow-ups, children were given motor-skills and mental-skills tests, and during the age-4 follow-up they were given cognitive tests in math and literacy. We estimate childcare effects on the age-4 math and literacy scores, and condition on the history of children's scores in our models. We standardize all test scores using the universe of test takers in the data.
Our empirical strategy requires that we impose several restrictions on the data set. First, to facilitate a clean comparison across programs, we restrict our analysis to include only children who were reported to be in care, exclusively, either at a Head Start facility or a non-Head Start, center-based care facility at the time of the age-4 survey. That is, we omit all children who received some other form of care, or split time between multiple care arrangements.Footnote 15 Among the children who meet this criterion, we further restrict the sample based on caregiver responses to the time-in-care question. In our primary models we identify children who were reported to be in care for 3 months or less as controls, and children who were reported to be in care for 6−12 months as treatments. In a robustness exercise in Section “Robustness and other issues”, we show that our findings are not sensitive to reasonable adjustments to these treatment and control definitions.Footnote 16
We impose three additional restrictions on the data. The first two restrictions are imposed because the caregiver question about time-in-care is specific to the current center, raising concerns about misclassification error. For example, we may misclassify some treated children as controls, and understate the amount of treatment that some treated children actually received, simply because they changed childcare centers. To minimize these occurrences in the data, we omit all children whose parents reported changing residences between the age-2 and age-4 surveys because we expect that residence changes are likely to correspond to changes in childcare arrangements. We also exclude all children who were reported to attend any non-relative childcare during the age-2 survey. Neither treatments nor controls should have been in care at that time, and if they were, it suggests that a change in childcare arrangements occurred. We also omit children who were reported to be in care for more than 40 hours per week because Head Start centers do not care for children for over 40 hours.Footnote 17 Again, we consider the robustness of our findings to relaxing these restrictions in Section “Robustness and other issues”.
Moving forward, when we reference our preferred data sample, we are referring to the sample where treatments were in care for 6−12 months, controls were in care for 3 months or less, and the above-described data restrictions are in place.
Table 1 shows descriptive statistics for the full ECLS-B data file (age-4 survey), and within childcare type, for program participants, exclusive program participants and our preferred data samples. We highlight two aspects of the table. First, across programs, Head Start and non-Head Start children differ substantially along virtually every observable dimension. Second, within programs, our preferred data samples differ from the exclusive-care samples in some ways, and are much smaller.Footnote 18 However, given the sharp reduction in sample sizes when we move to our preferred samples, and the number of restrictions we impose on the data, the differences seem generally modest. That is, with a handful of exceptions, our preferred data samples do not seem to be meaningfully different from their respective exclusive-care samples in the full ECLS-B data file.Footnote 19
Next, in Table 2, we split our preferred samples based on treatment status and compare treatments and controls within and across programs. We also further restrict the data to include only children for whom either a math or literacy score is reported during the age-4 survey (or both), which is required for inclusion in our analysis. The first row of the table shows the average difference in treatment duration between treatments and controls. The difference is roughly 6 months for both programs (although it is slightly larger in the non-Head Start sample — by about 5 days, or 0.16 months). The treatment effects that we report below correspond to these differences in treatment duration, which we consider to be equivalent across programs.
The second row highlights a large and statistically significant difference in age (in months) between treatments and controls, which is a mechanical artifact of our research design — treatment children, by definition, had been in care longer at the time of the age-4 ECLS-B survey, and this is correlated with age (note that there is no discrepancy in age across programs conditional on treatment status). To ensure that the age differences between treatments and controls do not bias our findings, we include age-in-months indicator variables in all of our models — that is, we estimate within-age treatment effects. We allow the age trend to differ by program type in our difference-in-difference models.Footnote 20
The remainder of Table 2 reports average values for the demographic, socioeconomic, and lagged test-score variables. Within both programs there are some differences between treatments and controls. The differences are more often statistically significant in the non-Head Start sample; however, as we show in Table 3, the nominal differences are slightly larger in the Head Start sample (the differences in the t-tests are driven by differences in sample size). We also note that the within-program differences between treatments and controls are similar across childcare types. For example, treatments from both the Head Start and non-Head Start samples are more likely to be non-White, and generally come from lower-income and less-educated families than controls. This suggests that any bias in our childcare-specific estimates owing to these differences will be reduced when we compare the programs.
Are the differences that we document in Table 2 large? We answer this question by calculating the standardized differences in observables between three groups: (1) Head Start treatments and non-Head Start treatments; (2) Head Start treatments and Head Start controls; and (3) non-Head Start treatments and non-Head Start controls. We calculate the standardized difference in observable characteristic X k between groups A and B as:
Equation (2) is motivated by Rosenbaum and Rubin [1985], who suggest using a similar metric to evaluate whether matching methods are effective in producing observationally comparable treatment and control units. Analogously to the matching literature, we expect our treatment and control observations to be similar, in which case the within-program standardized differences should be small.
Table 3 reports standardized differences that are averaged across all of the demographic, socioeconomic and lagged test-score variables in Table 2, by comparison group.Footnote 21 Not surprisingly, the average standardized differences are much smaller in the within-program comparisons than in the across-program comparison. As the treatment and control definitions are chosen independently of the observable information in the data, the improvement in observed comparability generated by our research design suggests that unobserved comparability is also improved. We cannot directly control for unobserved differences between children in our models, meaning that the reduction in these differences implied by Table 3 is particularly important.
It is easy to see that our within-program samples are more comparable than our between-program samples, but whether the within-program standardized differences are small or large is not obvious. To provide some insight, for each comparison we construct the empirical distribution for the average standardized difference under random assignment. Because the standardized difference is an absolute measure, even with random assignment and large (but finite) samples, the center of the distribution will be above zero.
We illustrate our approach for constructing the empirical distributions using the comparison between Head Start treatments and controls. We begin with the 852 children in the full ECLS-B data file who were reported to exclusively attend Head Start (see Table 1). From this sample, we randomly draw 96 observations and calculate the mean and variance of each variable, X k , and then randomly draw 108 observations and calculate the mean and variance of each variable. The first group corresponds to our primary control sample from Table 2, where N=96, and the second group to our primary treatment sample, where N=108. We calculate the average standardized difference between the two random samples across the X's, and then repeat this procedure 500 times to construct the empirical distribution of the average standardized difference under random sampling. For each comparison, Table 3 reports the average and 95 percent confidence interval of this distribution.Footnote 22,Footnote 23
Table 3 shows that for both programs, the observed standardized difference is comfortably within the 95 percent confidence interval of the random-sampling distribution (although, consistent with Table 2, the standardized difference is larger relative to the empirical distribution in the non-Head Start sample). Row (4) of the table shows that our “difference-in-difference” standardized difference is also well within the 95 percent confidence interval of the random-sampling distribution. Although we lack a true mechanism for random assignment in our study; at least observationally, our treatment and control samples look very similar to what we would expect to see if such a mechanism were actually in place.
METHODOLOGY
We estimate program-specific regression models where we only include observations that are identified as either treatments or controls. These models take the following form:
In (3), Y i indicates an outcome for child i, A i is a vector of age-in-months dummies, X i is a vector of observable information about child i and her family (the child-specific variables in X i include the demographic, socioeconomic-status and lagged test-score variables shown in Tables 1 and 2), and T i indicates treatment status, where treated children belong to the group that spent more time in care. π̂ estimates the treatment effect.
The identifying variation in the models comes from differences in treatment status among children who were the same age at the time of their assessments (because we include age-fixed effects). This means that we identify the program impacts by comparing children who differ by their timing-of-entry into childcare, as discussed above.Footnote 24 More specifically, we compare relatively late-entering control children to relatively early-entering treatment children. Noting that the ECLS-B assessments were administered mostly in the fall — in September, October, and November — the control children in our analysis primarily entered care in the late summer and early fall immediately preceding the tests. Treatment children entered care at some point during the prior year — some enrolled at the beginning of the year and others enrolled mid-year.Footnote 25
We compare the program-specific estimates from equation (3) using a difference-in-difference framework:
In (4), Y i , A i and X i are defined as above. T i also maintains its definition from equation (3), but now applies to treated individuals in either program. HS i is an indicator for Head Start attendance, both treatment and control, and HS i T further indicates Head Start treatment. The covariates and age indicators are interacted with the Head Start indicator to allow them to differentially affect Head Start and non-Head Start children. F-tests reject the null hypothesis that δ2=δ3=0.Footnote 26 If our empirical strategy effectively mitigates selection bias, θ can be interpreted as the average causal effect of Head Start treatment relative to treatment at a non-Head Start childcare center. δ1 is also informative — it indicates how selection into Head Start compares to selection into non-Head Start, center-based care. Noting that HS i T=HS i *T i , equation (4) is a basic difference-in-difference model.Footnote 27
RESULTS
We estimate program impacts on children's age-4 math and literacy scores, which we standardize using the entire sample of age-4 test takers in the ECLS-B data. Table 4 reports estimates from the childcare-specific models. The first column in each panel shows raw differences in outcomes between treatments and controls, and the second column is covariate adjusted. For non-Head Start centers, treatment is positively associated with outcomes. For Head Start centers, the estimates are nominally positive but statistically indistinguishable from zero.
Table 5 reports several different sets of estimates. The first two columns show estimates from OLS regressions that directly compare Head Start and non-Head Start treatments. Column (1) reports unconditional differences in outcomes, and column (2) reports covariate-adjusted differences. Columns (3) and (4) report estimates from difference-in-difference models where we impose the constraint that δ2=δ3=0. Columns (5) and (6) report estimates from difference-in-difference models where we relax this constraint, which is our preferred specification (these estimates are equivalent to subtracting the non-Head Start effects from the Head Start effects in Table 4). The covariates greatly increase predictive power in all of the models, as would be expected. However, in the difference-in-difference models they negligibly affect our estimates, suggesting that the control observations largely capture the effects of selection.
The unconditional differences in column (1) show that Head Start children perform remarkably worse on the cognitive tests. The regression-adjusted estimates in column (2) are near zero, indicating that the differences in column (1) reflect selection. The difference-in-difference estimates are generally small, and none can be statistically distinguished from zero. Overall, Table 5 shows that Head Start centers, on average, perform similarly to non-Head Start centers in terms of raising children's cognitive scores.Footnote 28
It is notable that the estimates from the model in column (2), which we refer to as the regression-adjusted model, are similar to the difference-in-difference estimates. If there is selection into childcare type based on unobservables, the estimates from the regression-adjusted model will be biased by this selection, while in the difference-in-difference models bias from unobserved selection is likely to be reduced. The most reasonable explanation for the similarity in the findings is that selection into childcare is largely captured by the observable information in the data. Although this is somewhat surprising, it is a testament to the exceptional quality of the ECLS-B survey. Ex post, our findings would have been qualitatively similar had we simply assumed selection on observables from the onset. However, we would be less confident in the regression-adjusted estimates without the confirmation provided by the difference-in-difference models.
ROBUSTNESS AND OTHER ISSUES
Robustness of findings
We first evaluate the robustness of our findings to adjustments to the treatment and control definitions. We consider defining treatments as being in care for 7−12 months and for 9–15 months, and compare these treatments with controls who were reported to be in care for 0–2 and 0–3 months (we also compare the 6–12-month treatment group with the new control group). We then return to our preferred treatment and control samples, and relax some of the other data restrictions. Specifically, we include children whose parents reported moving no more than once, and no more than twice, between the age-2 and age-4 surveys; and children who were reported to be in care for more than 40 hours per week. For the models that include movers, we note that treatment misclassification errors will be more common. However, in practice, any bias from the misclassification errors appears to roughly cancel out in the difference-in-difference models (of note, approximately half of the movers are classified as treatments, and half as controls, in each program). Including the movers increases our sample size by over 50 percent.
Our results are reported in Table 6. For brevity, we only report estimates from the covariate-adjusted, unrestricted difference-in-difference models (the first row of the table shows the baseline estimates from Table 5 for comparison). In math, Table 6 provides no evidence to overturn our primary finding that Head Start and non-Head Start centers perform similarly. In literacy, while none of the estimates can be statistically distinguished from zero, it is noteworthy that the point estimates are consistently negative. This raises the possibility that Head Start underperforms in literacy, but the evidence is not strong enough to scientifically support this claim.
Next, we consider an alternative specification analogous to the model in (4) where we replace the treatment indicator with a quadratic function of months-in-care for each child.
In (5), MIC i and MIC i 2 replace T i from equation (4), where MIC indicates months-in-care. The coefficients of interest in (5) are θ1 and θ2, which characterize the effects of months-in-care in Head Start relative to months-in-care in non-Head Start centers. A benefit of defining treatment duration as in (5) is that we can include children who were reported to be in care for 4 or 5 months at the time of the age-4 survey — these children were not identified as either treatments or controls in the previous models.Footnote 29
Table 7 reports estimates that are analogous to those in Table 5, but based on the new estimation sample and the model in (5). The first column of the table shows the unconditional differences in outcomes between Head Start and non-Head Start children, and the second column shows the regression-adjusted differences. The unconditional differences again show that Head Start children perform considerably worse on the cognitive tests. The regression-adjusted differences in math and literacy are both nominally negative, but statistically insignificant. Moving to the difference-in-difference estimates, the results in Table 7 are qualitatively consistent with those in Table 5 — they do not indicate any differences in performance between Head Start and non-Head Start childcare centers.
Finally, using a much larger sample, we also considered an IV strategy where we instrumented for months-in-care using assessment dates from the ECLS-B survey as a source of exogenous variation [following Fitzpatrick et al. 2011].Footnote 30 However, the IV approach was unsuccessful, and therefore, we omit the results for brevity.Footnote 31 The IV estimates were clearly too large, which was obvious when we compared the Head Start IV estimates to the Impact Study estimates. The primary problem with the instrument is that we do not have access to a pre-test, and we suspect that we could not adequately separate the test-date variation from variation in age (we tried to deal with this issue in several ways without success). Of interest, however, is that even in the IV models the estimated Head Start and non-Head Start program impacts are very similar, which would occur under two conditions: (1) Head Start and non-Head Start centers have similar impacts; and (2) the biases in the IV estimates are similar for Head Start and non-Head Start children.
Does our analysis favor Head Start?
Our analysis may favor Head Start because of the relative disadvantage of Head Start children. Although we would typically be concerned about the reverse, we cannot preclude this possibility. One of the most likely ways that our results would be biased in favor of Head Start would be if the testing instruments in the ECLS-B had strong ceiling effects. If this were the case, the test scores for the higher-achieving non-Head Start children would be mechanically restricted relative to the Head Start children, which would give Head Start an advantage in our analysis. Following Koedel and Betts [2010], we test for ceiling effects in the ECLS-B testing instruments and find no evidence of a test-score ceiling in either test. In fact, inferring from Koedel and Betts [2010], there are mild floor effects in the literacy test, which would work against Head Start in our comparison (scores for the lowest performers are mechanically inflated). The test-score floor in the literacy test is consistent with our estimates, which show that Head Start has a nominally smaller relative effect in literacy.Footnote 32
It is also possible that disadvantaged children gain more from childcare relative to non-attendance. For example, it could be that advantaged children gain little from attending childcare relative to staying at home, perhaps because they have more positive home environments. Alternatively, disadvantaged children could benefit more from going to childcare, if for no other reason than because they are not staying at home. If the marginal benefit to childcare attendance over non-attendance is higher for disadvantaged children, this would favor Head Start in our analysis. We use the heterogeneity in family income within the non-Head Start sample to evaluate whether this explanation is likely to be driving our findings (see Table 1).Footnote 33 Specifically, using the income categories reported in Tables 1 and 2, we divide the non-Head Start children into bins and estimate income-specific childcare effects. Separating the income controls from the other X's for illustration, and defining the vector of income indicators for child i by INC i , we estimate:
The estimates of λ2 are of interest in (6), and we report our findings in Table 8. The omitted group is the highest income category. Although the estimates are noisy, they are not consistent with disadvantaged children gaining more from childcare treatment, at least within the non-Head Start sample.Footnote 34
CONCLUSION
For Head Start and non-Head Start childcare centers, we estimate average program impacts by comparing outcomes from children who received treatment to outcomes from children who selected into treatment but received very little care. We use a difference-in-difference framework to evaluate the relative performance of Head Start. We find that Head Start centers, on average, perform similarly to non-Head Start centers.
Our analysis suggests that the common perception that Head Start is performing poorly is driven more by lofty expectations for the program than by the program's actual performance. Implicit in calls for improvements to the delivery of Head Start is the notion that the program is somehow “broken” and must be fixed, but our analysis does not support this contention. Policymakers and other interested parties may or may not be satisfied with the outcomes generated by Head Start, but speaking comparatively, the program does not appear to be underperforming.
Two qualifications to our study merit attention. First, our findings do not preclude the possibility that Head Start can perform better. For example, Gormley and Gayer [2005]; Gormley et al. [2010] evaluate a high-quality pre-kindergarten program in Tulsa, Oklahoma, and they find that it greatly outperforms Head Start in terms of affecting cognitive achievement. In addtion, Currie [2001] discusses several other studies that document large effects for programs that were funded at higher levels than Head Start. However, these previously studied programs are unlikely to be scalable to the level of Head Start without considerable increases in funding, and even then there would be challenges. In this modern era of fiscal constraints, this is an important consideration. A unique contribution of our study is that we compare Head Start to the larger non-Head Start, center-based childcare sector, which provides care to the majority of children in the United States and where childcare costs are more closely aligned with the costs of Head Start.Footnote 35
Second, our study does not address the growing body of evidence showing that Head Start has large effects on children's longer-term outcomes [e.g., Garces et al. 2002; Ludwig and Miller 2007; Deming 2009]. The longer-term benefits of Head Start participation may be sufficient to justify current expenditures even with only small immediate-term effects on test scores and other outcomes [Ludwig and Philips 2007]. Our findings, in conjunction with those from the literature on Head Start's longer-term impacts, suggest that we should be cautious about putting too much emphasis on the immediate-term impacts of Head Start on test scores. If Head Start is performing within expectations in this regard (which our study suggests should be low), but participants meaningfully benefit in the long term, overreacting to the test-score results from the Impact Study could do more harm than good.
Notes
These findings are consistent with a large non-experimental literature showing that Head Start improves children's cognitive, social and health outcomes during and immediately after treatment [e.g., see Currie and Thomas 1995; Currie 2001; Frisvold and Lumeng 2011].
Prior non-experimental work has also documented the fade out of program impacts [e.g., see Currie and Thomas 1995].
The literature on K–12 education shows that it is not uncommon for educational interventions to have effects that fade out over time [see, e.g., Jacob et al. 2008; Bhatt and Koedel 2010].
As noted by Barnett [1992], many of the earlier studies that evaluate preschool interventions, including Head Start, suffer from serious design limitations. In a more recent and rigorous study, Currie and Thomas [1995] consider the effects of non-Head Start preschool interventions. They use a sibling-fixed-effects approach to estimate non-Head Start preschool effects on cognitive outcomes and estimate effects that are statistically indistinguishable from zero.
Examples of negative responses to the Impact Study findings are abundant, and include Besharov [2005], Coulson [2010], Wetzstein [2010], Whitehurst [2010]. There are also other potential benefits of the provision of Head Start — for example, Lemke et al. [2007] find that the availability of Head Start is associated with an increase in single mothers on welfare transitioning directly into work.
Head Start has been slowly ramping up teacher degree requirements, but even by 2013 Head Start will only require half of its teachers to hold a bachelor's degree. Furthermore, teacher pay in Head Start is far below the level of the typical elementary-school teacher, particularly when benefits are included in the calculation.
Also see Currie [2001], who documents evidence from several other studies of pre-kindergarten interventions that are funded at a considerably higher level than Head Start.
In the Early Childhood Longitudinal Survey, Birth Cohort (ECLS-B) data set, which includes a nationally representative sample of children (see Section “Methodology”), over half of the sample attended some form of center-based care (non-Head Start) at the time of the age-4 survey (≈55 percent). The next most common childcare arrangement reported in the data was non-parent, relative care (≈20 percent).
If anything, the ECLS-B data suggest that per-pupil expenditures in Head Start are higher, on average, than in the larger non-Head Start sector, although we note that accurate data on non-Head Start per-pupil expenditures are difficult to obtain. That said, the per-pupil costs in the non-Head Start sector appear to be much closer to Head Start's per-pupil costs than are the costs of other pre-kindergarten programs that have been evaluated in prior work (see Currie 2001, for example). Also, Resnick and Zill (undated) show that in terms of measurable inputs, Head Start looks similar to preschools from the National Child Care Staffing Study.
Head Start also has non-cognitive objectives, like improving children's social and health outcomes, but we focus on cognitive outcomes here.
To the best of our knowledge, this approach has been used twice in prior research, first by Hebbeler [1985] and more recently by Behrman et al. [2004]. We note that our use of the term “control” is non-standard because the control children in our analysis did receive some treatment. Nonetheless, for ease of presentation we refer to the children who were in care for only a short duration as “controls” throughout our analysis.
Variation in time-in-care, which determines treatment or control status in our analysis, can come from two sources in the data: (1) timing-of-entry into childcare; and (2) variation in assessment dates among children. Following Fitzpatrick et al. [2011], we attempted to isolate and exploit the latter source of variation in our study because it is more likely to be exogenous. We were unsuccessful because we could not disentangle the test-date variation from variation in age (unlike in Fitzpatrick et al. we do not have access to a pre-test). Nonetheless, we note that our estimates from the IV analysis are consistent with the findings we present throughout the text — Head Start and non-Head Start centers have similar effects — if we assume that the age bias is constant across Head Start and non-Head Start children. More details are available in Section “Robustness and other issues”.
The difference in treatment duration between treatment and control children in our primary analysis is roughly 6 months in each program, which is similar to the treatment duration for Impact Study children, which facilitates the comparison (see Appendix B).
The difference-in-difference estimates could increase bias if unobserved, timing-of-entry selection effects move in opposite directions for Head Start and non-Head Start children. Although this possibility is difficult to evaluate directly, we note that the small observable differences between treatment and control children within programs are consistent with any such bias moving in the same direction (see Table 2).
The majority of children in each program receive exclusive care. The exclusive enrollment rate is higher for the non-Head Start sample, with the difference driven primarily by the fact that more Head Start children also receive care from a non-parent relative (see Table 1).
We cap treatment durations in the neighborhood of 1 year because children do not generally enroll in Head Start prior to age-3. Therefore, at the time of the age-4 survey, few Head Start children were reported to have been in care for over a year. As we extend the time-in-care window for treatments beyond 12 months, we observe longer care durations, on average, for non-Head Start children relative to Head Start children, reducing comparability.
Although we note that in the data there are a handful of exclusive-care Head Start children who are reported to be in care for over 40 hours per week (see Appendix Table A1).
Appendix Table A1 shows how we arrive at our final data sample in each program.
As our samples are small and purposefully selected we do not use the NCES-provided sample weights; instead, we control for the oversampled populations directly in the covariate-adjusted models as in Rose and Betts [2004]. The survey weights are designed to allow for generalizations about the survey population — because we have already drawn a non-random subsample of the data, it is not clear what weighting would imply for our estimates. In any case, for completeness, in an omitted analysis we also estimate models that use the survey weights. Our primary findings are not qualitatively sensitive to using the weights, but the weighted estimates are less precise.
Most of the sample is between 48 and 59 months of age during the age-4 survey. However, some children are younger, and some older. Children are grouped by exact age, in months, for ages 48−59 months. The remaining children are categorized as either age<48 or age>59. Our results are robust to excluding children outside of the primary age range of 48– 59 months. If anything, excluding these children modestly favors Head Start in our comparison.
We exclude differences in age in our calculations because these differences are generated by our research design. We address the age discrepancies between treatments and controls in our analysis by estimating within-age treatment effects.
To generate the random-sampling distribution for our comparison across programs, we randomly draw from the combined exclusive-care sample in Table 1 (N=4,403). For the non-Head Start comparison, we draw from the sample of 3,551 non-Head Start, exclusive-care children. Finally, for the difference-in-difference, we simply subtract the non-Head Start difference from the Head Start difference at each draw.
The re-sampling is done without replacement. We also estimate the empirical distributions by re-sampling with replacement, which yields very similar results. However, we prefer re-sampling without replacement because when we re-sample with replacement, the same observations can be designated as treatments and controls. Of course, in practice this can never occur.
Children who were assessed at the same age cannot have different treatment durations if they also entered childcare at the same age (treatment duration={assessment age–entry age}). Therefore, our estimates are identified from variation in timing-of-entry into childcare across children.
Mid-year enrollment policies at non-Head Start centers are likely to be heterogeneous. The policy for Head Start centers is that mid-year enrollments are permitted subject to space constraints. The ECLS-B data show that mid-year enrollments occur regularly in both programs, although the most common enrollment period is in the late summer.
When the interaction terms involving the covariates are included, the difference-in-difference models return estimates that are equivalent to the differences between the program-specific estimates from model (3). Imposing the constraint δ2=δ3=0 changes the estimates slightly, and in a way that tends to favor Head Start, although we empirically reject this model (see Section “Results”).
Bertrand et al. [2004] show that difference-in-difference estimators can overstate statistical significance when outcomes are evaluated for several years before and after an intervention, and there is no correction for serial correlation. However, this concern is not relevant to our analysis.
Appendix Table A2 reports estimates for all coefficients from the models in columns (2), (4) and (6) in Table 5.
Adding these children, net of the other data restrictions, increases our sample size by approximately 12 percent. Also, similarly to the analysis above, our findings from this model are not qualitatively sensitive to relaxing the other data restrictions.
This strategy does not require great care in identifying treatment and control groups based on treatment duration so we were able to use most of the observations from the ECLS-B for children who were in one of the two care types.
Results are available from the authors upon request.
Koedel and Betts [2010] use skewness to measure ceiling effects. The skewness in the distribution of math scores is approximately −0.04, and in literacy it is 0.83.
There is not enough heterogeneity in income, and there are too few observations, to reasonably estimate the parameters in (6) using the Head Start sample.
This test is imperfect, but it was the best we could do with the available data. The primary problem with the test, of course, is that there is likely to be selection into childcare centers within the non-Head Start sector. If lower-income children attend lower-quality centers, our estimates from equation (6) will understate their benefits from childcare inputs.
Data from the ECLS-B suggest that childcare costs at non-Head Start centers are lower than in Head Start, but similar. However, we note that measuring non-Head Start childcare costs is difficult because they are paid by various sources (parents, non-profit groups, government, etc.). Details are available from the authors upon request.
The Impact Study does not report estimates that are statistically insignificant but, of course, these estimates must be included in the comparison. They are available in supplementary tables from the study, obtainable from the Department of Health and Human Services. Also, the language-arts ITT includes one estimate based on parent-reported literacy that is much larger than the other ITT's — this may be related to parents’ perceptions of the benefits of their children winning the lottery, rather than actual program benefits. Excluding the parent-reported outcome, the average language-arts ITT is 0.148.
Taking our finding that Head Start and non-Head Start centers have similar effects at face value, an argument could be made that the Impact Study scaling factors should be adjusted up to roughly 2.1 (some of the treatments and controls attend non-Head Start centers, which can be viewed as equivalent to attending Head Start). We have some reservations about this adjustment, however, because as noted in the text, our findings are relevant for the entire center-care sector, and the control children in the Impact Study likely attended inferior centers. But, in the spirit of these rough comparisons, it is worth noting that a scaling factor of 2.1 could be viewed as an upper bound.
References
Barnett, Steven W. 1992. Benefits of Compensatory Preschool Education. Journal of Human Resources, 27 (2): 279–312.
Behrman, Jere R., Yingmei Cheng, and E. Todd Petra . 2004. Evaluating Preschool Programs when the Length of Exposure to the Program Varies: A Non-parametric Approach. The Review of Economics and Statistics, 86 (1): 108–132.
Bertrand, Marriane, Esther Duflo, and Sendhil Mulllainathan . 2004. How Much Should We Trust Differences-in-differences Estimates? Quarterly Journal of Economics, 119 (1): 249–275.
Besharov, Douglas J. 2005. Head Start's Broken Promise, American Enterprise Institute, On the Issues (October).
Bhatt, Rachana, and Cory Koedel . 2010. A Non-experimental Evaluation of Curricular Effectiveness in Math, Working Paper 09–13, University of Missouri.
Coulson, Andrew J. 2010. Head Start: A Tragic Waste of Money. New York Post, 7 (January 28).
Currie, Janet. 2001. Early Childhood Intervention Programs: What Do We Know? Journal of Economic Perspectives, 15 (2): 213–238.
Currie, Janet, and Duncan Thomas . 1995. Does Head Start Make a Difference? American Economic Review, 85 (3): 341–364.
Currie, Janet, and Duncan Thomas . 2000. School Quality and the Longer Term Effects of Head Start. Journal of Human Resources, 35 (4): 755–774.
Deming, David. 2009. Early Childhood Intervention and Life-cycle Skill Development: Evidence from Head Start. American Economic Journal: Applied Economics, 1 (3): 111–134.
Fitzpatrick, Maria D., David Grissmer, and Sarah Hastedt . 2011. What a Difference a Day Makes: Estimating Daily Learning Gains During Kindergarten and First Grade Using a Natural Experiment. Economics of Education Review, 30 (2): 269–279.
Frisvold, David E., and Julie C. Lumeng . 2011. Expanding Exposure: Can Increasing the Daily Duration of Head Start Reduce Childhood Obesity? Journal of Human Resources, 46 (2): 373–402.
Garces, Eliana, Duncan Thomas, and Janet Currie . 2002. Longer-term Effects of Head Start. American Economic Review, 92 (4): 999–1012.
Gormley, William T., Deborah Phillips, Shirley Adelstein, and Catherine Shaw . 2010. Head Start's Comparative Advantage: Myth or Reality. Policy Studies Journal, 38 (3): 397–418.
Gormley, William T., and Ted Gayer . 2005. An Evaluation of Tulsa's Pre-K Program. Journal of Human Resources, 40 (3): 533–558.
Hebbeler, Kathleen . 1985. An Old and a New Question on the Effects of Early Education for Children from Low Income Families. Education Evaluation and Policy Analysis, 7 (3): 207–216.
Jacob, Brian, Lars Lefgren, and David Sims . 2008. The Persistence of Teacher-induced Learning Gains, NBER Working Paper No. 14065.
Koedel, Cory, and Julian R. Betts . 2010. Value-added to What? How a Ceiling in the Testing Instrument Influences Value-added Estimation. Education Finance and Policy, 5 (1): 54–81.
Lee, Valerie E., and Susanna Loeb . 1995. Where do Head Start Attendees End Up? One Reason Why Preschool Effects Fade Out. Educational Evaluation and Policy Analysis, 17 (1): 62–82.
Lee, Valerie E., Jeanne Brooks-Gunn, Elizabeth Schnur, and Fong-Ruey Liaw . 1990. Are Head Start Effects Sustained? A Longitudinal Follow-up Comparison of Disadvantaged Children Attending Head Start, No Preschool, and Other Preschool Programs. Child Development, 61 (2): 495–507.
Lemke, Robert J., J. Witt Robert, and Dryden Witte Ann . 2007. The Transition from Welfare to Work. Eastern Economic Journal, 33: 359–373.
Ludwig, Jens, and Deborah A. Philips . 2007. The Benefits and Costs of Head Start, NBER Working Paper 12973.
Ludwig, Jens, and Douglas Miller . 2007. Does Head Start Improve Children's Life Chances? Evidence from a Regression Discontinuity Design. Quarterly Journal of Economics, 122 (1): 159–208.
Resnick, Gary, and Nicholas Zill . Undated. Is Head Start Providing High-quality Educational Services? Unpacking Classroom Processes. unpublished report Westat, Inc.
Rose, Heather, and Julian R. Betts . 2004. The Effect of High School Courses on Earnings. Review of Economics and Statistics, 86 (2): 497–512.
Rosenbaum, Paul R., and Donald B. Rubin . 1985. The Bias Due to Incomplete Matching. Biometrika, 41 (1): 103–116.
US Department of Health and Human Services. 2010. Administration for Children and Families, Head Start Impact Study. Final Report, Washington, DC.
Wetzstein, Cheryl. 2010. Is Head Start a “Sacred Cow”? The Washington Times (March 30).
Whitehurst, Grover J. 2010. Is Head Start Working for American Students? Brookings Institution Up Front Blog Entry (January 21).
Acknowledgements
We thank Emek Basker, Daphna Bassok, and Jeff Milyo for useful comments and suggestions, and the Economic & Policy Analysis Research Center at the University of Missouri for research support.
Author information
Authors and Affiliations
Appendices
APPENDIX A
APPENDIX B
Comparison of our findings to the Impact Study findings
We briefly compare our Head Start-specific estimates from Table 4 to analogous experimental estimates from the Impact Study. Unfortunately, there are several important differences between the studies that limit inference from this comparative exercise. We highlight four issues prior to performing the comparison:
-
1
The duration of treatment in the Impact Study may have been somewhat longer, on average, than in our preferred estimation sample, although any differences are likely to be small. The Impact Study compares children's fall to spring test scores. The report indicates that the fall data collection began in October and was “mostly” complete by mid-November, but we were unable to find documentation of the spring data-collection timeframe. Assuming that the spring data collection occurred around May of the following year, the average treatment duration would have been roughly 6 months for children who were fall-tested in mid-November in the Impact Study. Therefore, we expect the treatment durations across studies to be quite similar.
-
2
The childcare arrangements for control children in the Impact Study differ from those of the control children in our Head Start-specific analysis. Our controls are selected based on not receiving any care, while many of the controls from the Impact Study attended some alternative care. Also, the controls from the Impact Study that did not seek out any form of alternative care (roughly 38 percent) differ from the controls in our study in that they initially sought to be placed in Head Start, while our controls may not have. The implications of these differences for the comparison of the estimates are not immediately clear.
-
3
Testing Instrument Issues: the testing instruments are different across studies, and the Impact Study reports findings from numerous instruments. There are surely differences in the content of the exams. Also, while both studies report estimates in standard deviations of the tests, facilitating some comparison, we standardize scores across the entire age-4 sample in the Early Childhood Longitudinal Survey, Birth Cohort (ECLS-B) while the Impact Study uses only Head Start eligible children (of course). The variance of test scores among Head Start eligible children is smaller than the variance of scores for all children in our data. For the purposes of this comparative exercise, we re-standardize children's outcomes in our analysis using only children from the ECLS-B data who attended some Head Start (N=1,482, see Table 1). This scales up our math and reading estimates by roughly 10 and 20 percent, respectively.
-
4
The Impact Study estimates intention-to-treat effects (ITT), while we estimate treatment effects. Under fairly modest assumptions the ITT effects can be scaled to be comparable to our estimates, although the scaling likely adds some noise to the comparison.
To make the Impact Study results comparable to our own, we being by taking the within-subject averages of the ITT effects from the Impact Study in math and language arts. These averages serve as rough summary measures of the Impact Study findings by subject, which can be compared to our results. Across assessments, the average ITT effect in math for the age-3 cohort is 0.105. For language arts, the across-assessment average is 0.168 (the age-3 cohort is the correct comparison cohort, although the comparison is very similar if we use the age-4 cohort from the Impact Study instead).Footnote 36 The appropriate scaling factor to convert the ITT's to treatment effects is roughly 1.5 (as noted by Ludwig and Philips (2007), and verified here). Therefore, the Impact Study estimates that are most comparable to our estimates are 0.159 and 0.252 for math and language arts (which we label as literacy), respectively.Footnote 37
Table B1 compares our estimates to these rough summary measures of the Impact Study findings. We do not attempt to estimate the noise in the summary measures from the Impact Study, but these measures are based on estimates that are themselves quite noisy (see the Impact Study's Main Tables supplement). The table shows that while the Impact Study estimates are nominally larger than our program-specific estimates in both subjects, particularly in reading, one set of estimates does not rule out the other.
Despite inference from this comparative exercise being limited by several key differences between the studies, we do not uncover a conflict between the experimental Impact Study findings and our Head Start-specific estimates. This is consistent with our analysis in the text, which uncovers little evidence to suggest that our findings are driven by selection bias. Finally, we also note that this across-study comparison does not take into account the difference-in-difference aspect of our empirical strategy — that is, even if there is some bias in our Head Start-specific estimates in Table 4, it may be partly mitigated when we compare the Head Start estimates to the non-Head Start estimates.
Rights and permissions
About this article
Cite this article
Koedel, C., Techapaisarnjaroenkit, T. The Relative Performance of Head Start. Eastern Econ J 38, 251–275 (2012). https://doi.org/10.1057/eej.2011.9
Published:
Issue Date:
DOI: https://doi.org/10.1057/eej.2011.9