INTRODUCTION

Head Start is a federal matching grant program that provides disadvantaged children with access to pre-kindergarten education. The program began as part of the War on Poverty in the 1960s, and today serves roughly 900,000 children each year at a cost of $6.9 billion.

There is a large and robust literature that evaluates the effects of Head Start on program participants. A recent and particularly influential study is the Head Start Impact Study [US Department of Health and Human Services 2010], which uses an experimental design and a nationally representative sample of children to evaluate the program. The Impact Study shows that Head Start has positive immediate-term effects, but these effects have been described as “disappointingly small” [Besharov 2005].Footnote 1 It also shows that the effects of Head Start fade out quickly.Footnote 2 The Impact Study has rekindled an intense policy debate over the future of Head Start, and has lead to increased calls for change.

Prior research provides some insight as to why the initially positive effects of Head Start fade out over time — Head Start children go on to attend inferior K–12 schools [Lee et al. 1990; Lee and Loeb 1995; Currie and Thomas 2000].Footnote 3 However, this does not address the commonly voiced concern that even the immediate-term effects of the program are too small. It is often presumed, sometimes implicitly, that Head Start's small program impacts are attributable to the poor performance of the program itself. Taking this presumption as a point of departure, one logical course of action would be to intervene to improve the program. However, it is also possible that Head Start is performing well, but the program's margin for effect is smaller than is typically thought. These two explanations are substantively different, and they have different policy implications. For example, if Head Start's small program impacts are an artifact of the nature of the intervention itself, it makes it less likely that adjustments to the program will meaningfully improve children's outcomes.

It is difficult to know what types of program impacts are reasonable to expect from Head Start because we have limited evidence on the effects of preschool interventions in general, particularly large-scale interventions.Footnote 4 In our study, we use the performance of non-Head Start childcare centers as a benchmark for evaluating the Head Start program. Although there is no other program exactly like Head Start that can be used for comparison, non-Head Start childcare centers offer a similar service, and are a viable, if imperfect, option.

A direct comparison between Head Start and non-Head Start childcare centers is difficult because the programs serve different populations of children. The key to our empirical strategy is that we have data from children who enrolled in Head Start and non-Head Start centers but were treated for only a short time. We use these children to control for selection into each childcare type, and compare their outcomes to outcomes from children who were treated for a much longer duration. This approach produces estimates of program impacts for Head Start and non-Head Start centers that correspond to differences in treatment duration within each program. We compare these estimates in a difference-in-difference framework.

Motivated largely by the Impact Study findings and subsequent policy discussions, we focus our evaluation on how Head Start and non-Head Start centers affect children's immediate-term test scores. We find that, on average, both childcare types perform similarly. Our findings are at odds with the common perception that Head Start is performing poorly, but this perception appears to be driven more by lofty expectations for the program than by the program's actual performance. Implicit in calls for improvements to the delivery of Head Start is the notion that the program is somehow “broken” and must be fixed, but this notion is not supported by the evidence presented here. We additionally note that there is a growing body of research showing that Head Start confers positive and economically meaningful long-term benefits to program participants [e.g., see Garces et al. 2002; Ludwig and Miller 2007; Deming 2009]. Combined with this prior research, our study suggests that the negative reaction to the Impact Study findings may be overblown.Footnote 5

MOTIVATION AND RESEARCH DESIGN

The Head Start literature to date has focused primarily on estimating treatment impacts for treated children. The typical comparison in the literature is between Head Start and a composite of the childcare arrangements available to children who are similar to Head Start attendees, but who do not attend (including non-Head Start, non-relative care; relative care; and parent care). While this type of comparison provides an answer to exactly the right question when we are interested in knowing whether Head Start participants are better off for having participated (a clear first-order issue), it is only partly informative about Head Start's performance in the broader sense. A key issue is that the non-Head Start childcare arrangements available to socioeconomically disadvantaged families are likely to be of lower quality than the childcare arrangements available to other families. This means that from the current literature we only know how well Head Start performs relative to what are likely to be inferior childcare alternatives.

Even so, the Impact Study, like many other studies in the literature, finds that Head Start has small treatment impacts. Prompted by the Impact Study results, Whitehurst [2010] writes: “Head Start isn’t doing the job the families it serves and the nation need. It must be improved. There are many proposals for doing so. Let's try them, test them, and do what works.” Implicit in Whitehurst's argument, and the many others like it, is that Head Start can perform better if only we can figure out how to improve efficacy. But this is far from certain. In fact, we only know that Head Start seems to marginally improve upon the available alternatives for program participants. An important question remains: How much better could we reasonably expect Head Start to perform? There are very few studies in the literature that attempt to answer this question.

A notable exception is Gormley et al. [2010], who compare Head Start to an early-childhood program in Tulsa, Oklahoma. They find that the Tulsa program generally outperforms Head Start (with the exception that Head Start has larger effects on health outcomes). On the one hand, Gormley et al. [2010] establish that early-childhood interventions can have large effects on children's outcomes, which is important. However, on the other hand, comparability is an issue because the childcare inputs in the Tulsa program appear to be superior to those in Head Start. For example, Gormley and Gayer [2005] report that care providers in the Tulsa program are required to have 4-year college degrees and are paid like public elementary-school teachers.Footnote 6 This raises concerns about the fiscal scalability of such a program.Footnote 7

Instead of comparing Head Start to the composite of alternatives that are available to socioeconomically disadvantaged children, or to a program that will be difficult to bring to scale for fiscal or other reasons, the contribution of the present study is to compare Head Start to the larger non-Head Start, center-based childcare sector. This comparison is appealing because over half of all children in the United States attend some form of non-Head Start, center-based care, making it the most common type of care in the country.Footnote 8 Furthermore, childcare costs in the non-Head Start, center-based childcare sector are more closely aligned with those of Head Start than are the costs of previously studied, higher-quality childcare programs.Footnote 9

One way to view the contribution of our study is as a complement to earlier work, like that of Gormley et al. [2010] and others, which shows that high-quality pre-kindergarten interventions can be quite successful. The prior literature provides insight into what we could expect to see if we were to dramatically scale up the quality of the Head Start program, and our fiscal commitment to its success. At the other end of the spectrum is the question of how well Head Start performs compared to more typical childcare arrangements in the United States — prior to our study, we are not aware of any work that attempts to shed light on this question.

Our approach is conceptually straightforward: we estimate average program impacts on children's test scores for Head Start and non-Head Start childcare centers, and compare these estimates using a difference-in-difference framework.Footnote 10 We estimate the effects of the programs using test-score data collected at the time of treatment. Although much of the recent controversy surrounding Head Start is related to the lack of persistent program impacts, we cannot perform our comparison using later-year outcomes because these outcomes are also affected by the different post-treatment experiences of Head Start and non-Head Start children [Lee et al. 1990; Lee and Loeb 1995; Currie and Thomas 1995; 2000]. For our analysis, where interest is in the performance of Head Start itself, these post-treatment experiences are confounding factors.

The key empirical issue that we face is that children and their families are non-randomly selected into childcare treatments. To minimize the influence of selection bias in our study, we rely on a fairly unique feature of our data set, the ECLS-B. In the survey, each caregiver for an age-4 child is asked to indicate how long the child has been receiving care. We use caregiver responses to this question to categorize children as either “treatments” or “controls,” within care type, based on treatment duration. Specifically, some children had been enrolled in care for only a few months at the time of the survey, while others had been enrolled for much longer. The children who had been enrolled for only a short time received very little treatment, but they are useful for our analysis because they selected into treatment. We use these “control” children to capture selection effects, and compare their outcomes to outcomes from children who were in care for a longer duration, who we define as “treated.” Although Head Start children differ markedly from non-Head Start children across virtually every observable measure, the within-program samples of treatments and controls are observationally very similar.Footnote 11

Our research design greatly improves observational comparability in our study, which we show below, but we rely on variation across children in the timing-of-entry into childcare to identify program impacts.Footnote 12 For our within-program estimates of treatment effects to be unbiased, timing-of-entry must be independent of potential outcomes conditional on observable information in each program. That is, for each program:

In (1), Y0 and Y1 are outcomes that correspond to the control and treatment states; T indicates treatment status, which is determined by duration in care; and X is a vector of observable information about children and their families.

The conditional independence assumption will be violated if timing-of-entry is correlated with unobserved factors that are not controlled for by the components of X, and if these factors also influence children's outcomes. Unfortunately, we cannot directly test for the biasing effects of unobservables, but we do use the rich set of observable characteristics available in the ECLS-B data to show that within the same program, children who differ in terms of treatment duration over the range of durations that we consider are observationally very similar (see below). Additionally, we provide some validation of our empirical approach by comparing the treatment effects that we estimate for Head Start to analogous experimental estimates from the Impact Study. Our findings from this exercise are reported in Appendix B which shows that our Head Start-specific estimates are similar to the experimental results, particularly in math. Our interpretation of this result is that even if conditional independence is not mechanically satisfied in our analysis, any bias generated by its failure is likely to be small.Footnote 13

We also use a difference-in-difference framework to further limit the impact of bias in our estimates. The difference-in-difference models estimate Head Start's relative effect as the difference between the within-program estimates for Head Start and non-Head Start centers. Taking this difference will remove any bias generated by endogenous timing-of-entry decisions that is consistent across both programs. So, for example, if early entrants into both types of care are more disadvantaged, any resulting bias in the program-specific estimates will be reduced in our difference-in-difference models.Footnote 14

DATA

Our data are from the ECLS-B, provided by the National Center for Education Statistics (NCES). The ECLS-B began tracking children at birth and administered follow-up surveys when the children were aged 9 months, 2 years, and 4 years (the survey is ongoing). During each wave of the survey information is collected from parents and children, and for the age-2 and age-4 follow-ups, childcare providers were also surveyed. We use the data file from the age-4 follow-up in our analysis.

The ECLS-B contains detailed information about children and their families. Two of the most important variables in the data are family income and parental-education levels, which are highly correlated with children's outcomes. Also, because the survey began at birth, it provides access to information that is not commonly available, like birth weight. Childcare arrangements, past and present, are reported for each child. The data on childcare arrangements are collected contemporaneously. Therefore, unlike much of the prior work in this area, we do not need to rely on retrospective questionnaires given to parents to assign childcare treatments.

Assessment data are collected from children during each wave of the survey. During the 9-month and 2-year follow-ups, children were given motor-skills and mental-skills tests, and during the age-4 follow-up they were given cognitive tests in math and literacy. We estimate childcare effects on the age-4 math and literacy scores, and condition on the history of children's scores in our models. We standardize all test scores using the universe of test takers in the data.

Our empirical strategy requires that we impose several restrictions on the data set. First, to facilitate a clean comparison across programs, we restrict our analysis to include only children who were reported to be in care, exclusively, either at a Head Start facility or a non-Head Start, center-based care facility at the time of the age-4 survey. That is, we omit all children who received some other form of care, or split time between multiple care arrangements.Footnote 15 Among the children who meet this criterion, we further restrict the sample based on caregiver responses to the time-in-care question. In our primary models we identify children who were reported to be in care for 3 months or less as controls, and children who were reported to be in care for 6−12 months as treatments. In a robustness exercise in Section “Robustness and other issues”, we show that our findings are not sensitive to reasonable adjustments to these treatment and control definitions.Footnote 16

We impose three additional restrictions on the data. The first two restrictions are imposed because the caregiver question about time-in-care is specific to the current center, raising concerns about misclassification error. For example, we may misclassify some treated children as controls, and understate the amount of treatment that some treated children actually received, simply because they changed childcare centers. To minimize these occurrences in the data, we omit all children whose parents reported changing residences between the age-2 and age-4 surveys because we expect that residence changes are likely to correspond to changes in childcare arrangements. We also exclude all children who were reported to attend any non-relative childcare during the age-2 survey. Neither treatments nor controls should have been in care at that time, and if they were, it suggests that a change in childcare arrangements occurred. We also omit children who were reported to be in care for more than 40 hours per week because Head Start centers do not care for children for over 40 hours.Footnote 17 Again, we consider the robustness of our findings to relaxing these restrictions in Section “Robustness and other issues”.

Moving forward, when we reference our preferred data sample, we are referring to the sample where treatments were in care for 6−12 months, controls were in care for 3 months or less, and the above-described data restrictions are in place.

Table 1 shows descriptive statistics for the full ECLS-B data file (age-4 survey), and within childcare type, for program participants, exclusive program participants and our preferred data samples. We highlight two aspects of the table. First, across programs, Head Start and non-Head Start children differ substantially along virtually every observable dimension. Second, within programs, our preferred data samples differ from the exclusive-care samples in some ways, and are much smaller.Footnote 18 However, given the sharp reduction in sample sizes when we move to our preferred samples, and the number of restrictions we impose on the data, the differences seem generally modest. That is, with a handful of exceptions, our preferred data samples do not seem to be meaningfully different from their respective exclusive-care samples in the full ECLS-B data file.Footnote 19

Table 1 Average characteristics of various subsamples of the data based on the age-4 survey

Next, in Table 2, we split our preferred samples based on treatment status and compare treatments and controls within and across programs. We also further restrict the data to include only children for whom either a math or literacy score is reported during the age-4 survey (or both), which is required for inclusion in our analysis. The first row of the table shows the average difference in treatment duration between treatments and controls. The difference is roughly 6 months for both programs (although it is slightly larger in the non-Head Start sample — by about 5 days, or 0.16 months). The treatment effects that we report below correspond to these differences in treatment duration, which we consider to be equivalent across programs.

Table 2 Average characteristics for treatment and control samples within childcare type, based on the age-4 survey

The second row highlights a large and statistically significant difference in age (in months) between treatments and controls, which is a mechanical artifact of our research design — treatment children, by definition, had been in care longer at the time of the age-4 ECLS-B survey, and this is correlated with age (note that there is no discrepancy in age across programs conditional on treatment status). To ensure that the age differences between treatments and controls do not bias our findings, we include age-in-months indicator variables in all of our models — that is, we estimate within-age treatment effects. We allow the age trend to differ by program type in our difference-in-difference models.Footnote 20

The remainder of Table 2 reports average values for the demographic, socioeconomic, and lagged test-score variables. Within both programs there are some differences between treatments and controls. The differences are more often statistically significant in the non-Head Start sample; however, as we show in Table 3, the nominal differences are slightly larger in the Head Start sample (the differences in the t-tests are driven by differences in sample size). We also note that the within-program differences between treatments and controls are similar across childcare types. For example, treatments from both the Head Start and non-Head Start samples are more likely to be non-White, and generally come from lower-income and less-educated families than controls. This suggests that any bias in our childcare-specific estimates owing to these differences will be reduced when we compare the programs.

Table 3 Standardized differences in observables across subgroups, and distributions of standardized differences under random assignment to treatment

Are the differences that we document in Table 2 large? We answer this question by calculating the standardized differences in observables between three groups: (1) Head Start treatments and non-Head Start treatments; (2) Head Start treatments and Head Start controls; and (3) non-Head Start treatments and non-Head Start controls. We calculate the standardized difference in observable characteristic X k between groups A and B as:

Equation (2) is motivated by Rosenbaum and Rubin [1985], who suggest using a similar metric to evaluate whether matching methods are effective in producing observationally comparable treatment and control units. Analogously to the matching literature, we expect our treatment and control observations to be similar, in which case the within-program standardized differences should be small.

Table 3 reports standardized differences that are averaged across all of the demographic, socioeconomic and lagged test-score variables in Table 2, by comparison group.Footnote 21 Not surprisingly, the average standardized differences are much smaller in the within-program comparisons than in the across-program comparison. As the treatment and control definitions are chosen independently of the observable information in the data, the improvement in observed comparability generated by our research design suggests that unobserved comparability is also improved. We cannot directly control for unobserved differences between children in our models, meaning that the reduction in these differences implied by Table 3 is particularly important.

It is easy to see that our within-program samples are more comparable than our between-program samples, but whether the within-program standardized differences are small or large is not obvious. To provide some insight, for each comparison we construct the empirical distribution for the average standardized difference under random assignment. Because the standardized difference is an absolute measure, even with random assignment and large (but finite) samples, the center of the distribution will be above zero.

We illustrate our approach for constructing the empirical distributions using the comparison between Head Start treatments and controls. We begin with the 852 children in the full ECLS-B data file who were reported to exclusively attend Head Start (see Table 1). From this sample, we randomly draw 96 observations and calculate the mean and variance of each variable, X k , and then randomly draw 108 observations and calculate the mean and variance of each variable. The first group corresponds to our primary control sample from Table 2, where N=96, and the second group to our primary treatment sample, where N=108. We calculate the average standardized difference between the two random samples across the X's, and then repeat this procedure 500 times to construct the empirical distribution of the average standardized difference under random sampling. For each comparison, Table 3 reports the average and 95 percent confidence interval of this distribution.Footnote 22,Footnote 23

Table 3 shows that for both programs, the observed standardized difference is comfortably within the 95 percent confidence interval of the random-sampling distribution (although, consistent with Table 2, the standardized difference is larger relative to the empirical distribution in the non-Head Start sample). Row (4) of the table shows that our “difference-in-difference” standardized difference is also well within the 95 percent confidence interval of the random-sampling distribution. Although we lack a true mechanism for random assignment in our study; at least observationally, our treatment and control samples look very similar to what we would expect to see if such a mechanism were actually in place.

METHODOLOGY

We estimate program-specific regression models where we only include observations that are identified as either treatments or controls. These models take the following form:

In (3), Y i indicates an outcome for child i, A i is a vector of age-in-months dummies, X i is a vector of observable information about child i and her family (the child-specific variables in X i include the demographic, socioeconomic-status and lagged test-score variables shown in Tables 1 and 2), and T i indicates treatment status, where treated children belong to the group that spent more time in care. π̂ estimates the treatment effect.

The identifying variation in the models comes from differences in treatment status among children who were the same age at the time of their assessments (because we include age-fixed effects). This means that we identify the program impacts by comparing children who differ by their timing-of-entry into childcare, as discussed above.Footnote 24 More specifically, we compare relatively late-entering control children to relatively early-entering treatment children. Noting that the ECLS-B assessments were administered mostly in the fall — in September, October, and November — the control children in our analysis primarily entered care in the late summer and early fall immediately preceding the tests. Treatment children entered care at some point during the prior year — some enrolled at the beginning of the year and others enrolled mid-year.Footnote 25

We compare the program-specific estimates from equation (3) using a difference-in-difference framework:

In (4), Y i , A i and X i are defined as above. T i also maintains its definition from equation (3), but now applies to treated individuals in either program. HS i is an indicator for Head Start attendance, both treatment and control, and HS i T further indicates Head Start treatment. The covariates and age indicators are interacted with the Head Start indicator to allow them to differentially affect Head Start and non-Head Start children. F-tests reject the null hypothesis that δ2=δ3=0.Footnote 26 If our empirical strategy effectively mitigates selection bias, θ can be interpreted as the average causal effect of Head Start treatment relative to treatment at a non-Head Start childcare center. δ1 is also informative — it indicates how selection into Head Start compares to selection into non-Head Start, center-based care. Noting that HS i T=HS i *T i , equation (4) is a basic difference-in-difference model.Footnote 27

RESULTS

We estimate program impacts on children's age-4 math and literacy scores, which we standardize using the entire sample of age-4 test takers in the ECLS-B data. Table 4 reports estimates from the childcare-specific models. The first column in each panel shows raw differences in outcomes between treatments and controls, and the second column is covariate adjusted. For non-Head Start centers, treatment is positively associated with outcomes. For Head Start centers, the estimates are nominally positive but statistically indistinguishable from zero.

Table 4 Program-specific estimates

Table 5 reports several different sets of estimates. The first two columns show estimates from OLS regressions that directly compare Head Start and non-Head Start treatments. Column (1) reports unconditional differences in outcomes, and column (2) reports covariate-adjusted differences. Columns (3) and (4) report estimates from difference-in-difference models where we impose the constraint that δ2=δ3=0. Columns (5) and (6) report estimates from difference-in-difference models where we relax this constraint, which is our preferred specification (these estimates are equivalent to subtracting the non-Head Start effects from the Head Start effects in Table 4). The covariates greatly increase predictive power in all of the models, as would be expected. However, in the difference-in-difference models they negligibly affect our estimates, suggesting that the control observations largely capture the effects of selection.

Table 5 Relative effects of Head Start

The unconditional differences in column (1) show that Head Start children perform remarkably worse on the cognitive tests. The regression-adjusted estimates in column (2) are near zero, indicating that the differences in column (1) reflect selection. The difference-in-difference estimates are generally small, and none can be statistically distinguished from zero. Overall, Table 5 shows that Head Start centers, on average, perform similarly to non-Head Start centers in terms of raising children's cognitive scores.Footnote 28

It is notable that the estimates from the model in column (2), which we refer to as the regression-adjusted model, are similar to the difference-in-difference estimates. If there is selection into childcare type based on unobservables, the estimates from the regression-adjusted model will be biased by this selection, while in the difference-in-difference models bias from unobserved selection is likely to be reduced. The most reasonable explanation for the similarity in the findings is that selection into childcare is largely captured by the observable information in the data. Although this is somewhat surprising, it is a testament to the exceptional quality of the ECLS-B survey. Ex post, our findings would have been qualitatively similar had we simply assumed selection on observables from the onset. However, we would be less confident in the regression-adjusted estimates without the confirmation provided by the difference-in-difference models.

ROBUSTNESS AND OTHER ISSUES

Robustness of findings

We first evaluate the robustness of our findings to adjustments to the treatment and control definitions. We consider defining treatments as being in care for 7−12 months and for 9–15 months, and compare these treatments with controls who were reported to be in care for 0–2 and 0–3 months (we also compare the 6–12-month treatment group with the new control group). We then return to our preferred treatment and control samples, and relax some of the other data restrictions. Specifically, we include children whose parents reported moving no more than once, and no more than twice, between the age-2 and age-4 surveys; and children who were reported to be in care for more than 40 hours per week. For the models that include movers, we note that treatment misclassification errors will be more common. However, in practice, any bias from the misclassification errors appears to roughly cancel out in the difference-in-difference models (of note, approximately half of the movers are classified as treatments, and half as controls, in each program). Including the movers increases our sample size by over 50 percent.

Our results are reported in Table 6. For brevity, we only report estimates from the covariate-adjusted, unrestricted difference-in-difference models (the first row of the table shows the baseline estimates from Table 5 for comparison). In math, Table 6 provides no evidence to overturn our primary finding that Head Start and non-Head Start centers perform similarly. In literacy, while none of the estimates can be statistically distinguished from zero, it is noteworthy that the point estimates are consistently negative. This raises the possibility that Head Start underperforms in literacy, but the evidence is not strong enough to scientifically support this claim.

Table 6 Robustness exercise 1: Difference-in-difference estimates from covariate-adjusted, unrestricted models using alternative definitions of treatments and controls, and relaxing key data restrictions from the primary analysis

Next, we consider an alternative specification analogous to the model in (4) where we replace the treatment indicator with a quadratic function of months-in-care for each child.

In (5), MIC i and MIC i 2 replace T i from equation (4), where MIC indicates months-in-care. The coefficients of interest in (5) are θ1 and θ2, which characterize the effects of months-in-care in Head Start relative to months-in-care in non-Head Start centers. A benefit of defining treatment duration as in (5) is that we can include children who were reported to be in care for 4 or 5 months at the time of the age-4 survey — these children were not identified as either treatments or controls in the previous models.Footnote 29

Table 7 reports estimates that are analogous to those in Table 5, but based on the new estimation sample and the model in (5). The first column of the table shows the unconditional differences in outcomes between Head Start and non-Head Start children, and the second column shows the regression-adjusted differences. The unconditional differences again show that Head Start children perform considerably worse on the cognitive tests. The regression-adjusted differences in math and literacy are both nominally negative, but statistically insignificant. Moving to the difference-in-difference estimates, the results in Table 7 are qualitatively consistent with those in Table 5 — they do not indicate any differences in performance between Head Start and non-Head Start childcare centers.

Table 7 Robustness exercise 2: Relative effects of months-in-care in Head Start

Finally, using a much larger sample, we also considered an IV strategy where we instrumented for months-in-care using assessment dates from the ECLS-B survey as a source of exogenous variation [following Fitzpatrick et al. 2011].Footnote 30 However, the IV approach was unsuccessful, and therefore, we omit the results for brevity.Footnote 31 The IV estimates were clearly too large, which was obvious when we compared the Head Start IV estimates to the Impact Study estimates. The primary problem with the instrument is that we do not have access to a pre-test, and we suspect that we could not adequately separate the test-date variation from variation in age (we tried to deal with this issue in several ways without success). Of interest, however, is that even in the IV models the estimated Head Start and non-Head Start program impacts are very similar, which would occur under two conditions: (1) Head Start and non-Head Start centers have similar impacts; and (2) the biases in the IV estimates are similar for Head Start and non-Head Start children.

Does our analysis favor Head Start?

Our analysis may favor Head Start because of the relative disadvantage of Head Start children. Although we would typically be concerned about the reverse, we cannot preclude this possibility. One of the most likely ways that our results would be biased in favor of Head Start would be if the testing instruments in the ECLS-B had strong ceiling effects. If this were the case, the test scores for the higher-achieving non-Head Start children would be mechanically restricted relative to the Head Start children, which would give Head Start an advantage in our analysis. Following Koedel and Betts [2010], we test for ceiling effects in the ECLS-B testing instruments and find no evidence of a test-score ceiling in either test. In fact, inferring from Koedel and Betts [2010], there are mild floor effects in the literacy test, which would work against Head Start in our comparison (scores for the lowest performers are mechanically inflated). The test-score floor in the literacy test is consistent with our estimates, which show that Head Start has a nominally smaller relative effect in literacy.Footnote 32

It is also possible that disadvantaged children gain more from childcare relative to non-attendance. For example, it could be that advantaged children gain little from attending childcare relative to staying at home, perhaps because they have more positive home environments. Alternatively, disadvantaged children could benefit more from going to childcare, if for no other reason than because they are not staying at home. If the marginal benefit to childcare attendance over non-attendance is higher for disadvantaged children, this would favor Head Start in our analysis. We use the heterogeneity in family income within the non-Head Start sample to evaluate whether this explanation is likely to be driving our findings (see Table 1).Footnote 33 Specifically, using the income categories reported in Tables 1 and 2, we divide the non-Head Start children into bins and estimate income-specific childcare effects. Separating the income controls from the other X's for illustration, and defining the vector of income indicators for child i by INC i , we estimate:

The estimates of λ2 are of interest in (6), and we report our findings in Table 8. The omitted group is the highest income category. Although the estimates are noisy, they are not consistent with disadvantaged children gaining more from childcare treatment, at least within the non-Head Start sample.Footnote 34

Table 8 Income-specific childcare effects for non-Head Start, center care facilities

CONCLUSION

For Head Start and non-Head Start childcare centers, we estimate average program impacts by comparing outcomes from children who received treatment to outcomes from children who selected into treatment but received very little care. We use a difference-in-difference framework to evaluate the relative performance of Head Start. We find that Head Start centers, on average, perform similarly to non-Head Start centers.

Our analysis suggests that the common perception that Head Start is performing poorly is driven more by lofty expectations for the program than by the program's actual performance. Implicit in calls for improvements to the delivery of Head Start is the notion that the program is somehow “broken” and must be fixed, but our analysis does not support this contention. Policymakers and other interested parties may or may not be satisfied with the outcomes generated by Head Start, but speaking comparatively, the program does not appear to be underperforming.

Two qualifications to our study merit attention. First, our findings do not preclude the possibility that Head Start can perform better. For example, Gormley and Gayer [2005]; Gormley et al. [2010] evaluate a high-quality pre-kindergarten program in Tulsa, Oklahoma, and they find that it greatly outperforms Head Start in terms of affecting cognitive achievement. In addtion, Currie [2001] discusses several other studies that document large effects for programs that were funded at higher levels than Head Start. However, these previously studied programs are unlikely to be scalable to the level of Head Start without considerable increases in funding, and even then there would be challenges. In this modern era of fiscal constraints, this is an important consideration. A unique contribution of our study is that we compare Head Start to the larger non-Head Start, center-based childcare sector, which provides care to the majority of children in the United States and where childcare costs are more closely aligned with the costs of Head Start.Footnote 35

Second, our study does not address the growing body of evidence showing that Head Start has large effects on children's longer-term outcomes [e.g., Garces et al. 2002; Ludwig and Miller 2007; Deming 2009]. The longer-term benefits of Head Start participation may be sufficient to justify current expenditures even with only small immediate-term effects on test scores and other outcomes [Ludwig and Philips 2007]. Our findings, in conjunction with those from the literature on Head Start's longer-term impacts, suggest that we should be cautious about putting too much emphasis on the immediate-term impacts of Head Start on test scores. If Head Start is performing within expectations in this regard (which our study suggests should be low), but participants meaningfully benefit in the long term, overreacting to the test-score results from the Impact Study could do more harm than good.