- David McKenzie

## This page in:

English

In ancient Greece, important decisions were never made without consulting the High Priestess in the Oracle at Delphi. She would bring wisdom from the gods, though that advice was sometimes vague or confusing, and often misinterpreted by mortals. Today I bring news that the High Priestess and the Priests (Athey, Abadie, Imbens and Wooldridge) have done itdeliverednew wisdom from the god of econometrics about the important decision of**When should you cluster standard errors?**. This is definitely one of life's most important questions, as any avid seminar bingo player can surely attest. In case their paper is all Greek to you (half of it is literal), I'll try to summarize their recommendations so your standard errors are heaven.

The authors argue that there are two reasons for standard error clustering: a*Sampling-Design*Reason that arises because you have collected data from a population using cluster sampling and want to tell something about the broader population; and a*experimental design*Reason where the attribution mechanism is grouped for a causal treatment of interest. Let me go through them one by one with examples, ending with some of their takeaways.

**The sampling design rationale for clustering**

Consider running a simple Mincer income regression of the form:

Log(wages) = a + b*school years + c*experience + d*experience^2 + e

You present this model and decide whether to cluster the standard errors. Referee 1 tells you, "residual wages are likely to correlate with local labor markets, so you should group your standard errors by state or village." But reviewer 2 argues, "residual wages are likely to correlate with people working in the same industry, so you should consider your Group standard errors by industry', and Reviewer 3 argues that 'residual wages are likely to be correlated with age cohort, so you should group your standard errors by cohort'. What should one do?

You could try estimating your model using these three different clustering approaches and see what difference it makes.

*Her advice:*Whether or not clustering makes a difference to the standard errors should not be the basis for deciding whether or not to perform clustering. They note that there is a misconception that if clustering is important, then one should cluster.

Instead, clustering matters from the sampling perspective**how the sample was selected**and whether there are clusters in the population of interest that are not represented in the sample. So we can imagine different scenarios here:

- You want to say something about the relationship between schooling and wages in a certain population group and use a random sample of workers from that population group. Then the standard errors for clustering do not need to be adjusted at all, even if clustering would change the standard errors.
- The sample was chosen by randomly selecting 100 towns and villages within the country and then randomly selecting people in each. and your goal is to say something about the return to education in the general population. Here you should group the standard errors by village, since there are villages in the population of interest that are larger than those in the sample.
- The same logic makes clear why you wouldn't group by age cohort in general (it seems unlikely that we would randomly select some age cohorts and not others and then try to say something about all age groups); and that we only want to group by industry if the sample would be drawn by randomly selecting a sample of industries and then sampling people from each industry.

Also in the second case, Abadie et al. Note that both the usual robust (Eicker-Huber-White or EHW) standard errors and the clustered standard errors (which they call Liang-Zeger or LZ standard errors) can do this*both*be correct, it's just that they are correct for different estimates. That said, if you're content to just say something about the particular sample of people you have without trying to generalize to the population, the EHW standard errors are all you need; but if you want to say something about the wider population, the LZ standard errors are necessary.

Special case: Even when the sample is clustered, the EHW and LZ standard errors are equal when there is no heterogeneity in treatment effects.

Side note 1: This also reminds me of the Adjust Tilt Points command*nnmatch*by Abadie (with another et al.) where you can get the narrower SATE standard errors for the sample or the broader PATE errors for the population.

Side note 2: This reason is rarely a justification for clustering in an impact evaluation. But Rosenzweig and UdryPapieron external validity indicates that we only observe treatment effects for certain points in*time*, and that if we want to say something more general about how our treatment behaves at other times, we need broader standard errors than we use to say just about our specific sample - which is very related to the point here about very clear what your estimate is.

**The experimental design rationale for clustering**

The second reason for clustering, which we are probably more familiar with, is when clusters of units are assigned to a treatment rather than individual units. Let's take the same equation as above but assume we have a binary treatment that assigns more schooling to people. So now we have:

Log(wages) = a +b*treatment + e

If treatment is then assigned at the individual level, there is no need for clustering (*). There has been much confusion as to how Chris Blattman wrote in two previous posts on this topic (the fabulous title "Cluster RuckandClusterjerk the sequel), and I still get occasional hints from reviewers that I should try clustering by industry or something similar in an individually randomized experiment. This Abadie et al. Paper is now finally a good reference to explain why this isn't necessary.

(*) unless you use multiple time periods and then want to group by person, since the randomization unit is individual, not individual.

What if your treatment is assigned at the village level? Then group by village. This is also why you would like tocluster-difference-in-differencesat the state level if you have a source of variation coming from differences between states and why "treating" like one side of a border versus the other is problematic (because you only have 2 clusters).

**Add fixed effects**

What if we sample at the city level, but then add fixed city effects to our mincer regression. Or we randomize at the city level but add fixed city effects. Do we still need to cluster at the city level?

The authors note that there is much confusion about the use of fixed effects clustering. As a general rule, if either sampling or allocation to treatment has been clustered, you still need to cluster. However, the authors show that cluster fits only make a fixed-effects fit when there is heterogeneity in treatment effects.

**How is clustering done?**

This is largely a paper about*if*clusters, not*how*to group. There is, of course, a very different debate about when to rely on asymptotics, bootstrapping, or randomization inference approaches. They show, using asymptotic approximations, that the standard Liang-Zeger cluster fit is generally conservative, and they provide an alternative cluster-adjusted variance estimator that can be used when there are differences in treatment allocation within clusters and you know the fraction of clusters sampled. However, because of the concern that, with the sample sizes used in many experiments, asymptotic standard errors may not be conservative enough, you should be cautious about using such a fit for typical sample sizes.

## authors

### David McKenzie

Lead Economist, Development Research Group, Weltbank

More blogs from David

Helio

16. October 2017

a very useful source

Dimitri Masterow

17. October 2017

This is an excellent summary of this paper. I have a follow-up question about DDD. In Jeff Wooldridge's Econometric Analysis (2nd edition) on page 151 he gives an example of a DDD (difference-in-difference-in-differences) estimator for the two-period case in which country B has an elderly-oriented Change in health policy implemented . If I want to extrapolate what this would mean for older people in other states, how should I group?

Munez

18. October 2017

This is a fantastic blog. (When to) Cluster made easy. Thank you!

max

February 13, 2018

This is incredibly useful - and the first paragraph is a work of art! Many Thanks.

Dilhan

March 17, 2018

Thanks for this and all your other posts - they've been a huge help!

BJ

21. November 2018

Thank you for the very clear summary. It's so rare to find something that non-econometricians can understand.

Inkyu

21. May 2019

such an intuitive summary!

Reza

22. May 2019

Such a useful summary. Many Thanks

Chris

December 02, 2019

Thanks for sharing this, Dr. McKenzie. This blog helps me a lot when I am struggling with the clustered standard error in my work.

Imam

December 12, 2019

Hi David, this is a very stupid question of mine. Do you need to use grouped standard errors when performing regression on census data? Many Thanks

David McKenzie

December 12, 2019

Hello Imam,

It depends on what kind of regression equation you are trying to estimate. See this paper by the same co-authorshttps://www.nber.org/papers/w20325which explains how to interpret standard errors in census data. Essentially, if you're trying to estimate a causal effect and the source of the variation is at a clustered level, you still need to cluster. E.g. If you are using US Census data on wages and are examining minimum wage policies at the state level, your hypothetical experiment is one where treatment is different at the state level and you would therefore still group standard errors by state.

Carlo

03. May 2021

Dear David,

Thanks for all your explanations, great service to the community. Following your example, I assume the example you had in mind (in the answer above) was a cross section? Should we group at the state level if this example was in a panel format?

David McKenzie

03. May 2021

Hello Carlo. The classic How much should we trust the difference-in-difference paper (https://economics.mit.edu/files/750) illustrates the need to group at the state level rather than the state period level for panel data. This is because the policies are typically not randomly reassigned every period, but are instead correlated over time in a state. If you somehow found yourself in an environment where states randomly chose their policy each year and totally ignored what their policy was in previous years, then grouping at the state level makes sense. I can't think of any applications where this would be the case.

Bis

February 23, 2021

Hello David,

Thanks for this summary!

I was wondering what happens when both cases happen at the same time:

I work with micro data where the sample was chosen by randomly selecting clusters (= villages) and then select respondents within those clusters. I would therefore group SE at village/cluster level.

But I'm doing a diff-in-diff and evaluating a policy that has been implemented at the district level, which would suggest clustering at the district level.

Is there a suggestion how to proceed in such a case?

David McKenzie

February 23, 2021

I would use the sample weights to re-weight the data if you want it to be representative of a broader population and not just the sample you have, but since your question of interest is a causal question pertaining to bases a policy on the district level, group at the district level.

Bis

February 24, 2021

Great, thanks for your answer!

Jakob

01. June 2021

Hello David, thanks for the very nice summary.

You speak of clustering at the village level

"That's also why you want to group differences into state-level differences when you have a source of variation that comes from differences between states and why a 'treatment' like being on one side of a border or the other side is problematic (because you only have 2 clusters).

Are you implying that you wouldn't cluster at the country level if you only have 2 clusters? I'm doing a guideline evaluation trying to run my regression with only 2 states (one treated, one untreated) and as of now I'm getting really small standard errors

Lu

18. August 2021

Very helpful! Thank you for sharing!

Martin

March 14, 2022

Dear David! Thank you for your contribution. I use a DiD over European Social Survey, in which I build up a treatment and control group from (external) country characteristics. However, when I group my standard errors by country, they become less significant and increase up to a factor of 40. Can you recommend a good paper on how to interpret such a difference? I have never seen a case where the clustered standard errors fall like this. Thank you and keep up the interesting blog!

David McKenzie

March 14, 2022

Hallo Martin,

This indicates a really high intra-cluster correlation. You could check if this is the case - see e.g.https://blogs.worldbank.org/impactevaluations/tools-of-the-trade-intra-…

Zoey

March 14, 2022

This is such a great summary! Thank you I have a question about clustering for the DiD model. I'm running a two-period, two-state DiD regression, but I'm using ZIP code-level data. The policy differs at the state level. Should I group standard errors at the zip code level? Thank you!

David McKenzie

March 14, 2022

Hi Zoey, This is a tricky case as your policy divergence is actually at the state level which would indicate clustering at that level - but with only 2 states this won't work. You are in a similar situation to the original minimum wage work - see the discussion herehttps://blogs.worldbank.org/impactevaluations/explaining-why-we-should-…and Roth (2022) - basically, a design-based approach won't work, and you have to make model assumptions that there are no state-level shocks apart from the policy change - and only be clear about what you assume.