How does diff-in-diff work for multiple datapoints?

Question

Most textbooks and texts in blogs like to provide example of diff-in-diff using 4 datapoints. It's almost always an example with two cities, an intervention in one of them and then the calculation of the effect is:

Effect = (Treated after - Treatment before) - (Control after - Control before)

I think it's intuitive, but 'Treated after' is the city after the intervention and 'Treatment before' is the city with the intervention before the intervention. So let's say we want to evaluate the effect of a policy in the murder rate. Let's say the city with the new policy started with 10 deaths and after the policy implementation the deaths decreased to 5. The city with no intervention started with 15 and after the same period it ended with 10 deaths. The calculation would be something like this:

Effect = (10-5)-(15-10) = 5-5 = 0

The policy didn't have any effect. I know we usually control for other covariates and do a regression, but let's not consider this at first. Now, let's say we have not 2 cities, but 10, where 3 had an intervention and 7 didn't have any. How will the calculation be? I mean, we can't do the same thing because we have more cities in the control group. We would get the mean of the change in deaths? What if we had some seasonality? Do we consider this by using something in a regression? I'm kind of confused on how to get the examples with 4 datapoints and go up to lots of datapoints.

If you want to do this by hand, then yes, you should look at the group averages. But once we obtain more groups and more time periods, then I see no reason to work this out manually. Once we move beyond the two-by-two case, then you should run this in software using a regression formulation; it will give you your standard errors—for free. — Thomas Bilach
– Thomas Bilach, Commented Mar 23, 2021 at 17:16

Thomas Bilach · Accepted Answer · 2021-03-25 04:06:33Z

In settings with more than two cities, we can still compute the difference-in-differences coefficient. I will show you how it works using a very simply illustration in R, though I highly recommend @DimitriyV.Masterov's answer linked in the comments. Stata's margins command is very robust, though the margins package in R is able to replicate Stata's output.

Let's experiment with a simple illustration in R.

library(tidyr) set.seed(123) df <- tibble( city = rep(1:10, each = 2), # 10 cities time = rep(c(1, 2), times = 10), # 2 time periods y = rnorm(20, mean = 1000, sd = 50), # random outcome treat = ifelse(city %in% c(8, 9, 10), 1, 0), # treatment group (3 cities) post = ifelse(time == 2, 1, 0) # post-treatment (time 2) ) df_grouped <- df %>% group_by(treat, post) %>% # group_by the treatment dummy AND the pre- versus post-treatment indicator summarize(outcome = mean(y)) # calculate the means # Manually extract the means pre_treatment <- df_grouped %>% filter(post == 0, treat == 1) %>% pull(outcome) post_treatment <- df_grouped %>% filter(post == 1, treat == 1) %>% pull(outcome) pre_control <- df_grouped %>% filter(post == 0, treat == 0) %>% pull(outcome) post_control <- df_grouped %>% filter(post == 1, treat == 0) %>% pull(outcome) # Calculate the difference-in-differences estimate manually (dd <- (post_treatment - pre_treatment) - (post_control - pre_control)) [1] -5.802908

This approach is a bit tedious but it illustrates the point. Now let's estimate the interaction model in lm() which produces the difference-in-differences coefficient.

# Pull out the coefficients (see 'treat:post') lm(y ~ treat*post, data = df)$coefficients (Intercept) treat post treat:post 1018.045979 -7.323225 -15.794770 -5.802908

The difference-in-differences estimate is $-5.803$, which is equivalent to a 'double difference' using the group means. Note, we're only looking at city averages. Once you start adjusting for other covariates, then I recommend using the margins() function. It will allow you to calculate the marginal effects at specified values.

model <- lm(y ~ treat*post, data = df) margins(model, at = list(post = 0:1, treat = 0:1)) Average marginal effects at specified values lm(formula = y ~ treat * post, data = df) at(post) at(treat) treat post 0 0 -7.323 -15.79 1 0 -13.126 -15.79 0 1 -7.323 -21.60 1 1 -13.126 -21.60

Stack Exchange Network

How does diff-in-diff work for multiple datapoints?

1 Answer 1

Linked

Hot Network Questions

How does diff-in-diff work for multiple datapoints?

1 Answer 1

Linked

Related

Hot Network Questions