In this vignette, we want to illustrate the dangers of relying on traditional two-way fixed-effects (TWFE) regression models with leads and lags to conduct causal inference about treatment effect dynamics. In fact, we will show that *even* when treatment effect dynamics are homogeneous across different treatment groups/cohorts, some of these TWFE regression models can be severely biased.

In order to effectively communicate these dangers, we consider a stylized simulation example where covariates do not play any major role. This example draws from several discussions we have had with Andrew Baker. We consider three setups. In the first one all units are eventually treated, and treatment effect dynamics are homogeneous across treatment groups. In the second, we have a group of units that remain untreated at the end of our sample, and treatment effect dynamics are homogeneous across treatment groups. The third one is like the second, but with heterogeneous treatment effect dynamics across treatment groups.

Consider a staggered treatment adoption setup where 40 ‘states’ (\(state = \{1,2,\dots,40\}\)) are randomly assigned into 4 treatment groups depending on the treatment starting year (1986, 1992, 1998, and 2004). We denote the treatment starting period as \(g\) with (\(g \in \{ 1986, 1992, 1998, 2004\}\)). The 1000 units (e.g., ‘counties’ or ‘firms’) \(i\) in the sample are randomly assigned to one of the 40 states. Let \(G_i\) indicates the group/cohort unit \(i\) belongs to, i.e., \(G \subseteq \{ 1986, 1992, 1998, 2004\}\).

The data generating process (DGP) for the outcome \(Y\) we consider is \[Y_{i,t} = (2010-g) + \alpha_i + \alpha_t + \tau_{i,t} + \epsilon_{i,t}\] where \(\alpha_i\) are unit fixed effects drawn from \(\sim N(\mu_{state}, 1)\) with state-specific mean \(\mu_{state} = state/5\), \(\alpha_t\) are time fixed effects (cohort specific parallel time-trends) generated as \[\alpha_t = 0.1 * (t - g) + \epsilon^{time FE}_t, \] with \(\epsilon^{time FE}_t \sim N(0, 1)\), \(\epsilon_{i,t} \sim N\left(0, \left(\frac{1}{2}\right)^2\right)\) is an idiosyncratic error term, and \(\tau_{i,t}\) are the unit-specific treatment effects at time \(t\) generated as \[ \tau_{i,t} = \mu \times (t - g + 1)\times 1\{t \ge g \},\] where \(\mu\) is the the instantaneous treatment effect; we set \(\mu=1\).

From the above DGP, it is evident that the ATT at the time a unit is treated is equal to 1 for all groups, it is equal to 2 one period after treatment started for all groups, etc. Thus, the evolution of the treatment effects is homogeneous across treatment groups. This is an *a priori* strong restriction that should make things “easier” to estimate (but is by no means required!).

Let’s start the discussion by first visualizing how the DGP looks like.

```
# Load libraries and set baseline parameters
library(tidyverse)
library(lfe)
library(fastDummies)
library(ggthemes)
library(did)
theme_set(theme_clean() + theme(plot.background = element_blank()))
#----------------------------------------------------------------------------
iseed = 20201221
nrep <- 100
true_mu <- 1
set.seed(iseed)
#----------------------------------------------------------------------------
## Generate data - treated cohorts consist of 250 obs each, with the treatment effect still = true_mu on average
make_data <- function(nobs = 1000,
nstates = 40) {
# unit fixed effects (unobservd heterogeneity)
unit <- tibble(
unit = 1:nobs,
# generate state
state = sample(1:nstates, nobs, replace = TRUE),
unit_fe = rnorm(nobs, state/5, 1),
# generate instantaneous treatment effect
#mu = rnorm(nobs, true_mu, 0.2)
mu = true_mu
)
# year fixed effects (first part)
year <- tibble(
year = 1980:2010,
year_fe = rnorm(length(year), 0, 1)
)
# Put the states into treatment groups
treat_taus <- tibble(
# sample the states randomly
state = sample(1:nstates, nstates, replace = FALSE),
# place the randomly sampled states into four treatment groups G_g
cohort_year = sort(rep(c(1986, 1992, 1998, 2004), 10))
)
# make main dataset
# full interaction of unit X year
expand_grid(unit = 1:nobs, year = 1980:2010) %>%
left_join(., unit) %>%
left_join(., year) %>%
left_join(., treat_taus) %>%
# make error term and get treatment indicators and treatment effects
# Also get cohort specific trends (modify time FE)
mutate(error = rnorm(nobs*31, 0, 1),
treat = ifelse(year >= cohort_year, 1, 0),
tau = ifelse(treat == 1, mu, 0),
year_fe = year_fe + 0.1*(year - cohort_year)
) %>%
# calculate cumulative treatment effects
group_by(unit) %>%
mutate(tau_cum = cumsum(tau)) %>%
ungroup() %>%
# calculate the dep variable
mutate(dep_var = (2010 - cohort_year) + unit_fe + year_fe + tau_cum + error)
}
#----------------------------------------------------------------------------
# make data
data <- make_data()
# plot
plot1 <- data %>%
ggplot(aes(x = year, y = dep_var, group = unit)) +
geom_line(alpha = 1/8, color = "grey") +
geom_line(data = data %>%
group_by(cohort_year, year) %>%
summarize(dep_var = mean(dep_var)),
aes(x = year, y = dep_var, group = factor(cohort_year),
color = factor(cohort_year)),
size = 2) +
labs(x = "", y = "Value", color = "Treatment group ") +
geom_vline(xintercept = 1986, color = '#E41A1C', size = 2) +
geom_vline(xintercept = 1992, color = '#377EB8', size = 2) +
geom_vline(xintercept = 1998, color = '#4DAF4A', size = 2) +
geom_vline(xintercept = 2004, color = '#984EA3', size = 2) +
scale_color_brewer(palette = 'Set1') +
theme(legend.position = 'bottom',
#legend.title = element_blank(),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)) +
ggtitle("One draw of the DGP with homogeneous effects across cohorts \n and with all groups being eventually treated")+
theme(plot.title = element_text(hjust = 0.5, size=12))
plot1
#ggsave("plot_dgp1.png", plot1, width = 10, height = 5, dpi = 100)
```

The above plot shows the 1,000 individual values of \(Y_{i,t}\), as well as the average by treatment-group (the thicker colorful lines), and the vertical lines show the period when treatment begins for each group. Because treatment effects here grows linearly with time elapsed since treatment started, we can see that the earliest treated units end up with the highest value of Y, and that the difference grows by the gap between treatment years.

Given that we are interested in treatment effect dynamics, we then proceed to consider a classical two-way fixed-effects (TWFE) event study specification \[Y_{i,t} = \alpha_i + \alpha_t + \gamma_k^{-K}~ D_{i,t}^{<-K} + \sum_{k=-K}^{-2} \gamma_k^{lead} D_{i,t}^{k} +\sum_{k=0}^{L} \gamma_k^{lag} D_{i,t}^{k} + \gamma_k^{L+} D_{i,t}^{>L} + \varepsilon_{i,t}\] where \(D_{i,t}^{k}=1\{t-G_i=k\}\) is an “event-study” dummy variable that takes value one if a unit \(i\) is \(k\) periods away from initial treatment at time \(t\) and zero otherwise, \(D_{i,t}^{<-K}=1\{t-G_i<-K\}\) and \(D_{i,t}^{>L}=1\{t-G_i>L\}\) are defined analogously. For instance, \(D_{i,t}^{0}\) is equal to one if the unit \(i\) is first treated at time \(t\), \(D_{i,t}^{1}\) is equal to one if a one period has passed since treatment started (treatment lags), etc. Alternatively we have that \(D_{i,t}^{-2}\) is equal to one if a unit \(i\) will be treated in two periods from \(t\) (treatment leads). In this exercise we set \(K\) and \(L\) to be equal to 5.

Up to today, it is customary to interpret estimates of \(\gamma_k^{lags}\) as ``good’’ measures of the average treatment effect for being exposed to treatment for \(k\) periods, and estimates of \(\gamma_k^{leads}\) as measures of pre-trends. Our first exercise here is to assess if this is OK-ish.

Towards this end, we will a perform Monte Carlo exercise where we generate data according to the DGP described above and estimate all the parameters of the TWFE event-study regression above. We repeat this 100 times (same holds true if we do 1,000) and plot the Monte Carlo mean and 1.96 times the Monte-Carlo standard-deviation of the relative time indicator dummies, as well as the true treatment effect below.

```
# function to run ES DID
# variables we will use
keepvars <- c("`rel_year_-5`", "`rel_year_-4`", "`rel_year_-3`", "`rel_year_-2`",
"rel_year_0", "rel_year_1", "rel_year_2", "rel_year_3", "rel_year_4", "rel_year_5")
run_ES_DiD <- function(...) {
# resimulate the data
data <- make_data()
# make dummy columns
data <- data %>%
# make dummies
mutate(rel_year = year - cohort_year) %>%
dummy_cols(select_columns = "rel_year") %>%
# generate pre and post dummies
mutate(Pre = ifelse(rel_year < -5, 1, 0),
Post = ifelse(rel_year > 5, 1, 0))
# estimate the model
mod <- lfe::felm(dep_var ~ Pre + `rel_year_-5` + `rel_year_-4` + `rel_year_-3` + `rel_year_-2` +
`rel_year_0` + `rel_year_1` + `rel_year_2` + `rel_year_3` + `rel_year_4` +
`rel_year_5` + Post | unit + year | 0 | state, data = data, exactDOF = TRUE)
# grab the obs we need
mod2 <- tibble(
estimate = mod$coefficients,
term1 = rownames(mod$coefficients)
)
es <-
mod2 %>%
filter(term1 %in% keepvars) %>%
mutate(t = c(-5:-2, 0:5)) %>%
select(t, estimate)
es
}
data_classical <- map_dfr(1:nrep, run_ES_DiD)
colors <- c("True Effect" = "red", "Estimated Effect" = "blue")
ES_plot_classical <- data_classical %>%
group_by(t) %>%
summarize(avg = mean(estimate),
sd = sd(estimate),
lower.ci = avg - 1.96*sd,
upper.ci = avg + 1.96*sd) %>%
bind_rows(tibble(t = -1, avg = 0, sd = 0, lower.ci = 0, upper.ci = 0)) %>%
mutate(true_tau = ifelse(t >= 0, (t + 1)* true_mu, 0)) %>%
ggplot(aes(x = t, y = avg)) +
#geom_linerange(aes(ymin = lower.ci, ymax = upper.ci), color = 'darkgrey', size = 2) +
geom_ribbon(aes(ymin = lower.ci, ymax = upper.ci), color = "lightgrey", alpha = 0.2) +
geom_point(color = 'blue', size = 3) +
geom_line(aes(color = 'Estimated Effect'), size = 1) +
geom_line(aes(x = t, y = true_tau, color = 'True Effect'), linetype = "dashed", size = 2) +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_x_continuous(breaks = -5:5) +
labs(x = "Relative Time", y = "Estimate") +
theme(axis.title = element_text(size = 14),
axis.text = element_text(size = 12)) +
ggtitle("TWFE event-study regression with binned end-points")+
scale_color_manual(values = colors) +
theme(plot.title = element_text(hjust = 0.5, size=12),
legend.position = "bottom",
legend.title = element_blank())
ES_plot_classical
#ggsave("es_plot_classical.png", ES_plot_classical, width = 10, height = 5, dpi = 150)
```

The results above show that these TWFE event-study type estimates are *severely biased* for the true treatment effects. Furthermore, using the estimates of coefficient of treatment leads as a way to find evidence of “pre-trends” is very problematic, as illustrated above; see Sun and Abraham (2021) for formal results! Putting it simply, the results above highlight that such TWFE linear regression should not be used to highlight treatment effect dynamics!

Now, you may wonder: What happens if I include all possible leads and lags in the TWFE event study specification, i.e., to set K and L to the maximum allowable in the data, making inclusion of \(D_{i,t}^{<-K}\) and of \(D_{i,t}^{>L}\) unnecessary. In what follows, we do exactly that, though we need to omit one extra period apart from -1 to avoid problems of multicollinearity identified by Borusyak and Jaravel (2018). We drop the most negative time period, which is perhaps the most natural thing to do. In order words, we set \(K=23\) and \(L=24\), and drop the \(D_{i,t}^{<-K}\) and \(D_{i,t}^{>L}\) from the TWFE specification.

The result of this exercise is displayed below. And, as you can see, this still leads to *very biased* estimates for the causal effects of interest!

```
# function to run ES DID
run_ES_DiD_sat <- function(...) {
# resimulate the data
data <- make_data()
# make dummy columns
data <- data %>%
# make relative year indicator
mutate(rel_year = year - cohort_year)
# get the minimum relative year - we need this to reindex
min_year <- min(data$rel_year)
# reindex the relative years
data <- data %>%
mutate(rel_year = rel_year - min_year) %>%
dummy_cols(select_columns = "rel_year")
# make regression formula
indics <- paste("rel_year", (1:max(data$rel_year))[-(-1 - min_year)], sep = "_", collapse = " + ")
keepvars <- paste("rel_year", c(-5:-2, 0:5) - min_year, sep = "_")
formula <- as.formula(paste("dep_var ~", indics, "| unit + year | 0 | state"))
# run mod
mod <- felm(formula, data = data, exactDOF = TRUE)
# grab the obs we need
mod2 <- tibble(
estimate = mod$coefficients,
term1 = rownames(mod$coefficients)
)
es <-
mod2 %>%
filter(term1 %in% keepvars) %>%
mutate(t = c(-5:-2, 0:5)) %>%
select(t, estimate)
es
}
data_sat <- map_dfr(1:nrep, run_ES_DiD_sat)
ES_plot_sat <- data_sat %>%
group_by(t) %>%
summarize(avg = mean(estimate),
sd = sd(estimate),
lower.ci = avg - 1.96*sd,
upper.ci = avg + 1.96*sd) %>%
bind_rows(tibble(t = -1, avg = 0, sd = 0, lower.ci = 0, upper.ci = 0)) %>%
mutate(true_tau = ifelse(t >= 0, (t + 1)* true_mu, 0)) %>%
ggplot(aes(x = t, y = avg)) +
#geom_linerange(aes(ymin = lower.ci, ymax = upper.ci), color = 'darkgrey', size = 2) +
geom_ribbon(aes(ymin = lower.ci, ymax = upper.ci), color = "lightgrey", alpha = 0.2) +
geom_point(color = 'blue', size = 3) +
geom_line(aes(color = 'Estimated Effect'), size = 1) +
geom_line(aes(x = t, y = true_tau, color = 'True Effect'), linetype = "dashed", size = 2) +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_x_continuous(breaks = -5:5) +
labs(x = "Relative Time", y = "Estimate") +
theme(axis.title = element_text(size = 14),
axis.text = element_text(size = 12)) +
ggtitle("TWFE event-study regression with 'all' leads and lags")+
scale_color_manual(values = colors) +
theme(plot.title = element_text(hjust = 0.5, size=12),
legend.position = "bottom",
legend.title = element_blank())
ES_plot_sat
#ggsave("es_plot_sat.png", ES_plot_sat, width = 10, height = 5, dpi = 150)
```

Instead of using variants of the TWFE specification above, we advocate you to use the procedure put forward by Callaway and Sant’Anna (2021) and implemented in our package **did**.

In this specific DGP, as there is no “never-treated” units, we can use the “not-yet-treated” units as a valid comparison group for each cohort. Furthermore, since covariates are not available (and do not play any major role into the analysis), we do not need to worry about the underlying “est_method”.

Below we show the results associated with using the Callaway and Sant’Anna (2021) procedure. As you can see, you get reliable and precise estimates for treatment effect dynamics. This is good news!

```
# function to run ES DID
run_CS <- function(...) {
# resimulate the data
data <- make_data()
mod <- did::att_gt(yname = "dep_var",
tname = "year",
idname = "unit",
gname = "cohort_year",
control_group= "notyettreated",
bstrap = FALSE,
data = data,
print_details = FALSE)
event_std <- did::aggte(mod, type = "dynamic")
att.egt <- event_std$att.egt
names(att.egt) <- event_std$egt
# grab the obs we need
broom::tidy(att.egt) %>%
filter(names %in% -5:5) %>%
mutate(t = -5:5, estimate = x) %>%
select(t, estimate)
}
data_CS <- map_dfr(1:nrep, run_CS)
ES_plot_CS <- data_CS %>%
group_by(t) %>%
summarize(avg = mean(estimate),
sd = sd(estimate),
lower.ci = avg - 1.96*sd,
upper.ci = avg + 1.96*sd) %>%
mutate(true_tau = ifelse(t >= 0, (t + 1)* true_mu, 0)) %>%
ggplot(aes(x = t, y = avg)) +
#geom_linerange(aes(ymin = lower.ci, ymax = upper.ci), color = 'darkgrey', size = 2) +
geom_ribbon(aes(ymin = lower.ci, ymax = upper.ci), color = "lightgrey", alpha = 0.2) +
geom_point(color = 'blue', size = 3) +
geom_line(aes(color = 'Estimated Effect'), size = 1) +
geom_line(aes(x = t, y = true_tau, color = 'True Effect'), linetype = "dashed", size = 2) +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_x_continuous(breaks = -5:5) +
labs(x = "Relative Time", y = "Estimate") +
theme(axis.title = element_text(size = 14),
axis.text = element_text(size = 12)) +
ggtitle("Event-study-parameters estimated using Callaway and Sant'Anna (2021)\nComparison group: Not-yet-treated")+
scale_color_manual(values = colors) +
theme(plot.title = element_text(hjust = 0.5, size=12),
legend.position = "bottom",
legend.title = element_blank())
ES_plot_CS
#ggsave("es_plot_CS.png", ES_plot_CS, width = 10, height = 5, dpi = 150)
```

One of the features of the above DGP is that all units eventually get treated. In some applications, we have that a group of units remain untreated at the end of the sample. We call this group of units the “never-treated” group. To assess if the problems associated with TWFE specification highlighted above persist if such a group of units is available, we modify a little bit the DGP and repeat the same exercise as above.

The DGP we consider is very similar to the one we have already considered. The only difference is that now we suppress treatment for the last cohort. Put it simply, the cohort with \(g=2004\) is now “never-treated”, implying that, in this DGP the unit-specific treatment effects at time \(t\) \(\tau_{i,t}\) are generated as \[ \tau_{i,t} = \mu \times (t - g + 1)\times 1\{t \ge g \}1\{g\not=2004 \},\] where \(\mu\) is the the instantaneous treatment effect; we set \(\mu=1\).

Like before, the ATT at the time a unit is treated is 1, it is equal to 2 one period after treatment started, etc.

Let’s start the discussion by first visualizing how the DGP looks like.

```
## Generate data - treated cohorts consist of 250 obs each, with the treatment effect still = true_mu on average
make_data2 <- function(nobs = 1000,
nstates = 40) {
# unit fixed effects (unobservd heterogeneity)
unit <- tibble(
unit = 1:nobs,
# generate state
state = sample(1:nstates, nobs, replace = TRUE),
unit_fe = rnorm(nobs, state/5, 1),
# generate instantaneous treatment effect
#mu = rnorm(nobs, true_mu, 0.2)
mu = true_mu
)
# year fixed effects (first part)
year <- tibble(
year = 1980:2010,
year_fe = rnorm(length(year), 0, 1)
)
# Put the states into treatment groups
treat_taus <- tibble(
# sample the states randomly
state = sample(1:nstates, nstates, replace = FALSE),
# place the randomly sampled states into 1\{t \ge g \}G_g
cohort_year = sort(rep(c(1986, 1992, 1998, 2004), 10))
)
# make main dataset
# full interaction of unit X year
expand_grid(unit = 1:nobs, year = 1980:2010) %>%
left_join(., unit) %>%
left_join(., year) %>%
left_join(., treat_taus) %>%
# make error term and get treatment indicators and treatment effects
# Also get cohort specific trends (modify time FE)
mutate(error = rnorm(nobs*31, 0, 1),
treat = ifelse((year >= cohort_year)* (cohort_year != 2004), 1, 0),
tau = ifelse(treat == 1, mu, 0),
year_fe = year_fe + 0.1*(year - cohort_year)
) %>%
# calculate cumulative treatment effects
group_by(unit) %>%
mutate(tau_cum = cumsum(tau)) %>%
ungroup() %>%
# calculate the dep variable
mutate(dep_var = (2010 - cohort_year) + unit_fe + year_fe + tau_cum + error) %>%
# Relabel 2004 cohort as never-treated
mutate(cohort_year = ifelse(cohort_year == 2004, Inf, cohort_year))
}
#----------------------------------------------------------------------------
# make data
data <- make_data2()
# plot
plot2 <- data %>%
ggplot(aes(x = year, y = dep_var, group = unit)) +
geom_line(alpha = 1/8, color = "grey") +
geom_line(data = data %>%
group_by(cohort_year, year) %>%
summarize(dep_var = mean(dep_var)),
aes(x = year, y = dep_var, group = factor(cohort_year),
color = factor(cohort_year)),
size = 2) +
labs(x = "", y = "Value", color = "Treatment group ") +
geom_vline(xintercept = 1986, color = '#E41A1C', size = 2) +
geom_vline(xintercept = 1992, color = '#377EB8', size = 2) +
geom_vline(xintercept = 1998, color = '#4DAF4A', size = 2) +
#geom_vline(xintercept = 2004, color = '#984EA3', size = 2) +
scale_color_brewer(palette = 'Set1') +
theme(legend.position = 'bottom',
#legend.title = element_blank(),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)) +
scale_color_manual(labels = c("1986", "1992", "1998", "Never-treated"),
values = c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3"))+
ggtitle("One draw of the DGP with homogeneous effects across cohorts \n and with a never-treated group")+
theme(plot.title = element_text(hjust = 0.5, size=12))
plot2
#ggsave("plot_dgp_never_treated.png",plot2, width = 10, height = 5, dpi = 100)
```