When doing Difference in Differences, we basically pretend to know the average treated outcome $\frac{\sum_{i=1}^n Y_i(1)}{n}$ and the average no-treatment outcome $\frac{\sum_{i=1}^n Y_i(0)}{n}$ of an initial group of units (by assuming a parallel counterfactual trend with a secondary group of units). Therefore, we can directly use the fact that the sample ATE is an unbiased estimator of the true ATE to estimate the true ATE by $\frac{\sum_{i=1}^n Y_i(1)-Y_i(0)}{n} = \frac{\sum_{i=1}^n Y_i(1)}{n} - \frac{\sum_{i=1}^n Y_i(0)}{n}$.
This is how I rationalize the Difference in Differences result. The parallel trend assumption spares us the effort of dividing the units into treatment and control groups and it means that there is no selection bias to worry about (which Wikipedia confirms)
LE: To clarify my reasoning, I painted this graph.
By making the Parallel Trends assumption and observing that the second group reaches point a, we automatically know that the first group would have reached point c had it not been for the treatment. We thus have everything we need to know about the first group:
- the average (observed) treatment outcome = $\frac{\sum_{i=1}^n Y_i(1)}{n}$ = d
- the average (assumed) no-treatment outcome = $\frac{\sum_{i=1}^n Y_i(0)}{n}$ = c
Therefore, the sample ATE is d-c, which is an unbiased estimate of the true ATE. There are no treatment or control groups, because we don't make any assignment. We literally know or assume both average potential outcomes of the first group.
