Back to data science. What is an excellent use case for generative AI models? Creating synthetic data for causal identification.
Synthetic controls is already popular, but there is another use case: scaling a small dataset realistically to a larger dataset. Let’s say we have a limited dataset (new product launch, new or niche market, or early startup data) and need to expand it for analysis while preserving its properties. Training a generative model is only sensible.
In a new Causal Book chapter, we use a Gaussian Copula synthesizer to generate data for the regression discontinuity. We start with a small seed data with a causal jump around the cutoff (ideally that seed data is a real dataset). The challenge? Generative models are “smoothers” by nature; they tend to blur the discontinuities.
Our fix was to move the jump outside the black box in a semi-parametric design: we trained the model on residualized data, effectively teaching the model everything except the causal break, and then re-introducing the break during reconstruction:
- Residualization: We strip the causal effect pre-training. By subtracting the deterministic part from the outcome, we isolate a residual without the jump.
- Training: We train the Gaussian Copula on the residualized data. This allows the generative model to capture the correlation structure between covariates without getting derailed by the discontinuity.
- Reconstruction: We don’t ask the model to generate the outcome directly. Instead, we reconstruct it by applying the functional form (the “ground truth” – special case, we know) to the synthetic covariates and the generated residual.
This process forces the treatment effect back into the otherwise smoothed-out data, preserving the causal structure we needed. We used the Synthetic Data Vault library. You can find the Python code and Colab notebook in Chapter 2 of the Causal Book.
Our use case was specialized here, but we can think of other use cases. We are working on a Data Duets article to discuss the broader business implications.