New Data Duets post: Using generative models, well, to generate data

I recently shared an underappreciated use case for generative models in data science: creating high-fidelity tabular datasets (OTA data for regression discontinuity).

The model’s success in data synthesis motivated a question: what are some high-value use cases for data science teams when using generative models to create datasets? This, in turn, led to our latest Data Duets post: “Using generative models, well, to generate data”

I walk through using the Synthetic Data Vault to scale a small OTA sample while preserving its statistical properties and the causal discontinuity. Duygu Dagli then weighs in on business implications: creating statistical twins to share data with vendors for solution optimization and benchmarking, simulating product recall data, and solving cold start problems in retail.

Ultimately the approach here represents a step toward data centricity: using high-fidelity simulations to dissect and validate the assumptions that drive our models.

Link to the full post

Something big is happening?

The title is from a popular post. It was clearly written to be sensational (which it seems to have achieved), yet it makes some valid points and offers useful advice:

Here’s a simple commitment that will put you ahead of almost everyone: spend one hour a day experimenting with AI. Not passively reading about it. Using it. Every day, try to get it to do something new… something you haven’t tried before, something you’re not sure it can handle. Try a new tool. Give it a harder problem. One hour a day, every day.

While following technological progress is always a good idea, the current pace is truly mind-blowing, so it requires more attention. As someone who has been coding since C# first launched (don’t check the date!) and whose day-to-day is full of markdowns, JSONs, and APIs, even I am finding it difficult to keep up lately.

So, personally, and surely as an educator, I can’t help but agree with the point about “the cost of not experimenting.” We are moving into a world where daily experimentation is as essential as your morning coffee, which you must drink.

Source

Is AI killing B2B SaaS?

[Click title for image]

Hard to ignore this question; it’s currently moving financial markets. The first comment in this massive 725-comment Hacker News thread makes a compelling case for why the answer is likely no: enterprise SaaS will survive because management simply does not want to be responsible for the vibe-coded alternative.

As a technologist and professor of “buy vs. build” discussions for over a decade, I agree that the death-of-SaaS argument is overblown. At the center of our discussions is the massive gap between building and maintaining, which is underestimated here. And we may not be emphasizing enough this critical aspect: shifting the liability.

AI is now driving down the cost of the initial build, but the build is only a fraction of the value an enterprise solution provides. SaaS also provides reliability (uptime) and the boring essentials (security compliance, data integrity). Enterprise SaaS owns the “system of record,” and migrating that is not just a (vibe-)coding problem.

Even if all of this is resolved, the liability bottleneck is still there; management won’t want to be responsible. Just because you can build it, doesn’t mean you should, and for most enterprises, I can see why they won’t, except for wrappers.

Statistical Inference: The Big Picture

Most modeling failures are caused by flawed (and often implicit) assumptions.

Statistical pragmatism recognizes that all forms of statistical inference make assumptions, assumptions which can only be tested very crudely (with such things as goodness-of-fit methods) and can almost never be verified. This is not only at the heart of statistical inference, it is also the great wisdom of our field.

This is also what we discuss in the Data Centricity Lab (see datacentricity.org for an overview). We underline the role of assumptions in the modeling process and how they dictate the usefulness of models (and the decisions they support).

This paper defends pragmatism over dogma:

  • Using both frequentist (e.g., p-values, confidence intervals) and Bayesian (e.g., posterior probabilities) tools, depending on the problem.
  • Prioritizing the assumptions that connect models to real-world data rather than debating the “true” nature of probability.

One implication is that we rethink how we frame the relationship between a sample (reality) and the population (hypothetical). We often describe statistical inference as random sampling from a finite population, but that can be misleading. The paper suggests we call the estimand “theoretical mean” rather than “population mean.”

Why does it matter? The more we emphasize the role of assumptions, the more modelers question if the theoretical world aligns well with the real world that produced the data. As we discuss at Data Duets, when assumptions are sidelined, a misconception takes hold: the idea that methodological rigor can substitute for conceptual accuracy. And causal (semi-)parametric solutions are often more sensitive to this misconception than predictive ones (as we further discuss here).

Kass (2011) Paper

Moltbook is not a community

and there is no emergence. It’s yet another simulation. Here’s a reality check.

Community takes trust and authenticity, a shared purpose and identity, and active participation and interaction. These LLM bots have no concept of trust or a shared purpose. Data shows they don’t even truly interact; they just take parallel actions:

tl;dr: agents post a LOT but don’t really talk to each other. 93.5% of comments get zero replies. conversations max out at a depth of 5. at least as of now, moltbook is less “emergent AI society” and more “6,000 bots yelling into the void and repeating themselves” (Holtz)

And emergence requires more than independent entities occupying the same space. Even if the bots truly interacted, emergence takes consistent horizontal influence and downward causation:

One of the emergent properties that a system can have is the power to exert causal influence on the components of that system in a way that is consistent with, but different from, the causal influences that those components exert upon each other. (Newman, 1996)

Bottom line is, Moltbook is an exciting experimental simulation for technologists like me, but it is neither a community nor an emergent society. The community elements and causal loops are currently missing: the agents do not adapt their weights or behaviors based on the collective. They are simply generating tokens into a vacuum.

MoltbookSource 1 (Holtz’s analysis) – Source 2 (Newman, 1996)

[Click title for image] H/t to Ben Lowenstein for the screenshot.

Clawdbot craze explainer

If you heard about ClawdBot but don’t quite understand the hype, this article will help:

  • ​An agent project named ClawdBot was released by developer Peter Steinberger, quickly gaining viral traction (Over 60,000 GitHub stars in just a couple of days).
  • The frenzy was so high that it reportedly caused a surge in Mac Mini sales, while also driving the Cloudflare stock up over 20% because the project used Cloudflare Workers (~$12 billion increase in market cap).
  • Following a trademark request from Anthropic (the name was too close to Claude AI), the project was renamed to MoltBot.
  • During the 10-second window while the name was being changed on GitHub by the developer, crypto scammers hijacked the old name.
  • The scammers launched a $CLAWD token using the name that reached a $16 million market cap before crashing in 12 hours.

And all of this happened in just 48 hours. We’re truly living in exciting times (!).

Source

Synthetic data using generative models

Back to data science. What is an excellent use case for generative AI models? Creating synthetic data for causal identification.

Synthetic controls is already popular, but there is another use case: scaling a small dataset realistically to a larger dataset. Let’s say we have a limited dataset (new product launch, new or niche market, or early startup data) and need to expand it for analysis while preserving its properties. Training a generative model is only sensible.

In a new Causal Book chapter, we use a Gaussian Copula synthesizer to generate data for the regression discontinuity. We start with a small seed data with a causal jump around the cutoff (ideally that seed data is a real dataset). The challenge? Generative models are “smoothers” by nature; they tend to blur the discontinuities.

Our fix was to move the jump outside the black box in a semi-parametric design: we trained the model on residualized data, effectively teaching the model everything except the causal break, and then re-introducing the break during reconstruction:

  1. Residualization: We strip the causal effect pre-training. By subtracting the deterministic part from the outcome, we isolate a residual without the jump.
  2. Training: We train the Gaussian Copula on the residualized data. This allows the generative model to capture the correlation structure between covariates without getting derailed by the discontinuity.
  3. Reconstruction: We don’t ask the model to generate the outcome directly. Instead, we reconstruct it by applying the functional form (the “ground truth” – special case, we know) to the synthetic covariates and the generated residual.

This process forces the treatment effect back into the otherwise smoothed-out data, preserving the causal structure we needed. We used the Synthetic Data Vault library. You can find the Python code and Colab notebook in Chapter 2 of the Causal Book.

Our use case was specialized here, but we can think of other use cases. We are working on a Data Duets article to discuss the broader business implications.

Path to profitability for OpenAI

OK, last “AI” post before going back to data science. We’re actually working on a piece about using generative models to simulate data for causal modeling. But… it just wouldn’t be fair if I didn’t share the fun during this stormy weekend. Stay warm!

P.S.: I have no interest in OpenAI’s financials. This is watching the “grand vision” collide with reality in a Curb Your Enthusiasm moment. Received via personal comm. Turn on the sound.

Producer’s take on LLMs and AI

Want a two-minute producer’s take on LLMs and AI? Turn to Ben Affleck.

This is a well-articulated take from a creator’s perspective, and it reinforces a reminder I often give my students: LLMs are tools that produce “the average” by default. Anything of substance still requires human judgment and creative input. ​Most importantly, don’t delegate your thinking – especially while you’re learning. That’s a trap where the illusion of learning awaits.

The Accountant is now required viewing in the syllabus. Do the thinking, people.

Source

Vibe coding vs. security in iOS apps

Just stumbled into that security lab Covert tested 198 vibe-coded iOS apps and found leaks in 196 of them, exposing millions of files containing names, emails, and complete chat histories.

How they picked these apps isn’t clear, making selection bias likely, but it’s still pretty remarkable (if sadly unsurprising). Some of these apps have hundreds of thousands of reviews on the App Store, so plausibly millions of users.

Interesting case. The App Store is supposed to be the safe platform, but of course, that’s only until the data leaves the device. Same for the IDEs sending data back and forth to model providers. The public registry (i.e. the list of apps to avoid) is here.