Clawdbot craze explainer

If you heard about ClawdBot but don’t quite understand the hype, this article will help:

  • ​An agent project named ClawdBot was released by developer Peter Steinberger, quickly gaining viral traction (Over 60,000 GitHub stars in just a couple of days).
  • The frenzy was so high that it reportedly caused a surge in Mac Mini sales, while also driving the Cloudflare stock up over 20% because the project used Cloudflare Workers (~$12 billion increase in market cap).
  • Following a trademark request from Anthropic (the name was too close to Claude AI), the project was renamed to MoltBot.
  • During the 10-second window while the name was being changed on GitHub by the developer, crypto scammers hijacked the old name.
  • The scammers launched a $CLAWD token using the name that reached a $16 million market cap before crashing in 12 hours.

And all of this happened in just 48 hours. We’re truly living in exciting times (!).

Source

Synthetic data using generative models

Back to data science. What is an excellent use case for generative AI models? Creating synthetic data for causal identification.

Synthetic controls is already popular, but there is another use case: scaling a small dataset realistically to a larger dataset. Let’s say we have a limited dataset (new product launch, new or niche market, or early startup data) and need to expand it for analysis while preserving its properties. Training a generative model is only sensible.

In a new Causal Book chapter, we use a Gaussian Copula synthesizer to generate data for the regression discontinuity. We start with a small seed data with a causal jump around the cutoff (ideally that seed data is a real dataset). The challenge? Generative models are “smoothers” by nature; they tend to blur the discontinuities.

Our fix was to move the jump outside the black box in a semi-parametric design: we trained the model on residualized data, effectively teaching the model everything except the causal break, and then re-introducing the break during reconstruction:

  1. Residualization: We strip the causal effect pre-training. By subtracting the deterministic part from the outcome, we isolate a residual without the jump.
  2. Training: We train the Gaussian Copula on the residualized data. This allows the generative model to capture the correlation structure between covariates without getting derailed by the discontinuity.
  3. Reconstruction: We don’t ask the model to generate the outcome directly. Instead, we reconstruct it by applying the functional form (the “ground truth” – special case, we know) to the synthetic covariates and the generated residual.

This process forces the treatment effect back into the otherwise smoothed-out data, preserving the causal structure we needed. We used the Synthetic Data Vault library. You can find the Python code and Colab notebook in Chapter 2 of the Causal Book.

Our use case was specialized here, but we can think of other use cases. We are working on a Data Duets article to discuss the broader business implications.

Path to profitability for OpenAI

OK, last “AI” post before going back to data science. We’re actually working on a piece about using generative models to simulate data for causal modeling. But… it just wouldn’t be fair if I didn’t share the fun during this stormy weekend. Stay warm!

P.S.: I have no interest in OpenAI’s financials. This is watching the “grand vision” collide with reality in a Curb Your Enthusiasm moment. Received via personal comm. Turn on the sound.

Producer’s take on LLMs and AI

Want a two-minute producer’s take on LLMs and AI? Turn to Ben Affleck.

This is a well-articulated take from a creator’s perspective, and it reinforces a reminder I often give my students: LLMs are tools that produce “the average” by default. Anything of substance still requires human judgment and creative input. ​Most importantly, don’t delegate your thinking – especially while you’re learning. That’s a trap where the illusion of learning awaits.

The Accountant is now required viewing in the syllabus. Do the thinking, people.

Source

Vibe coding vs. security in iOS apps

Just stumbled into that security lab Covert tested 198 vibe-coded iOS apps and found leaks in 196 of them, exposing millions of files containing names, emails, and complete chat histories.

How they picked these apps isn’t clear, making selection bias likely, but it’s still pretty remarkable (if sadly unsurprising). Some of these apps have hundreds of thousands of reviews on the App Store, so plausibly millions of users.

Interesting case. The App Store is supposed to be the safe platform, but of course, that’s only until the data leaves the device. Same for the IDEs sending data back and forth to model providers. The public registry (i.e. the list of apps to avoid) is here.

Mind the AI Gap: 2025 Update

LLMs now look much more like they are reasoning than they did just a year ago. In his 2025 review, Andrej Karpathy explains why:

In 2025, Reinforcement Learning from Verifiable Rewards (RLVR) emerged as the de facto new major stage to add to this mix. By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like “reasoning” to humans – they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples). These strategies would have been very difficult to achieve in the previous paradigms because it’s not clear what the optimal reasoning traces and recoveries look like for the LLM – it has to find what works for it, via the optimization against rewards.

Why should we care if they reason or not, as long as we get things done?

Reflecting my pedagogical approach, I’d argue that understanding a tool’s capabilities is key to getting the most out of it. Understanding what Excel is capable of means we wouldn’t ask it for life advice (even with Copilot integrated now, we’d better resist that urge).

I have been discussing this at a higher level in my talk, “Mind the AI Gap,” since 2024. I’ve most recently delivered this talk as a guest speaker for an executive cohort, and our discussion prompted updates to the deck. So I owe this update to 2025.

The updated deck is here, with both practical advice and high-level insights.

The key takeaway remains: Steve Jobs’ “bicycle for the mind” analogy is holding up. Today’s LLMs aren’t the mind itself; they’re still the tools that make the mind’s efforts go further. Much further. And we have yet to realize their full potential.

Karpathy concludes and I concur:

LLMs are emerging as a new kind of intelligence, simultaneously a lot smarter than I expected and a lot dumber than I expected. In any case they are extremely useful and I don’t think the industry has realized anywhere near 10% of their potential even at present capability.

Claude Code vs. Cursor vs. Others

[Click title for image]

I’m not ready to go all-in on Claude Code just yet. Despite the well-earned hype, I’m sticking with Cursor (for now). That’s because model independence is worth more to me than a native tool integration in an “actor-critic” type of workspace.

I started writing this as a comment on yet another Claude Code praise, but it evolved into an end-of-year reflection on coding assistants. I’ve been hearing a lot about Claude Code, but I haven’t yet found a compelling reason to abandon Cursor, which I’ve used since near its release. For those of us with a coding background, the primary reason for starting with Cursor was the low switching cost from VS Code to Cursor.

Claude Code is Anthropic’s response to VS Code and its forks: mainly Cursor and Windsurf. By the way, a lesser-known alternative emerged from Google last month: Antigravity.1 It’s currently in public preview, so you can use it for free here. It looks promising, and the standout feature is “Agent Manager,” which essentially acts as a mission control dashboard for orchestrating multiple agents working in parallel.

I was planning to test Windsurf once my Cursor subscription expires (for the reason I highlighted hereCodemaps offer a hierarchical mental model of the codebase), but a new praise for Claude Code every day further piqued my curiosity. After checking a few resources, this comparison seemed fair (if you look past Qodo’s advertising). This one is also solid, and ultimately recommends that you “just use both.”

So, what is my plan for 2026?

I still intend to test Windsurf because I find the “Codemaps” feature intriguing. After that, I might revert to VS Code and layer in Claude Code (here is how). Meanwhile, I will keep an eye on the evolution of Google’s Antigravity, the latecomer. My hesitation to commit to a specific vendor stems from my frequent use of actor-critic logic (where one model/agent performs the task and another reviews it). I find value in using different models for these roles, so I want to retain the flexibility to switch between models. Based on the comparisons above, I might face a premium as context usage scales, but that’s a price I’m willing to pay for model independence.

And what do I mean by the actor-critic setup?

While “racing” parallel agents is becoming standard practice (see the Cursor documentation), I also use a sequential approach with diverse models: e.g., using Claude to code (the actor) and Gemini to review (the critic). VS Code documentation on agent use touches on this “multi-agent” idea without calling it an “actor-critic” setup (rightly so, as I’m just borrowing the concept from reinforcement learning).

Credit for the figure goes to Gemini 3 Flash & Nano Banana based on this reflection.

Pluribus knows everything

[Click title for image]

Pluribus knows everything, and understands nothing – just like an LLM!

I’ve finally found the time to start watching Pluribus. I’m only two episodes in, but the analogy to LLMs is already clear. The crux of this comparison for me is the scene where Carol tries to convince the unified human entity that it is okay to say “no” and not be obsequious.

Like an LLM, the show’s unified human entity (which possesses all human thoughts and knowledge) “knows” everything but “understands” nothing. This is an excellent link to my take on LLMs in the talk, “Mind the AI Gap: Understanding vs. Knowing.”

Clearly, I was not the first to make this connection (see this summary for example). As I understand it, Vince Gilligan didn’t intend for this to be about LLMs/AI; he wrote the story long before ChatGPT. This makes it even more fascinating, because he imagined a being with total knowledge but no agency or imagination (and thus, no counterfactual reasoning), and this being uncannily mirrors the behavior of an LLM.

Happy holidays to everyone with this holiday post!

Computer vision combined with LLMs

Well, this is impressive. Today I found myself visually comparing two scatterplots. Then I asked the latest models of Claude, GPT, Gemini, and DeepSeek to identify which of the two was generated parametrically and which was semi-parametrically. They reached a consensus (DeepSeek did not receive the images), and they were correct.

When a collective of LLMs can unequivocally diagnose the subtle shift in data generation process just by looking at scatterplots, it changes the way you think about looking vs. seeing (and saves you a trip to the optometrist for eyestrain).

Attached are the two datasets if you want to give your eyes a chance.

LLMs learn from the best

[Click title for image]

Our version control debt has finally caught up with us. I’m working on a Jupyter notebook for the Causal Book using a helper LLM, and we seem to be approaching the absolute finality here..

For the record, this horrible naming convention is not a result of my input; it’s entirely derived from the training data. And here’s what Gemini says about it:

This is a hilarious catch. It looks like the AI has perfectly replicated the human panic of saving files as Project_Final_Final_v2_REAL_FINAL.doc.

Human panic, or uniquely human optimism for version control, you decide!