How to use LLMs as coding assistants

[Click title for image]

One way is hitting that accept button until the code works. This is like gambling: it works until it doesn’t, and you never know why in either case.

Another way is intentional use. Intentional use is:

  • Telling LLMs exactly what to do.
  • Inspecting their code, line by line, before accepting.
  • Unit testing the solution before it ever sees production.

This means treating LLMs as over-confident, lightning-fast assistants. They are incredibly helpful in boosting productivity, quickly looking things up and delivering operational code. They can search a 100-page API documentation and find a solution to your problem in seconds.

Unless we see a structural breakthrough in how language models are built, this is also the best we can have: an over-confident assistant. LLMs don’t think or reason; they achieve (shallow) deductive closure at best. While the debate over whether LLMs “think” can be unproductive, there is a practical implication: LLMs make profoundly inhuman coding mistakes.

The errors LLMs make aren’t the errors a human assistant would make, so working with LLMs takes another perspective shift. Understanding this distinction is key to effectively using them: our coding assistant is deeply inhuman.

Otherwise, LLM-driven coding will inevitably lead to more failures in data science. Expect to hear more stories about models breaking unexpectedly.

This post was inspired by the write-up “Here’s how I use LLMs to help me write code,” which nails many other crucial points. It’s worth checking out.

Using LLMs for IV discovery and data

LLMs excel at search and discovery. Why not use them to find IVs for causal models?

In a new section in Causal Book, Using LLMs for IV discovery and data, we offer a prompt template to help discover candidate IVs and their actual data. We tested it with the latest Gemini (2.5 Pro Preview 06-05-2025) and the results are promising.

This section is the latest addition to the IV design pattern chapter of Causal Book. The book itself aims to:

  1. provide solution patterns and their code implementations in R and Python,
  2. discuss different approaches to the same pattern on the same data (Statistics, Machine Learning, Bayesian),
  3. demystify some surprising (or seemingly surprising) challenges in applying the causal design patterns.

See the full table of contents here.

We’ll next dive into the regression discontinuity design pattern, which I hope will be even more fun with the newly added support in DoubleML.

How do LLMs report scientific text?

In short, by exaggerating.

It seems so convenient: when you are short of time, asking ChatGPT or another chatbot to summarise a scientific paper to quickly get a gist of it. But in up to 73 per cent of the cases, these large language models (LLMs) produce inaccurate conclusions…

Over a year, we collected 4,900 summaries. When we analysed them, we found that six of ten models systematically exaggerated claims they found in the original texts.

This prompts the question: why? My initial thought was that since these models are built to generalize, and exaggeration is a form of faulty generalization, they might actually be working as intended. This aligns with the study’s finding that LLMs are nearly five times more likely to produce broad generalizations than humans.

More interestingly, attempting to prompt LLMs for accuracy often backfires. They tend to double down on generalizations, becoming twice as likely to produce overgeneralized answers when asked to stick to the facts.

On the bright side, LLMs can quickly replace the media for reporting of science.

SourcePaper

What an A/B test is not

[Click title for image]

The founder of this Shark Tank backed company (thinks he) did an A/B test on the impact of tariffs on customer behavior (demand for a showerhead): “Made in USA” vs. “Made in Asia”.

There’s so much wrong here that I’m just going to share it without comment. But one thing is clear: Outside of tech and other companies that are invested in data science, we’re still in the early days of business analytics education. When it comes to causal modeling, inference, and experimental design, we seem to be just getting started.

Source

AI as a disguised customer service agent

[Click title for image]

This is ironic and offers a valuable lesson.

Cursor, an AI-powered integrated development environment (IDE), started kicking users out when they logged in from multiple machines.

I use Cursor on a daily basis, and I know how frustrating and disruptive this limitation can be for most users.

So many Cursor users rushed to email the support team to ask if this was a new policy. In response, the support team explained that this was “expected behavior” as part of a new security feature.

But, in reality, there was no support team. Sam is a bot designed to “mimic human responses.” That answer, which was completely made up by the bot, quickly went viral, and users started canceling their subscriptions.

By the time Cursor’s “real humans” stepped in, the damage was done. Here on Reddit, Cursor is doing damage control.

Pretty remarkable that the AI company got hit by the AI and no one noticed until users canceled their subscriptions in droves.

And this could have been largely avoided if Cursor had disclosed that Sam was a bot.

Agent2Agent Protocol for LLMs

Google has just announced the Agent2Agent Protocol (A2A). A2A is open source and aims to enable AI agents to work together seamlessly, potentially multiplying productivity gains in end-to-end business processes.

As I understand it, A2A is to agent communication what MCP is to tool use. At the time, I saw MCP as an opportunity to reduce frictions in agent deployment while maintaining a level of security (see here), and it has taken off since then. Google’s A2A seems to take it to the next level, providing more security in the cloud for multiple agents to communicate and collaborate:

A2A focuses on enabling agents to collaborate in their natural, unstructured modalities, even when they don’t share memory, tools and context. We are enabling true multi-agent scenarios without limiting an agent to a “tool.”

SourceDocumentation

Collapse of trust in digitized evidence

[Click title for image]

How much longer will we have non-zero trust in what see on a computer screen?

Generative models are eroding trust in the digital world at an astonishing rate with each new model released. Soon, pictures and videos of events will no longer be accepted as evidence.

Insurance companies won’t accept pictures and videos of damage after accidents, and accounting departments will no longer accept pictures of receipts. This may be an easier problem to solve. We’ll likely develop more ways to authenticate digital files. More algorithms will verify authenticity, and companies may simply ask customers to use dedicated apps.

But the shift in public trust in digital files is less easily repairable and may even be permanent. We may be leaving behind pics or it didn’t happen for I only believe what I physically see.

No-code as a cure for understanding

[Click title for image]

Some tasks require understanding, not just knowing how to do. Tools can’t fill the gaps in understanding. For these tasks, time is better spent learning and understanding. No-code development is useful for building without understanding, but understanding is most critical when things fail. And things fail while building products, be they data products or otherwise.

Here the user switches from Cursor (automated coding) to Bubble (a no-code tool) to address the lack of understanding, not realizing that switching tools is solving the wrong problem.

We often make the same mistake in data science, especially in predictive modeling, where a new off-the-shelf library or method is treated as a prophet (pun intended), only to find out later that it was solving the wrong problem.

Source

Coding vs. understanding the code

[Click title for image]

Doing is not understanding. Even LLMs seem to know the difference.

I’ve written and spoken a lot about this (link to the talk). Naturally, the exchange here was too good not to share. Here is Claude in Cursor lecturing a user on the difference between having something coded by an LLM vs. coding it yourself so you learn and understand.

The better we separate things we need to understand from things we just need to do, the more effectively we will benefit from LLMs. We certainly can’t understand everything (nor do we need to), but it’s a good idea to avoid the illusion of understanding just because we can do it.

To paraphrase Feynman, we can only understand the code we can create.

Sources of technological progress

[Click title for image]

If you woke up this morning running to the coffee pot even more aggressively because of the start of Daylight Saving Time, just remember that you’re not alone, and that’s how innovation and technological progress begin.

The world’s first webcam was invented in 1991 to monitor a coffee pot in a computer lab at the University of Cambridge, England:

To save people working in the building the disappointment of finding the coffee machine empty after making the trip to the room, a camera was set up providing a live picture of the coffee pot to all desktop computers on the office network. After the camera was connected to the Internet a few years later, the coffee pot gained international renown as a feature of the fledgling World Wide Web, until being retired in 2001.

See the Wiki here.

Deep, Deeper, Deepest Research

[Click title for image]

You must be Platinum, Diamond, or Elite Plus somewhere. Maybe Premier Plus?

Since LLM developers discovered the idea of using multiple models (or agents?) that interact with each other to produce richer output, we have seen another round of semantic reduction by overusing “deep” and “research” (as we did with “intelligence”, “thinking”, and “reasoning”).

In this post “The Differences between Deep Research, Deep Research, and Deep Research”, Han Lee tries to make sense of the deep research mania and offers a quadrant to classify different models.

Is the “depth” of research just the number of iterations in the search for information? That’s another story.

AI as a substitute or complement

This is a much-needed perspective on the new generation of tools in language modeling, object recognition, robotics, and others. The time and effort spent pitting algorithms against human intelligence is truly mind-boggling, when algorithms have been complementing us in so many tasks for decades. The new generation of tools simply offers more opportunities.

In data science, for example, humans excel at conceptual modeling of causal problems because they are creative and imaginative, and algorithms excel at complementing effect identification by collecting, structuring, computing, and optimizing high-dimensional, high-volume data in nonlinear, nonparametric space. Maybe we just need to get over the obsession with benchmarks that pit machine against human and create tests of complementarity.

Causal inference is not about methods

The price elasticity of demand doesn’t magically become causal by using DoubleML instead of regression. Similarly, we can’t estimate the causal effect of a treatment if a condition is always treated or never treated. We need to treat sometimes and not treat other times.

Causal modeling starts with bespoke data and continues with assumptions. The methods follow the data and assumptions and are useful only if the right data and assumptions are available. This is different from predictive modeling, where brute force bias reduction using the most complex method can be successful.

We offer a reminder in this solo piece at Data Duets. You can read or listen (just scroll to the end).

New AI feature. Great, how do I disable it?

[Click title for image]

I just received this email from Google. If you have a Google Workspace account, you may have received it as well. As soon as I saw the email, I remembered a Reddit thread from yesterday (where the meme is from).

I can’t turn off “Gemini Apps Activity” in my account (the account admin can’t turn it off either). Why is this? Why would I use a tool that is forced upon me while it is also giving me little to no control over my data?

See the Reddit thread here for more frustration with the haphazard rollout of “AI” tools (not limited to Google or privacy issues).

From data models to world models

[Click title for image]

Sentence completion is a predictive task for the language model, not a causal one. It works as just another data model – it doesn’t need a world model, that is, unless a revolution is at stake.

World models are causal representations of the environment to the extent required by the tasks to be performed (as discussed here and there).

World models guide actions by making predictions based on this causal representation. So while not all data models need to be causal, all world models do.

LLM agents as world modelers?

LLMs are data models, so they are useful simplifications of the world. How well LLM agents can move from one useful simplification to another will determine the business use cases for which the agents will be useful. We’re about to find out.

* Image courtesy of xkcd.com.

Data Duets relaunch and new article

Duygu Dagli and I have relaunched Data Duets to focus on business cases.

Data Duets started as an experiment to discuss the same topic from two perspectives: academic and practical. The idea is still the same, but we are shifting our focus to publicly reported business cases where data centricity plays a role in decision making (causal or predictive modeling, optimization, or generative models).

In this first post, we discuss the smart fridge door installations at Walgreens, how the failure could have been avoided by getting the data collection and modeling right, and the many missed opportunities due to lack of business process alignment. When I commented on this story earlier, it reached 14K readers here. This article expands on the case.

This one is a failure, but we will also discuss success stories. I think the next one will be an interesting success story about data collection and optimization at Chick-fil-A, unless we change the order in the pipeline. Stay tuned!

New book: DevOps for Data Science

Just came across this new, free book on the basics of enterprise data management. The book does a great job of covering the most important aspects of data management from a data science perspective.

I have a hard time fitting all of these tool-focused topics like using Git, CI/CD, Docker, SSH, OAuth into my course syllabi. These are things to know vs. understand and we focus on the latter. But these are absolute must-knows for data workers. This is how models are deployed and optimized for production, which is the only way to make a real impact. So this will be a great quick reference to list.

The book is a very practical introduction to (almost) everything that comes after modeling in data science. Here’s the intro to data connectors (no offense to data engineers):

Your job as a data scientist is to sift through a massive pile of data to extract nuggets of real information – and then use that information. Working at the end of an external process, you must meet the data where it lives.

CI/CD and Git:

The role of Git is to make your code promotion process happen. Git allows you to configure requirements for whatever approvals and testing you need. Your CI/CD tool sits on top of that so that all this merging and branching does something.

with the footnote:

Strictly speaking, this is not true. There are a lot of different ways to kick off CI/CD jobs. But, the right way to do it is to base it on Git operations.

I like the tone here (and the drawings). You can check out the book here.

A plane crashes, what is the cause?

In my earlier work, I applied systems modeling to business problems using agent-based and system dynamics models. My background in systems thinking prompts me to question every time we make claims we make about causal relationships in data that has many moving and interacting parts, as in most business analytics problems.

In business, when we look for cause-and-effect relationships, we typically measure the impact of a single action. The stakes are measured in financial terms, not human lives. Even if we get it wrong and attribute the increase in sales to a new ad rather than a competitor’s price hike, we can only lose money. But if we fail to identify the causes of one accident, we may fail to prevent another one, and the cost is human life.

So with accidents, we need a different framework to answer causal modeling questions, a deadly one of which is: why did the plane crash? We can no longer downplay interactions, interference, interdependencies, and feedback loops, as we tend to do when measuring the causal effect of a new ad or coupon. Systems thinking should be part of the modeling process.

Causal Analysis based on System Theory (CAST) is an accident analysis framework that aims to maximize causal learning from accidents and incidents. This is clearly a difficult and vital task. Here’s a nice summary of CAST (credit to Joel Parker):

  • Causal: Don’t assume accidents are due to one “root cause” or a few “probable causes”, because it turns out that most accidents are actually due to many interacting causes.

  • Analysis: Don’t blame people, because it turns out you learn more by doing a blame-free examinations of why a loss occurred, and how it occurred i.e. “ask why and how, not who”.

  • System: Don’t fix just the one thing that broke, because it turns out it’s smarter to discover multiple causes, then consider multiple ways to improve the whole system.

  • Theory: Don’t wait until something breaks, because it turns out it’s wiser to plan ahead by using scientific control theory and process model theory.

One of the basic tenets of CAST is that human error is a symptom of a system that needs to be fixed: redesigned or reengineered. Indeed, this is often the case. It’s easier to blame a person for pushing or not pushing a button, but why was that button needed in the first place? Well, that’s a systems question.

Root cause analysis and other approaches that are motivated by finding a single cause (e.g., average treatment effect of X) tend to miss the bigger picture by oversimplifying, and lead to a false sense of accomplishment in “solving the problem”. Here’s an example from the CAST handbook by Nancy G. Leveson:

In the crash of an American Airlines DC-10 at Chicago O’Hare Airport in 1979, the U.S. National Transportation Safety Board (NTSB) blamed only a “maintenance-induced crack” and not also a design error that allowed the slats to retract if the wing was punctured. Because of this omission, McDonnell Douglas was not required to change the design, leading to future accidents related to the same design error.

Takeaway?

Every time we think we’ve identified and estimated a causal effect, we may be better off pausing for a moment and taking a systems view. A systems mindset will lead to more questions than answers, but asking those questions is better than jumping to conclusions. This is a way to increase our chances of actually “solving” the problem.

Causal effect ordering without estimation

[Click title for image]

What if we could rank order causal effects without having to estimate them? A creative question, but would it work?

Why would we want to rank rather than estimate at the first place? First, estimating causal effects is difficult and expensive. Also, in a case like the following, we may not need to estimate the effect: The decision to intervene has already been made (say, there will be a promotion). We want to maximize the return on the promotion.

A missing piece, as I also discussed with one of the authors is the estimation of the financial impact, which usually precedes the decision to intervene (and how). Let’s skip this part for now and assume that an intervention (a specific promotion) has already been decided. So the conditional question we are answering is: Which of the customers should we target given the promotion? Can we decide this without estimating the causal effect for each customer?

The paper explores two cases where causal effect ordering may be a viable solution:

  • Intervention data is not available. In other words, we only have predictions of the treatment effect, not a direct estimate as we would have in an experiment. Let’s say we only have predicted propensity scores for conversion.
  • Data on the actual outcome are not available, and we have to rely on a surrogate. Let’s say we can observe a customer’s short-term revenue, but that’s only a surrogate for the actual outcome we’re interested in: customer lifetime value.

The authors use discrete choice modeling to show that in such cases where causal effect estimation is not feasible, causal effect ordering is possible if there exists a latent variable that satisfies the following two conditions:

  1. Latent monotonicity: The latent variable is monotonically (positively or negatively) related to both the treatment effect and the outcome.
  2. Full moderation: All (customer) features are informative only through the moderator.*

That is, following the example in the opening slide, customer features (demographics, historical and behavioral purchase patterns…) are only relevant to the effect of the promotion and the customer’s decision to leave the company through, let’s say, the customer’s price sensitivity (a perfect mediation).

Even with such strong assumptions, this looks like a useful and promising method. I came across this paper while attending a conference in December and finally found the time to take a look. Of course, there is more to the story. For example, what happens when you have multiple latent variables? If those latent variables are positively correlated, it shouldn’t be a problem, but what if they’re not? Also, the potentially different relationship between the latent variable and the outcome versus its surrogate is a concern. The authors address these boundary conditions and provide two decision trees in the paper that show when it is a good time to use causal effect ordering. Check it out here.

* The latent variable is actually a moderator of the relationship between the treatment and the outcome. The authors show how it can also be defined as a mediator between the features and the outcome (and its surrogate) when the treatment is removed from the DAG. See Figure 10 in the paper.