AI agents and failing projects

Nearly half of all AI agent projects are set to fail (as Gartner predicts here). Why? Unclear business value, inadequate risk controls, and escalating costs.

As I see it, much of this is fueled by hype, leading to existing solutions being relabeled as “Agentic AI” without any rethinking of business processes.

Human creativity is missing in this picture. It’s this creative thinking that should move agent use beyond just automating or augmenting individual tasks with LLMs, leading instead to the redesign of business processes and a vision for how humans and AI can truly complement each other.

The risks and costs are more straightforward to resolve:

– Managers who are most excited about AI agents often do not fully understand the risks and limitations of LLMs. They should invest as much in understanding these models as they do in using them.

– The true cost of scaling proof-of-concept GenAI solutions is often underestimated. This is on selecting the right vendor. Gartner estimates only about 130 of the thousands of agentic AI vendors are real.

Everybody lies, but why?

Andrew Gelman’s latest “rant” is worth a read. Everybody lies, but why, even when data clearly refutes it?

It’s interesting to think a little bit more and understand why and how people lie, especially when it comes to scientists, medical doctors, and law enforcement officials. Spoiler is, the answer is not always money.

LLM training and fair use

[Click title for image]

Turns out Anthropic literally purchased and scanned millions of print books to train its language model Claude.

The judge finds that the scanning of these purchased books is “fair use” even without licensing, and goes on to say:

Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable. For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.

This is quite interesting, as it draws an analogy between how humans read, comprehend, and write text and how a language model operates.

Source

How to use LLMs as coding assistants

[Click title for image]

One way is hitting that accept button until the code works. This is like gambling: it works until it doesn’t, and you never know why in either case.

Another way is intentional use. Intentional use is:

  • Telling LLMs exactly what to do.
  • Inspecting their code, line by line, before accepting.
  • Unit testing the solution before it ever sees production.

This means treating LLMs as over-confident, lightning-fast assistants. They are incredibly helpful in boosting productivity, quickly looking things up and delivering operational code. They can search a 100-page API documentation and find a solution to your problem in seconds.

Unless we see a structural breakthrough in how language models are built, this is also the best we can have: an over-confident assistant. LLMs don’t think or reason; they achieve (shallow) deductive closure at best. While the debate over whether LLMs “think” can be unproductive, there is a practical implication: LLMs make profoundly inhuman coding mistakes.

The errors LLMs make aren’t the errors a human assistant would make, so working with LLMs takes another perspective shift. Understanding this distinction is key to effectively using them: our coding assistant is deeply inhuman.

Otherwise, LLM-driven coding will inevitably lead to more failures in data science. Expect to hear more stories about models breaking unexpectedly.

This post was inspired by the write-up “Here’s how I use LLMs to help me write code,” which nails many other crucial points. It’s worth checking out.

Using LLMs for IV discovery and data

LLMs excel at search and discovery. Why not use them to find IVs for causal models?

In a new section in Causal Book, Using LLMs for IV discovery and data, we offer a prompt template to help discover candidate IVs and their actual data. We tested it with the latest Gemini (2.5 Pro Preview 06-05-2025) and the results are promising.

This section is the latest addition to the IV design pattern chapter of Causal Book. The book itself aims to:

  1. provide solution patterns and their code implementations in R and Python,
  2. discuss different approaches to the same pattern on the same data (Statistics, Machine Learning, Bayesian),
  3. demystify some surprising (or seemingly surprising) challenges in applying the causal design patterns.

See the full table of contents here.

We’ll next dive into the regression discontinuity design pattern, which I hope will be even more fun with the newly added support in DoubleML.

How do LLMs report scientific text?

In short, by exaggerating.

It seems so convenient: when you are short of time, asking ChatGPT or another chatbot to summarise a scientific paper to quickly get a gist of it. But in up to 73 per cent of the cases, these large language models (LLMs) produce inaccurate conclusions…

Over a year, we collected 4,900 summaries. When we analysed them, we found that six of ten models systematically exaggerated claims they found in the original texts.

This prompts the question: why? My initial thought was that since these models are built to generalize, and exaggeration is a form of faulty generalization, they might actually be working as intended. This aligns with the study’s finding that LLMs are nearly five times more likely to produce broad generalizations than humans.

More interestingly, attempting to prompt LLMs for accuracy often backfires. They tend to double down on generalizations, becoming twice as likely to produce overgeneralized answers when asked to stick to the facts.

On the bright side, LLMs can quickly replace the media for reporting of science.

SourcePaper

What an A/B test is not

[Click title for image]

The founder of this Shark Tank backed company (thinks he) did an A/B test on the impact of tariffs on customer behavior (demand for a showerhead): “Made in USA” vs. “Made in Asia”.

There’s so much wrong here that I’m just going to share it without comment. But one thing is clear: Outside of tech and other companies that are invested in data science, we’re still in the early days of business analytics education. When it comes to causal modeling, inference, and experimental design, we seem to be just getting started.

Source

AI as a disguised customer service agent

[Click title for image]

This is ironic and offers a valuable lesson.

Cursor, an AI-powered integrated development environment (IDE), started kicking users out when they logged in from multiple machines.

I use Cursor on a daily basis, and I know how frustrating and disruptive this limitation can be for most users.

So many Cursor users rushed to email the support team to ask if this was a new policy. In response, the support team explained that this was “expected behavior” as part of a new security feature.

But, in reality, there was no support team. Sam is a bot designed to “mimic human responses.” That answer, which was completely made up by the bot, quickly went viral, and users started canceling their subscriptions.

By the time Cursor’s “real humans” stepped in, the damage was done. Here on Reddit, Cursor is doing damage control.

Pretty remarkable that the AI company got hit by the AI and no one noticed until users canceled their subscriptions in droves.

And this could have been largely avoided if Cursor had disclosed that Sam was a bot.

Agent2Agent Protocol for LLMs

Google has just announced the Agent2Agent Protocol (A2A). A2A is open source and aims to enable AI agents to work together seamlessly, potentially multiplying productivity gains in end-to-end business processes.

As I understand it, A2A is to agent communication what MCP is to tool use. At the time, I saw MCP as an opportunity to reduce frictions in agent deployment while maintaining a level of security (see here), and it has taken off since then. Google’s A2A seems to take it to the next level, providing more security in the cloud for multiple agents to communicate and collaborate:

A2A focuses on enabling agents to collaborate in their natural, unstructured modalities, even when they don’t share memory, tools and context. We are enabling true multi-agent scenarios without limiting an agent to a “tool.”

SourceDocumentation

Collapse of trust in digitized evidence

[Click title for image]

How much longer will we have non-zero trust in what see on a computer screen?

Generative models are eroding trust in the digital world at an astonishing rate with each new model released. Soon, pictures and videos of events will no longer be accepted as evidence.

Insurance companies won’t accept pictures and videos of damage after accidents, and accounting departments will no longer accept pictures of receipts. This may be an easier problem to solve. We’ll likely develop more ways to authenticate digital files. More algorithms will verify authenticity, and companies may simply ask customers to use dedicated apps.

But the shift in public trust in digital files is less easily repairable and may even be permanent. We may be leaving behind pics or it didn’t happen for I only believe what I physically see.

No-code as a cure for understanding

[Click title for image]

Some tasks require understanding, not just knowing how to do. Tools can’t fill the gaps in understanding. For these tasks, time is better spent learning and understanding. No-code development is useful for building without understanding, but understanding is most critical when things fail. And things fail while building products, be they data products or otherwise.

Here the user switches from Cursor (automated coding) to Bubble (a no-code tool) to address the lack of understanding, not realizing that switching tools is solving the wrong problem.

We often make the same mistake in data science, especially in predictive modeling, where a new off-the-shelf library or method is treated as a prophet (pun intended), only to find out later that it was solving the wrong problem.

Source

Coding vs. understanding the code

[Click title for image]

Doing is not understanding. Even LLMs seem to know the difference.

I’ve written and spoken a lot about this (link to the talk). Naturally, the exchange here was too good not to share. Here is Claude in Cursor lecturing a user on the difference between having something coded by an LLM vs. coding it yourself so you learn and understand.

The better we separate things we need to understand from things we just need to do, the more effectively we will benefit from LLMs. We certainly can’t understand everything (nor do we need to), but it’s a good idea to avoid the illusion of understanding just because we can do it.

To paraphrase Feynman, we can only understand the code we can create.

Sources of technological progress

[Click title for image]

If you woke up this morning running to the coffee pot even more aggressively because of the start of Daylight Saving Time, just remember that you’re not alone, and that’s how innovation and technological progress begin.

The world’s first webcam was invented in 1991 to monitor a coffee pot in a computer lab at the University of Cambridge, England:

To save people working in the building the disappointment of finding the coffee machine empty after making the trip to the room, a camera was set up providing a live picture of the coffee pot to all desktop computers on the office network. After the camera was connected to the Internet a few years later, the coffee pot gained international renown as a feature of the fledgling World Wide Web, until being retired in 2001.

See the Wiki here.

Deep, Deeper, Deepest Research

[Click title for image]

You must be Platinum, Diamond, or Elite Plus somewhere. Maybe Premier Plus?

Since LLM developers discovered the idea of using multiple models (or agents?) that interact with each other to produce richer output, we have seen another round of semantic reduction by overusing “deep” and “research” (as we did with “intelligence”, “thinking”, and “reasoning”).

In this post “The Differences between Deep Research, Deep Research, and Deep Research”, Han Lee tries to make sense of the deep research mania and offers a quadrant to classify different models.

Is the “depth” of research just the number of iterations in the search for information? That’s another story.

AI as a substitute or complement

This is a much-needed perspective on the new generation of tools in language modeling, object recognition, robotics, and others. The time and effort spent pitting algorithms against human intelligence is truly mind-boggling, when algorithms have been complementing us in so many tasks for decades. The new generation of tools simply offers more opportunities.

In data science, for example, humans excel at conceptual modeling of causal problems because they are creative and imaginative, and algorithms excel at complementing effect identification by collecting, structuring, computing, and optimizing high-dimensional, high-volume data in nonlinear, nonparametric space. Maybe we just need to get over the obsession with benchmarks that pit machine against human and create tests of complementarity.

Causal inference is not about methods

The price elasticity of demand doesn’t magically become causal by using DoubleML instead of regression. Similarly, we can’t estimate the causal effect of a treatment if a condition is always treated or never treated. We need to treat sometimes and not treat other times.

Causal modeling starts with bespoke data and continues with assumptions. The methods follow the data and assumptions and are useful only if the right data and assumptions are available. This is different from predictive modeling, where brute force bias reduction using the most complex method can be successful.

We offer a reminder in this solo piece at Data Duets. You can read or listen (just scroll to the end).

New AI feature. Great, how do I disable it?

[Click title for image]

I just received this email from Google. If you have a Google Workspace account, you may have received it as well. As soon as I saw the email, I remembered a Reddit thread from yesterday (where the meme is from).

I can’t turn off “Gemini Apps Activity” in my account (the account admin can’t turn it off either). Why is this? Why would I use a tool that is forced upon me while it is also giving me little to no control over my data?

See the Reddit thread here for more frustration with the haphazard rollout of “AI” tools (not limited to Google or privacy issues).

From data models to world models

[Click title for image]

Sentence completion is a predictive task for the language model, not a causal one. It works as just another data model – it doesn’t need a world model, that is, unless a revolution is at stake.

World models are causal representations of the environment to the extent required by the tasks to be performed (as discussed here and there).

World models guide actions by making predictions based on this causal representation. So while not all data models need to be causal, all world models do.

LLM agents as world modelers?

LLMs are data models, so they are useful simplifications of the world. How well LLM agents can move from one useful simplification to another will determine the business use cases for which the agents will be useful. We’re about to find out.

* Image courtesy of xkcd.com.

Data Duets relaunch and new article

Duygu Dagli and I have relaunched Data Duets to focus on business cases.

Data Duets started as an experiment to discuss the same topic from two perspectives: academic and practical. The idea is still the same, but we are shifting our focus to publicly reported business cases where data centricity plays a role in decision making (causal or predictive modeling, optimization, or generative models).

In this first post, we discuss the smart fridge door installations at Walgreens, how the failure could have been avoided by getting the data collection and modeling right, and the many missed opportunities due to lack of business process alignment. When I commented on this story earlier, it reached 14K readers here. This article expands on the case.

This one is a failure, but we will also discuss success stories. I think the next one will be an interesting success story about data collection and optimization at Chick-fil-A, unless we change the order in the pipeline. Stay tuned!