How a causal modeling failure can cost $200 million

Just read this article at Bloomberg “Walgreens Replaced Fridge Doors With Smart Screens. It’s Now a $200 Million Fiasco”. In summary, a startup promised Walgreens that its high-tech fridges would track shoppers and spark an in-store advertising revolution. Then the project fails miserably for a number of reasons.

It’s probably a terrible idea to block the view of soft drinks, which customers can literally touch anyway, with a digital screen. If the digital screens are offline for any reason, the contents are completely invisible (that’s why they had to put signs on the doors explaining what’s inside!).

But why was this idea even executed in the first place? Apparently, Walgreens signed a 10-year contract and initially had 10,000 smart doors installed. So why more than a limited experiment in the first place?

My answer is bad data, bad analysis: a poor understanding of causal modeling and data centricity.

Here is a four-sentence summary from the article:

Expectation vs. reality 1:

“Pilot data showed the screens resulting in more than a 5% incremental sales jump, and Walgreens committed to installing them in an additional 50 stores the next year as part of a decade-long deal.”

“Walgreens says each smart door ended up bringing in just $215 that year, or a mere 59¢ a day, about half the contractual minimum and a pittance when measured against the thousands of dollars each door cost to build and install.”

Expectation vs. reality 2:

“Cooler Screens had outsourced sales of available advertising slots for its fridges to Yahoo, then a subsidiary of its investor Verizon. But Yahoo barely topped $3 million in sales for the fridges in 2021, 91% lower than projected, a Cooler Screens court filing said.”

Grand finale:

The startup “claimed that its displays garnered almost 100 million monthly impressions and gave brands a healthy sales bounce, but these people doubted the math, which was tracked in spreadsheets.”

Explaining the unexplainable Part II: SHAP and SAGE

In this latest post at Data Duets, we discuss SHAP and SAGE, two explainable AI methods. I focused on a discussion of what these methods actually do and how they work intuitively, their lesser known but serious limitations, and how and why they relate to counterfactual thinking and causality.

If you’d rather skip all the reading and listen to the article as a podcast discussion, here’s the link. But don’t skip reading, because:

“Outside of a dog, a book is a man’s best friend. Inside of a dog, it’s too dark to read.” —Groucho Marx

How to use LLMs for learning in 2025

We can use LLMs to:

  1. Do things
  2. Learn things

When just doing things, LLMs feel like magic. Not so much when learning.

LLMs are excellent tools for getting things done: writing, rewriting, coding, code reviewing, or just figuring things out. The interaction to get things done is straightforward, but it can be improved if the goal is not just to get things done, but to get things done right. For example:

  • You can use an LLM that reports sources in addition to answers. Click on some of the sources to understand the context of the answer. This will help you verify that the answer is within the bounds of what you’d expect. It will also help you validate the answer against the source.
  • Pause and review the autocompleted code to make sure it does what it is supposed to do. If it doesn’t look familiar, just copy and paste the main function and use good old Google.

When it comes to learning, things get more complicated. With the latest round of updates (Claude 3.5, GPT o1 etc.), LLMs have taken over the chain of reasoning for many tasks.

This means that you don’t have to think about the question and formulate the steps of a solution yourself, the model does that for you. The model gives you a fish, but you don’t really learn where the fish came from. Instead, you can:

  • Embrace your own chain of thought: For topics and tasks where your goal is not just to do things, but to learn how to do them, keep your train of thought to yourself. This means proactively thinking of answers to the question at hand before you ask LLM your question.
  • Treat post-LLM agents as assistants that need guidance in thinking and reasoning. Think of a solution first, and ask the agent to help you through the steps of the solution. The agent may come up with a different solution, and that’s okay. Just try to understand why.
  • A quick tip. Using search and discovery focused LLM tools like Perplexity can help this process. Perplexity’s “Pro Search” and “Focus” motivate the learner to be more proactive.

I gave another talk in December and updated my main deck on Knowing vs. Understanding. You can find it here. For my December talk, I also put together a prologue deck for this discussion, which I will post after optimizing it for the web. Stay tuned.

Modeling data to win an argument or solve a problem

Modeling data to win an argument motivates us to make assumptions that are often baked into the modeling process.

There is a better way: focus on solving the problem. It starts with “I don’t know”, and it takes creativity and an open mind to find out. The data may or may not be there. We may need an experiment to get the data. The method we use to model the data doesn’t matter anymore. Methods become tools. More importantly, focusing on solving the problem limits our assumptions to those we have to make to get from the data to a model for decision making. So we focus on data centricity.

The pleasure of winning an argument will always be there, but perhaps we can avoid it in favor of better decision making and problem solving. And even if we can’t avoid it, we’re probably better off making an argument to learn, not to win.

Model Context Protocol for LLMs

LLM news never stops these days, but this could be a big one from a development perspective. MCP is an open standard protocol for connecting LLMs to any data source, removing the custom development barrier for LLMs to securely work with private data as directed.

For example, Claude Desktop can now use MCP to connect to, query, and analyze data in a local SQL database, keeping the data private and secure without integration barriers.

In the video, Claude is asked to develop an HTML page, create a GitHub repo to push the page to, push an update, create an issue, push changes, and create a pull request.

The protocol won’t be as visible to end users, but it will open up many possibilities for LLM agents, essentially lowering the cost of agent creation and data access.

Cool, here.

1 dataset 100 visualizations

Nice thought experiment and execution on many visualizations of the same data: change in the number of World Heritage sites from 2004 to 2022 in three Nordic countries.

Clearly, the data is better presented here as a table with a third row/column showing percentages, as shown on the About page, but getting to 100 certainly takes some creativity.

Source

Modern macro recording

Remember the ability to “record” Excel macros we were promised back in the 90s that never quite worked? Autotab now does that job as a standalone browser.

It’s basically automation on steroids, making the training and execution of a mini-model easier and more accessible, eliminating the tedious process for everyday tasks.

This is a great use case for the post-LLM world of AI agents, with a potentially direct positive impact on employee productivity and net value creation. Check it out here.

Quantification bias in decisions

When making decisions, people are systematically biased to favor options that dominate on quantified dimensions.*

The figures show the extent of bias in different contexts. Depending on what information is quantified, our decisions change even though the information content remains about the same. In other words, quantification has a distorting effect on decision making.

This made me think about the implications for data centricity. By prioritizing quantitative over qualitative information, are we failing to stay true to the data?

The study provides some evidence: we overweight salary and benefits and overlook work-life balance and workplace culture in our decisions. We check product ratings but miss the fact that the product lacks that one little feature we really need. It’s discussed in product reviews, but not quantified.

That sounds right. Clearly, we often base our decision to stay at a hotel on the rating rather than the sentiment in the reviews. But will this tendency change? Quite possibly. We have LLMs everywhere. LLMs can help resolve the trade-off between quantification and data centricity.

Using text data for decision making is easier than ever. We can now more effectively search in product reviews instead of relying solely on ratings (e.g. Amazon Rufus). Information about work-life balance and workplace culture contained in employee reviews can be more effectively quantified. Currently, Glassdoor applies sentiment analysis to a subset of work-life balance reviews by keyword matching, but it’ll get better. Comparably.com already does better.

It’s time to do better. LLMs offer the opportunity to use qualitative information for more effective, higher quality decisions by staying true to data, or data centricity.

* From the article Does counting change what counts? Quantification fixation biases decision-making.

H/T Philip Rocco for sharing the article. You can learn more about data centricity at datacentricity.org.

TinyTroupe from Microsoft

New Microsoft Research project comes with a Python library to create AI agents “for imagination enhancement and business insights”. Ha! This follows Google’s Interactive Simulacra from last year.

TinyTroupe is an experimental Python library that allows the simulation of people with specific personalities, interests, and goals. These artificial agents – TinyPersons – can listen to us and one another, reply back, and go about their lives in simulated TinyWorld environments. […] The focus is thus on understanding human behavior…

So it’s like a little SimCity where AI agents “think” and act (talk). The product recommendation notebook asks the agents to brainstorm AI features for MS Word. It’s a GPT 4 wrapper after all, so the ideas are mediocre at best, focusing on some kind of train/test logic: learn the behavior of the Word user and… (blame the predictive modeling work that dominates the training data)

Are these the most valuable business insights? This project attempts to “understand human behavior”, but can we even run experiments with these agents to simulate the causal links needed for business insights in a counterfactual design? The answer is no: the process, including agent creation and deployment, screams unknown confounders and interference.

It still looks like fun and is worth a try, even though I honestly thought it was a joke at first. That’s because the project, coming from Microsoft Research, has a surprising number of typos everywhere and errors in the Jupyter notebooks (and a borderline funny description):

One common source of confusion is to think all such AI agents are meant for assiting humans. How narrow, fellow homosapiens! Have you not considered that perhaps we can simulate artificial people to understand real people? Truly, this is our aim here — TinyTroup is meant to simulate and help understand people! To further clarify this point, consider the following differences:

Source

AI outlines in Scholar PDF Reader

Google’s PDF Reader seems to have got a new feature:

An AI outline is an extended table of contents for the paper. It includes a few bullets for each key section. Skim the outline for a quick overview. Click on a bullet to deep read where it gets interesting – be it methods, results, discussion, or specific details.

Clearly it’s not an alternative to reading (well, I hope not), but it makes search and discovery a breeze. Sure, one could feed the PDF into another LLM to generate a table of contents and outline, but the value here is the convenience of having them generated right when you open the PDF (not just in Google Scholar, but anywhere on the web). Highly recommended.

If you’re not already using this tool, I shared this very, very helpful tool when it came out earlier this year.

New chapter in the Causal Book: IV the Bayesian way

New chapter in the Causal Book is out: IV the Bayesian Way. In this chapter we examine the price elasticity of demand for cigarettes and identify the causal treatment effect using state taxes as an instrument. We’ll streamline the conceptual model and data across chapters later.

Basically, the sample question here is: What is the effect of a price increase on smoking? As always, the solution includes complete code and data. This chapter uses the powerful RStan and CmdStanR via brms and ulam, and, unlike the other chapters, doesn’t replicate the solution in Python (due to the added computational cost of the sampling process).

Causal Book is an interactive resource that presents a network of concepts and methods for causal inference. Due to the nonlinear – network structure of the book, each new chapter comes with a number of other linked sections and pages. All of this added content can be viewed in the graph view (available only on desktop in the upper right corner).

This book aims to be a curated set of design patterns for causal inference and the application of each pattern using a variety of methods in three approaches: Statistics (narrowly defined), Machine Learning, and Bayesian. Each design pattern is supported by business cases that use the pattern. Three approaches are compared using the same data and model. The book discusses the lesser known and understood details of the modeling process in each pattern.

Ongoing debate: LLMs reasoning or not

There are now so many papers testing the capabilities of LLMs that I increasingly rely on thoughtful summaries like this one.

The word ‘reasoning’ is an umbrella term that includes abilities for deduction, induction, abduction, analogy, common sense, and other ‘rational’ or systematic methods for solving problems. Reasoning is often a process that involves composing multiple steps of inference. Reasoning is typically thought to require abstraction—that is, the capacity to reason is not limited to a particular example, but is more general. If I can reason about addition, I can not only solve 23+37, but any addition problem that comes my way. If I learn to add in base 10 and also learn about other number bases, my reasoning abilities allow me to quickly learn to add in any other base.

Abstraction is key to imagination and counterfactual reasoning, and thus to establishing causal relationships. We don’t have it (yet) in LLMs, as the three papers summarized here and others show (assuming robustness is a necessary condition).

Is that a deal breaker? Clearly not. LLMs are excellent assistants for many tasks, and productivity gains are already documented.

Perhaps if LLMs weren’t marketed as thinking machines, we could have focused more of our attention on how best to use them to solve problems in business and society.

Nonetheless, the discussion around reasoning seems to be advancing our understanding of our thinking and learning process vis-à-vis machine learning, and that’s a good thing.

The illusion of information adequacy

A new PLOS One study coined this term to describe people’s strong tendency to believe they always have enough data to make an informed decision – regardless of what information they actually have.

In the study, participants responded to a hypothetical scenario in which control participants were given full information and treatment participants were given about half the information (about a water issue involving a school). The study found that treatment participants believed they had comparably adequate information and were equally competent to make thoughtful decisions based on that information.

In essence, the study shows that people assume they have enough information – even when they lack half of the relevant information. This can be extended to data science, where it is often assumed that the data at hand is sufficient to make decisions, even though assumptions fill in the gaps between data and models (implicitly or explicitly), leading to decisions. We briefly discuss this idea of data centricity at datacentricity.org (and more to come).

Image courtesy of learningrabbithole.com.

Programming is solved by LLMs, isn’t it?

AI should virtually eliminate coding and debugging.

This is a direct quote from an IBM report published in 1954 (here, page 2), if you replace AI with Fortran. It didn’t happen, not because Fortran wasn’t revolutionary at the time. It was the first commercial compiler, which took 18 person-years to develop.

Compiling didn’t “solve” it, and neither do LLMs. LLMs help solve (part of) the problem. They don’t solve exception handling. I wrote before about exception handling (or lack thereof) in most machine learning applications. We need to pay more attention to it.

Exception handling is difficult, if not impossible, to automate away because of the complexity and unintended consequences of human-machine (user-model) interactions. LLMs can certainly be useful for generating alternative scenarios and building solutions for them.

We will continue to benefit from the models that are increasingly available to us, including LLMs. Just remembering that the problem is not just pattern recognition, but also exception handling, should help us think about how best to use these models to solve problems.

This essay here is more from a software development perspective. From the essay:

You’d think 15 years into the smart phone revolution most people could operate an order kiosk or self-checkout without help. That’s certainly what stores had hoped. But as these are rolling out you can see how these systems are now staffed by people there to handle the exception. Amazon Go will be surly seen ahead of its time, but those are now staffed full time and your order is checked on the way night. And special orders at McDonalds? Head to the counter 🙂 

Mathematical Methods in Data Science (with Python)

Just came across this neat resource while looking for an MCMC / Gibbs sampling code example in object recognition. Self-description of the book:

This textbook on the mathematics of data has two intended audiences:

  • For students majoring in math or other quantitative fields like physics, economics, engineering, etc.: it is meant as an invitation to data science and AI from a rigorous mathematical perspective.
  • For mathematically-inclined students in data science related fields (at the undergraduate or graduate level): it can serve as a mathematical companion to machine learning, AI, and statistics courses.

Not yet published, but you can check it out here.

Podcast-style discussions on Data Duets

You should not add 1 before log-transforming zeros. If you don’t believe me, listen to these two experts on how to make better decisions using log-transformed data.

This conversation was produced by NotebookLM based on our discussion about the Log of Zero problem at Data Duets. Duygu Dagli and I have now added a podcast-style conversation to each of our articles. All audio is raw/unedited.

The conversations are usually fun (sometimes for odd reasons). The model adds (1) examples we don’t have in the original content and (2) light banter and some jokes. The examples are hit or miss.

So, besides the usual deep and reinforcement learning backend, what does NotebookLM do? (based on Steven Johnson’s description on the Vergecast)

  1. Start with a draft and revise it
  2. Generate a detailed script of the podcast
  3. Critique the script and create a revised version
  4. Add disfluencies (um, uh, like, you know, c-c-can, sssssee…) to sound convincingly human
  5. Apply Google’s latest text-to-speech Gemini model to add intonation, emphasis, and pacing

Have fun, and don’t add 1 to your variables before applying the log transformation.

Good data science, bad data science

…and why the difference matters.

We can call data science the practice of making (high-quality) decisions using data.

The order is (1) decision making (2) using data, not (1) decision driven (2) data. So, ideally, it’s not stirring the data pile for evidence to support a decision.

That’s a good place to start. We also need to:

  1. Make the business case really well in advance. Bringing in a half-baked problem or asking the wrong question won’t lead to the best insights.
  2. Understand what the models can and cannot do. We certainly need more of this in the LLM land. A Gen AI project is cool, but is it what the problem needs?
  3. Stick to the data. Data is real. Models add assumptions. Whether it’s experimental or observational, predictive or causal, the data must rule.
  4. Divide, focus, and conquer. Prioritize the most important needs. You can measure and track all metrics, but that’s probably not what you really need.
  5. Align the problem and available data with the assumptions embedded in the modeling solution. Testing the assumptions is the only way to know what’s real and what’s not.
  6. Choose the better solution over the faster one, and the simple solution over the complicated one for long-term value creation. This needs no explanation.

Some rules of good (vs. bad) data science, based on insights from projects I’ve been involved with in one way or another. #3 and #5 are most closely related to a framework we are working on: data centricity.

Image courtesy of xkcd.com

Concept drift and data centricity

This plot shows how coefficients in a linear model can change (not only in effect size, but also in sign) as new data is added to the training set (as a result of data or concept drift). Think of it as new retail sales data being added to the set over time.

In the plot, b is the coefficient of interest and z is the proportion of new data (Population 2) gradually added to the existing training data (Population 1). First, all the data is from P1 (so z is 0), then it’s 75% P1 and 25% P2 (z is 0.25), and so on.

As we add more of new data, we observe how the estimated effect changes. It starts out negative, becomes positive, then negative again. When the old and new data are equally mixed (z is 0.50), the previously negative effect disappears.

This thought experiment (by John Mount) reminds me of Lord’s Paradox (John calls it a continuous version of Simpson’s Paradox and that’s another way of putting it).

The data changes, but the model assumptions remain the same, and that’s a problem. This is another example of why staying true to the data, or data centricity, is critical to getting the right insights from models for decision making.

You can find the Python code walkthrough and Jupyter notebook here. If you want to learn more about data centricity, here is a one-pager.

Discrepancies in standard errors R vs. Python

You may have modeled (or asked your data science team to model) the same data in R and Python. Why? Most data science teams use both R and Python, with team members specializing in one or the other. So, this could be a model changing hands. Or maybe you wanted to make sure the package implementation behaved as intended. You may also have needed better computational efficiency (R fixest can be much faster than Python linearmodels on panel data).

For whatever reason, when you run models in R and Python, you may have run into the following situation: The parameter estimates are the same, but the standard errors (and p-values) are different. The data and the model are exactly the same. So you can’t explain why, and you don’t know which standard error / statistical significance test to trust and report to the business.

If you’re curious about the most common reason, check out another previously missing section now published in the Causal Book, here. We now discuss this as part of our exercise on applying the same instrumental variable model in R vs. Python.