Causal inference is not about methods

The price elasticity of demand doesn’t magically become causal by using DoubleML instead of regression. Similarly, we can’t estimate the causal effect of a treatment if a condition is always treated or never treated. We need to treat sometimes and not treat other times.

Causal modeling starts with bespoke data and continues with assumptions. The methods follow the data and assumptions and are useful only if the right data and assumptions are available. This is different from predictive modeling, where brute force bias reduction using the most complex method can be successful.

We offer a reminder in this solo piece at Data Duets. You can read or listen (just scroll to the end).

New AI feature. Great, how do I disable it?

[Click title for image]

I just received this email from Google. If you have a Google Workspace account, you may have received it as well. As soon as I saw the email, I remembered a Reddit thread from yesterday (where the meme is from).

I can’t turn off “Gemini Apps Activity” in my account (the account admin can’t turn it off either). Why is this? Why would I use a tool that is forced upon me while it is also giving me little to no control over my data?

See the Reddit thread here for more frustration with the haphazard rollout of “AI” tools (not limited to Google or privacy issues).

From data models to world models

[Click title for image]

Sentence completion is a predictive task for the language model, not a causal one. It works as just another data model – it doesn’t need a world model, that is, unless a revolution is at stake.

World models are causal representations of the environment to the extent required by the tasks to be performed (as discussed here and there).

World models guide actions by making predictions based on this causal representation. So while not all data models need to be causal, all world models do.

LLM agents as world modelers?

LLMs are data models, so they are useful simplifications of the world. How well LLM agents can move from one useful simplification to another will determine the business use cases for which the agents will be useful. We’re about to find out.

* Image courtesy of xkcd.com.

Data Duets relaunch and new article

Duygu Dagli and I have relaunched Data Duets to focus on business cases.

Data Duets started as an experiment to discuss the same topic from two perspectives: academic and practical. The idea is still the same, but we are shifting our focus to publicly reported business cases where data centricity plays a role in decision making (causal or predictive modeling, optimization, or generative models).

In this first post, we discuss the smart fridge door installations at Walgreens, how the failure could have been avoided by getting the data collection and modeling right, and the many missed opportunities due to lack of business process alignment. When I commented on this story earlier, it reached 14K readers here. This article expands on the case.

This one is a failure, but we will also discuss success stories. I think the next one will be an interesting success story about data collection and optimization at Chick-fil-A, unless we change the order in the pipeline. Stay tuned!

New book: DevOps for Data Science

Just came across this new, free book on the basics of enterprise data management. The book does a great job of covering the most important aspects of data management from a data science perspective.

I have a hard time fitting all of these tool-focused topics like using Git, CI/CD, Docker, SSH, OAuth into my course syllabi. These are things to know vs. understand and we focus on the latter. But these are absolute must-knows for data workers. This is how models are deployed and optimized for production, which is the only way to make a real impact. So this will be a great quick reference to list.

The book is a very practical introduction to (almost) everything that comes after modeling in data science. Here’s the intro to data connectors (no offense to data engineers):

Your job as a data scientist is to sift through a massive pile of data to extract nuggets of real information – and then use that information. Working at the end of an external process, you must meet the data where it lives.

CI/CD and Git:

The role of Git is to make your code promotion process happen. Git allows you to configure requirements for whatever approvals and testing you need. Your CI/CD tool sits on top of that so that all this merging and branching does something.

with the footnote:

Strictly speaking, this is not true. There are a lot of different ways to kick off CI/CD jobs. But, the right way to do it is to base it on Git operations.

I like the tone here (and the drawings). You can check out the book here.

A plane crashes, what is the cause?

In my earlier work, I applied systems modeling to business problems using agent-based and system dynamics models. My background in systems thinking prompts me to question every time we make claims we make about causal relationships in data that has many moving and interacting parts, as in most business analytics problems.

In business, when we look for cause-and-effect relationships, we typically measure the impact of a single action. The stakes are measured in financial terms, not human lives. Even if we get it wrong and attribute the increase in sales to a new ad rather than a competitor’s price hike, we can only lose money. But if we fail to identify the causes of one accident, we may fail to prevent another one, and the cost is human life.

So with accidents, we need a different framework to answer causal modeling questions, a deadly one of which is: why did the plane crash? We can no longer downplay interactions, interference, interdependencies, and feedback loops, as we tend to do when measuring the causal effect of a new ad or coupon. Systems thinking should be part of the modeling process.

Causal Analysis based on System Theory (CAST) is an accident analysis framework that aims to maximize causal learning from accidents and incidents. This is clearly a difficult and vital task. Here’s a nice summary of CAST (credit to Joel Parker):

  • Causal: Don’t assume accidents are due to one “root cause” or a few “probable causes”, because it turns out that most accidents are actually due to many interacting causes.

  • Analysis: Don’t blame people, because it turns out you learn more by doing a blame-free examinations of why a loss occurred, and how it occurred i.e. “ask why and how, not who”.

  • System: Don’t fix just the one thing that broke, because it turns out it’s smarter to discover multiple causes, then consider multiple ways to improve the whole system.

  • Theory: Don’t wait until something breaks, because it turns out it’s wiser to plan ahead by using scientific control theory and process model theory.

One of the basic tenets of CAST is that human error is a symptom of a system that needs to be fixed: redesigned or reengineered. Indeed, this is often the case. It’s easier to blame a person for pushing or not pushing a button, but why was that button needed in the first place? Well, that’s a systems question.

Root cause analysis and other approaches that are motivated by finding a single cause (e.g., average treatment effect of X) tend to miss the bigger picture by oversimplifying, and lead to a false sense of accomplishment in “solving the problem”. Here’s an example from the CAST handbook by Nancy G. Leveson:

In the crash of an American Airlines DC-10 at Chicago O’Hare Airport in 1979, the U.S. National Transportation Safety Board (NTSB) blamed only a “maintenance-induced crack” and not also a design error that allowed the slats to retract if the wing was punctured. Because of this omission, McDonnell Douglas was not required to change the design, leading to future accidents related to the same design error.

Takeaway?

Every time we think we’ve identified and estimated a causal effect, we may be better off pausing for a moment and taking a systems view. A systems mindset will lead to more questions than answers, but asking those questions is better than jumping to conclusions. This is a way to increase our chances of actually “solving” the problem.

Causal effect ordering without estimation

[Click title for image]

What if we could rank order causal effects without having to estimate them? A creative question, but would it work?

Why would we want to rank rather than estimate at the first place? First, estimating causal effects is difficult and expensive. Also, in a case like the following, we may not need to estimate the effect: The decision to intervene has already been made (say, there will be a promotion). We want to maximize the return on the promotion.

A missing piece, as I also discussed with one of the authors is the estimation of the financial impact, which usually precedes the decision to intervene (and how). Let’s skip this part for now and assume that an intervention (a specific promotion) has already been decided. So the conditional question we are answering is: Which of the customers should we target given the promotion? Can we decide this without estimating the causal effect for each customer?

The paper explores two cases where causal effect ordering may be a viable solution:

  • Intervention data is not available. In other words, we only have predictions of the treatment effect, not a direct estimate as we would have in an experiment. Let’s say we only have predicted propensity scores for conversion.
  • Data on the actual outcome are not available, and we have to rely on a surrogate. Let’s say we can observe a customer’s short-term revenue, but that’s only a surrogate for the actual outcome we’re interested in: customer lifetime value.

The authors use discrete choice modeling to show that in such cases where causal effect estimation is not feasible, causal effect ordering is possible if there exists a latent variable that satisfies the following two conditions:

  1. Latent monotonicity: The latent variable is monotonically (positively or negatively) related to both the treatment effect and the outcome.
  2. Full moderation: All (customer) features are informative only through the moderator.*

That is, following the example in the opening slide, customer features (demographics, historical and behavioral purchase patterns…) are only relevant to the effect of the promotion and the customer’s decision to leave the company through, let’s say, the customer’s price sensitivity (a perfect mediation).

Even with such strong assumptions, this looks like a useful and promising method. I came across this paper while attending a conference in December and finally found the time to take a look. Of course, there is more to the story. For example, what happens when you have multiple latent variables? If those latent variables are positively correlated, it shouldn’t be a problem, but what if they’re not? Also, the potentially different relationship between the latent variable and the outcome versus its surrogate is a concern. The authors address these boundary conditions and provide two decision trees in the paper that show when it is a good time to use causal effect ordering. Check it out here.

* The latent variable is actually a moderator of the relationship between the treatment and the outcome. The authors show how it can also be defined as a mediator between the features and the outcome (and its surrogate) when the treatment is removed from the DAG. See Figure 10 in the paper.

DeepSeek R1 and Goodhart’s law

[Click title for image]

I am impressed by the huge cost advantage of DeepSeek’s R1. DeepSeek R1 is about 30x cheaper than OpenAI’s o1 for both input and output tokens:

  • DeepSeek’s API costs $0.55 per million input tokens, $0.14 for cached inputs, and $2.19 per million output tokens.
  • OpenAI’s o1 costs $15 per million input tokens, $7.50 for cached inputs, and $60 per million output tokens.

This is despite the fact that DeepSeek R1 performs on par if not better than OpenAI o1 on most benchmarks. What gives me pause is the obsession with benchmarks. This obsession seems to make training and fine-tuning these models even more complex, but is it for the right reason?

For example, DeepSeek R1 appears to use a synthetic dataset of thousands of long-form CoT examples generated by prompting its own base model, whose responses were then reviewed and refined by human annotators, added to its own base model’s responses to reasoning prompts, and multiple layers of RL rounds that follow.

How about reverse causation?

As long as the model works and performs well on the benchmarks, we don’t seem to care about the complexity at this point, but I am increasingly curious about the future use of these models. By losing any kind of directionality in the way data and assumptions flow, reverse causation runs rampant.

Will there be a pause to try to simplify at some point?

In the business process reengineering work of the 1990s, the key challenge was to obliterate non-value-added work rather than automate it with technology. So far, each new version of an LLM seems to be obliterating for the sake of doing better on the benchmarks, which brings me to Goodhart’s Law:

 “When a measure becomes a target, it ceases to be a good measure.”

How a causal modeling failure can cost $200 million

[Click title for image]

Just read this article at Bloomberg “Walgreens Replaced Fridge Doors With Smart Screens. It’s Now a $200 Million Fiasco”. In summary, a startup promised Walgreens that its high-tech fridges would track shoppers and spark an in-store advertising revolution. Then the project fails miserably for a number of reasons.

It’s probably a terrible idea to block the view of soft drinks, which customers can literally touch anyway, with a digital screen. If the digital screens are offline for any reason, the contents are completely invisible (that’s why they had to put signs on the doors explaining what’s inside!).

But why was this idea even executed in the first place? Apparently, Walgreens signed a 10-year contract and initially had 10,000 smart doors installed. So why more than a limited experiment in the first place?

My answer is bad data, bad analysis: a poor understanding of causal modeling and data centricity.

Here is a four-sentence summary from the article:

Expectation vs. reality 1:

“Pilot data showed the screens resulting in more than a 5% incremental sales jump, and Walgreens committed to installing them in an additional 50 stores the next year as part of a decade-long deal.”

“Walgreens says each smart door ended up bringing in just $215 that year, or a mere 59¢ a day, about half the contractual minimum and a pittance when measured against the thousands of dollars each door cost to build and install.”

Expectation vs. reality 2:

“Cooler Screens had outsourced sales of available advertising slots for its fridges to Yahoo, then a subsidiary of its investor Verizon. But Yahoo barely topped $3 million in sales for the fridges in 2021, 91% lower than projected, a Cooler Screens court filing said.”

Grand finale:

The startup “claimed that its displays garnered almost 100 million monthly impressions and gave brands a healthy sales bounce, but these people doubted the math, which was tracked in spreadsheets.”

Explaining the unexplainable Part II: SHAP and SAGE

In this latest post at Data Duets, we discuss SHAP and SAGE, two explainable AI methods. I focused on a discussion of what these methods actually do and how they work intuitively, their lesser known but serious limitations, and how and why they relate to counterfactual thinking and causality.

If you’d rather skip all the reading and listen to the article as a podcast discussion, here’s the link. But don’t skip reading, because:

“Outside of a dog, a book is a man’s best friend. Inside of a dog, it’s too dark to read.” —Groucho Marx

Passionate Amazon customer service agent (?)

[Click title for image]

You might think Amazon is faking, but another idea is that Ankur is taking the day off after connecting the GPT API to the chat… Or Ankur is a passionate programmer who is willing to help the customer no matter what…

From Reddit.

How to use LLMs for learning in 2025

We can use LLMs to:

  1. Do things
  2. Learn things

When just doing things, LLMs feel like magic. Not so much when learning.

LLMs are excellent tools for getting things done: writing, rewriting, coding, code reviewing, or just figuring things out. The interaction to get things done is straightforward, but it can be improved if the goal is not just to get things done, but to get things done right. For example:

  • You can use an LLM that reports sources in addition to answers. Click on some of the sources to understand the context of the answer. This will help you verify that the answer is within the bounds of what you’d expect. It will also help you validate the answer against the source.
  • Pause and review the autocompleted code to make sure it does what it is supposed to do. If it doesn’t look familiar, just copy and paste the main function and use good old Google.

When it comes to learning, things get more complicated. With the latest round of updates (Claude 3.5, GPT o1 etc.), LLMs have taken over the chain of reasoning for many tasks.

This means that you don’t have to think about the question and formulate the steps of a solution yourself, the model does that for you. The model gives you a fish, but you don’t really learn where the fish came from. Instead, you can:

  • Embrace your own chain of thought: For topics and tasks where your goal is not just to do things, but to learn how to do them, keep your train of thought to yourself. This means proactively thinking of answers to the question at hand before you ask LLM your question.
  • Treat post-LLM agents as assistants that need guidance in thinking and reasoning. Think of a solution first, and ask the agent to help you through the steps of the solution. The agent may come up with a different solution, and that’s okay. Just try to understand why.
  • A quick tip. Using search and discovery focused LLM tools like Perplexity can help this process. Perplexity’s “Pro Search” and “Focus” motivate the learner to be more proactive.

I gave another talk in December and updated my main deck on Knowing vs. Understanding. You can find it here. For my December talk, I also put together a prologue deck for this discussion, which I will post after optimizing it for the web. Stay tuned.

Modeling data to win an argument or solve a problem

Modeling data to win an argument motivates us to make assumptions that are often baked into the modeling process.

There is a better way: focus on solving the problem. It starts with “I don’t know”, and it takes creativity and an open mind to find out. The data may or may not be there. We may need an experiment to get the data. The method we use to model the data doesn’t matter anymore. Methods become tools. More importantly, focusing on solving the problem limits our assumptions to those we have to make to get from the data to a model for decision making. So we focus on data centricity.

The pleasure of winning an argument will always be there, but perhaps we can avoid it in favor of better decision making and problem solving. And even if we can’t avoid it, we’re probably better off making an argument to learn, not to win.

Model Context Protocol for LLMs

LLM news never stops these days, but this could be a big one from a development perspective. MCP is an open standard protocol for connecting LLMs to any data source, removing the custom development barrier for LLMs to securely work with private data as directed.

For example, Claude Desktop can now use MCP to connect to, query, and analyze data in a local SQL database, keeping the data private and secure without integration barriers.

In the video, Claude is asked to develop an HTML page, create a GitHub repo to push the page to, push an update, create an issue, push changes, and create a pull request.

The protocol won’t be as visible to end users, but it will open up many possibilities for LLM agents, essentially lowering the cost of agent creation and data access.

Cool, here.

1 dataset 100 visualizations

Nice thought experiment and execution on many visualizations of the same data: change in the number of World Heritage sites from 2004 to 2022 in three Nordic countries.

Clearly, the data is better presented here as a table with a third row/column showing percentages, as shown on the About page, but getting to 100 certainly takes some creativity.

Source

Modern macro recording

Remember the ability to “record” Excel macros we were promised back in the 90s that never quite worked? Autotab now does that job as a standalone browser.

It’s basically automation on steroids, making the training and execution of a mini-model easier and more accessible, eliminating the tedious process for everyday tasks.

This is a great use case for the post-LLM world of AI agents, with a potentially direct positive impact on employee productivity and net value creation. Check it out here.

Quantification bias in decisions

When making decisions, people are systematically biased to favor options that dominate on quantified dimensions.*

The figures show the extent of bias in different contexts. Depending on what information is quantified, our decisions change even though the information content remains about the same. In other words, quantification has a distorting effect on decision making.

This made me think about the implications for data centricity. By prioritizing quantitative over qualitative information, are we failing to stay true to the data?

The study provides some evidence: we overweight salary and benefits and overlook work-life balance and workplace culture in our decisions. We check product ratings but miss the fact that the product lacks that one little feature we really need. It’s discussed in product reviews, but not quantified.

That sounds right. Clearly, we often base our decision to stay at a hotel on the rating rather than the sentiment in the reviews. But will this tendency change? Quite possibly. We have LLMs everywhere. LLMs can help resolve the trade-off between quantification and data centricity.

Using text data for decision making is easier than ever. We can now more effectively search in product reviews instead of relying solely on ratings (e.g. Amazon Rufus). Information about work-life balance and workplace culture contained in employee reviews can be more effectively quantified. Currently, Glassdoor applies sentiment analysis to a subset of work-life balance reviews by keyword matching, but it’ll get better. Comparably.com already does better.

It’s time to do better. LLMs offer the opportunity to use qualitative information for more effective, higher quality decisions by staying true to data, or data centricity.

* From the article Does counting change what counts? Quantification fixation biases decision-making.

H/T Philip Rocco for sharing the article. You can learn more about data centricity at datacentricity.org.

TinyTroupe from Microsoft

New Microsoft Research project comes with a Python library to create AI agents “for imagination enhancement and business insights”. Ha! This follows Google’s Interactive Simulacra from last year.

TinyTroupe is an experimental Python library that allows the simulation of people with specific personalities, interests, and goals. These artificial agents – TinyPersons – can listen to us and one another, reply back, and go about their lives in simulated TinyWorld environments. […] The focus is thus on understanding human behavior…

So it’s like a little SimCity where AI agents “think” and act (talk). The product recommendation notebook asks the agents to brainstorm AI features for MS Word. It’s a GPT 4 wrapper after all, so the ideas are mediocre at best, focusing on some kind of train/test logic: learn the behavior of the Word user and… (blame the predictive modeling work that dominates the training data)

Are these the most valuable business insights? This project attempts to “understand human behavior”, but can we even run experiments with these agents to simulate the causal links needed for business insights in a counterfactual design? The answer is no: the process, including agent creation and deployment, screams unknown confounders and interference.

It still looks like fun and is worth a try, even though I honestly thought it was a joke at first. That’s because the project, coming from Microsoft Research, has a surprising number of typos everywhere and errors in the Jupyter notebooks (and a borderline funny description):

One common source of confusion is to think all such AI agents are meant for assiting humans. How narrow, fellow homosapiens! Have you not considered that perhaps we can simulate artificial people to understand real people? Truly, this is our aim here — TinyTroup is meant to simulate and help understand people! To further clarify this point, consider the following differences:

Source

AI outlines in Scholar PDF Reader

Google’s PDF Reader seems to have got a new feature:

An AI outline is an extended table of contents for the paper. It includes a few bullets for each key section. Skim the outline for a quick overview. Click on a bullet to deep read where it gets interesting – be it methods, results, discussion, or specific details.

Clearly it’s not an alternative to reading (well, I hope not), but it makes search and discovery a breeze. Sure, one could feed the PDF into another LLM to generate a table of contents and outline, but the value here is the convenience of having them generated right when you open the PDF (not just in Google Scholar, but anywhere on the web). Highly recommended.

If you’re not already using this tool, I shared this very, very helpful tool when it came out earlier this year.