On data, causality, and AI - Page 3 - Log of insights and news

Mathematical Methods in Data Science (with Python)

Just came across this neat resource while looking for an MCMC / Gibbs sampling code example in object recognition. Self-description of the book:

This textbook on the mathematics of data has two intended audiences:

For students majoring in math or other quantitative fields like physics, economics, engineering, etc.: it is meant as an invitation to data science and AI from a rigorous mathematical perspective.

For mathematically-inclined students in data science related fields (at the undergraduate or graduate level): it can serve as a mathematical companion to machine learning, AI, and statistics courses.

Not yet published, but you can check it out here.

Podcast-style discussions on Data Duets

Uncategorized

You should not add 1 before log-transforming zeros. If you don’t believe me, listen to these two experts on how to make better decisions using log-transformed data.

This conversation was produced by NotebookLM based on our discussion about the Log of Zero problem at Data Duets. Duygu Dagli and I have now added a podcast-style conversation to each of our articles. All audio is raw/unedited.

The conversations are usually fun (sometimes for odd reasons). The model adds (1) examples we don’t have in the original content and (2) light banter and some jokes. The examples are hit or miss.

So, besides the usual deep and reinforcement learning backend, what does NotebookLM do? (based on Steven Johnson’s description on the Vergecast)

Start with a draft and revise it
Generate a detailed script of the podcast
Critique the script and create a revised version
Add disfluencies (um, uh, like, you know, c-c-can, sssssee…) to sound convincingly human
Apply Google’s latest text-to-speech Gemini model to add intonation, emphasis, and pacing

Have fun, and don’t add 1 to your variables before applying the log transformation.

Good data science, bad data science

Uncategorized

…and why the difference matters.

We can call data science the practice of making (high-quality) decisions using data.

The order is (1) decision making (2) using data, not (1) decision driven (2) data. So, ideally, it’s not stirring the data pile for evidence to support a decision.

That’s a good place to start. We also need to:

Make the business case really well in advance. Bringing in a half-baked problem or asking the wrong question won’t lead to the best insights.
Understand what the models can and cannot do. We certainly need more of this in the LLM land. A Gen AI project is cool, but is it what the problem needs?
Stick to the data. Data is real. Models add assumptions. Whether it’s experimental or observational, predictive or causal, the data must rule.
Divide, focus, and conquer. Prioritize the most important needs. You can measure and track all metrics, but that’s probably not what you really need.
Align the problem and available data with the assumptions embedded in the modeling solution. Testing the assumptions is the only way to know what’s real and what’s not.
Choose the better solution over the faster one, and the simple solution over the complicated one for long-term value creation. This needs no explanation.

Some rules of good (vs. bad) data science, based on insights from projects I’ve been involved with in one way or another. #3 and #5 are most closely related to a framework we are working on: data centricity.

Image courtesy of xkcd.com

Concept drift and data centricity

Uncategorized

This plot shows how coefficients in a linear model can change (not only in effect size, but also in sign) as new data is added to the training set (as a result of data or concept drift). Think of it as new retail sales data being added to the set over time.

In the plot, b is the coefficient of interest and z is the proportion of new data (Population 2) gradually added to the existing training data (Population 1). First, all the data is from P1 (so z is 0), then it’s 75% P1 and 25% P2 (z is 0.25), and so on.

As we add more of new data, we observe how the estimated effect changes. It starts out negative, becomes positive, then negative again. When the old and new data are equally mixed (z is 0.50), the previously negative effect disappears.

This thought experiment (by John Mount) reminds me of Lord’s Paradox (John calls it a continuous version of Simpson’s Paradox and that’s another way of putting it).

The data changes, but the model assumptions remain the same, and that’s a problem. This is another example of why staying true to the data, or data centricity, is critical to getting the right insights from models for decision making.

You can find the Python code walkthrough and Jupyter notebook here. If you want to learn more about data centricity, here is a one-pager.

Discrepancies in standard errors R vs. Python

Uncategorized

You may have modeled (or asked your data science team to model) the same data in R and Python. Why? Most data science teams use both R and Python, with team members specializing in one or the other. So, this could be a model changing hands. Or maybe you wanted to make sure the package implementation behaved as intended. You may also have needed better computational efficiency (R fixest can be much faster than Python linearmodels on panel data).

For whatever reason, when you run models in R and Python, you may have run into the following situation: The parameter estimates are the same, but the standard errors (and p-values) are different. The data and the model are exactly the same. So you can’t explain why, and you don’t know which standard error / statistical significance test to trust and report to the business.

If you’re curious about the most common reason, check out another previously missing section now published in the Causal Book, here. We now discuss this as part of our exercise on applying the same instrumental variable model in R vs. Python.

Good assumptions make good decisions

Uncategorized

As algorithms get better at processing data (and as we have “thinking” LLMs), we need to focus on better thinking for decision making.

Good decisions combine available information with good thinking, sound reasoning. Then come assumptions to fill in the blanks left by incomplete information. The more reasonable the assumptions, the better the decision.

The same is true when analyzing data to support decision making. Modeling data involves assumptions, both method-specific and model-specific. If the assumptions are sound, a decision based on a model’s insights is more likely to be a good one.

Staying true to the actual data at hand while making decisions based on the data is data centricity. One way to achieve data centricity is to look for model-free (i.e., assumption-free) evidence before spending any red ink to connect the dots.

Original image courtesy of xkcd.com

Update on LLM capabilities

Uncategorized

With OpenAI’s release of ChatGPT vo1, I revisited my talk on learning with LLMs. In this talk, I focus on the advantages and disadvantages of using LLMs for professional learning. The discussion distinguishes between knowing and understanding, and underlines that identifying causality is central to our understanding. The link in between is the ability to reason (counterfactual reasoning in particular).

Since yesterday we seem to have a “thinking” and “reasoning” LLM. So I asked OpenAI o1 the same question I asked ChatGPT 4o before. What an improvement: OpenAI’s model went from failing to reason to talking nonsense to hide its failure to reason. These slides are from the original talk (next to be presented in December). You can see the entire deck here.

While I can only naively wish that this was intentional, I must still congratulate OpenAI for creating a model that masters fallacies like equivocation and red herring.

Explaining the unexplainable Part I: LIME

Uncategorized

After a long break, Duygu Dagli and I have written a new article at Data Duets: Explaining the unexplainable Part I: LIME. This post is about the interpretability of predictive models, explains LIME and discusses its pros and cons.

Why the break? We started this project as an experiment. There were already resources out there on the topics we were discussing. We started by offering two perspectives on one topic: Academic vs. Director.

That was well received but it was not enough for us to focus. After getting some feedback, we scoped the project to focus on what we call data centricity: How can we use models to make data-informed decisions while staying faithful to the actual data?

Now we have two goals: 1) provide academic and practitioner perspectives on the same data science and AI topic/concept, and 2) discuss the implications for business decision making and data centricity.

We have added a section on data centricity to each of our previous posts. You can see an example for causal inference using synthetic controls here. We are excited about this new direction and have more to come. See our latest post on LIME here.

Causal Book: Design Patterns in Causal Inference

Uncategorized

After an unexpected hiatus, I’m pleased to announce the early release of a long overdue project: Causal Book: Design Patterns in Causal Inference.

I started this project some time ago, but never had a chance to devote time to scoping it. I finally got around to it, and the first chapter is almost done. I keep going back to it, so it might change a little more along the way.

This is an accessible, interactive book for the data science / causal inference audience. Some chapters should also read well to the business audience.

The book is not meant to substitute for the already great accessible books out there. The two that come to mind are The Effect and The Mixtape. Kudos to Nick and Scott for these great resources.

Our goal here is to complement what’s out there by using the idea of design patterns:

(1) focus on solutions to problem patterns and their code implementations in R and Python,

(2) discuss the implications of different approaches to the same problem solved by modeling the same data,

(3) explain some of the surprising (or seemingly surprising) challenges in applying the causal design patterns.

It’s a work in progress, but now that it’s scoped, more is on the way. Versioning and references are up next. I will post updates along the way.

Finally, why design patterns? Early in my career, I was a programmer using C# and then Java. Our most valuable resources back then were design patterns. I still have a copy of the book Head First Java Design Patterns on my bookshelf from 20 years ago. It was a lifesaver when I moved from C# to Java. This is a tribute to those days.

Linear Algebra for Data Science

Uncategorized

Linear algebra concepts are underappreciated in data science (frankly, like many other math concepts). On the other hand, understanding some concepts, such as orthogonality, is critical to understanding methods like Double Machine Learning (and of course OLS and many other methods, but Double ML is the cool one).

There are several reasons for this lack of appreciation. The availability of ready-to-use, off-the-shelf libraries and packages is one reason. Another important reason is the lack of field-specific coverage of linear algebra with examples and applications for data science / modeling data.

I’ve discovered a new (free) book that addresses the second issue: “Linear Algebra for Data Science”. The book looks like a practical introduction to linear algebra, but at least each chapter ends with a subchapter called “Application to Data Science”.

In the words of the authors:

“We (Prof. Wanmo Kang and Prof. Kyunghyun Cho) have been discussing over the past few years how we should teach linear algebra to students in this new era of data science and artificial intelligence.”

Worth checking out:

Source

Finding no effect is valuable information

Uncategorized

This article from Nature discussing the null result (“file drawer”) problem reminds me of a note I posted apparently four years ago: Analyzing data to do nothing

The article focuses on academic publishing in the natural sciences, but the problem is widespread, from business schools to small and large corporations. Positive, statistically significant results with a large effect size (?) are perceived and rewarded as superior to inconclusive (and apparently negative!) results.

While absence of evidence is not always evidence of absence, seeking an intervention (e.g., a new promotion, change in ad placement, revision in return policy) and finding no effect of the intervention is valuable information that should be appreciated as much as finding an effect. As I seem to have noted four years ago, “Rarity (of any effect) is expected simply because the probability of noise is often disproportionately higher.” To remember this is to recognize unintended consequences.

Source

Revisiting learning with LLMs (Understanding vs. knowing)

Uncategorized

Right after giving a short talk at the International Business Pedagogy Workshop on how best to use large language models for learning and discussing the topic with a great panel, I took a short break and never got a chance to revisit and reflect on our discussion.

In this talk, I originally focused on asking a few questions, but we did not have much time during the panel to discuss the answers to the questions. I’ve revisited my slides from that panel and added a slide with some answers (last slide). You can find the updated deck for “Mind the AI Gap: Understanding vs. Knowing” at https://ozer.gt/talks/mind-the-ai-gap.html

Bottom line:

To promote understanding, it’s better to treat LLMs as knowledge aggregators (not oracles with answers). This is what tools like Perplexity aim to do. Instead of asking simple and straightforward questions and expecting a correct answer, deconstructing questions (offline) before interacting with LLMs is likely to trigger reasoning and facilitate understanding (vs. just knowing).

In the classroom, this can be accomplished by encouraging students to think and reflect on the question offline before going online and interacting with LLMs. For example, a homework assignment can begin in the classroom with a task to deconstruct the problem before students go online to find a solution to the problem later.

Source

Podcast Episode 3: Matching and Causal Inference

Uncategorized

We just released another belated episode of our Data Spanners podcast (with Courtney Paulson). In this episode, we host the inimitable Sean Taylor and talk about matching (and re-matching), causal inference, and challenges in modeling different types of data (including “sequence data”). It’s an episode we had a lot of fun recording, and I bet you’ll enjoy listening to it (Spotify only).

We touch on big data, optimization, continued value of theory, System 1 and 2 loops, and modeling decisions in high stakes vs. low stakes problems. We also tackle tough questions like “What are the most important inputs to modeling data?” Data itself, creativity, domain expertise, or algorithms? I think we even mention AI at some point (pretty sure Sean brings it up!).

On a related note, but unrelated to the people involved in the making of this podcast episode, I’ll be posting some updates soon on our concept-in-progress “data centricity” and how assumptions play a critical but underappreciated role in modeling data and making models work. Stay tuned.

Source

Performance of large language models on counterfactual tasks

Uncategorized

I came across a post by Melanie Mitchell summarizing their recent research on understanding the capabilities of large language models (GPT in particular). LLMs seem to do relatively well at basic analogy (zero-generalization) problems, performing about 20% worse than humans in their replication study. However, the latest and supposedly best LLMs continue to fail at counterfactual tasks (which require reasoning beyond the content available in the training set), performing about 50% worse than humans. This is another study showing that the fundamental prerequisite for causal understanding is missing from the language models:

When tested on our counterfactual tasks, the accuracy of humans stays relatively high, while the accuracy of the LLMs drops substantially. The plot above shows the average accuracy of humans (blue dots, with error bars) and the accuracies of the LLMs, on problems using alphabets with different numbers of permuted letters, and on symbol alphabets (“Symb”). While LLMs do relatively well on problems with the alphabet seen in their training data, their abilities decrease dramatically on problems that use a new “fictional” alphabet. Humans, however, are able to adapt their concepts to these novel situations. Another research study, by University of Washington’s Damian Hodel and Jevin West, found similar results.

Our paper concluded with this: “These results imply that GPT models are still lacking the kind of abstract reasoning needed for human-like fluid intelligence.”

The post also refers to contradictory studies, but I agree with the comment about what counterfactual (abstract) thinking means, and thus why the results above make more sense:

I disagree that the letter-string problems with permuted alphabets “require that letters be converted into the corresponding indices.” I don’t believe that’s how humans solve them—you don’t have to figure out that, say, m is the 5^th letter and p is the 16^th letter to solve the problem I gave as an example above. You just have to understand general abstract concepts such as successorship and predecessorship, and what these means in the context of the permuted alphabet. Indeed, this was the point of the counterfactual tasks—to test this general abstract understanding.

Source

Business models for Gen AI

Uncategorized

You’ll get the semicolon joke only if you’ve coded in Java or the C family.

Here’s the part that’s not a joke:

Venture capital firm Sequoia estimates that in 2023, the AI industry spent $50 billion on the Nvidia chips used to train the generative AI models, but generated only $3 billion in revenue. My understanding is that the spending figure doesn’t even include the rest of the costs, just the chips.

It will take a serious creative leap to find a business model and close the gap between cost and value. Until business use cases move beyond better search and answer generation, the gap will continue to widen as the willingness to pay for existing services does not appear to be anywhere near the cost of development (user base growth for LLMs has already stalled).

Credit for the Venn goes to Forrest Brazeal.

The AI revolution is already losing steam

Uncategorized

These models work by digesting huge volumes of text, and it’s undeniable that up to now, simply adding more has led to better capabilities. But a major barrier to continuing down this path is that companies have already trained their AIs on more or less the entire internet, and are running out of additional data to hoover up. There aren’t 10 more internets’ worth of human-generated content for today’s AIs to inhale.

We need a difference in kind, not in degree. Existing language models are incapable of learning cause-effect relationships, and adding more data won’t change that.

Source

Reducing “understanding” to the ability to create a map of associations

Uncategorized

Reducing “understanding” to the ability to create a map of associations (even a highly successful map) is not helpful for business use cases. This leads to the illusion that existing large language models can “understand”.

The first image is an excerpt from the latest Anthropic article claiming that LLMs can understand (otherwise a very useful article, here). OpenAI also often refers to AGI or strong AI in its product releases.

The following screenshots from Reddit are one of many illustrations of why such a reductionist approach is neither accurate nor helpful. Without the ability to map causal relationships, knowledge doesn’t translate into understanding.

We will have the best business use cases for LLMs only if we define the capabilities of these models correctly. Let’s say a business analyst wants to take a quick look at some sales numbers in an exploratory analysis. They would interact with an LLM very differently if they were told that the model understands versus just knows more (and potentially better).

Student’s t-test during a study leave at Guinness

Uncategorized

You may or may not know that the Student’s t-test was named after William Sealy Gosset, head experimental brewer at Guinness, who published under the pseudonym “Student”. This is because Guinness preferred its employees to use pseudonyms when publishing scientific papers.

The part I didn’t know is that Guinness had a policy of granting study leave to technical staff, and Gosset took advantage of this during the first two terms of the 1906-1907 academic year. This sounds like a great idea to encourage boundary spanning.

This article is a very nice account of the story, with nice visuals (which will definitely make it into the beer preference example in my predictive analytics course).

Source

Mind the AI Gap: Understanding vs. Knowing

Uncategorized

Next week I will be speaking at the 2024 International Business Pedagogy Workshop on the use of AI in education. This gave me the opportunity to put together some ideas about using LLMs as learning tools.

Some key points:
– Knowing is not the same as understanding, but can easily be confused.
– Understanding requires the ability to reason and identify causality (using counterfactual thinking).
– This is completely lacking in LLMs at the moment.
– Framing LLMs as magical thinking machines or creative minds is not helpful because it can easily mislead us to lend our cognition.
– The best way to benefit from LLMs is to recognize them for what they are: Master memorization models.
– Their power lies in their sheer processing power and capacity, which can make them excellent learning companions when used properly.
– How can LLMs best be used as learning companions? That’ll be part of our discussion.

Source

Are the two images the same?

Uncategorized

To humans, the answer is undoubtedly yes. To algorithms, they could easily be two completely different images, if not mistaken in their characteristics. The image on the right is the 𝘨𝘭𝘢𝘻𝘦𝘥 version of the original image on the left.

Glazed is a product of the SAND Lab at the University of Chicago that helps artists protect their art from generative AI companies. Glaze adds noise to artwork that is invisible to the human eye but misleading to the algorithm.

Glaze is free to use, but understandably not open source, so as not to give art thieves an advantage in adaptive responses in this cat-and-mouse game.

The idea is similar to the adversarial attack famously discussed in Goodfellow et al. (2015), where a panda predicted with low confidence becomes a sure gibbon to the algorithm after adversarial noise is added to the image.

I heard about this cool and useful project a while ago and have been meaning to help spread the word. In the words of the researchers:

Glaze is a system designed to protect human artists by disrupting style mimicry. At a high level, Glaze works by understanding the AI models that are training on human art, and using machine learning algorithms, computing a set of minimal changes to artworks, such that it appears unchanged to human eyes, but appears to AI models like a dramatically different art style. For example, human eyes might find a glazed charcoal portrait with a realism style to be unchanged, but an AI model might see the glazed version as a modern abstract style, a la Jackson Pollock. So when someone then prompts the model to generate art mimicking the charcoal artist, they will get something quite different from what they expected.

The sample artwork is by Jingna Zhang.

Source