Ongoing debate: LLMs reasoning or not

There are now so many papers testing the capabilities of LLMs that I increasingly rely on thoughtful summaries like this one.

The word ‘reasoning’ is an umbrella term that includes abilities for deduction, induction, abduction, analogy, common sense, and other ‘rational’ or systematic methods for solving problems. Reasoning is often a process that involves composing multiple steps of inference. Reasoning is typically thought to require abstraction—that is, the capacity to reason is not limited to a particular example, but is more general. If I can reason about addition, I can not only solve 23+37, but any addition problem that comes my way. If I learn to add in base 10 and also learn about other number bases, my reasoning abilities allow me to quickly learn to add in any other base.

Abstraction is key to imagination and counterfactual reasoning, and thus to establishing causal relationships. We don’t have it (yet) in LLMs, as the three papers summarized here and others show (assuming robustness is a necessary condition).

Is that a deal breaker? Clearly not. LLMs are excellent assistants for many tasks, and productivity gains are already documented.

Perhaps if LLMs weren’t marketed as thinking machines, we could have focused more of our attention on how best to use them to solve problems in business and society.

Nonetheless, the discussion around reasoning seems to be advancing our understanding of our thinking and learning process vis-à-vis machine learning, and that’s a good thing.

The illusion of information adequacy

A new PLOS One study coined this term to describe people’s strong tendency to believe they always have enough data to make an informed decision – regardless of what information they actually have.

In the study, participants responded to a hypothetical scenario in which control participants were given full information and treatment participants were given about half the information (about a water issue involving a school). The study found that treatment participants believed they had comparably adequate information and were equally competent to make thoughtful decisions based on that information.

In essence, the study shows that people assume they have enough information – even when they lack half of the relevant information. This can be extended to data science, where it is often assumed that the data at hand is sufficient to make decisions, even though assumptions fill in the gaps between data and models (implicitly or explicitly), leading to decisions. We briefly discuss this idea of data centricity at datacentricity.org (and more to come).

Image courtesy of learningrabbithole.com.

Programming is solved by LLMs, isn’t it?

AI should virtually eliminate coding and debugging.

This is a direct quote from an IBM report published in 1954 (here, page 2), if you replace AI with Fortran. It didn’t happen, not because Fortran wasn’t revolutionary at the time. It was the first commercial compiler, which took 18 person-years to develop.

Compiling didn’t “solve” it, and neither do LLMs. LLMs help solve (part of) the problem. They don’t solve exception handling. I wrote before about exception handling (or lack thereof) in most machine learning applications. We need to pay more attention to it.

Exception handling is difficult, if not impossible, to automate away because of the complexity and unintended consequences of human-machine (user-model) interactions. LLMs can certainly be useful for generating alternative scenarios and building solutions for them.

We will continue to benefit from the models that are increasingly available to us, including LLMs. Just remembering that the problem is not just pattern recognition, but also exception handling, should help us think about how best to use these models to solve problems.

This essay here is more from a software development perspective. From the essay:

You’d think 15 years into the smart phone revolution most people could operate an order kiosk or self-checkout without help. That’s certainly what stores had hoped. But as these are rolling out you can see how these systems are now staffed by people there to handle the exception. Amazon Go will be surly seen ahead of its time, but those are now staffed full time and your order is checked on the way night. And special orders at McDonalds? Head to the counter 🙂 

Mathematical Methods in Data Science (with Python)

Just came across this neat resource while looking for an MCMC / Gibbs sampling code example in object recognition. Self-description of the book:

This textbook on the mathematics of data has two intended audiences:

  • For students majoring in math or other quantitative fields like physics, economics, engineering, etc.: it is meant as an invitation to data science and AI from a rigorous mathematical perspective.
  • For mathematically-inclined students in data science related fields (at the undergraduate or graduate level): it can serve as a mathematical companion to machine learning, AI, and statistics courses.

Not yet published, but you can check it out here.

Podcast-style discussions on Data Duets

You should not add 1 before log-transforming zeros. If you don’t believe me, listen to these two experts on how to make better decisions using log-transformed data.

This conversation was produced by NotebookLM based on our discussion about the Log of Zero problem at Data Duets. Duygu Dagli and I have now added a podcast-style conversation to each of our articles. All audio is raw/unedited.

The conversations are usually fun (sometimes for odd reasons). The model adds (1) examples we don’t have in the original content and (2) light banter and some jokes. The examples are hit or miss.

So, besides the usual deep and reinforcement learning backend, what does NotebookLM do? (based on Steven Johnson’s description on the Vergecast)

  1. Start with a draft and revise it
  2. Generate a detailed script of the podcast
  3. Critique the script and create a revised version
  4. Add disfluencies (um, uh, like, you know, c-c-can, sssssee…) to sound convincingly human
  5. Apply Google’s latest text-to-speech Gemini model to add intonation, emphasis, and pacing

Have fun, and don’t add 1 to your variables before applying the log transformation.

Good data science, bad data science

…and why the difference matters.

We can call data science the practice of making (high-quality) decisions using data.

The order is (1) decision making (2) using data, not (1) decision driven (2) data. So, ideally, it’s not stirring the data pile for evidence to support a decision.

That’s a good place to start. We also need to:

  1. Make the business case really well in advance. Bringing in a half-baked problem or asking the wrong question won’t lead to the best insights.
  2. Understand what the models can and cannot do. We certainly need more of this in the LLM land. A Gen AI project is cool, but is it what the problem needs?
  3. Stick to the data. Data is real. Models add assumptions. Whether it’s experimental or observational, predictive or causal, the data must rule.
  4. Divide, focus, and conquer. Prioritize the most important needs. You can measure and track all metrics, but that’s probably not what you really need.
  5. Align the problem and available data with the assumptions embedded in the modeling solution. Testing the assumptions is the only way to know what’s real and what’s not.
  6. Choose the better solution over the faster one, and the simple solution over the complicated one for long-term value creation. This needs no explanation.

Some rules of good (vs. bad) data science, based on insights from projects I’ve been involved with in one way or another. #3 and #5 are most closely related to a framework we are working on: data centricity.

Image courtesy of xkcd.com

Concept drift and data centricity

This plot shows how coefficients in a linear model can change (not only in effect size, but also in sign) as new data is added to the training set (as a result of data or concept drift). Think of it as new retail sales data being added to the set over time.

In the plot, b is the coefficient of interest and z is the proportion of new data (Population 2) gradually added to the existing training data (Population 1). First, all the data is from P1 (so z is 0), then it’s 75% P1 and 25% P2 (z is 0.25), and so on.

As we add more of new data, we observe how the estimated effect changes. It starts out negative, becomes positive, then negative again. When the old and new data are equally mixed (z is 0.50), the previously negative effect disappears.

This thought experiment (by John Mount) reminds me of Lord’s Paradox (John calls it a continuous version of Simpson’s Paradox and that’s another way of putting it).

The data changes, but the model assumptions remain the same, and that’s a problem. This is another example of why staying true to the data, or data centricity, is critical to getting the right insights from models for decision making.

You can find the Python code walkthrough and Jupyter notebook here. If you want to learn more about data centricity, here is a one-pager.

Discrepancies in standard errors R vs. Python

You may have modeled (or asked your data science team to model) the same data in R and Python. Why? Most data science teams use both R and Python, with team members specializing in one or the other. So, this could be a model changing hands. Or maybe you wanted to make sure the package implementation behaved as intended. You may also have needed better computational efficiency (R fixest can be much faster than Python linearmodels on panel data).

For whatever reason, when you run models in R and Python, you may have run into the following situation: The parameter estimates are the same, but the standard errors (and p-values) are different. The data and the model are exactly the same. So you can’t explain why, and you don’t know which standard error / statistical significance test to trust and report to the business.

If you’re curious about the most common reason, check out another previously missing section now published in the Causal Book, here. We now discuss this as part of our exercise on applying the same instrumental variable model in R vs. Python.

Good assumptions make good decisions

 

As algorithms get better at processing data (and as we have “thinking” LLMs), we need to focus on better thinking for decision making.

Good decisions combine available information with good thinking, sound reasoning. Then come assumptions to fill in the blanks left by incomplete information. The more reasonable the assumptions, the better the decision.

The same is true when analyzing data to support decision making. Modeling data involves assumptions, both method-specific and model-specific. If the assumptions are sound, a decision based on a model’s insights is more likely to be a good one.

Staying true to the actual data at hand while making decisions based on the data is data centricity. One way to achieve data centricity is to look for model-free (i.e., assumption-free) evidence before spending any red ink to connect the dots.

Original image courtesy of xkcd.com

Update on LLM capabilities

With OpenAI’s release of ChatGPT vo1, I revisited my talk on learning with LLMs. In this talk, I focus on the advantages and disadvantages of using LLMs for professional learning. The discussion distinguishes between knowing and understanding, and underlines that identifying causality is central to our understanding. The link in between is the ability to reason (counterfactual reasoning in particular).

Since yesterday we seem to have a “thinking” and “reasoning” LLM. So I asked OpenAI o1 the same question I asked ChatGPT 4o before. What an improvement: OpenAI’s model went from failing to reason to talking nonsense to hide its failure to reason. These slides are from the original talk (next to be presented in December). You can see the entire deck here.

While I can only naively wish that this was intentional, I must still congratulate OpenAI for creating a model that masters fallacies like equivocation and red herring.