You’re absolutely right!

[Click title for image]

This is so hilarious I had to share. A major issue with using LLMs is their overly obsequious behavior. They aren’t much help when I’m right; I don’t want to be right, I want to be corrected.

This project uses a Python script to count how often Claude Code says you’re “absolutely right.” The script doesn’t seem to normalize the counts by usage, which might be a good next step.

SourceScript

Student learning with LLMs

In January, I wrote a short note based on one of my talks: “How to use LLMs for learning in 2025.” In that note, I differentiated between using LLMs (1) to learn and (2) to do. With the new semester now underway, I’ve checked some usage numbers and read the Ammari et al. (2025) paper on how students use ChatGPT. I was particularly interested in the second RQ: “Which usage patterns correlate with continued or increased reliance on ChatGPT over time?”

An over-reliance on any tool, regardless of what it is, is a potential red flag for persistent learning, especially when the goal is comprehension. For example, understanding derivatives and calculating them using a computer are two distinct learning objectives. If the reliance on a tool substitutes for understanding, long-term implications may not be a net positive.

The article does not really answer the reliance part of the question. It does, however, report some interesting correlations between LLM behavior and student engagement. Notably, when ChatGPT asks for clarifications, provides unintended or inconsistent answers, or communicates its limitations, students are less likely to continue using it.

Plausible, but what these correlations mean for learning and comprehension is unclear. What is the next step after disengagement? Do they switch to another LLM to get a direct answer without having to answer follow-up questions, or do they go back to figuring it out on their own?

Class of 2029, I guess the answer lies with you. Welcome!

SourcePaper

Using AI at work: Hype vs. reality

A recent New York Times story offers insights.

Using LLMs as a “second pair of eyes” or as a fallible assistant seems to work well. Automation also works effectively when the instructions are clear and the objectives are defined unambiguously. In both cases, human agency remains central.

Use case #15 in the article, “Review medical literature,” reminded me of a study I shared earlier (How do LLMs report scientific text?). The study showed that LLMs systematically exaggerate claims they found in the original text. The user in this case is a medical imaging scientist and is aware of the danger. When a tool isn’t foolproof, the user’s expertise and awareness make all the difference.

The high-demand use cases are quickly scaling into independent businesses with more standardized output, often with LLMs as the core wrapper. I suspect some are marketed as “magic,” and to resist that hype, users will need a combination of expertise and awareness.

AI in 64 pictures: A visual journey

If you’re a visual learner looking to deepen your understanding of AI and language models:

I’ve just made the deck from my recent talk, “AI in 64 Pictures,” available. It’s a visual journey through language processing: from word embeddings to RNNs, attention, and transformers.

Understanding AI models better helps us discover more use cases and navigate their limitations. And if you’re looking to dive deeper, you can follow the links in the slides.

AI agents and failing projects

Nearly half of all AI agent projects are set to fail (as Gartner predicts here). Why? Unclear business value, inadequate risk controls, and escalating costs.

As I see it, much of this is fueled by hype, leading to existing solutions being relabeled as “Agentic AI” without any rethinking of business processes.

Human creativity is missing in this picture. It’s this creative thinking that should move agent use beyond just automating or augmenting individual tasks with LLMs, leading instead to the redesign of business processes and a vision for how humans and AI can truly complement each other.

The risks and costs are more straightforward to resolve:

– Managers who are most excited about AI agents often do not fully understand the risks and limitations of LLMs. They should invest as much in understanding these models as they do in using them.

– The true cost of scaling proof-of-concept GenAI solutions is often underestimated. This is on selecting the right vendor. Gartner estimates only about 130 of the thousands of agentic AI vendors are real.

Everybody lies, but why?

Andrew Gelman’s latest “rant” is worth a read. Everybody lies, but why, even when data clearly refutes it?

It’s interesting to think a little bit more and understand why and how people lie, especially when it comes to scientists, medical doctors, and law enforcement officials. Spoiler is, the answer is not always money.

LLM training and fair use

[Click title for image]

Turns out Anthropic literally purchased and scanned millions of print books to train its language model Claude.

The judge finds that the scanning of these purchased books is “fair use” even without licensing, and goes on to say:

Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable. For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.

This is quite interesting, as it draws an analogy between how humans read, comprehend, and write text and how a language model operates.

Source

How to use LLMs as coding assistants

[Click title for image]

One way is hitting that accept button until the code works. This is like gambling: it works until it doesn’t, and you never know why in either case.

Another way is intentional use. Intentional use is:

  • Telling LLMs exactly what to do.
  • Inspecting their code, line by line, before accepting.
  • Unit testing the solution before it ever sees production.

This means treating LLMs as over-confident, lightning-fast assistants. They are incredibly helpful in boosting productivity, quickly looking things up and delivering operational code. They can search a 100-page API documentation and find a solution to your problem in seconds.

Unless we see a structural breakthrough in how language models are built, this is also the best we can have: an over-confident assistant. LLMs don’t think or reason; they achieve (shallow) deductive closure at best. While the debate over whether LLMs “think” can be unproductive, there is a practical implication: LLMs make profoundly inhuman coding mistakes.

The errors LLMs make aren’t the errors a human assistant would make, so working with LLMs takes another perspective shift. Understanding this distinction is key to effectively using them: our coding assistant is deeply inhuman.

Otherwise, LLM-driven coding will inevitably lead to more failures in data science. Expect to hear more stories about models breaking unexpectedly.

This post was inspired by the write-up “Here’s how I use LLMs to help me write code,” which nails many other crucial points. It’s worth checking out.

Using LLMs for IV discovery and data

LLMs excel at search and discovery. Why not use them to find IVs for causal models?

In a new section in Causal Book, Using LLMs for IV discovery and data, we offer a prompt template to help discover candidate IVs and their actual data. We tested it with the latest Gemini (2.5 Pro Preview 06-05-2025) and the results are promising.

This section is the latest addition to the IV design pattern chapter of Causal Book. The book itself aims to:

  1. provide solution patterns and their code implementations in R and Python,
  2. discuss different approaches to the same pattern on the same data (Statistics, Machine Learning, Bayesian),
  3. demystify some surprising (or seemingly surprising) challenges in applying the causal design patterns.

See the full table of contents here.

We’ll next dive into the regression discontinuity design pattern, which I hope will be even more fun with the newly added support in DoubleML.

How do LLMs report scientific text?

In short, by exaggerating.

It seems so convenient: when you are short of time, asking ChatGPT or another chatbot to summarise a scientific paper to quickly get a gist of it. But in up to 73 per cent of the cases, these large language models (LLMs) produce inaccurate conclusions…

Over a year, we collected 4,900 summaries. When we analysed them, we found that six of ten models systematically exaggerated claims they found in the original texts.

This prompts the question: why? My initial thought was that since these models are built to generalize, and exaggeration is a form of faulty generalization, they might actually be working as intended. This aligns with the study’s finding that LLMs are nearly five times more likely to produce broad generalizations than humans.

More interestingly, attempting to prompt LLMs for accuracy often backfires. They tend to double down on generalizations, becoming twice as likely to produce overgeneralized answers when asked to stick to the facts.

On the bright side, LLMs can quickly replace the media for reporting of science.

SourcePaper