Podcast Episode 3: Matching and Causal Inference

We just released another belated episode of our Data Spanners podcast (with Courtney Paulson). In this episode, we host the inimitable Sean Taylor and talk about matching (and re-matching), causal inference, and challenges in modeling different types of data (including “sequence data”). It’s an episode we had a lot of fun recording, and I bet you’ll enjoy listening to it (Spotify only).

We touch on big data, optimization, continued value of theory, System 1 and 2 loops, and modeling decisions in high stakes vs. low stakes problems. We also tackle tough questions like “What are the most important inputs to modeling data?” Data itself, creativity, domain expertise, or algorithms? I think we even mention AI at some point (pretty sure Sean brings it up!).

On a related note, but unrelated to the people involved in the making of this podcast episode, I’ll be posting some updates soon on our concept-in-progress “data centricity” and how assumptions play a critical but underappreciated role in modeling data and making models work. Stay tuned.

Source

Performance of large language models on counterfactual tasks

I came across a post by Melanie Mitchell summarizing their recent research on understanding the capabilities of large language models (GPT in particular). LLMs seem to do relatively well at basic analogy (zero-generalization) problems, performing about 20% worse than humans in their replication study. However, the latest and supposedly best LLMs continue to fail at counterfactual tasks (which require reasoning beyond the content available in the training set), performing about 50% worse than humans. This is another study showing that the fundamental prerequisite for causal understanding is missing from the language models:

When tested on our counterfactual tasks, the accuracy of humans stays relatively high, while the accuracy of the LLMs drops substantially.  The plot above shows the average accuracy of humans (blue dots, with error bars) and the accuracies of the LLMs, on problems using alphabets with different numbers of permuted letters, and on symbol alphabets (“Symb”).  While LLMs do relatively well on problems with the alphabet seen in their training data, their abilities decrease dramatically on problems that use a new “fictional” alphabet. Humans, however, are able to adapt their concepts to these novel situations. Another research study, by University of Washington’s Damian Hodel and Jevin West, found similar results.

Our paper concluded with this: “These results imply that GPT models are still lacking the kind of abstract reasoning needed for human-like fluid intelligence.”

The post also refers to contradictory studies, but I agree with the comment about what counterfactual (abstract) thinking means, and thus why the results above make more sense:

I disagree that the letter-string problems with permuted alphabets “require that letters be converted into the corresponding indices.”  I don’t believe that’s how humans solve them—you don’t have to figure out that, say, m is the 5th letter and p is the 16th letter to solve the problem I gave as an example above.  You just have to understand general abstract concepts such as successorship and predecessorship, and what these means in the context of the permuted alphabet. Indeed, this was the point of the counterfactual tasks—to test this general abstract understanding.

Source

Business models for Gen AI

You’ll get the semicolon joke only if you’ve coded in Java or the C family.

Here’s the part that’s not a joke:

Venture capital firm Sequoia estimates that in 2023, the AI industry spent $50 billion on the Nvidia chips used to train the generative AI models, but generated only $3 billion in revenue. My understanding is that the spending figure doesn’t even include the rest of the costs, just the chips.

It will take a serious creative leap to find a business model and close the gap between cost and value. Until business use cases move beyond better search and answer generation, the gap will continue to widen as the willingness to pay for existing services does not appear to be anywhere near the cost of development (user base growth for LLMs has already stalled).

Credit for the Venn goes to Forrest Brazeal.

The AI revolution is already losing steam

These models work by digesting huge volumes of text, and it’s undeniable that up to now, simply adding more has led to better capabilities. But a major barrier to continuing down this path is that companies have already trained their AIs on more or less the entire internet, and are running out of additional data to hoover up. There aren’t 10 more internets’ worth of human-generated content for today’s AIs to inhale.

We need a difference in kind, not in degree. Existing language models are incapable of learning cause-effect relationships, and adding more data won’t change that.

Source

Reducing “understanding” to the ability to create a map of associations

Reducing “understanding” to the ability to create a map of associations (even a highly successful map) is not helpful for business use cases. This leads to the illusion that existing large language models can “understand”.

The first image is an excerpt from the latest Anthropic article claiming that LLMs can understand (otherwise a very useful article, here). OpenAI also often refers to AGI or strong AI in its product releases.

The following screenshots from Reddit are one of many illustrations of why such a reductionist approach is neither accurate nor helpful. Without the ability to map causal relationships, knowledge doesn’t translate into understanding.

We will have the best business use cases for LLMs only if we define the capabilities of these models correctly. Let’s say a business analyst wants to take a quick look at some sales numbers in an exploratory analysis. They would interact with an LLM very differently if they were told that the model understands versus just knows more (and potentially better).

Student’s t-test during a study leave at Guinness

You may or may not know that the Student’s t-test was named after William Sealy Gosset, head experimental brewer at Guinness, who published under the pseudonym “Student”. This is because Guinness preferred its employees to use pseudonyms when publishing scientific papers.

The part I didn’t know is that Guinness had a policy of granting study leave to technical staff, and Gosset took advantage of this during the first two terms of the 1906-1907 academic year. This sounds like a great idea to encourage boundary spanning.

This article is a very nice account of the story, with nice visuals (which will definitely make it into the beer preference example in my predictive analytics course).

Source

Mind the AI Gap: Understanding vs. Knowing

Next week I will be speaking at the 2024 International Business Pedagogy Workshop on the use of AI in education. This gave me the opportunity to put together some ideas about using LLMs as learning tools.

Some key points:
– Knowing is not the same as understanding, but can easily be confused.
– Understanding requires the ability to reason and identify causality (using counterfactual thinking).
– This is completely lacking in LLMs at the moment.
– Framing LLMs as magical thinking machines or creative minds is not helpful because it can easily mislead us to lend our cognition.
– The best way to benefit from LLMs is to recognize them for what they are: Master memorization models.
– Their power lies in their sheer processing power and capacity, which can make them excellent learning companions when used properly.
– How can LLMs best be used as learning companions? That’ll be part of our discussion.

Source

Are the two images the same?

To humans, the answer is undoubtedly yes. To algorithms, they could easily be two completely different images, if not mistaken in their characteristics. The image on the right is the 𝘨𝘭𝘢𝘻𝘦𝘥 version of the original image on the left.

Glazed is a product of the SAND Lab at the University of Chicago that helps artists protect their art from generative AI companies. Glaze adds noise to artwork that is invisible to the human eye but misleading to the algorithm.

Glaze is free to use, but understandably not open source, so as not to give art thieves an advantage in adaptive responses in this cat-and-mouse game.

The idea is similar to the adversarial attack famously discussed in Goodfellow et al. (2015), where a panda predicted with low confidence becomes a sure gibbon to the algorithm after adversarial noise is added to the image.

I heard about this cool and useful project a while ago and have been meaning to help spread the word. In the words of the researchers:

Glaze is a system designed to protect human artists by disrupting style mimicry. At a high level, Glaze works by understanding the AI models that are training on human art, and using machine learning algorithms, computing a set of minimal changes to artworks, such that it appears unchanged to human eyes, but appears to AI models like a dramatically different art style. For example, human eyes might find a glazed charcoal portrait with a realism style to be unchanged, but an AI model might see the glazed version as a modern abstract style, a la Jackson Pollock. So when someone then prompts the model to generate art mimicking the charcoal artist, they will get something quite different from what they expected.

The sample artwork is by Jingna Zhang.

Source

Hardest problem in Computer Science: Centering things

This is a must-read/see article full of joy (and pain) for visually obsessed people. It’s a tribute to symmetry and a rebuke to non-random, unexplained errors in achieving it.

Centering is more than a computer science problem. We struggle with centering all the time, from hanging frames on the wall to landscaping. In another world, centering is also central to data science, as in standardized scores and other rescaling operations. Centering gives us a baseline against which to compare everything else. Our brains love this symmetry (as explained here and elsewhere).

Source

“Medium is for human storytelling, not AI-generated writing.”

Medium appears to be the first major publishing platform to adopt a policy banning the monetization of articles written by AI, effective May 1, 2024.

Enforcing this policy will be a real challenge, and will likely require human moderators to win an otherwise cat-and-mouse game. This is another area where AI may, ironically, create jobs to clean up the mess it has made.

Source

Why do people use LLMs?

Apparently for anything and everything, including advice of all kinds (medical, career, business), therapy, and Dungeons & Dragons (to create storylines, characters, and quests for players).

The list is based on a crawl of the web (Quora, Reddit, etc.).

Source

How do language models represent relations between entities?

This work shows that the complex nonlinear computation of LLMs for attribute extraction can be well-approximated with a simple linear function…

and more importantly, without a conceptual model.

The study has two main findings:
1. Some of the implicit knowledge is represented in a simple, interpretable, and structured format.
2.. This representation is not universally used, and superficially similar facts can be encoded and extracted in very different ways.

This is an interesting study that highlights the simplistic and associative nature of language models and the resulting randomness in their output.

Source

Google’s new PDF parser

In less sensational but more useful AI news, I’ve just discovered Google’s release of a new PDF parser.

The product was pushed by the Google Scholar team as a Chrome extension, but once installed, it parses any PDF opened in Chrome (it doesn’t have to be an academic article). It creates an interactive table of contents and shows the in-text references, tables, and figures on the spot, without having to go back and forth from top to bottom of the paper. It also has rich citation features.

I love it, but my natural reaction was, why didn’t we have this already?

Source

World’s first fully autonomous AI engineer?

Meet Devin, the world’s first fully autonomous AI software engineer.

We are an applied AI lab focused on reasoning.

We’re building AI teammates with capabilities far beyond today’s existing AI tools. By solving reasoning, we can unlock new possibilities in a wide range of disciplines—code is just the beginning.

Cognition Labs makes some big claims. The demos are impressive, but it is not clear what they mean by “solving reasoning”. There is good reasoning and there is bad reasoning. The latter may be easier to solve. Let’s see what’s left after the smoke clears.

At least they do not claim that Devin is a creative thinker.

Source

When do neural nets outperform boosted trees on tabular data?

Otherwise, tree ensembles continue to outperform neural networks. The decision tree in the figure shows the winner among the top five methods.

Now, the background:

I explored the why of this question before, but didn’t get very far. This may be expected, given the black-box and data-driven nature of these methods.

This is another study, this time testing larger tabular datasets. By comparing 19 methods on 176 datasets, this paper shows that 𝗳𝗼𝗿 𝗮 𝗹𝗮𝗿𝗴𝗲 𝗻𝘂𝗺𝗯𝗲𝗿 𝗼𝗳 𝗱𝗮𝘁𝗮𝘀𝗲𝘁𝘀, 𝗲𝗶𝘁𝗵𝗲𝗿 𝗮 𝘀𝗶𝗺𝗽𝗹𝗲 𝗯𝗮𝘀𝗲𝗹𝗶𝗻𝗲 𝗺𝗲𝘁𝗵𝗼𝗱 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝘀 𝗮𝘀 𝘄𝗲𝗹𝗹 𝗮𝘀 𝗮𝗻𝘆 𝗼𝘁𝗵𝗲𝗿 𝗺𝗲𝘁𝗵𝗼𝗱, 𝗼𝗿 𝗯𝗮𝘀𝗶𝗰 𝗵𝘆𝗽𝗲𝗿𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿 𝘁𝘂𝗻𝗶𝗻𝗴 𝗼𝗻 𝗮 𝘁𝗿𝗲𝗲-𝗯𝗮𝘀𝗲𝗱 𝗲𝗻𝘀𝗲𝗺𝗯𝗹𝗲 𝗺𝗲𝘁𝗵𝗼𝗱 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝘀 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗺𝗼𝗿𝗲 𝘁𝗵𝗮𝗻 𝗰𝗵𝗼𝗼𝘀𝗶𝗻𝗴 𝘁𝗵𝗲 𝗯𝗲𝘀𝘁 𝗮𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺.

This project also comes with a great resource. This time it comes with a ready-to-use codebase and testbed along with the paper.

Source

Why do tree-based models outperform deep learning on tabular data?

“The man who knows how will always have a job. The man who knows why will always be his boss.” – Ralph Waldo Emerson

The study shows that tree-based methods consistently outperform neural networks on tabular data with about 10K observations, both in prediction error and computational efficiency, with and without hyperparameter tuning. 45 datasets from different domains are modeled for benchmarking.

The paper then goes on to explain why. The “why” part offers some experiments but looks quite empirically driven so I can’t say I’m convinced there. The Hugging Face repo for the paper, datasets, code, and a detailed description is a great resource though.

Source

Project Euler and the SQL Murder Mystery

If you’re like me and love coding, but your daily work can go long stretches without coding, you’ll like Project Euler, where you can solve math problems using any programming language you like (as a long-time user, I use Python, since I use R more often when modeling data).

The project now has nearly 900 problems, with a new one added about once a week. The problems vary in difficulty, but each can be solved in less than a minute of CPU time using an efficient algorithm on an average computer.

Also, my recommendation engine says that if you like Project Euler, you might also like this SQL Murder Mystery I just discovered. This one is not really that difficult, but it does require you to pay close attention to the clues and prompts.

Unexpected spillover effect of the AI boom

Anguilla will generate over 10% of its GDP from the .ai domain sales this year. Based on a population of 15,899, .ai will generate a net gain of over $8K per year for a family of four on an island with a GDP per capita of $20K.

𝘈𝘯𝘥 𝘪𝘵’𝘴 𝘫𝘶𝘴𝘵 𝘱𝘢𝘳𝘵 𝘰𝘧 𝘵𝘩𝘦 𝘨𝘦𝘯𝘦𝘳𝘢𝘭 𝘣𝘶𝘥𝘨𝘦𝘵—𝘵𝘩𝘦 𝘨𝘰𝘷𝘦𝘳𝘯𝘮𝘦𝘯𝘵 𝘤𝘢𝘯 𝘶𝘴𝘦 𝘪𝘵 𝘩𝘰𝘸𝘦𝘷𝘦𝘳 𝘵𝘩𝘦𝘺 𝘸𝘢𝘯𝘵. 𝘉𝘶𝘵 𝘐’𝘷𝘦 𝘯𝘰𝘵𝘪𝘤𝘦𝘥 𝘵𝘩𝘢𝘵 𝘵𝘩𝘦𝘺’𝘷𝘦 𝘱𝘢𝘪𝘥 𝘥𝘰𝘸𝘯 𝘴𝘰𝘮𝘦 𝘰𝘧 𝘵𝘩𝘦𝘪𝘳 𝘥𝘦𝘣𝘵, 𝘸𝘩𝘪𝘤𝘩 𝘪𝘴 𝘱𝘳𝘦𝘵𝘵𝘺 𝘶𝘯𝘶𝘴𝘶𝘢𝘭. 𝘛𝘩𝘦𝘺’𝘷𝘦 𝘦𝘭𝘪𝘮𝘪𝘯𝘢𝘵𝘦𝘥 𝘱𝘳𝘰𝘱𝘦𝘳𝘵𝘺 𝘵𝘢𝘹𝘦𝘴 𝘰𝘯 𝘳𝘦𝘴𝘪𝘥𝘦𝘯𝘵𝘪𝘢𝘭 𝘣𝘶𝘪𝘭𝘥𝘪𝘯𝘨𝘴. 𝘚𝘰 𝘸𝘦’𝘳𝘦 𝘥𝘰𝘪𝘯𝘨 𝘸𝘦𝘭𝘭, 𝘐 𝘸𝘰𝘶𝘭𝘥 𝘴𝘢𝘺.

So AI stands for Asset Increase in Anguilla.

Source

Environmental costs of the AI boom

This is a bit personal. As a technologist, there’s probably never been a better time to be alive. As an environmentalist, it’s probably just the opposite.

As usual, we largely ignore the environmental impact and sustainability of large language models compared to the use cases and value they create. This whitepaper uses some descriptive data to provide a contrarian yet realistic view. TL;DR – It’s not a crisis per se yet, but it could be soon.

The comparisons need to be refined though. For example, the trend is more important than the snapshot (there is no kettle boom). We also probably need to use the kettle and the oven more than we need language models to “write a biblical verse in the style of the King James Bible explaining how to remove a peanut butter sandwich from a VCR” (from the article).

The article goes on to offer another positive: Responsible AI can spur efforts toward environmental sustainability, “from optimizing model-training efficiency to sourcing cleaner energy and beyond.” We will see about that.

Source