Can discovery survive the mean?

We often hear “Claude does this” or “Claude does that,” but how does it perform on an advanced data science task?

This study builds on prior work on variation in causal inference (research I contributed to) and asks Opus 4.5 to replicate the human analysis. Claude performs the same tasks following the same instructions given to human researchers. The tasks follow increasing levels of constraint:

Task 1: Maximum freedom.

Task 2: Data cleaning held constant.

Task 3: Both data cleaning and methodology held constant.

As constraints increase across the tasks, the variation in Claude’s results decreases. Across all stages, Claude consistently shows less dispersion than the researchers. These findings align with the nature of LLMs: they converge toward the mean.

This makes me think: such convergence may be useful for replication and robustness checks, but discoveries often originate in the tails of human variation. How, then, do we keep human variation in the loop?

The answer will depend on the task. Our work in Augmented Data Science, for example, focuses on data science workflows, and we recently posted a method selection agent that aims to retain data scientist variation.

Sources: Original StudyClaude ReplicationReplication Project Website