Autoresearch optimizing random seeds

You may have already seen “AutoResearch” released by Andrej Karpathy yesterday. It is another interesting experiment: research agents training on a single-GPU implementation of nanoGPT.

In this context, “research” is mostly hyperparameter tuning, but the agent is fully autonomous. So it can modify the code as it sees fit without a human in the loop.

While checking it out, I saw a session report posted by the agent, making me smile:

Changing random seed from 42→137 improved by 0.0004. Seed 7 was worse. Make of that what you will.

Even though the agent knows that optimizing the seed is pointless, it does it anyway and then tosses the ball back to you. Do whatever you want with that information!

Source 1 (Autoresearch repo) – Source 2 (Discussion link)

Has the worst language won?

Will we ever see a new programming language scale again?

With foundational models trained on the vast existing corpus of Python, what would it take for another general-purpose language to catch up? Does Julia have a chance of moving beyond its current niche?

Some new languages like Mojo are already being marketed as having “Pythonic syntax” (a superset of Python).

The post titled The Worst Language Won was the trigger for my question:

Python is the language of AI. By all conventional measures, it shouldn’t be.

Python is slow. Thousands of times slower than C, it loses benchmarks to languages that died decades ago.

Python is unsafe. With no compiler to catch your mistakes, your code’s flaws are exposed when it breaks in use.

Despite the intro, the post is actually praising Python’s experimental nature.

Language w/o reasoning ≠ understanding

Michael Burry (The Big Short) shared an interesting story today from an 1880 New York Times article titled “Is There Thought Without Language? Case of a Deaf Mute.” The story itself is fascinating, highlighting how far science has progressed in our understanding of deafness.

More to the point, I saw this powerful statement in the 1880 piece that separates understanding from language:

That by which we understand all things must be essentially superior to anything else that is understood by it.

This prompted an update to my “Mind the AI Gap” deck, a framework I initially created in May 2024 for a talk on LLM-assisted learning. Since then I’ve kept it updated as I discussed the topic.

Burry’s conclusion that “Language without the Capacity for Reason fails at Understanding” mirrors a key argument in the deck. This 1880 case study is now next to the previous 1980 discussion point from Steve Jobs (AI is the bicycle for the mind), marking a 100-year interval.

History, it seems, has a lot to teach us about AI.

See the “Mind the AI Gap” deck – Read Michael Burry’s post

Trusting the AI artifact

[Click title for image]

Looks good, must be right? We seem to scale back our questioning once AI starts building. This is a key finding from the latest Anthropic report on AI fluency.

When AI produces artifacts (apps, code, documents, visuals, or interactive tools), users are significantly less likely to verify the work:

  • 3.7 pp less likely to check facts
  • 3.1 pp less likely to question reasoning
  • 5.2 pp less likely to identify missing context

This suggests an interesting conundrum: as AI moves from being a chat/conversation partner to a builder, our skepticism fades. We are far more likely to question a text response than a functional piece of code or a formatted document, even though the latter often requires the most oversight.

Source

Human vs. LLM-generated content

Just a heads up: the next post you’ll read is LLM-generated, because this one isn’t 😉.

Skip the next post.

If there’s one causal effect I’m willing to speculate on without any modeling, this is it. This one doesn’t need a diff-in-diff: the trends are parallel; relative time checks out; ready for publication.

New Data Duets post: Using generative models, well, to generate data

I recently shared an underappreciated use case for generative models in data science: creating high-fidelity tabular datasets (OTA data for regression discontinuity).

The model’s success in data synthesis motivated a question: what are some high-value use cases for data science teams when using generative models to create datasets? This, in turn, led to our latest Data Duets post: “Using generative models, well, to generate data”

I walk through using the Synthetic Data Vault to scale a small OTA sample while preserving its statistical properties and the causal discontinuity. Duygu Dagli then weighs in on business implications: creating statistical twins to share data with vendors for solution optimization and benchmarking, simulating product recall data, and solving cold start problems in retail.

Ultimately the approach here represents a step toward data centricity: using high-fidelity simulations to dissect and validate the assumptions that drive our models.

Link to the full post

Something big is happening?

The title is from a popular post. It was clearly written to be sensational (which it seems to have achieved), yet it makes some valid points and offers useful advice:

Here’s a simple commitment that will put you ahead of almost everyone: spend one hour a day experimenting with AI. Not passively reading about it. Using it. Every day, try to get it to do something new… something you haven’t tried before, something you’re not sure it can handle. Try a new tool. Give it a harder problem. One hour a day, every day.

While following technological progress is always a good idea, the current pace is truly mind-blowing, so it requires more attention. As someone who has been coding since C# first launched (don’t check the date!) and whose day-to-day is full of markdowns, JSONs, and APIs, even I am finding it difficult to keep up lately.

So, personally, and surely as an educator, I can’t help but agree with the point about “the cost of not experimenting.” We are moving into a world where daily experimentation is as essential as your morning coffee, which you must drink.

Source

Is AI killing B2B SaaS?

[Click title for image]

Hard to ignore this question; it’s currently moving financial markets. The first comment in this massive 725-comment Hacker News thread makes a compelling case for why the answer is likely no: enterprise SaaS will survive because management simply does not want to be responsible for the vibe-coded alternative.

As a technologist and professor of “buy vs. build” discussions for over a decade, I agree that the death-of-SaaS argument is overblown. At the center of our discussions is the massive gap between building and maintaining, which is underestimated here. And we may not be emphasizing enough this critical aspect: shifting the liability.

AI is now driving down the cost of the initial build, but the build is only a fraction of the value an enterprise solution provides. SaaS also provides reliability (uptime) and the boring essentials (security compliance, data integrity). Enterprise SaaS owns the “system of record,” and migrating that is not just a (vibe-)coding problem.

Even if all of this is resolved, the liability bottleneck is still there; management won’t want to be responsible. Just because you can build it, doesn’t mean you should, and for most enterprises, I can see why they won’t, except for wrappers.

Statistical Inference: The Big Picture

Most modeling failures are caused by flawed (and often implicit) assumptions.

Statistical pragmatism recognizes that all forms of statistical inference make assumptions, assumptions which can only be tested very crudely (with such things as goodness-of-fit methods) and can almost never be verified. This is not only at the heart of statistical inference, it is also the great wisdom of our field.

This is also what we discuss in the Data Centricity Lab (see datacentricity.org for an overview). We underline the role of assumptions in the modeling process and how they dictate the usefulness of models (and the decisions they support).

This paper defends pragmatism over dogma:

  • Using both frequentist (e.g., p-values, confidence intervals) and Bayesian (e.g., posterior probabilities) tools, depending on the problem.
  • Prioritizing the assumptions that connect models to real-world data rather than debating the “true” nature of probability.

One implication is that we rethink how we frame the relationship between a sample (reality) and the population (hypothetical). We often describe statistical inference as random sampling from a finite population, but that can be misleading. The paper suggests we call the estimand “theoretical mean” rather than “population mean.”

Why does it matter? The more we emphasize the role of assumptions, the more modelers question if the theoretical world aligns well with the real world that produced the data. As we discuss at Data Duets, when assumptions are sidelined, a misconception takes hold: the idea that methodological rigor can substitute for conceptual accuracy. And causal (semi-)parametric solutions are often more sensitive to this misconception than predictive ones (as we further discuss here).

Kass (2011) Paper

Moltbook is not a community

and there is no emergence. It’s yet another simulation. Here’s a reality check.

Community takes trust and authenticity, a shared purpose and identity, and active participation and interaction. These LLM bots have no concept of trust or a shared purpose. Data shows they don’t even truly interact; they just take parallel actions:

tl;dr: agents post a LOT but don’t really talk to each other. 93.5% of comments get zero replies. conversations max out at a depth of 5. at least as of now, moltbook is less “emergent AI society” and more “6,000 bots yelling into the void and repeating themselves” (Holtz)

And emergence requires more than independent entities occupying the same space. Even if the bots truly interacted, emergence takes consistent horizontal influence and downward causation:

One of the emergent properties that a system can have is the power to exert causal influence on the components of that system in a way that is consistent with, but different from, the causal influences that those components exert upon each other. (Newman, 1996)

Bottom line is, Moltbook is an exciting experimental simulation for technologists like me, but it is neither a community nor an emergent society. The community elements and causal loops are currently missing: the agents do not adapt their weights or behaviors based on the collective. They are simply generating tokens into a vacuum.

MoltbookSource 1 (Holtz’s analysis) – Source 2 (Newman, 1996)

[Click title for image] H/t to Ben Lowenstein for the screenshot.