Claude Code vs. Cursor vs. Others

[Click title for image]

I’m not ready to go all-in on Claude Code just yet. Despite the well-earned hype, I’m sticking with Cursor (for now). That’s because model independence is worth more to me than a native tool integration in an “actor-critic” type of workspace.

I started writing this as a comment on Scott Cunningham’s post here (and Substack), but it evolved into a full end-of-year reflection on the coding assistants. While my focus is primarily on Claude Code vs. Cursor, Scott’s write-up provides excellent context, particularly in how it compares standard chatbots with Claude Code as a context-aware tool. I can see why Claude Code won him over in that regard.

I’ve been hearing a lot about Claude Code lately, but I haven’t yet found a compelling reason to abandon Cursor, which I’ve used since near its release. For those of us with a coding background, the primary reason for starting with Cursor was the incredibly low switching cost from VS Code to Cursor.

Claude Code is Anthropic’s response to VS Code and its forks: mainly Cursor and Windsurf. By the way, a lesser-known alternative emerged from Google last month: Antigravity. It’s currently in public preview, so you can use it for free here. It looks promising, and the standout feature is “Agent Manager,” which essentially acts as a mission control dashboard for orchestrating multiple agents working in parallel.

I was actually planning to test Windsurf once my Cursor subscription expires (for the reason I highlighted hereCodemaps offer a hierarchical mental model of the codebase), but Scott’s post piqued my curiosity about Claude Code. After checking a few resources, this comparison seemed fair (if you look past Qodo’s advertising). This one is also solid, and ultimately recommends that you “just use both.”

So, what is my plan for 2026?

I still intend to test Windsurf because I find the “Codemaps” feature intriguing. After that, I might revert to VS Code and layer in Claude Code (here is how). Meanwhile, I will keep an eye on the evolution of Google’s Antigravity, the latecomer. My hesitation to commit to a specific vendor stems from my frequent use of actor-critic logic (where one model/agent performs the task and another reviews it). I find value in using different models for these roles, so I want to retain the flexibility to switch between models. Based on the comparisons above, I might face a premium as context usage scales, but that’s a price I’m willing to pay for model independence.

And what do I mean by the actor-critic setup?

While “racing” parallel agents is becoming standard practice (see the Cursor documentation), I also use a sequential approach with diverse models: e.g., using Claude to code (the actor) and Gemini to review (the critic). VS Code documentation on agent use touches on this “multi-agent” idea without calling it an “actor-critic” setup (rightly so, as I’m just borrowing the concept from reinforcement learning).

Credit for the figure goes to Gemini 3 Flash & Nano Banana based on this reflection.

Pluribus knows everything

[Click title for image]

Pluribus knows everything, but understands nothing – just like an LLM!

I’ve finally found the time to start watching Pluribus. I’m only two episodes in, but the analogy to LLMs is already clear. The crux of this comparison for me is the scene where Carol tries to convince the unified human entity that it is okay to say “no” and not be obsequious.

Like an LLM, the show’s unified human entity (which possesses all human thoughts and knowledge) “knows” everything but “understands” nothing. This is an excellent link to my take on LLMs in the talk, “Mind the AI Gap: Understanding vs. Knowing.”

Clearly, I was not the first to make this connection (see this summary for example). As I understand it, Vince Gilligan didn’t intend for this to be about LLMs/AI; he wrote the story long before ChatGPT. This makes it even more fascinating, because he imagined a being with total knowledge but no agency or imagination (and thus, no counterfactual reasoning), and this being uncannily mirrors the behavior of an LLM.

Happy holidays to everyone with this holiday post!

Computer vision combined with LLMs

Well, this is impressive. Today I found myself visually comparing two scatterplots. Then I asked the latest models of Claude, GPT, Gemini, and DeepSeek to identify which of the two was generated parametrically and which was semi-parametrically. They reached a consensus (DeepSeek did not receive the images), and they were correct.

When a collective of LLMs can unequivocally diagnose the subtle shift in data generation process just by looking at scatterplots, it changes the way you think about looking vs. seeing (and saves you a trip to the optometrist for eyestrain).

Attached are the two datasets if you want to give your eyes a chance.

LLMs learn from the best

[Click title for image]

Our version control debt has finally caught up with us. I’m working on a Jupyter notebook for the Causal Book using a helper LLM, and we seem to be approaching the absolute finality here..

For the record, this horrible naming convention is not a result of my input; it’s entirely derived from the training data. And here’s what Gemini says about it:

This is a hilarious catch. It looks like the AI has perfectly replicated the human panic of saving files as Project_Final_Final_v2_REAL_FINAL.doc.

Human panic, or uniquely human optimism for version control, you decide!

Reimagining in-class learning with LLMs

I commend Andrej Karpathy’s pedagogical work, e.g., Eureka Labs’ vision and his instructional videos. His insight that students must be proficient in AI but should be able to exist without it, is spot on. He also suggests leaning on in-class evaluation to ensure academic integrity. While a shift to in-class is clearly necessary, basing it solely on grading implications sounds too narrow.

In-class time will play a bigger role.

This is one of the reasons we have a number of recent teaching innovations in-class, including Hackathons for predictive modeling and reinforcement learning (multi-armed bandits), and LLM-assisted development deployed to HuggingFace.

LLMs can help make learning fun and engaging, starting in the classroom.

The most effective teaching fosters student ownership of learning. This involves showing that learning is fun (and surely challenging). LLMs offer an opportunity to strengthen this message. Learning is even more fun now, and somewhat less challenging: LLMs make it much easier to access material, test understanding, iterate on solutions, experiment, and get quick feedback.

That’s why we will next dedicate more of the in-class time to demonstrating how to use LLMs as life-long learning companions without mindlessly delegating our understanding. To read more about this difference, you can see the slides from my talk Mind the AI Gap: Understanding vs. Knowing here.

In all, yes, in-class time needs to be more strategically used, but making grading the sole driver represents a missed opportunity. Using more of the in-class time to model the joy of discovery and learning with LLMs (“the pleasure of finding things out”) can be a better primary driver.

Causal evidence in the headlines

It’s not every day that causal evidence is quoted in the headlines. Incidentally, we had a similar (unpublished) study on Instagram looking at the effects of “Instagram perfect” on users’ prosocial behavior (also through social comparison as the mechanism), with somewhat parallel results.

So, I am not surprised at all by this finding:

To the company’s disappointment, “people who stopped using Facebook for a week reported lower feelings of depression, anxiety, loneliness and social comparison,” internal documents said.

Source

Algorithm that doesn’t rot your brain?

This is slightly off-track, but I felt compelled to share this opinion piece. The NYT published an opinion video featuring Jack Conte, musician and CEO of Patreon. The message is simple: algorithms should serve people instead of people serving algorithms.

The piece reminded me of the times when you could reliably follow someone. These days, I see all kinds of content that I didn’t sign up for, and I miss the content from the people I thought I followed. I don’t even see the updates from my connections.

As a workaround, LinkedIn wants you to “double follow” if you want to really follow someone. You need to visit a person’s profile and click on the unlabeled, literally hidden bell in the upper right to get notified when that person shares something.

Isn’t that a little preposterous?

The opinion piece suggests that we must:

  1. Prioritize long-term relationships
  2. Fund art, not ads
  3. Put humans in control

As a technologist, I agree. This may sound like a rant, but it really is not. I think Jack is doing an excellent job making people question the existing design (and offering an alternative?).

I’ve created a gift link so you can access the content without a NYT membership, see here.

Learning, insight, and causality

If the goal of teaching is learning, then how exactly does the brain make a difficult concept instantly clear?

I’ve been a student of how the human brain works for as long as I can remember, particularly since the early days of my teaching. Teaching is moot if actual learning lags. Learning is difficult by definition, and making it sticky is even more challenging.

This article provides a status update on research into what “insight” is, how it is formed, and how it aids learning and long-term memory. Worth a read.

In the age of generative models, a better understanding of how insight is formed and the role of cause-effect triggers (water rises – Eureka!) is increasingly valuable.

Is AI the bicycle or the mind?

[Click title for image]

Is AI the bicycle for the mind (following Steve Jobs), or is it the mind riding the bicycle (quite literally like the 20-year-old robot here even before the Transformer)?

In this article, Tim O’Reilly, countering Jensen Huang’s keynote remarks, frames this as a question of function: Is AI a tool or a worker using other tools? He explores a number of premises and concludes the LLM is “a tool that knows it’s a tool.”

This may actually be an apt way to describe an agent: a tool that knows it’s a tool -and- can use other tools.

Credit for the picture goes to Koji Sasahara / AP.

Bias-variance tradeoff in matching for diff-in-diff

In matching for causal inference, we often focus too much on reducing bias and too little on variance. This has generalizability implications. This paper, while not focused on external validity, tackles the bias-variance trade-off in matching for diff-in-diff:

While matching on covariates may reduce bias by creating a more comparable control group, this often comes at the cost of higher variance. Matching discards non-comparable control units, limiting the sample and, in turn, jeopardizing the precision of the estimate. That’s a good reminder.

How about matching also on pre-treatment outcomes?

Here, the win is clear: it’s a guaranteed reduction in variance because the sample-size trade-off no longer applies once matching is performed. So, while a reduction in bias isn’t a mathematical certainty, this makes additionally matching on pre-treatment outcomes a potentially optimal strategy when both bias and variance are a concern.

The generalizability implications will be part of the matching chapter of the Causal Book.

PS: Yes, matching on pre-treatment outcomes reduces the diff-in-diff estimator to diff-in-means and may introduce bias, but that’s a discussion for another day (and chapter).

Understand Code Before You Vibe It?

[Click title for image]

That tagline with the made-up graph instantly raises a red flag, but the core idea is surprisingly cool. Windsurf’s new owner, Cognition (following the failed OpenAI acquisition), has shipped a new feature called Codemaps.

The idea is to significantly ease codebase understanding. This actually looks incredibly useful, especially when tackling an existing codebase, say, an open-source project, and it might get me to switch over from Cursor.

Source

LLMs vs. Stack Overflow

Did you know about stackoverflow.ai? I must’ve completely missed this. It looks like a great alternative to the search function on the site (or using Google to search it). We seem to have come a full circle from LLMs killing StackOverflow to LLMs powering StackOverflow for search and discovery. Recommended.

Back to Causal Book: Regression Discontinuity

The intro sections and DAGs for the RD chapter are in. More to come.

I’m looking for interesting datasets for the RD design. I have some candidates, but I’m eager to find more compelling, real data. Ideally, I’d like a business case (rather than policy), such as one on customer loyalty status. The IV chapter already uses policy data (tax on cigarette prices vs. smoking). Please comment with a link if you have ideas beyond the Kaggle datasets.

As a reminder, Causal Book is an accessible, interactive resource for the data science and causal inference audience. It is not meant to substitute for the excellent texts already available, such as The Effect by Nick Huntington-Klein and The Mixtape by Scott Cunningham. This book aims to complement them by focusing on the idea of solution patterns, with code in R and Python, exploring different approaches (Freq. Statistics, Machine Learning, and Bayesian), and clarifying some of the counterintuitive (or seemingly surprising) challenges faced in practice.

Causal Book

Is college old school now?

This is interesting: Palantir has launched a “Meritocracy Fellowship” to hire high-achieving high school graduates right out of school, offering a paid internship with a chance at full-time employment. The company presents this as an alternative to college.

This is a very limited, transactional view of college. College is more than just training for employment; it is where students gain knowledge and broaden their horizons, learning how to think and ask questions, in addition to acquiring practical skills. I doubt a four-week history seminar will make up for all that.

Source

Neo the household robot

I’m terribly sad to learn that the first consumer humanoid robot marketed to load the dishwasher (finally!) is essentially a proxy operated remotely by a human (oh no). The automation it offers is akin to hiring a teenager to mow your lawn remotely, yet it introduces privacy and latency nightmares.

How much longer must we keep loading the dishes? Until Nvidia’s valuation hits $10 trillion? Let’s buy more stonks. I’m losing my patience here.

Source – Neo Order Page

Is our society increasingly rewarding conformity?

Is our society increasingly rewarding conformity? Is AI accelerating this process, and is it simultaneously stripping work of deep meaning? After all, thanks to LLMs, many now define creativity as merely a probabilistic recombination of matrices derived from a training set.

Is science also contributing to this potential lack of deviation in culture, education, arts, architecture, and business? Here’s a take on it (not mine):

You can spot this scientific bland-ification right away when you read older scientific writing. As Roger’s Bacon points out, scientific papers used to have style. Now they all sound the same, and they’re all boring.

Maybe science is supposed to be boring? What happened to the style though?

For definitive answers to these questions, look elsewhere. For a compelling set of data (plus plenty of causal speculation) on many aspects of contemporary society and scientific style, check out this compilation and the essay here.

The greatest thinkers in science (and business) are often prolific authors. They write books, blogs, and copious emails to sharpen ideas. Richard Lewontin, E.O. Wilson, and Paul Graham are but three examples. Dorothy Hodgkin’s scientific correspondence and papers, stacked together, extend 25.85 meters in length. Great thinkers, in other words, write all the time.

Researchers are evaluated by simple measures of productivity or influence — number of papers published, citation count, and grant dollars. In such an environment, it has become exceedingly difficult for scientists to take stylistic risks in their academic writing or to devote significant amounts of time to other forms of creative writing.

When is TSLS Actually LATE?

I first came across this paper while writing the Machine Learning Using IV chapter of the Causal Book. Revisiting it today, I remain struck by its central finding: about 95% of the empirical TSLS (Two-Stage Least Squares) models surveyed here claim to estimate the Local Average Treatment Effect (LATE), but they fail to meet the necessary conditions to do so.

The failure is mainly due to not controlling for covariates nonparametrically. That is to say, in attempting to correct for selection bias (endogeneity) using IVs, causal modelers inadvertently introduce significant specification bias, thereby theoretically nullifying the LATE interpretation.

On a different note, I’ve resumed work on Causal Book. Updates are on the way!

Is AI also innocent until proven guilty?

Today, my feed is full of speculation linking the recent AWS layoffs, driven by increased AI automation, to yesterday’s outage. In reality, we don’t really know if AI caused any of it.

What do we know? I read two articles this morning, and one thing that struck me is that AWS was reportedly not able to diagnose the core issue for 75–90 mins. That’s an absurdly long time.

If this timeline is accurate, the extended delay is compelling evidence that critical expertise was either absent or inaccessible when it was most needed, for whatever reason.

Credit for the image goes to Emil Lendof/WSJ.

Source 1Source 2

Update on using LLMs for OCR

Here’s an update on using LLMs for OCR without having to use the same hammer (generic model) for all nails. DeepSeek has released an OCR-focused model: https://github.com/deepseek-ai/DeepSeek-OCR

Check out the deep parsing mode, which is parsing images within documents through secondary model calls. Very useful for data extraction. The results are pretty impressive too:

Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode 𝑁 text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy. These findings suggest promising directions for future applications, such as implementing optical processing for dialogue histories beyond 𝑘 rounds in multi-turn conversations to achieve 10× compression efficiency.

Education, AI, and standards

The data on education call for attention:

– 33% of eighth graders are reading at a level that is “below basic”—meaning that they struggle to follow the order of events in a passage or to even summarize its main idea.

– 40% of fourth graders are below basic in reading, the highest share since 2000.

– In 2024, the average score on the ACT, a popular college-admissions standardized test that is graded on a scale of 1 to 36, was 19.4—the worst average performance since the test was redesigned in 1990.

The article speculates on several causal links to explain the declining trend in the metrics, ranging from the effects of COVID to the influence of smartphones and social media.

The point that truly resonates with me as an educator, though, is this: a pervasive refusal to hold children to high standards. Standards are about values, not technology or tools. No tool causes the fading emphasis on rigor.

The article discusses other important aspects, such as the disparity between school districts, the heterogeneity in outcomes based on affluence, and the potential role of AI as a democratizer, but keeps returning to the same line: declining standards and low expectations. And that’s for a good reason:

“Roughly 40 percent of middle-school teachers work in schools where there are no late penalties for coursework, no zeroes for missing coursework, and unlimited redos of tests.”

This is potentially the most important problem facing our society today, and it warrants far more attention.

Source