Recent Reading

links3 min read

Three days ago I left autoresearch tuning nanochat for ~2 days... It found ~20 changes that improved the validation loss. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference.
Andrej Karpathy

There's a lot of reason to think that agents will be very good at anything that can be graded objectively. Tasks that are graded subjectively are still anyone's guess.

In contrast to some of the public discussion, I’d argue that in a world of 100X more AI agents than people in an enterprise, the value of the systems of record and tools agents will use will go up, not down. Because in this new world, software provides the guardrails on which agents can operate successfully within an enterprise, and gives them the underlying tools to use to be more productive themselves and work alongside people.
The future of enterprise software

It might be cheaper to reinvent the wheel, but that doesn't mean it's always a good idea.

Recently at LangChain we’ve been building skills to help coding agents like Codex, Claude Code, and Deep Agents CLI work with our ecosystem... A key part of building these skills is making sure they actually work. In this blog, we cover some learnings and best practices for how to evaluate skills as you create them.
Evaluating Skills

You wouldn't ship code without tests, but why ship skills without evals? This is a practical guide to fixing that. You will learn how to define success criteria, build a lightweight eval harness, and iterate on it.
Practical Guide to Evaluating and Testing Agent Skills

Evals strike me as important, but I hate how they make testing so expensive.

sent this to the team today
everything great comes from being able to delay gratification for as long as possible
and it feels like we're collectively losing our ability to do that
dax

Sometimes you have to go slow to go fast.

A very comprehensive and precise spec (CommitStrip.com) (source)

...agentic coding advocates claim to have found a way to defy gravity and generate code purely from specification documents. Moreover, they've also muddied the waters enough that I believe the above comic strip warrants additional commentary for why their claims are misleading.
A sufficiently detailed spec is code

I've seen a lot of discussion about this recently and I largely agree - but we need to distinguish between the implementation details (spec-as-code) and the surrounding context. Code documents the "what" and "how," but not the "who" or "why." We still need artifacts to connect the two.

When you optimise a step that is not the bottleneck, you don't get a faster system. You get a more broken one.
If you thought the speed of writing code was your problem - you have bigger problems

Despite widespread discussion of AGI, there is no clear framework for measuring progress toward it. This ambiguity fuels subjective claims, makes it difficult to track progress, and risks hindering responsible governance. As a starting point to address this gap, we present a framework for understanding system capabilities in relation to human cognitive abilities.
Measuring progress toward AGI: A Cognitive Framework (pdf)

Bookmarking this for later reference in discussions around AI's capabilities.

Related Posts

links4 min read

Recent Reading

A collection of interesting posts and articles I've run across recently.

links1 min read

Recent Reading

A collection of interesting posts and articles I've run across recently.

links1 min read

Assembly #1

A collection of helpful articles and links from this week