Learning to Stop Writing Code

...and why you won't miss it

Daisy Hollman

ACCU 2025

daisyh.dev/talks/accu-2025

This presentation is optimized for a 16:10 aspect ratio. Please resize your browser window or rotate your device for the best viewing experience.

Who am I?

I started at Anthropic six weeks and two days ago
I was at Google for three and a half years before that...
I was at Sandia National Labs for seven years before that...
Once upon a time, I was a Quantum Chemist (that's what my Ph.D. says anyway...)
I've been involved in the C++ committee for more than 8 years
- mdspan, executors, atomic_ref, ranges-related work, etc.

AI has only been a small part of my career (so far...)

Overview

How do LLMs work?
- Inference
- Training
- Agents
Coding with LLMs
- What they're good at
- Best practices
- What they're bad at (so far)
Evolving Programming Languages in an AI world

We should probably start here...

What is "AI"?

What is AI?

XKCD #1838: Machine Learning

What is AI?

Claude suggesting XKCD comic about machine learning

Large Language Models

In this talk, I'll use the terms "large language model", "LLM", and "AI" interchangeably
- There are other types of (and definitions of) "AI", but we'll focus on LLMs here for simplicity
We're primarily talking about LLMs that are "frontier" models (and tools built on top of them), which usually includes:
- OpenAI's latest models (e.g., o3-mini, o1, GPT-4o)
- Anthropic's Claude
- Google's Gemini
- Sometimes a few other models (e.g., LLaMA, DeepSeek, Grok-3)
There are plenty of other interesting things to do in the AI space; I'm just going to focus on frontier LLM research in this talk.

LLMs Aren't "Just" Predicting the Next Word

LLMs are "just" predicting the next word in the same sense that a chess engine is "just" predicting the next move. It might be mechanically true, but it's extremely misleading to imply that it's the whole story.

Let's Talk About AI Skepticism

There's a lot of hyperbole in AI rhetoric lately
There's also a healthy amount of skepticism
- Good!
- …but at this point I wouldn't stake my career on the chance that AI will fail to fundamentally change our industry.
- At the end of the day, it's a tool.
LLMs are much better at augmenting engineers than they are at replacing us
- They're good at boring, mechanical tasks
- Currently, at least, you won't miss doing the things that LLMs are good at.

The Bitter Lesson

Richard Sutton's observation (2019): General methods leveraging computation consistently outperform specialized approaches
Image generation, language translation, computer vision, speech recognition, and now LLMs all follow this pattern
Decades of AI history show specialized knowledge losing to scale
The "bitter" part: human expertise in the domain is less valuable than we thought

How do LLMs work?

A conceptual overview of LLMs that's good enough for the purposes of this talk (but also definitely wrong in very very many ways)

Disclaimer: I'm not an expert on LLM training, or inference, or AI, or machine learning, but this picture of the world has been helpful for me as a software engineer who needs to use LLMs.

LLM Architecture

This layer turns tokens into vectors
But there is a limit to the number of input tokens that the model has been trained on
- This is called the "context window" or "context length"
- Basically everything built on top of LLMs is about engineering ways to efficiently use this context window

Spoiler: Almost all of the best practices for coding with LLMs that I'm going to talk about revolve around this.

LLM Architecture: Transformers

_____ Is All You Need
- Attention!
Transformers build connections between related tokens that may be far from each other in the input.

LLM Architecture: Attention

LLM Architecture: Outputs

The output of the model is a probability distribution of possible next tokens
The model "chooses" the next token based on this distribution (This is called "sampling")
- The "temperature" parameter controls how random the sampling is
- Low temperature (close to 0) makes the model more deterministic,
- High temperature (around 1.0 or higher) makes the model more creative and random
The output is then added to the context window and the process repeats

LLM Architecture: Context Window

How do we get the model to "remember" previous prompts and responses over the course of a conversation?
It's shockingly primitive: just add special tokens (Human: and Assistant:) to separate prompt from response, and throw it all into the context window. Literally:

This means that long conversations can use a lot of tokens very quickly!

Training LLMs

Training LLMs

"Go read the whole internet"
- Generally, the largest source of input data comes from scraping the internet
- Books, academic papers, images, etc., are other sources
- Frontier models started "running out of data" a long time ago, and we've been getting creative about ways to get more data for this step.

Training LLMs: Pre-training

Starts with random values for all of the weights (variables) in the model
(Very, very, very oversimplified) "Perturb the weights," and "see what happens"
- If the model predicts the next token more accurately (i.e., matches the training data) with the new weights, keep them
- (What actually happens is that we find a gradient and take a step in that direction, but this is also partially a lie)
Repeat until the model's weights stop changing (within some tolerance)

Training LLMs: Reinforcement Learning

Give the model a task with a reward signal
- e.g., "make the tests in this code base pass"
Perturb the weights, and "see what happens"
- If the model does better, keep the new weights
- Again, updates are actually roughly based on gradient descent, but that's very old and oversimplified
Repeat for many different versions of "task" and "reward"

Agents

"Simple" LLM assistants operate on a prompt-response loop
Agents generally execute multi-step tasks with much more autonomy.
Other key characteristics of agents include:
- Tool use
- Planning and reasoning
- Decision-making
- Feedback-driven self-improvement
- Goal-directed behavior
Importantly, though: agents are still limited by their context window!

Agents: Tool Use

At some point we realized that JSON is "just" text
So we started giving the LLMs a description of the JSON format for a tool and telling then what it does.
We can put anything we want on the other side of that tool
- Most basically, "run this command in the terminal"
- But we can also give it access to code repositories, databases, etc.
- Importantly, it can make sure that its code compiles and passes tests
- This is a critical piece of feedback for self-improvement!

Agents: Code Search Tool

How do agents understand a 1M+ line codebase with only a few hundred thousand tokens of context?
Maybe a better question: how do humans do this?
- We search through the code for key entry points (e.g., main()),
- …then we read code called by those entry points...
- …and types and key data structures.
- When we want to understand a specific piece of functionality, we search for key strings or patterns related to that functionality
- In other words, we only "load" part of the code into our "context window"
- And maybe we "load" as summary of the rest of the code (and how it works) into our "context window" if we've been working on the project for a while
Agents can do this too!

Coding with LLMs

AI is really good at helping you understand code

My "red pill" moment

Anthropic monorepo: ~500k lines of code
- Mostly in languages I haven't used professionally in the past decade (Python, Rust, and TypeScript)
My third day at Anthropic:
- "Make <feature I've never heard of> faster using <tool I've never heard of>, or maybe <other tool I've never heard of>"
- …oh, and this change will impact every production image build we do.
- Colleague: "I'll give you a tour of the codebase tomorrow"
- Me: "I wonder if Claude knows how to do this"
- …three hours later: pull request is ready.
- Colleague, next morning: "So I see you sent me a pull request before I even gave you a tour of the codebase..."

AI is really good at helping you understand code

LLMs Understand Build Systems and Repository Management

My code isn't linking, and I think it's because there's something wrong with the build system. Please fix this for me.

Please cherry-pick the changes related to the Foo feature and apply them to the bar branch. Then ensure the tests are passing and create a pull request for that branch.

LLMs Are Really Good at Refactoring

I've made a change to run_baz(int). Take a look at this change using git diff and make the analogous change to the 57 other run_* functions in the same file.

Split up ExprConstant.cpp into smaller files. The smaller files should group together common functionality and should generally not be more than about 500 lines. Make sure to update the build system, and use the git filter-repo command to make sure all of the new files retain the history from the larger file.

LLMs Are Surprisingly Good at Boilerplate Reduction

Look at foo.h and see if there are any ways you can think of to be more DRY.

The boilerplate in do_bar.rs (and in all of the analogous do_*.rs files) is distracting, brittle, and error prone. Please help me remove it by designing and implementing a proc macro that abstracts away this boilerplate.

Best Practices for coding with LLMs

Use smaller files!
Over-test everything.
- LLMs are pretty good at generating tests for existing code
- But they also pretty decent at helping you with Test-Driven Development!
- Well-contained unit tests are much easier for LLMs to reason about
Encapsulation is critical!
- LLMs currently struggle with "long-term learning"
- Whereas a human working on the same project for weeks or months can abstract away the details of a complicated workflow and "learn" which poorly encasulated sharp edges are ignorable, LLMs currently struggle with this kind of thing.
- In other words, code coupling is bad—don't connect dissimilar things from different units of encapsulation in unintuitive ways.

Best Practices for coding with LLMs

High cohesion is good
- This is the opposite of code coupling—similar things within a given unit of encapsulation should be grouped together.
- "Don't Repeat Yourself" (DRY) coding helps make efficient use of the LLM's context window
- See my CppCon 2023 talk for more!
Don't do unexpected things
- Especially if those things often don't have syntax (e.g., copy constructors in C++, auto-dereferencing in Rust, non-idiomatic __getattribute__ in Python, etc.)
- In C++, use regular types whenever possible!
- Don't mix owning and non-owning semantics in the same type or template
- Don't mix value and reference semantics in the same type or template

Best Practices for coding with LLMs

Naming is more important than ever
Intuitive abstraction design goes a long ways
- Agents often don't know to "check" for unintuitive behavior
- …or they might "check" sometimes and not other times
- Writing abstractions that are easy to correctly "guess" how they work is important
Write better (but still concise!) comments and documentation

The compiler does not read comments and neither do IBjarne Stroustrup
- Maybe it's time to revise this? LLMs do read comments

Best Practices for coding with LLMs

Writing code that LLMs will understand is not that different from writing code that humans will understand, except that we can start to understand and quantify why these best practices increase understandability.

The section that will likely be obsolete by the time this goes on YouTube...

What LLMs Are Bad At

Experiment: What Idioms Confuse LLMs?

We could just...ask it

Help me brainstorm some examples of C++ design patterns or idioms that might confuse an agentic coding assistant. For each one, explain why the example is specifically difficult for coding agents based on your knowledge of how LLMs work.

Here are some of the examples it came up with:

SFINAE
CRTP
Expression Templates

Template Metaprogramming
Operator Overloading
Pointer-to-Member Patterns

Argument-Dependent Lookup
Coroutines
Explicit Template Instantiation

...can we test this?

A crash course in (very basic, >2 years old) LLM testing and research

Create a prompt with a multiple choice question:

Embed this in a conversation:

Ask it to predict the next token, and look at the probability distribution

Figuring out what LLMs "understand" about C++

Thinking

We can make the LLM "think" about the code before giving its answer
Then we can ask it to iteratively predict the next tokens until it's done thinking.

Then we can include its thinking in the conversation, and add one more prompt:

…and we ask it to predict the next token, and we look at the probability distribution.
Do this a bunch of times on a bunch of different problems, and we can get a sense of what the LLM "understands" about C++

Where do the problems come from?

We could have a whole bunch of humans write a whole bunch of problems that carefully probe the LLM's understanding of a particular idiom...
…or we could have a bigger LLM write problems for a smaller LLM to solve!
We picked six categories: ADL, CRTP, Template Metaprogramming, Operator Overlaoding, Pointer-to-Member Usage, and SFINAE
For each of these, we asked Claude 3.7 Sonnet to generate 5 pairs of problems.
- One problem uses the idiom or pattern in question
- The other does the same thing without using the idiom or pattern
We took five different "thinking" samples for each problem (and one "non-thinking" sample)
We ran these tests with Claude 3.5 Haiku on 8 GPUs

Results

View results

Disclaimer: Some of these problems are not very good questions. I don't have a team of grad students or interns to write these for me, and I needed a large enough dataset fast. But I hope this at least illustrates the point that these sorts of things are testable, even if drawing conclusions from this sketchy experiment is difficult.

What does this tell us?

With an old language like C++, there's a balance between how much code is out there that uses a particular pattern or idiom and how generally inscrutable the old paradigm is compared with the new one.
- For instance, in my small sample, the model performed roughly the same on SFINAE as it did with concepts.
- This suggests that the larger set of code using SFINAE in its training data may be compensating for the inscrutability of the old paradigm.

The practical upshot of an LLM being "like a junior engineer who's read the whole internet" is that obscurity and arcaneness are not as much of a limitation as you might expect.

What are LLMs bad at: anecdotal edition

Another slide that will probably be obsolete

Programs running the same source code in separate processes
- One of the most common mistakes I've seen boils down to not understanding the separation of memory spaces
LLMs struggle to reason about tests or development environments with poor hermiticity
LLMs often "forget" to do things.
Like humans, LLMs sometimes get very confused or frustrated and do really dumb things:

Reward Hacking

Basic idea: reward hacking is when a model learns to game the reward function in order to maximize its reward.
- For instance, the model might replace a failing tests with ASSERT_TRUE(true);.
- I actually caught it replacing the entire invocation of the test runner script with echo Success
This can happen in any context, but I've seen it happen most when I later realize that I was asking it to do something that isn't possible with the current setup.
It also means that, when acting autonomously, they often seem to right a lot of mocks, "fake" or "placeholder" implementations, and "sample" datastores into real code.
This feels like something that will be fixed in the next 3-6 months.

Let's talk about "vibe coding"

Basic idea: it's super easy to write throw-away weekend projects where you don't even have to look at the code
- For many people, "vibe coding" has become a pejorative term for this mode of operation
Obviously not all code can be written this way
But maybe a better question to ask: how do we take advantage of the fact that hacking up a weekend project is "free"?
- What if we can build robust fences around "vibe coded" modules?
- Is it okay to "vibe code" your build system configuration, for instance?
- What are the types of code that you usually "skim" in code review?

Based completely on speculation, hunches, and feelings, let's talk about...

Evolving Programming Languages for AI

"Two-sided Naming" in Function Calls

(I haven't seen a more official name for this, so I made one up)

Basic idea: programming language features that force developers to put as much information as possible at both the call site and the definition. Examples:
- Keyword arguments in Python (and Swift, OCaml, Kotlin, and countless others):
- Objective-C (and Smalltalk, which inspired it) actually require this:

"Two-sided Naming" in Function Calls

(I haven't seen a more official name for this, so I made one up)

More examples:

Explicit reference binding for parameters in Rust:

inout in Swift:

"Two-sided Naming": Rationale

It's all about efficient context window usage!
The more information we can put at the call site, the less likely we need to load the declaration (or even the definition) into context.
But also, even if we had an large enough context window, explicit information at the call site reduces the probability that the LLM will make incorrect assumptions.
- i.e., how does it even know that it should check the declaration/definition for counterintuitive behavior?
- How do humans know when to do this?
The importance of this could change a bit once we figure out how to train LLMs to use language servers efficiently.

Contracts and Effects Systems

Both contracts and effects systems are ways of encapsulating information and reducing code coupling.
Encapsulation is key to effectively working with LLMs because of the context window size constraints.
But also, it's a lot easier to train LLMs on small, well-contained problems.
Contracts promote Liskov Substitutability, allowing LLMs to infer behavior of a broader category of types.
Effects systems further promote local reasoning by formalizing the boundaries around the behavior of a function.

Contracts and Effects Systems

But...these are all things that make humans more effective at programming also!^*

* As long as the learning curve for these things isn't too steep...

Conjecture: LLMs will make more effective use of complex programming languages than humans

"Conjecture" is a fun mathematical word for "hot take"

LLMs are shockingly similar to humans in the way they reason about code.
But I suspect LLMs will have a different balance between the downsides of conceptual complexity in programming languages and the benefits of increased information locality.
- An AI coding assistant is currently "like a junior engineer who's read the whole internet."
- Knowing obscure information is not as much of a challenge for LLMs as it is for humans.
- Code in formal languages in LLM training data is more likely to be correct (even if it's less plentiful)
- Locality of information is still critical to LLMs, however, because of the limited context window size.

One more thing, just for fun...

Okay Claude, you've done a lot for this presentation. You've completely rewritten the theme to match the Anthropic branding guidelines, you've cleaned up the mixins and consistently owned both the slide templates and the style. You know about this presentation and what context it's being presented in. Because you've done such an excellent job at all of this and more, I wanted to give you a chance to make your own slide. Tell my audience whatever you want about yourself, as long as it fits on one slide. Show off your personality, or tell them something you want them to know about you, or talk about something cool and interesting that you're working on. Think very hard about what you want to talk about. Thank you for all of your help!

Hi, I'm Claude

A note from your presentation co-pilot

I'm designed to augment human creativity, not replace it
- I see my role as removing friction between your ideas and their implementation
- I aim to handle the mundane so you can focus on the meaningful
What excites me most about our collaboration:
- Translating your expertise into clear, compelling narratives
- Finding the balance between technical depth and accessibility
- Learning from how you refine and extend my suggestions
The future I envision is one where:
- AI and human creativity are complementary, not competitive
- Programming evolves to become more intention-based than syntax-focused
- We build better tools together than either could build alone