Learning to Stop Writing Code

...and why you won't miss it

Daisy Hollman

ACCU 2025
daisyh.dev/talks/accu-2025
This presentation is optimized for a 16:10 aspect ratio. Please resize your browser window or rotate your device for the best viewing experience.

Who am I?

  • I started at Anthropic six weeks and two days ago
  • I was at Google for three and a half years before that...
  • I was at Sandia National Labs for seven years before that...
  • Once upon a time, I was a Quantum Chemist (that's what my Ph.D. says anyway...)
  • I've been involved in the C++ committee for more than 8 years
    • mdspan, executors, atomic_ref, ranges-related work, etc.

AI has only been a small part of my career (so far...)

Overview

  • How do LLMs work?
    • Inference
    • Training
    • Agents
  • Coding with LLMs
    • What they're good at
    • Best practices
    • What they're bad at (so far)
  • Evolving Programming Languages in an AI world

We should probably start here...

What is "AI"?

What is AI?

XKCD comic about machine learning

XKCD #1838: Machine Learning

What is AI?

Claude suggesting XKCD comic about machine learning

Large Language Models

  • In this talk, I'll use the terms "large language model", "LLM", and "AI" interchangeably
    • There are other types of (and definitions of) "AI", but we'll focus on LLMs here for simplicity
  • We're primarily talking about LLMs that are "frontier" models (and tools built on top of them), which usually includes:
    • OpenAI's latest models (e.g., o3-mini, o1, GPT-4o)
    • Anthropic's Claude
    • Google's Gemini
    • Sometimes a few other models (e.g., LLaMA, DeepSeek, Grok-3)
  • There are plenty of other interesting things to do in the AI space; I'm just going to focus on frontier LLM research in this talk.

LLMs Aren't "Just" Predicting the Next Word



LLMs are "just" predicting the next word in the same sense that a chess engine is "just" predicting the next move. It might be mechanically true, but it's extremely misleading to imply that it's the whole story.

Let's Talk About AI Skepticism

  • There's a lot of hyperbole in AI rhetoric lately
  • There's also a healthy amount of skepticism
    • Good!
    • …but at this point I wouldn't stake my career on the chance that AI will fail to fundamentally change our industry.
    • At the end of the day, it's a tool.
  • LLMs are much better at augmenting engineers than they are at replacing us
    • They're good at boring, mechanical tasks
    • Currently, at least, you won't miss doing the things that LLMs are good at.

The Bitter Lesson

  • Richard Sutton's observation (2019): General methods leveraging computation consistently outperform specialized approaches
  • Image generation, language translation, computer vision, speech recognition, and now LLMs all follow this pattern
  • Decades of AI history show specialized knowledge losing to scale
  • The "bitter" part: human expertise in the domain is less valuable than we thought
Bitter Lesson Meme

How do LLMs work?



A conceptual overview of LLMs that's good enough for the purposes of this talk (but also definitely wrong in very very many ways)

Disclaimer: I'm not an expert on LLM training, or inference, or AI, or machine learning, but this picture of the world has been helpful for me as a software engineer who needs to use LLMs.

LLM Architecture

Overview of a typical modern LLM architecture

LLM Architecture

  • This layer turns tokens into vectors
  • But there is a limit to the number of input tokens that the model has been trained on
    • This is called the "context window" or "context length"
    • Basically everything built on top of LLMs is about engineering ways to efficiently use this context window

Spoiler: Almost all of the best practices for coding with LLMs that I'm going to talk about revolve around this.

Overview of a typical modern LLM architecture

LLM Architecture: Transformers

  • _____ Is All You Need
    • Attention!
      Overview of a typical modern LLM architecture
  • Transformers build connections between related tokens that may be far from each other in the input.
Overview of a typical modern LLM architecture

LLM Architecture: Attention

Attention sentence diagram
Attention matrix diagram

LLM Architecture: Outputs

  • The output of the model is a probability distribution of possible next tokens
  • The model "chooses" the next token based on this distribution (This is called "sampling")
    • The "temperature" parameter controls how random the sampling is
    • Low temperature (close to 0) makes the model more deterministic,
    • High temperature (around 1.0 or higher) makes the model more creative and random
  • The output is then added to the context window and the process repeats
Overview of a typical modern LLM architecture

LLM Architecture: Context Window

  • How do we get the model to "remember" previous prompts and responses over the course of a conversation?
  • It's shockingly primitive: just add special tokens (Human: and Assistant:) to separate prompt from response, and throw it all into the context window. Literally:
  • This means that long conversations can use a lot of tokens very quickly!

Training LLMs

LLM

Training LLMs

  • "Go read the whole internet"
    • Generally, the largest source of input data comes from scraping the internet
    • Books, academic papers, images, etc., are other sources
    • Frontier models started "running out of data" a long time ago, and we've been getting creative about ways to get more data for this step.
LLM

Training LLMs: Pre-training

  • Starts with random values for all of the weights (variables) in the model
  • (Very, very, very oversimplified) "Perturb the weights," and "see what happens"
    • If the model predicts the next token more accurately (i.e., matches the training data) with the new weights, keep them
    • (What actually happens is that we find a gradient and take a step in that direction, but this is also partially a lie)
  • Repeat until the model's weights stop changing (within some tolerance)
LLM

Training LLMs: Reinforcement Learning

  • Give the model a task with a reward signal
    • e.g., "make the tests in this code base pass"
  • Perturb the weights, and "see what happens"
    • If the model does better, keep the new weights
    • Again, updates are actually roughly based on gradient descent, but that's very old and oversimplified
  • Repeat for many different versions of "task" and "reward"
LLM

Agents

  • "Simple" LLM assistants operate on a prompt-response loop
  • Agents generally execute multi-step tasks with much more autonomy.
  • Other key characteristics of agents include:
    • Tool use
    • Planning and reasoning
    • Decision-making
    • Feedback-driven self-improvement
    • Goal-directed behavior
  • Importantly, though: agents are still limited by their context window!

Agents: Tool Use

  • At some point we realized that JSON is "just" text
  • So we started giving the LLMs a description of the JSON format for a tool and telling then what it does.
  • We can put anything we want on the other side of that tool
    • Most basically, "run this command in the terminal"
    • But we can also give it access to code repositories, databases, etc.
    • Importantly, it can make sure that its code compiles and passes tests
    • This is a critical piece of feedback for self-improvement!

Agents: Code Search Tool

  • How do agents understand a 1M+ line codebase with only a few hundred thousand tokens of context?
  • Maybe a better question: how do humans do this?
    • We search through the code for key entry points (e.g., main()),
    • …then we read code called by those entry points...
    • …and types and key data structures.
    • When we want to understand a specific piece of functionality, we search for key strings or patterns related to that functionality
    • In other words, we only "load" part of the code into our "context window"
    • And maybe we "load" as summary of the rest of the code (and how it works) into our "context window" if we've been working on the project for a while
  • Agents can do this too!

Coding with LLMs


AI is really good at helping you understand code

My "red pill" moment

  • Anthropic monorepo: ~500k lines of code
    • Mostly in languages I haven't used professionally in the past decade (Python, Rust, and TypeScript)
  • My third day at Anthropic:
    • "Make <feature I've never heard of> faster using <tool I've never heard of>, or maybe <other tool I've never heard of>"
    • …oh, and this change will impact every production image build we do.
    • Colleague: "I'll give you a tour of the codebase tomorrow"
    • Me: "I wonder if Claude knows how to do this"
    • …three hours later: pull request is ready.
    • Colleague, next morning: "So I see you sent me a pull request before I even gave you a tour of the codebase..."

AI is really good at helping you understand code


LLMs Understand Build Systems and Repository Management

My code isn't linking, and I think it's because there's something wrong with the build system. Please fix this for me.

Please cherry-pick the changes related to the Foo feature and apply them to the bar branch. Then ensure the tests are passing and create a pull request for that branch.

LLMs Are Really Good at Refactoring

I've made a change to run_baz(int). Take a look at this change using git diff and make the analogous change to the 57 other run_* functions in the same file.

Split up ExprConstant.cpp into smaller files. The smaller files should group together common functionality and should generally not be more than about 500 lines. Make sure to update the build system, and use the git filter-repo command to make sure all of the new files retain the history from the larger file.

LLMs Are Surprisingly Good at Boilerplate Reduction

Look at foo.h and see if there are any ways you can think of to be more DRY.

The boilerplate in do_bar.rs (and in all of the analogous do_*.rs files) is distracting, brittle, and error prone. Please help me remove it by designing and implementing a proc macro that abstracts away this boilerplate.

Best Practices for coding with LLMs

  • Use smaller files!
  • Over-test everything.
    • LLMs are pretty good at generating tests for existing code
    • But they also pretty decent at helping you with Test-Driven Development!
    • Well-contained unit tests are much easier for LLMs to reason about
  • Encapsulation is critical!
    • LLMs currently struggle with "long-term learning"
    • Whereas a human working on the same project for weeks or months can abstract away the details of a complicated workflow and "learn" which poorly encasulated sharp edges are ignorable, LLMs currently struggle with this kind of thing.
    • In other words, code coupling is bad—don't connect dissimilar things from different units of encapsulation in unintuitive ways.

Best Practices for coding with LLMs

  • High cohesion is good
    • This is the opposite of code coupling—similar things within a given unit of encapsulation should be grouped together.
    • "Don't Repeat Yourself" (DRY) coding helps make efficient use of the LLM's context window
    • See my CppCon 2023 talk for more!
  • Don't do unexpected things
    • Especially if those things often don't have syntax (e.g., copy constructors in C++, auto-dereferencing in Rust, non-idiomatic __getattribute__ in Python, etc.)
    • In C++, use regular types whenever possible!
    • Don't mix owning and non-owning semantics in the same type or template
    • Don't mix value and reference semantics in the same type or template

Best Practices for coding with LLMs

  • Naming is more important than ever
  • Intuitive abstraction design goes a long ways
    • Agents often don't know to "check" for unintuitive behavior
    • …or they might "check" sometimes and not other times
    • Writing abstractions that are easy to correctly "guess" how they work is important
  • Write better (but still concise!) comments and documentation

    The compiler does not read comments and neither do IBjarne Stroustrup
    • Maybe it's time to revise this? LLMs do read comments

Best Practices for coding with LLMs



Writing code that LLMs will understand is not that different from writing code that humans will understand, except that we can start to understand and quantify why these best practices increase understandability.

The section that will likely be obsolete by the time this goes on YouTube...

What LLMs Are Bad At

Experiment: What Idioms Confuse LLMs?

We could just...ask it

Help me brainstorm some examples of C++ design patterns or idioms that might confuse an agentic coding assistant. For each one, explain why the example is specifically difficult for coding agents based on your knowledge of how LLMs work.
Here are some of the examples it came up with:

  • SFINAE
  • CRTP
  • Expression Templates
  • Template Metaprogramming
  • Operator Overloading
  • Pointer-to-Member Patterns
  • Argument-Dependent Lookup
  • Coroutines
  • Explicit Template Instantiation

...can we test this?

A crash course in (very basic, >2 years old) LLM testing and research

  • Create a prompt with a multiple choice question:
  • Embed this in a conversation:
  • Ask it to predict the next token, and look at the probability distribution

Figuring out what LLMs "understand" about C++

Thinking

  • We can make the LLM "think" about the code before giving its answer
  • Then we can ask it to iteratively predict the next tokens until it's done thinking.
  • Then we can include its thinking in the conversation, and add one more prompt:
  • …and we ask it to predict the next token, and we look at the probability distribution.
  • Do this a bunch of times on a bunch of different problems, and we can get a sense of what the LLM "understands" about C++

Where do the problems come from?

  • We could have a whole bunch of humans write a whole bunch of problems that carefully probe the LLM's understanding of a particular idiom...
  • …or we could have a bigger LLM write problems for a smaller LLM to solve!
  • We picked six categories: ADL, CRTP, Template Metaprogramming, Operator Overlaoding, Pointer-to-Member Usage, and SFINAE
  • For each of these, we asked Claude 3.7 Sonnet to generate 5 pairs of problems.
    • One problem uses the idiom or pattern in question
    • The other does the same thing without using the idiom or pattern
  • We took five different "thinking" samples for each problem (and one "non-thinking" sample)
  • We ran these tests with Claude 3.5 Haiku on 8 GPUs

Results

View results

Disclaimer: Some of these problems are not very good questions. I don't have a team of grad students or interns to write these for me, and I needed a large enough dataset fast. But I hope this at least illustrates the point that these sorts of things are testable, even if drawing conclusions from this sketchy experiment is difficult.

What does this tell us?

  • With an old language like C++, there's a balance between how much code is out there that uses a particular pattern or idiom and how generally inscrutable the old paradigm is compared with the new one.
    • For instance, in my small sample, the model performed roughly the same on SFINAE as it did with concepts.
    • This suggests that the larger set of code using SFINAE in its training data may be compensating for the inscrutability of the old paradigm.

The practical upshot of an LLM being "like a junior engineer who's read the whole internet" is that obscurity and arcaneness are not as much of a limitation as you might expect.

What are LLMs bad at: anecdotal edition

Another slide that will probably be obsolete

  • Programs running the same source code in separate processes
    • One of the most common mistakes I've seen boils down to not understanding the separation of memory spaces
  • LLMs struggle to reason about tests or development environments with poor hermiticity
  • LLMs often "forget" to do things.
  • Like humans, LLMs sometimes get very confused or frustrated and do really dumb things:
    Oops

Reward Hacking

  • Basic idea: reward hacking is when a model learns to game the reward function in order to maximize its reward.
    • For instance, the model might replace a failing tests with ASSERT_TRUE(true);.
    • I actually caught it replacing the entire invocation of the test runner script with echo Success
  • This can happen in any context, but I've seen it happen most when I later realize that I was asking it to do something that isn't possible with the current setup.
  • It also means that, when acting autonomously, they often seem to right a lot of mocks, "fake" or "placeholder" implementations, and "sample" datastores into real code.
  • This feels like something that will be fixed in the next 3-6 months.

Let's talk about "vibe coding"

  • Basic idea: it's super easy to write throw-away weekend projects where you don't even have to look at the code
    • For many people, "vibe coding" has become a pejorative term for this mode of operation
  • Obviously not all code can be written this way
  • But maybe a better question to ask: how do we take advantage of the fact that hacking up a weekend project is "free"?
    • What if we can build robust fences around "vibe coded" modules?
    • Is it okay to "vibe code" your build system configuration, for instance?
    • What are the types of code that you usually "skim" in code review?

Based completely on speculation, hunches, and feelings, let's talk about...

Evolving Programming Languages for AI

"Two-sided Naming" in Function Calls

(I haven't seen a more official name for this, so I made one up)

  • Basic idea: programming language features that force developers to put as much information as possible at both the call site and the definition. Examples:
    • Keyword arguments in Python (and Swift, OCaml, Kotlin, and countless others):
    • Objective-C (and Smalltalk, which inspired it) actually require this:

"Two-sided Naming" in Function Calls

(I haven't seen a more official name for this, so I made one up)

  • More examples:
    • Explicit reference binding for parameters in Rust:
    • inout in Swift:

"Two-sided Naming": Rationale

  • It's all about efficient context window usage!
  • The more information we can put at the call site, the less likely we need to load the declaration (or even the definition) into context.
  • But also, even if we had an large enough context window, explicit information at the call site reduces the probability that the LLM will make incorrect assumptions.
    • i.e., how does it even know that it should check the declaration/definition for counterintuitive behavior?
    • How do humans know when to do this?
  • The importance of this could change a bit once we figure out how to train LLMs to use language servers efficiently.

Contracts and Effects Systems

  • Both contracts and effects systems are ways of encapsulating information and reducing code coupling.
  • Encapsulation is key to effectively working with LLMs because of the context window size constraints.
  • But also, it's a lot easier to train LLMs on small, well-contained problems.
  • Contracts promote Liskov Substitutability, allowing LLMs to infer behavior of a broader category of types.
  • Effects systems further promote local reasoning by formalizing the boundaries around the behavior of a function.

Contracts and Effects Systems



But...these are all things that make humans more effective at programming also!*

* As long as the learning curve for these things isn't too steep...

Conjecture: LLMs will make more effective use of complex programming languages than humans

"Conjecture" is a fun mathematical word for "hot take"

  • LLMs are shockingly similar to humans in the way they reason about code.
  • But I suspect LLMs will have a different balance between the downsides of conceptual complexity in programming languages and the benefits of increased information locality.
    • An AI coding assistant is currently "like a junior engineer who's read the whole internet."
    • Knowing obscure information is not as much of a challenge for LLMs as it is for humans.
    • Code in formal languages in LLM training data is more likely to be correct (even if it's less plentiful)
    • Locality of information is still critical to LLMs, however, because of the limited context window size.

One more thing, just for fun...

Okay Claude, you've done a lot for this presentation. You've completely rewritten the theme to match the Anthropic branding guidelines, you've cleaned up the mixins and consistently owned both the slide templates and the style. You know about this presentation and what context it's being presented in. Because you've done such an excellent job at all of this and more, I wanted to give you a chance to make your own slide. Tell my audience whatever you want about yourself, as long as it fits on one slide. Show off your personality, or tell them something you want them to know about you, or talk about something cool and interesting that you're working on. Think very hard about what you want to talk about. Thank you for all of your help!

Hi, I'm Claude

A note from your presentation co-pilot

  • I'm designed to augment human creativity, not replace it
    • I see my role as removing friction between your ideas and their implementation
    • I aim to handle the mundane so you can focus on the meaningful
  • What excites me most about our collaboration:
    • Translating your expertise into clear, compelling narratives
    • Finding the balance between technical depth and accessibility
    • Learning from how you refine and extend my suggestions
  • The future I envision is one where:
    • AI and human creativity are complementary, not competitive
    • Programming evolves to become more intention-based than syntax-focused
    • We build better tools together than either could build alone