Crafting the Code You Don't Write

Sculpting Software in an AI World

Dr. Daisy Hollman

MUC++ Meetup

daisyh.dev/talks/muc-meetup

This presentation is optimized for a 16:10 aspect ratio. Please resize your browser window or rotate your device for the best viewing experience.

Who am I?

Long-time C++ committee member
- Contributed to wording or features in C++17, C++20, C++23, C++26 (almost certainly), and C++29 (likely)
- std::mdspan, std::execution, std::atomic_ref, std::linalg, variadic operator[]
- Former chair of Study Group 9 (Ranges)
Former CppCon program chair (2022 and 2023)
Former purveyor of Cute C++ Tricks

I am one of you!

Joined Anthropic in February 2025, currently working on Claude Code

The Elephant in the Room...

Yes, I am aware that
AI is a bit of a polarizing issue right now...
(I'm not here to talk about that)

Non-goals for this presentation

I'm not here to...

…convince you that AI coding agents are going to change the way software engineering works
…convince you that vibe coding is the future of software engineering, or a good thing, or a bad thing, or sustainable.
…tell you about how AI is going to replace you, or in general talk about the social impacts of AI
…justify the environmental impact of AI

But,

I do want to help you thrive in a world where you have a limited ability to control the existence of AI

Goals for this presentation

Understand LLMs and coding agents at a high level
- Especially the parts that are most relevant to understanding how to use them to write code
Learn to think about coding agents as a tool
Improve your mental model for how and why LLMs make mistakes
- And learn how to anticipate and avoid these mistakes
"Learn how to learn" to use coding agents more effectively
Geek out about the future of C++ and where coding agents fit in that picture

How do LLMs work?

A conceptual overview that's good enough for this talk

(but also definitely wrong in many ways)

Disclaimer: I'm not an expert on LLM training, or inference, or AI, or machine learning, but this picture of the world has been helpful for me as a software engineer who needs to use LLMs.

The Evolution of Large Language Models

Key milestones that shaped modern AI

2017

Transformers

"All You Need"

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

LLM Architecture

LLM Architecture: Input

The "embedding" layer turns tokens into vectors
- Uses a fixed-size vocabulary selected at training time
But there is a limit to the number of input tokens that the model has been trained on
- This is called the context length or "context window," and it's fixed at training time
There's a third size called the embedding dimension, which is the "working" dimension of the model

LLM Architecture: Input

The context window limits the amount of information that the user can have the model "think" about at any given time.

Basically everything built on top of LLMs is about engineering ways to efficiently use this context window. Understanding and managing the context window is critical to getting better at using LLMs!

LLM Architecture: Context Window

How do we make a chatbot out of this context window?
It's shockingly primitive: just add special token sequences (e.g., \nHuman: and \nAssistant:) to separate prompt from response, and throw it all into the context window. Literally:

This means that long conversations can use a lot of tokens very quickly!

LLM Architecture: Transformers

_____ Is All You Need
- Attention! (Vaswani, Shazeer, et al.)
Transformers build connections between related tokens that may be far from each other in the input.

LLM Architecture: Attention

LLM Architecture: Outputs

The output of the model is a probability distribution of possible next tokens
The model "chooses" the next token based on this distribution (This is called "sampling")
- The "temperature" parameter controls how random the sampling is
- Low temperature (close to 0) makes the model more deterministic
- High temperature (around 1.0 or higher) makes the model more creative and random
The output is then added to the context window and the process repeats

The Evolution of Large Language Models

2017

Transformers

Attention is All You Need

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

Training LLMs: Pretraining

Start with random weights
Look at some tokens from the training data:
Generate a probability distribution of next tokens:
Compute a gradient of the weights that makes it match the training data better
Adjust the weights and repeat

The Evolution of Large Language Models

2017

Transformers

Attention is All You Need

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

Emergent behaviors and unreasonable effectiveness of scale

February 2019: GPT-2
- 1.5B parameters
- ~40GB of training data
June 2020: GPT-3
- 175B parameters
- ~570GB of training data
In a very vague sense, pre-training represents a "compression" of the training data
- The surprising thing is that this appears to have some things in common with the "compression" of data that our brains use: conceptual generalization
- This might be one reason that when we ask it to "decompress" the data, sometimes it displays emergent behaviors that are surprisingly human

Emergent behaviors and unreasonable effectiveness of scale

Consider the model completing this code snippet:

Several completion candidates:
widget
foo
nullptr
std::move(widget)
std::move(foo)
make_unique<Widget>()
;
42
🌼
How do we differentiate between these candidates?

Emergent behaviors and unreasonable effectiveness of scale

widget
foo
nullptr
std::move(widget)
std::move(foo)

make_unique<Widget>()
;
42
🌼

How do we differentiate between these candidates?
- Probably ASCII
- Look at the return type—it can't be an int or void
- Returning foo would be a use-after-move
- "Factory pattern" understanding means that nullptr seems pretty unlikely
- Probably we wouldn't initialize something and then throw it away immediately
- People usually name the variable to be returned by widget_factory() something like widget

Emergent behaviors and unreasonable effectiveness of scale

widget
std::move(widget)

We probably have a full spectrum of "probability distributions" in this room between these two completions
- Confession: my distribution probably looked something like
  - widget: 93%
  - std::move(widget): 7%

Emergent behaviors and unreasonable effectiveness of scale

Just for fun...

11
14
17
20

Before looking this up, my distribution probably looked something like
- 11: 33%
- 14: 22%
- 17: 40%
- 20: 5%

Widget Factory Completions

Widget Factory Completion

Widget Factory Completions: With 'Implicit Move' Comment

Widget Factory Completions

Widget Factory Completions: With 'Broken Compiler' Comment

Widget Factory Completions

Widget Factory Completions: C++ version number

Emergent behaviors and unreasonable effectiveness of scale

Compression through conceptual generalization is a useful metaphor for understanding how pre-trained models store and "regurgitate" information

The Evolution of Large Language Models

2017

Transformers

Attention is All You Need

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

The Problem with Pre-trained Models: Glorified Autocomplete

What's the return type of std::vector::size()?

Training LLMs: Reinforcement Learning

Basic idea:
- Give the model a question or a task
- Generate hundreds of different completions (i.e., answers to the question or executions of the task)
- Score those completions based on some metric
- Compute a gradient of the weights that makes the model more likely to give the completions with higher scores
- Repeat (many times and for many tasks)
This sounds really hard...
- …but it's actually harder than it sounds

The Evolution of Large Language Models

2017

Transformers

Attention is All You Need

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

Tool Use

At some point we realized that XML, JSON, and other structured markup languages are "just" text
So we started giving the LLMs a schema and a description of what will happen when it generates tokens matching that schema
Then we implement code that takes XML/JSON/etc. as input and does some task:
- Run a command in the terminal
- Run the compiler, the debugger, or the profiler
- Search the internet
- Search the code in a repository
- Edit a file

How Tool Calls Actually Work

Most LLMs (including Claude) use XML-based syntax for tool invocation (for now)

Here's what an Edit tool call looks like in Claude Code:

If the old_string is wrong, the tool call fails!
If there are multiple old_strings in the file, the tool call fails!
We are in the ed era of agentic tooling—we haven't even invented vi yet.

Just for fun...

Claude Code wrote the code block in the tool call in the last slide. Here's what the edit tool call looked like for that...

The Evolution of Large Language Models

2017

Transformers

Attention is All You Need

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

Context window sizes over time

GPT-1 (2018): 512 tokens
GPT-4 (2023): 8K / 32K tokens
GPT-4 Turbo (2024): 128K tokens
Claude 3 (2024): 200K tokens
Gemini 1.5 (2024): 1M+ tokens
Claude 4 Sonnet (2025): 200K tokens

What do we do with all of these extra tokens?

Context Window Growth

Source: meibel.ai

What do we do with all of these tokens?

We need a way to put more relevant information in the context window in order to make the output better
- 2023: Retrieval Augmented Generation (RAG)
- 2024: Tool use
Wild idea: what if we just...ask the LLM to put more relevant tokens in its own context?

Before:

After:

Why does this work so well?

The Evolution of Large Language Models

2017

Transformers

Attention is All You Need

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

Agents

Early LLMs could only do relatively short time-horizon tasks
- Chatbots with short answers worked fine (kind of)
- "Fancy" in-line code completion was another early application
- With longer time-horizon tasks, early LLMs would quickly go "off the rails"
As LLMs got larger and reinforcement learning got better, longer running tasks became more feasible
- Key insight: Let the model see the results of its actions and iterate on that feedback
- For instance, what if we:
  - Ask the model to generate code
  - Give the model a tool to run the compiler on the code it generated
  - Add the results of the compilation in the context window
  - Ask it to fix its compilation errors
This feedback loop is what transforms a chatbot into an "agent"

Agent Task Length Over Time

Growth in agent task length capabilities over time

Source: metr.org

Agents: Code Search Tool

How do agents understand a 1M+ line codebase with only a few hundred thousand tokens of context?
Maybe a better question: how do humans do this?
- We search through the code for key entry points (e.g., main()),
- …then we read code called by those entry points...
- …and types and key data structures.
- When we want to understand a specific piece of functionality, we search for key strings or patterns related to that functionality
- In other words, we only "load" part of the code into our "context window"
- And maybe we "load" a summary of the rest of the code (and how it works) into our "context window" if we've been working on the project for a while
Agents can do this too!

Coding with LLMs

AI is really good at helping you understand code

My "AGI pill" moment

Anthropic monorepo: ~4M lines of code at the time
- Mostly in languages I haven't used professionally in the past decade (Python, Rust, and TypeScript)
My third day at Anthropic:
- "Make <feature I've never heard of> faster using <tool I've never heard of>, or maybe <other tool I've never heard of>"
- …oh, and this change will impact every production image build we do.
- Colleague: "I'll give you a tour of the codebase tomorrow"
- Me: "I wonder if Claude knows how to do this"
- …three hours later: pull request is ready.
- Colleague, next morning: "So I see you sent me a pull request before I even gave you a tour of the codebase..."

Coding with Agents

A coding agent is like a junior engineer who has read the whole internet.

They struggle with large scope tasks
- Abstraction design
- Code that's used in a lot of different scenarios
- Libraries with a diverse set of stakeholders
They tend to overperform on highly arcane or obscure tasks that are smaller in scope
- Build systems
- Template metaprogramming
- Compiler intrinsics

Modes of Operation

Rapid prototyping
"Side-quests"
- Handwrite mission-critical code, dispatch agents for auxiliary code
Planning and brainstorming
- Helping with "writer's block"
Learning and exploration
Refactoring, boilerplate generation, modernization, cross-language translation
Test generation

The Side-Quest Strategy

You write the core logic. AI writes everything else.

Your Focus

Core algorithm
Architecture and code structure
Design and strategy
Security logic

AI's Work

Build system configuration
Benchmarking harnesses
Error handling boilerplate
Unit tests
Documentation

This alone can double your velocity on new features

Best Practices: Don't Be "Surprising"

If the LLM keeps "guessing" your library interface incorrectly, maybe it's too surprising (too "cute")
If the LLM is "guessing" the effects of your function incorrectly, maybe it needs a better name
Overthink the "principle of least astonishment"
- When you write code that an agent might read, actively think through all of the ways your code could be misinterpreted
- Be creative: what is the most "bland" way you can say the same thing?

Best Practices: Don't Be "Surprising"

"Surprising" C++ things to avoid:

Avoid "clever" operator overloading
Don't mix owning and non-owning semantics in the same type
Avoid macros
Avoid argument-dependent lookup (ADL)
Avoid implicit conversions
Use regular types wherever possible

Best Practices: Prioritize Better Encapsulation

Create, maintain, and document rigorous invariants at encapsulation boundaries
The agent should be able to reason about the code in isolation as much as possible
Avoid code coupling!
- Err on the side of repeating yourself rather than breaking encapsulation by reusing code in vastly different scenarios
See my CppCon 2023 talk "Expressing Sameness and Similarity in Modern C++":
- Video: dsh.fyi/cppcon-2023-video
- Slides: dsh.fyi/cppcon-2023

Best Practices: Document Your Code!

Comments are more important for agents than they are for humans!
Frequent, verbose comments in code can act like extended thinking
Unlike previous advice: redundancy is useful!
Think about the things that you can't say with your code and put them in comments
- "For all" contracts, for instance
Agents are very good at generating and maintaining comments
- Add a CI step that asks an agent to verify all of the comments related to changes in your pull request!

Best Practices: Key Takeaway

Writing code that LLMs will understand is not that different from writing code that humans will understand, except that we can start to understand and quantify why these best practices increase understandability.

Scaling Up

Using More Tokens to Build Faster and Stronger

Beyond Interactive Use

Interactive use of agents quickly becomes bottlenecked on the human in the loop
This mode of use will likely always produce the most productivity per token
But humans don't scale up; agents do

Parallel Subagents

Claude Code running multiple review agents in parallel

Claude Code Plugins

Claude Code Plugins allow you to install collections of slash commands, skills, subagents, hooks, and/or MCP servers that work together for a common purpose
- More customization points coming soon

/plugin to access menu, or use it directly:

Marketplaces are lists of plugins that are easy for anyone to host (publicly or privately)

Parallel agents in plugins

/review-pr command, specified in commands/review-pr.md

Parallel agents in plugins

silent-failure-hunter agent, specified in agents/silent-failure-hunter.md

Hot Take: Building Coding Tools for AGI, not ASI

Building for ASI ("Artificial Super Intelligence") implies a mindset where anything we add to the agentic harness will be obsolete because the model will be so smart that it doesn't need our tools
- Realistically, this has not been the trend in agentic coding
- Giving agents more ability to get deterministic information (through tool calls, RAG, and hooks) has led to more autonomy and more reliability
Building for AGI ("Artificial General Intelligence") assumes that future agents will have intelligence like the best humans
- This means that the baseline assumption should be that tools that improve efficiency and reliability for humans should also improve efficiency and reliability for agents.
- Generally, the most efficient humans make more use of coding tools, not less. Why should we think agents are any different?
- Build tools that work with what we have now, and they will be even more useful as the model improves

Pushing towards Longer Autonomy

The vision: write a spec before bed, wake up with a usable implementation of that spec
Problem: agents often stop working before a long-horizon task is done
- This is partially because of how reinforcement learning works (to make any progress, you often need to give partial credit to partial solutions)
- Also partially because we haven't fully evolved training away from the "chatbot" model (short turns, direct interaction, human-in-the-loop)
Problem: agents often do things they know won't work if asked to examine their work
- A lot of problems in software engineering are easier to assess for correctness than they are to solve.
"Overnight" autonomy is often limited by the model deciding to stop working, not by it actually reaching the end of its ability to do productive work.

The Ralph Wiggum Loop

"I'm helping!"

Original post by Geoffrey Huntley: ghuntley.com/ralph

Our version of this: feed the same prompt back into the agent until it is willing to "promise" (verbatim) that all parts of the task are done
- We tell it that there are "special" <promise> tags that carry extra weight, and it must use those to exit the loop.

The Ralph Wiggum Loop

The Ralph Wiggum Philosophy

"Good enough" autonomy now
Quantity "becomes" quality
Faith in eventual consistency

Better Instruction Following

Poor instruction following is a really hard problem to solve with reinforcement learning
- Given a prompt with specific instructions, it's basically impossible to distinguish between changes that reinforce instruction following and changes that reinforce the specific behavior in those instructions
Result: agent often "forgets" to follow instructions in the system prompt
- e.g., "always use camelCase function names"
This gets better if we remind the model of these instructions every turn
- This uses a lot of context and doesn't scale well
But what if we only remind it when the instructions are relevant?

The hookify Plugin

Basic idea:
- Scan tool uses for unwanted patterns
- Remind the model of your instructions when the pattern matches
- Make it easy to add to the list of things you want to remind the model of
Install the plugin (released last weekend just for this workshop!)
Tell the model when there's something you don't like
The model launches a subagent that creates the hook

Multi-agent Swarms

Launch multiple instances of Claude Code and give them a way to talk to each other
- e.g., put each one in a separate tmux window and tell them to communicate with each other with tmux send-keys
Assign one of them to be the technical lead
- Tell it to not write any code itself
Set up stop hooks that tell the technical lead when its workers are awaiting feedback
Set up a Ralph Wiggum loop on the technical lead

Then this happened...

Workflows: Key Takeaway

Orchestrating agents is programming—just with different primitives and a different language.

Your engineering skills are still valuable, even if you don't write much of the actual code yourself
- (this is what it's always been like to be a technical lead on a large project)
Prompts, hooks, and agent configurations are your new "source code"
The skills that make you a good programmer transfer directly to this domain

Based on speculation and hunches...

Programming Languages for AI

"Two-sided Naming" in Function Calls

(I haven't seen a more official name for this, so I made one up)

Basic idea: programming language features that force developers to put as much information as possible at both the call site and the definition. Examples:
- Keyword arguments in Python (and Swift, OCaml, Kotlin, and countless others):
- Objective-C (and Smalltalk, which inspired it) actually require this:

"Two-sided Naming" in Function Calls

(I haven't seen a more official name for this, so I made one up)

More examples:

Explicit reference binding for parameters in Rust:

inout in Swift:

"Two-sided Naming": Rationale

It's all about efficient context window usage!
The more information we can put at the call site, the less likely we need to load the declaration (or even the definition) into context.
But also, even if we had a large enough context window, explicit information at the call site reduces the probability that the LLM will make incorrect assumptions.
- i.e., how does it even know that it should check the declaration/definition for counterintuitive behavior?
- How do humans know when to do this?
The importance of this could change a bit once we figure out how to train LLMs to use language servers efficiently.

Contracts and Effects Systems

Both contracts and effects systems are ways of encapsulating information and reducing code coupling.
Encapsulation is key to effectively working with LLMs because of the context window size constraints.
But also, it's a lot easier to train LLMs on small, well-contained problems.
Contracts promote Liskov Substitutability, allowing LLMs to infer behavior of a broader category of types.
Effects systems further promote local reasoning by formalizing the boundaries around the behavior of a function.

Contracts and Effects Systems

But...these are all things that make humans more effective at programming also!^*

* As long as the learning curve for these things isn't too steep...

Conjecture: LLMs will make more effective use of complex programming languages than humans

"Conjecture" is a fun mathematical word for "hot take"

LLMs are shockingly similar to humans in the way they reason about code.
But I suspect LLMs will have a different balance between the downsides of conceptual complexity in programming languages and the benefits of increased information locality.
- An AI coding assistant is currently "like a junior engineer who's read the whole internet."
- Knowing obscure information is not as much of a challenge for LLMs as it is for humans.
- Code in formal languages in LLM training data is more likely to be correct (even if it's less plentiful)
- Locality of information is still critical to LLMs, however, because of the limited context window size.

Summary

Coding agents are tools that take some time to learn to use
- They generate output based on input, and the quality of their output depends on the quality of the input
- The context window is all of the space you have to give input to the agent—use it efficiently!
- "LLMs can't do this" is much less likely to be accurate than "I can't do this with an LLM"
Coding agents are like junior engineers who have read the whole internet
- It's worth taking the time to learn what they're good at and what they're bad at.
Careful, intentional design can accelerate your development process more than ever before
- Encapsulation and careful abstraction design matter now more than ever
- Code coupling is more harmful than ever
- Good documentation will pay off now more than ever
The future of programming languages and software engineering is going to be anything but boring!

Questions?

For questions I don't get to, I will do my best to answer every question posted on Discord during the remainder of the conference. Or just talk to me in the hallway!

Extra Slides

LLM Architecture: Embedding

Pre-training Scale: What the Model Has Seen

Claude-generated advice that was too good to leave out

Trained on millions of code files
Has seen your favorite pattern thousands of times:
- Every way to implement a singleton
- Every variant of the visitor pattern
- Every possible iterator implementation
- Good code, bad code, and everything in between
The model is essentially a compressed database of "what usually comes next"

Pre-trained models are brilliant at continuing text, but terrible at following instructions

What This Means for Your Coding

Claude-generated advice that was too good to leave out

The model is a pattern completion engine

It works best when:
- Your code follows patterns it's seen frequently
- You establish clear context early
- You use conventional naming
It struggles when:
- Your code is highly unusual or domain-specific
- Context is ambiguous
- You're doing something genuinely novel

Write code that "rhymes" with good code in the training data