Crafting the Code You Don't Write

Sculpting Software in an AI World

Dr. Daisy Hollman

CppCon 2025 Keynote

daisyh.dev/talks/cppcon-2025

This presentation is optimized for a 16:10 aspect ratio. Please resize your browser window or rotate your device for the best viewing experience.

Who am I?

Long-time C++ committee member
- Contributed to wording or features in C++17, C++20, C++23, C++26 (almost certainly), and C++29 (likely)
- std::mdspan, std::execution, std::atomic_ref, std::linalg, variadic operator[]
- Former chair of Study Group 9 (Ranges)
Former CppCon program chair (2022 and 2023)
Former purveyor of Cute C++ Tricks

I am one of you!

Joined Anthropic in February 2025, currently working on Claude Code

The Elephant in the Room...

Yes, I am aware that
AI is a bit of a polarizing issue right now...
(I'm not here to talk about that)

Non-goals for this presentation

I'm not here to...

…convince you that AI coding agents are going to change the way software engineering works
…convince you that vibe coding is the future of software engineering, or a good thing, or a bad thing, or sustainable.
…tell you about how AI is going to replace you, or in general talk about the social impacts of AI
…justify the environmental impact of AI

But,

I do want to help you thrive in a world where you have a limited ability to control the existence of AI

Goals for this presentation

Understand LLMs and coding agents at a high level
- Especially the parts that are most relevant to understanding how to use them to write code
Learn to think about coding agents as a tool
Improve your mental model for how and why LLMs make mistakes
- And learn how to anticipate and avoid these mistakes
"Learn how to learn" to use coding agents more effectively
Geek out about the future of C++ and where coding agents fit in that picture

How do LLMs work?

A conceptual overview of LLMs that's good enough for the purposes of this talk (but also definitely wrong in very very many ways)

Disclaimer: I'm not an expert on LLM training, or inference, or AI, or machine learning, but this picture of the world has been helpful for me as a software engineer who needs to use LLMs.

The Evolution of Large Language Models

Key milestones that shaped modern AI

2017

Transformers

"All You Need"

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

LLM Architecture

LLM Architecture: Input

The "embedding" layer turns tokens into vectors
- Uses a fixed-size vocabulary selected at training time
But there is a limit to the number of input tokens that the model has been trained on
- This is called the context length or "context window," and it's fixed at training time
There's a third size called the embedding dimension, which is the "working" dimension of the model

LLM Architecture: Input

The context window limits the amount of information that the user can have the model "think" about at any given time.

Basically everything built on top of LLMs is about engineering ways to efficiently use this context window. Understanding and managing the context window is critical to getting better at using LLMs!

LLM Architecture: Context Window

How do we make a chat bot out of this context window?
It's shockingly primitive: just add special token sequences (e.g., \nHuman: and \nAssistant:) to separate prompt from response, and throw it all into the context window. Literally:

This means that long conversations can use a lot of tokens very quickly!

LLM Architecture: Transformers

_____ Is All You Need
- Attention! (Vaswani, Puth, et al.)
Transformers build connections between related tokens that may be far from each other in the input.

LLM Architecture: Attention

LLM Architecture: Outputs

The output of the model is a probability distribution of possible next tokens
The model "chooses" the next token based on this distribution (This is called "sampling")
- The "temperature" parameter controls how random the sampling is
- Low temperature (close to 0) makes the model more deterministic
- High temperature (around 1.0 or higher) makes the model more creative and random
The output is then added to the context window and the process repeats

The Evolution of Large Language Models

2017

Transformers

Attention is All You Need

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

Training LLMs: Pretraining

Start with random weights
Look at some tokens from the training data:
Generate a probability distribution of next tokens:
Compute a gradient of the weights that makes it match the training data better
Adjust the weights and repeat

The Evolution of Large Language Models

2017

Transformers

Attention is All You Need

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

Emergent behaviors and unreasonable effectiveness of scale

February 2019: GPT-2
- 1.5B parameters
- ~40GB of training data
June 2020: GPT-3
- 175B parameters
- ~570GB of training data
In a very vague sense, pre-training represents a "compression" of the training data
- The surprising thing is that this appears to have some things in common with the "compression" of data that our brains use: conceptual generalization
- This might be one reason that when we ask it to "decompress" the data, sometimes it displays emergent behaviors that are surprisingly human

Emergent behaviors and unreasonable effectiveness of scale

Consider the model completing this code snippet:

Several completion candidates:
widget
foo
nullptr
std::move(widget)
std::move(foo)
make_unique<Widget>()
;
42
🌼
How do we differentiate between these candidates?

Emergent behaviors and unreasonable effectiveness of scale

widget
foo
nullptr
std::move(widget)
std::move(foo)

make_unique<Widget>()
;
42
🌼

How do we differentiate between these candidates?
- Probably ASCII
- Look at the return type—it can't be an int or void
- Returning foo would be a use-after-move
- "Factory pattern" understanding means that nullptr seems pretty unlikely
- Probably we wouldn't initialize something and then throw it away immediately
- People usually name the variable to be returned by widget_factory() something like widget

Emergent behaviors and unreasonable effectiveness of scale

widget
std::move(widget)

We probably have a full spectrum of "probability distributions" in this room between these two completions
- Confession: my distribution probably looked something like
  - widget: 93%
  - std::move(widget): 7%

Emergent behaviors and unreasonable effectiveness of scale

Just for fun...

11
14
17
20

Before looking this up, my distribution probably looked something like
- 11: 33%
- 14: 22%
- 17: 40%
- 20: 5%

Widget Factory Completions

Widget Factory Completion

Widget Factory Completions: With 'Implicit Move' Comment

Widget Factory Completions

Widget Factory Completions: With 'Broken Compiler' Comment

Widget Factory Completions

Widget Factory Completions: C++ version number

Emergent behaviors and unreasonable effectiveness of scale

Compression through conceptual generalization is a useful metaphor for understanding how pre-trained models store and "regurgitate" information

The Evolution of Large Language Models

2017

Transformers

Attention is All You Need

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

The Problem with Pre-trained Models: Glorified Autocomplete

What's the return type of std::vector::size()?

Training LLMs: Reinforcement Learning

Basic idea:
- Give the model a question or a task
- Generate hundreds of different completions (i.e., answers to the question or executions of the task)
- Score those completions based on some metric
- Compute a gradient of the weights that makes the model more likely to give the completions with higher scores
- Repeat (many times and for many tasks)
This sounds really hard...
- …but it's actually harder than it sounds

The Evolution of Large Language Models

2017

Transformers

Attention is All You Need

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

Tool Use

At some point we realized that XML, JSON, and other structured markup languages are "just" text
So we started giving the LLMs a schema and a description of what will happen when it generates tokens matching that schema
Then we implement code that takes XML/JSON/etc. as input and does some task:
- Run a command in the terminal
- Run the compiler, the debugger, or the profiler
- Search the internet
- Search the code in a repository
- Edit a file

How Tool Calls Actually Work

Most LLMs (including Claude) use XML-based syntax for tool invocation (for now)

Here's what an Edit tool call looks like in Claude Code:

If the old_string is wrong, the tool call fails!
If there are multiple old_strings in the file, the tool call fails!
We are in the ed era of agentic tooling—we haven't even invented vi yet.

Just for fun...

Claude Code wrote the code block in the tool call in the last slide. Here's what the edit tool call looked like for that...

The Evolution of Large Language Models

2017

Transformers

Attention is All You Need

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

Context window sizes over time

GPT-1 (2018): 512 tokens
GPT-4 (2023): 8K / 32K tokens
GPT-4 Turbo (2024): 128K tokens
Claude 3 (2024): 200K tokens
Gemini 1.5 (2024): 1M+ tokens
Claude 4 Sonnet (2025): 200K tokens

What do we do with all of these extra tokens?

Context Window Growth

Source: meibel.ai

What do we do with all of these tokens?

We need a way to put more relevant information in the context window in order to make the output better
- 2023: Retrieval Augmented Generation (RAG)
- 2024: Tool use
Wild idea: what if we just...ask the LLM to put more relevant tokens in its own context?

Before:

After:

Why does this work so well?

The Evolution of Large Language Models

2017

Transformers

Attention is All You Need

2018

Early pre-trained models

"Just" Predict the Next Token

2019-20

Scale

Emergent Behaviors

2022

Alignment

Reinforcement Learning and Finetuning

2023-24

Tool Use

Functions

Late 2024

Reasoning

Extended Thinking

2025

Agents

Autonomous Systems

Agents

Early LLMs could only do relatively short time-horizon tasks
- Chatbots with short answers worked fine (kind of)
- "Fancy" in-line code completion was another early application
- With longer time-horizon tasks, early LLMs would quickly go "off the rails"
As LLMs got larger and reinforcement learning got better, longer running tasks became more feasible
- Key insight: Let the model see the results of its actions and iterate on that feedback
- For instance, what if we:
  - Ask the model to generate code
  - Give the model a tool to run the compiler on the code it generated
  - Add the results of the compilation in the context window
  - Ask it to fix its compilation errors
This feedback loop is what transforms a chatbot into an "agent"

Agent Task Length Over Time

Growth in agent task length capabilities over time

Source: metr.org

Agents: Code Search Tool

How do agents understand a 1M+ line codebase with only a few hundred thousand tokens of context?
Maybe a better question: how do humans do this?
- We search through the code for key entry points (e.g., main()),
- …then we read code called by those entry points...
- …and types and key data structures.
- When we want to understand a specific piece of functionality, we search for key strings or patterns related to that functionality
- In other words, we only "load" part of the code into our "context window"
- And maybe we "load" as summary of the rest of the code (and how it works) into our "context window" if we've been working on the project for a while
Agents can do this too!

Coding with LLMs

AI is really good at helping you understand code

My "AGI pill" moment

Anthropic monorepo: ~4M lines of code at the time
- Mostly in languages I haven't used professionally in the past decade (Python, Rust, and TypeScript)
My third day at Anthropic:
- "Make <feature I've never heard of> faster using <tool I've never heard of>, or maybe <other tool I've never heard of>"
- …oh, and this change will impact every production image build we do.
- Colleague: "I'll give you a tour of the codebase tomorrow"
- Me: "I wonder if Claude knows how to do this"
- …three hours later: pull request is ready.
- Colleague, next morning: "So I see you sent me a pull request before I even gave you a tour of the codebase..."

Coding with Agents

A coding agent is like a junior engineer who has read the whole internet.

They struggle with large scope tasks
- Abstraction design
- Code that's used in a lot of different scenarios
- Libraries with a diverse set of stakeholders
They tend to overperform on highly arcane or obscure tasks that are smaller in scope
- Build systems
- Template metaprorgramming
- Compiler intrinsics

Modes of Operation

Rapid prototyping
"Side-quests"
- Handwrite mission-critical code, dispatch agents for auxiliary code
Planning and brainstorming
- Helping with "writer's block"
Learning and exploration
Refactoring, boilerplate generation, modernization, cross-language translation
Test generation

Best Practices: Don't Be "Surprising"

If the LLM keeps "guessing" your library interface incorrectly, maybe it's too surprising (too "cute")
If the LLM is "guessing" the effects of your function incorrectly, maybe it needs a better name
Overthink the "principle of least astonishment"
- When you write code that an agent might read, actively think through all of the ways your code could be misinterpreted
- Be creative: what is the most "bland" way you can say the same thing?

Best Practices: Don't Be "Surprising"

"Surprising" C++ things to avoid:

Avoid "clever" operator overloading
Don't mix owning and non-owning semantics in the same type
Avoid marcos
Avoid argument-dependent lookup (ADL)
Avoid implicit conversions
Use regular types wherever possible

Best Practices: Prioritize Better Encapsulation

Create, maintain, and document rigorous invariants at encapsulation boundaries
The agent should be able to reason about the code in isolation as much as possible
Avoid code coupling!
- Err on the side of repeating yourself rather than breaking encapsulation by reusing code in vastly different scenarios
See my CppCon 2023 talk "Expressing Sameness and Similarity in Modern C++":
- Video: dsh.fyi/cppcon-2023-video
- Slides: dsh.fyi/cppcon-2023

Best Practices: Document Your Code!

Comments are more important for agents than they are for humans!
Frequent, verbose comments in code can act like extended thinking
Unlike previous advice: redundancy is useful!
Think about the things that you can't say with your code and put them in comments
- "For all" contracts, for instance
Agents are very good at generating and maintaining comments
- Add a CI step that asks an agent to verify all of the comments related to changes in your pull request!

Best Practices: Key Takeaway

Writing code that LLMs will understand is not that different from writing code that humans will understand, except that we can start to understand and quantify why these best practices increase understandability.

Based completely on speculation, hunches, and feelings, let's talk about...

Evolving Programming Languages for AI

"Two-sided Naming" in Function Calls

(I haven't seen a more official name for this, so I made one up)

Basic idea: programming language features that force developers to put as much information as possible at both the call site and the definition. Examples:
- Keyword arguments in Python (and Swift, OCaml, Kotlin, and countless others):
- Objective-C (and Smalltalk, which inspired it) actually require this:

"Two-sided Naming" in Function Calls

(I haven't seen a more official name for this, so I made one up)

More examples:

Explicit reference binding for parameters in Rust:

inout in Swift:

"Two-sided Naming": Rationale

It's all about efficient context window usage!
The more information we can put at the call site, the less likely we need to load the declaration (or even the definition) into context.
But also, even if we had an large enough context window, explicit information at the call site reduces the probability that the LLM will make incorrect assumptions.
- i.e., how does it even know that it should check the declaration/definition for counterintuitive behavior?
- How do humans know when to do this?
The importance of this could change a bit once we figure out how to train LLMs to use language servers efficiently.

Contracts and Effects Systems

Both contracts and effects systems are ways of encapsulating information and reducing code coupling.
Encapsulation is key to effectively working with LLMs because of the context window size constraints.
But also, it's a lot easier to train LLMs on small, well-contained problems.
Contracts promote Liskov Substitutability, allowing LLMs to infer behavior of a broader category of types.
Effects systems further promote local reasoning by formalizing the boundaries around the behavior of a function.

Contracts and Effects Systems

But...these are all things that make humans more effective at programming also!^*

* As long as the learning curve for these things isn't too steep...

Conjecture: LLMs will make more effective use of complex programming languages than humans

"Conjecture" is a fun mathematical word for "hot take"

LLMs are shockingly similar to humans in the way they reason about code.
But I suspect LLMs will have a different balance between the downsides of conceptual complexity in programming languages and the benefits of increased information locality.
- An AI coding assistant is currently "like a junior engineer who's read the whole internet."
- Knowing obscure information is not as much of a challenge for LLMs as it is for humans.
- Code in formal languages in LLM training data is more likely to be correct (even if it's less plentiful)
- Locality of information is still critical to LLMs, however, because of the limited context window size.

Summary

Coding agents are tools that take somem time to learn to use
- They generate output based on input, and the quality of their output depends on the quality of the input
- The context window is all of the space you have to give input to the agent—use it efficiently!
- "LLMs can't do this" is much less likely to be accurate than "I can't do this with an LLM"
Coding agents are like junior engineers who have read the whole internet
- It's worth taking the time to learn what they're good at and what they're bad at.
Careful, intentional design can accelerate your development process more than ever before
- Encapsulation and careful abstraction design matters now more than ever
- Code coupling is more harmful than ever
- Good documentation will pay off now more than ever
The future of programming languages and software engineering is going to be anything but boring!

Questions?

For questions I don't get to, I will do my best to answer every question posted on Discord during the remainder of the conference. Or just talk to me in the hallway!

Extra Slides

LLM Architecture: Embedding

Pre-training Scale: What the Model Has Seen

Claude-generated advice that was too good to leave out

Trained on millions of code files
Has seen your favorite pattern thousands of times:
- Every way to implement a singleton
- Every variant of the visitor pattern
- Every possible iterator implementation
- Good code, bad code, and everything in between
The model is essentially a compressed database of "what usually comes next"

Pre-trained models are brilliant at continuing text, but terrible at following instructions

What This Means for Your Coding

Claude-generated advice that was too good to leave out

The model is a pattern completion engine

It works best when:
- Your code follows patterns it's seen frequently
- You establish clear context early
- You use conventional naming
It struggles when:
- Your code is highly unusual or domain-specific
- Context is ambiguous
- You're doing something genuinely novel

Write code that "rhymes" with good code in the training data