Crafting the Code You Don't Write

Sculpting Software in an AI World

Dr. Daisy Hollman

MUC++ Meetup
daisyh.dev/talks/muc-meetup
This presentation is optimized for a 16:10 aspect ratio. Please resize your browser window or rotate your device for the best viewing experience.

Who am I?

  • Long-time C++ committee member
    • Contributed to wording or features in C++17, C++20, C++23, C++26 (almost certainly), and C++29 (likely)
    • std::mdspan, std::execution, std::atomic_ref, std::linalg, variadic operator[]
    • Former chair of Study Group 9 (Ranges)
  • Former CppCon program chair (2022 and 2023)
  • Former purveyor of Cute C++ Tricks
I am one of you!
  • Joined Anthropic in February 2025, currently working on Claude Code

The Elephant in the Room...




Yes, I am aware that
AI is a bit of a polarizing issue right now...
(I'm not here to talk about that)

Non-goals for this presentation

I'm not here to...
  • …convince you that AI coding agents are going to change the way software engineering works
  • …convince you that vibe coding is the future of software engineering, or a good thing, or a bad thing, or sustainable.
  • …tell you about how AI is going to replace you, or in general talk about the social impacts of AI
  • …justify the environmental impact of AI
But,
  • I do want to help you thrive in a world where you have a limited ability to control the existence of AI

Goals for this presentation

  • Understand LLMs and coding agents at a high level
    • Especially the parts that are most relevant to understanding how to use them to write code
  • Learn to think about coding agents as a tool
  • Improve your mental model for how and why LLMs make mistakes
    • And learn how to anticipate and avoid these mistakes
  • "Learn how to learn" to use coding agents more effectively
  • Geek out about the future of C++ and where coding agents fit in that picture

How do LLMs work?



A conceptual overview that's good enough for this talk

(but also definitely wrong in many ways)

Disclaimer: I'm not an expert on LLM training, or inference, or AI, or machine learning, but this picture of the world has been helpful for me as a software engineer who needs to use LLMs.

The Evolution of Large Language Models

Key milestones that shaped modern AI

2017
Transformers
"All You Need"
2018
Early pre-trained models
"Just" Predict the Next Token
2019-20
Scale
Emergent Behaviors
2022
Alignment
Reinforcement Learning and Finetuning
2023-24
Tool Use
Functions
Late 2024
Reasoning
Extended Thinking
2025
Agents
Autonomous Systems

LLM Architecture

Overview of a typical modern LLM architecture

LLM Architecture: Input

  • The "embedding" layer turns tokens into vectors
    • Uses a fixed-size vocabulary selected at training time
  • But there is a limit to the number of input tokens that the model has been trained on
    • This is called the context length or "context window," and it's fixed at training time
  • There's a third size called the embedding dimension, which is the "working" dimension of the model
Overview of a typical modern LLM architecture

LLM Architecture: Input

The context window limits the amount of information that the user can have the model "think" about at any given time.
Basically everything built on top of LLMs is about engineering ways to efficiently use this context window. Understanding and managing the context window is critical to getting better at using LLMs!
Overview of a typical modern LLM architecture

LLM Architecture: Context Window

  • How do we make a chatbot out of this context window?
  • It's shockingly primitive: just add special token sequences (e.g., \nHuman: and \nAssistant:) to separate prompt from response, and throw it all into the context window. Literally:
  • This means that long conversations can use a lot of tokens very quickly!

LLM Architecture: Transformers

  • _____ Is All You Need
    • Attention! (Vaswani, Shazeer, et al.)
      Overview of a typical modern LLM architecture
  • Transformers build connections between related tokens that may be far from each other in the input.
Overview of a typical modern LLM architecture

LLM Architecture: Attention

Attention sentence diagram
Attention matrix diagram

LLM Architecture: Outputs

  • The output of the model is a probability distribution of possible next tokens
  • The model "chooses" the next token based on this distribution (This is called "sampling")
    • The "temperature" parameter controls how random the sampling is
    • Low temperature (close to 0) makes the model more deterministic
    • High temperature (around 1.0 or higher) makes the model more creative and random
  • The output is then added to the context window and the process repeats
Overview of a typical modern LLM architecture

The Evolution of Large Language Models

2017
Transformers
Attention is All You Need
2018
Early pre-trained models
"Just" Predict the Next Token
2019-20
Scale
Emergent Behaviors
2022
Alignment
Reinforcement Learning and Finetuning
2023-24
Tool Use
Functions
Late 2024
Reasoning
Extended Thinking
2025
Agents
Autonomous Systems

Training LLMs: Pretraining

  • Start with random weights
  • Look at some tokens from the training data:
  • Generate a probability distribution of next tokens:
  • Compute a gradient of the weights that makes it match the training data better
  • Adjust the weights and repeat
LLM

The Evolution of Large Language Models

2017
Transformers
Attention is All You Need
2018
Early pre-trained models
"Just" Predict the Next Token
2019-20
Scale
Emergent Behaviors
2022
Alignment
Reinforcement Learning and Finetuning
2023-24
Tool Use
Functions
Late 2024
Reasoning
Extended Thinking
2025
Agents
Autonomous Systems

Emergent behaviors and unreasonable effectiveness of scale

  • February 2019: GPT-2
    • 1.5B parameters
    • ~40GB of training data
  • June 2020: GPT-3
    • 175B parameters
    • ~570GB of training data
  • In a very vague sense, pre-training represents a "compression" of the training data
    • The surprising thing is that this appears to have some things in common with the "compression" of data that our brains use: conceptual generalization
    • This might be one reason that when we ask it to "decompress" the data, sometimes it displays emergent behaviors that are surprisingly human

Emergent behaviors and unreasonable effectiveness of scale

  • Consider the model completing this code snippet:
  • Several completion candidates:
    • widget
    • foo
    • nullptr
    • std::move(widget)
    • std::move(foo)
    • make_unique<Widget>()
    • ;
    • 42
    • 🌼
  • How do we differentiate between these candidates?

Emergent behaviors and unreasonable effectiveness of scale

  • widget
  • foo
  • nullptr
  • std::move(widget)
  • std::move(foo)
  • make_unique<Widget>()
  • ;
  • 42
  • 🌼
  • How do we differentiate between these candidates?
    • Probably ASCII
    • Look at the return type—it can't be an int or void
    • Returning foo would be a use-after-move
    • "Factory pattern" understanding means that nullptr seems pretty unlikely
    • Probably we wouldn't initialize something and then throw it away immediately
    • People usually name the variable to be returned by widget_factory() something like widget

Emergent behaviors and unreasonable effectiveness of scale

  • widget
  • std::move(widget)
  • We probably have a full spectrum of "probability distributions" in this room between these two completions
    • Confession: my distribution probably looked something like
      • widget: 93%
      • std::move(widget): 7%

Emergent behaviors and unreasonable effectiveness of scale

Just for fun...
  • 11
  • 14
  • 17
  • 20
  • Before looking this up, my distribution probably looked something like
    • 11: 33%
    • 14: 22%
    • 17: 40%
    • 20: 5%

Widget Factory Completions



Widget Factory Completions

Widget Factory Completions Results

Widget Factory Completions

Widget Factory Completions Results

Widget Factory Completion



Widget Factory Completions: With 'Implicit Move' Comment

Widget Factory Completions: With 'Implicit Move' Comment Results

Widget Factory Completions: With 'Implicit Move' Comment

Widget Factory Completions: With 'Implicit Move' Comment Results

Widget Factory Completions



Widget Factory Completions: With 'Broken Compiler' Comment

Widget Factory Completions: With 'Broken Compiler' Comment Results

Widget Factory Completions: With 'Broken Compiler' Comment

Widget Factory Completions: With 'Broken Compiler' Comment Results

Widget Factory Completions



Widget Factory Completions: C++ version number

Widget Factory Completions: C++ version number Results

Emergent behaviors and unreasonable effectiveness of scale




Compression through conceptual generalization is a useful metaphor for understanding how pre-trained models store and "regurgitate" information

The Evolution of Large Language Models

2017
Transformers
Attention is All You Need
2018
Early pre-trained models
"Just" Predict the Next Token
2019-20
Scale
Emergent Behaviors
2022
Alignment
Reinforcement Learning and Finetuning
2023-24
Tool Use
Functions
Late 2024
Reasoning
Extended Thinking
2025
Agents
Autonomous Systems

The Problem with Pre-trained Models: Glorified Autocomplete

What's the return type of std::vector::size()?

Training LLMs: Reinforcement Learning

  • Basic idea:
    • Give the model a question or a task
    • Generate hundreds of different completions (i.e., answers to the question or executions of the task)
    • Score those completions based on some metric
    • Compute a gradient of the weights that makes the model more likely to give the completions with higher scores
    • Repeat (many times and for many tasks)
  • This sounds really hard...
    • …but it's actually harder than it sounds
LLM

The Evolution of Large Language Models

2017
Transformers
Attention is All You Need
2018
Early pre-trained models
"Just" Predict the Next Token
2019-20
Scale
Emergent Behaviors
2022
Alignment
Reinforcement Learning and Finetuning
2023-24
Tool Use
Functions
Late 2024
Reasoning
Extended Thinking
2025
Agents
Autonomous Systems

Tool Use

  • At some point we realized that XML, JSON, and other structured markup languages are "just" text
  • So we started giving the LLMs a schema and a description of what will happen when it generates tokens matching that schema
  • Then we implement code that takes XML/JSON/etc. as input and does some task:
    • Run a command in the terminal
    • Run the compiler, the debugger, or the profiler
    • Search the internet
    • Search the code in a repository
    • Edit a file

How Tool Calls Actually Work

  • Most LLMs (including Claude) use XML-based syntax for tool invocation (for now)
  • Here's what an Edit tool call looks like in Claude Code:
  • If the old_string is wrong, the tool call fails!
  • If there are multiple old_strings in the file, the tool call fails!
  • We are in the ed era of agentic tooling—we haven't even invented vi yet.

Just for fun...

  • Claude Code wrote the code block in the tool call in the last slide. Here's what the edit tool call looked like for that...

The Evolution of Large Language Models

2017
Transformers
Attention is All You Need
2018
Early pre-trained models
"Just" Predict the Next Token
2019-20
Scale
Emergent Behaviors
2022
Alignment
Reinforcement Learning and Finetuning
2023-24
Tool Use
Functions
Late 2024
Reasoning
Extended Thinking
2025
Agents
Autonomous Systems

Context window sizes over time

  • GPT-1 (2018): 512 tokens
  • GPT-4 (2023): 8K / 32K tokens
  • GPT-4 Turbo (2024): 128K tokens
  • Claude 3 (2024): 200K tokens
  • Gemini 1.5 (2024): 1M+ tokens
  • Claude 4 Sonnet (2025): 200K tokens
What do we do with all of these extra tokens?

Context Window Growth

Context Window Growth
Source: meibel.ai

What do we do with all of these tokens?

  • We need a way to put more relevant information in the context window in order to make the output better
    • 2023: Retrieval Augmented Generation (RAG)
    • 2024: Tool use
  • Wild idea: what if we just...ask the LLM to put more relevant tokens in its own context?
Before:
After:

Why does this work so well?


Overview of a typical modern LLM architecture

The Evolution of Large Language Models

2017
Transformers
Attention is All You Need
2018
Early pre-trained models
"Just" Predict the Next Token
2019-20
Scale
Emergent Behaviors
2022
Alignment
Reinforcement Learning and Finetuning
2023-24
Tool Use
Functions
Late 2024
Reasoning
Extended Thinking
2025
Agents
Autonomous Systems

Agents

  • Early LLMs could only do relatively short time-horizon tasks
    • Chatbots with short answers worked fine (kind of)
    • "Fancy" in-line code completion was another early application
    • With longer time-horizon tasks, early LLMs would quickly go "off the rails"
  • As LLMs got larger and reinforcement learning got better, longer running tasks became more feasible
    • Key insight: Let the model see the results of its actions and iterate on that feedback
    • For instance, what if we:
      • Ask the model to generate code
      • Give the model a tool to run the compiler on the code it generated
      • Add the results of the compilation in the context window
      • Ask it to fix its compilation errors
  • This feedback loop is what transforms a chatbot into an "agent"

Agent Task Length Over Time

Growth in agent task length capabilities over time
Source: metr.org

Agents: Code Search Tool

  • How do agents understand a 1M+ line codebase with only a few hundred thousand tokens of context?
  • Maybe a better question: how do humans do this?
    • We search through the code for key entry points (e.g., main()),
    • …then we read code called by those entry points...
    • …and types and key data structures.
    • When we want to understand a specific piece of functionality, we search for key strings or patterns related to that functionality
    • In other words, we only "load" part of the code into our "context window"
    • And maybe we "load" a summary of the rest of the code (and how it works) into our "context window" if we've been working on the project for a while
  • Agents can do this too!

Coding with LLMs


AI is really good at helping you understand code

My "AGI pill" moment

  • Anthropic monorepo: ~4M lines of code at the time
    • Mostly in languages I haven't used professionally in the past decade (Python, Rust, and TypeScript)
  • My third day at Anthropic:
    • "Make <feature I've never heard of> faster using <tool I've never heard of>, or maybe <other tool I've never heard of>"
    • …oh, and this change will impact every production image build we do.
    • Colleague: "I'll give you a tour of the codebase tomorrow"
    • Me: "I wonder if Claude knows how to do this"
    • …three hours later: pull request is ready.
    • Colleague, next morning: "So I see you sent me a pull request before I even gave you a tour of the codebase..."

Coding with Agents

A coding agent is like a junior engineer who has read the whole internet.
  • They struggle with large scope tasks
    • Abstraction design
    • Code that's used in a lot of different scenarios
    • Libraries with a diverse set of stakeholders
  • They tend to overperform on highly arcane or obscure tasks that are smaller in scope
    • Build systems
    • Template metaprogramming
    • Compiler intrinsics

Modes of Operation

  • Rapid prototyping
  • "Side-quests"
    • Handwrite mission-critical code, dispatch agents for auxiliary code
  • Planning and brainstorming
    • Helping with "writer's block"
  • Learning and exploration
  • Refactoring, boilerplate generation, modernization, cross-language translation
  • Test generation

The Side-Quest Strategy

You write the core logic. AI writes everything else.

Your Focus

  • Core algorithm
  • Architecture and code structure
  • Design and strategy
  • Security logic

AI's Work

  • Build system configuration
  • Benchmarking harnesses
  • Error handling boilerplate
  • Unit tests
  • Documentation
This alone can double your velocity on new features

Best Practices: Don't Be "Surprising"

  • If the LLM keeps "guessing" your library interface incorrectly, maybe it's too surprising (too "cute")
  • If the LLM is "guessing" the effects of your function incorrectly, maybe it needs a better name
  • Overthink the "principle of least astonishment"
    • When you write code that an agent might read, actively think through all of the ways your code could be misinterpreted
    • Be creative: what is the most "bland" way you can say the same thing?

Best Practices: Don't Be "Surprising"

"Surprising" C++ things to avoid:
  • Avoid "clever" operator overloading
  • Don't mix owning and non-owning semantics in the same type
  • Avoid macros
  • Avoid argument-dependent lookup (ADL)
  • Avoid implicit conversions
  • Use regular types wherever possible

Best Practices: Prioritize Better Encapsulation

  • Create, maintain, and document rigorous invariants at encapsulation boundaries
  • The agent should be able to reason about the code in isolation as much as possible
  • Avoid code coupling!
    • Err on the side of repeating yourself rather than breaking encapsulation by reusing code in vastly different scenarios
  • See my CppCon 2023 talk "Expressing Sameness and Similarity in Modern C++":

Best Practices: Document Your Code!

  • Comments are more important for agents than they are for humans!
  • Frequent, verbose comments in code can act like extended thinking
  • Unlike previous advice: redundancy is useful!
  • Think about the things that you can't say with your code and put them in comments
    • "For all" contracts, for instance
  • Agents are very good at generating and maintaining comments
    • Add a CI step that asks an agent to verify all of the comments related to changes in your pull request!

Best Practices: Key Takeaway



Writing code that LLMs will understand is not that different from writing code that humans will understand, except that we can start to understand and quantify why these best practices increase understandability.

Scaling Up


Using More Tokens to Build Faster and Stronger

Beyond Interactive Use

  • Interactive use of agents quickly becomes bottlenecked on the human in the loop
  • This mode of use will likely always produce the most productivity per token
  • But humans don't scale up; agents do

Parallel Subagents

Claude Code running multiple review agents in parallel

Claude Code Plugins

  • Claude Code Plugins allow you to install collections of slash commands, skills, subagents, hooks, and/or MCP servers that work together for a common purpose
    • More customization points coming soon
  • /plugin to access menu, or use it directly:
  • Marketplaces are lists of plugins that are easy for anyone to host (publicly or privately)

Parallel agents in plugins

/review-pr command, specified in commands/review-pr.md

Parallel agents in plugins

silent-failure-hunter agent, specified in agents/silent-failure-hunter.md

Hot Take: Building Coding Tools for AGI, not ASI

  • Building for ASI ("Artificial Super Intelligence") implies a mindset where anything we add to the agentic harness will be obsolete because the model will be so smart that it doesn't need our tools
    • Realistically, this has not been the trend in agentic coding
    • Giving agents more ability to get deterministic information (through tool calls, RAG, and hooks) has led to more autonomy and more reliability
  • Building for AGI ("Artificial General Intelligence") assumes that future agents will have intelligence like the best humans
    • This means that the baseline assumption should be that tools that improve efficiency and reliability for humans should also improve efficiency and reliability for agents.
    • Generally, the most efficient humans make more use of coding tools, not less. Why should we think agents are any different?
    • Build tools that work with what we have now, and they will be even more useful as the model improves

Pushing towards Longer Autonomy

  • The vision: write a spec before bed, wake up with a usable implementation of that spec
  • Problem: agents often stop working before a long-horizon task is done
    • This is partially because of how reinforcement learning works (to make any progress, you often need to give partial credit to partial solutions)
    • Also partially because we haven't fully evolved training away from the "chatbot" model (short turns, direct interaction, human-in-the-loop)
  • Problem: agents often do things they know won't work if asked to examine their work
    • A lot of problems in software engineering are easier to assess for correctness than they are to solve.
  • "Overnight" autonomy is often limited by the model deciding to stop working, not by it actually reaching the end of its ability to do productive work.

The Ralph Wiggum Loop

Ralph Wiggum saying 'I'm helping!'

"I'm helping!"

Original post by Geoffrey Huntley: ghuntley.com/ralph

  • Our version of this: feed the same prompt back into the agent until it is willing to "promise" (verbatim) that all parts of the task are done
    • We tell it that there are "special" <promise> tags that carry extra weight, and it must use those to exit the loop.

The Ralph Wiggum Loop

The Ralph Wiggum Philosophy

  • "Good enough" autonomy now
  • Quantity "becomes" quality
  • Faith in eventual consistency
Ralph Wiggum loop screenshot

Better Instruction Following

  • Poor instruction following is a really hard problem to solve with reinforcement learning
    • Given a prompt with specific instructions, it's basically impossible to distinguish between changes that reinforce instruction following and changes that reinforce the specific behavior in those instructions
  • Result: agent often "forgets" to follow instructions in the system prompt
    • e.g., "always use camelCase function names"
  • This gets better if we remind the model of these instructions every turn
    • This uses a lot of context and doesn't scale well
  • But what if we only remind it when the instructions are relevant?

The hookify Plugin

  • Basic idea:
    • Scan tool uses for unwanted patterns
    • Remind the model of your instructions when the pattern matches
    • Make it easy to add to the list of things you want to remind the model of
  • Install the plugin (released last weekend just for this workshop!)
  • Tell the model when there's something you don't like
  • The model launches a subagent that creates the hook

Multi-agent Swarms

  • Launch multiple instances of Claude Code and give them a way to talk to each other
    • e.g., put each one in a separate tmux window and tell them to communicate with each other with tmux send-keys
  • Assign one of them to be the technical lead
    • Tell it to not write any code itself
  • Set up stop hooks that tell the technical lead when its workers are awaiting feedback
  • Set up a Ralph Wiggum loop on the technical lead

Then this happened...

Then this happened...

Workflows: Key Takeaway



Orchestrating agents is programming—just with different primitives and a different language.

  • Your engineering skills are still valuable, even if you don't write much of the actual code yourself
    • (this is what it's always been like to be a technical lead on a large project)
  • Prompts, hooks, and agent configurations are your new "source code"
  • The skills that make you a good programmer transfer directly to this domain

Based on speculation and hunches...

Programming Languages for AI


"Two-sided Naming" in Function Calls

(I haven't seen a more official name for this, so I made one up)

  • Basic idea: programming language features that force developers to put as much information as possible at both the call site and the definition. Examples:
    • Keyword arguments in Python (and Swift, OCaml, Kotlin, and countless others):
    • Objective-C (and Smalltalk, which inspired it) actually require this:

"Two-sided Naming" in Function Calls

(I haven't seen a more official name for this, so I made one up)

  • More examples:
    • Explicit reference binding for parameters in Rust:
    • inout in Swift:

"Two-sided Naming": Rationale

  • It's all about efficient context window usage!
  • The more information we can put at the call site, the less likely we need to load the declaration (or even the definition) into context.
  • But also, even if we had a large enough context window, explicit information at the call site reduces the probability that the LLM will make incorrect assumptions.
    • i.e., how does it even know that it should check the declaration/definition for counterintuitive behavior?
    • How do humans know when to do this?
  • The importance of this could change a bit once we figure out how to train LLMs to use language servers efficiently.

Contracts and Effects Systems

  • Both contracts and effects systems are ways of encapsulating information and reducing code coupling.
  • Encapsulation is key to effectively working with LLMs because of the context window size constraints.
  • But also, it's a lot easier to train LLMs on small, well-contained problems.
  • Contracts promote Liskov Substitutability, allowing LLMs to infer behavior of a broader category of types.
  • Effects systems further promote local reasoning by formalizing the boundaries around the behavior of a function.

Contracts and Effects Systems



But...these are all things that make humans more effective at programming also!*

* As long as the learning curve for these things isn't too steep...

Conjecture: LLMs will make more effective use of complex programming languages than humans

"Conjecture" is a fun mathematical word for "hot take"

  • LLMs are shockingly similar to humans in the way they reason about code.
  • But I suspect LLMs will have a different balance between the downsides of conceptual complexity in programming languages and the benefits of increased information locality.
    • An AI coding assistant is currently "like a junior engineer who's read the whole internet."
    • Knowing obscure information is not as much of a challenge for LLMs as it is for humans.
    • Code in formal languages in LLM training data is more likely to be correct (even if it's less plentiful)
    • Locality of information is still critical to LLMs, however, because of the limited context window size.

Summary

Summary

  • Coding agents are tools that take some time to learn to use
    • They generate output based on input, and the quality of their output depends on the quality of the input
    • The context window is all of the space you have to give input to the agent—use it efficiently!
    • "LLMs can't do this" is much less likely to be accurate than "I can't do this with an LLM"
  • Coding agents are like junior engineers who have read the whole internet
    • It's worth taking the time to learn what they're good at and what they're bad at.
  • Careful, intentional design can accelerate your development process more than ever before
    • Encapsulation and careful abstraction design matter now more than ever
    • Code coupling is more harmful than ever
    • Good documentation will pay off now more than ever
  • The future of programming languages and software engineering is going to be anything but boring!

Questions?

For questions I don't get to, I will do my best to answer every question posted on Discord during the remainder of the conference. Or just talk to me in the hallway!

Extra Slides

LLM Architecture: Embedding

Pre-training Scale: What the Model Has Seen

Claude-generated advice that was too good to leave out

  • Trained on millions of code files
  • Has seen your favorite pattern thousands of times:
    • Every way to implement a singleton
    • Every variant of the visitor pattern
    • Every possible iterator implementation
    • Good code, bad code, and everything in between
  • The model is essentially a compressed database of "what usually comes next"

Pre-trained models are brilliant at continuing text, but terrible at following instructions

What This Means for Your Coding

Claude-generated advice that was too good to leave out

The model is a pattern completion engine

  • It works best when:
    • Your code follows patterns it's seen frequently
    • You establish clear context early
    • You use conventional naming
  • It struggles when:
    • Your code is highly unusual or domain-specific
    • Context is ambiguous
    • You're doing something genuinely novel

Write code that "rhymes" with good code in the training data