Crafting the Code You Don't Write

AI Tools for Startup Founders

Dr. Daisy Hollman

Slush 2025 Workshop

November 19-20, 2025 • Helsinki, Finland

This presentation is optimized for a 16:10 aspect ratio. Please resize your browser window or rotate your device for the best viewing experience.

Who am I?

Engineer on the Claude Code team at Anthropic
- Claude Code "Power User" Lead
- Also: major Claude Code power user
Previously: programming language designer and engineer
- C++ standards committee since 2016 (std::mdspan, executors, linear algebra)
- Former CppCon program chair (2022-2023)
- PhD in computational quantum chemistry
These two are more similar than you think

I've spent a decade thinking about how to make code understandable—that's exactly what matters when working with AI tools

A Thought Exercise

Imagine you could use twice as many tokens to go from idea to product 20% faster.

Imagine that you could repeat this process any number of times.

Question: How many times would you repeat this?

A Thought Exercise

Iterations	Efficiency	Token Usage	$/Engineer	Category
1	1.2×	2×	$400	Hobbyist
2	1.4×	4×	$800	Hobbyist
3	1.7×	8×	$1,600	Pro-sumer
4	2.1×	16×	$3,200	Start-up
5	2.5×	32×	$6,400
6	3.0×	64×	$12,800
8	4.3×	256×	$51,200	Scale-up
10	6.2×	1024×	$204,800	Anthropic

Usage Patterns by Diminishing-Return Tolerance

Efficiency	1.2×	1.4×	1.7×	2.1×	2.5×	3.0×	3.6×	4.3×	5.2×	6.2×
Token Cost	2×	4×	8×	16×	32×	64×	128×	256×	512×	1024×
Category	Hobbyist		Pro-sumer	Start-up			Scale-up		Anthropic

Direct Interaction

Parallel Subagents

CI Integration

AI Code Review

Overnight Autonomy

Ambient Agents

Best of N Models

Agent Swarms

Understanding AI Coding Tools

A practical overview of how AI coding tools work

What you need to know to use them effectively

LLM Architecture

LLM Architecture: Input

The "embedding" layer turns tokens into vectors
- Uses a fixed-size vocabulary selected at training time
But there is a limit to the number of input tokens that the model has been trained on
- This is called the context length or "context window," and it's fixed at training time
There's a third size called the embedding dimension, which is the "working" dimension of the model

LLM Architecture: Input

The context window limits the amount of information that the user can have the model "think" about at any given time.

Basically everything built on top of LLMs is about engineering ways to efficiently use this context window. Understanding and managing the context window is critical to getting better at using LLMs!

LLM Architecture: Context Window

How do we make a chat bot out of this context window?
It's shockingly primitive: just add special token sequences (e.g., \nHuman: and \nAssistant:) to separate prompt from response, and throw it all into the context window. Literally:

This means that long conversations can use a lot of tokens very quickly!

Tool Use

At some point we realized that XML, JSON, and other structured markup languages are "just" text
So we started giving the LLMs a schema and a description of what will happen when it generates tokens matching that schema
Then we implement code that takes XML/JSON/etc. as input and does some task:
- Run a command in the terminal
- Run the compiler, the debugger, or the profiler
- Search the internet
- Search the code in a repository
- Edit a file

How Tool Calls Actually Work

Most LLMs (including Claude) use XML-based syntax for tool invocation (for now)

Here's what an Edit tool call looks like in Claude Code:

If the old_string is wrong, the tool call fails!
If there are multiple old_strings in the file, the tool call fails!
We are in the ed era of agentic tooling—we haven't even invented vi yet.

Just for fun...

Claude Code wrote the code block in the tool call in the last slide. Here's what the edit tool call looked like for that...

Context window sizes over time

GPT-1 (2018): 512 tokens
GPT-4 (2023): 8K / 32K tokens
GPT-4 Turbo (2024): 128K tokens
Claude 3 (2024): 200K tokens
Gemini 1.5 (2024): 1M+ tokens
Claude 4 Sonnet (2025): 200K tokens

What do we do with all of these extra tokens?

Context Window Growth

Source: meibel.ai

What do we do with all of these tokens?

We need a way to put more relevant information in the context window in order to make the output better
- 2023: Retrieval Augmented Generation (RAG)
- 2024: Tool use
Wild idea: what if we just...ask the LLM to put more relevant tokens in its own context?

Before:

After:

Agents

Early LLMs could only do relatively short time-horizon tasks
- Chatbots with short answers worked fine (kind of)
- "Fancy" in-line code completion was another early application
- With longer time-horizon tasks, early LLMs would quickly go "off the rails"
As LLMs got larger and reinforcement learning got better, longer running tasks became more feasible
- Key insight: Let the model see the results of its actions and iterate on that feedback
- For instance, what if we:
  - Ask the model to generate code
  - Give the model a tool to run the compiler on the code it generated
  - Add the results of the compilation in the context window
  - Ask it to fix its compilation errors
This feedback loop is what transforms a chatbot into an "agent"

Agent Task Length Over Time

Growth in agent task length capabilities over time

Source: metr.org

Agents: Code Search Tool

How do agents understand a 1M+ line codebase with only a few hundred thousand tokens of context?
Maybe a better question: how do humans do this?
- We search through the code for key entry points (e.g., main()),
- …then we read code called by those entry points...
- …and types and key data structures.
- When we want to understand a specific piece of functionality, we search for key strings or patterns related to that functionality
- In other words, we only "load" part of the code into our "context window"
- And maybe we "load" as summary of the rest of the code (and how it works) into our "context window" if we've been working on the project for a while
Agents can do this too!

How Do We Scale From Here?

Using More Tokens to Build Faster and Stronger

Beyond Interactive Use

Interactive use of agents quickly becomes bottlenecked on the human in the loop
This mode of use will likely always produce the most productivity per token
But humans don't scale up; agents do

Parallel Subagents

Claude Code running multiple review agents in parallel

Claude Code Plugins

Claude Code Plugins allow you to install collections of slash commands, skills, subagents, hooks, and/or MCP servers that work together for a common purpose
- More customization points coming soon

/plugin to access menu, or use it directly:

Marketplaces are lists of plugins that are easy for anyone to host (publicly or privately)

Parallel agents in plugins

/review-pr command, specified in commands/review-pr.md

Parallel agents in plugins

silent-failure-hunter agent, specified in agents/silent-failure-hunter.md

Building Custom Agents

Building Custom Agents

Creating an agent for my Slush 2025 presentation

Building Custom Agents

Building Custom Agents

Hot Take: Building Coding Tools for AGI, not ASI

Building for ASI ("Artificial Super Intelligence") implies a mindset where anything we add to the agentic harness will be obsolete because the model will be so smart that it doesn't need our tools
- Realistically, this has not been the trend in agentic coding
- Giving agents more ability to get deterministic information (through tool calls, RAG, and hooks) has led to more autonomy and more reliability
Building for AGI ("Artificial General Intelligence") assumes that future agents will have intelligence like the best humans
- This means that the baseline assumption should be that tools that improve efficiency and reliability for humans should also improve efficiency and reliability for agents.
- Generally, the most efficient humans make more use of coding tools, not less. Why should we think agents are any different?
- Build tools that work with what we have now, and they will be even more useful as the model improves

Pushing towards Longer Autonomy

The vision: write a spec before bed, wake up with a useable implementation of that spec
Problem: agents often stop working before a long-horizon task is done
- This is partially because of how reinforcement learning works (to make any progress, you often need to give partial credit to partial solutions)
- Also partially because we haven't fully evolved training away from the "chatbot" model (short turns, direct interaction, human-in-the-loop)
Problem: agents often do things they know won't work if asked to examine their work
- A lot of problems in software engineering are easier to assess for correctness than they are to solve.
"Overnight" autonomy is often limited by the model deciding to stop working, not by it actually reaching the end of its ability to do productive work.

The Ralph Wiggum Loop

"I'm helping!"

Original post by Geoffrey Huntley: ghuntley.com/ralph

Our version of this: feed the same prompt back into the agent until it is willing to "promise" (verbatim) that all parts of the task are done
- We tell it that there are "special" <promise> tags that carry extra weight, and it must use those to exit the loop.

The Ralph Wiggum Loop

The Ralph Wiggum Philosophy

"Good enough" autonomy now
Quantity "becomes" quality
Faith in eventual consistency

Efficiency	1.2×	1.4×	1.7×	2.1×	2.5×	3.0×	3.6×	4.3×	5.2×	6.2×
Token Cost	2×	4×	8×	16×	32×	64×	128×	256×	512×	1024×
Category	Hobbyist		Pro-sumer	Start-up			Scale-up		Anthropic

Overnight Autonomy

Better Instruction Following

Poor instruction following is a really hard problem to solve with reinforcement learning
- Given a prompt with specific instructions, it's basically impossible to distinguish between changes that reinforce instruction following and changes that reinforce the specific behavior in those instructions
Result: agent often "forgets" to follow instructions in the system prompt
- e.g., "always use camelCase function names"
This gets better if we remind the model of these instructions every turn
- This uses a lot of context and doesn't scale well
But what if we only remind it when the instructions are relevant?

The hookify Plugin

Basic idea:
- Scan tool uses for unwanted patterns
- Remind the model of your instructions when the pattern matches
- Make it easy to add to the list of things you want to remind the model of
Install the plugin (released last weekend just for this workshop!)
Tell the model when there's something you don't like
The model launches a subagent that creates the hook

Multi-agent Swarms

Launch multiple instances of Claude Code and give them a way to talk to each other
- e.g., put each one in a separate tmux window and tell them to communicate with each other with tmux send-keys
Assign one of them to be the technical lead
- Tell it to not write any code itself
Setup stop hooks that tell the technical lead when its workers are awaiting feedback
Setup a Ralph Wiggum loop on the technical lead

Then this happened...

Caveat: Cost

Efficiency	1.2×	1.4×	1.7×	2.1×	2.5×	3.0×	3.6×	4.3×	5.2×	6.2×
Token Cost	2×	4×	8×	16×	32×	64×	128×	256×	512×	1024×
Category	Hobbyist		Pro-sumer	Start-up			Scale-up		Anthropic

Agent Swarms

Summary

Usage patterns for coding agents will vary significantly depending on your tolerance for diminishing returns
- Most startups will be in a position to trade at least some productivity-per-token for increased autonomy
The context window is the most important part of LLM architecture to keep in mind when working with coding agents
Plugins enable building workflows for increased autonomy and to address shortfalls that are difficult or impractical to fix in training
- Consider building plugins for common internal workflows
Build for AGI, not ASI
- Practical tooling that accelerates workflows now will continue to be useful as models improve

Questions?

Thank you! Feel free to reach out with questions or to chat about AI coding tools.

Extra Slides

Appendix: Probability Distribution Experiments

These slides show detailed experimental results about how models complete code

Widget Factory Completions

Widget Factory Completion

Widget Factory Completions: With 'Implicit Move' Comment

Widget Factory Completions

Widget Factory Completions: With 'Broken Compiler' Comment

Widget Factory Completions

Widget Factory Completions: C++ version number