Crafting the Code You Don't Write

AI Tools for Startup Founders

Dr. Daisy Hollman

Slush 2025 Workshop
November 19-20, 2025 • Helsinki, Finland
This presentation is optimized for a 16:10 aspect ratio. Please resize your browser window or rotate your device for the best viewing experience.

Who am I?

  • Engineer on the Claude Code team at Anthropic
    • Claude Code "Power User" Lead
    • Also: major Claude Code power user
  • Previously: programming language designer and engineer
    • C++ standards committee since 2016 (std::mdspan, executors, linear algebra)
    • Former CppCon program chair (2022-2023)
    • PhD in computational quantum chemistry
  • These two are more similar than you think
I've spent a decade thinking about how to make code understandable—that's exactly what matters when working with AI tools

A Thought Exercise

Imagine you could use twice as many tokens to go from idea to product 20% faster.

Imagine that you could repeat this process any number of times.

Question: How many times would you repeat this?

A Thought Exercise

IterationsEfficiencyToken Usage$/EngineerCategory
11.2×$400Hobbyist
21.4×$800
31.7×$1,600Pro-sumer
42.1×16×$3,200Start-up
52.5×32×$6,400
63.0×64×$12,800
84.3×256×$51,200Scale-up
106.2×1024×$204,800Anthropic

Usage Patterns by Diminishing-Return Tolerance

Efficiency1.2×1.4×1.7×2.1×2.5×3.0×3.6×4.3×5.2×6.2×
Token Cost16×32×64×128×256×512×1024×
CategoryHobbyistPro-sumerStart-upScale-upAnthropic
Direct Interaction
Parallel Subagents
CI Integration
AI Code Review
Overnight Autonomy
Ambient Agents
Best of N Models
Agent Swarms

Understanding AI Coding Tools



A practical overview of how AI coding tools work

What you need to know to use them effectively

LLM Architecture

Overview of a typical modern LLM architecture

LLM Architecture: Input

  • The "embedding" layer turns tokens into vectors
    • Uses a fixed-size vocabulary selected at training time
  • But there is a limit to the number of input tokens that the model has been trained on
    • This is called the context length or "context window," and it's fixed at training time
  • There's a third size called the embedding dimension, which is the "working" dimension of the model
Overview of a typical modern LLM architecture

LLM Architecture: Input

The context window limits the amount of information that the user can have the model "think" about at any given time.
Basically everything built on top of LLMs is about engineering ways to efficiently use this context window. Understanding and managing the context window is critical to getting better at using LLMs!
Overview of a typical modern LLM architecture

LLM Architecture: Context Window

  • How do we make a chat bot out of this context window?
  • It's shockingly primitive: just add special token sequences (e.g., \nHuman: and \nAssistant:) to separate prompt from response, and throw it all into the context window. Literally:
  • This means that long conversations can use a lot of tokens very quickly!

Tool Use

  • At some point we realized that XML, JSON, and other structured markup languages are "just" text
  • So we started giving the LLMs a schema and a description of what will happen when it generates tokens matching that schema
  • Then we implement code that takes XML/JSON/etc. as input and does some task:
    • Run a command in the terminal
    • Run the compiler, the debugger, or the profiler
    • Search the internet
    • Search the code in a repository
    • Edit a file

How Tool Calls Actually Work

  • Most LLMs (including Claude) use XML-based syntax for tool invocation (for now)
  • Here's what an Edit tool call looks like in Claude Code:
  • If the old_string is wrong, the tool call fails!
  • If there are multiple old_strings in the file, the tool call fails!
  • We are in the ed era of agentic tooling—we haven't even invented vi yet.

Just for fun...

  • Claude Code wrote the code block in the tool call in the last slide. Here's what the edit tool call looked like for that...

Context window sizes over time

  • GPT-1 (2018): 512 tokens
  • GPT-4 (2023): 8K / 32K tokens
  • GPT-4 Turbo (2024): 128K tokens
  • Claude 3 (2024): 200K tokens
  • Gemini 1.5 (2024): 1M+ tokens
  • Claude 4 Sonnet (2025): 200K tokens
What do we do with all of these extra tokens?

Context Window Growth

Context Window Growth
Source: meibel.ai

What do we do with all of these tokens?

  • We need a way to put more relevant information in the context window in order to make the output better
    • 2023: Retrieval Augmented Generation (RAG)
    • 2024: Tool use
  • Wild idea: what if we just...ask the LLM to put more relevant tokens in its own context?
Before:
After:

Agents

  • Early LLMs could only do relatively short time-horizon tasks
    • Chatbots with short answers worked fine (kind of)
    • "Fancy" in-line code completion was another early application
    • With longer time-horizon tasks, early LLMs would quickly go "off the rails"
  • As LLMs got larger and reinforcement learning got better, longer running tasks became more feasible
    • Key insight: Let the model see the results of its actions and iterate on that feedback
    • For instance, what if we:
      • Ask the model to generate code
      • Give the model a tool to run the compiler on the code it generated
      • Add the results of the compilation in the context window
      • Ask it to fix its compilation errors
  • This feedback loop is what transforms a chatbot into an "agent"

Agent Task Length Over Time

Growth in agent task length capabilities over time
Source: metr.org

Agents: Code Search Tool

  • How do agents understand a 1M+ line codebase with only a few hundred thousand tokens of context?
  • Maybe a better question: how do humans do this?
    • We search through the code for key entry points (e.g., main()),
    • …then we read code called by those entry points...
    • …and types and key data structures.
    • When we want to understand a specific piece of functionality, we search for key strings or patterns related to that functionality
    • In other words, we only "load" part of the code into our "context window"
    • And maybe we "load" as summary of the rest of the code (and how it works) into our "context window" if we've been working on the project for a while
  • Agents can do this too!

How Do We Scale From Here?


Using More Tokens to Build Faster and Stronger

Beyond Interactive Use

  • Interactive use of agents quickly becomes bottlenecked on the human in the loop
  • This mode of use will likely always produce the most productivity per token
  • But humans don't scale up; agents do

Parallel Subagents

Claude Code running multiple review agents in parallel

Claude Code Plugins

  • Claude Code Plugins allow you to install collections of slash commands, skills, subagents, hooks, and/or MCP servers that work together for a common purpose
    • More customization points coming soon
  • /plugin to access menu, or use it directly:
  • Marketplaces are lists of plugins that are easy for anyone to host (publicly or privately)

Parallel agents in plugins

/review-pr command, specified in commands/review-pr.md

Parallel agents in plugins

silent-failure-hunter agent, specified in agents/silent-failure-hunter.md

Building Custom Agents

The /agents command

Building Custom Agents

Creating an agent for my Slush 2025 presentation

Building Custom Agents

Configuring the agent's system prompt

Building Custom Agents

Running the review with multiple agents

Hot Take: Building Coding Tools for AGI, not ASI

  • Building for ASI ("Artificial Super Intelligence") implies a mindset where anything we add to the agentic harness will be obsolete because the model will be so smart that it doesn't need our tools
    • Realistically, this has not been the trend in agentic coding
    • Giving agents more ability to get deterministic information (through tool calls, RAG, and hooks) has led to more autonomy and more reliability
  • Building for AGI ("Artificial General Intelligence") assumes that future agents will have intelligence like the best humans
    • This means that the baseline assumption should be that tools that improve efficiency and reliability for humans should also improve efficiency and reliability for agents.
    • Generally, the most efficient humans make more use of coding tools, not less. Why should we think agents are any different?
    • Build tools that work with what we have now, and they will be even more useful as the model improves

Pushing towards Longer Autonomy

  • The vision: write a spec before bed, wake up with a useable implementation of that spec
  • Problem: agents often stop working before a long-horizon task is done
    • This is partially because of how reinforcement learning works (to make any progress, you often need to give partial credit to partial solutions)
    • Also partially because we haven't fully evolved training away from the "chatbot" model (short turns, direct interaction, human-in-the-loop)
  • Problem: agents often do things they know won't work if asked to examine their work
    • A lot of problems in software engineering are easier to assess for correctness than they are to solve.
  • "Overnight" autonomy is often limited by the model deciding to stop working, not by it actually reaching the end of its ability to do productive work.

The Ralph Wiggum Loop

Ralph Wiggum saying 'I'm helping!'

"I'm helping!"

Original post by Geoffrey Huntley: ghuntley.com/ralph

  • Our version of this: feed the same prompt back into the agent until it is willing to "promise" (verbatim) that all parts of the task are done
    • We tell it that there are "special" <promise> tags that carry extra weight, and it must use those to exit the loop.

The Ralph Wiggum Loop

The Ralph Wiggum Philosophy

  • "Good enough" autonomy now
  • Quantity "becomes" quality
  • Faith in eventual consistency
Claude Code running multiple review agents in parallel
Efficiency1.2×1.4×1.7×2.1×2.5×3.0×3.6×4.3×5.2×6.2×
Token Cost16×32×64×128×256×512×1024×
CategoryHobbyistPro-sumerStart-upScale-upAnthropic
Overnight Autonomy

Better Instruction Following

  • Poor instruction following is a really hard problem to solve with reinforcement learning
    • Given a prompt with specific instructions, it's basically impossible to distinguish between changes that reinforce instruction following and changes that reinforce the specific behavior in those instructions
  • Result: agent often "forgets" to follow instructions in the system prompt
    • e.g., "always use camelCase function names"
  • This gets better if we remind the model of these instructions every turn
    • This uses a lot of context and doesn't scale well
  • But what if we only remind it when the instructions are relevant?

The hookify Plugin

  • Basic idea:
    • Scan tool uses for unwanted patterns
    • Remind the model of your instructions when the pattern matches
    • Make it easy to add to the list of things you want to remind the model of
  • Install the plugin (released last weekend just for this workshop!)
  • Tell the model when there's something you don't like
  • The model launches a subagent that creates the hook

Multi-agent Swarms

  • Launch multiple instances of Claude Code and give them a way to talk to each other
    • e.g., put each one in a separate tmux window and tell them to communicate with each other with tmux send-keys
  • Assign one of them to be the technical lead
    • Tell it to not write any code itself
  • Setup stop hooks that tell the technical lead when its workers are awaiting feedback
  • Setup a Ralph Wiggum loop on the technical lead

Then this happened...

Then this happened...

Caveat: Cost

Efficiency1.2×1.4×1.7×2.1×2.5×3.0×3.6×4.3×5.2×6.2×
Token Cost16×32×64×128×256×512×1024×
CategoryHobbyistPro-sumerStart-upScale-upAnthropic
Agent Swarms

Summary

Summary

  • Usage patterns for coding agents will vary significantly depending on your tolerance for diminishing returns
    • Most startups will be in a position to trade at least some productivity-per-token for increased autonomy
  • The context window is the most important part of LLM architecture to keep in mind when working with coding agents
  • Plugins enable building workflows for increased autonomy and to address shortfalls that are difficult or impractical to fix in training
    • Consider building plugins for common internal workflows
  • Build for AGI, not ASI
    • Practical tooling that accelerates workflows now will continue to be useful as models improve

Questions?

Thank you! Feel free to reach out with questions or to chat about AI coding tools.

Extra Slides

Appendix: Probability Distribution Experiments

These slides show detailed experimental results about how models complete code

Widget Factory Completions



Widget Factory Completions

Widget Factory Completions Results

Widget Factory Completions

Widget Factory Completions Results

Widget Factory Completion



Widget Factory Completions: With 'Implicit Move' Comment

Widget Factory Completions: With 'Implicit Move' Comment Results

Widget Factory Completions: With 'Implicit Move' Comment

Widget Factory Completions: With 'Implicit Move' Comment Results

Widget Factory Completions



Widget Factory Completions: With 'Broken Compiler' Comment

Widget Factory Completions: With 'Broken Compiler' Comment Results

Widget Factory Completions: With 'Broken Compiler' Comment

Widget Factory Completions: With 'Broken Compiler' Comment Results

Widget Factory Completions



Widget Factory Completions: C++ version number

Widget Factory Completions: C++ version number Results