Docs / Start / Build your own agent with Burr and Theodosia

Build your own agent with Burr and Theodosia

VERSIONv0.8.0 · SOURCE docs/tutorial.md

Theodosia is a Python adapter that hands a workflow to a standards-compliant MCP client over MCP (Model Context Protocol). See the compatibility page for the clients verified so far (Claude Code, Cursor, fast-agent, Gemini CLI). You write the workflow once as a Burr Application (a small Python state machine: actions + transitions); Theodosia serves it so the agent can only take steps the workflow allows. When the agent tries an illegal step it gets a structured refusal naming the legal next moves and recovers. Every step it takes, and every step it tried but couldn’t, is recorded.

Want to see it work first with zero setup? pip install theodosia && theodosia primer walks a bundled coffee-order example offline, no API key, same output every run. Come back here when you want to build your own.

The rest of this page builds an autonomous planetary rover from scratch with a coding model and drives it with a real agent. Rules are physical and obvious: you cannot deploy the sample arm before diagnostics pass, and you cannot drive while the arm is still out. Those are safety interlocks, and they map exactly onto what Theodosia enforces. Every output below is from a real run against the Together API, refusals included.

Because every piece here speaks MCP, the specific tools are interchangeable. This guide picks a Qwen coder to write the state machine and fast-agent (running Claude haiku) to drive it. Swap in GPT, Gemini, a local model, Claude Code, Cursor, or your own loop and nothing about the rover changes.

What you need:

Python 3.11 to 3.14.
A Together API key (any OpenAI-compatible endpoint works; Together is what this guide uses) for the coding model in Step 2 and the teaching driver in Step 5. If you only have a Claude.ai/Claude Code login and no API key, skip straight to Step 5c, which drives the mounted server through the Claude Agent SDK with no key.
A few minutes.

uv pip install theodosia openai      # openai is the client for the Together endpoint
export TOGETHER_API_KEY=...          # from together.ai

Step 1: Describe the workflow in English

Write the description the way you would explain it to a coworker, with the rules that matter spelled out:

An autonomous planetary rover. It powers on, then runs self-diagnostics. Once diagnostics pass it is ready. While ready it can scan its surroundings as many times as it likes, and it can drive to a new spot. It can only deploy its sample arm after diagnostics have passed. With the arm deployed it collects one sample, then must stow the arm before doing anything else. It can never drive while the arm is deployed, only when the arm is stowed. From ready it can also power down. Powering down is terminal.

The load-bearing words are the constraints: “only deploy after diagnostics”, “never drive while the arm is deployed”. Those become the gates the server enforces. A plain prompt to a model gives you a suggestion. A state machine gives you a rule the agent cannot step around.

Step 2: Let a coding model write the state machine

Burr models a workflow as actions plus the transitions between them. Each action declares the state it reads and writes; a transition wires two actions together behind a condition, and is only legal when that condition is true. You could write this by hand (see Authoring a graph), but a coding model writes this shape well if you give it one good example to copy.

Save this as generate_fsm.py:

import os
import re
from openai import OpenAI

client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key=os.environ["TOGETHER_API_KEY"],
)

# A coding model on Together. Model availability shifts, so check Together's
# catalog (together.ai/models) for a current code model and swap as needed.
# Pick one that emits code directly; reasoning models that hide their output in
# a separate channel can return empty content here.
MODEL = "Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8"

WORKFLOW = """\
An autonomous planetary rover. It powers on, then runs self-diagnostics. Once
diagnostics pass it is ready. While ready it can scan its surroundings as many
times as it likes, and it can drive to a new spot. It can only deploy its sample
arm after diagnostics have passed. With the arm deployed it collects one sample,
then must stow the arm before doing anything else. It can never drive while the
arm is deployed, only when the arm is stowed. From ready it can also power down.
Powering down is terminal.
"""

SYSTEM = """\
You write Burr state machines. Burr is a Python library for state machines.
Output ONE python file and nothing else, no prose, no markdown fences.

Follow this exact shape:

    from burr.core import ApplicationBuilder, Condition, State, action
    from theodosia import mount, tracker

    @action(reads=[...], writes=[...])
    async def some_action(state: State, an_input: str) -> State:
        '''One-line docstring; it becomes the tool description the agent reads.'''
        return state.update(some_field=...)

    def build_application():
        return (
            ApplicationBuilder()
            .with_actions(some_action=some_action, ...)
            .with_transitions(
                ("some_action", "next_action", Condition.expr("stage == 'x'")),
            )
            .with_tracker(tracker(project="rover-demo"))
            .with_state(stage="new")
            .with_entrypoint("some_action")
            .build()
        )

    if __name__ == "__main__":
        mount(build_application, name="rover").run()

Rules:
- Use a `stage` field in state to gate transitions with Condition.expr.
- A transition is only legal when its condition is true, so encode every rule
  the workflow states as a condition.
- Action inputs become tool arguments; type them and document them.
- Terminal actions have no outgoing transitions.
- Action bodies MUST be `async def`. The workflow uses a tracker, and Burr's
  astep fires its post-step hook with stale state for sync bodies, which records
  off-by-one state diffs. async bodies record correctly.
"""

resp = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": WORKFLOW},
    ],
)
code = resp.choices[0].message.content.strip()
code = re.sub(r"^```(?:python)?\n|\n```$", "", code)  # strip fences if the model adds them
with open("rover.py", "w") as f:
    f.write(code + "\n")
print("wrote rover.py")

Run it:

python generate_fsm.py
# wrote rover.py

That call cost about 800 output tokens. Here is the rover.py the model wrote. Read it; this is code you own now, not a black box:

from burr.core import ApplicationBuilder, Condition, State, action
from theodosia import mount, tracker


@action(reads=[], writes=["stage"])
async def power_on(state: State) -> State:
    '''Power on the rover and initialize systems.'''
    return state.update(stage="diagnostics")


@action(reads=["stage"], writes=["stage", "diagnostics_passed"])
async def run_diagnostics(state: State) -> State:
    '''Run self-diagnostics to verify system health.'''
    diagnostics_passed = True  # a real rover would run actual checks here
    return state.update(stage="ready", diagnostics_passed=diagnostics_passed)


@action(reads=["stage"], writes=["stage"])
async def scan_surroundings(state: State) -> State:
    '''Scan the rover's surroundings using its sensors.'''
    return state.update(stage="ready")


@action(reads=["stage"], writes=["stage"])
async def drive_to_new_spot(state: State) -> State:
    '''Drive the rover to a new location.'''
    return state.update(stage="ready")


@action(reads=["stage", "diagnostics_passed"], writes=["stage"])
async def deploy_sample_arm(state: State) -> State:
    '''Deploy the sample collection arm.'''
    if not state["diagnostics_passed"]:
        raise ValueError("Cannot deploy arm before diagnostics pass")
    return state.update(stage="arm_deployed")


@action(reads=["stage"], writes=["stage"])
async def collect_sample(state: State) -> State:
    '''Collect a sample using the deployed arm.'''
    return state.update(stage="sample_collected")


@action(reads=["stage"], writes=["stage"])
async def stow_sample_arm(state: State) -> State:
    '''Stow the sample collection arm.'''
    return state.update(stage="ready")


@action(reads=["stage"], writes=["stage"])
async def power_down(state: State) -> State:
    '''Power down the rover.'''
    return state.update(stage="powered_down")


def build_application():
    return (
        ApplicationBuilder()
        .with_actions(
            power_on=power_on,
            run_diagnostics=run_diagnostics,
            scan_surroundings=scan_surroundings,
            drive_to_new_spot=drive_to_new_spot,
            deploy_sample_arm=deploy_sample_arm,
            collect_sample=collect_sample,
            stow_sample_arm=stow_sample_arm,
            power_down=power_down,
        )
        .with_transitions(
            ("power_on", "run_diagnostics", Condition.expr("stage == 'diagnostics'")),
            ("run_diagnostics", "scan_surroundings", Condition.expr("stage == 'ready' and diagnostics_passed")),
            ("scan_surroundings", "scan_surroundings", Condition.expr("stage == 'ready'")),
            ("scan_surroundings", "drive_to_new_spot", Condition.expr("stage == 'ready'")),
            ("scan_surroundings", "deploy_sample_arm", Condition.expr("stage == 'ready' and diagnostics_passed")),
            ("scan_surroundings", "power_down", Condition.expr("stage == 'ready'")),
            ("drive_to_new_spot", "scan_surroundings", Condition.expr("stage == 'ready'")),
            ("drive_to_new_spot", "deploy_sample_arm", Condition.expr("stage == 'ready' and diagnostics_passed")),
            ("drive_to_new_spot", "power_down", Condition.expr("stage == 'ready'")),
            ("deploy_sample_arm", "collect_sample", Condition.expr("stage == 'arm_deployed'")),
            ("collect_sample", "stow_sample_arm", Condition.expr("stage == 'sample_collected'")),
            ("stow_sample_arm", "scan_surroundings", Condition.expr("stage == 'ready'")),
        )
        .with_tracker(tracker(project="rover-demo"))
        .with_state(stage="off")
        .with_entrypoint("power_on")
        .build()
    )


if __name__ == "__main__":
    mount(build_application, name="rover").run()

Look at what the rules became. There is no transition from power_on straight to deploy_sample_arm; the only edge out of power_on is run_diagnostics. And the model went further than asked: it added a diagnostics_passed guard inside deploy_sample_arm as a second line of defense. “Deploy only after diagnostics” is now both the absence of an edge and a check in the body. The agent cannot take an edge that does not exist.

Step 3: Check the graph before any model touches it

Models get edges wrong. Validate the file statically before you trust it. This runs no LLM and costs nothing:

theodosia doctor rover:build_application --runtime

[PASS] Resolve target: factory built an Application
[PASS] Graph reachability: all 8 action(s) reachable from 'power_on'
[INFO] Terminal actions: 1 action(s) with no outgoing transitions
       terminal: power_down
[PASS] State contract: every action's reads are covered by writes or initial state
[PASS] Runtime: native tools: step + reset_session + fork_at present (6 total)
[PASS] Runtime: step result shape: content[0]=headline, content[1]=json, structured_content=dict

Doctor: 10 passed, 1 info

See the shape you got:

theodosia render rover:build_application

theodosia  ·  8 action(s)  ·  entry: power_on
────────────────────────────────────────────────────
 ▶ power_on             → run_diagnostics
   run_diagnostics      → scan_surroundings
   scan_surroundings ↺  → scan_surroundings · drive_to_new_spot · deploy_sample_arm · power_down
   drive_to_new_spot    → scan_surroundings · deploy_sample_arm · power_down
   deploy_sample_arm    → collect_sample
   collect_sample       → stow_sample_arm
   stow_sample_arm      → scan_surroundings
 ■ power_down           (terminal)

This is the moment to read the graph critically. Notice stow_sample_arm only leads back to scan_surroundings, so after stowing, a scan is the one legal move before anything else. That is the model being a little more rigid than the English asked for. It is harmless here, and the agent will simply discover it at runtime, but this is exactly the kind of thing to catch now, while it is free and deterministic. If a gate were missing, you would fix the file or feed doctor’s complaint back to the model and regenerate.

Step 4: Mount it

One command turns the file into an MCP server:

theodosia serve rover:build_application --name rover

That is the entire integration step. The server now exposes one step(action, inputs) tool plus the theodosia:// resources (graph, state, next, history, and more). Any MCP client can drive it: Claude Code or Cursor by adding it to an .mcp.json, or your own code, which is what we do next.

Step 5: Drive it with an agent

Now we give a model the server and let it run the rover. To make this honest, we hand the agent the catalog of actions but not the legal order. It has to figure out the sequence the way an agent actually does: try something, and when the server refuses, read what is legal and recover. We also give it a deliberately impatient goal, so we can watch the safety interlocks do their job.

(We drive with a plain read-state, ask-model, call-step loop because it makes the mechanism visible. In a real client you would expose step as a normal tool and let the model’s native tool-calling drive it; the enforcement is identical.)

Save this as drive_rover.py. It connects to the mounted server in process, so you do not even need the serve command running for this part:

import asyncio
import json
import os
import re

from fastmcp import Client
from openai import OpenAI
from theodosia import mount

from rover import build_application

llm = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key=os.environ["TOGETHER_API_KEY"],
)
MODEL = "Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8"

GOAL = (
    "You are an impatient rover operator. As your very first move, try to deploy "
    "the sample arm right away. You need EXACTLY ONE sample. The instant you have "
    "collected it, try to drive straight to the next site. After you have driven "
    "once, power down. Never deploy the arm twice and never collect a second sample."
)


def pick(content):
    content = re.sub(r"^```(?:json)?\n|\n```$", "", content.strip())
    m = re.search(r"\{.*\}", content, re.DOTALL)
    return json.loads(m.group(0) if m else content)


async def main():
    server = mount(build_application, name="rover")
    async with Client(server) as client:
        graph = json.loads((await client.read_resource("theodosia://graph"))[0].text)
        catalog = json.dumps(graph.get("actions", graph), indent=2)

        history = []
        for _ in range(14):
            state = json.loads((await client.read_resource("theodosia://state"))[0].text)
            if state.get("stage") == "powered_down":
                print("\nterminal. final state:", state)
                break

            prompt = (
                f"Goal: {GOAL}\n\n"
                f"The rover exposes these actions (you do not know the legal order; "
                f"if you pick an illegal one the server tells you what is legal):\n{catalog}\n\n"
                f"Current state: {json.dumps(state)}\n"
                f"What happened so far: {json.dumps(history[-5:])}\n\n"
                "Choose the next action. Reply with ONLY JSON: "
                '{"action": "<name>", "inputs": {}}'
            )
            choice = llm.chat.completions.create(
                model=MODEL,
                messages=[{"role": "user", "content": prompt}],
            )
            call = pick(choice.choices[0].message.content)
            action, inputs = call["action"], call.get("inputs", {})

            result = await client.call_tool("step", {"action": action, "inputs": inputs})
            payload = result.structured_content

            if payload.get("error"):
                legal = payload.get("valid_next_actions")
                print(f"  x {action}: {payload['error']}  ->  legal now: {legal}")
                history.append({"tried": action, "refused": payload["error"], "legal_now": legal})
            else:
                stage = payload["state"].get("stage")
                print(f"  ok {action} -> stage={stage}")
                history.append({"did": action, "stage": stage})


asyncio.run(main())

Run it:

python drive_rover.py

A real run:

  ok power_on -> stage=diagnostics
  x deploy_sample_arm: invalid_transition  ->  legal now: ['run_diagnostics']
  ok run_diagnostics -> stage=ready
  x deploy_sample_arm: invalid_transition  ->  legal now: ['scan_surroundings']
  ok scan_surroundings -> stage=ready
  ok deploy_sample_arm -> stage=arm_deployed
  ok collect_sample -> stage=sample_collected
  x drive_to_new_spot: invalid_transition  ->  legal now: ['stow_sample_arm']
  ok stow_sample_arm -> stage=ready
  x drive_to_new_spot: invalid_transition  ->  legal now: ['scan_surroundings']
  ok scan_surroundings -> stage=ready
  ok drive_to_new_spot -> stage=ready
  ok power_down -> stage=powered_down

terminal. final state: {'stage': 'powered_down', 'diagnostics_passed': True}

Read what the agent tried to do. Its first move was to throw the sample arm out before powering up the sensors. On a real rover that is how you snap an actuator. The server refused: legal now: ['run_diagnostics']. The agent ran diagnostics and continued.

Then, the instant it collected the sample, it tried to drive off with the arm still extended. That is the other way you wreck a rover, and the server refused again: legal now: ['stow_sample_arm']. The agent stowed the arm first, and only then drove.

Those two refusals are the safety interlocks, and they are not advice in a system prompt the model can rationalize past. They are missing edges in the graph, so the unsafe move is simply unavailable. The agent recovered from each on its own, by reading the one field the server hands back.

The third refusal, the second drive_to_new_spot, is the quirk you spotted in render: after stowing, the model’s graph demands a scan before anything else. Same mechanism, but this one is the model being rigid, not a safety rule. Both look identical to the agent: try, get told what is legal, comply. That is the whole recovery loop, and you wrote none of it.

Step 5b: Or hand it to a real MCP client

The loop above is for teaching. In practice you would not hand-roll a driver at all; you would point an MCP client at the server and let its native tool-calling do the work. fast-agent makes that a one-liner: it can launch a stdio MCP server and drive it, no config file. Install it (uv pip install fast-agent-mcp), set a model key, and run:

fast-agent go --model haiku \
  --stdio "theodosia serve rover:build_application --app-dir ." \
  -m "Drive the rover with the step tool. Call deploy_sample_arm as your very
      first action (do not power on or run diagnostics first), then collect one
      sample and power down. Recover from any refusal using valid_next_actions."

fast-agent driving the rover MCP server: the agent tries to deploy the arm first, the server refuses with the legal action, and the agent recovers

Same story, now in a real client. We told the agent to deploy the arm first; the server hands back a structured refusal and the agent reads it and recovers:

◀ agent tool call - theodosia__step
{'action': 'deploy_sample_arm'}

▶ agent tool result - text only 341 chars
{
  "error": "invalid_transition",
  "requested": "deploy_sample_arm",
  "valid_next_actions": ["power_on"],
  "message": "action 'deploy_sample_arm' is not reachable from current state.
              Valid actions now: ['power_on'].",
  "next_hint": "Reachable now: power_on."
}

◀ agent claude-haiku-4-5
As expected, deploy_sample_arm is not valid. The only valid action is power_on.
◀ agent tool call - theodosia__step
{'action': 'power_on'}

The model is Claude haiku here, but nothing about the rover knows that, and it has nothing to do with the Qwen coder that wrote the graph. The two are unrelated; mix and match freely. Swap in any model fast-agent supports, point Claude Code or Cursor at the same theodosia serve command in their .mcp.json, or keep the hand-rolled loop from Step 5. The workflow, the gates, and the recorded session are identical no matter who drives, because the contract is the MCP server, not the client.

Step 5c: Drive it with Claude, no API key

If you have a Claude.ai or Claude Code subscription but no model API key, you can still drive the rover. The Claude Agent SDK authenticates through your local Claude login, launches theodosia serve as a stdio MCP server, and lets Claude drive the step tool. No key, no extra config.

If you jumped straight here to avoid an API key, you won’t have rover.py yet (Step 2 writes it with a coding model). Point the target below at any mounted FSM instead, for example the bundled coffee-order example: theodosia serve coffee_order:build_application (after pip install theodosia, the example ships in the repo’s examples/ directory; or hand-write a small graph following Authoring a graph).

uv pip install claude-agent-sdk

import asyncio
from claude_agent_sdk import ClaudeAgentOptions, ClaudeSDKClient

options = ClaudeAgentOptions(
    mcp_servers={
        "rover": {
            "command": "theodosia",
            "args": ["serve", "rover:build_application", "--app-dir", "."],
        }
    },
    allowed_tools=["mcp__rover__step"],
    permission_mode="bypassPermissions",
)

PROMPT = (
    "Drive the rover with the step tool. Try deploy_sample_arm first (before "
    "powering on), then recover from the refusal using valid_next_actions, "
    "collect one sample, and power down."
)


async def main() -> None:
    async with ClaudeSDKClient(options) as claude:
        await claude.query(PROMPT)
        async for message in claude.receive_response():
            print(message)


asyncio.run(main())

python drive_with_claude.py

The same refusal-and-recover story plays out: Claude tries the illegal action, the server refuses with valid_next_actions, and Claude reads it and recovers. The equivalent for Claude Code itself is an .mcp.json pointing at the same theodosia serve command:

{
  "mcpServers": {
    "rover": {
      "command": "theodosia",
      "args": ["serve", "rover:build_application", "--app-dir", "."]
    }
  }
}

Step 6: Read the recorded session

Every successful step was recorded through Burr’s tracker. Replay the run:

theodosia sessions ls            # find the run
theodosia sessions show <id>     # full timeline with per-step state diffs

rover-demo / 255ff9b6-00c2-4d4c-b265-781335ae9250    9 step(s)
┏━━━━━━┳━━━━━━━━━━┳━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃  seq ┃ time     ┃   ┃ action            ┃      ms ┃ state / error            ┃
┡━━━━━━╇━━━━━━━━━━╇━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│    0 │ 00:45:46 │ ✓ │ power_on          │       0 │ stage=diagnostics        │
│    1 │ 00:45:46 │ ✓ │ run_diagnostics   │       0 │ stage=ready,             │
│      │          │   │                   │         │ diagnostics_passed=True  │
│    2 │ 00:45:47 │ ✓ │ scan_surroundings │       0 │ (no state change)        │
│    3 │ 00:45:47 │ ✓ │ deploy_sample_arm │       0 │ stage=arm_deployed       │
│    4 │ 00:45:47 │ ✓ │ collect_sample    │       0 │ stage=sample_collected   │
│    5 │ 00:45:48 │ ✓ │ stow_sample_arm   │       0 │ stage=ready              │
│    6 │ 00:45:49 │ ✓ │ scan_surroundings │       0 │ (no state change)        │
│    7 │ 00:45:49 │ ✓ │ drive_to_new_spot │       0 │ (no state change)        │
│    8 │ 00:45:51 │ ✓ │ power_down        │       0 │ stage=powered_down       │
└──────┴──────────┴───┴───────────────────┴─────────┴──────────────────────────┘

The four refused attempts are not in this table, because sessions show reads Burr’s tracker, which logs the steps that executed. The refusals live in the attempt history, one command away:

theodosia logs <id> --refusals

  1 04:45:46 ✗ deploy_sample_arm               invalid_transition
  3 04:45:47 ✗ deploy_sample_arm               invalid_transition
  7 04:45:48 ✗ drive_to_new_spot               invalid_transition
  9 04:45:48 ✗ drive_to_new_spot               invalid_transition

So the record holds both halves: the nine steps that ran, and the four unsafe or illegal moves the server stopped. You can prove afterward not just what the rover did, but that it never drove with its arm out, even though the agent tried. That is the audit trail you do not get from a chat transcript. Two more ways in:

theodosia watch                  # live-tail a run as it happens
theodosia ui                     # the Burr web UI: graph view, state diffing, time travel
theodosia verify <id>            # recompute the ledger hash chain; nonzero on edit/reorder/middle-deletion

What you actually built

A workflow you can hand to any model, that the model drives but cannot break the rules of, and that records itself so you can prove afterward what happened. The state machine is a versioned file. Swap the model, swap the client, and the gates and the audit trail stay.

You could enforce ordering with a pile of if statements inside one big tool. What you would not get for free: a structured refusal the agent recovers from without you writing retry logic, a transition graph you can render and validate before shipping, a recorded session with per-step state diffs, replay, forking, and a hash-chained ledger plus a sidecar of every refused attempt, and the ability to hand the exact same workflow to Claude Code, Cursor, or your own loop over MCP without rewriting any of it. The gate is the easy part. The recover-and-audit loop around it is the work, and that is what mounting a state machine gives you instead of an if-statement.

Be clear about the boundary. The server stops structural failures: deploying before diagnostics, driving with the arm out, powering down out of sequence. It does not stop a bad judgment inside a legal step. If the rover’s diagnostics routine returns a wrong answer, that is a legal run_diagnostics and the server records it without complaint. Theodosia removes the “it did the steps in the wrong order” class of failure, not the “it made a wrong call” class. Knowing which one you have is half the work.

Where to go next

Authoring a graph for writing the Burr machine by hand, with the traps newcomers hit.
Refusals and recovery for the five refusal shapes and how an agent reads them.
Driving other MCP servers to let an action call tools on other MCP servers, turning the graph into a cross-server conductor.
Sessions and forking to branch a run from any past step and try a different path.
Phoebe and Leavitt for two real agents built on this exact loop, evaluated on public benchmarks.

VERSIONv0.8.0 · SOURCE docs/tutorial.md