Shipping AI agents end-to-end: what I learned building ToonyStory

A lot of what I do, both at my day job at Marq and on the side at ToonyStory, is ship AI agents into the hands of real users and then own what happens next. The job isn't research, and it isn't a demo. It's forward deployed: pick a user outcome, build the agent that gets there, deploy it, watch it break, fix it, and keep going.

ToonyStory is a good case to write up because the failure modes are unforgiving. The agent has to produce a physical, printed book. There's no "click regenerate" once it ships to the printer.

Here's what I've learned doing that work.

The agent loop, not the prompt

The most common mistake I see in early AI products — and I've made it plenty of times — is treating the LLM call as the product. It isn't. The product is the loop around the call.

The ToonyStory agent does roughly this:

Intake — pull structured input from the user (subject, age, traits, photo references)
Story generation — an LLM produces a narrative with explicit scene breaks and per-scene character grounding
Character consistency — a separate pass extracts visual descriptors and reuses them across every illustration prompt
Illustration — a multimodal model generates each page from a constrained prompt + reference imagery
Layout — text and image get composed into a print-ready spread, respecting bleed and trim
Print fulfillment — the file goes to a print-on-demand partner with strict color and resolution requirements
Telemetry & feedback — every stage emits structured events back into the warehouse for evals

None of those stages is the product. The product is whether the loop, end-to-end, produces a book the user wants to hand to a kid.

That framing changes what I optimize. A better prompt is worth nothing if the printer rejects the file. A perfect illustration is worth nothing if the character doesn't look like the same character on page two.

Character consistency: the hard part

The single hardest problem in ToonyStory isn't story quality. It's keeping the same character recognizable across twelve illustrated pages.

The shape of the solution is what I'd recognize as standard agent design now, but it took several iterations to land on it:

Treat the character as a persistent piece of state, not a prompt-time afterthought
Extract a structured "character sheet" (descriptors, palette, pose tendencies, distinctive features) once, then inject it into every page-level generation
Constrain the generation prompt heavily — give the model less room to drift
Sample, score, and keep a running "best image so far" per page; only regenerate if the score is below threshold
Treat reference imagery as a tool the agent has access to, not a one-shot input

A lot of the work isn't in any single piece — it's in the regeneration policy: when to retry, how many times, what to swap in the prompt on retry, and when to give up and hand-flag for review.

That regeneration policy is where the forward-deployed PM job actually lives. Engineers can build the pieces. The PM decides what counts as good enough, when to ship vs hold, and what the user experiences when the agent fails.

Evals are the product spec

I used to write product specs the normal way: user stories, acceptance criteria, screenshots.

I don't anymore. For AI products, the evals are the spec.

An eval, for me, is a structured test case with:

An input (the kind of request a real user actually sends)
A passing rubric (what "right" looks like, ideally automated; if not, a written rubric a human grader uses)
A weight (how much this case matters relative to the rest)

A good eval set tells you, before any user sees the agent, whether the latest change made things better or worse. A great eval set tells you for whom and in which scenario.

For ToonyStory, the eval set covers:

Story quality (narrative coherence, age-appropriateness, voice)
Character consistency across pages (visual + descriptive)
Print-quality compliance (bleed, resolution, color profile, no banned content)
Latency budgets per stage
Failure recovery (does the retry produce a better result, on average, than the original?)

When I make a prompt change, I rerun the eval set before it hits production. When I see a new failure mode in production, the first thing I do is add it to the eval set as a regression test — that case can never silently break again.

This is also the cleanest way I've found to manage prompt versioning. The prompt isn't the change. The change is prompt + eval delta. If the prompt moves but no eval moved with it, something is suspicious.

Telemetry is what makes ambiguity tractable

The other thing that forward-deployed AI work has taught me: instrument before you ship.

Every stage of the ToonyStory agent emits a structured event:

Inputs received (hashed)
Prompt template + version used
Tools called and their outcomes
Tokens in / out / latency / cost
Output, scored by automated rubric where possible
Final user-facing outcome (kept book? regenerated? abandoned?)

The events go to a warehouse and a small set of dashboards. When something feels off, I can answer "where in the loop is this going wrong?" in minutes, not days.

That's the difference between iterating on an AI product and guessing at one. Without telemetry you're working from screenshots and gut feel. With it you're working from the actual distribution of user inputs and where the agent is dropping them.

The forward-deployed move: ship narrow, then widen

The temptation when you have a powerful general-purpose LLM is to ship a general-purpose product. Resist it.

Every time I've narrowed the input domain — fewer story types, fewer age bands, fewer art styles — the agent has gotten dramatically better and the user experience has gotten cleaner. The narrow version is what the eval set covers. The narrow version is what the regeneration policy is tuned for. The narrow version is what the printer can reliably produce.

You can always widen later. You almost can't narrow successfully once you've widened, because users have come to expect the broad surface area.

This is the same lesson I've watched land at Marq with enterprise AI: ship a narrow agent against a specific customer workflow, prove the outcome, then generalize. The reverse — ship a general agent and hope it works for everyone — fails almost every time.

What I take from it

The thing I'd say to anyone shipping AI agents in production:

The product is the loop, not the prompt.
The spec is the eval set, not the doc.
The job is forward deployed — sit with the user, instrument the path, fix what's actually breaking.
Ship narrow. Widen only after the narrow version is genuinely good.
Own the outcome metric, not the feature shipped. If the metric didn't move, the work isn't done.

Most of what I do — at Marq with enterprise customers, at ToonyStory with consumer users — is some variation of this loop. The customer and the agent change. The job doesn't.

If you're building something in this shape and want to compare notes, say hello.

Shipping AI agents end-to-end: what I learned building ToonyStory

Tags

The agent loop, not the prompt

Character consistency: the hard part

Evals are the product spec

Telemetry is what makes ambiguity tractable

The forward-deployed move: ship narrow, then widen

What I take from it

Get New Posts via Email

Sign in to comment

Contents