Temper

Continuous
Measurement
for Local AI Agents.

You run AI coding agents every day. Do you know which ones are getting better? Which configurations produce higher-quality work? Where quality is drifting before it becomes a problem? Temper gives you honest, continuous answers — not guesses.

Discover agents Archive traces Measure quality Detect drift Run experiment Decide
Measurement is different from testing. Tests grade pass/fail in a single moment. Measurement grades quality over time and triggers controlled experiments when quality drops — so you can act on evidence, not intuition.

Tests check a moment.
Measurement tracks a trend.

CI tests tell you whether code compiles and whether assertions pass. They don't tell you whether your AI agent's output quality is improving, degrading, or holding steady — because quality isn't binary, and it isn't static.

Testing alone

Pass or fail, right now

A test suite runs on demand. It grades the current output against a fixed expectation. When your agent changes — different model, updated prompt, new tool — the tests don't notice until something breaks hard enough to fail an assertion. Drift accumulates invisibly until it becomes a failure.

Temper's approach

Quality over time, with experiments

Temper archives every agent trace and grades it continuously against criteria you define. When quality drops — even subtly — Temper surfaces it before it compounds. You form a hypothesis, fork a variant, run it against the same evaluation criteria, and compare results. Decisions made on evidence, not instinct.


Three pillars.
One research platform.

Temper runs locally, discovers agents on your machine, and builds an immutable truth record of what they do. From there it measures, detects when something changes, and helps you run a controlled experiment to improve it.

01 — DISCOVER

Discover local agents and archive their traces

Temper scans your machine for running AI agents, pulls their configurations, tools, and conversation history, and stores everything in an append-only archive. No cloud sync. No manual export. Just an honest record of what happened, preserved immutably.

02 — MEASURE

Run quality tests over time

Define what good looks like for your agents. Write evaluation criteria in plain language, then Temper grades each agent's output using a panel of independent judges — both deterministic checks and AI-assisted scoring. Quality trends are tracked continuously, not just on demand.

03 — EXPERIMENT

Detect drift and run controlled experiments

When quality drops, Temper surfaces it automatically. You form a hypothesis — change a configuration, update a prompt, adjust a tool — fork a variant, run it against the same evaluation criteria, and compare results. Temper presents the recommendation; you make the call.

0% 50% 100% QUALITY SCORE threshold drift detected experiment variant improvement deployed day 1 week 2 week 4 week 6 week 8 QUALITY TRACKED CONTINUOUSLY — DRIFT CAUGHT BEFORE IT COMPOUNDS — EXPERIMENT RESOLVES IT

Temper is in early access.

We're working with technical founders and engineering leaders who run local AI agents and want measurable proof their improvements work. If that's you, we'd like to hear from you.