Temper

Continuous
Measurement
for Local AI Agents.

You run AI coding agents every day. Do you know which ones are getting better? Where quality is drifting before it becomes a problem? Temper gives you continuous, honest answers.

Discover agents Archive traces Measure quality Detect drift Run experiment Decide
Measurement is different from testing. Tests grade pass/fail in a single moment. Measurement grades quality over time and triggers controlled experiments when quality drops — so you can act on evidence, not intuition.

Tests check a moment.
Measurement tracks a trend.

CI tests tell you whether code compiles and whether assertions pass. They don't tell you whether your AI agent's output quality is improving, degrading, or holding steady — because quality isn't binary, and it isn't static.

Testing alone

Pass or fail, right now

Tests grade the current output against a fixed expectation. When your agent changes, tests don't notice until something breaks hard enough to fail an assertion.

Why this matters
Drift accumulates invisibly. A test suite runs on demand — it doesn't catch subtle quality degradation that happens across weeks. When your agent changes model, prompt, or tools, the gap widens before anyone sees it. By the time a test fails, the problem has already compounded.
Temper's approach

Quality over time, with experiments

Temper archives every agent trace and grades it continuously. When quality drops, even subtly, Temper surfaces it before it compounds.

How it works in practice
You form a hypothesis, fork a variant, run it against the same evaluation criteria, and compare results. Decisions made on evidence, not instinct. The experiment record lives in Lore — so the next similar problem starts from what you already learned.

Three pillars.
One research platform.

Temper runs locally, discovers agents on your machine, and builds an immutable truth record of what they do. From there it measures, detects when something changes, and helps you run a controlled experiment to improve it.

01 — DISCOVER

Discover local agents and archive their traces

Temper scans your machine for running AI agents, pulls their configurations and conversation history, and stores everything in an append-only archive. No cloud sync. An honest record, preserved immutably.

02 — MEASURE

Run quality tests over time

Define what good looks like for your agents. Temper grades each agent's output using a panel of independent judges — deterministic checks and AI-assisted scoring. Quality tracked continuously, not just on demand.

03 — EXPERIMENT

Detect drift and run controlled experiments

When quality drops, Temper surfaces it automatically. You form a hypothesis, fork a variant, run it against the same criteria, and compare results. Temper presents the recommendation; you make the call.

Stable Quality holds steady. Temper tracks traces continuously.
Drifting Subtle decline begins. No single trace fails outright.
Detected Temper surfaces the drift. You form a hypothesis and fork a variant.
Resolved Experiment confirms the improvement. Deploy with evidence.
0% 50% 100% QUALITY SCORE threshold drift detected experiment variant improvement deployed day 1 week 2 week 4 week 6 week 8 QUALITY TRACKED CONTINUOUSLY — DRIFT CAUGHT BEFORE IT COMPOUNDS — EXPERIMENT RESOLVES IT

Temper is in early access.

We're working with technical founders and engineering leaders who run local AI agents and want measurable proof their improvements work. If that's you, we'd like to hear from you.