When you are building with coding agents, model comparisons are only useful if they measure the thing your workflow actually needs.
A model can sound confident and still write the wrong file. It can produce a clean explanation while missing the required action schema. It can choose the right file but generate stale content. That is why evaluation behind Mythos Router should focus on verified file behavior, not just text quality.
The model proposes the edit. SWD verifies whether the real filesystem matches the claim.
What to Measure
A practical provider evaluation should track:
| Signal | Why it matters |
|---|---|
| Target accuracy | Did the model choose the right files? |
| Action schema adherence | Did it emit valid FILE_ACTION or structured JSON actions? |
| SWD correction behavior | When verification failed, did the next attempt fix the real issue? |
| Latency | How long did the session take end to end? |
| Token cost | What did the verified result cost? |
| Receipt outcome | Did the final receipt verify against disk later? |
That gives you a more honest picture than a generic coding benchmark alone.
Run the Same Task Across Providers
Mythos lets you force a provider when you want to compare behavior directly:
mythos run --provider anthropic --file TASK.md
mythos run --provider deepseek --file TASK.md
mythos run --provider openai --file TASK.mdKeep the task file stable. Compare the generated actions, SWD verification result, correction turns, receipts, and total usage.
Inspect the Result
After a file-changing run, inspect the receipt and verify drift:
mythos receipts show latest
mythos receipts verify latestThe useful question is not "which model sounded smartest?" It is "which provider reached a verified filesystem state with the fewest corrections, acceptable latency, and acceptable cost?"
External Agents Count Too
If you are evaluating a separate agent framework, put Mythos behind it as the execution boundary:
mythos mcp
cat actions.json | mythos swd apply --stdin --jsonIn that path, Mythos does not call a model. The external agent brings its own runtime and model key; Mythos validates and applies the submitted file actions through SWD.
The Practical Standard
Different providers can be good at different tasks. Mythos does not need every provider to behave identically. It needs every file edit to prove itself against the real repository.
That is the durable evaluation layer: provider flexibility on one side, filesystem truth on the other.