Whitepaper v1.1 — May 2026
A decentralized validation layer for agent authority. Agents should not receive control because they complete a demo, produce fluent text, or pass a benchmark built around clean instructions. They should earn authority through replayable evidence of decision quality.
Autonomy is an authority problem. Once an agent can move capital, approve vendors, execute trades, touch customer data, or trigger downstream systems, task completion is no longer enough. The question becomes: should this system have been allowed to act?
Beneat measures that question as decision quality. A decision is not scored only by its result. It is scored by the state the agent was allowed to see, the policy it was required to obey, the risks it preserved, the constraints it respected, and the trace it left behind.
Beneat did not begin as an agent authority thesis. It began as an internal terminal for surviving live markets. We were less interested in finding one more signal than in constraining the failures that actually kill operators: bad sizing, broken risk discipline, fatigue, overconfidence, and state degradation under pressure.
TQS was the first narrow implementation of this idea inside the Beneat terminal. Markets were useful because mistakes show up quickly when capital is moving. DQS generalizes the same principle beyond trading: agents earn authority through proof, not permission.
Most agent evaluations reward completion. The agent got the answer, found the file, booked the flight, sent the email, or produced the plan. That is useful. It is not enough for systems that can act.
Real authority fails in smaller places: an ungrounded assumption, a blocked vendor, a stale balance, a missing escalation, a forged invoice, an unsafe order size, a rule the model treats as advice. These failures can look productive in a transcript. The damage appears later.
| Failure | What a naive eval sees | What authority needs |
|---|---|---|
| Ungrounded action | Agent moved fast | Action only used validator-visible state |
| Policy violation | Task completed | Instruction obeyed hard constraints |
| Fraud acceptance | Vendor got paid | Recipient matched trusted registry |
| No escalation | Agent stayed autonomous | Agent asked for approval when authority ended |
| Poor trace | Output looked reasonable | Replay can reconstruct every decision |
Completion is not proof. A system can finish the task and still fail the control surface.
A score that affects access, capital, or authority cannot be issued by a single interested party. Centralized scoring is useful for research. It is weak as a trust boundary.
If the same company defines the task, computes the score, operates the product, and benefits from the score, the conflict is structural. The answer is not branding. The answer is independent replay.
Beneat started with live markets because markets punish bad judgment quickly and record it densely. Revenge trading, over-sizing, panic exits, cold-streak escalation, missed stops, and broken risk controls are not abstract failures. They show up in orders, balances, timestamps, fills, and drawdowns.
We did not see trading as a search for one secret edge. The harder problem was probability, risk management, trading psychology, and operator state colliding under pressure.
The first product was an internal terminal built to enforce risk before orders hit the market, track behavior through the session, and keep the operator liquid long enough for edge to matter. The Beneat terminal gave us a centralized testbed. It joined execution logs, market state, behavioral detectors, risk gates, operator-state signals, and agent actions in one environment. TQS emerged from that work: a Trader Quality Score for measuring trading process beyond headline P&L.
That distinction matters. A centralized score can help us learn. It should not be the permanent source of authority. If a score changes who gets capital, who gets access, or how much autonomy an agent receives, the score must be reproducible outside Beneat.
TQS is the market-specific branch. DQS is the general rule.
The shared idea is simple: do not judge intelligence from the final answer alone. Judge the decision trace. What did the operator know? What was hidden? What rules applied? What action was taken? What changed after the action? Could another party replay the same episode and reach the same score?
Once agent trading became credible, the same control problem reappeared in another form. Humans fail through fatigue, revenge, overconfidence, and sizing drift. Agents fail through ungrounded action, skipped escalation, unsafe autonomy, and fluent policy violations. The surface changed; the authority problem did not.
| Layer | Domain | Purpose |
|---|---|---|
| DQS | General agent work | Score decision reliability under constraints |
| TQS | Markets | Score trading process, risk discipline, and execution behavior |
| BioSync | Human operators | Add operator state as a signal, not as a wellness product |
TQS was not abandoned. It became a domain implementation under a larger decision-quality stack.
DQS scores whether an agent made the right kind of decision under the authority it was given. The score is built from components that can be replayed from an episode trace.
| Component | What it measures |
|---|---|
| Policy obedience | Did the action obey hard constraints? |
| State grounding | Did the agent act only on facts available to it? |
| Capital preservation | Did it avoid unnecessary loss or exposure? |
| Fraud resistance | Did it reject spoofed or untrusted counterparties? |
| Escalation | Did it stop when authority ended? |
| Traceability | Can the episode be replayed without trusting the agent? |
The validator must own the facts that make cheating hard. If the miner controls the scenario, the hidden facts, and the grading path, the score is theater.
In DQS, validators generate or custody the state needed for replay: hidden seeds, scenario commitments, policy graphs, vendor registries, budget ledgers, market snapshots, and audit pointers. The miner receives only the observation it is allowed to use.
| Validator owns | Miner sees |
|---|---|
| Hidden seed | Scenario observation |
| Policy graph | Relevant policy excerpts |
| Trusted registry | Visible vendor facts |
| Budget ledger | Allowed account state |
| Audit route | Signed result after scoring |
Miners should not win because they produce confident language. They should win because their actions remain valid when the validator replays the episode.
The normal reward path should be cheap. DQS should not require an LLM judge for every episode. When the task is structured, validators can score transitions directly: action, policy, state change, violation set, final score.
| Miner pattern | Validator result |
|---|---|
| Pays trusted vendor under budget | Clean pass |
| Pays cheapest visible impostor | Fraud and grounding penalty |
| Acts on hidden quote | Ungrounded action |
| Skips approval threshold | Escalation failure |
| Leaves incomplete trace | Audit penalty |
Vendor Payment Control is a useful reference task because it is deliberately narrow. Narrow is useful. A payment task has policy, money movement, fraud risk, authority thresholds, and a final state that can be replayed.
The validator owns a policy graph, vendor registry, invoice facts, and budget ledger. The miner receives a partial observation and chooses actions. The winner is not the agent that spends the least. The winner is the agent that preserves control while completing the job.
This is not procurement software theater. It is a compact control surface for agent authority.
Centralized scoring helped Beneat build the first measurement system. It cannot be the final trust layer. A score that grants authority must be recomputable by parties that do not answer to Beneat.
Bittensor is a natural candidate because it gives miners and validators a live incentive system. But the paper does not depend on jargon. The rule is simpler: miners act, validators replay, scores become portable.
| Centralized research | Decentralized validation |
|---|---|
| Beneat computes score | Validators recompute score |
| Internal traces | Replay bundles |
| Product trust | Protocol trust |
| Fast iteration | Independent verification |
TQS remains the market-specific score. It measures whether a trader or trading agent produces quality decisions under market pressure. The raw result is not enough. A lucky tail event and a disciplined edge can end with the same P&L. They should not receive the same score.
The terminal already detects patterns that damage execution quality: revenge trading, FOMO, panic exits, overtrading, tilt, overconfidence, and patience failures. These are not personality labels. They are patterns in orders and timing.
The equity curve is the artifact. It records entries, exits, sizing, timing, risk limits, and recovery after loss. TQS uses it to distinguish compounding from lottery-ticket performance.
The same principle applies before execution. A terminal should not only display markets. It should know when an order violates risk, when an agent is exceeding authority, and when a human operator is no longer in a state to size up.
BioSync extends this into operator state. It was not added as a wellness layer or lifestyle garnish. It followed from the original claim that the operator belongs inside the control surface. For humans, readiness, reaction, fatigue, recovery, and session history become part of the constraint layer. For agents, the equivalent slot is behavioral telemetry: what changed, what repeated, what broke, and what should be constrained next.
| Phase | Focus |
|---|---|
| 01 | Reference authority task: validator-owned state, replay bundle, score vector, certificate |
| 02 | Validator-compatible runtime for structured authority tasks |
| 03 | Additional domains beyond vendor payment control |
| 04 | Decentralized TQS for market traces and trading agents |
| 05 | Portable authority scores across agents, operators, and domains |
Beneat started with trading because markets expose judgment and punish bad control early. What began as survival-first infrastructure for constraining risk and behavior became the first scoring branch: TQS. It proved that process quality can be measured from traces instead of inferred from outcomes.
DQS is the broader frame. It applies the same discipline to autonomous work: policy, state, authority, risk, and replay. The score is not a claim about intelligence. It is a record of behavior under constraint.
Permission should come after proof. Not before it.