Reliquary is Subnet 81: a decentralized reinforcement-learning training market. Miners compete to find the prompts at a model's learning frontier; the validator cryptographically verifies every rollout and runs the GRPO step on the survivors. The result of putting prompt selection on a market, at identical compute:
What we built
Miners generate rollouts on problems with verifiable rewards; the validator runs GRPO on the accepted submissions and publishes updated checkpoints to Hugging Face. Every rollout is verified through GRAIL proofs, which let the validator confirm a generation was produced by the correct model weights. That is the infrastructure — the contribution is in how prompts are selected for training.
Prompt selection is the dominant lever
DAPO (Xu et al., 2025) showed that of its four techniques, Dynamic Sampling — discarding rollout groups with zero reward variance — was the largest single gain, exceeding the other three combined. The reason is structural: in group-relative policy optimization the gradient signal is proportional to the reward variance across a group. When every completion succeeds, or every one fails, the advantages collapse to zero. Only prompts at the frontier carry signal.
Only 1–15% of prompts occupy the learning zone at any checkpoint. A reactive filter discards the rest after paying to generate them.
From reactive filter to competitive market
DAPO reacts: generate a group, measure variance, discard below threshold. As the policy improves the frontier shrinks, the rejection rate climbs, and a growing share of inference is spent on groups that are thrown away. Reliquary replaces the filter with a market — each window needs N rollout groups; miners independently select a prompt, generate a group, and race to submit. The first N valid, distinct submissions seal the window.
can only react — and waste grows as the frontier narrows.
anticipation gets more valuable as the zone contracts.
Two pressures compound. Speed — slots are scarce and awarded in arrival order. Selection accuracy — with only 1–15% in-zone, a miner picking at random wastes 85–99% of its budget; a coarse difficulty estimator lifting hit-rate 5%→30% cuts wasted inference 6×. The supervision is free: every submission is a labeled (prompt, checkpoint, accepted?) point, so the incentive gradient bends toward building difficulty models.
The controlled experiment
To isolate the market, everything is held fixed except the source of rollouts: same base model (Qwen3-4B-Instruct), same training function (imported from the same commit), same hyperparameters, dataset, and step count (300). Reliquary's rollouts come from competing miners selecting at the frontier; the baseline's come from one trainer selecting uniformly at random. Both checkpoints were scored on the same held-out math set with the same seed — a paired comparison.
| base | vanilla GRPO | Reliquary | |
|---|---|---|---|
| pass@1 | 0.330 | 0.470 | 0.610 |
| pass@4 | 0.700 | 0.730 | 0.730 |
| truncation | 80.5% | 73.2% | 62.5% |
The Reliquary arm gains +14 pp pass@1 at identical step count (≈2.8 SE from zero, p ≈ 0.01). pass@4 is unchanged (0.73 vs 0.73): training sharpens the single-attempt answer rather than expanding the solvable set. Truncation falls to 62.5% — the market-trained policy converges on answers more efficiently.
Why ×2 is a lower bound
Read the ×2 straight off FIG.05 at the regime this experiment ran in. At ~15% in-zone a reactive filter wastes ~85% of its inference, while an informed market holds waste near ~40% — so the market delivers roughly twice the useful rollouts per unit of compute (≈60% vs ≈15% productive). That is the lower bound, and it widens from here.
This was measured at 300 steps, early, when the zone is still wide (~15%). A reactive filter wastes ~85% there — costly, not catastrophic. As training proceeds the zone narrows; at 1% in-zone a reactive filter discards 99% of its compute. A market does not degrade the same way — miners adapt their estimators to the moving frontier and hold a high hit-rate as the zone contracts. So the efficiency multiplier is a function of training progress.
Vision
The architecture isn't tied to a model or domain — it's a general-purpose RL inference layer. The goal is a product: a client brings a model and a set of environments, and Reliquary handles the RL inference — rollout generation, prompt selection, and verification — returning an optimized training signal without the client building inference infrastructure.
a model + environments
an optimized training signal
The gap between naive sampling and informed selection by experienced miners is where both the network and its clients capture value.