pass@1 by training arm

Reliquary is Subnet 81: a decentralized reinforcement-learning training market. Miners compete to find the prompts at a model's learning frontier; the validator cryptographically verifies every rollout and runs the GRPO step on the survivors. The result of putting prompt selection on a market, at identical compute:

FIG.01 · held-out pass@1 · 300 steps · paired evalhigher is better

+14 pppass@1 vs vanilla GRPO

p ≈ 0.01paired test · 2.8σ

0.61Reliquary pass@1

62.5%truncation (was 80.5%)

What we built

Miners generate rollouts on problems with verifiable rewards; the validator runs GRPO on the accepted submissions and publishes updated checkpoints to Hugging Face. Every rollout is verified through GRAIL proofs, which let the validator confirm a generation was produced by the correct model weights. That is the infrastructure — the contribution is in how prompts are selected for training.

FIG.02 · the verified training loop

Prompt selection is the dominant lever

DAPO (Xu et al., 2025) showed that of its four techniques, Dynamic Sampling — discarding rollout groups with zero reward variance — was the largest single gain, exceeding the other three combined. The reason is structural: in group-relative policy optimization the gradient signal is proportional to the reward variance across a group. When every completion succeeds, or every one fails, the advantages collapse to zero. Only prompts at the frontier carry signal.

FIG.03 · difficulty distribution · gradient signal ∝ reward variance

Only 1–15% of prompts occupy the learning zone at any checkpoint. A reactive filter discards the rest after paying to generate them.

From reactive filter to competitive market

DAPO reacts: generate a group, measure variance, discard below threshold. As the policy improves the frontier shrinks, the rejection rate climbs, and a growing share of inference is spent on groups that are thrown away. Reliquary replaces the filter with a market — each window needs N rollout groups; miners independently select a prompt, generate a group, and race to submit. The first N valid, distinct submissions seal the window.

FIG.04 · one node that reacts vs a market that anticipates

reactive filter · DAPO

1generate a rollout group

2measure reward variance

3discard if below threshold

85–99%inference discarded

can only react — and waste grows as the frontier narrows.

competitive market · Reliquary

1miners anticipate the frontier

2race on speed + selection accuracy

3first N valid submissions seal the window

6×less waste at 5%→30% hit rate

anticipation gets more valuable as the zone contracts.

Two pressures compound. Speed — slots are scarce and awarded in arrival order. Selection accuracy — with only 1–15% in-zone, a miner picking at random wastes 85–99% of its budget; a coarse difficulty estimator lifting hit-rate 5%→30% cuts wasted inference 6×. The supervision is free: every submission is a labeled (prompt, checkpoint, accepted?) point, so the incentive gradient bends toward building difficulty models.

The controlled experiment

To isolate the market, everything is held fixed except the source of rollouts: same base model (Qwen3-4B-Instruct), same training function (imported from the same commit), same hyperparameters, dataset, and step count (300). Reliquary's rollouts come from competing miners selecting at the frontier; the baseline's come from one trainer selecting uniformly at random. Both checkpoints were scored on the same held-out math set with the same seed — a paired comparison.

	base	vanilla GRPO	Reliquary
pass@1	0.330	0.470	0.610
pass@4	0.700	0.730	0.730
truncation	80.5%	73.2%	62.5%

The Reliquary arm gains +14 pp pass@1 at identical step count (≈2.8 SE from zero, p ≈ 0.01). pass@4 is unchanged (0.73 vs 0.73): training sharpens the single-attempt answer rather than expanding the solvable set. Truncation falls to 62.5% — the market-trained policy converges on answers more efficiently.

Why ×2 is a lower bound

Read the ×2 straight off FIG.05 at the regime this experiment ran in. At ~15% in-zone a reactive filter wastes ~85% of its inference, while an informed market holds waste near ~40% — so the market delivers roughly twice the useful rollouts per unit of compute (≈60% vs ≈15% productive). That is the lower bound, and it widens from here.

This was measured at 300 steps, early, when the zone is still wide (~15%). A reactive filter wastes ~85% there — costly, not catastrophic. As training proceeds the zone narrows; at 1% in-zone a reactive filter discards 99% of its compute. A market does not degrade the same way — miners adapt their estimators to the moving frontier and hold a high hit-rate as the zone contracts. So the efficiency multiplier is a function of training progress.

FIG.05 · wasted inference vs in-zone ratereactive degrades; market holds

Vision

The architecture isn't tied to a model or domain — it's a general-purpose RL inference layer. The goal is a product: a client brings a model and a set of environments, and Reliquary handles the RL inference — rollout generation, prompt selection, and verification — returning an optimized training signal without the client building inference infrastructure.

FIG.06 · Reliquary as an RL inference layer

client brings

a model + environments

→

Reliquary RL inference layer

· rollout generation· prompt selection· GRAIL verification

→

client gets

an optimized training signal

The gap between naive sampling and informed selection by experienced miners is where both the network and its clients capture value.

linkscode · github.com/reliquadotai/reliquary mechanism · docs/concepts production checkpoints (Qwen3.5-4B) · huggingface.co/ReliquaryForge/qwen3.5-4b-reliquary

FIG.01 · held-out pass@1 · 300 steps · paired evalhigher is better

+14 pppass@1 vs vanilla GRPO

p ≈ 0.01paired test · 2.8σ

0.61Reliquary pass@1

62.5%truncation (was 80.5%)

What we built

FIG.02 · the verified training loop

Prompt selection is the dominant lever

FIG.03 · difficulty distribution · gradient signal ∝ reward variance

Only 1–15% of prompts occupy the learning zone at any checkpoint. A reactive filter discards the rest after paying to generate them.

From reactive filter to competitive market

FIG.04 · one node that reacts vs a market that anticipates

reactive filter · DAPO

1generate a rollout group

2measure reward variance

3discard if below threshold

85–99%inference discarded

can only react — and waste grows as the frontier narrows.

competitive market · Reliquary

1miners anticipate the frontier

2race on speed + selection accuracy

3first N valid submissions seal the window

6×less waste at 5%→30% hit rate

anticipation gets more valuable as the zone contracts.

The controlled experiment

	base	vanilla GRPO	Reliquary
pass@1	0.330	0.470	0.610
pass@4	0.700	0.730	0.730
truncation	80.5%	73.2%	62.5%

Why ×2 is a lower bound

FIG.05 · wasted inference vs in-zone ratereactive degrades; market holds

Vision

FIG.06 · Reliquary as an RL inference layer

client brings

a model + environments

→

Reliquary RL inference layer

· rollout generation· prompt selection· GRAIL verification

→

client gets

an optimized training signal

The gap between naive sampling and informed selection by experienced miners is where both the network and its clients capture value.

linkscode · github.com/reliquadotai/reliquary mechanism · docs/concepts production checkpoints (Qwen3.5-4B) · huggingface.co/ReliquaryForge/qwen3.5-4b-reliquary

Reliquary: a market for the learning frontier

What we built

Prompt selection is the dominant lever

From reactive filter to competitive market

The controlled experiment

Why ×2 is a lower bound

Vision

Reliquary: a market for the learning frontier

What we built

Prompt selection is the dominant lever

From reactive filter to competitive market

The controlled experiment

Why ×2 is a lower bound

Vision