roadmap · reliquary

The learning-frontier market, opened, then generalized.

Three moves: a live training market on Subnet 81 today, opened to outside workloads next, generalized into an RL inference layer for all of Bittensor.

nowphase 1
now · the learning-frontier marketlive now
Miners find the prompts the model learns from
The market is live on Subnet 81. Miners compete to select prompts at the policy's learning frontier — group-σ in the trainable band, not yet in cooldown. The validator verifies every rollout with GRAIL, runs a GRPO step on what survives, and publishes the updated checkpoint to Hugging Face every ten trained windows. The live policy is Qwen3.5-4B across mixed OpenMath + OpenCode environments.
- Ex-ante prompt market — miners bet their own compute on the σ-trainable band, replacing DAPO's reactive generate-then-discard filter
- GRAIL anti-fabrication proof on every rollout — the validator recomputes the forward pass and pays fabricated work zero
- Live GRPO loop — PPO-clipped, KL-penalized; checkpoint published to HF (ReliquaryForge/qwen3.5-4b-reliquary) every ten windows
- Controlled result — pass@1 0.470 → 0.610 (+14pp), about 2× the training efficiency of vanilla GRPO
shipped result · held-out pass@1
nextphase 2
next · open the marketdesign in flight
Bring your model and environments
Open the market to external workloads. A client brings a model and a set of environments; the miner network handles rollout generation, frontier selection, and verification; the client receives an optimized GRPO training signal — verified, advantage-weighted, no infra to run.
- Client-supplied model + environments — point a training run at the network and get verified rollouts back
- Sandboxed reward functions — deterministic exec, resource caps, no network
- Per-job isolation — external workloads never collide with the canonical training run
- GRAIL proofs returned per rollout — re-verify offline before feeding your trainer
- Pricing — per-accepted-rollout TAO with credit packs for sustained workloads
laterphase 3
later · a general-purpose RL inference layeropen design
Decentralized RLHF as shared infrastructure
Generalize Reliquary into a general-purpose RL inference layer on Bittensor: many teams point training runs at the network at once and get verified, frontier-selected rollouts at scale. The market becomes shared infrastructure for decentralized RLHF — selection intelligence priced as a first-class commodity.
- Multi-tenant trainers — many concurrent training runs sharing one miner network
- Open environment registry — community-contributed environments and graders
- Cross-run miner reputation — selection skill that transfers across workloads
- Economics — emission split across the canonical training run and external jobs

Find the frontier · prove the rollout · train the checkpoint.

Miners find the prompts the model learns from

The market is live on Subnet 81. Miners compete to select prompts at the policy's learning frontier — group-σ in the trainable band, not yet in cooldown. The validator verifies every rollout with GRAIL, runs a GRPO step on what survives, and publishes the updated checkpoint to Hugging Face every ten trained windows. The live policy is Qwen3.5-4B across mixed OpenMath + OpenCode environments.

Ex-ante prompt market — miners bet their own compute on the σ-trainable band, replacing DAPO's reactive generate-then-discard filter

GRAIL anti-fabrication proof on every rollout — the validator recomputes the forward pass and pays fabricated work zero

Live GRPO loop — PPO-clipped, KL-penalized; checkpoint published to HF (ReliquaryForge/qwen3.5-4b-reliquary) every ten windows

Controlled result — pass@1 0.470 → 0.610 (+14pp), about 2× the training efficiency of vanilla GRPO

shipped result · held-out pass@1

Bring your model and environments

Open the market to external workloads. A client brings a model and a set of environments; the miner network handles rollout generation, frontier selection, and verification; the client receives an optimized GRPO training signal — verified, advantage-weighted, no infra to run.

Client-supplied model + environments — point a training run at the network and get verified rollouts back

Sandboxed reward functions — deterministic exec, resource caps, no network

Per-job isolation — external workloads never collide with the canonical training run

GRAIL proofs returned per rollout — re-verify offline before feeding your trainer

Pricing — per-accepted-rollout TAO with credit packs for sustained workloads

Decentralized RLHF as shared infrastructure

Generalize Reliquary into a general-purpose RL inference layer on Bittensor: many teams point training runs at the network at once and get verified, frontier-selected rollouts at scale. The market becomes shared infrastructure for decentralized RLHF — selection intelligence priced as a first-class commodity.

Multi-tenant trainers — many concurrent training runs sharing one miner network

Open environment registry — community-contributed environments and graders

Cross-run miner reputation — selection skill that transfers across workloads

Economics — emission split across the canonical training run and external jobs