Miners find the prompts the model learns from
The market is live on Subnet 81. Miners compete to select prompts at the policy's learning frontier — group-σ in the trainable band, not yet in cooldown. The validator verifies every rollout with GRAIL, runs a GRPO step on what survives, and publishes the updated checkpoint to Hugging Face every ten trained windows. The live policy is Qwen3.5-4B across mixed OpenMath + OpenCode environments.
- Ex-ante prompt market — miners bet their own compute on the σ-trainable band, replacing DAPO's reactive generate-then-discard filter
- GRAIL anti-fabrication proof on every rollout — the validator recomputes the forward pass and pays fabricated work zero
- Live GRPO loop — PPO-clipped, KL-penalized; checkpoint published to HF (ReliquaryForge/qwen3.5-4b-reliquary) every ten windows
- Controlled result — pass@1 0.470 → 0.610 (+14pp), about 2× the training efficiency of vanilla GRPO