Reproducibility & Leaderboard Submissions¶

QueryGym ships with a reproducibility pipeline that powers leaderboard.querygym.com and the SIGIR 2026 reproducibility paper. This page explains how to submit a result.

The full schema lives at reproducibility/schema.md (human-readable) and reproducibility/schema.json (machine-readable). All submitted JSONs are validated against it three times: at emit time, at submit time, and at aggregate time in CI.

Trusted contributor flow¶

If you have commit access:

# 1. Run the example pipeline.
python examples/querygym_pyserini/pipeline.py \
    --dataset msmarco-v1-passage.trecdl2019 \
    --method query2e \
    --model gpt-4.1-mini \
    --output-dir outputs/dl19_query2e_zs

# 2. Copy the output into the canonical layout.
python -m reproducibility.scripts.submit_run --from-dir outputs/dl19_query2e_zs

# 3. Regenerate the aggregate CSV + manifest.
make repro-aggregate

# 4. Commit and open a PR.
git add reproducibility/data/
git commit -m "add query2e/gpt-4.1-mini result on dl19-passage"
git push
gh pr create

CI runs the schema/validator tests and aggregate_runs.py --check. If everything is green, the leaderboard rebuilds on merge.

Common failure modes¶

Symptom	Cause	Fix
`aggregator --check failed: results.csv is out of date`	You forgot step 3.	Run `make repro-aggregate`, commit the diff.
`dataset_id 'foo' not in dataset_registry.yaml`	Typo or new dataset not registered.	Add the dataset to `dataset_registry.yaml` first, then re-submit.
`method_id 'foo' not in registered methods`	Method not registered or name typo'd.	Register via `@register_method("foo")` in `querygym/methods/`.
`params_hash mismatch`	The JSON was hand-edited.	Don't hand-edit run JSONs — re-run the emitter or use `submit_run` instead.
`metric(s) ['bleu'] not in eval_metrics for dataset 'X'`	Unsupported metric for that dataset.	Either drop the metric or add it to the dataset's `output.eval_metrics` in the registry.

External (fork) contributor flow¶

If you don't have commit access:

Fork ls3-lab/QueryGym on GitHub and clone your fork.
Run steps 1–3 from the trusted flow above.
Push to your fork and open a PR against ls3-lab/QueryGym:main.

CI runs the same schema/validator/aggregator checks against your PR — no LLM keys or Pyserini are needed for these checks, so fork PRs get fast feedback.

A maintainer will additionally re-verify your numbers locally before merging:

Cheap pre-check (~30s): the maintainer runs pytrec_eval against your submitted run.txt using the dataset's qrels and confirms the reported metrics match.
Full re-run (only if needed): if the cheap check is suspicious, the maintainer runs the example pipeline with your config block as inputs and compares reformulated queries + run file.

This is why every submission must include run.txt and reformulated_queries.tsv alongside the JSON — they make verification cheap.

Verifying a published number (paper readers)¶

Each leaderboard row links to the canonical files at a paper-release tag. To verify independently:

git clone --depth=1 --branch=paper-sigir2026 https://github.com/ls3-lab/QueryGym.git
cd QueryGym

# Pick a run.
RUN_DIR=reproducibility/data/runs/msmarco-v1-passage.trecdl2019/query2e/gpt-4.1-mini

# Re-run trec_eval against the public qrels (Pyserini ships them).
python -m pyserini.eval.trec_eval -m ndcg_cut.10 dl19-passage "${RUN_DIR}"/*.run.txt

The number from pyserini.eval.trec_eval should match metrics.ndcg_cut_10 in the corresponding JSON.

External tools (dashboard, third parties)¶

The contract is reproducibility/schema.json — a Draft 2020-12 JSON Schema document. Any tool that emits a conformant JSON can submit (subject to the trusted vs. fork flows above). You don't need to import any Python from QueryGym; just read the schema file and validate locally with whatever JSON Schema library your stack provides (Ajv for JS, jsonschema for Python, everit-org/json-schema for Java).

schema_version is "const": 1 today. Bumping it to 2 will be a breaking change announced ahead of time.