QueryGym¶
A lightweight, reproducible toolkit for LLM-based query reformulation.
Features¶
- Single Prompt Bank (YAML) with metadata
- Simple DataLoader: Dependency-free file loading for queries, qrels, and contexts
- Format Loaders: Optional BEIR and MS MARCO format loaders
- OpenAI-compatible LLM client (works with any OpenAI API–compatible endpoint)
- Pyserini optional: either pass contexts (JSONL) or pass a retriever instance to build contexts
- Export-only: emits reformulated queries; optionally generates a bash script for Pyserini +
trec_eval
Quick Example¶
import querygym as qg
# Load data
queries = qg.load_queries("queries.tsv")
qrels = qg.load_qrels("qrels.txt")
# Create reformulator
reformulator = qg.create_reformulator("genqr_ensemble", model="gpt-4")
# Reformulate
results = reformulator.reformulate_batch(queries)
# Save
qg.DataLoader.save_queries(
[qg.QueryItem(r.qid, r.reformulated) for r in results],
"reformulated.tsv"
)
Installation¶
Install from PyPI¶
Use Docker (Quick Start)¶
# Pull pre-built image
docker pull ghcr.io/ls3-lab/querygym:latest
# Run with Docker Compose
docker compose run --rm querygym
See the Docker Guide for detailed setup and usage.
For optional features:
# With HuggingFace datasets support
pip install querygym[hf]
# With BEIR format support
pip install querygym[beir]
# With Pyserini adapter
pip install querygym[pyserini]
# All optional features
pip install querygym[all]
# Development dependencies
pip install querygym[dev]
Documentation¶
📊 Looking for benchmarks? Visit the Leaderboard.
Citation¶
If you use QueryGym in your research, please cite:
@misc{bigdeli2025querygymtoolkitreproduciblellmbased,
title={QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation},
author={Amin Bigdeli and Radin Hamidi Rad and Mert Incesu and Negar Arabzadeh and Charles L. A. Clarke and Ebrahim Bagheri},
year={2025},
eprint={2511.15996},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2511.15996},
}
License¶
Apache License 2.0 - see LICENSE for details.