Benchmarking#
For the official leaderboard results, see the Benchmark Leaderboard.
This page describes how to benchmark your own model on CHURRO-DS. Please open a pull request if you would like to add your model to the official leaderboard.
Smallest Useful Run#
The benchmark runner lives in this repo at tooling.benchmarking.benchmark.
pixi run python -m tooling.benchmarking.benchmark \
--backend litellm \
--dataset-split test \
--model vertex_ai/gemini-2.5-pro
By default, results are written under workdir/results/<split>/.
The evaluation pipeline strips the default OCR wrapper tag, flattens supported XML-like OCR output, normalizes whitespace and punctuation, and applies additional Arabic normalization for languages with Arabic script (Arabic and Persian).
Common Flags#
--dataset-split dev|test: choose the CHURRO-DS split--input-size N: benchmark only the firstNselected pages--offset N: skip the firstNselected pages--languageand--document-type: filter the benchmark subset before slicing--output-dir PATH: override the default results directory--max-concurrency N: cap the number of in-flight OCR requests--reasoning-effort VALUE: forward LiteLLM/OpenAIreasoning_effortforlitellmandopenai-compatiblebackends
Output Files#
Each benchmark run writes one result directory. The directory contains two JSON files:
outputs.json: one row per evaluated page with the raw predicted text, gold text, and page-level metricsall_metrics.json: aggregate metrics grouped across the full run, by main language, by document type, and by the language/type combination
Filtering And Slicing#
You can run benchmarks on subsets of the data by combining --language, --document-type, --offset, and --input-size. The filters are applied in the following order:
--languagefilters onmain_language--document-typefilters ondocument_type--offsetskips rows after filtering--input-sizelimits rows after filtering and offset
That means for example --language Arabic --offset 100 --input-size 50 selects rows 101 to 150 from the Arabic-only subset, not from the full split.
Example Commands#
If you want to benchmark a model using vLLM or llama.cpp, run the server separately and point --backend openai-compatible at its OpenAI-compatible endpoint. See the official vLLM serving docs or the official llama.cpp serving docs.
Model |
Model ID |
Backend |
Full command |
|---|---|---|---|
Gemini 2.5 Pro |
|
|
|
GPT-5.4 |
|
|
|
Qwen 3.5-0.8B |
|
|
|