Benchmarking#

For the official leaderboard results, see the Benchmark Leaderboard.

This page describes how to benchmark your own model on CHURRO-DS. Please open a pull request if you would like to add your model to the official leaderboard.

Smallest Useful Run#

The benchmark runner lives in this repo at tooling.benchmarking.benchmark.

pixi run python -m tooling.benchmarking.benchmark \
  --backend litellm \
  --dataset-split test \
  --model vertex_ai/gemini-2.5-pro

By default, results are written under workdir/results/<split>/. The evaluation pipeline strips the default OCR wrapper tag, flattens supported XML-like OCR output, normalizes whitespace and punctuation, and applies additional Arabic normalization for languages with Arabic script (Arabic and Persian).

Common Flags#

  • --dataset-split dev|test: choose the CHURRO-DS split

  • --input-size N: benchmark only the first N selected pages

  • --offset N: skip the first N selected pages

  • --language and --document-type: filter the benchmark subset before slicing

  • --output-dir PATH: override the default results directory

  • --max-concurrency N: cap the number of in-flight OCR requests

  • --reasoning-effort VALUE: forward LiteLLM/OpenAI reasoning_effort for litellm and openai-compatible backends

Output Files#

Each benchmark run writes one result directory. The directory contains two JSON files:

  • outputs.json: one row per evaluated page with the raw predicted text, gold text, and page-level metrics

  • all_metrics.json: aggregate metrics grouped across the full run, by main language, by document type, and by the language/type combination

Filtering And Slicing#

You can run benchmarks on subsets of the data by combining --language, --document-type, --offset, and --input-size. The filters are applied in the following order:

  • --language filters on main_language

  • --document-type filters on document_type

  • --offset skips rows after filtering

  • --input-size limits rows after filtering and offset

That means for example --language Arabic --offset 100 --input-size 50 selects rows 101 to 150 from the Arabic-only subset, not from the full split.

Example Commands#

If you want to benchmark a model using vLLM or llama.cpp, run the server separately and point --backend openai-compatible at its OpenAI-compatible endpoint. See the official vLLM serving docs or the official llama.cpp serving docs.

Model

Model ID

Backend

Full command

Gemini 2.5 Pro

vertex_ai/gemini-2.5-pro

litellm

pixi run python -m tooling.benchmarking.benchmark --backend litellm --dataset-split test --model vertex_ai/gemini-2.5-pro --output-dir workdir/results/test/litellm_vertex_ai_gemini-2.5-pro

GPT-5.4

gpt-5.4

litellm

pixi run python -m tooling.benchmarking.benchmark --backend litellm --dataset-split test --model gpt-5.4 --api-key "$OPENAI_API_KEY" --max-concurrency 16 --output-dir workdir/results/test/litellm_gpt-5.4

Qwen 3.5-0.8B

Qwen/Qwen3.5-0.8B

openai-compatible

pixi run python -m tooling.benchmarking.benchmark --backend openai-compatible --dataset-split test --model Qwen/Qwen3.5-0.8B --base-url http://127.0.0.1:8000/v1 --output-dir workdir/results/test/openai-compatible_Qwen_Qwen3.5-0.8B