churro_ocr.document#

Document-level OCR pipeline built on the page detection and OCR APIs.

class churro_ocr.document.DocumentOCRResult[source]#

Bases: object

Document OCR output across all detected pages.

Parameters:
  • pages – OCR-enriched pages in output order.

  • source_type – Input source type, typically "image" or "pdf".

  • metadata – Document-level metadata carried forward from page detection.

texts()[source]#

Return OCR text for each page in order.

Returns:

Plain OCR text for each page. Missing page text is normalized to "".

Return type:

list[str]

as_ocr_results()[source]#

Return plain OCR results in page order.

Returns:

OCRResult objects derived from the current pages.

Return type:

list[OCRResult]

__init__(pages, source_type, metadata=<factory>)#
Parameters:
Return type:

None

class churro_ocr.document.DocumentOCRPipeline[source]#

Bases: object

Run page detection and OCR as one document-level pipeline.

The pipeline is the highest-level API in the package. It detects pages from an image or PDF, runs OCR on each detected page, and preserves the page objects in the final result.

Create a document OCR pipeline.

Parameters:
  • ocr_backend – OCR backend or async OCR callable used for each page.

  • page_detector – Optional fully constructed page detector to reuse.

  • detection_backend – Optional low-level detection backend used when page_detector is not provided.

  • max_concurrency – Maximum number of page OCR jobs run at once.

Raises:

ConfigurationError – If max_concurrency is less than 1.

__init__(ocr_backend, *, page_detector=None, detection_backend=None, max_concurrency=8)[source]#

Create a document OCR pipeline.

Parameters:
Raises:

ConfigurationError – If max_concurrency is less than 1.

Return type:

None

async process_image(request, *, ocr_metadata=None)[source]#

Detect pages and OCR a single input image.

Parameters:
  • request (PageDetectionRequest) – Image detection request describing the source image.

  • ocr_metadata (dict[str, Any] | None) – Optional caller-side metadata merged into each page before OCR runs.

Returns:

Document OCR result preserving page order and page images.

Return type:

DocumentOCRResult

process_image_sync(request, *, ocr_metadata=None)[source]#

Synchronously detect pages and OCR a single input image.

Parameters:
  • request (PageDetectionRequest) – Image detection request describing the source image.

  • ocr_metadata (dict[str, Any] | None) – Optional caller-side metadata merged into each page before OCR runs.

Returns:

Document OCR result preserving page order and page images.

Return type:

DocumentOCRResult

async process_pdf(path, *, dpi=300, trim_margin=30, ocr_metadata=None)[source]#

Rasterize, detect pages, and OCR a PDF.

Parameters:
  • path (str | Path) – PDF path to rasterize and process.

  • dpi (int) – Rasterization DPI used before page detection.

  • trim_margin (int) – Pixel margin added around detected crops.

  • ocr_metadata (dict[str, Any] | None) – Optional caller-side metadata merged into each page before OCR runs.

Returns:

Document OCR result across the rasterized PDF pages.

Return type:

DocumentOCRResult

process_pdf_sync(path, *, dpi=300, trim_margin=30, ocr_metadata=None)[source]#

Synchronously rasterize, detect pages, and OCR a PDF.

Parameters:
  • path (str | Path) – PDF path to rasterize and process.

  • dpi (int) – Rasterization DPI used before page detection.

  • trim_margin (int) – Pixel margin added around detected crops.

  • ocr_metadata (dict[str, Any] | None) – Optional caller-side metadata merged into each page before OCR runs.

Returns:

Document OCR result across the rasterized PDF pages.

Return type:

DocumentOCRResult