`churro_ocr.document`#

Document-level OCR pipeline built on the page detection and OCR APIs.

class churro_ocr.document.DocumentOCRResult[source]#

Bases: object

Document OCR output across all detected pages.

Parameters:

pages – OCR-enriched pages in output order.
source_type – Input source type, typically "image" or "pdf".
metadata – Document-level metadata carried forward from page detection.

texts()[source]#

Return OCR text for each page in order.

Returns:: Plain OCR text for each page. Missing page text is normalized to "".
Return type:: list[str]

as_ocr_results()[source]#

Return plain OCR results in page order.

Returns:: OCRResult objects derived from the current pages.
Return type:: list[OCRResult]

__init__(pages, source_type, metadata=<factory>)#

Parameters:

pages (list[DocumentPage])
source_type (str)
metadata (MetadataDict)

Return type:

None

class churro_ocr.document.DocumentOCRPipeline[source]#

Bases: object

Run page detection and OCR as one document-level pipeline.

The pipeline is the highest-level API in the package. It detects pages from an image or PDF, runs OCR on each detected page, and preserves the page objects in the final result.

Create a document OCR pipeline.

Parameters:

ocr_backend – OCR backend or async OCR callable used for each page.
page_detector – Optional fully constructed page detector to reuse.
detection_backend – Optional low-level detection backend used when page_detector is not provided.
max_concurrency – Maximum number of page OCR jobs run at once.

Raises:

ConfigurationError – If max_concurrency is less than 1.

__init__(ocr_backend, *, page_detector=None, detection_backend=None, max_concurrency=8)[source]#

Create a document OCR pipeline.

Parameters:

ocr_backend (OCRBackend | Callable[[DocumentPage], Awaitable[OCRResult]]) – OCR backend or async OCR callable used for each page.
page_detector (DocumentPageDetector | None) – Optional fully constructed page detector to reuse.
detection_backend (PageDetectionBackend | Callable[[Image], Awaitable[list[PageCandidate]]] | None) – Optional low-level detection backend used when page_detector is not provided.
max_concurrency (int) – Maximum number of page OCR jobs run at once.

Raises:

ConfigurationError – If max_concurrency is less than 1.

Return type:

None

async process_image(request, *, ocr_metadata=None)[source]#

Detect pages and OCR a single input image.

Parameters:

request (PageDetectionRequest) – Image detection request describing the source image.
ocr_metadata (MetadataDict | None) – Optional caller-side metadata merged into each page before OCR runs.

Returns:

Document OCR result preserving page order and page images.

Return type:

DocumentOCRResult

process_image_sync(request, *, ocr_metadata=None)[source]#

Synchronously detect pages and OCR a single input image.

Parameters:

request (PageDetectionRequest) – Image detection request describing the source image.
ocr_metadata (MetadataDict | None) – Optional caller-side metadata merged into each page before OCR runs.

Returns:

Document OCR result preserving page order and page images.

Return type:

DocumentOCRResult

async process_pdf(path, *, dpi=300, trim_margin=30, ocr_metadata=None)[source]#

Rasterize, detect pages, and OCR a PDF.

Parameters:

path (str | Path) – PDF path to rasterize and process.
dpi (int) – Rasterization DPI used before page detection.
trim_margin (int) – Pixel margin added around detected crops.
ocr_metadata (MetadataDict | None) – Optional caller-side metadata merged into each page before OCR runs.

Returns:

Document OCR result across the rasterized PDF pages.

Return type:

DocumentOCRResult

process_pdf_sync(path, *, dpi=300, trim_margin=30, ocr_metadata=None)[source]#

Synchronously rasterize, detect pages, and OCR a PDF.

Parameters:

path (str | Path) – PDF path to rasterize and process.
dpi (int) – Rasterization DPI used before page detection.
trim_margin (int) – Pixel margin added around detected crops.
ocr_metadata (MetadataDict | None) – Optional caller-side metadata merged into each page before OCR runs.

Returns:

Document OCR result across the rasterized PDF pages.

Return type:

DocumentOCRResult

churro_ocr.document#

`churro_ocr.document`#