churro_ocr.document#
Document-level OCR pipeline built on the page detection and OCR APIs.
- class churro_ocr.document.DocumentOCRResult[source]#
Bases:
objectDocument OCR output across all detected pages.
- Parameters:
pages – OCR-enriched pages in output order.
source_type – Input source type, typically
"image"or"pdf".metadata – Document-level metadata carried forward from page detection.
- class churro_ocr.document.DocumentOCRPipeline[source]#
Bases:
objectRun page detection and OCR as one document-level pipeline.
The pipeline is the highest-level API in the package. It detects pages from an image or PDF, runs OCR on each detected page, and preserves the page objects in the final result.
Create a document OCR pipeline.
- Parameters:
ocr_backend – OCR backend or async OCR callable used for each page.
page_detector – Optional fully constructed page detector to reuse.
detection_backend – Optional low-level detection backend used when
page_detectoris not provided.max_concurrency – Maximum number of page OCR jobs run at once.
- Raises:
ConfigurationError – If
max_concurrencyis less than 1.
- __init__(ocr_backend, *, page_detector=None, detection_backend=None, max_concurrency=8)[source]#
Create a document OCR pipeline.
- Parameters:
ocr_backend (OCRBackend | Callable[[DocumentPage], Awaitable[OCRResult]]) – OCR backend or async OCR callable used for each page.
page_detector (DocumentPageDetector | None) – Optional fully constructed page detector to reuse.
detection_backend (PageDetectionBackend | Callable[[Image], Awaitable[list[PageCandidate]]] | None) – Optional low-level detection backend used when
page_detectoris not provided.max_concurrency (int) – Maximum number of page OCR jobs run at once.
- Raises:
ConfigurationError – If
max_concurrencyis less than 1.- Return type:
None
- async process_image(request, *, ocr_metadata=None)[source]#
Detect pages and OCR a single input image.
- Parameters:
request (PageDetectionRequest) – Image detection request describing the source image.
ocr_metadata (dict[str, Any] | None) – Optional caller-side metadata merged into each page before OCR runs.
- Returns:
Document OCR result preserving page order and page images.
- Return type:
- process_image_sync(request, *, ocr_metadata=None)[source]#
Synchronously detect pages and OCR a single input image.
- Parameters:
request (PageDetectionRequest) – Image detection request describing the source image.
ocr_metadata (dict[str, Any] | None) – Optional caller-side metadata merged into each page before OCR runs.
- Returns:
Document OCR result preserving page order and page images.
- Return type:
- async process_pdf(path, *, dpi=300, trim_margin=30, ocr_metadata=None)[source]#
Rasterize, detect pages, and OCR a PDF.
- Parameters:
- Returns:
Document OCR result across the rasterized PDF pages.
- Return type: