churro_ocr#

Library-first public API for churro-ocr.

class churro_ocr.BatchOCRBackend[source]#

Bases: Protocol

Async batch OCR backend interface.

__init__(*args, **kwargs)#
async ocr_batch(pages)[source]#

Run OCR for multiple pages in one batch.

Parameters:

pages (list[DocumentPage]) – Pages to transcribe in batch order.

Returns:

OCR results in the same order as pages.

Return type:

list[OCRResult]

exception churro_ocr.ChurroError[source]#

Bases: RuntimeError

Base exception for package-level failures.

exception churro_ocr.ConfigurationError[source]#

Bases: ChurroError

Raised when a backend is missing required runtime configuration.

class churro_ocr.DocumentPage[source]#

Bases: object

A document page image with optional OCR output attached.

Parameters:
  • page_index – Page position in the current output sequence.

  • image – Page image.

  • source_index – Index of the original source item that produced the page.

  • bbox – Bounding box in source-image coordinates when available.

  • polygon – Polygon in source-image coordinates when available.

  • metadata – Caller-side or detector-side metadata for the page.

  • text – OCR text attached to the page when OCR has been run.

  • provider_name – Provider identifier attached by OCR.

  • model_name – Model name attached by OCR.

  • ocr_metadata – Provider-returned OCR metadata for this page.

__init__(page_index, image, source_index, bbox=None, polygon=(), metadata=<factory>, text=None, provider_name=None, model_name=None, ocr_metadata=<factory>)#
Parameters:
Return type:

None

classmethod from_image(image, *, page_index=0, source_index=0, metadata=None)[source]#

Create a document page from an in-memory image.

Parameters:
  • image (Image) – Source page image.

  • page_index (int) – Page position to attach to the page.

  • source_index (int) – Source index to attach to the page.

  • metadata (dict[str, Any] | None) – Optional caller-side metadata for the page.

Returns:

New page object with a copied image.

Return type:

DocumentPage

classmethod from_image_path(path, *, page_index=0, source_index=0, metadata=None)[source]#

Create a document page from an image path.

Parameters:
  • path (str | Path) – Path to the page image on disk.

  • page_index (int) – Page position to attach to the page.

  • source_index (int) – Source index to attach to the page.

  • metadata (dict[str, Any] | None) – Optional caller-side metadata for the page.

Returns:

New page object loaded from path.

Return type:

DocumentPage

property height: int#

Return the current page image height in pixels.

property width: int#

Return the current page image width in pixels.

with_ocr(*, text, provider_name, model_name, ocr_metadata=None)[source]#

Return a copy of the page with OCR output attached.

Parameters:
  • text (str) – OCR text for the page.

  • provider_name (str) – Provider identifier to attach.

  • model_name (str) – Model name to attach.

  • ocr_metadata (dict[str, Any] | None) – Provider-returned OCR metadata.

Returns:

Copy of the current page with OCR fields filled in.

Return type:

DocumentPage

class churro_ocr.DocumentOCRPipeline[source]#

Bases: object

Run page detection and OCR as one document-level pipeline.

The pipeline is the highest-level API in the package. It detects pages from an image or PDF, runs OCR on each detected page, and preserves the page objects in the final result.

Create a document OCR pipeline.

Parameters:
  • ocr_backend – OCR backend or async OCR callable used for each page.

  • page_detector – Optional fully constructed page detector to reuse.

  • detection_backend – Optional low-level detection backend used when page_detector is not provided.

  • max_concurrency – Maximum number of page OCR jobs run at once.

Raises:

ConfigurationError – If max_concurrency is less than 1.

__init__(ocr_backend, *, page_detector=None, detection_backend=None, max_concurrency=8)[source]#

Create a document OCR pipeline.

Parameters:
Raises:

ConfigurationError – If max_concurrency is less than 1.

Return type:

None

async process_image(request, *, ocr_metadata=None)[source]#

Detect pages and OCR a single input image.

Parameters:
  • request (PageDetectionRequest) – Image detection request describing the source image.

  • ocr_metadata (dict[str, Any] | None) – Optional caller-side metadata merged into each page before OCR runs.

Returns:

Document OCR result preserving page order and page images.

Return type:

DocumentOCRResult

process_image_sync(request, *, ocr_metadata=None)[source]#

Synchronously detect pages and OCR a single input image.

Parameters:
  • request (PageDetectionRequest) – Image detection request describing the source image.

  • ocr_metadata (dict[str, Any] | None) – Optional caller-side metadata merged into each page before OCR runs.

Returns:

Document OCR result preserving page order and page images.

Return type:

DocumentOCRResult

async process_pdf(path, *, dpi=300, trim_margin=30, ocr_metadata=None)[source]#

Rasterize, detect pages, and OCR a PDF.

Parameters:
  • path (str | Path) – PDF path to rasterize and process.

  • dpi (int) – Rasterization DPI used before page detection.

  • trim_margin (int) – Pixel margin added around detected crops.

  • ocr_metadata (dict[str, Any] | None) – Optional caller-side metadata merged into each page before OCR runs.

Returns:

Document OCR result across the rasterized PDF pages.

Return type:

DocumentOCRResult

process_pdf_sync(path, *, dpi=300, trim_margin=30, ocr_metadata=None)[source]#

Synchronously rasterize, detect pages, and OCR a PDF.

Parameters:
  • path (str | Path) – PDF path to rasterize and process.

  • dpi (int) – Rasterization DPI used before page detection.

  • trim_margin (int) – Pixel margin added around detected crops.

  • ocr_metadata (dict[str, Any] | None) – Optional caller-side metadata merged into each page before OCR runs.

Returns:

Document OCR result across the rasterized PDF pages.

Return type:

DocumentOCRResult

class churro_ocr.DocumentOCRResult[source]#

Bases: object

Document OCR output across all detected pages.

Parameters:
  • pages – OCR-enriched pages in output order.

  • source_type – Input source type, typically "image" or "pdf".

  • metadata – Document-level metadata carried forward from page detection.

__init__(pages, source_type, metadata=<factory>)#
Parameters:
Return type:

None

as_ocr_results()[source]#

Return plain OCR results in page order.

Returns:

OCRResult objects derived from the current pages.

Return type:

list[OCRResult]

texts()[source]#

Return OCR text for each page in order.

Returns:

Plain OCR text for each page. Missing page text is normalized to "".

Return type:

list[str]

class churro_ocr.DocumentPageDetector[source]#

Bases: object

Detect pages from raw images or PDFs.

Create a document page detector.

Parameters:

backend – Optional low-level detection backend or async callable.

__init__(*, backend=None)[source]#

Create a document page detector.

Parameters:

backend (PageDetectionBackend | Callable[[Image], Awaitable[list[PageCandidate]]] | None) – Optional low-level detection backend or async callable.

Return type:

None

async detect_image(request)[source]#

Detect pages in a single image.

Parameters:

request (PageDetectionRequest) – Detection request describing the source image.

Returns:

Detection result for one image input.

Return type:

PageDetectionResult

detect_image_sync(request)[source]#

Synchronously detect pages in a single image.

Parameters:

request (PageDetectionRequest) – Detection request describing the source image.

Returns:

Detection result for one image input.

Return type:

PageDetectionResult

async detect_pdf(path, *, dpi=300, trim_margin=30)[source]#

Rasterize a PDF and detect pages on each image.

Parameters:
  • path (str | Path) – PDF path to rasterize.

  • dpi (int) – Rasterization DPI used before detection.

  • trim_margin (int) – Pixel margin added around detected crops.

Returns:

Detection result containing all detected pages from the PDF.

Return type:

PageDetectionResult

detect_pdf_sync(path, *, dpi=300, trim_margin=30)[source]#

Synchronously rasterize a PDF and detect pages on each image.

Parameters:
  • path (str | Path) – PDF path to rasterize.

  • dpi (int) – Rasterization DPI used before detection.

  • trim_margin (int) – Pixel margin added around detected crops.

Returns:

Detection result containing all detected pages from the PDF.

Return type:

PageDetectionResult

class churro_ocr.OCRBackend[source]#

Bases: Protocol

Async OCR backend interface.

__init__(*args, **kwargs)#
async ocr(page)[source]#

Run OCR for one page.

Parameters:

page (DocumentPage) – Page image and page metadata to transcribe.

Returns:

Provider-agnostic OCR result for the page.

Return type:

OCRResult

class churro_ocr.OCRClient[source]#

Bases: object

User-facing OCR client with page-first sync and async entrypoints.

Create an OCR client.

Parameters:

backend – OCR backend or async callable used for page OCR.

__init__(backend)[source]#

Create an OCR client.

Parameters:

backend (OCRBackend | Callable[[DocumentPage], Awaitable[OCRResult]]) – OCR backend or async callable used for page OCR.

Return type:

None

async aocr(page)[source]#

Run OCR asynchronously for one page.

Parameters:

page (DocumentPage) – Page to transcribe.

Returns:

Copy of page with OCR output attached.

Return type:

DocumentPage

async aocr_image(*, image=None, image_path=None, page_index=0, source_index=0, metadata=None)[source]#

Create a single page from an image input and OCR it.

Parameters:
  • image (Image | None) – In-memory page image. Mutually exclusive with image_path.

  • image_path (str | Path | None) – Path to a page image on disk. Mutually exclusive with image.

  • page_index (int) – Page position to attach to the generated page.

  • source_index (int) – Original source index to attach to the generated page.

  • metadata (dict[str, Any] | None) – Optional caller-side metadata attached before OCR runs.

Returns:

OCR-enriched page object.

Raises:

ConfigurationError – If both or neither of image and image_path are provided.

Return type:

DocumentPage

ocr(page)[source]#

Run OCR synchronously for one page.

Parameters:

page (DocumentPage) – Page to transcribe.

Returns:

Copy of page with OCR output attached.

Return type:

DocumentPage

ocr_image(*, image=None, image_path=None, page_index=0, source_index=0, metadata=None)[source]#

Create a single page from an image input and OCR it synchronously.

Parameters:
  • image (Image | None) – In-memory page image. Mutually exclusive with image_path.

  • image_path (str | Path | None) – Path to a page image on disk. Mutually exclusive with image.

  • page_index (int) – Page position to attach to the generated page.

  • source_index (int) – Original source index to attach to the generated page.

  • metadata (dict[str, Any] | None) – Optional caller-side metadata attached before OCR runs.

Returns:

OCR-enriched page object.

Raises:

ConfigurationError – If both or neither of image and image_path are provided.

Return type:

DocumentPage

class churro_ocr.OCRResult[source]#

Bases: object

Provider-agnostic OCR result.

Parameters:
  • text – OCR text after any backend-specific postprocessing.

  • provider_name – Stable provider identifier attached to the result.

  • model_name – Human-readable model name attached to the result.

  • metadata – Provider-returned metadata for this OCR call.

__init__(text, provider_name, model_name, metadata=<factory>)#
Parameters:
Return type:

None

class churro_ocr.PageDetectionBackend[source]#

Bases: Protocol

Async interface for page detection.

__init__(*args, **kwargs)#
async detect(image)[source]#

Detect page candidates from one image.

Parameters:

image (Image) – Source image to analyze.

Returns:

Page candidates in reading order.

Return type:

list[PageCandidate]

class churro_ocr.PageDetector[source]#

Bases: object

Detect one or more page crops from an input image.

Create a page detector.

Parameters:

backend – Optional low-level backend or async callable. When not provided, the full input image is treated as a single page.

__init__(backend=None)[source]#

Create a page detector.

Parameters:

backend (PageDetectionBackend | Callable[[Image], Awaitable[list[PageCandidate]]] | None) – Optional low-level backend or async callable. When not provided, the full input image is treated as a single page.

Return type:

None

async adetect(request)[source]#

Asynchronously detect pages for a single image.

Parameters:

request (PageDetectionRequest) – Detection request describing the source image.

Returns:

Detected page crops in reading order.

Return type:

list[DocumentPage]

detect(request)[source]#

Synchronously detect pages for a single image.

Parameters:

request (PageDetectionRequest) – Detection request describing the source image.

Returns:

Detected page crops in reading order.

Return type:

list[DocumentPage]

class churro_ocr.PageCandidate[source]#

Bases: object

Intermediate page candidate returned by a page detector.

Parameters:
  • bbox – Bounding box in source-image coordinates.

  • image – Optional already-cropped page image. When provided, detection callers use this image directly instead of cropping from bbox or polygon.

  • polygon – Optional polygon in source-image coordinates.

  • metadata – Detector-side metadata attached to the candidate.

__init__(bbox=None, image=None, polygon=(), metadata=<factory>)#
Parameters:
Return type:

None

class churro_ocr.PageDetectionRequest[source]#

Bases: object

Request payload for image page detection.

Parameters:
  • image – In-memory image to detect pages from. Mutually exclusive with image_path.

  • image_path – Path to an image on disk. Mutually exclusive with image.

  • trim_margin – Margin in pixels to add around detected crops.

__init__(image=None, image_path=None, trim_margin=30)#
Parameters:
Return type:

None

require_image()[source]#

Return the input image, loading it from disk when needed.

Returns:

Copy of the requested image.

Raises:

ConfigurationError – If both or neither of image and image_path are provided.

Return type:

Image

class churro_ocr.PageDetectionResult[source]#

Bases: object

Page detection output for an image or PDF.

Parameters:
  • pages – Detected pages in output order.

  • source_type – Input source type, typically "image" or "pdf".

  • metadata – Detection-level metadata, such as PDF rasterization settings.

__init__(pages, source_type, metadata=<factory>)#
Parameters:
Return type:

None

exception churro_ocr.ProviderError[source]#

Bases: ChurroError

Raised when an OCR or page detection provider returns an unusable response.