API Reference

hocr_tools_lib.tools.hocr_check

Check the given file for conformance with the hOCR format spec.

class hocr_tools_lib.tools.hocr_check.Checker(hocr_file: PathLike[str], no_overlap: bool = False)

Container holding all the checks.

Parameters:
  • hocr_file – hOCR file to check.

  • no_overlap – Disable the overlap checks.

check() None

Top-level check method executing all checks.

check_geometry() None

Check geometry-related aspects.

check_xml_structure() None

Check the XML structure.

test_counter: int = 0

Number of checks performed.

test_ok(v: bool, msg: str) None

Report the status of the current check to stderr.

Parameters:
  • v – The test result.

  • msg – The message to display.

hocr_tools_lib.tools.hocr_combine

Combine multiple hOCR documents into one.

hocr_tools_lib.tools.hocr_combine.combine(filenames: list[str]) str

Combine the given hOCR documents into one.

Parameters:

filenames – hOCR documents to combine.

Returns:

The combined hOCR document content.

hocr_tools_lib.tools.hocr_cut

Cut a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns.

hocr_tools_lib.tools.hocr_cut.cut(hocr: PathLike[str], debug: bool = False) None

Cut the given hOCR file.

Generates an image file for both columns with the same basename as the input file, only adding the suffix .left and .right before the extension.

Parameters:
  • hocr – hOCR file to cut.

  • debug – Create a third image file with the suffix .cut with some debugging output.

hocr_tools_lib.tools.hocr_eval_geom

Compute statistics about the quality of the geometric segmentation at the level of the given hOCR element.

class hocr_tools_lib.tools.hocr_eval_geom.Boxstats(multiple: 'int' = 0, missing: 'int' = 0, error: 'float' = 0.0, count: 'int' = 0)
count: int = 0

Total number of of boxes.

error: float = 0.0

Aggregated relative overlap for close matches.

missing: int = 0

Number of boxes with no close match.

multiple: int = 0

Number of boxes with more than one significant overlap.

to_tuple() tuple[int, int, float, int]

Convert to a tuple.

Returns:

The values (multiple, missing, error, count).

hocr_tools_lib.tools.hocr_eval_geom.boxstats(truths: list[tuple[float, float, float, float] | None], actuals: list[tuple[float, float, float, float] | None], significant_overlap: float = 0.1, close_match: float = 0.9) Boxstats

Determine the box statistics for the given set of boxes.

Parameters:
  • truths – Ground truth boxes.

  • actuals – Actual boxes.

  • significant_overlap – Lower bound for a significant overlap.

  • close_match – Lower bound for an overlap.

Returns:

The corresponding statistics.

hocr_tools_lib.tools.hocr_eval_geom.check_bad_partition(boxes: list[tuple[float, float, float, float] | None], significant_overlap: float = 0.1) bool

Check if the given boxes are badly partitioned as they overlap too much.

Parameters:
  • boxes – The boxes to check.

  • significant_overlap – Lower bound for a significant overlap.

Returns:

Whether the partitioning is badly done.

hocr_tools_lib.tools.hocr_eval_geom.evaluate_geometries(truth: PathLike[str], actual: PathLike[str], element: str = 'ocr_line', significant_overlap: float = 0.1, close_match: float = 0.9) Generator[tuple[Boxstats, Boxstats]]

Evaluate the geometries for the given files.

Parameters:
  • truth – hOCR file with ground truth.

  • actual – hOCR file with actual data.

  • element – hOCR element to look at.

  • significant_overlap – Lower bound for a significant overlap.

  • close_match – Lower bound for an overlap.

Returns:

For each set of pages, a tuple of the statistics checking the actual values against the truth values and vice versa.

hocr_tools_lib.tools.hocr_eval_lines

Compute statistics about the quality of the geometric segmentation.

hocr_tools_lib.tools.hocr_eval_lines.evaluate_lines(tfile: SupportsRead[str], hfile: PathLike[str], verbose: bool = False) tuple[int, int]

Run the evaluation.

Parameters:
  • tfile – Text file with the true lines.

  • hfile – hOCR file with the actually recognized lines.

  • verbose – Whether to log additional information for each line.

Returns:

The number of segmentation and OCR errors.

hocr_tools_lib.tools.hocr_eval

Compute statistics about the general quality of the hOCR data.

hocr_tools_lib.tools.hocr_eval.evaluate(truth: PathLike[str], actual: PathLike[str], img_file: SupportsRead[bytes] | None = None, debug: bool = False, verbose: bool = False) tuple[Image | None, int, int, int]

Perform the evaluation.

Parameters:
  • truth – hOCR file with ground truth.

  • actual – hOCR file with actual data.

  • img_file – Optional image file. If set, draw the bboxes of the lines onto it and save it to errors.png.

  • debug – Log additional debug information.

  • verbose – Log additional error data.

Returns:

The image with the bboxes, the number of segmentation errors (expected and actual bboxes not similar enough), the number of OCR segmentation errors (number of differing characters due to segmentation) and the number of OCR errors (number of differing characters).

hocr_tools_lib.tools.hocr_extract_images

Extract the images and text within all the requested elements within the hOCR file.

hocr_tools_lib.tools.hocr_extract_images.extract_images(hocr: SupportsReadClose[str], basename: str, pattern: str = 'line-%03d.png', element: str = 'ocr_line', pad: str | None = None, unicode_dammit: bool = False) None

Extract the images from the given document.

Parameters:
  • hocr – hOCR file to use.

  • basename – Image directory.

  • pattern – Output file pattern to use.

  • element – hOCR element to look into.

  • pad – Extra padding for bounding box. If set, either one number for all four sides or four numbers separated by a comma.

  • unicode_dammit – Attempt to use BeautifulSoup.UnicodeDammit for fixing encoding issues.

hocr_tools_lib.tools.hocr_lines

Extract the text within all the ocr_line elements within the hOCR file.

hocr_tools_lib.tools.hocr_lines.lines(hocr: PathLike[str]) Generator[str]

Extract the lines from the given document.

Parameters:

hocr – hOCR file to extract from.

Returns:

The corresponding lines.

hocr_tools_lib.tools.hocr_merge_dc

Merge Dublin Core metadata into hOCR header files.

hocr_tools_lib.tools.hocr_merge_dc.merge_dc(dc: PathLike[str], hocr: PathLike[str]) bytes

Merge the metadata into the hOCR file.

Parameters:
  • dc – The Dublin Core metadata file.

  • hocr – The hOCR input file.

Returns:

The generated hOCR data.

hocr_tools_lib.tools.hocr_pdf

Create a searchable PDF from a pile of hOCr + JPEG. Tested with Tesseract.

exception hocr_tools_lib.tools.hocr_pdf.NoImagesFoundError

Custom error class when no images could be found.

hocr_tools_lib.tools.hocr_pdf.add_text_layer(pdf: Canvas, image: str, height: float, dpi: int) None

Draw an invisible text layer for OCR data.

Parameters:
  • pdf – The PDF canvas to add the layer to.

  • image – The image path to determine the hOCR file from.

  • height – The page height to use for positioning/scaling.

  • dpi – The resolution to use for positioning/scaling.

hocr_tools_lib.tools.hocr_pdf.export_pdf(directory: str | Path, savefile: str | Path, default_dpi: int = 300) None

Create a searchable PDF from a pile of HOCR + JPEG.

Parameters:
  • directory – The input directory to use.

  • default_dpi – The image resolution to use.

  • savefile – Save the PDF file to this file.

hocr_tools_lib.tools.hocr_pdf.load_invisible_font() None

Load the invisible font to use for rendering into reportlab.

hocr_tools_lib.tools.hocr_split

Split an hOCR file into individual pages.

hocr_tools_lib.tools.hocr_split.split(hocr: PathLike[str] | str, pattern: str = 'base-%03d.html') None

Split the given hOCR file into multiple pages.

Parameters:
  • hocr – hOCR file to split.

  • pattern – Naming pattern for the output files.

hocr_tools_lib.tools.hocr_wordfreq

Calculate the word frequency inside an hOCR file.

hocr_tools_lib.tools.hocr_wordfreq.word_frequencies(hocr_in: PathLike[str] | str, case_insensitive: bool = False, spaces: bool = False, dehyphenate: bool = False, max_hits: int = 10) Generator[str]

Determine the word frequencies.

Parameters:
  • hocr_in – hOCR file to analyze.

  • case_insensitive – Ignore the casing of the words.

  • spaces – Split on spaces only.

  • dehyphenate – Try to dehyphenate the text.

  • max_hits – Number of hits to return.

Returns:

Up to max_hits of the most used words.

hocr_tools_lib.utils.edit_utils

hocr_tools_lib.utils.edit_utils.edit_distance(a: str, b: str, threshold: int = 99999) int

Determine the editing distance between the two strings.

Parameters:
  • a – The first string.

  • b – The second string.

  • threshold – Threshold on which to perform an early return.

Returns:

The editing distance.

hocr_tools_lib.utils.edit_utils.remove_tex(text: str) str

Remove TeX from the given string.

Currently not implemented, thus always returning the original input.

Parameters:

text – The string to clean.

Returns:

The cleaned string.

hocr_tools_lib.utils.node_utils

hocr_tools_lib.utils.node_utils.get_bbox(node: HtmlElement) tuple[float, float, float, float] | None

Get the bounding box declared for the given node.

Parameters:

node – The node to run on.

Returns:

The bounding box, or None if not found.

hocr_tools_lib.utils.node_utils.get_prop(node: HtmlElement, name: str, strip_value: bool = False) str | None

Get the requested property from the node title.

Parameters:
  • node – The node to work on.

  • name – The property to retrieve.

  • strip_value – Whether to strip single quotation marks.

Returns:

The requested property, or None if not found.

hocr_tools_lib.utils.node_utils.get_text(node: HtmlElement) str

Get the text from the given node.

Parameters:

node – The node to run on.

Returns:

The text of the given node.

hocr_tools_lib.utils.rectangle_utils

hocr_tools_lib.utils.rectangle_utils.RectangleType

Custom type for wrapping a simple rectangle.

alias of tuple[float, float, float, float]

hocr_tools_lib.utils.rectangle_utils.area(u: tuple[float, float, float, float] | None) float

Area of a rectangle.

hocr_tools_lib.utils.rectangle_utils.erode(u: tuple[float, float, float, float] | None, tx: float, ty: float) tuple[float, float, float, float] | None

Erode the given rectangle with the given transformation factors.

hocr_tools_lib.utils.rectangle_utils.height(u: tuple[float, float, float, float] | None) float

Height of a rectangle.

hocr_tools_lib.utils.rectangle_utils.intersect(u: tuple[float, float, float, float] | None, v: tuple[float, float, float, float] | None) tuple[float, float, float, float] | None

Intersection of two rectangles.

hocr_tools_lib.utils.rectangle_utils.mostly_non_overlapping(boxes: list[tuple[float, float, float, float] | None], significant_overlap: float = 0.2) bool

Check if the given boxes do not overlap more than the given threshold.

hocr_tools_lib.utils.rectangle_utils.overlaps(u: tuple[float, float, float, float] | None, v: tuple[float, float, float, float] | None) bool

Predicate: Do the two rectangles overlap?

hocr_tools_lib.utils.rectangle_utils.relative_overlap(u: tuple[float, float, float, float] | None, v: tuple[float, float, float, float] | None) float

Relative overlap of the two rectangles, id est overlap in comparison to larger rectangle area.

hocr_tools_lib.utils.rectangle_utils.width(u: tuple[float, float, float, float] | None) float

Width of a rectangle.

hocr_tools_lib.utils.text_utils

hocr_tools_lib.utils.text_utils.normalize(s: str) str

Normalize the given string.

hocr_tools_lib.utils.typing_utils

class hocr_tools_lib.utils.typing_utils.SupportsRead(*args, **kwargs)

Type of file that supports reading.

class hocr_tools_lib.utils.typing_utils.SupportsReadClose(*args, **kwargs)

Type of file that supports reading and closing.

Mostly relevant for lxml support.