API Reference¶

hocr_tools_lib.tools.hocr_check¶

Check the given file for conformance with the hOCR format spec.

class hocr_tools_lib.tools.hocr_check.Checker(hocr_file: PathLike[str], no_overlap: bool = False)¶

Container holding all the checks.

Parameters:

hocr_file – hOCR file to check.
no_overlap – Disable the overlap checks.

check() → None¶: Top-level check method executing all checks.

check_geometry() → None¶: Check geometry-related aspects.

check_xml_structure() → None¶: Check the XML structure.

test_counter: int = 0¶: Number of checks performed.

test_ok(v: bool, msg: str) → None¶

Report the status of the current check to stderr.

Parameters:

v – The test result.
msg – The message to display.

hocr_tools_lib.tools.hocr_combine¶

Combine multiple hOCR documents into one.

hocr_tools_lib.tools.hocr_combine.combine(filenames: list[str]) → str¶

Combine the given hOCR documents into one.

Parameters:: filenames – hOCR documents to combine.
Returns:: The combined hOCR document content.

hocr_tools_lib.tools.hocr_cut¶

Cut a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns.

hocr_tools_lib.tools.hocr_cut.cut(hocr: PathLike[str], debug: bool = False) → None¶

Cut the given hOCR file.

Generates an image file for both columns with the same basename as the input file, only adding the suffix .left and .right before the extension.

Parameters:

hocr – hOCR file to cut.
debug – Create a third image file with the suffix .cut with some debugging output.

hocr_tools_lib.tools.hocr_eval_geom¶

Compute statistics about the quality of the geometric segmentation at the level of the given hOCR element.

class hocr_tools_lib.tools.hocr_eval_geom.Boxstats(multiple: 'int' = 0, missing: 'int' = 0, error: 'float' = 0.0, count: 'int' = 0)¶

count: int = 0¶: Total number of of boxes.

error: float = 0.0¶: Aggregated relative overlap for close matches.

missing: int = 0¶: Number of boxes with no close match.

multiple: int = 0¶: Number of boxes with more than one significant overlap.

to_tuple() → tuple[int, int, float, int]¶

Convert to a tuple.

Returns:: The values (multiple, missing, error, count).

hocr_tools_lib.tools.hocr_eval_geom.boxstats(truths: list[tuple[float, float, float, float] | None], actuals: list[tuple[float, float, float, float] | None], significant_overlap: float = 0.1, close_match: float = 0.9) → Boxstats¶

Determine the box statistics for the given set of boxes.

Parameters:

truths – Ground truth boxes.
actuals – Actual boxes.
significant_overlap – Lower bound for a significant overlap.
close_match – Lower bound for an overlap.

Returns:

The corresponding statistics.

hocr_tools_lib.tools.hocr_eval_geom.check_bad_partition(boxes: list[tuple[float, float, float, float] | None], significant_overlap: float = 0.1) → bool¶

Check if the given boxes are badly partitioned as they overlap too much.

Parameters:

boxes – The boxes to check.
significant_overlap – Lower bound for a significant overlap.

Returns:

Whether the partitioning is badly done.

hocr_tools_lib.tools.hocr_eval_geom.evaluate_geometries(truth: PathLike[str], actual: PathLike[str], element: str = 'ocr_line', significant_overlap: float = 0.1, close_match: float = 0.9) → Generator[tuple[Boxstats, Boxstats]]¶

Evaluate the geometries for the given files.

Parameters:

truth – hOCR file with ground truth.
actual – hOCR file with actual data.
element – hOCR element to look at.
significant_overlap – Lower bound for a significant overlap.
close_match – Lower bound for an overlap.

Returns:

For each set of pages, a tuple of the statistics checking the actual values against the truth values and vice versa.

hocr_tools_lib.tools.hocr_eval_lines¶

Compute statistics about the quality of the geometric segmentation.

hocr_tools_lib.tools.hocr_eval_lines.evaluate_lines(tfile: SupportsRead[str], hfile: PathLike[str], verbose: bool = False) → tuple[int, int]¶

Run the evaluation.

Parameters:

tfile – Text file with the true lines.
hfile – hOCR file with the actually recognized lines.
verbose – Whether to log additional information for each line.

Returns:

The number of segmentation and OCR errors.

hocr_tools_lib.tools.hocr_eval¶

Compute statistics about the general quality of the hOCR data.

hocr_tools_lib.tools.hocr_eval.evaluate(truth: PathLike[str], actual: PathLike[str], img_file: SupportsRead[bytes] | None = None, debug: bool = False, verbose: bool = False) → tuple[Image | None, int, int, int]¶

Perform the evaluation.

Parameters:

truth – hOCR file with ground truth.
actual – hOCR file with actual data.
img_file – Optional image file. If set, draw the bboxes of the lines onto it and save it to errors.png.
debug – Log additional debug information.
verbose – Log additional error data.

Returns:

The image with the bboxes, the number of segmentation errors (expected and actual bboxes not similar enough), the number of OCR segmentation errors (number of differing characters due to segmentation) and the number of OCR errors (number of differing characters).

hocr_tools_lib.tools.hocr_extract_images¶

Extract the images and text within all the requested elements within the hOCR file.

hocr_tools_lib.tools.hocr_extract_images.extract_images(hocr: SupportsReadClose[str], basename: str, pattern: str = 'line-%03d.png', element: str = 'ocr_line', pad: str | None = None, unicode_dammit: bool = False) → None¶

Extract the images from the given document.

Parameters:

hocr – hOCR file to use.
basename – Image directory.
pattern – Output file pattern to use.
element – hOCR element to look into.
pad – Extra padding for bounding box. If set, either one number for all four sides or four numbers separated by a comma.
unicode_dammit – Attempt to use BeautifulSoup.UnicodeDammit for fixing encoding issues.

hocr_tools_lib.tools.hocr_lines¶

Extract the text within all the ocr_line elements within the hOCR file.

hocr_tools_lib.tools.hocr_lines.lines(hocr: PathLike[str]) → Generator[str]¶

Extract the lines from the given document.

Parameters:: hocr – hOCR file to extract from.
Returns:: The corresponding lines.

hocr_tools_lib.tools.hocr_merge_dc¶

Merge Dublin Core metadata into hOCR header files.

hocr_tools_lib.tools.hocr_merge_dc.merge_dc(dc: PathLike[str], hocr: PathLike[str]) → bytes¶

Merge the metadata into the hOCR file.

Parameters:

dc – The Dublin Core metadata file.
hocr – The hOCR input file.

Returns:

The generated hOCR data.

hocr_tools_lib.tools.hocr_pdf¶

Create a searchable PDF from a pile of hOCr + JPEG. Tested with Tesseract.

exception hocr_tools_lib.tools.hocr_pdf.NoImagesFoundError¶: Custom error class when no images could be found.

hocr_tools_lib.tools.hocr_pdf.add_text_layer(pdf: Canvas, image: str, height: float, dpi: int) → None¶

Draw an invisible text layer for OCR data.

Parameters:

pdf – The PDF canvas to add the layer to.
image – The image path to determine the hOCR file from.
height – The page height to use for positioning/scaling.
dpi – The resolution to use for positioning/scaling.

hocr_tools_lib.tools.hocr_pdf.export_pdf(directory: str | Path, savefile: str | Path, default_dpi: int = 300) → None¶

Create a searchable PDF from a pile of HOCR + JPEG.

Parameters:

directory – The input directory to use.
default_dpi – The image resolution to use.
savefile – Save the PDF file to this file.

hocr_tools_lib.tools.hocr_pdf.load_invisible_font() → None¶: Load the invisible font to use for rendering into reportlab.

hocr_tools_lib.tools.hocr_split¶

Split an hOCR file into individual pages.

hocr_tools_lib.tools.hocr_split.split(hocr: PathLike[str] | str, pattern: str = 'base-%03d.html') → None¶

Split the given hOCR file into multiple pages.

Parameters:

hocr – hOCR file to split.
pattern – Naming pattern for the output files.

hocr_tools_lib.tools.hocr_wordfreq¶

Calculate the word frequency inside an hOCR file.

hocr_tools_lib.tools.hocr_wordfreq.word_frequencies(hocr_in: PathLike[str] | str, case_insensitive: bool = False, spaces: bool = False, dehyphenate: bool = False, max_hits: int = 10) → Generator[str]¶

Determine the word frequencies.

Parameters:

hocr_in – hOCR file to analyze.
case_insensitive – Ignore the casing of the words.
spaces – Split on spaces only.
dehyphenate – Try to dehyphenate the text.
max_hits – Number of hits to return.

Returns:

Up to max_hits of the most used words.

hocr_tools_lib.utils.edit_utils¶

hocr_tools_lib.utils.edit_utils.edit_distance(a: str, b: str, threshold: int = 99999) → int¶

Determine the editing distance between the two strings.

Parameters:

a – The first string.
b – The second string.
threshold – Threshold on which to perform an early return.

Returns:

The editing distance.

hocr_tools_lib.utils.edit_utils.remove_tex(text: str) → str¶

Remove TeX from the given string.

Currently not implemented, thus always returning the original input.

Parameters:: text – The string to clean.
Returns:: The cleaned string.

hocr_tools_lib.utils.node_utils¶

hocr_tools_lib.utils.node_utils.get_bbox(node: HtmlElement) → tuple[float, float, float, float] | None¶

Get the bounding box declared for the given node.

Parameters:: node – The node to run on.
Returns:: The bounding box, or None if not found.

hocr_tools_lib.utils.node_utils.get_prop(node: HtmlElement, name: str, strip_value: bool = False) → str | None¶

Get the requested property from the node title.

Parameters:

node – The node to work on.
name – The property to retrieve.
strip_value – Whether to strip single quotation marks.

Returns:

The requested property, or None if not found.

hocr_tools_lib.utils.node_utils.get_text(node: HtmlElement) → str¶

Get the text from the given node.

Parameters:: node – The node to run on.
Returns:: The text of the given node.

hocr_tools_lib.utils.rectangle_utils¶

hocr_tools_lib.utils.rectangle_utils.RectangleType¶

Custom type for wrapping a simple rectangle.

alias of tuple[float, float, float, float]

hocr_tools_lib.utils.rectangle_utils.area(u: tuple[float, float, float, float] | None) → float¶: Area of a rectangle.

hocr_tools_lib.utils.rectangle_utils.erode(u: tuple[float, float, float, float] | None, tx: float, ty: float) → tuple[float, float, float, float] | None¶: Erode the given rectangle with the given transformation factors.

hocr_tools_lib.utils.rectangle_utils.height(u: tuple[float, float, float, float] | None) → float¶: Height of a rectangle.

hocr_tools_lib.utils.rectangle_utils.intersect(u: tuple[float, float, float, float] | None, v: tuple[float, float, float, float] | None) → tuple[float, float, float, float] | None¶: Intersection of two rectangles.

hocr_tools_lib.utils.rectangle_utils.mostly_non_overlapping(boxes: list[tuple[float, float, float, float] | None], significant_overlap: float = 0.2) → bool¶: Check if the given boxes do not overlap more than the given threshold.

hocr_tools_lib.utils.rectangle_utils.overlaps(u: tuple[float, float, float, float] | None, v: tuple[float, float, float, float] | None) → bool¶: Predicate: Do the two rectangles overlap?

hocr_tools_lib.utils.rectangle_utils.relative_overlap(u: tuple[float, float, float, float] | None, v: tuple[float, float, float, float] | None) → float¶: Relative overlap of the two rectangles, id est overlap in comparison to larger rectangle area.

hocr_tools_lib.utils.rectangle_utils.width(u: tuple[float, float, float, float] | None) → float¶: Width of a rectangle.

hocr_tools_lib.utils.text_utils¶

hocr_tools_lib.utils.text_utils.normalize(s: str) → str¶: Normalize the given string.

hocr_tools_lib.utils.typing_utils¶

class hocr_tools_lib.utils.typing_utils.SupportsRead(*args, **kwargs)¶: Type of file that supports reading.

class hocr_tools_lib.utils.typing_utils.SupportsReadClose(*args, **kwargs)¶

Type of file that supports reading and closing.

Mostly relevant for lxml support.