API Reference¶
hocr_tools_lib.tools.hocr_check¶
Check the given file for conformance with the hOCR format spec.
hocr_tools_lib.tools.hocr_combine¶
Combine multiple hOCR documents into one.
hocr_tools_lib.tools.hocr_cut¶
Cut a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns.
- hocr_tools_lib.tools.hocr_cut.cut(hocr: PathLike[str], debug: bool = False) None¶
Cut the given hOCR file.
Generates an image file for both columns with the same basename as the input file, only adding the suffix .left and .right before the extension.
- Parameters:
hocr – hOCR file to cut.
debug – Create a third image file with the suffix .cut with some debugging output.
hocr_tools_lib.tools.hocr_eval_geom¶
Compute statistics about the quality of the geometric segmentation at the level of the given hOCR element.
- class hocr_tools_lib.tools.hocr_eval_geom.Boxstats(multiple: 'int' = 0, missing: 'int' = 0, error: 'float' = 0.0, count: 'int' = 0)¶
- hocr_tools_lib.tools.hocr_eval_geom.boxstats(truths: list[tuple[float, float, float, float] | None], actuals: list[tuple[float, float, float, float] | None], significant_overlap: float = 0.1, close_match: float = 0.9) Boxstats¶
Determine the box statistics for the given set of boxes.
- Parameters:
truths – Ground truth boxes.
actuals – Actual boxes.
significant_overlap – Lower bound for a significant overlap.
close_match – Lower bound for an overlap.
- Returns:
The corresponding statistics.
- hocr_tools_lib.tools.hocr_eval_geom.check_bad_partition(boxes: list[tuple[float, float, float, float] | None], significant_overlap: float = 0.1) bool¶
Check if the given boxes are badly partitioned as they overlap too much.
- Parameters:
boxes – The boxes to check.
significant_overlap – Lower bound for a significant overlap.
- Returns:
Whether the partitioning is badly done.
- hocr_tools_lib.tools.hocr_eval_geom.evaluate_geometries(truth: PathLike[str], actual: PathLike[str], element: str = 'ocr_line', significant_overlap: float = 0.1, close_match: float = 0.9) Generator[tuple[Boxstats, Boxstats]]¶
Evaluate the geometries for the given files.
- Parameters:
truth – hOCR file with ground truth.
actual – hOCR file with actual data.
element – hOCR element to look at.
significant_overlap – Lower bound for a significant overlap.
close_match – Lower bound for an overlap.
- Returns:
For each set of pages, a tuple of the statistics checking the actual values against the truth values and vice versa.
hocr_tools_lib.tools.hocr_eval_lines¶
Compute statistics about the quality of the geometric segmentation.
- hocr_tools_lib.tools.hocr_eval_lines.evaluate_lines(tfile: SupportsRead[str], hfile: PathLike[str], verbose: bool = False) tuple[int, int]¶
Run the evaluation.
- Parameters:
tfile – Text file with the true lines.
hfile – hOCR file with the actually recognized lines.
verbose – Whether to log additional information for each line.
- Returns:
The number of segmentation and OCR errors.
hocr_tools_lib.tools.hocr_eval¶
Compute statistics about the general quality of the hOCR data.
- hocr_tools_lib.tools.hocr_eval.evaluate(truth: PathLike[str], actual: PathLike[str], img_file: SupportsRead[bytes] | None = None, debug: bool = False, verbose: bool = False) tuple[Image | None, int, int, int]¶
Perform the evaluation.
- Parameters:
truth – hOCR file with ground truth.
actual – hOCR file with actual data.
img_file – Optional image file. If set, draw the bboxes of the lines onto it and save it to
errors.png.debug – Log additional debug information.
verbose – Log additional error data.
- Returns:
The image with the bboxes, the number of segmentation errors (expected and actual bboxes not similar enough), the number of OCR segmentation errors (number of differing characters due to segmentation) and the number of OCR errors (number of differing characters).
hocr_tools_lib.tools.hocr_extract_images¶
Extract the images and text within all the requested elements within the hOCR file.
- hocr_tools_lib.tools.hocr_extract_images.extract_images(hocr: SupportsReadClose[str], basename: str, pattern: str = 'line-%03d.png', element: str = 'ocr_line', pad: str | None = None, unicode_dammit: bool = False) None¶
Extract the images from the given document.
- Parameters:
hocr – hOCR file to use.
basename – Image directory.
pattern – Output file pattern to use.
element – hOCR element to look into.
pad – Extra padding for bounding box. If set, either one number for all four sides or four numbers separated by a comma.
unicode_dammit – Attempt to use BeautifulSoup.UnicodeDammit for fixing encoding issues.
hocr_tools_lib.tools.hocr_lines¶
Extract the text within all the ocr_line elements within the hOCR file.
hocr_tools_lib.tools.hocr_merge_dc¶
Merge Dublin Core metadata into hOCR header files.
hocr_tools_lib.tools.hocr_pdf¶
Create a searchable PDF from a pile of hOCr + JPEG. Tested with Tesseract.
- exception hocr_tools_lib.tools.hocr_pdf.NoImagesFoundError¶
Custom error class when no images could be found.
- hocr_tools_lib.tools.hocr_pdf.add_text_layer(pdf: Canvas, image: str, height: float, dpi: int) None¶
Draw an invisible text layer for OCR data.
- Parameters:
pdf – The PDF canvas to add the layer to.
image – The image path to determine the hOCR file from.
height – The page height to use for positioning/scaling.
dpi – The resolution to use for positioning/scaling.
- hocr_tools_lib.tools.hocr_pdf.export_pdf(directory: str | Path, savefile: str | Path, default_dpi: int = 300) None¶
Create a searchable PDF from a pile of HOCR + JPEG.
- Parameters:
directory – The input directory to use.
default_dpi – The image resolution to use.
savefile – Save the PDF file to this file.
hocr_tools_lib.tools.hocr_split¶
Split an hOCR file into individual pages.
hocr_tools_lib.tools.hocr_wordfreq¶
Calculate the word frequency inside an hOCR file.
- hocr_tools_lib.tools.hocr_wordfreq.word_frequencies(hocr_in: PathLike[str] | str | TextIO, case_insensitive: bool = False, spaces: bool = False, dehyphenate: bool = False, max_hits: int = 10) Generator[str]¶
Determine the word frequencies.
- Parameters:
hocr_in – hOCR file to analyze.
case_insensitive – Ignore the casing of the words.
spaces – Split on spaces only.
dehyphenate – Try to dehyphenate the text.
max_hits – Number of hits to return.
- Returns:
Up to max_hits of the most used words.
hocr_tools_lib.utils.edit_utils¶
hocr_tools_lib.utils.node_utils¶
- hocr_tools_lib.utils.node_utils.get_bbox(node: HtmlElement) tuple[float, float, float, float] | None¶
Get the bounding box declared for the given node.
- Parameters:
node – The node to run on.
- Returns:
The bounding box, or
Noneif not found.
- hocr_tools_lib.utils.node_utils.get_prop(node: HtmlElement, name: str, strip_value: bool = False) str | None¶
Get the requested property from the node title.
- Parameters:
node – The node to work on.
name – The property to retrieve.
strip_value – Whether to strip single quotation marks.
- Returns:
The requested property, or
Noneif not found.
- hocr_tools_lib.utils.node_utils.get_text(node: HtmlElement) str¶
Get the text from the given node.
- Parameters:
node – The node to run on.
- Returns:
The text of the given node.
hocr_tools_lib.utils.rectangle_utils¶
- hocr_tools_lib.utils.rectangle_utils.RectangleType¶
Custom type for wrapping a simple rectangle.
- hocr_tools_lib.utils.rectangle_utils.area(u: tuple[float, float, float, float] | None) float¶
Area of a rectangle.
- hocr_tools_lib.utils.rectangle_utils.erode(u: tuple[float, float, float, float] | None, tx: float, ty: float) tuple[float, float, float, float] | None¶
Erode the given rectangle with the given transformation factors.
- hocr_tools_lib.utils.rectangle_utils.height(u: tuple[float, float, float, float] | None) float¶
Height of a rectangle.
- hocr_tools_lib.utils.rectangle_utils.intersect(u: tuple[float, float, float, float] | None, v: tuple[float, float, float, float] | None) tuple[float, float, float, float] | None¶
Intersection of two rectangles.
- hocr_tools_lib.utils.rectangle_utils.mostly_non_overlapping(boxes: list[tuple[float, float, float, float] | None], significant_overlap: float = 0.2) bool¶
Check if the given boxes do not overlap more than the given threshold.
- hocr_tools_lib.utils.rectangle_utils.overlaps(u: tuple[float, float, float, float] | None, v: tuple[float, float, float, float] | None) bool¶
Predicate: Do the two rectangles overlap?
hocr_tools_lib.utils.text_utils¶
hocr_tools_lib.utils.typing_utils¶
- class hocr_tools_lib.utils.typing_utils.SupportsRead(*args, **kwargs)¶
Type of file that supports reading.
- class hocr_tools_lib.utils.typing_utils.SupportsReadClose(*args, **kwargs)¶
Type of file that supports reading and closing.
Mostly relevant for lxml support.