Skip to content

Conversation

@sliedes
Copy link

@sliedes sliedes commented Jul 2, 2024

This change adds rudimentary hOCR output support. Notes:

  • Currently it just adds bounding boxes, not baselines (which are also supported) to the hOCR output

  • It doesn't add any semantic layout stuff; instead, it just represents each word as an ocrx_word

  • Some of the metadata could be improved, such as adding the real image name and perhaps EasyOCR version number

  • I didn't check if EasyOCR supports multipage inputs; this will certainly break with those if it does

  • I left this comment in the source code; I'm not sure what to do with it (probably shouldn't be enabled by default):

# In order to get a browser-renderable HTML file, you can add this before the closing </body> tag:
#
# <script src="https://unpkg.com/hocrjs"></script>

Other than that, I validated the output with hocr-check from https://github.com/ocropus/hocr-tools and also checked that it validates as XHTML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant