Skip to content

Commit d358ca5

Browse files
DARREN OBERSTDARREN OBERST
authored andcommitted
shifting ocr dependencies to optional
1 parent 8de8d44 commit d358ca5

File tree

4 files changed

+29
-8
lines changed

4 files changed

+29
-8
lines changed

‎examples/Parsing/ocr_embedded_doc_images.py‎

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,15 @@
77
B. run an OCR against the images to derive the text from the image using the OCR
88
C. insert the text into the database library collection for subsequent retrieval.
99
10+
Note: this example uses additional python dependencies:
11+
12+
-- pip3 install pytesseract
13+
1014
Note: this example uses an OCR engine, which is outside of the core llmware package. To install on Ubuntu:
1115
1216
-- sudo apt install tesseract-ocr
1317
-- sudo apt install libtesseract-dev
1418
15-
-- pip3 install pytesseract [should already be installed with llmware requirements.txt]
16-
1719
[Other platforms:
1820
-- Mac: brew install tesseract
1921
-- Windows: GUI download installer - see UB-Mannheim @ www.github.com/UB-Mannheim/tesseract/wiki
@@ -47,6 +49,11 @@
4749
from llmware.resources import CollectionRetrieval, CollectionWriter
4850
from llmware.parsers import ImageParser
4951

52+
from importlib import util
53+
if not util.find_spec("pytesseract"):
54+
print("\nto run this example requires additional dependencies, including pytesseract - see comments above in "
55+
"this script. to install pytesseract: pip3 install pytesseract.")
56+
5057

5158
def ocr_images_in_library(library_name, add_new_text_block=False, chunk_size=400, min_chars=10):
5259

‎examples/Parsing/parse_pdf_by_ocr.py‎

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,21 @@
11

22
""" This example demonstrates how to parse PDF documents consisting of scanned pages using OCR
3-
1. Note: uses pdf2image library - requires separate install locally of lib tesseract and poppler
4-
2. This is a useful fall-back for scanned documents, if not possible to parse digitally
3+
4+
Parsing a PDF-by-OCR is much slower and loses metadata, compared with a digital parse - but this is a
5+
necessary fall-back for many 'paper-scanned' PDFs, or in the relatively rare cases in which
6+
digital parsing is not successful
7+
8+
NOTE: there are several dependencies that must be installed to run this example:
9+
10+
pip install:
11+
-- pip3 install pytesseract
12+
-- pip3 install pdf2image
13+
14+
core libraries:
15+
-- tesseract: e.g., (Mac OS) - brew install tesseract or (Linux) - sudo apt install tesseract
16+
-- poppler: e.g., (Mac OS) - brew install poppler or (Linux) - sudo apt-get install -y poppler-utils
17+
for Windows download see - https://poppler.freedesktop.org/
18+
519
"""
620

721
import os
@@ -10,6 +24,10 @@
1024
from llmware.parsers import Parser
1125
from llmware.setup import Setup
1226

27+
from importlib import util
28+
if not util.find_spec("pytesseract") or not util.find_spec("pdf2image"):
29+
print("\nto run this example, please install pytesseract and pdf2image - and there may be core libraries "
30+
"that need to be installed as well - see comments above more details.")
1331

1432
def parsing_pdf_by_ocr ():
1533

‎llmware/requirements.txt‎

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,8 @@ datasets==2.15.0
33
huggingface-hub==0.19.4
44
numpy>=1.23.2
55
openai>=1.0
6-
pdf2image==1.16.0
76
pymilvus>=2.3.0
87
pymongo>=4.7.0
9-
pytesseract==0.3.10
108
sentence-transformers==2.2.2
119
tabulate==0.9.0
1210
tokenizers>=0.15.0

‎setup.py‎

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,10 +58,8 @@ def glob_fix(package_name, glob):
5858
'huggingface-hub==0.19.4',
5959
'numpy>=1.23.2',
6060
'openai>=1.0.0',
61-
'pdf2image==1.16.0',
6261
'pymilvus>=2.3.0',
6362
'pymongo>=4.7.0',
64-
'pytesseract==0.3.10',
6563
'sentence-transformers==2.2.2',
6664
'tabulate==0.9.0',
6765
'tokenizers>=0.15.0',

0 commit comments

Comments
 (0)