A Python CLI tool to extract text and structured data from PDF files using Azure's Document Intelligence service (formerly Form Recognizer).
- π Extract plain text from PDF files
- π Web Interface - User-friendly web UI for easy uploading and extraction
- π Extract and format as Markdown - Get beautifully formatted markdown with headings, tables, and proper structure
- π Extract structured data (tables, key-value pairs, paragraphs)
- πΎ Save extracted content to files
- π Secure credential management using
.envfiles - π― Simple command-line interface
# Create virtual environment
python -m venv .venv
# Activate virtual environment
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activatepip install -r requirements.txtCopy your .env file from your other project or create a new one:
# Copy .env.example to .env
copy .env.example .envEdit the .env file and add your Azure Document Intelligence credentials:
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-resource-name.cognitiveservices.azure.com/
AZURE_DOCUMENT_INTELLIGENCE_KEY=your-api-key-hereWhere to find these values:
- Log in to Azure Portal
- Navigate to your Document Intelligence resource
- Go to "Keys and Endpoint" section
- Copy the endpoint URL and one of the keys
- Start the web server:
python app.py
- Open your browser and navigate to
http://localhost:5000 - Upload a PDF file and select your desired output formats (Text, Markdown, JSON)
- View the results directly in the browser or download the extracted files
Extract text from a PDF and display it:
python extract_pdf.py path/to/your/document.pdfpython extract_pdf.py input.pdf -o output.txtExtract with proper markdown formatting including headings, tables, and structure:
python extract_pdf.py document.pdf --markdown -o output.mdThis will create a markdown file with:
- Document title and metadata
- Page-by-page content with headings
- Tables formatted as markdown tables
- Key-value pairs formatted as bullet lists
Extract tables, key-value pairs, and paragraphs:
python extract_pdf.py document.pdf --structuredpython extract_pdf.py document.pdf --structured -o output.jsonpositional arguments:
pdf_file Path to the PDF file to process
optional arguments:
-h, --help Show help message and exit
-o, --output OUTPUT Output file path for extracted text
--markdown Format output as markdown (with headings, tables, and formatting)
--structured Extract structured data (tables, key-value pairs)
π Processing PDF: sample.pdf
β³ Analyzing document...
π Processing page 1 of 3
π Processing page 2 of 3
π Processing page 3 of 3
β
Extraction complete! Total pages: 3
π Total characters extracted: 5432
π Extracting structured data from: invoice.pdf
π Found 2 table(s)
π Found 15 key-value pair(s)
π Found 8 paragraph(s)
==================================================
STRUCTURED DATA SUMMARY
==================================================
Total Pages: 2
Tables: 2
Key-Value Pairs: 15
Paragraphs: 8
If you see an error about missing environment variables, ensure:
- Your
.envfile exists in the project root - It contains both
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINTandAZURE_DOCUMENT_INTELLIGENCE_KEY - The values are correct (no extra spaces or quotes)
- Verify your API key is correct
- Ensure your endpoint URL is properly formatted
- Check that your Azure resource is active and not expired
- Ensure the PDF file path is correct
- Use absolute paths if relative paths don't work
- Check file permissions
- Python 3.7+
- Azure Document Intelligence resource (with API key and endpoint)
- Internet connection for API calls
azure-ai-formrecognizer==3.3.3- Azure Document Intelligence SDKpython-dotenv==1.0.0- Environment variable management
This project is provided as-is for educational and development purposes.