Skip to content

A lightweight web application that extracts text, markdown, and structured data from PDF files using Azure Document Intelligence. Features a modern drag-and-drop UI, real-time extraction progress, interactive result viewer, and multi-format export (Text, Markdown, JSON). Secure, fast, and ideal for document processing workflows.

Notifications You must be signed in to change notification settings

zentverse/doc-intel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Content Extractor using Microsoft Document Intelligence

A Python CLI tool to extract text and structured data from PDF files using Azure's Document Intelligence service (formerly Form Recognizer).

Features

  • πŸ“„ Extract plain text from PDF files
  • 🌐 Web Interface - User-friendly web UI for easy uploading and extraction
  • πŸ“ Extract and format as Markdown - Get beautifully formatted markdown with headings, tables, and proper structure
  • πŸ“Š Extract structured data (tables, key-value pairs, paragraphs)
  • πŸ’Ύ Save extracted content to files
  • πŸ” Secure credential management using .env files
  • 🎯 Simple command-line interface

Setup

1. Create Virtual Environment

# Create virtual environment
python -m venv .venv

# Activate virtual environment
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate

2. Install Dependencies

pip install -r requirements.txt

3. Configure Environment Variables

Copy your .env file from your other project or create a new one:

# Copy .env.example to .env
copy .env.example .env

Edit the .env file and add your Azure Document Intelligence credentials:

AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-resource-name.cognitiveservices.azure.com/
AZURE_DOCUMENT_INTELLIGENCE_KEY=your-api-key-here

Where to find these values:

  • Log in to Azure Portal
  • Navigate to your Document Intelligence resource
  • Go to "Keys and Endpoint" section
  • Copy the endpoint URL and one of the keys

Usage

Web Interface

  1. Start the web server:
    python app.py
  2. Open your browser and navigate to http://localhost:5000
  3. Upload a PDF file and select your desired output formats (Text, Markdown, JSON)
  4. View the results directly in the browser or download the extracted files

Basic Text Extraction

Extract text from a PDF and display it:

python extract_pdf.py path/to/your/document.pdf

Save Extracted Text to File

python extract_pdf.py input.pdf -o output.txt

Extract as Markdown Format

Extract with proper markdown formatting including headings, tables, and structure:

python extract_pdf.py document.pdf --markdown -o output.md

This will create a markdown file with:

  • Document title and metadata
  • Page-by-page content with headings
  • Tables formatted as markdown tables
  • Key-value pairs formatted as bullet lists

Extract Structured Data

Extract tables, key-value pairs, and paragraphs:

python extract_pdf.py document.pdf --structured

Save Structured Data as JSON

python extract_pdf.py document.pdf --structured -o output.json

Command Line Options

positional arguments:
  pdf_file              Path to the PDF file to process

optional arguments:
  -h, --help           Show help message and exit
  -o, --output OUTPUT  Output file path for extracted text
  --markdown           Format output as markdown (with headings, tables, and formatting)
  --structured         Extract structured data (tables, key-value pairs)

Example Output

Plain Text Extraction

πŸ“„ Processing PDF: sample.pdf
⏳ Analyzing document...
πŸ“‘ Processing page 1 of 3
πŸ“‘ Processing page 2 of 3
πŸ“‘ Processing page 3 of 3
βœ… Extraction complete! Total pages: 3
πŸ“Š Total characters extracted: 5432

Structured Data Extraction

πŸ“„ Extracting structured data from: invoice.pdf
πŸ“Š Found 2 table(s)
πŸ”‘ Found 15 key-value pair(s)
πŸ“ Found 8 paragraph(s)

==================================================
STRUCTURED DATA SUMMARY
==================================================
Total Pages: 2
Tables: 2
Key-Value Pairs: 15
Paragraphs: 8

Troubleshooting

Missing Environment Variables

If you see an error about missing environment variables, ensure:

  1. Your .env file exists in the project root
  2. It contains both AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT and AZURE_DOCUMENT_INTELLIGENCE_KEY
  3. The values are correct (no extra spaces or quotes)

Authentication Errors

  • Verify your API key is correct
  • Ensure your endpoint URL is properly formatted
  • Check that your Azure resource is active and not expired

File Not Found

  • Ensure the PDF file path is correct
  • Use absolute paths if relative paths don't work
  • Check file permissions

Requirements

  • Python 3.7+
  • Azure Document Intelligence resource (with API key and endpoint)
  • Internet connection for API calls

Dependencies

  • azure-ai-formrecognizer==3.3.3 - Azure Document Intelligence SDK
  • python-dotenv==1.0.0 - Environment variable management

License

This project is provided as-is for educational and development purposes.

About

A lightweight web application that extracts text, markdown, and structured data from PDF files using Azure Document Intelligence. Features a modern drag-and-drop UI, real-time extraction progress, interactive result viewer, and multi-format export (Text, Markdown, JSON). Secure, fast, and ideal for document processing workflows.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published