Unsiloed Parser

AI-Powered Document Processing for LLMs

Transform unstructured data into structured LLM assets for RAG and automation.

unsiloed-parser is an open-source Python library for intelligent document chunking and AI-powered text extraction.

Perfect for building RAG pipelines, AI chatbots, knowledge bases, and automated document processing workflows.

🔑 Keywords: semantic chunking · AI RAG tools · Python LLM preprocessing · PDF parser · OCR library · document AI

🚀 Quick Links

✨ Features

📄 Document Chunking

Supported File Types: PDF, DOCX, PPTX, HTML, Markdown, Images, Webpages

Chunking Strategies:

Fixed Size : Splits text into chunks of specified size with optional overlap
Page-based : Splits PDF by pages (PDF only, falls back to paragraph for other file types)
Semantic : Uses YOLO for segmentation and VLM + OCR for intelligent extraction of text, images, and tables — followed by semantic grouping for clean, contextual output
Paragraph : Splits text by paragraphs
Heading : Splits text by identified headings
Hierarchical : Advanced multi-level chunking with parent-child relationships

🤖 Local LLM Model Support

Modular LLM Selection: Choose from multiple LLM providers and models
Local Model Integration: Support for locally hosted models (Ollama)
Provider Options: OpenAI, Anthropic, Google, Cohere, and custom endpoints
Model Flexibility: Switch between different models for different chunking strategies

🔢 LaTeX Support

Mathematical Equations: Full LaTeX rendering and processing
Scientific Documents: Optimized for academic and technical papers
Formula Extraction: Intelligent extraction and preservation of mathematical formulas
Equation Chunking: Maintains mathematical context across chunks

🌍 Multi-lingual Support

Language Detection: Automatic language identification
Parameterized Processing: Language-specific chunking strategies
Unicode Support: Full support for non-Latin scripts
Localized Chunking: Language-aware paragraph and sentence boundaries

📁 Extended File Format Support

Images : JPG, PNG, TIFF, BMP with OCR capabilities
Webpages : Direct URL processing with content extraction
Spreadsheets : Excel, CSV with structured data extraction

⚙️ Configuration

Environmental Variables

OPENAI_API_KEY: Your OpenAI API key for semantic chunking

📦 Installation

Using pip

💡 Tip: We recommend installing in a virtual environment for project isolation.

# Create a virtual environment (optional but recommended)
python -m venv venv

# Activate the virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate

# Install unsiloed-parser
pip install unsiloed-parser

Requirements

unsiloed-parser requires Python 3.8 or higher and has the following dependencies:

Core Dependencies:

openai - OpenAI API integration
PyPDF2 - PDF processing
python-docx - Word document processing
python-pptx - PowerPoint processing
Pillow - Image processing
pytesseract - OCR capabilities
aiohttp - Async HTTP client
requests - HTTP library
beautifulsoup4 - HTML parsing
validators - URL validation

AI & ML:

ultralytics - YOLO model integration
opencv-python-headless - Computer vision
numpy - Numerical computing

Utilities:

python-dotenv - Environment variable management
markdown - Markdown processing
lxml - XML/HTML parsing
html2text - HTML to text conversion
pdf2image - PDF to image conversion

🔐 Environment Setup

Before using unsiloed-parser, set up your OpenAI API key for semantic chunking:

Using environment variables

# Linux/macOS
export OPENAI_API_KEY="your-api-key-here"

# Windows (Command Prompt)
set OPENAI_API_KEY=your-api-key-here

# Windows (PowerShell)
$env:OPENAI_API_KEY="your-api-key-here"

Using a .env file

Create a .env file in your project directory:

OPENAI_API_KEY=your-api-key-here

Then in your Python code:

from dotenv import load_dotenv
load_dotenv()  # This loads the variables from .env

💻 Usage

Example 1: Semantic Chunking

import os
import Unsiloed

result = Unsiloed.process_sync({
    "filePath": "./test.pdf",
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "semantic",
    "chunkSize": 1000,
    "overlap": 100
})

print(result)

Example Output (Semantic Chunking):

{
  "file_type": "pdf",
  "strategy": "semantic",
  "total_chunks": 98,
  "avg_chunk_size": 305.05,
  "chunks": [
    {
      "text": "Introduction to Machine Learning",
      "metadata": {
        "page_number": 1,
        "semantic_group_index": 0,
        "element_count": 1,
        "primary_element_type": "Section-header",
        "avg_confidence": 0.92,
        "combined_bbox": [100, 50, 500, 120],
        "strategy": "semantic_openai_boundary_detection",
        "split_confidence": 0.95,
        "reading_order_start": 0,
        "reading_order_end": 0,
        "constituent_elements": [
          {
            "element_type": "Section-header",
            "bbox": [100, 50, 500, 120],
            "reading_order": "page_1_element_0"
          }
        ]
      }
    },
    {
      "text": "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience...",
      "metadata": {
        "page_number": 1,
        "semantic_group_index": 1,
        "element_count": 2,
        "primary_element_type": "Text",
        "avg_confidence": 0.89,
        "combined_bbox": [100, 150, 800, 400]
      }
    },
    {
      "text": "The image shows a neural network architecture with multiple layers...",
      "metadata": {
        "page_number": 2,
        "semantic_group_index": 2,
        "primary_element_type": "Picture",
        "avg_confidence": 0.94,
        "combined_bbox": [100, 200, 700, 600]
      }
    },
    {
      "text": "```markdown\n| Model | Accuracy | Speed |\n|-------|----------|-------|\n| CNN   | 95%      | Fast  |\n```",
      "metadata": {
        "page_number": 3,
        "semantic_group_index": 3,
        "primary_element_type": "Table",
        "primary_content_type": "table",
        "avg_confidence": 0.91,
        "combined_bbox": [150, 100, 850, 500]
      }
    }
  ]
}

Example 2: Processing HTML Files 🌐

import Unsiloed

html_result = Unsiloed.process_sync({
    "filePath": "./webpage.html",
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "paragraph"
})

Example 3: Processing Markdown Files 📝

import Unsiloed

markdown_result = Unsiloed.process_sync({
    "filePath": "./README.md",
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "heading"
})

Example 4: Processing Website URLs 🔗

import Unsiloed

url_result = Unsiloed.process_sync({
    "filePath": "https://example.com",
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "paragraph"
})

Example 5: Using Async Version ⚡

import asyncio
import Unsiloed

async def async_processing():
    result = await Unsiloed.process({
        "filePath": "./test.pdf",
        "credentials": {
            "apiKey": os.environ.get("OPENAI_API_KEY")
        },
        "strategy": "semantic"
    })
    return result

# Run async processing
async_result = asyncio.run(async_processing())

Example 6: Error Handling 🛡️

import Unsiloed
import os

try:
    result = Unsiloed.process_sync({
        "filePath": "./document.pdf",
        "credentials": {
            "apiKey": os.environ.get("OPENAI_API_KEY")
        },
        "strategy": "semantic"
    })
    print(f"Successfully processed {len(result['chunks'])} chunks")
    
except FileNotFoundError:
    print("Error: File not found. Please check the file path.")
    
except ValueError as e:
    print(f"Error: Invalid configuration - {e}")
    
except Exception as e:
    print(f"An unexpected error occurred: {e}")

📂 Supported File Types

Transform any document format into LLM-ready chunks with intelligent parsing and extraction.

File Type	Extensions	Supported Strategies	Key Features	Use Cases
PDF Documents	`.pdf`	All strategies (semantic, fixed, page, paragraph, heading, hierarchical)	PDF chunking for RAG, page-level extraction, text and image parsing	Research papers, reports, ebooks, invoices
Word Documents	`.docx`	All except page-based	Document parsing, style-aware chunking, table extraction	Business documents, contracts, articles
PowerPoint	`.pptx`	All except page-based	Slide-by-slide processing, text and image extraction	Presentations, training materials, pitch decks
HTML Files	`.html`, `.htm`	All except page-based	Web content extraction, semantic HTML parsing	Web pages, documentation, blog posts
Markdown	`.md`, `.markdown`	All except page-based	Heading-based structure, code block preservation	Technical docs, READMEs, wikis
Web URLs	`http://`, `https://`	All except page-based	Live webpage scraping, dynamic content extraction	Real-time content processing, web monitoring
Images	`.jpg`, `.png`, `.tiff`, `.bmp`	Semantic, fixed, paragraph	OCR for images, handwriting recognition, visual text extraction	Scanned documents, photos, screenshots
Spreadsheets	`.xlsx`, `.csv`	Semantic, fixed, paragraph	Structured data extraction, table parsing, cell-level analysis	Data tables, reports, inventories

SEO Keywords: PDF chunking for RAG, OCR for images, document parsing for LLM, semantic document chunking, AI-powered text extraction, webpage to text conversion, DOCX parsing, structured data extraction

🎯 Chunking Strategies

Choose the optimal strategy for your document processing needs and RAG pipeline.

Strategy	Best For	How It Works	API Key Required	Output Format
Semantic	RAG pipelines, AI applications, context-aware chunking	Uses YOLO segmentation + VLM + OCR to intelligently identify and group related content (text, images, tables)	✅ Yes (OpenAI)	Structured chunks with semantic context, metadata, and type classification
Fixed	Token-limited LLMs, consistent chunk sizes, embeddings	Splits text into uniform chunks with configurable size and overlap	❌ No	Fixed-size text chunks with character/word count control
Page	PDF documents, page-level processing	Extracts content page-by-page, preserving document structure	❌ No	One chunk per page with page numbers
Paragraph	Natural text breaks, readability	Splits on paragraph boundaries using natural language structure	❌ No	Paragraph-level chunks maintaining context
Heading	Hierarchical documents, documentation	Organizes content by heading structure (H1, H2, H3, etc.)	❌ No	Section-based chunks with heading hierarchy
Hierarchical	Complex documents, parent-child relationships	Advanced multi-level chunking with nested structure and relationships	❌ No	Nested chunks with parent-child metadata

💡 Performance Tips:

Use semantic for best RAG results and AI-powered content understanding
Use fixed for consistent embedding sizes and token management
Use heading for technical documentation and structured content
Use hierarchical for complex documents requiring context preservation

Credential Options

You can provide credentials in three ways:

In the options (recommended):

result = Unsiloed.process_sync({
    "filePath": "./test.pdf",
    "credentials": {
        "apiKey": "your-openai-api-key"
    },
    "strategy": "semantic"
})

Environment variable:

export OPENAI_API_KEY="your-openai-api-key"

Using .env file:

OPENAI_API_KEY=your-openai-api-key

🛠️ Development Setup

Prerequisites

Python 3.8 or higher
pip (Python package installer)
git

Setting Up Local Development Environment

Clone the repository:

git clone https://github.com/Unsiloed-AI/Unsiloed-Parser.git
cd Unsiloed-Parser

Create a virtual environment:

# Using venv
python -m venv venv

# Activate the virtual environment
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Set up your environment variables:

# Create a .env file
echo "OPENAI_API_KEY=your-api-key-here" > .env

Run the FastAPI server locally (if applicable):

uvicorn Unsiloed.app:app --reload

Access the API documentation: Open your browser and go to http://localhost:8000/docs

🤝 Contributing

We welcome contributions to unsiloed-parser!

Here's how you can help:

Setting Up Development Environment

Fork the repository and clone your fork:

git clone https://github.com/YOUR_USERNAME/Unsiloed-Parser.git
cd Unsiloed-Parser

Install development dependencies:

pip install -r requirements.txt

Making Changes

Create a new branch for your feature:

git checkout -b feature/your-feature-name

Make your changes and write tests if applicable
Commit your changes:

git commit -m "Add your meaningful commit message here"

Push to your fork:

git push origin feature/your-feature-name

Create a Pull Request from your fork to the main repository

Code Style and Standards

We follow PEP 8 for Python code style
Use type hints where appropriate
Document functions and classes with docstrings
Write tests for new features

📄 License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

🌟 Community and Support

Join the Community

GitHub Discussions 💬: For questions, ideas, and discussions
Issues 🐛: For bug reports and feature requests
Pull Requests 🔧: For contributing to the codebase

Staying Updated

Star ⭐ the repository to show support
Watch 👀 for notification on new releases

📞 Connect with Us

Ready to Transform Your Data? Let's Connect! 🚀

📧 Email Us

Get in touch directly

hello@unsiloed-ai.com

Email Unsiloed AI - Get Support for Document Chunking and RAG Solutions

📅 Schedule a Call

Book a discovery session

Schedule with our team

Schedule a Call with Unsiloed AI - Book Discovery Session for Document Processing Solutions

🌐 Visit Our Website

Explore more features

www.unsiloed-ai.com

Visit Unsiloed AI Website - Learn More About Document Processing and RAG Solutions

Made with ❤️ by the Unsiloed AI Team

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github		.github
Unsiloed		Unsiloed
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
logo.png		logo.png
requirements.txt		requirements.txt
setup.py		setup.py
test_summarization.py		test_summarization.py

License

Unsiloed-AI/Unsiloed-Parser

Folders and files

Latest commit

History

Repository files navigation

Unsiloed Parser

AI-Powered Document Processing for LLMs

🚀 Quick Links

Table of Contents

✨ Features

📄 Document Chunking

🤖 Local LLM Model Support

🔢 LaTeX Support

🌍 Multi-lingual Support

📁 Extended File Format Support

⚙️ Configuration

Environmental Variables

📦 Installation

Using pip

Requirements

🔐 Environment Setup

Using environment variables

Using a .env file

💻 Usage

Example 1: Semantic Chunking

Example 2: Processing HTML Files 🌐

Example 3: Processing Markdown Files 📝

Example 4: Processing Website URLs 🔗

Example 5: Using Async Version ⚡

Example 6: Error Handling 🛡️

📂 Supported File Types

🎯 Chunking Strategies

Credential Options

🛠️ Development Setup

Prerequisites

Setting Up Local Development Environment

🤝 Contributing

Setting Up Development Environment

Making Changes

Code Style and Standards

📄 License

🌟 Community and Support

Join the Community

Staying Updated

📞 Connect with Us

Ready to Transform Your Data? Let's Connect! 🚀

📧 Email Us

📅 Schedule a Call

🌐 Visit Our Website

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages