Skip to content

Unsiloed-AI/Unsiloed-Parser

Repository files navigation

Unsiloed Parser Logo - AI-Powered Document Chunking and Parsing Tool for LLM and RAG Applications

Unsiloed Parser

AI-Powered Document Processing for LLMs

Transform unstructured data into structured LLM assets for RAG and automation.

PyPI Version Python Version 3.8+ Apache 2.0 License GitHub Stars

unsiloed-parser is an open-source Python library for intelligent document chunking and AI-powered text extraction.

Perfect for building RAG pipelines, AI chatbots, knowledge bases, and automated document processing workflows.

πŸ”‘ Keywords: semantic chunking Β· AI RAG tools Β· Python LLM preprocessing Β· PDF parser Β· OCR library Β· document AI


πŸš€ Quick Links

Try Unsiloed Parser Live Demo - Interactive Document Chunking Tool Contact Unsiloed AI Team - Get Support for Document Processing


Table of Contents


✨ Features

πŸ“„ Document Chunking

Supported File Types: PDF, DOCX, PPTX, HTML, Markdown, Images, Webpages

Chunking Strategies:

  • Fixed Size : Splits text into chunks of specified size with optional overlap
  • Page-based : Splits PDF by pages (PDF only, falls back to paragraph for other file types)
  • Semantic : Uses YOLO for segmentation and VLM + OCR for intelligent extraction of text, images, and tables β€” followed by semantic grouping for clean, contextual output
  • Paragraph : Splits text by paragraphs
  • Heading : Splits text by identified headings
  • Hierarchical : Advanced multi-level chunking with parent-child relationships

πŸ€– Local LLM Model Support

  • Modular LLM Selection: Choose from multiple LLM providers and models
  • Local Model Integration: Support for locally hosted models (Ollama)
  • Provider Options: OpenAI, Anthropic, Google, Cohere, and custom endpoints
  • Model Flexibility: Switch between different models for different chunking strategies

πŸ”’ LaTeX Support

  • Mathematical Equations: Full LaTeX rendering and processing
  • Scientific Documents: Optimized for academic and technical papers
  • Formula Extraction: Intelligent extraction and preservation of mathematical formulas
  • Equation Chunking: Maintains mathematical context across chunks

🌍 Multi-lingual Support

  • Language Detection: Automatic language identification
  • Parameterized Processing: Language-specific chunking strategies
  • Unicode Support: Full support for non-Latin scripts
  • Localized Chunking: Language-aware paragraph and sentence boundaries

πŸ“ Extended File Format Support

  • Images : JPG, PNG, TIFF, BMP with OCR capabilities
  • Webpages : Direct URL processing with content extraction
  • Spreadsheets : Excel, CSV with structured data extraction

βš™οΈ Configuration

Environmental Variables

  • OPENAI_API_KEY: Your OpenAI API key for semantic chunking

πŸ“¦ Installation

PyPI version

Using pip

πŸ’‘ Tip: We recommend installing in a virtual environment for project isolation.

# Create a virtual environment (optional but recommended)
python -m venv venv

# Activate the virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate

# Install unsiloed-parser
pip install unsiloed-parser

Requirements

unsiloed-parser requires Python 3.8 or higher and has the following dependencies:

Core Dependencies:

  • openai - OpenAI API integration
  • PyPDF2 - PDF processing
  • python-docx - Word document processing
  • python-pptx - PowerPoint processing
  • Pillow - Image processing
  • pytesseract - OCR capabilities
  • aiohttp - Async HTTP client
  • requests - HTTP library
  • beautifulsoup4 - HTML parsing
  • validators - URL validation

AI & ML:

  • ultralytics - YOLO model integration
  • opencv-python-headless - Computer vision
  • numpy - Numerical computing

Utilities:

  • python-dotenv - Environment variable management
  • markdown - Markdown processing
  • lxml - XML/HTML parsing
  • html2text - HTML to text conversion
  • pdf2image - PDF to image conversion

πŸ” Environment Setup

Before using unsiloed-parser, set up your OpenAI API key for semantic chunking:

Using environment variables

# Linux/macOS
export OPENAI_API_KEY="your-api-key-here"

# Windows (Command Prompt)
set OPENAI_API_KEY=your-api-key-here

# Windows (PowerShell)
$env:OPENAI_API_KEY="your-api-key-here"

Using a .env file

Create a .env file in your project directory:

OPENAI_API_KEY=your-api-key-here

Then in your Python code:

from dotenv import load_dotenv
load_dotenv()  # This loads the variables from .env

πŸ’» Usage

Example 1: Semantic Chunking

import os
import Unsiloed

result = Unsiloed.process_sync({
    "filePath": "./test.pdf",
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "semantic",
    "chunkSize": 1000,
    "overlap": 100
})

print(result)

Example Output (Semantic Chunking):

{
  "file_type": "pdf",
  "strategy": "semantic",
  "total_chunks": 98,
  "avg_chunk_size": 305.05,
  "chunks": [
    {
      "text": "Introduction to Machine Learning",
      "metadata": {
        "page_number": 1,
        "semantic_group_index": 0,
        "element_count": 1,
        "primary_element_type": "Section-header",
        "avg_confidence": 0.92,
        "combined_bbox": [100, 50, 500, 120],
        "strategy": "semantic_openai_boundary_detection",
        "split_confidence": 0.95,
        "reading_order_start": 0,
        "reading_order_end": 0,
        "constituent_elements": [
          {
            "element_type": "Section-header",
            "bbox": [100, 50, 500, 120],
            "reading_order": "page_1_element_0"
          }
        ]
      }
    },
    {
      "text": "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience...",
      "metadata": {
        "page_number": 1,
        "semantic_group_index": 1,
        "element_count": 2,
        "primary_element_type": "Text",
        "avg_confidence": 0.89,
        "combined_bbox": [100, 150, 800, 400]
      }
    },
    {
      "text": "The image shows a neural network architecture with multiple layers...",
      "metadata": {
        "page_number": 2,
        "semantic_group_index": 2,
        "primary_element_type": "Picture",
        "avg_confidence": 0.94,
        "combined_bbox": [100, 200, 700, 600]
      }
    },
    {
      "text": "```markdown\n| Model | Accuracy | Speed |\n|-------|----------|-------|\n| CNN   | 95%      | Fast  |\n```",
      "metadata": {
        "page_number": 3,
        "semantic_group_index": 3,
        "primary_element_type": "Table",
        "primary_content_type": "table",
        "avg_confidence": 0.91,
        "combined_bbox": [150, 100, 850, 500]
      }
    }
  ]
}

Example 2: Processing HTML Files 🌐

import Unsiloed

html_result = Unsiloed.process_sync({
    "filePath": "./webpage.html",
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "paragraph"
})

Example 3: Processing Markdown Files πŸ“

import Unsiloed

markdown_result = Unsiloed.process_sync({
    "filePath": "./README.md",
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "heading"
})

Example 4: Processing Website URLs πŸ”—

import Unsiloed

url_result = Unsiloed.process_sync({
    "filePath": "https://example.com",
    "credentials": {
        "apiKey": os.environ.get("OPENAI_API_KEY")
    },
    "strategy": "paragraph"
})

Example 5: Using Async Version ⚑

import asyncio
import Unsiloed

async def async_processing():
    result = await Unsiloed.process({
        "filePath": "./test.pdf",
        "credentials": {
            "apiKey": os.environ.get("OPENAI_API_KEY")
        },
        "strategy": "semantic"
    })
    return result

# Run async processing
async_result = asyncio.run(async_processing())

Example 6: Error Handling πŸ›‘οΈ

import Unsiloed
import os

try:
    result = Unsiloed.process_sync({
        "filePath": "./document.pdf",
        "credentials": {
            "apiKey": os.environ.get("OPENAI_API_KEY")
        },
        "strategy": "semantic"
    })
    print(f"Successfully processed {len(result['chunks'])} chunks")
    
except FileNotFoundError:
    print("Error: File not found. Please check the file path.")
    
except ValueError as e:
    print(f"Error: Invalid configuration - {e}")
    
except Exception as e:
    print(f"An unexpected error occurred: {e}")

πŸ“‚ Supported File Types

Transform any document format into LLM-ready chunks with intelligent parsing and extraction.

File Type Extensions Supported Strategies Key Features Use Cases
PDF Documents .pdf All strategies (semantic, fixed, page, paragraph, heading, hierarchical) PDF chunking for RAG, page-level extraction, text and image parsing Research papers, reports, ebooks, invoices
Word Documents .docx All except page-based Document parsing, style-aware chunking, table extraction Business documents, contracts, articles
PowerPoint .pptx All except page-based Slide-by-slide processing, text and image extraction Presentations, training materials, pitch decks
HTML Files .html, .htm All except page-based Web content extraction, semantic HTML parsing Web pages, documentation, blog posts
Markdown .md, .markdown All except page-based Heading-based structure, code block preservation Technical docs, READMEs, wikis
Web URLs http://, https:// All except page-based Live webpage scraping, dynamic content extraction Real-time content processing, web monitoring
Images .jpg, .png, .tiff, .bmp Semantic, fixed, paragraph OCR for images, handwriting recognition, visual text extraction Scanned documents, photos, screenshots
Spreadsheets .xlsx, .csv Semantic, fixed, paragraph Structured data extraction, table parsing, cell-level analysis Data tables, reports, inventories

SEO Keywords: PDF chunking for RAG, OCR for images, document parsing for LLM, semantic document chunking, AI-powered text extraction, webpage to text conversion, DOCX parsing, structured data extraction

🎯 Chunking Strategies

Choose the optimal strategy for your document processing needs and RAG pipeline.

Strategy Best For How It Works API Key Required Output Format
Semantic RAG pipelines, AI applications, context-aware chunking Uses YOLO segmentation + VLM + OCR to intelligently identify and group related content (text, images, tables) βœ… Yes (OpenAI) Structured chunks with semantic context, metadata, and type classification
Fixed Token-limited LLMs, consistent chunk sizes, embeddings Splits text into uniform chunks with configurable size and overlap ❌ No Fixed-size text chunks with character/word count control
Page PDF documents, page-level processing Extracts content page-by-page, preserving document structure ❌ No One chunk per page with page numbers
Paragraph Natural text breaks, readability Splits on paragraph boundaries using natural language structure ❌ No Paragraph-level chunks maintaining context
Heading Hierarchical documents, documentation Organizes content by heading structure (H1, H2, H3, etc.) ❌ No Section-based chunks with heading hierarchy
Hierarchical Complex documents, parent-child relationships Advanced multi-level chunking with nested structure and relationships ❌ No Nested chunks with parent-child metadata

πŸ’‘ Performance Tips:

  • Use semantic for best RAG results and AI-powered content understanding
  • Use fixed for consistent embedding sizes and token management
  • Use heading for technical documentation and structured content
  • Use hierarchical for complex documents requiring context preservation

Credential Options

You can provide credentials in three ways:

  1. In the options (recommended):
result = Unsiloed.process_sync({
    "filePath": "./test.pdf",
    "credentials": {
        "apiKey": "your-openai-api-key"
    },
    "strategy": "semantic"
})
  1. Environment variable:
export OPENAI_API_KEY="your-openai-api-key"
  1. Using .env file:
OPENAI_API_KEY=your-openai-api-key

πŸ› οΈ Development Setup

Prerequisites

  • Python 3.8 or higher
  • pip (Python package installer)
  • git

Setting Up Local Development Environment

  1. Clone the repository:
git clone https://github.com/Unsiloed-AI/Unsiloed-Parser.git
cd Unsiloed-Parser
  1. Create a virtual environment:
# Using venv
python -m venv venv

# Activate the virtual environment
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up your environment variables:
# Create a .env file
echo "OPENAI_API_KEY=your-api-key-here" > .env
  1. Run the FastAPI server locally (if applicable):
uvicorn Unsiloed.app:app --reload
  1. Access the API documentation: Open your browser and go to http://localhost:8000/docs

🀝 Contributing

We welcome contributions to unsiloed-parser!

Here's how you can help:

Setting Up Development Environment

  1. Fork the repository and clone your fork:
git clone https://github.com/YOUR_USERNAME/Unsiloed-Parser.git
cd Unsiloed-Parser
  1. Install development dependencies:
pip install -r requirements.txt

Making Changes

  1. Create a new branch for your feature:
git checkout -b feature/your-feature-name
  1. Make your changes and write tests if applicable

  2. Commit your changes:

git commit -m "Add your meaningful commit message here"
  1. Push to your fork:
git push origin feature/your-feature-name
  1. Create a Pull Request from your fork to the main repository

Code Style and Standards

  • We follow PEP 8 for Python code style
  • Use type hints where appropriate
  • Document functions and classes with docstrings
  • Write tests for new features

πŸ“„ License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

🌟 Community and Support

Join the Community

  • GitHub Discussions πŸ’¬: For questions, ideas, and discussions
  • Issues πŸ›: For bug reports and feature requests
  • Pull Requests πŸ”§: For contributing to the codebase

Staying Updated

  • Star ⭐ the repository to show support
  • Watch πŸ‘€ for notification on new releases

πŸ“ž Connect with Us

Ready to Transform Your Data? Let's Connect! πŸš€

Unsiloed AI Support - We're Here to Help with Your Document Processing Needs

πŸ“§ Email Us

Get in touch directly

hello@unsiloed-ai.com

Email Unsiloed AI - Get Support for Document Chunking and RAG Solutions

πŸ“… Schedule a Call

Book a discovery session

Schedule with our team

Schedule a Call with Unsiloed AI - Book Discovery Session for Document Processing Solutions

🌐 Visit Our Website

Explore more features

www.unsiloed-ai.com

Visit Unsiloed AI Website - Learn More About Document Processing and RAG Solutions

Made with ❀️ by the Unsiloed AI Team

Built for Python Developers and Data Scientists Open Source Apache 2.0 License - Free Document Chunking Library AI Powered by GPT-4 and YOLO for Intelligent Document Processing