Transform unstructured data into structured LLM assets for RAG and automation.
unsiloed-parser is an open-source Python library for intelligent document chunking and AI-powered text extraction.
Perfect for building RAG pipelines, AI chatbots, knowledge bases, and automated document processing workflows.
π Keywords:
semantic chunking Β· AI RAG tools Β· Python LLM preprocessing Β·
PDF parser Β· OCR library Β· document AI
- Features
- Configuration
- Constraints & Limitations
- Request Parameters
- Installation
- Environment Setup
- Usage
- Development Setup
- Contributing
- License
- Community and Support
- Connect with Us
Supported File Types: PDF, DOCX, PPTX, HTML, Markdown, Images, Webpages
Chunking Strategies:
- Fixed Size : Splits text into chunks of specified size with optional overlap
- Page-based : Splits PDF by pages (PDF only, falls back to paragraph for other file types)
- Semantic : Uses YOLO for segmentation and VLM + OCR for intelligent extraction of text, images, and tables β followed by semantic grouping for clean, contextual output
- Paragraph : Splits text by paragraphs
- Heading : Splits text by identified headings
- Hierarchical : Advanced multi-level chunking with parent-child relationships
- Modular LLM Selection: Choose from multiple LLM providers and models
- Local Model Integration: Support for locally hosted models (Ollama)
- Provider Options: OpenAI, Anthropic, Google, Cohere, and custom endpoints
- Model Flexibility: Switch between different models for different chunking strategies
- Mathematical Equations: Full LaTeX rendering and processing
- Scientific Documents: Optimized for academic and technical papers
- Formula Extraction: Intelligent extraction and preservation of mathematical formulas
- Equation Chunking: Maintains mathematical context across chunks
- Language Detection: Automatic language identification
- Parameterized Processing: Language-specific chunking strategies
- Unicode Support: Full support for non-Latin scripts
- Localized Chunking: Language-aware paragraph and sentence boundaries
- Images : JPG, PNG, TIFF, BMP with OCR capabilities
- Webpages : Direct URL processing with content extraction
- Spreadsheets : Excel, CSV with structured data extraction
OPENAI_API_KEY: Your OpenAI API key for semantic chunking
π‘ Tip: We recommend installing in a virtual environment for project isolation.
# Create a virtual environment (optional but recommended)
python -m venv venv
# Activate the virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
# Install unsiloed-parser
pip install unsiloed-parserunsiloed-parser requires Python 3.8 or higher and has the following dependencies:
Core Dependencies:
openai- OpenAI API integrationPyPDF2- PDF processingpython-docx- Word document processingpython-pptx- PowerPoint processingPillow- Image processingpytesseract- OCR capabilitiesaiohttp- Async HTTP clientrequests- HTTP librarybeautifulsoup4- HTML parsingvalidators- URL validation
AI & ML:
ultralytics- YOLO model integrationopencv-python-headless- Computer visionnumpy- Numerical computing
Utilities:
python-dotenv- Environment variable managementmarkdown- Markdown processinglxml- XML/HTML parsinghtml2text- HTML to text conversionpdf2image- PDF to image conversion
Before using unsiloed-parser, set up your OpenAI API key for semantic chunking:
# Linux/macOS
export OPENAI_API_KEY="your-api-key-here"
# Windows (Command Prompt)
set OPENAI_API_KEY=your-api-key-here
# Windows (PowerShell)
$env:OPENAI_API_KEY="your-api-key-here"Create a .env file in your project directory:
OPENAI_API_KEY=your-api-key-hereThen in your Python code:
from dotenv import load_dotenv
load_dotenv() # This loads the variables from .envimport os
import Unsiloed
result = Unsiloed.process_sync({
"filePath": "./test.pdf",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "semantic",
"chunkSize": 1000,
"overlap": 100
})
print(result)Example Output (Semantic Chunking):
{
"file_type": "pdf",
"strategy": "semantic",
"total_chunks": 98,
"avg_chunk_size": 305.05,
"chunks": [
{
"text": "Introduction to Machine Learning",
"metadata": {
"page_number": 1,
"semantic_group_index": 0,
"element_count": 1,
"primary_element_type": "Section-header",
"avg_confidence": 0.92,
"combined_bbox": [100, 50, 500, 120],
"strategy": "semantic_openai_boundary_detection",
"split_confidence": 0.95,
"reading_order_start": 0,
"reading_order_end": 0,
"constituent_elements": [
{
"element_type": "Section-header",
"bbox": [100, 50, 500, 120],
"reading_order": "page_1_element_0"
}
]
}
},
{
"text": "Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience...",
"metadata": {
"page_number": 1,
"semantic_group_index": 1,
"element_count": 2,
"primary_element_type": "Text",
"avg_confidence": 0.89,
"combined_bbox": [100, 150, 800, 400]
}
},
{
"text": "The image shows a neural network architecture with multiple layers...",
"metadata": {
"page_number": 2,
"semantic_group_index": 2,
"primary_element_type": "Picture",
"avg_confidence": 0.94,
"combined_bbox": [100, 200, 700, 600]
}
},
{
"text": "```markdown\n| Model | Accuracy | Speed |\n|-------|----------|-------|\n| CNN | 95% | Fast |\n```",
"metadata": {
"page_number": 3,
"semantic_group_index": 3,
"primary_element_type": "Table",
"primary_content_type": "table",
"avg_confidence": 0.91,
"combined_bbox": [150, 100, 850, 500]
}
}
]
}import Unsiloed
html_result = Unsiloed.process_sync({
"filePath": "./webpage.html",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "paragraph"
})import Unsiloed
markdown_result = Unsiloed.process_sync({
"filePath": "./README.md",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "heading"
})import Unsiloed
url_result = Unsiloed.process_sync({
"filePath": "https://example.com",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "paragraph"
})import asyncio
import Unsiloed
async def async_processing():
result = await Unsiloed.process({
"filePath": "./test.pdf",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "semantic"
})
return result
# Run async processing
async_result = asyncio.run(async_processing())import Unsiloed
import os
try:
result = Unsiloed.process_sync({
"filePath": "./document.pdf",
"credentials": {
"apiKey": os.environ.get("OPENAI_API_KEY")
},
"strategy": "semantic"
})
print(f"Successfully processed {len(result['chunks'])} chunks")
except FileNotFoundError:
print("Error: File not found. Please check the file path.")
except ValueError as e:
print(f"Error: Invalid configuration - {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")Transform any document format into LLM-ready chunks with intelligent parsing and extraction.
| File Type | Extensions | Supported Strategies | Key Features | Use Cases |
|---|---|---|---|---|
| PDF Documents | .pdf |
All strategies (semantic, fixed, page, paragraph, heading, hierarchical) | PDF chunking for RAG, page-level extraction, text and image parsing | Research papers, reports, ebooks, invoices |
| Word Documents | .docx |
All except page-based | Document parsing, style-aware chunking, table extraction | Business documents, contracts, articles |
| PowerPoint | .pptx |
All except page-based | Slide-by-slide processing, text and image extraction | Presentations, training materials, pitch decks |
| HTML Files | .html, .htm |
All except page-based | Web content extraction, semantic HTML parsing | Web pages, documentation, blog posts |
| Markdown | .md, .markdown |
All except page-based | Heading-based structure, code block preservation | Technical docs, READMEs, wikis |
| Web URLs | http://, https:// |
All except page-based | Live webpage scraping, dynamic content extraction | Real-time content processing, web monitoring |
| Images | .jpg, .png, .tiff, .bmp |
Semantic, fixed, paragraph | OCR for images, handwriting recognition, visual text extraction | Scanned documents, photos, screenshots |
| Spreadsheets | .xlsx, .csv |
Semantic, fixed, paragraph | Structured data extraction, table parsing, cell-level analysis | Data tables, reports, inventories |
SEO Keywords: PDF chunking for RAG, OCR for images, document parsing for LLM, semantic document chunking, AI-powered text extraction, webpage to text conversion, DOCX parsing, structured data extraction
Choose the optimal strategy for your document processing needs and RAG pipeline.
| Strategy | Best For | How It Works | API Key Required | Output Format |
|---|---|---|---|---|
| Semantic | RAG pipelines, AI applications, context-aware chunking | Uses YOLO segmentation + VLM + OCR to intelligently identify and group related content (text, images, tables) | β Yes (OpenAI) | Structured chunks with semantic context, metadata, and type classification |
| Fixed | Token-limited LLMs, consistent chunk sizes, embeddings | Splits text into uniform chunks with configurable size and overlap | β No | Fixed-size text chunks with character/word count control |
| Page | PDF documents, page-level processing | Extracts content page-by-page, preserving document structure | β No | One chunk per page with page numbers |
| Paragraph | Natural text breaks, readability | Splits on paragraph boundaries using natural language structure | β No | Paragraph-level chunks maintaining context |
| Heading | Hierarchical documents, documentation | Organizes content by heading structure (H1, H2, H3, etc.) | β No | Section-based chunks with heading hierarchy |
| Hierarchical | Complex documents, parent-child relationships | Advanced multi-level chunking with nested structure and relationships | β No | Nested chunks with parent-child metadata |
π‘ Performance Tips:
- Use semantic for best RAG results and AI-powered content understanding
- Use fixed for consistent embedding sizes and token management
- Use heading for technical documentation and structured content
- Use hierarchical for complex documents requiring context preservation
You can provide credentials in three ways:
- In the options (recommended):
result = Unsiloed.process_sync({
"filePath": "./test.pdf",
"credentials": {
"apiKey": "your-openai-api-key"
},
"strategy": "semantic"
})- Environment variable:
export OPENAI_API_KEY="your-openai-api-key"- Using .env file:
OPENAI_API_KEY=your-openai-api-key- Python 3.8 or higher
- pip (Python package installer)
- git
- Clone the repository:
git clone https://github.com/Unsiloed-AI/Unsiloed-Parser.git
cd Unsiloed-Parser- Create a virtual environment:
# Using venv
python -m venv venv
# Activate the virtual environment
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Set up your environment variables:
# Create a .env file
echo "OPENAI_API_KEY=your-api-key-here" > .env- Run the FastAPI server locally (if applicable):
uvicorn Unsiloed.app:app --reload- Access the API documentation:
Open your browser and go to
http://localhost:8000/docs
We welcome contributions to unsiloed-parser!
Here's how you can help:
- Fork the repository and clone your fork:
git clone https://github.com/YOUR_USERNAME/Unsiloed-Parser.git
cd Unsiloed-Parser- Install development dependencies:
pip install -r requirements.txt- Create a new branch for your feature:
git checkout -b feature/your-feature-name-
Make your changes and write tests if applicable
-
Commit your changes:
git commit -m "Add your meaningful commit message here"- Push to your fork:
git push origin feature/your-feature-name- Create a Pull Request from your fork to the main repository
- We follow PEP 8 for Python code style
- Use type hints where appropriate
- Document functions and classes with docstrings
- Write tests for new features
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.
- GitHub Discussions π¬: For questions, ideas, and discussions
- Issues π: For bug reports and feature requests
- Pull Requests π§: For contributing to the codebase
- Star β the repository to show support
- Watch π for notification on new releases
|
Get in touch directly |
Book a discovery session |
Explore more features |
