Skip to content

πŸŽ“ Production-grade RAG Tutor for handwritten notes. Features Hybrid Search (BGE-M3 + BM25), Cross-Encoder Reranking, and an LLM-based OCR cleaning pipeline.

Notifications You must be signed in to change notification settings

apooorv19/QEDS-RAG-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

πŸŽ“ QED-Scribe - Hybrid RAG

This project is a production-grade Retrieval-Augmented Generation (RAG) Tutor built to understand and explain handwritten notes of QEDS, even with OCR noise.

Python Stack OCR

It uses:

  • πŸ” Hybrid Retrieval (BM25 + BGE-M3)
  • πŸ”„ RAG Fusion using FLAN-T5 paraphraser
  • πŸ”₯ Cross-Encoder Reranking
  • βœ‚οΈ Contextual Compression
  • 🧹 OCR Noise Sanitization
  • 🧠 LLaMA3 (Ollama) for answering

Perfect for academic notes, handwritten documents, mathematical derivations, and noisy OCR text.


πŸ“Έ Demo

Asking about Gini Coefficient:

App Demo

Asking questions about Homogenous Differential Equations:

App Demo (The AI retrieves the correct handwritten module, fixes the math symbols, and explains the concept)


πŸ” What Problem Does This Solve?

Students often struggle with:

  • Handwritten lecture notes
  • OCR errors in scanned material
  • Fragmented definitions and formulas
  • Losing context while asking follow-up questions

QEDS-GPT solves this by:

  • Indexing cleaned OCR notes into a semantic vector database
  • Retrieving the most relevant modules and topics
  • Correcting equations and notations
  • Allowing multi-turn, memory-aware academic conversations

πŸš€ Features

🧠 Conversational RAG with Memory

  • Multi-turn chat interface
  • Context retained across refreshes and sessions
  • No login required

πŸ“š Semantic Retrieval over Notes

  • ChromaDB vector store
  • Hugging Face bge-m3 embeddings
  • Module-aware retrieval with metadata:
    • Semester
    • Subject
    • Module

✍️ OCR Noise Correction

  • Fixes broken equations and symbols
  • Rewrites math in clean LaTeX
  • Repairs fragmented sentences

🚫 Hallucination Controls

  • Vague-query detection
  • Relevance filtering
  • Uses academic knowledge only to reconstruct missing context

πŸ“¦ Production Deployment

  • Dockerized application
  • Deployed on Hugging Face Spaces
  • Git LFS support for vector index files
  • Secure API key management

🧠 Architecture

RAG Pipeline

Basic RAG Pipeline

Detailed Retrieval & Embedding Flow

Detailed Architecture

User
↓
Streamlit Chat UI
↓
Semantic Retrieval (ChromaDB)
↓
OCR Cleanup & Context Sanitization
↓
LLM Reasoning (Groq)
↓
Answer + Updated Memory

πŸ§ͺ Example Capabilities

Query: "Explain the Slutsky substitution effect."

System Action: Retrieves Economics notes from Semester 3, fixes OCR typos in the definition, and presents the derivation.

Query: "Solve the Bernoulli differential equation from Module 1."

System Action: Filters for "Semester 4 - Diff Eq", finds the specific raw formula, converts it to clean LaTeX, and explains the solution steps.

πŸ”’ Notes

Handwritten notes are private and not included

Vector database is stored using Git LFS

Memory is stored per user locally via SQLite


πŸ“ Project Structure

QEDS-RAG-Project/
β”‚
β”œβ”€β”€ chroma_db/              # Vector database
β”‚
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ streamlit_app.py      # Main conversational RAG app   
β”‚
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
└── README.md

βš™οΈ Tech Stack

- Language: Python
- UI: Streamlit
- Vector DB: ChromaDB
- Embeddings: Hugging Face `BAAI/bge-m3`
- LLM: Groq (LLaMA-3.1-8B-Instant)
- Deployment: Docker + Hugging Face Spaces

πŸ› οΈ Installation & Usage

1. Prerequisites

Make sure Ollama is installed and running with Llama 3:

ollama run llama3

2. Setup Environment

git clone https://github.com/apooorv19/QEDS-RAG-Project.git
cd QEDS-RAG-Project
pip install -r requirements.txt
  1. Set environment variable
export GROQ_API_KEY=your_api_key_here
  1. Run the app
streamlit run src/streamlit_app.py

🐳 Docker Deployment

docker build -t qeds-gpt .
docker run -p 8501:8501 -e GROQ_API_KEY=your_api_key qeds-gpt

🌐 Live Demo

https://huggingface.co/spaces/Apooorv69/QEDS-RAG-Project


πŸ‘€ Author

Apurva Mishra
IMSc Quantitative Economics & Data Science
Birla Institute of Technology, Mesra

GitHub: https://github.com/apooorv19
LinkedIn: https://www.linkedin.com/in/apooorv/


πŸ“œ Citations & Credits

@misc{paruchuri2025surya,
  author       = {Vikas Paruchuri and Datalab Team},
  title        = {Surya: A lightweight document OCR and analysis toolkit},
  year         = {2025},
  howpublished = {\url{https://github.com/VikParuchuri/surya}},
  note         = {GitHub repository},
}

About

πŸŽ“ Production-grade RAG Tutor for handwritten notes. Features Hybrid Search (BGE-M3 + BM25), Cross-Encoder Reranking, and an LLM-based OCR cleaning pipeline.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published