This project is a production-grade Retrieval-Augmented Generation (RAG) Tutor built to understand and explain handwritten notes of QEDS, even with OCR noise.
It uses:
- π Hybrid Retrieval (BM25 + BGE-M3)
- π RAG Fusion using FLAN-T5 paraphraser
- π₯ Cross-Encoder Reranking
- βοΈ Contextual Compression
- π§Ή OCR Noise Sanitization
- π§ LLaMA3 (Ollama) for answering
Perfect for academic notes, handwritten documents, mathematical derivations, and noisy OCR text.
Asking about Gini Coefficient:
Asking questions about Homogenous Differential Equations:
(The AI retrieves the correct handwritten module, fixes the math symbols, and explains the concept)
Students often struggle with:
- Handwritten lecture notes
- OCR errors in scanned material
- Fragmented definitions and formulas
- Losing context while asking follow-up questions
QEDS-GPT solves this by:
- Indexing cleaned OCR notes into a semantic vector database
- Retrieving the most relevant modules and topics
- Correcting equations and notations
- Allowing multi-turn, memory-aware academic conversations
- Multi-turn chat interface
- Context retained across refreshes and sessions
- No login required
- ChromaDB vector store
- Hugging Face
bge-m3embeddings - Module-aware retrieval with metadata:
- Semester
- Subject
- Module
- Fixes broken equations and symbols
- Rewrites math in clean LaTeX
- Repairs fragmented sentences
- Vague-query detection
- Relevance filtering
- Uses academic knowledge only to reconstruct missing context
- Dockerized application
- Deployed on Hugging Face Spaces
- Git LFS support for vector index files
- Secure API key management
User
β
Streamlit Chat UI
β
Semantic Retrieval (ChromaDB)
β
OCR Cleanup & Context Sanitization
β
LLM Reasoning (Groq)
β
Answer + Updated Memory
Query: "Explain the Slutsky substitution effect."
System Action: Retrieves Economics notes from Semester 3, fixes OCR typos in the definition, and presents the derivation.
Query: "Solve the Bernoulli differential equation from Module 1."
System Action: Filters for "Semester 4 - Diff Eq", finds the specific raw formula, converts it to clean LaTeX, and explains the solution steps.
Handwritten notes are private and not included
Vector database is stored using Git LFS
Memory is stored per user locally via SQLite
QEDS-RAG-Project/
β
βββ chroma_db/ # Vector database
β
βββ src/
β βββ streamlit_app.py # Main conversational RAG app
β
βββ Dockerfile
βββ requirements.txt
βββ README.md
- Language: Python
- UI: Streamlit
- Vector DB: ChromaDB
- Embeddings: Hugging Face `BAAI/bge-m3`
- LLM: Groq (LLaMA-3.1-8B-Instant)
- Deployment: Docker + Hugging Face Spaces
Make sure Ollama is installed and running with Llama 3:
ollama run llama3git clone https://github.com/apooorv19/QEDS-RAG-Project.git
cd QEDS-RAG-Project
pip install -r requirements.txt- Set environment variable
export GROQ_API_KEY=your_api_key_here- Run the app
streamlit run src/streamlit_app.pydocker build -t qeds-gpt .
docker run -p 8501:8501 -e GROQ_API_KEY=your_api_key qeds-gpthttps://huggingface.co/spaces/Apooorv69/QEDS-RAG-Project
Apurva Mishra
IMSc Quantitative Economics & Data Science
Birla Institute of Technology, Mesra
GitHub: https://github.com/apooorv19
LinkedIn: https://www.linkedin.com/in/apooorv/
@misc{paruchuri2025surya,
author = {Vikas Paruchuri and Datalab Team},
title = {Surya: A lightweight document OCR and analysis toolkit},
year = {2025},
howpublished = {\url{https://github.com/VikParuchuri/surya}},
note = {GitHub repository},
}


