🧠 From Clueless to Clarity: Building a Document Intelligence System (and Growing Through It)
“You don’t need to know everything to start. You need to start to know everything.” – that’s what this project taught me.
Over the past few days, I built a full-stack Document Categorization System using Django REST Framework and React — a project that pushed me, challenged me, and ultimately helped me grow in ways no tutorial ever could.
🔍 What the System Does
✅ Secure Uploads – Users can upload documents (PDFs or images).
✅ Text Extraction – Using OCR (via pytesseract) and PDF parsers, the system extracts raw text.
✅ Auto Categorization – A rule-based keyword engine analyzes content and classifies documents as:
- Tax
- Identity
- Medical
- Real Estate
- Other
✅ Confidence Score – Each classification comes with a score to indicate reliability.
✅ JWT Authentication – Built secure access using JWT via SimpleJWT.
✅ Live Filtering + Search – React frontend allows users to search and filter documents dynamically.
✅ Manual Category Override – Users can correct the category if needed.
🔧 Tech Stack
- Backend: Django, DRF, PostgreSQL, JWT Auth, OCR (Tesseract), PDFMiner
- Frontend: React, Axios, Tailwind
- Hosting (local/dev): Postman, VSCode, Railway for DB
Recommended by LinkedIn
🌱 What I Learned (The Real Win)
🧠 Real-world debugging – Nothing teaches better than a 404(server not found )or 500 (Internal server error ) at 2AM.
🧠 React hooks & state management – From total confusion to clarity.
🧠 API design – Including custom endpoints like /update_category/ and understanding router behavior.
🧠 Django internals – Signals, model methods, serializers, request/response cycles.
🧠 Confidence – More than the score I attached to documents, I built some for myself.
💬 A Word to Anyone Starting Out
I didn’t feel “ready” when I started. I felt like an impostor midway. I finished it anyway — because growth doesn’t come from knowing, it comes from doing.
🧩 Core Logic – Behind the Scenes
At the heart of the project lies a simple but effective rule-based classification engine:
- Text Extraction - PDFs are parsed using pdfminer . Images (JPEG, PNG) are converted to text via OCR using pytesseract.
- Keyword-Based Categorization Each document is scanned for predefined keywords. For example: "income tax", "gst" → Tax "aadhar", "passport" → Identity "diagnosis", "prescription" → Medical
- Model Save Logic On saving a document, the backend: 1. Extracts the text , 2. Predicts the category , 3. Stores it with metadata like upload time, user, and confidence.
- Frontend Integration React handles: 1. Upload , 2. Real-time feedback (category & confidence) , 3. Filtering documents by category, name, or upload time
This simple rule-engine can later evolve into a full-fledged ML-based classifier, but it does the job today — fast, clean, and explainable.
🙌 Open to Feedback, Opportunities & Collaboration
If you're working in AI, OCR, or backend-heavy systems — I’d love to connect, contribute, and grow. And if you’re hiring for backend/full-stack roles — I’m actively looking.