Educational Content Processing & Retrieval-Augmented Generation System A comprehensive research platform for multimodal lecture processing, intelligent retrieval, and RAG pipeline development.
This capstone builds an educational content processing and Retrieval-Augmented Generation (RAG) system: ingest multimodal lecture materials, align and structure them, index them for text and visual retrieval, and support question answering with citations, lecture-aware summaries, and personalized learning features behind a modern web UI and production-style deployment options.
The authoritative requirements baseline is docs/requirements.md (Software Requirements Specification): 37 requirements in total 22 functional (FR-001βFR-022 and extended FRs in that doc), 8 non-functional (NFR-001βNFR-008), and 7 technical (TR-001βTR-007). Highlights from the SRS scope:
- Content processing: ASR and timed exports (FR-001); documents, OCR, dual outputs (FR-002); spreadsheet merged cells and Markdown (FR-003, FR-004); images / VLM (FR-005); deduplication (FR-006); audioβslide alignment and temporal navigation (FR-007, FR-008).
- Retrieval & QA: BM25, dense, hybrid (FR-009); visionβlanguage retrieval (FR-010); query handling (FR-011); grounded answers (FR-012, FR-013); chat decomposition, strategy, and multi-search aggregation (FR-014).
- Product features: file management and search UI (FR-021, FR-022); automated summaries and summary navigation (FR-023, FR-024); learning paths, assessment, and analytics (FR-025βFR-027).
- Non-functional: latency and scale targets (NFR-001βNFR-002); availability, integrity, UX, accessibility, security, and privacy (NFR-003βNFR-008).
- Technical: FastAPI + async APIs, React 18 + Vite + Tailwind, vector and metadata stores, external LLM/embedding services, Docker, and cloud-ready infrastructure (TR-001βTR-007).
Research-week folders (Week03*, Week05*, Week07*) map to these requirements incrementally; Phase_2_FE_AI_Merge is the maintained integrated app (Firebase UI, Qdrant/S3, optional SageMaker, Terraform for ECS/ALB/ECR).
BK-MInD follows a six-tier Clean Architecture pattern that separates concerns across distinct layers, enabling maintainability, testability, and independent scaling. The system is designed to achieve: (1) multimodal data ingestion from diverse educational materials, (2) asynchronous processing to support concurrent operations without bottlenecks, and (3) production-grade security and scalability on AWS infrastructure.
The following diagram shows the complete system topology aligning with the SRS: multimodal ingest β process β index β retrieve β generate, organized across six architectural tiers plus cross-cutting concerns for auth, security, and persistence.
Layer summary
| Layer | Role | SRS touchpoints |
|---|---|---|
| Client | Uploads, search, summaries, dashboards, auth | FR-021βFR-022, FR-023βFR-027, NFR-005βNFR-008 |
| API | Orchestration, RBAC hooks, integration | TR-001, NFR-003βNFR-004 |
| Processing | ASR, OCR/VLM, spreadsheets, sync, corpus | FR-001βFR-008 |
| Storage | Vectors, sparse index, blobs, metadata | TR-003, TR-004, NFR-004 |
| Retrieval & generation | Hybrid + visual search, RAG, chat, LLM | FR-009βFR-014, TR-004 |
| Deployment | Containers, cloud LB TLS, optional managed GPU | TR-006βTR-007, NFR-002βNFR-003 |
For HTTPS and custom domains on AWS, see docs/deployment-alb-acm-custom-domain.md.
The latest deployment architecture (v4) shows production-grade cloud infrastructure on AWS with ECS Fargate, ALB, ElastiCache, vector databases, and auto-scaling:
Additional Diagrams:
docs/diagram/Complete diagram collection including document processing flows and system documentation
The project includes a complete, publication-ready academic manuscript for submission to top-tier conferences:
π Folder: Phase_2_Manuscript/
What's Included:
- β main.pdf (804 KB, 14 pages) - 2-column IEEE/ACM format paper with BibTeX references
- β
main.tex (418 lines) - LaTeX source with proper
\cite{}commands and all elements - β references.bib (23 academic sources) - Comprehensive BibTeX bibliography
- β Figures (3 professional diagrams) - System architecture, technology rationale, related work
- β Tables (5 comprehensive tables) - RAG alternatives, parsing, retrieval, end-to-end eval, appendix comparison
- β Complete Documentation - Submission guides, compilation instructions, writing standards
- β Fact-Checked Metrics - All 40+ performance metrics verified against Phase_2_Report
Manuscript Title: BK-MInD: Multimodal Retrieval-Augmented Generation for Institutional Educational Content
Key Contributions:
- Dual-pathway multimodal architecture with reciprocal rank fusion
- 7-stage document processing pipeline with conditional routing
- Multi-tier security architecture (FERPA-compliant)
- Production deployment validation (50 concurrent users, $683.72/month)
Evaluation Results:
- Document parsing: 58.91% OmniDocBench score
- Retrieval effectiveness: 84.84% nDCG@10 for text, 67.14% for images
- System accuracy: 72.7% correctness, 99.5% faithfulness (zero hallucinations)
- Production ready: Stable 30-45 second response times at 50 concurrent users
Target Conferences:
- ACL 2027 (Deadline: January 2027) - EXCELLENT FIT
- EMNLP 2027 (Deadline: May 2027) - EXCELLENT FIT
- Learning@Scale 2027 (Deadline: October 2026) - EXCELLENT FIT
Quick Start: Download main.pdf from Phase_2_Manuscript/ folder and submit to target conference!
See Phase_2_Manuscript/README.md for detailed submission instructions.
A robust batch downloader for academic PDFs from major venues (arXiv, ACL, CVPR, AAAI, ACM). Features intelligent metadata extraction, automatic retries, and comprehensive logging.
Key Features:
- Multi-venue support with site-specific heuristics
- Semantic filename generation from paper metadata
- PDF validation and deduplication
- Exponential backoff retry mechanism
Baseline implementation for extracting text from lecture videos and slides.
Technologies:
- ASR: PhoWhisper (OpenAI Whisper variant optimized for Vietnamese)
- OCR: Tesseract with adaptive preprocessing
- Audio Processing: FFmpeg extraction, 16kHz WAV conversion
- Batch Processing: Multi-file support with structured outputs
Output: Timestamped transcripts (TXT/JSON) + extracted slide text
Comprehensive comparison of retrieval methods on MS MARCO dataset.
Methods Evaluated:
- BM25: Sparse keyword-based retrieval (baseline)
- Dense: Sentence-BERT embeddings with cosine similarity
- Hybrid: Weighted Sum + Reciprocal Rank Fusion (RRF)
Key Findings:
- Dense retrieval achieves 3.6Γ higher nDCG@10 than BM25 on MS MARCO
- Hybrid methods provide marginal improvements but add complexity
- Vocabulary mismatch severely impacts BM25 on natural language queries
Metrics: nDCG@10, Recall@10, latency analysis
Systematic evaluation of three RAG implementation approaches.
Frameworks:
- LangChain: High-level abstractions, extensive integrations
- LlamaIndex: Python-native, data-centric design
- Manual: Custom implementation for full control
Configuration Options:
- Vector Stores: FAISS (in-memory), Chroma (persistent)
- LLMs: OpenAI GPT-4o-mini, Azure OpenAI, Google Gemini, Ollama
- Benchmarking: Automated metrics collection and reporting
Use Case: Comparative analysis for selecting optimal RAG stack
Expanded processing pipeline with multiple AI backends and detailed benchmarking.
ASR Models:
- OpenAI Whisper: Variants from
tinytolarge-v3 - Google Gemini: API-based with 2.0/2.5 Flash models
- DeepSeek: Alternative API provider
OCR Enhancements:
- Advanced preprocessing (OTSU, adaptive thresholding)
- Multi-language support (Vietnamese + English)
- PDF batch processing with Poppler integration
Deliverables: Model comparison reports (asr rank.md, ocr rank.md, model comparison.md)
Industrial-grade retrieval implementations using specialized tools.
Upgrades:
- Milvus: Vector database for billion-scale dense retrieval
- Pyserini: Lucene-based BM25 with advanced linguistic processing
- ColPali: Vision-language retrieval for document images (no OCR needed)
Performance Improvements:
- 44 minutes β ~10 seconds for BM25 (Pyserini)
- 6 seconds β <1 second for Dense (Milvus)
- Better tokenization, stemming, and query optimization
Novel Approach: ColPali for end-to-end visual retrieval (bypassing OCR errors)
Complete overhaul into production-ready 4-stage pipeline with enterprise features and intelligent processing.
Architecture Overview:
- Stage 1 (Normalizer): Format conversion with consistent filename truncation for Windows compatibility
- Stage 2 (Media Processor): Audio/video transcription with multiple export formats (JSON/SRT/VTT/MD)
- Stage 3 (Docling Processor): Smart deduplication avoiding duplicate processing, VLM-powered understanding
- Stage 4 (Consolidator): RAG-ready unified structure with dual-mode outputs
Core Features:
- Smart Deduplication: Process each file only once, optimal quality source selection
- Dual RAG Outputs: Normalized PDFs for image retrieval + Markdown for semantic search
- Universal Format Support: 15+ formats (DOCX, PPTX, HTML, Images, Video, Audio, PDF, Excel, CSV, AsciiDoc, WebVTT)
Advanced Capabilities:
- Visual Understanding: SmolVLM-256M integration for image descriptions and layout analysis
- Processing Modes:
- Full Mode (default): VLM-enabled, highest quality, ~1Γ speed
- Balanced Mode (
--no-vlm): OCR-only with exports, ~2Γ faster - Fast Mode (
--fast-mode): OCR-only minimal exports, 3-5Γ faster
- Intelligent Caching: MD5-based skip system with
--forceflag to bypass - Windows Optimization: Automatic filename truncation (50 chars + MD5 hash) for 260-char path limit
- Multi-OCR Support: RapidOCR (primary), Tesseract, EasyOCR
- ASR Integration: Whisper-based transcription for audio/video with configurable models
Performance Optimizations:
- GPU acceleration (CUDA support)
- Batch processing with progress tracking
- Exponential backoff retry mechanism
- Comprehensive error handling and logging
- Graceful degradation for unsupported formats
Output Structure:
stage4_rag_ready/
βββ document_name.pdf # Image-based RAG (preserved layout)
βββ document_name.md # Text-based RAG (semantic search)
βββ document_name_docling_additional/ # Extracted images/tables
βββ images/
βββ tables/
Single tree that combines the production-style FastAPI backend (Qdrant, S3, optional SageMaker inference), the React + Firebase frontend from the FE track, SageMaker hosting packs (unified Docling + Whisper + ColQwen container and optional split endpoints), and Terraform for AWS: ECR, ECS Fargate, Application Load Balancer with optional HTTPS (ACM), auto scaling, and an optional SageMaker endpoint aligned with sagemaker/unified.
| Area | Path | Documentation |
|---|---|---|
| Folder overview | Phase_2_FE_AI_Merge/ |
Phase_2_FE_AI_Merge/README.md |
| Integration log | Phase_2_FE_AI_Merge/MERGE_SUMMARY.md |
Merge checklist and features |
| Terraform (ALB, ECS, ECR, SageMaker) | Phase_2_FE_AI_Merge/terraform/ |
Phase_2_FE_AI_Merge/terraform/README.md |
| SageMaker build / deploy | Phase_2_FE_AI_Merge/sagemaker/ |
Phase_2_FE_AI_Merge/sagemaker/README.md |
| HTTPS + custom domain runbook | docs/technical/DOCS_deployment-alb-acm-custom-domain.md |
ACM validation, DNS, ALB listeners |
Use Phase_2_FE_AI_Merge as the maintained application tree for local development, technical review, deployment, and testing.
π For capstone presentations / documentation review:
Start with docs/README.md (documentation hub) β docs/report/ folder for Phase 2 reports and presentation guides.
π¨βπ» For development setup:
Prerequisites follow docs/requirements.md (TR-001βTR-005, NFR-005βNFR-006): Python 3.9+, FastAPI backend; React 18+, Vite, Tailwind frontend; FFmpeg, Tesseract, Poppler for media; GPU optional locally if you offload heavy inference to APIs or SageMaker (Phase_2_FE_AI_Merge/sagemaker/README.md). Docker and Terraform are for packaging and cloud layout (TR-006βTR-007).
Shell: All commands below are Windows PowerShell (5.1 or 7+). From another shell, translate Set-Location/Copy-Item/.\venv\Scripts\Activate.ps1 as needed. If script activation is blocked, run once: Set-ExecutionPolicy -Scope CurrentUser RemoteSigned (or start Python via .\venv\Scripts\python.exe without activating).
git clone https://github.com/pdz1804/capstone-project.git
Set-Location capstone-project
python -m venv venv
.\venv\Scripts\Activate.ps1Full UI (Firebase), Qdrant/S3-aware API, tests, Terraform and SageMaker docs see Phase_2_FE_AI_Merge/README.md.
# Backend (see Phase_2_FE_AI_Merge/backend/README.md for uvicorn/install scripts)
Set-Location Phase_2_FE_AI_Merge\backend
pip install -r requirements.txt
Copy-Item .env.example .env
# Edit .env: keys, Qdrant, S3, SageMaker flags
# Frontend new PowerShell window at the repository root, then:
Set-Location Phase_2_FE_AI_Merge\frontend
npm install
Copy-Item .env.example .env
npm run devURLs (typical): UI http://localhost:5173 (or Vite default), API http://localhost:8000, docs http://localhost:8000/docs. Run the API with the command in backend/README.md (e.g. uvicorn on app.main:app).
Terraform (local validation only no apply):
Set-Location Phase_2_FE_AI_Merge\terraform
terraform init -backend=false
terraform fmt -recursive
terraform validateSet-Location Week0506_Mkhoi_OCR_ASR\src
python main.py asr --output-dir results\asr @(Get-ChildItem -Path "data\videos\*.mp4" | ForEach-Object { $_.FullName })
Set-Location ..\..\Week0304_NKhoi_Retrieval
jupyter notebook manual_bm25_dense_hybrid.ipynb
Set-Location ..\Week0304_QPhu_RAG_Pipeline
python setup_and_run.py
Set-Location ..\Week070809_QPhu_Processor
python src\pipeline.py input\ output\
# Optional: add --fast-mode where that script supports itUse Set-Location <repoRoot> first if you are not already at the repository root (replace <repoRoot> with your clone path, e.g. D:\PDZ\BKU\Learning\LVTN\GD1\Code).
Course: CS252 - Capstone Project Institution: Ho Chi Minh City University of Technology (HCMUT) Focus: Applied AI for Educational Content Processing Domain: Information Retrieval, NLP, Multimodal Learning, RAG Systems
Research Contributions:
- Vietnamese-optimized ASR/OCR pipeline for lecture processing
- Comprehensive retrieval method comparison on MS MARCO
- RAG framework selection guide for educational Q&A
- Production-grade retrieval system implementations
- Multimodal document understanding with Docling
- Dual-mode RAG processing pipeline (text + image retrieval)
- Intelligent document deduplication and caching system
- Performance-quality tradeoff framework (Fast vs Full modes)
π Start Here:
docs/README.mdDocumentation hub and overview.docs/requirements.mdβ Software Requirements Specification: functional, non-functional, technical constraints (37 requirements total).
Authoritative Technical Documents
docs/technical/APPLICATION_OVERVIEW.mdβ Product scope, user workflows, architecture summary, features, quality attributes, and engineering assessment.docs/technical/API_REFERENCE.mdMaintainer-level API reference covering authentication, files, processing, indexing, search, chat, insights, feedback, and operational guidance.docs/technical/DOCS_TECHNICAL_GUARDRAIL_CONFIGURATION.mdAWS Bedrock guardrails configuration, content safety filters, PII protection, implementation details.
Testing and Performance Evidence
docs/report/FRESH_EVALUATION_REPORT_2026_05_07.mdFinal evaluation report with component testing, performance benchmarks, and production readiness assessment.docs/jmeter-capacity-tests/runs/README_MAIN_APIS.mdJMeter runbook and result exports for Process, Index, and Search APIs.docs/jmeter-capacity-tests/runs/README_NON_MAIN_APIS.mdJMeter runbook and result exports for Auth, User, Stats, Upload, Chat, and Insights APIs.
Architecture and Deployment
docs/technical/DOCS_deployment-alb-acm-custom-domain.mdACM certificates, DNS validation, ALB HTTPβHTTPS, and custom domain setup.docs/technical/DOCS_search-cache-redis-setup.mdRedis/ElastiCache search cache setup and operational notes.docs/technical/DOCS_REDIS_ASYNC_JOB_SYSTEM_GUIDE.mdAsync job tracking system (Redis-based), job lifecycle, and monitoring.
Security and WAF Configuration
docs/technical/DOCS_TECHNICAL_WAF_CONFIGURATION.mdAWS WAF rules, IP whitelisting, DDoS protection, and security group configuration.docs/technical/SECURITY_SECTION_CAPSTONE_REPORT.mdSecurity architecture, threat modeling, and compliance considerations.
Cost Estimation
docs/others/AWS_Cost_Estimation_50_Users_Professional.xlsxDetailed cost analysis and scalability projections for 50 concurrent users.
Merged Production Application (Phase_2_FE_AI_Merge/)
Phase_2_FE_AI_Merge/README.mdTop-level map: frontend, backend, SageMaker pack, Terraform; local quick paths.Phase_2_FE_AI_Merge/MERGE_SUMMARY.mdWhat was integrated from FE and AI service tracks.Phase_2_FE_AI_Merge/backend/README.mdFastAPI layout, Qdrant/BM25/hybrid/image retrieval, S3 vs local storage.Phase_2_FE_AI_Merge/terraform/README.mdAWS resources (ECR, ECS, ALB, optional HTTPS, optional SageMaker) and Terraform checks.Phase_2_FE_AI_Merge/sagemaker/README.mdUnified container, ECR push, deploy/delete scripts, and backend environment variables.
Research Milestones and Utilities
- READMEs inside
Week0304_*,Week0506_*,Week070809_QPhu_Processor/, anddownloads/directories (datasets and paper references). Phase_1/Week0304_QPhu_RAG_Pipeline/DETAILED_PIPELINE_FLOWS.mdβ Detailed RAG pipeline flow diagrams and explanations.
The downloads/ directory contains a curated collection of research papers covering:
- Retrieval-Augmented Generation (RAG) architectures
- Dense retrieval methods (DPR, ColBERT, ANCE)
- Multimodal learning (CLIP, LayoutLM, Docling)
- Speech recognition (Whisper, Wav2Vec 2.0)
- OCR and document understanding
This is an academic capstone project. For collaboration or questions:
- Repository: github.com/pdz1804/capstone-project
- Issues: Use GitHub Issues for bug reports or feature requests
- Contact: See individual weekly READMEs for team member information
This project is licensed under the MIT License - see the LICENSE file for details.
Copyright (c) 2025 Quang Phu, Ngoc Khoi, and Minh Khoi
Open-source models, APIs, and platforms that this codebase builds on (see also TR-004βTR-005 and integration notes in docs/requirements.md):
- OpenAI Whisper and LLM APIs used in ASR and generation experiments.
- Google Gemini (multimodal/API), Firebase (authentication in the merged frontend stack), and embedding-related tooling referenced in weekly work.
- Hugging Face
transformers, model hubs, and pretrained checkpoints (e.g. ColQwen, sentence encoders). - IBM Docling and related document-understanding components.
- Qdrant Vector Database used in the Phase 2 AI service and merge backend.
- Amazon Web Services S3, SageMaker real-time inference, and (via Terraform) ECS, ECR, ALB, ACM for optional cloud deployment.
- HashiCorp Terraform for infrastructure as code in
Phase_2_FE_AI_Merge/terraform/. - Pyserini / Anserini & Milvus retrieval stacks explored in research-week milestones.
- LangChain & LlamaIndex RAG framework comparisons (early-phase notebooks and prototypes).
- FFmpeg, Tesseract, Poppler media, OCR, and PDF tooling (TR-005).
- React, Vite, Tailwind CSS frontend stack (TR-002).
Version: 1.0 Last Updated: May 10, 2026
Team: MKhoi, NKhoi, QPhu.

