Capstone Project - HCMUT CS252

Educational Content Processing & Retrieval-Augmented Generation System A comprehensive research platform for multimodal lecture processing, intelligent retrieval, and RAG pipeline development.

🎯 Project Overview

This capstone builds an educational content processing and Retrieval-Augmented Generation (RAG) system: ingest multimodal lecture materials, align and structure them, index them for text and visual retrieval, and support question answering with citations, lecture-aware summaries, and personalized learning features behind a modern web UI and production-style deployment options.

The authoritative requirements baseline is docs/requirements.md (Software Requirements Specification): 37 requirements in total 22 functional (FR-001–FR-022 and extended FRs in that doc), 8 non-functional (NFR-001–NFR-008), and 7 technical (TR-001–TR-007). Highlights from the SRS scope:

Content processing: ASR and timed exports (FR-001); documents, OCR, dual outputs (FR-002); spreadsheet merged cells and Markdown (FR-003, FR-004); images / VLM (FR-005); deduplication (FR-006); audio–slide alignment and temporal navigation (FR-007, FR-008).
Retrieval & QA: BM25, dense, hybrid (FR-009); vision–language retrieval (FR-010); query handling (FR-011); grounded answers (FR-012, FR-013); chat decomposition, strategy, and multi-search aggregation (FR-014).
Product features: file management and search UI (FR-021, FR-022); automated summaries and summary navigation (FR-023, FR-024); learning paths, assessment, and analytics (FR-025–FR-027).
Non-functional: latency and scale targets (NFR-001–NFR-002); availability, integrity, UX, accessibility, security, and privacy (NFR-003–NFR-008).
Technical: FastAPI + async APIs, React 18 + Vite + Tailwind, vector and metadata stores, external LLM/embedding services, Docker, and cloud-ready infrastructure (TR-001–TR-007).

Research-week folders (Week03*, Week05*, Week07*) map to these requirements incrementally; Phase_2_FE_AI_Merge is the maintained integrated app (Firebase UI, Qdrant/S3, optional SageMaker, Terraform for ECS/ALB/ECR).

🏗️ System Architecture

BK-MInD follows a six-tier Clean Architecture pattern that separates concerns across distinct layers, enabling maintainability, testability, and independent scaling. The system is designed to achieve: (1) multimodal data ingestion from diverse educational materials, (2) asynchronous processing to support concurrent operations without bottlenecks, and (3) production-grade security and scalability on AWS infrastructure.

High-Level System Architecture

The following diagram shows the complete system topology aligning with the SRS: multimodal ingest → process → index → retrieve → generate, organized across six architectural tiers plus cross-cutting concerns for auth, security, and persistence.

Layer summary

Layer	Role	SRS touchpoints
Client	Uploads, search, summaries, dashboards, auth	FR-021–FR-022, FR-023–FR-027, NFR-005–NFR-008
API	Orchestration, RBAC hooks, integration	TR-001, NFR-003–NFR-004
Processing	ASR, OCR/VLM, spreadsheets, sync, corpus	FR-001–FR-008
Storage	Vectors, sparse index, blobs, metadata	TR-003, TR-004, NFR-004
Retrieval & generation	Hybrid + visual search, RAG, chat, LLM	FR-009–FR-014, TR-004
Deployment	Containers, cloud LB TLS, optional managed GPU	TR-006–TR-007, NFR-002–NFR-003

For HTTPS and custom domains on AWS, see docs/deployment-alb-acm-custom-domain.md.

AWS Deployment Architecture

The latest deployment architecture (v4) shows production-grade cloud infrastructure on AWS with ECS Fargate, ALB, ElastiCache, vector databases, and auto-scaling:

Additional Diagrams:

docs/diagram/ Complete diagram collection including document processing flows and system documentation

📄 Academic Publication - Phase_2_Manuscript

BK-MInD Academic Manuscript (Ready for Conference Submission)

The project includes a complete, publication-ready academic manuscript for submission to top-tier conferences:

📜 Folder: Phase_2_Manuscript/

What's Included:

✅ main.pdf (804 KB, 14 pages) - 2-column IEEE/ACM format paper with BibTeX references
✅ main.tex (418 lines) - LaTeX source with proper \cite{} commands and all elements
✅ references.bib (23 academic sources) - Comprehensive BibTeX bibliography
✅ Figures (3 professional diagrams) - System architecture, technology rationale, related work
✅ Tables (5 comprehensive tables) - RAG alternatives, parsing, retrieval, end-to-end eval, appendix comparison
✅ Complete Documentation - Submission guides, compilation instructions, writing standards
✅ Fact-Checked Metrics - All 40+ performance metrics verified against Phase_2_Report

Manuscript Title: BK-MInD: Multimodal Retrieval-Augmented Generation for Institutional Educational Content

Key Contributions:

Dual-pathway multimodal architecture with reciprocal rank fusion
7-stage document processing pipeline with conditional routing
Multi-tier security architecture (FERPA-compliant)
Production deployment validation (50 concurrent users, $683.72/month)

Evaluation Results:

Document parsing: 58.91% OmniDocBench score
Retrieval effectiveness: 84.84% nDCG@10 for text, 67.14% for images
System accuracy: 72.7% correctness, 99.5% faithfulness (zero hallucinations)
Production ready: Stable 30-45 second response times at 50 concurrent users

Target Conferences:

ACL 2027 (Deadline: January 2027) - EXCELLENT FIT
EMNLP 2027 (Deadline: May 2027) - EXCELLENT FIT
Learning@Scale 2027 (Deadline: October 2026) - EXCELLENT FIT

Quick Start: Download main.pdf from Phase_2_Manuscript/ folder and submit to target conference!

See Phase_2_Manuscript/README.md for detailed submission instructions.

📦 Project Components

🔧 Utility: Research Paper Downloader (`downloads/`)

A robust batch downloader for academic PDFs from major venues (arXiv, ACL, CVPR, AAAI, ACM). Features intelligent metadata extraction, automatic retries, and comprehensive logging.

Key Features:

Multi-venue support with site-specific heuristics
Semantic filename generation from paper metadata
PDF validation and deduplication
Exponential backoff retry mechanism

📅 Week 03-04: Foundation Development

MKhoi: ASR & OCR Pipeline (`Week0304_MKhoi_OCR_ASR/`)

Baseline implementation for extracting text from lecture videos and slides.

Technologies:

ASR: PhoWhisper (OpenAI Whisper variant optimized for Vietnamese)
OCR: Tesseract with adaptive preprocessing
Audio Processing: FFmpeg extraction, 16kHz WAV conversion
Batch Processing: Multi-file support with structured outputs

Output: Timestamped transcripts (TXT/JSON) + extracted slide text

NKhoi: Retrieval Systems Evaluation (`Week0304_NKhoi_Retrieval/`)

Comprehensive comparison of retrieval methods on MS MARCO dataset.

Methods Evaluated:

BM25: Sparse keyword-based retrieval (baseline)
Dense: Sentence-BERT embeddings with cosine similarity
Hybrid: Weighted Sum + Reciprocal Rank Fusion (RRF)

Key Findings:

Dense retrieval achieves 3.6× higher nDCG@10 than BM25 on MS MARCO
Hybrid methods provide marginal improvements but add complexity
Vocabulary mismatch severely impacts BM25 on natural language queries

Metrics: nDCG@10, Recall@10, latency analysis

QPhu: RAG Framework Comparison (`Week0304_QPhu_RAG_Pipeline/`)

Systematic evaluation of three RAG implementation approaches.

Frameworks:

LangChain: High-level abstractions, extensive integrations
LlamaIndex: Python-native, data-centric design
Manual: Custom implementation for full control

Configuration Options:

Vector Stores: FAISS (in-memory), Chroma (persistent)
LLMs: OpenAI GPT-4o-mini, Azure OpenAI, Google Gemini, Ollama
Benchmarking: Automated metrics collection and reporting

Use Case: Comparative analysis for selecting optimal RAG stack

📅 Week 05-06: Advanced Enhancements

MKhoi: Multi-Model ASR/OCR (`Week0506_Mkhoi_OCR_ASR/`)

Expanded processing pipeline with multiple AI backends and detailed benchmarking.

ASR Models:

OpenAI Whisper: Variants from tiny to large-v3
Google Gemini: API-based with 2.0/2.5 Flash models
DeepSeek: Alternative API provider

OCR Enhancements:

Advanced preprocessing (OTSU, adaptive thresholding)
Multi-language support (Vietnamese + English)
PDF batch processing with Poppler integration

Deliverables: Model comparison reports (asr rank.md, ocr rank.md, model comparison.md)

NKhoi: Production Retrieval Systems (`Week0506_NKhoi_Retrieval/`)

Industrial-grade retrieval implementations using specialized tools.

Upgrades:

Milvus: Vector database for billion-scale dense retrieval
Pyserini: Lucene-based BM25 with advanced linguistic processing
ColPali: Vision-language retrieval for document images (no OCR needed)

Performance Improvements:

44 minutes → ~10 seconds for BM25 (Pyserini)
6 seconds → <1 second for Dense (Milvus)
Better tokenization, stemming, and query optimization

Novel Approach: ColPali for end-to-end visual retrieval (bypassing OCR errors)

📅 Week 07-09: Production Pipeline

QPhu: Unified Processing Pipeline (`Week070809_QPhu_Processor/`)

Complete overhaul into production-ready 4-stage pipeline with enterprise features and intelligent processing.

Architecture Overview:

Stage 1 (Normalizer): Format conversion with consistent filename truncation for Windows compatibility
Stage 2 (Media Processor): Audio/video transcription with multiple export formats (JSON/SRT/VTT/MD)
Stage 3 (Docling Processor): Smart deduplication avoiding duplicate processing, VLM-powered understanding
Stage 4 (Consolidator): RAG-ready unified structure with dual-mode outputs

Core Features:

Smart Deduplication: Process each file only once, optimal quality source selection
Dual RAG Outputs: Normalized PDFs for image retrieval + Markdown for semantic search
Universal Format Support: 15+ formats (DOCX, PPTX, HTML, Images, Video, Audio, PDF, Excel, CSV, AsciiDoc, WebVTT)

Advanced Capabilities:

Visual Understanding: SmolVLM-256M integration for image descriptions and layout analysis
Processing Modes:
- Full Mode (default): VLM-enabled, highest quality, ~1× speed
- Balanced Mode (--no-vlm): OCR-only with exports, ~2× faster
- Fast Mode (--fast-mode): OCR-only minimal exports, 3-5× faster
Intelligent Caching: MD5-based skip system with --force flag to bypass
Windows Optimization: Automatic filename truncation (50 chars + MD5 hash) for 260-char path limit
Multi-OCR Support: RapidOCR (primary), Tesseract, EasyOCR
ASR Integration: Whisper-based transcription for audio/video with configurable models

Performance Optimizations:

GPU acceleration (CUDA support)
Batch processing with progress tracking
Exponential backoff retry mechanism
Comprehensive error handling and logging
Graceful degradation for unsupported formats

Output Structure:

stage4_rag_ready/
├── document_name.pdf                    # Image-based RAG (preserved layout)
├── document_name.md                     # Text-based RAG (semantic search)
└── document_name_docling_additional/    # Extracted images/tables
    ├── images/
    └── tables/

📅 Phase 2 Integrated Application: FE + AI + AWS (`Phase_2_FE_AI_Merge/`)

Single tree that combines the production-style FastAPI backend (Qdrant, S3, optional SageMaker inference), the React + Firebase frontend from the FE track, SageMaker hosting packs (unified Docling + Whisper + ColQwen container and optional split endpoints), and Terraform for AWS: ECR, ECS Fargate, Application Load Balancer with optional HTTPS (ACM), auto scaling, and an optional SageMaker endpoint aligned with sagemaker/unified.

Area	Path	Documentation
Folder overview	`Phase_2_FE_AI_Merge/`	`Phase_2_FE_AI_Merge/README.md`
Integration log	`Phase_2_FE_AI_Merge/MERGE_SUMMARY.md`	Merge checklist and features
Terraform (ALB, ECS, ECR, SageMaker)	`Phase_2_FE_AI_Merge/terraform/`	`Phase_2_FE_AI_Merge/terraform/README.md`
SageMaker build / deploy	`Phase_2_FE_AI_Merge/sagemaker/`	`Phase_2_FE_AI_Merge/sagemaker/README.md`
HTTPS + custom domain runbook	`docs/technical/DOCS_deployment-alb-acm-custom-domain.md`	ACM validation, DNS, ALB listeners

Use Phase_2_FE_AI_Merge as the maintained application tree for local development, technical review, deployment, and testing.

🚀 Quick Start

📚 For capstone presentations / documentation review: Start with docs/README.md (documentation hub) → docs/report/ folder for Phase 2 reports and presentation guides.

👨‍💻 For development setup: Prerequisites follow docs/requirements.md (TR-001–TR-005, NFR-005–NFR-006): Python 3.9+, FastAPI backend; React 18+, Vite, Tailwind frontend; FFmpeg, Tesseract, Poppler for media; GPU optional locally if you offload heavy inference to APIs or SageMaker (Phase_2_FE_AI_Merge/sagemaker/README.md). Docker and Terraform are for packaging and cloud layout (TR-006–TR-007).

Shell: All commands below are Windows PowerShell (5.1 or 7+). From another shell, translate Set-Location/Copy-Item/.\venv\Scripts\Activate.ps1 as needed. If script activation is blocked, run once: Set-ExecutionPolicy -Scope CurrentUser RemoteSigned (or start Python via .\venv\Scripts\python.exe without activating).

Clone and base setup

git clone https://github.com/pdz1804/capstone-project.git
Set-Location capstone-project
python -m venv venv
.\venv\Scripts\Activate.ps1

Recommended: merged app (`Phase_2_FE_AI_Merge/`)

Full UI (Firebase), Qdrant/S3-aware API, tests, Terraform and SageMaker docs see Phase_2_FE_AI_Merge/README.md.

# Backend (see Phase_2_FE_AI_Merge/backend/README.md for uvicorn/install scripts)
Set-Location Phase_2_FE_AI_Merge\backend
pip install -r requirements.txt
Copy-Item .env.example .env
# Edit .env: keys, Qdrant, S3, SageMaker flags

# Frontend   new PowerShell window at the repository root, then:
Set-Location Phase_2_FE_AI_Merge\frontend
npm install
Copy-Item .env.example .env
npm run dev

URLs (typical): UI http://localhost:5173 (or Vite default), API http://localhost:8000, docs http://localhost:8000/docs. Run the API with the command in backend/README.md (e.g. uvicorn on app.main:app).

Terraform (local validation only no apply):

Set-Location Phase_2_FE_AI_Merge\terraform
terraform init -backend=false
terraform fmt -recursive
terraform validate

Research and pipeline folders (optional)

Set-Location Week0506_Mkhoi_OCR_ASR\src
python main.py asr --output-dir results\asr @(Get-ChildItem -Path "data\videos\*.mp4" | ForEach-Object { $_.FullName })

Set-Location ..\..\Week0304_NKhoi_Retrieval
jupyter notebook manual_bm25_dense_hybrid.ipynb

Set-Location ..\Week0304_QPhu_RAG_Pipeline
python setup_and_run.py

Set-Location ..\Week070809_QPhu_Processor
python src\pipeline.py input\ output\
# Optional: add --fast-mode where that script supports it

Use Set-Location <repoRoot> first if you are not already at the repository root (replace <repoRoot> with your clone path, e.g. D:\PDZ\BKU\Learning\LVTN\GD1\Code).

🎓 Academic Context

Course: CS252 - Capstone Project Institution: Ho Chi Minh City University of Technology (HCMUT) Focus: Applied AI for Educational Content Processing Domain: Information Retrieval, NLP, Multimodal Learning, RAG Systems

Research Contributions:

Vietnamese-optimized ASR/OCR pipeline for lecture processing
Comprehensive retrieval method comparison on MS MARCO
RAG framework selection guide for educational Q&A
Production-grade retrieval system implementations
Multimodal document understanding with Docling
Dual-mode RAG processing pipeline (text + image retrieval)
Intelligent document deduplication and caching system
Performance-quality tradeoff framework (Fast vs Full modes)

📚 Documentation

📖 Start Here:

docs/README.md Documentation hub and overview.
docs/requirements.md ⭐ Software Requirements Specification: functional, non-functional, technical constraints (37 requirements total).

Authoritative Technical Documents

docs/technical/APPLICATION_OVERVIEW.md ⭐ Product scope, user workflows, architecture summary, features, quality attributes, and engineering assessment.
docs/technical/API_REFERENCE.md Maintainer-level API reference covering authentication, files, processing, indexing, search, chat, insights, feedback, and operational guidance.
docs/technical/DOCS_TECHNICAL_GUARDRAIL_CONFIGURATION.md AWS Bedrock guardrails configuration, content safety filters, PII protection, implementation details.

Testing and Performance Evidence

docs/report/FRESH_EVALUATION_REPORT_2026_05_07.md Final evaluation report with component testing, performance benchmarks, and production readiness assessment.
docs/jmeter-capacity-tests/runs/README_MAIN_APIS.md JMeter runbook and result exports for Process, Index, and Search APIs.
docs/jmeter-capacity-tests/runs/README_NON_MAIN_APIS.md JMeter runbook and result exports for Auth, User, Stats, Upload, Chat, and Insights APIs.

Architecture and Deployment

docs/technical/DOCS_deployment-alb-acm-custom-domain.md ACM certificates, DNS validation, ALB HTTP→HTTPS, and custom domain setup.
docs/technical/DOCS_search-cache-redis-setup.md Redis/ElastiCache search cache setup and operational notes.
docs/technical/DOCS_REDIS_ASYNC_JOB_SYSTEM_GUIDE.md Async job tracking system (Redis-based), job lifecycle, and monitoring.

Security and WAF Configuration

docs/technical/DOCS_TECHNICAL_WAF_CONFIGURATION.md AWS WAF rules, IP whitelisting, DDoS protection, and security group configuration.
docs/technical/SECURITY_SECTION_CAPSTONE_REPORT.md Security architecture, threat modeling, and compliance considerations.

Cost Estimation

docs/others/AWS_Cost_Estimation_50_Users_Professional.xlsx Detailed cost analysis and scalability projections for 50 concurrent users.

Merged Production Application (Phase_2_FE_AI_Merge/)

Phase_2_FE_AI_Merge/README.md Top-level map: frontend, backend, SageMaker pack, Terraform; local quick paths.
Phase_2_FE_AI_Merge/MERGE_SUMMARY.md What was integrated from FE and AI service tracks.
Phase_2_FE_AI_Merge/backend/README.md FastAPI layout, Qdrant/BM25/hybrid/image retrieval, S3 vs local storage.
Phase_2_FE_AI_Merge/terraform/README.md AWS resources (ECR, ECS, ALB, optional HTTPS, optional SageMaker) and Terraform checks.
Phase_2_FE_AI_Merge/sagemaker/README.md Unified container, ECR push, deploy/delete scripts, and backend environment variables.

Research Milestones and Utilities

READMEs inside Week0304_*, Week0506_*, Week070809_QPhu_Processor/, and downloads/ directories (datasets and paper references).
Phase_1/Week0304_QPhu_RAG_Pipeline/DETAILED_PIPELINE_FLOWS.md — Detailed RAG pipeline flow diagrams and explanations.

🔬 Research Papers & References

The downloads/ directory contains a curated collection of research papers covering:

Retrieval-Augmented Generation (RAG) architectures
Dense retrieval methods (DPR, ColBERT, ANCE)
Multimodal learning (CLIP, LayoutLM, Docling)
Speech recognition (Whisper, Wav2Vec 2.0)
OCR and document understanding

🤝 Contributing

This is an academic capstone project. For collaboration or questions:

Repository: github.com/pdz1804/capstone-project
Issues: Use GitHub Issues for bug reports or feature requests
Contact: See individual weekly READMEs for team member information

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Open-source models, APIs, and platforms that this codebase builds on (see also TR-004–TR-005 and integration notes in docs/requirements.md):

OpenAI Whisper and LLM APIs used in ASR and generation experiments.
Google Gemini (multimodal/API), Firebase (authentication in the merged frontend stack), and embedding-related tooling referenced in weekly work.
Hugging Face transformers, model hubs, and pretrained checkpoints (e.g. ColQwen, sentence encoders).
IBM Docling and related document-understanding components.
Qdrant Vector Database used in the Phase 2 AI service and merge backend.
Amazon Web Services S3, SageMaker real-time inference, and (via Terraform) ECS, ECR, ALB, ACM for optional cloud deployment.
HashiCorp Terraform for infrastructure as code in Phase_2_FE_AI_Merge/terraform/.
Pyserini / Anserini & Milvus retrieval stacks explored in research-week milestones.
LangChain & LlamaIndex RAG framework comparisons (early-phase notebooks and prototypes).
FFmpeg, Tesseract, Poppler media, OCR, and PDF tooling (TR-005).
React, Vite, Tailwind CSS frontend stack (TR-002).

Version: 1.0 Last Updated: May 10, 2026

Team: MKhoi, NKhoi, QPhu.

Name		Name	Last commit message	Last commit date
Latest commit History 961 Commits
.github		.github
Phase_1		Phase_1
Phase_2_AI_SERVICE_FOLDER		Phase_2_AI_SERVICE_FOLDER
Phase_2_FE_AI_Merge		Phase_2_FE_AI_Merge
Phase_2_FE_IMPLEMENT		Phase_2_FE_IMPLEMENT
Phase_2_Manuscript		Phase_2_Manuscript
Phase_2_PDZ_001_Test_Media_RAG		Phase_2_PDZ_001_Test_Media_RAG
Phase_2_PDZ_002_Model_Deploy		Phase_2_PDZ_002_Model_Deploy
Phase_2_PDZ_003_Test_Qdrant_Cloud		Phase_2_PDZ_003_Test_Qdrant_Cloud
Phase_2_Report		Phase_2_Report
docs		docs
.gitignore		.gitignore
.graphifyignore		.graphifyignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Capstone Project - HCMUT CS252

🎯 Project Overview

🏗️ System Architecture

High-Level System Architecture

AWS Deployment Architecture

📄 Academic Publication - Phase_2_Manuscript

BK-MInD Academic Manuscript (Ready for Conference Submission)

📦 Project Components

🔧 Utility: Research Paper Downloader (downloads/)

📅 Week 03-04: Foundation Development

MKhoi: ASR & OCR Pipeline (Week0304_MKhoi_OCR_ASR/)

NKhoi: Retrieval Systems Evaluation (Week0304_NKhoi_Retrieval/)

QPhu: RAG Framework Comparison (Week0304_QPhu_RAG_Pipeline/)

📅 Week 05-06: Advanced Enhancements

MKhoi: Multi-Model ASR/OCR (Week0506_Mkhoi_OCR_ASR/)

NKhoi: Production Retrieval Systems (Week0506_NKhoi_Retrieval/)

📅 Week 07-09: Production Pipeline

QPhu: Unified Processing Pipeline (Week070809_QPhu_Processor/)

📅 Phase 2 Integrated Application: FE + AI + AWS (Phase_2_FE_AI_Merge/)

🚀 Quick Start

Clone and base setup

Recommended: merged app (Phase_2_FE_AI_Merge/)

Research and pipeline folders (optional)

🎓 Academic Context

📚 Documentation

🔬 Research Papers & References

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

🔧 Utility: Research Paper Downloader (`downloads/`)

MKhoi: ASR & OCR Pipeline (`Week0304_MKhoi_OCR_ASR/`)

NKhoi: Retrieval Systems Evaluation (`Week0304_NKhoi_Retrieval/`)

QPhu: RAG Framework Comparison (`Week0304_QPhu_RAG_Pipeline/`)

MKhoi: Multi-Model ASR/OCR (`Week0506_Mkhoi_OCR_ASR/`)

NKhoi: Production Retrieval Systems (`Week0506_NKhoi_Retrieval/`)

QPhu: Unified Processing Pipeline (`Week070809_QPhu_Processor/`)

📅 Phase 2 Integrated Application: FE + AI + AWS (`Phase_2_FE_AI_Merge/`)

Recommended: merged app (`Phase_2_FE_AI_Merge/`)

Packages