- Project Overview
- Group Information
- Project Features
- System Architecture
- Technologies
- Deployment Guide
This system was developed to Track and analyze public data related to Thammasat University in academic terms using real-time data extraction and natural language processing (NLP) techniques to:
- Check comments and articles mentioning TU
- Analyze the sentiment and topic of the content
- Alert the PR department when important information is found
- Adjust the communication strategy to suit the situation
- Analyze the public opinion situation on the university
- β Real-time Scraping from Twitter (X) every 15 minutes
- β Convert data to Parquet and store in lakeFS
- β Detect new data through hashing and comparison
- β Analyze with NLP (Sentiment, Topic Modeling)
- β Streamlit Dashboard shows the results in an easy-to-understand format
- β Orchestrated with Prefect and can be deployed in multiple ways
Flow Summary:
- User posts a message on X β System starts scraping
- Check hash against original data from
lakeFS - If new data is found β Save data + Update repository
- Load data to lakefs and display with Dashboard via Streamlit
- Orchestrate entire pipeline with Prefect
| Component | Technology |
|---|---|
| Backend | Python |
| Scraping | Playwright |
| Storage | lakeFS, Parquet |
| Data Validation | Pydantic |
| NLP(word cloud) | Gemini 2 Flash |
| Dashboard | Streamlit |
| Orchestration | Prefect |
| Deployment | Docker, GitHub Actions |
- Python 3.9+
- Docker + Docker Compose
- Git -requirements.txt
- Clone Repository:
git clone https://github.com/SirapopChu/dsi321_2025
cd Siapopchu/dsi321_2025- Create virtual environment
python -m venv .venv- Activate virtual environment ** Windows **
source .venv/Scripts/activate** macOS & Linux **
source .venv/bin/activate- Run the start script
bash start.sh- Start Prefect Server
docker compose --profile server up -d- Connect CLI container
docker compose run cli3.Run initial scap flow (initail ; for base data)
python src/backend/pipeline/initial_scrape_flow.py4.Schedule scraping every 15 min (incremental)
python src/backend/pipeline/incremental_scrape_flow.py5.Check pipeline status (flow UI ; open in your browser : http://localhost:4200)
