DSI321_2025

📡 Realtime x Scraping for Adjusting TU Academic Public Relations

📘 Project Overview

This system was developed to Track and analyze public data related to Thammasat University in academic terms using real-time data extraction and natural language processing (NLP) techniques to:

Check comments and articles mentioning TU
Analyze the sentiment and topic of the content
Alert the PR department when important information is found
Adjust the communication strategy to suit the situation
Analyze the public opinion situation on the university

✨ Project Features

✅ Real-time Scraping from Twitter (X) every 15 minutes
✅ Convert data to Parquet and store in lakeFS
✅ Detect new data through hashing and comparison
✅ Analyze with NLP (Sentiment, Topic Modeling)
✅ Streamlit Dashboard shows the results in an easy-to-understand format
✅ Orchestrated with Prefect and can be deployed in multiple ways

🏗️ System Architecture

Flow Summary:

User posts a message on X → System starts scraping
Check hash against original data from lakeFS
If new data is found → Save data + Update repository
Load data to lakefs and display with Dashboard via Streamlit
Orchestrate entire pipeline with Prefect

🧪 Technologies

Component	Technology
Backend	Python
Scraping	Playwright
Storage	lakeFS, Parquet
Data Validation	Pydantic
NLP(word cloud)	Gemini 2 Flash
Dashboard	Streamlit
Orchestration	Prefect
Deployment	Docker, GitHub Actions

🚀 Deployment Guide

🧰 Prerequisites

Python 3.9+
Docker + Docker Compose
Git -requirements.txt

🛠️ Prepare Project

Clone Repository:

git clone https://github.com/SirapopChu/dsi321_2025 
cd Siapopchu/dsi321_2025

Create virtual environment

python -m venv .venv

Activate virtual environment ** Windows **

source .venv/Scripts/activate

** macOS & Linux **

source .venv/bin/activate

Run the start script

bash start.sh

Prepare Prefect

Start Prefect Server

docker compose --profile server up -d

Connect CLI container

docker compose run cli

3.Run initial scap flow (initail ; for base data)

python src/backend/pipeline/initial_scrape_flow.py

4.Schedule scraping every 15 min (incremental)

python src/backend/pipeline/incremental_scrape_flow.py

5.Check pipeline status (flow UI ; open in your browser : http://localhost:4200)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
config		config
data/from_prefect		data/from_prefect
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
start-prefect.sh		start-prefect.sh
start.sh		start.sh
system architect.png		system architect.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DSI321_2025

📡 Realtime x Scraping for Adjusting TU Academic Public Relations

Table of Contents

📘 Project Overview

✨ Project Features

🏗️ System Architecture

🧪 Technologies

🚀 Deployment Guide

🧰 Prerequisites

🛠️ Prepare Project

Prepare Prefect

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

SirapopChu/dsi321_2025

Folders and files

Latest commit

History

Repository files navigation

DSI321_2025

📡 Realtime x Scraping for Adjusting TU Academic Public Relations

Table of Contents

📘 Project Overview

✨ Project Features

🏗️ System Architecture

🧪 Technologies

🚀 Deployment Guide

🧰 Prerequisites

🛠️ Prepare Project

Prepare Prefect

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages