Skip to content

This system was developed to Track and analyze public data related to Thammasat University in academic terms

License

Notifications You must be signed in to change notification settings

SirapopChu/dsi321_2025

Repository files navigation

DSI321_2025

πŸ“‘ Realtime x Scraping for Adjusting TU Academic Public Relations

Table of Contents


πŸ“˜ Project Overview

This system was developed to Track and analyze public data related to Thammasat University in academic terms using real-time data extraction and natural language processing (NLP) techniques to:

  • Check comments and articles mentioning TU
  • Analyze the sentiment and topic of the content
  • Alert the PR department when important information is found
  • Adjust the communication strategy to suit the situation
  • Analyze the public opinion situation on the university

✨ Project Features

  • βœ… Real-time Scraping from Twitter (X) every 15 minutes
  • βœ… Convert data to Parquet and store in lakeFS
  • βœ… Detect new data through hashing and comparison
  • βœ… Analyze with NLP (Sentiment, Topic Modeling)
  • βœ… Streamlit Dashboard shows the results in an easy-to-understand format
  • βœ… Orchestrated with Prefect and can be deployed in multiple ways

πŸ—οΈ System Architecture

System Architecture

Flow Summary:

  1. User posts a message on X β†’ System starts scraping
  2. Check hash against original data from lakeFS
  3. If new data is found β†’ Save data + Update repository
  4. Load data to lakefs and display with Dashboard via Streamlit
  5. Orchestrate entire pipeline with Prefect

πŸ§ͺ Technologies

Component Technology
Backend Python
Scraping Playwright
Storage lakeFS, Parquet
Data Validation Pydantic
NLP(word cloud) Gemini 2 Flash
Dashboard Streamlit
Orchestration Prefect
Deployment Docker, GitHub Actions

πŸš€ Deployment Guide

🧰 Prerequisites

  • Python 3.9+
  • Docker + Docker Compose
  • Git -requirements.txt

πŸ› οΈ Prepare Project

  1. Clone Repository:
git clone https://github.com/SirapopChu/dsi321_2025 
cd Siapopchu/dsi321_2025
  1. Create virtual environment
python -m venv .venv
  1. Activate virtual environment ** Windows **
source .venv/Scripts/activate

** macOS & Linux **

source .venv/bin/activate
  1. Run the start script
bash start.sh

Prepare Prefect

  1. Start Prefect Server
docker compose --profile server up -d
  1. Connect CLI container
docker compose run cli

3.Run initial scap flow (initail ; for base data)

python src/backend/pipeline/initial_scrape_flow.py

4.Schedule scraping every 15 min (incremental)

python src/backend/pipeline/incremental_scrape_flow.py

5.Check pipeline status (flow UI ; open in your browser : http://localhost:4200)

About

This system was developed to Track and analyze public data related to Thammasat University in academic terms

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published