Skip to content

lawun330/Customized-BBC-Crawler

Repository files navigation

BBC Burmese Telegram Web App

This project is still in the developing phase. The project aims to create a web app bot in Telegram. The bot crawls the BBC Burmese website to display news contents. This way, Telegram users can now read news without having to leave the app.

Manual

You need the Telegram account to use this bot.

  • Chat the bot directly with your account.

OR

  • You can also find my bot in the Telegram's search bar as follows.
@presenter_sannie_bot

Once, the bot is started, it will automatically greet you with a direct link to the website.

I recommend using other methods to open the app without exiting the Telegram as this is the whole purpose. The bot supports

  • inline button
  • keyboard button
  • inline mode

Once the app is launched, you can enter a single link to read news content. If you do not have a particular link, choose a topic to get more links, copy a link, and then insert it.

Current Features

At the moment, there are two commands for my bot.

  1. /start - greet and return the main web app
  2. /help - describe how to use this bot
  3. /keyboard - return the keyboard button

Scraper's Potential

The webscraper can

  • scrape all topics (all pages of each topic) from BBC Burmese,
  • scrape Burmese contents with a filter,
  • export scraped data in spreadsheet,
  • write spreadsheet data to Local DynamoDB,
  • store current URL in redis cache for internet loss recovery.

Future Improvements for Scraper

  • To write spreadsheet data to cloud DynamoDB
  • To use a modular approach than a single scraper script

Building the Web App

Files and Directories

There are several folders and files in this repository-

  • The notebooks folder contains the jupyter notebooks used to develop and document the progress.
  • The spreadsheets folder contains the exported spreadsheets. This directory is ignored onwards.
  • The db folder contains the DynamoDBLocal database and related files.
  • The webscraper folder contains the main Python script to perform the webscraping.
    • During development, I recommend you to check notebooks instead.
  • The pyproject.toml file contains the project dependencies and settings for a ruff check.
  • The doc folder contains files related to front-end and the UI design.
  • The img folder contains images to use in the project.
  • The telegram-bot folder contains scripts to manage and run the telegram bot.
    • You need to create the .env file within the folder for your telegram bot token and username.

Project Development

Check all notebooks in the notebooks folder for detailed documentations. You will learn how the customized web-crawler for this project is evolved from scratch. This crawler is combined with the other two parts: the front-end and the Telegram bot. Necessary scripts can be found in specified folders as above.

Project Requirements

To install DynamoDB locally, check here.

  • (Optional Config) If you download the compressed file, extract it and move to "C:".

To run DynamoDB locally, JDK 17 is recommended.

  • (Optional Config) Place it in the specific directory such that java.exe can be used as follows:
"C:\Program Files\Java\jdk-17\bin\java.exe"

To install Redis for client, check here. To install Ubuntu on Windows with WSL, check here.

You may also have to install additional libraries and modules.

A. Simple but Slow Installation

  1. Create a virtual environment with Python
python -m venv <env-name>

OR with conda.

conda create --name <env-name>
  1. Activate the virtual environment.
  • For the virtual environment created with Python
<env-name>\Scripts\activate
  • For the virtual environment created with conda
conda activate "C:\Users\<your-pc-username>\anaconda3\envs\<env-name>"
  1. Install dependencies with
pip install -r requirements.txt

B. Fast but Complicated Installation

  1. Install uv: an extremely fast Python package and project manager, written in Rust.
pip install uv
  1. Go to your working directory.
  2. Create a virtual environment in your working directory with uv.
uv venv
  1. Activate the virtual environment.
.venv\Scripts\activate
  1. Install dependencies with
uv pip install -r requirements.txt

Hosting Servers

  • Local HTTP Server: Test your web application locally during development. The local HTTP server serves files from the current directory at port 9000 (arbitrary) with
python -m http.server 9000
  • Telegram: You have to navigate to the directory /telegram-bot and host the Telegram bot with
python app.py
  • Redis: You need the Redis-client installed on your device. It caches the current link to continue fetching if the connection is lost. Open the Ubuntu Terminal and run
redis-cli
  • DynamoDB: You need to host the DynamoDB to store data. Navigate to the directory /db and run
DynamoDB_init.bat
  • FastAPI: Run the following to work with the website requests
uvicorn api:app --reload

About

Developing a telegram web app to display BBC News without leaving the app

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published