With ssokz, users can efficiently discover, explore, and compare datasets, providing an interactive dashboard for data management and analysis. Registered users can search for datasets, triggering a smart web-scraping pipeline that collects relevant metadata such as title, domain, size, and download link from multiple online sources. The results are displayed in a dashboard that supports dataset comparison, personalized views, and at a later stage statistical visualizations and user-contributed dataset entries. The system will thus serve as an intelligent, user-centric hub for dataset exploration and organization.
This project uses Codecov for tracking test coverage. Public reports are available at: https://app.codecov.io/gh/syntaxsavr/scrapper
- Docker
- Docker Compose
- Clone the repository
git clone <your-repo-url>
cd scrapper-
Create Environment File Copy and rename
.env.exampleto.envand change the files values. -
Start the application
docker compose up --build- Access the application
- Web App: http://localhost:8000
- Admin Panel: http://localhost:8000/admin
Press CTRL+C in the terminal, then:
docker compose downdocker compose down -vRegarding load tests:
Locust was used, however be VERY careful when doing this yourself, because you might rate-limit your IP from huggingfacehub - meaning you need to make an account/pay
Avoid using paths that use scrapers (self.client.get("/api/search/?q=world")) when adding more then 10 users, or leaving it to run for longer, also the first reply may take longer than subsequent replys, due to stored results in the db (the first one usually takes more like ~100ms)
locust --headless --users 8 --spawn-rate 2 -H http://127.0.0.1:8000/

locust --headless --users 1000 --spawn-rate 50 -H http://127.0.0.1:8000/
