Skip to content

realshubhamraut/web-scraper

Repository files navigation

Sanfoundry Web Scraper

A robust, production-ready web scraper specifically designed for scraping educational content from Sanfoundry and similar websites. Built with Python, this scraper features comprehensive error handling, retry logic, rate limiting, and multiple export formats.

🌟 Features

  • Robust Scraping Engine

    • Advanced error handling and retry logic
    • Random user agent rotation
    • Rate limiting to prevent server overload
    • Request throttling with configurable delays
    • Automatic session recovery
  • Smart Parsing

    • Custom parser for Sanfoundry's structure
    • Extracts questions, multiple-choice options, answers, and explanations
    • Handles pagination automatically
    • Category-based organization
  • Data Storage

    • SQLite database with optimized schema
    • Automatic deduplication
    • Session tracking
    • Failed URL logging for retry
  • Multiple Export Formats

    • JSON (structured data)
    • CSV (spreadsheet compatible)
    • Excel (.xlsx) with auto-formatted columns
    • PDF (formatted document)
    • Export by category option
  • Progress Tracking

    • Real-time progress bars
    • Detailed logging (console + file)
    • Session statistics
    • Success rate monitoring

πŸ“‹ Requirements

  • Python 3.8+
  • Internet connection
  • ~100MB disk space (for dependencies and data)

πŸš€ Installation

1. Clone or Download the Project

cd /Users/proxim/PROXIM/PROJECTS/web-scraper

2. Create Virtual Environment (Recommended)

python3 -m venv venv
source venv/bin/activate  # On macOS/Linux
# OR
venv\Scripts\activate  # On Windows

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment (Optional)

cp .env.example .env
# Edit .env to customize settings

πŸ’» Usage

Basic Usage - Scrape Python Questions

python main.py

This will scrape the Python questions from Sanfoundry by default.

Scrape Specific Categories

# Scrape Python questions
python main.py --categories python

# Scrape multiple categories
python main.py --categories python java cpp

# Scrape all available categories
python main.py --categories all

Scrape a Single URL

python main.py --url "https://www.sanfoundry.com/1000-python-questions-answers/"

Export Options

# Export to JSON and CSV (default)
python main.py --categories python --export json csv

# Export to all formats
python main.py --categories python --export json csv xlsx pdf

# Export by category (separate files for each category)
python main.py --categories python java --export json csv --export-by-category

# Only export existing data without scraping
python main.py --no-scrape --export json csv xlsx pdf

Advanced Examples

# Scrape Data Structures questions and export to Excel and PDF
python main.py --categories data_structures --export xlsx pdf

# Scrape C and C++ questions, export by category to all formats
python main.py --categories c cpp --export json csv xlsx pdf --export-by-category

# Scrape custom URL with specific category name
python main.py --url "https://www.sanfoundry.com/1000-java-questions-answers/"

πŸ“ Project Structure

web-scraper/
β”œβ”€β”€ main.py                  # Main orchestrator and CLI entry point
β”œβ”€β”€ scraper.py              # Core scraping engine with retry logic
β”œβ”€β”€ sanfoundry_parser.py    # Sanfoundry-specific HTML parser
β”œβ”€β”€ database.py             # SQLite database management
β”œβ”€β”€ exporter.py             # Data export to multiple formats
β”œβ”€β”€ utils.py                # Utility functions and logging
β”œβ”€β”€ config.py               # Configuration and constants
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ .env.example           # Environment variables template
β”œβ”€β”€ .gitignore             # Git ignore rules
β”œβ”€β”€ README.md              # This file
β”œβ”€β”€ data/                  # Database and cached data
β”‚   └── scraper.db        # SQLite database
β”œβ”€β”€ logs/                  # Log files
β”‚   └── scraper.log       # Application logs
└── exports/               # Exported data files
    β”œβ”€β”€ *.json
    β”œβ”€β”€ *.csv
    β”œβ”€β”€ *.xlsx
    └── *.pdf

βš™οΈ Configuration

Edit the .env file to customize settings:

# Scraper Configuration
SCRAPER_DELAY_MIN=2          # Minimum delay between requests (seconds)
SCRAPER_DELAY_MAX=5          # Maximum delay between requests (seconds)
MAX_RETRIES=3                # Maximum retry attempts for failed requests
TIMEOUT=30                   # Request timeout (seconds)
CONCURRENT_REQUESTS=3        # Number of concurrent requests

# Database
DB_PATH=data/scraper.db      # SQLite database path

# Logging
LOG_LEVEL=INFO               # Logging level (DEBUG, INFO, WARNING, ERROR)
LOG_FILE=logs/scraper.log    # Log file path

# Export
EXPORT_DIR=exports           # Export directory

# Browser (if using Selenium/Playwright in future)
HEADLESS=True                # Run browser in headless mode
BROWSER=chrome               # Browser to use

πŸ“Š Database Schema

Questions Table

  • id: Primary key
  • question_number: Question number
  • category: Category/topic name
  • source_url: Source page URL
  • question_text: Question content
  • options: JSON string of multiple-choice options
  • correct_answer: Correct answer
  • explanation: Answer explanation
  • difficulty: Difficulty level
  • scraped_at: Timestamp

Sessions Table

  • Tracks scraping sessions
  • Statistics (pages scraped, success rate, etc.)
  • Session metadata

Failed URLs Table

  • Logs failed URLs for retry
  • Error messages
  • Attempt counts

🎯 Available Categories

The scraper comes pre-configured with popular Sanfoundry categories:

  • python - 1000 Python Questions & Answers
  • java - 1000 Java Questions & Answers
  • c - 1000 C Questions & Answers
  • cpp - 1000 C++ Questions & Answers
  • data_structures - 1000 Data Structure Questions & Answers

You can easily add more categories in config.py.

πŸ›‘οΈ Error Handling

The scraper includes comprehensive error handling:

  • Automatic Retries: Failed requests are retried up to 3 times with exponential backoff
  • Rate Limiting: Prevents overwhelming the server (10 requests per 60 seconds)
  • Random Delays: Random delays between requests (2-5 seconds)
  • User Agent Rotation: Rotates user agents to avoid detection
  • Failed URL Logging: Logs failed URLs in database for manual retry
  • Session Recovery: Can resume from where it left off

πŸ“ˆ Logging

Logs are written to both console and file:

  • Console: Colored, formatted output with progress bars
  • File: Detailed logs in logs/scraper.log
  • Levels: DEBUG, INFO, WARNING, ERROR

πŸ”§ Troubleshooting

Import Errors

If you get import errors, make sure all dependencies are installed:

pip install -r requirements.txt --upgrade

Connection Errors

If you encounter connection errors:

  • Check your internet connection
  • Increase timeout in .env: TIMEOUT=60
  • Increase delays: SCRAPER_DELAY_MIN=5, SCRAPER_DELAY_MAX=10

No Questions Extracted

If no questions are extracted:

  • The page structure might have changed
  • Check logs for detailed error messages
  • Try with --url option to test a single page
  • Set LOG_LEVEL=DEBUG in .env for detailed parsing logs

Rate Limited

If you're being rate limited:

  • Increase delays in .env
  • Reduce CONCURRENT_REQUESTS
  • Add longer delays: SCRAPER_DELAY_MIN=5, SCRAPER_DELAY_MAX=10

πŸ§ͺ Testing

To test the scraper with a single page:

python main.py --url "https://www.sanfoundry.com/1000-python-questions-answers/" --export json

πŸ“ License

This project is for educational purposes only. Please respect the terms of service of the websites you scrape.

⚠️ Disclaimer

This scraper is designed for educational purposes. When using this tool:

  • Respect the website's robots.txt file
  • Don't overload servers with too many requests
  • Use reasonable delays between requests
  • Comply with the website's terms of service
  • Use scraped data responsibly and ethically

🀝 Contributing

Contributions are welcome! Feel free to:

  • Report bugs
  • Suggest new features
  • Submit pull requests
  • Improve documentation

πŸ“§ Support

For issues, questions, or suggestions, please:

  1. Check the troubleshooting section
  2. Review the logs in logs/scraper.log
  3. Open an issue with detailed information

πŸŽ‰ Credits

Built with:


Happy Scraping! πŸš€

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published