A robust, production-ready web scraper specifically designed for scraping educational content from Sanfoundry and similar websites. Built with Python, this scraper features comprehensive error handling, retry logic, rate limiting, and multiple export formats.
-
Robust Scraping Engine
- Advanced error handling and retry logic
- Random user agent rotation
- Rate limiting to prevent server overload
- Request throttling with configurable delays
- Automatic session recovery
-
Smart Parsing
- Custom parser for Sanfoundry's structure
- Extracts questions, multiple-choice options, answers, and explanations
- Handles pagination automatically
- Category-based organization
-
Data Storage
- SQLite database with optimized schema
- Automatic deduplication
- Session tracking
- Failed URL logging for retry
-
Multiple Export Formats
- JSON (structured data)
- CSV (spreadsheet compatible)
- Excel (.xlsx) with auto-formatted columns
- PDF (formatted document)
- Export by category option
-
Progress Tracking
- Real-time progress bars
- Detailed logging (console + file)
- Session statistics
- Success rate monitoring
- Python 3.8+
- Internet connection
- ~100MB disk space (for dependencies and data)
cd /Users/proxim/PROXIM/PROJECTS/web-scraperpython3 -m venv venv
source venv/bin/activate # On macOS/Linux
# OR
venv\Scripts\activate # On Windowspip install -r requirements.txtcp .env.example .env
# Edit .env to customize settingspython main.pyThis will scrape the Python questions from Sanfoundry by default.
# Scrape Python questions
python main.py --categories python
# Scrape multiple categories
python main.py --categories python java cpp
# Scrape all available categories
python main.py --categories allpython main.py --url "https://www.sanfoundry.com/1000-python-questions-answers/"# Export to JSON and CSV (default)
python main.py --categories python --export json csv
# Export to all formats
python main.py --categories python --export json csv xlsx pdf
# Export by category (separate files for each category)
python main.py --categories python java --export json csv --export-by-category
# Only export existing data without scraping
python main.py --no-scrape --export json csv xlsx pdf# Scrape Data Structures questions and export to Excel and PDF
python main.py --categories data_structures --export xlsx pdf
# Scrape C and C++ questions, export by category to all formats
python main.py --categories c cpp --export json csv xlsx pdf --export-by-category
# Scrape custom URL with specific category name
python main.py --url "https://www.sanfoundry.com/1000-java-questions-answers/"web-scraper/
βββ main.py # Main orchestrator and CLI entry point
βββ scraper.py # Core scraping engine with retry logic
βββ sanfoundry_parser.py # Sanfoundry-specific HTML parser
βββ database.py # SQLite database management
βββ exporter.py # Data export to multiple formats
βββ utils.py # Utility functions and logging
βββ config.py # Configuration and constants
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variables template
βββ .gitignore # Git ignore rules
βββ README.md # This file
βββ data/ # Database and cached data
β βββ scraper.db # SQLite database
βββ logs/ # Log files
β βββ scraper.log # Application logs
βββ exports/ # Exported data files
βββ *.json
βββ *.csv
βββ *.xlsx
βββ *.pdf
Edit the .env file to customize settings:
# Scraper Configuration
SCRAPER_DELAY_MIN=2 # Minimum delay between requests (seconds)
SCRAPER_DELAY_MAX=5 # Maximum delay between requests (seconds)
MAX_RETRIES=3 # Maximum retry attempts for failed requests
TIMEOUT=30 # Request timeout (seconds)
CONCURRENT_REQUESTS=3 # Number of concurrent requests
# Database
DB_PATH=data/scraper.db # SQLite database path
# Logging
LOG_LEVEL=INFO # Logging level (DEBUG, INFO, WARNING, ERROR)
LOG_FILE=logs/scraper.log # Log file path
# Export
EXPORT_DIR=exports # Export directory
# Browser (if using Selenium/Playwright in future)
HEADLESS=True # Run browser in headless mode
BROWSER=chrome # Browser to useid: Primary keyquestion_number: Question numbercategory: Category/topic namesource_url: Source page URLquestion_text: Question contentoptions: JSON string of multiple-choice optionscorrect_answer: Correct answerexplanation: Answer explanationdifficulty: Difficulty levelscraped_at: Timestamp
- Tracks scraping sessions
- Statistics (pages scraped, success rate, etc.)
- Session metadata
- Logs failed URLs for retry
- Error messages
- Attempt counts
The scraper comes pre-configured with popular Sanfoundry categories:
python- 1000 Python Questions & Answersjava- 1000 Java Questions & Answersc- 1000 C Questions & Answerscpp- 1000 C++ Questions & Answersdata_structures- 1000 Data Structure Questions & Answers
You can easily add more categories in config.py.
The scraper includes comprehensive error handling:
- Automatic Retries: Failed requests are retried up to 3 times with exponential backoff
- Rate Limiting: Prevents overwhelming the server (10 requests per 60 seconds)
- Random Delays: Random delays between requests (2-5 seconds)
- User Agent Rotation: Rotates user agents to avoid detection
- Failed URL Logging: Logs failed URLs in database for manual retry
- Session Recovery: Can resume from where it left off
Logs are written to both console and file:
- Console: Colored, formatted output with progress bars
- File: Detailed logs in
logs/scraper.log - Levels: DEBUG, INFO, WARNING, ERROR
If you get import errors, make sure all dependencies are installed:
pip install -r requirements.txt --upgradeIf you encounter connection errors:
- Check your internet connection
- Increase timeout in
.env:TIMEOUT=60 - Increase delays:
SCRAPER_DELAY_MIN=5,SCRAPER_DELAY_MAX=10
If no questions are extracted:
- The page structure might have changed
- Check logs for detailed error messages
- Try with
--urloption to test a single page - Set
LOG_LEVEL=DEBUGin.envfor detailed parsing logs
If you're being rate limited:
- Increase delays in
.env - Reduce
CONCURRENT_REQUESTS - Add longer delays:
SCRAPER_DELAY_MIN=5,SCRAPER_DELAY_MAX=10
To test the scraper with a single page:
python main.py --url "https://www.sanfoundry.com/1000-python-questions-answers/" --export jsonThis project is for educational purposes only. Please respect the terms of service of the websites you scrape.
This scraper is designed for educational purposes. When using this tool:
- Respect the website's
robots.txtfile - Don't overload servers with too many requests
- Use reasonable delays between requests
- Comply with the website's terms of service
- Use scraped data responsibly and ethically
Contributions are welcome! Feel free to:
- Report bugs
- Suggest new features
- Submit pull requests
- Improve documentation
For issues, questions, or suggestions, please:
- Check the troubleshooting section
- Review the logs in
logs/scraper.log - Open an issue with detailed information
Built with:
- Requests - HTTP library
- BeautifulSoup4 - HTML parsing
- Pandas - Data manipulation
- ReportLab - PDF generation
- tqdm - Progress bars
- Tenacity - Retry logic
Happy Scraping! π