Sanfoundry Web Scraper

A robust, production-ready web scraper specifically designed for scraping educational content from Sanfoundry and similar websites. Built with Python, this scraper features comprehensive error handling, retry logic, rate limiting, and multiple export formats.

🌟 Features

Robust Scraping Engine
- Advanced error handling and retry logic
- Random user agent rotation
- Rate limiting to prevent server overload
- Request throttling with configurable delays
- Automatic session recovery
Smart Parsing
- Custom parser for Sanfoundry's structure
- Extracts questions, multiple-choice options, answers, and explanations
- Handles pagination automatically
- Category-based organization
Data Storage
- SQLite database with optimized schema
- Automatic deduplication
- Session tracking
- Failed URL logging for retry
Multiple Export Formats
- JSON (structured data)
- CSV (spreadsheet compatible)
- Excel (.xlsx) with auto-formatted columns
- PDF (formatted document)
- Export by category option
Progress Tracking
- Real-time progress bars
- Detailed logging (console + file)
- Session statistics
- Success rate monitoring

📋 Requirements

Python 3.8+
Internet connection
~100MB disk space (for dependencies and data)

🚀 Installation

1. Clone or Download the Project

cd /Users/proxim/PROXIM/PROJECTS/web-scraper

2. Create Virtual Environment (Recommended)

python3 -m venv venv
source venv/bin/activate  # On macOS/Linux
# OR
venv\Scripts\activate  # On Windows

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment (Optional)

cp .env.example .env
# Edit .env to customize settings

💻 Usage

Basic Usage - Scrape Python Questions

python main.py

This will scrape the Python questions from Sanfoundry by default.

Scrape Specific Categories

# Scrape Python questions
python main.py --categories python

# Scrape multiple categories
python main.py --categories python java cpp

# Scrape all available categories
python main.py --categories all

Scrape a Single URL

python main.py --url "https://www.sanfoundry.com/1000-python-questions-answers/"

Export Options

# Export to JSON and CSV (default)
python main.py --categories python --export json csv

# Export to all formats
python main.py --categories python --export json csv xlsx pdf

# Export by category (separate files for each category)
python main.py --categories python java --export json csv --export-by-category

# Only export existing data without scraping
python main.py --no-scrape --export json csv xlsx pdf

Advanced Examples

# Scrape Data Structures questions and export to Excel and PDF
python main.py --categories data_structures --export xlsx pdf

# Scrape C and C++ questions, export by category to all formats
python main.py --categories c cpp --export json csv xlsx pdf --export-by-category

# Scrape custom URL with specific category name
python main.py --url "https://www.sanfoundry.com/1000-java-questions-answers/"

📁 Project Structure

web-scraper/
├── main.py                  # Main orchestrator and CLI entry point
├── scraper.py              # Core scraping engine with retry logic
├── sanfoundry_parser.py    # Sanfoundry-specific HTML parser
├── database.py             # SQLite database management
├── exporter.py             # Data export to multiple formats
├── utils.py                # Utility functions and logging
├── config.py               # Configuration and constants
├── requirements.txt        # Python dependencies
├── .env.example           # Environment variables template
├── .gitignore             # Git ignore rules
├── README.md              # This file
├── data/                  # Database and cached data
│   └── scraper.db        # SQLite database
├── logs/                  # Log files
│   └── scraper.log       # Application logs
└── exports/               # Exported data files
    ├── *.json
    ├── *.csv
    ├── *.xlsx
    └── *.pdf

⚙️ Configuration

Edit the .env file to customize settings:

# Scraper Configuration
SCRAPER_DELAY_MIN=2          # Minimum delay between requests (seconds)
SCRAPER_DELAY_MAX=5          # Maximum delay between requests (seconds)
MAX_RETRIES=3                # Maximum retry attempts for failed requests
TIMEOUT=30                   # Request timeout (seconds)
CONCURRENT_REQUESTS=3        # Number of concurrent requests

# Database
DB_PATH=data/scraper.db      # SQLite database path

# Logging
LOG_LEVEL=INFO               # Logging level (DEBUG, INFO, WARNING, ERROR)
LOG_FILE=logs/scraper.log    # Log file path

# Export
EXPORT_DIR=exports           # Export directory

# Browser (if using Selenium/Playwright in future)
HEADLESS=True                # Run browser in headless mode
BROWSER=chrome               # Browser to use

📊 Database Schema

Questions Table

id: Primary key
question_number: Question number
category: Category/topic name
source_url: Source page URL
question_text: Question content
options: JSON string of multiple-choice options
correct_answer: Correct answer
explanation: Answer explanation
difficulty: Difficulty level
scraped_at: Timestamp

Sessions Table

Tracks scraping sessions
Statistics (pages scraped, success rate, etc.)
Session metadata

Failed URLs Table

Logs failed URLs for retry
Error messages
Attempt counts

🎯 Available Categories

The scraper comes pre-configured with popular Sanfoundry categories:

python - 1000 Python Questions & Answers
java - 1000 Java Questions & Answers
c - 1000 C Questions & Answers
cpp - 1000 C++ Questions & Answers
data_structures - 1000 Data Structure Questions & Answers

You can easily add more categories in config.py.

🛡️ Error Handling

The scraper includes comprehensive error handling:

Automatic Retries: Failed requests are retried up to 3 times with exponential backoff
Rate Limiting: Prevents overwhelming the server (10 requests per 60 seconds)
Random Delays: Random delays between requests (2-5 seconds)
User Agent Rotation: Rotates user agents to avoid detection
Failed URL Logging: Logs failed URLs in database for manual retry
Session Recovery: Can resume from where it left off

📈 Logging

Logs are written to both console and file:

Console: Colored, formatted output with progress bars
File: Detailed logs in logs/scraper.log
Levels: DEBUG, INFO, WARNING, ERROR

🔧 Troubleshooting

Import Errors

If you get import errors, make sure all dependencies are installed:

pip install -r requirements.txt --upgrade

Connection Errors

If you encounter connection errors:

Check your internet connection
Increase timeout in .env: TIMEOUT=60
Increase delays: SCRAPER_DELAY_MIN=5, SCRAPER_DELAY_MAX=10

No Questions Extracted

If no questions are extracted:

The page structure might have changed
Check logs for detailed error messages
Try with --url option to test a single page
Set LOG_LEVEL=DEBUG in .env for detailed parsing logs

Rate Limited

If you're being rate limited:

Increase delays in .env
Reduce CONCURRENT_REQUESTS
Add longer delays: SCRAPER_DELAY_MIN=5, SCRAPER_DELAY_MAX=10

🧪 Testing

To test the scraper with a single page:

python main.py --url "https://www.sanfoundry.com/1000-python-questions-answers/" --export json

📝 License

This project is for educational purposes only. Please respect the terms of service of the websites you scrape.

⚠️ Disclaimer

This scraper is designed for educational purposes. When using this tool:

Respect the website's robots.txt file
Don't overload servers with too many requests
Use reasonable delays between requests
Comply with the website's terms of service
Use scraped data responsibly and ethically

🤝 Contributing

Contributions are welcome! Feel free to:

Report bugs
Suggest new features
Submit pull requests
Improve documentation

📧 Support

For issues, questions, or suggestions, please:

Check the troubleshooting section
Review the logs in logs/scraper.log
Open an issue with detailed information

🎉 Credits

Built with:

Requests - HTTP library
BeautifulSoup4 - HTML parsing
Pandas - Data manipulation
ReportLab - PDF generation
tqdm - Progress bars
Tenacity - Retry logic

Happy Scraping! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.env.example		.env.example
.gitignore		.gitignore
QUICKSTART.md		QUICKSTART.md
README.md		README.md
check_status.sh		check_status.sh
config.py		config.py
convert_format.py		convert_format.py
database.py		database.py
debug_page.py		debug_page.py
example.py		example.py
exporter.py		exporter.py
find_and_scrape.py		find_and_scrape.py
main.py		main.py
requirements.txt		requirements.txt
sanfoundry_parser.py		sanfoundry_parser.py
scrape_all.py		scrape_all.py
scrape_page.py		scrape_page.py
scraper.py		scraper.py
setup.sh		setup.sh
test_scraper.py		test_scraper.py
utils.py		utils.py
view_data.py		view_data.py

realshubhamraut/web-scraper

Folders and files

Latest commit

History

Repository files navigation