Skip to content

amolsr/web-scrapping

Web Scraping Collection πŸ•·οΈ

A comprehensive collection of web scraping scripts for extracting data from popular websites. This project demonstrates various web scraping techniques using Python and provides ready-to-use scripts for data extraction. image

🌟 Features

  • Multiple Website Support: Scrape data from 10+ popular websites
  • CSV Output: All scrapers export data in CSV format for easy analysis
  • Easy to Use: Simple Python scripts with clear documentation
  • Educational: Perfect for learning web scraping techniques
  • Open Source: Contribute and improve the collection

πŸš€ Quick Start

Prerequisites

pip install requests beautifulsoup4 lxml

Installation

  1. Clone the repository

    git clone https://github.com/amolsr/web-scrapping.git
    cd web-scrapping
  2. Run any scraper

    python scrapers/ecommerce/flipkart.py
  3. Check the output

    ls output/

πŸ“Š Sample Output

IMDB Top Movies

Rank,Name,Year,Rating,Link,Director
1,The Shawshank Redemption,1994,9.2,https://www.imdb.com/title/tt0111161/,Frank Darabont
2,The Godfather,1972,9.2,https://www.imdb.com/title/tt0068646/,Francis Ford Coppola

Flipkart Smartphones

Mobile Name,Ratings,Pricing,Description
Nokia 8.1,4.3,β‚Ή15,999,6GB RAM | 128GB Storage
Nokia 6.1 Plus,4.2,β‚Ή12,999,4GB RAM | 64GB Storage

πŸ› οΈ Usage Examples

Basic Usage

# Run a specific scraper
python scrapers/content/imdb.py

# The script will automatically:
# 1. Fetch data from the website
# 2. Parse the HTML content
# 3. Extract relevant information
# 4. Save to CSV file in the output/ directory

Customization

Each script can be easily modified to:

  • Change the target URL
  • Extract different data fields
  • Modify the output format
  • Add error handling

πŸ“ Project Structure

web-scrapping/
β”œβ”€β”€ scrapers/               # All scraper scripts organized by category
β”‚   β”œβ”€β”€ ecommerce/          # E-commerce website scrapers
β”‚   β”‚   β”œβ”€β”€ flipkart.py     # Flipkart smartphone scraper
β”‚   β”‚   β”œβ”€β”€ amazon.py       # Amazon product scraper
β”‚   β”‚   └── olx.py          # OLX listings scraper
β”‚   β”œβ”€β”€ job_boards/         # Job board scrapers
β”‚   β”‚   β”œβ”€β”€ indeed.py       # Indeed job listings
β”‚   β”‚   β”œβ”€β”€ naukri_jobs.py  # Naukri job listings
β”‚   β”‚   β”œβ”€β”€ apnajob.py      # ApnaJob listings
β”‚   β”‚   β”œβ”€β”€ jobhai.py       # JobHai listings
β”‚   β”‚   β”œβ”€β”€ welcome_to_the_jungle.py  # Welcome to the Jungle jobs
β”‚   β”‚   └── craigslist_jobs.py  # Craigslist jobs
β”‚   β”œβ”€β”€ educational/       # Educational platform scrapers
β”‚   β”‚   β”œβ”€β”€ udemy.py        # Udemy course scraper
β”‚   β”‚   β”œβ”€β”€ sanfoundry.py   # Sanfoundry educational content
β”‚   β”‚   β”œβ”€β”€ college_notice_scraper.py  # College notices scraper
β”‚   β”‚   β”œβ”€β”€ javaguide.py    # Java Guide content
β”‚   β”‚   └── indiabix_networking.py  # IndiaBix networking Q&A
β”‚   β”œβ”€β”€ social_media/       # Social media and developer platforms
β”‚   β”‚   β”œβ”€β”€ youtube.py       # YouTube video scraper
β”‚   β”‚   β”œβ”€β”€ youtube_links.py # YouTube links extractor
β”‚   β”‚   β”œβ”€β”€ reddit.py       # Reddit posts scraper
β”‚   β”‚   β”œβ”€β”€ hackernews.py   # Hacker News posts
β”‚   β”‚   β”œβ”€β”€ stack_overflow.py  # Stack Overflow questions
β”‚   β”‚   └── github.py       # GitHub repository scraper
β”‚   β”œβ”€β”€ content/            # Content and media scrapers
β”‚   β”‚   β”œβ”€β”€ imdb.py         # IMDB top movies scraper
β”‚   β”‚   β”œβ”€β”€ books_toscrape.py  # Books.toscrape.com scraper
β”‚   β”‚   β”œβ”€β”€ quotes_toscrape.py  # Quotes to scrape
β”‚   β”‚   β”œβ”€β”€ wikipedia.py    # Wikipedia table scraper
β”‚   β”‚   └── openlibrary_books.py  # Open Library books
β”‚   β”œβ”€β”€ misc/               # Miscellaneous scrapers
β”‚   β”‚   β”œβ”€β”€ coinmarketcap.py  # Cryptocurrency market data
β”‚   β”‚   β”œβ”€β”€ weather.py      # Weather information scraper
β”‚   β”‚   β”œβ”€β”€ craigslist_housing.py  # Craigslist housing
β”‚   β”‚   └── syntaxminds.py  # SyntaxMinds content
β”‚   └── utils/              # Utility functions
β”‚       └── __init__.py     # Helper functions for scrapers
β”œβ”€β”€ output/                 # Generated CSV files
β”‚   β”œβ”€β”€ flipkart_latest_smartphones.csv
β”‚   β”œβ”€β”€ imdb.csv
β”‚   β”œβ”€β”€ github.csv
β”‚   └── ...
β”œβ”€β”€ main.py                 # Main entry point
└── README.md               # This file

πŸ”§ Dependencies

  • requests: HTTP library for making web requests
  • beautifulsoup4: HTML/XML parsing library
  • lxml: XML and HTML processing library
  • csv: Built-in CSV module for data export

🀝 Contributing

We welcome contributions! Here's how you can help:

  1. Fork the repository
  2. Create a new scraper or improve existing ones
  3. Add proper documentation and comments
  4. Test your changes
  5. Submit a pull request

Contribution Ideas

  • Add new website scrapers
  • Improve error handling
  • Add data validation
  • Create web interface
  • Add support for different output formats (JSON, XML)
  • Implement rate limiting and respect robots.txt

⚠️ Important Notes

  • Respect robots.txt: Always check the website's robots.txt file
  • Rate Limiting: Add delays between requests to be respectful
  • Terms of Service: Ensure you comply with each website's terms
  • Data Usage: Use scraped data responsibly and ethically

πŸ“ License

This project is open source and available under the MIT License.

πŸ™ Acknowledgments

  • Beautiful Soup for HTML parsing
  • Requests library for HTTP handling
  • All contributors who help improve this collection

πŸ“ž Support

If you have questions or need help:

  • Open an issue on GitHub
  • Check the code comments for implementation details
  • Review the output files for expected data format

Happy Scraping! πŸ•·οΈβœ¨

Stargazers over time

About

Web Scrapping Examples using Beautiful Soup in Python.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 22

Languages