VISHNU VARDHAN REDDY

Posted on Jun 2 • Edited on Jun 4

ScrapeSome: Effortless Web Scraping for JavaScript Heavy Sites — The Developer Friendly Scraper That Just Works

#opensource #python #playwright #devtools

Tired of 403s and blank pages when scraping JavaScript-heavy websites?
Looking for one library which can take care of 403, js rendering automatically?

You're not alone — and that's exactly why I built ScrapeSome.

🚀 What Is ScrapeSome?

ScrapeSome is a developer-friendly Python library that makes scraping modern websites simple — even the ones loaded with dynamic JavaScript or tough anti-bot protections.

It’s fast, lightweight, and requires zero boilerplate.

🔧 Why I Built It

I kept hitting walls on scraping projects:

Pages rendered everything with JavaScript
APIs were locked down or undocumented
requests/Scrapy failed or got 403 error
Setting up full browser automation felt too heavy for small jobs

So I built ScrapeSome — to fill the gap between requests and full-on headless scraping frameworks.

💡 Why Use ScrapeSome?

Handles both static and JS-heavy pages out of the box
Supports both sync and async scraping
Converts raw HTML into clean text, JSON, or Markdown
Works with minimal configuration (pip install scrapesome)
Handles timeouts, retries, redirects, user agents

🚀 Features

🔁 Sync + Async scraping support
🔄 Automatic retries and intelligent fallbacks
🧪 Playwright rendering fallback for JS-heavy pages
📝 Format responses as raw HTML, plain text, Markdown, or structured JSON
⚙️ Configurable: timeouts, redirects, user agents, and logging
🧪 Test coverage with pytest and pytest-asyncio

⚖ Comparison with Alternatives

Feature	ScrapeSome ✅	Scrapy	Selenium/UC	Playwright (Raw)
✅ Sync + Async Scraping	✅ Built-in	❌ Async only*	❌ Manual	❌ Manual
🧠 JS Rendering (Fallback)	✅ Seamless	❌ Plugin setup	✅ Full	✅ Full
📝 Output as JSON/Markdown/HTML	✅ Built-in	❌ Requires custom	❌ Manual parsing	❌ Manual parsing
🔁 Retry & Timeout Handling	✅ Built-in	⚠️ Requires config	❌ Manual	❌ Manual
⚡ Minimal Setup (Boilerplate)	✅ Near zero	❌ Needs project	❌ Driver setup	❌ Browser install
🧪 Testable out-of-the-box	✅ Pytest-ready	⚠️ Complex	❌	❌
🛠️ Config via .env or inline	✅ Simple	⚠️ Complex	❌	❌
📦 Install & Run in <1 Min	✅ Yes	❌	❌	❌

📦 Installation

pip install scrapesome

Playwright Setup

ScrapeSome uses Playwright for JavaScript rendering fallback. To enable this, you need to install Playwright and its dependencies.

1. Install Playwright Python package if not installed

pip install playwright

2. Install Playwright browsers

playwright install

3. Install system dependencies

Playwright requires some system libraries to run browsers, which vary by operating system.

For Windows
Playwright installs everything you need automatically with playwright install, so no additional setup is usually required.

For Linux (Ubuntu/Debian)
Run the following command to install required system libraries:

playwright install-deps

If you don't have playwright CLI available, you can install dependencies manually:

sudo apt-get update
sudo apt-get install -y libwoff1 libopus0 libwebp6 libharfbuzz-icu0 libwebpmux3 \
                        libenchant-2-2 libhyphen0 libegl1 libglx0 libgudev-1.0-0 \
                        libevdev2 libgles2 libx264-160

Note: Package names may vary depending on your distribution and version.

For macOS
You can install required libraries using Homebrew:

brew install harfbuzz enchant

After this setup, you should be able to use ScrapeSome with full Playwright rendering support!

⚡ Quick Start

Synchronous Example

from scrapesome import sync_scraper
html = sync_scraper("https://example.com")
html

Asynchronous Example

import asyncio
from scrapesome import async_scraper
html = asyncio.run(async_scraper("https://example.com"))
html

🖥️ CLI Usage

ScrapeSome also includes a powerful CLI for quick and easy scraping from the command line.

📦 Installation with CLI Support

To use the CLI, install with the optional cli extras:

pip install scrapesome[cli]

🔧 Basic Usage

scrapesome scrape --url https://example.com

This performs a synchronous scrape and outputs plain text by default.

⚙️ Available Options

Option	Description	Default
`--async-mode`	Use asynchronous scraping	False
`--force-playwright`	Force JavaScript rendering using Playwright	False
`--output-format`	Choose `text`, `json`, `markdown`, or `html`	html

Examples

Basic scrape

scrapesome scrape --url https://example.com

Force Playwright rendering

scrapesome scrape --url https://example.com --force-playwright

Get JSON output

scrapesome scrape --url https://example.com --output-format json

Async scrape with markdown output

scrapesome scrape --url https://example.com --async-mode --output-format markdown

🧪 Try it out on PyPI:

👉 https://pypi.org/project/scrapesome/

🔗 Links

🔧 GitHub: github.com/scrapesome/scrapesome
📚 Docs: scrapesome.github.io

🙌 Feedback Welcome

This is an early release, and I’d love to hear your thoughts.

Try it, break it, file issues, suggest features — or just ⭐ the repo if you like the idea!

Happy scraping! 🕷️

DEV Community