Intelligent web scraping with automatic strategy selection and TypeScript-first Apify Actor development.
This skill provides:
- Interactive reconnaissance - Hands-on site exploration using Playwright MCP & Chrome DevTools
- Proactive strategy discovery - Automatically checks for sitemaps and APIs
- Intelligent recommendations - Suggests optimal approach (sitemap/API/Playwright/hybrid)
- Iterative implementation - Starts simple, adds complexity only if needed
- Production-ready guidance - TypeScript-first Apify Actor development
Add this skill to Claude Code by placing this directory in the skills folder.
User: "Scrape https://example.com"
Claude will automatically:
1. Open site in browser (Playwright MCP) - observe loading behavior
2. Monitor network traffic (DevTools) - discover API endpoints
3. Test interactions - pagination, filters, dynamic content
4. Assess protections - Cloudflare, rate limits, fingerprinting
5. Check for sitemaps (/sitemap.xml, robots.txt)
6. Generate intelligence report with optimal strategy
7. Implement recommended approach iteratively
8. Test with small batch (5-10 items)
9. Scale to full dataset
User: "Make this an Apify Actor"
Claude will:
1. Recommend TypeScript (strongly)
2. Guide through `apify create` command
3. Help choose appropriate template (Cheerio vs Playwright)
4. Port scraping logic to Actor format
5. Configure input schema
6. Test and deploy
web-scraping/
├── SKILL.md # Main entry point (proactive workflow)
├── workflows/ # Implementation patterns
│ ├── reconnaissance.md # Phase 1 interactive reconnaissance (CRITICAL)
│ ├── implementation.md # Phase 4 iterative implementation
│ └── productionization.md # Phase 5 Actor creation
├── strategies/ # Deep-dive guides
│ ├── sitemap-discovery.md # 60x faster URL discovery
│ ├── api-discovery.md # 10-100x faster than scraping
│ ├── playwright-scraping.md # Browser-based scraping
│ ├── cheerio-scraping.md # HTTP-only (5x faster)
│ └── hybrid-approaches.md # Combining strategies
├── examples/ # Runnable code
│ ├── sitemap-basic.js
│ ├── api-scraper.js
│ ├── hybrid-sitemap-api.js
│ ├── playwright-basic.js
│ └── iterative-fallback.js
├── reference/ # Quick lookup
│ ├── regex-patterns.md
│ ├── selector-guide.md
│ └── anti-patterns.md
├── apify/ # Production deployment
│ ├── typescript-first.md # Why TypeScript
│ ├── cli-workflow.md # apify create (CRITICAL)
│ ├── templates/ # TypeScript boilerplate
│ └── examples/ # Working actors
└── README.md # This file
This skill follows Anthropic's official best practices for skill development:
Pattern: Three-level loading system to manage context efficiently
- Level 1: YAML frontmatter (~85 tokens) - Always loaded
- Level 2: Main SKILL.md (~356 lines) - Loaded when skill invoked
- Level 3: Subdirectories - Loaded on-demand as needed
Result: 70-80% token reduction vs monolithic documentation
Source: skill-creator/SKILL.md
Pattern: Write instructions using verb-first commands, not second-person language
Examples:
- ✅ "Load this workflow when user requests"
- ✅ "Check for sitemaps automatically"
- ❌ "You should load this workflow"
- ❌ "You need to check for sitemaps"
Exception: Second-person is acceptable in user-facing prompts, code comments, and tutorial examples
Source: skill-creator/SKILL.md
Pattern: Concise, specific name and description that determine when Claude invokes the skill
Applied:
name: web-scraping- Clear, hyphen-case identifierdescription:- Specific about activation triggers and capabilities (189 chars, optimized from 244)
Source: agent_skills_spec.md
Pattern: Keep only essential procedural instructions in SKILL.md; move detailed information to subdirectories
Applied:
- SKILL.md: Core 4-phase workflow (~356 lines)
workflows/: Detailed implementation patternsstrategies/: Deep-dive guidesexamples/: Runnable codereference/: Quick lookup patternsapify/: Production deployment guides
Source: skill-creator/SKILL.md
Pattern: Separate executable code, documentation, and output resources
Applied:
examples/- Executable JavaScript learning examples (like scripts/)workflows/,strategies/,reference/,apify/- Documentation loaded as needed (like references/)apify/templates/,apify/examples/- Boilerplate code and templates (like assets/)
Source: skill-creator/SKILL.md
Pattern: Create focused skills for specific purposes rather than one skill that does everything
Applied: This skill focuses specifically on web scraping and Apify Actor development, not general web development
Source: Anthropic Skills Best Practices
Pattern: Use clear, technical language focused on "what" and "how" rather than persuasive or promotional tone
Applied: Direct technical guidance throughout ("Check for sitemaps", "Implement iteratively") vs. marketing language
Source: skill-creator/SKILL.md
Before any implementation:
- Playwright MCP: Open site in real browser, observe loading behavior, test interactions
- Chrome DevTools MCP: Monitor network traffic, discover hidden APIs, analyze request patterns
- Protection Analysis: Detect Cloudflare, CAPTCHA, rate limiting, fingerprinting
- Intelligence Report: Generate structured findings with optimal strategy recommendation
Why this matters: Discovers hidden APIs (10-100x faster than HTML scraping), identifies blockers before coding, provides intelligence for informed strategy selection.
Automatically validates reconnaissance findings:
- Sitemaps (
/sitemap.xml,robots.txt) - API endpoints (confirmed from DevTools analysis)
- Site structure (JavaScript-heavy? Authentication?)
Presents 2-3 options with:
- Time estimates
- Complexity rating
- Pros/cons
- Clear reasoning
- Start with simplest approach
- Test small batch (5-10 items)
- Scale or fallback based on results
- Add robustness last
For production actors:
- Strongly recommend TypeScript
- Always use
apify createcommand - Choose template based on site type (Cheerio for static, Playwright for JS-heavy)
- Type-safe input/output
1. User: "Scrape example.com"
2. Claude opens site with Playwright MCP (Phase 1 reconnaissance)
3. Claude monitors DevTools, finds API endpoint GET /api/products
4. Claude tests pagination, detects Cloudflare protection
5. Claude checks sitemap (validates Phase 1 findings - 1,234 URLs)
6. Claude generates intelligence report
7. Claude recommends: Hybrid (Sitemap + API + Proxies)
8. Implements with discovered API endpoints
9. Tests with 10 items
10. Scales to full dataset
11. Result: 1000 products in 5 minutes, no blocks
1. User: "Make this an Apify Actor"
2. Claude loads apify/ module
3. Recommends TypeScript? (Yes)
4. Guides through: apify create
5. Analyzes site: Static HTML → Selects Cheerio template
6. Ports scraping logic to TypeScript
7. Adds input schema
8. Tests: apify run
9. Deploys: apify push
10. Result: Production-ready actor
| Approach | Time (1000 pages) | vs Crawling |
|---|---|---|
| Sitemap + API | 5 minutes | 60x faster |
| Sitemap + Playwright | 20 minutes | 15x faster |
| API only | 8 minutes | 40x faster |
| Playwright crawl | 45 minutes | Baseline |
✅ Always start with Playwright MCP + DevTools exploration ✅ Discover APIs before attempting HTML scraping ✅ Test site interactions to understand behavior ✅ Assess protections early (Cloudflare, CAPTCHA, rate limits) ✅ Generate intelligence report with findings
✅ Validate reconnaissance with automated sitemap checks ✅ Confirm API endpoints discovered in Phase 1 ✅ Analyze site structure based on observations
✅ Start simple (sitemap → API → Playwright) ✅ Test small batch first ✅ Handle errors gracefully ✅ Respect rate limits
✅ Use TypeScript for Apify Actors
✅ Always use apify create command
✅ Choose template based on Phase 1 findings (Cheerio vs Playwright)
✅ Test locally with apify run
✅ Deploy with apify push
→ See strategies/sitemap-discovery.md troubleshooting section
→ See strategies/api-discovery.md authentication section
→ See strategies/playwright-scraping.md performance optimization
→ See apify/cli-workflow.md common issues section
- Main skill: Read
SKILL.mdfor complete workflow - Workflows: Implementation patterns in
workflows/ - Strategies: Browse
strategies/for detailed guides - Examples: Run code in
examples/directory - Reference: Quick lookups in
reference/ - Apify: Production deployment in
apify/
Intelligence first, implementation second!
This skill prioritizes:
- Reconnaissance - Understand before coding (APIs > Sitemaps > Scraping)
- Speed - Fastest approach that works (API 10-100x faster than HTML)
- Reliability - Structured data > HTML parsing
- Maintainability - TypeScript, proper tooling
- Best practices - Industry standards
4.0.0 - Intelligence-driven scraping:
- NEW: Interactive reconnaissance phase (Playwright MCP + Chrome DevTools)
- NEW: API discovery before HTML scraping
- NEW: Protection analysis and countermeasures
- Progressive disclosure architecture
- Proactive strategy discovery
- TypeScript-first Apify guidance
- Comprehensive examples
- Modular organization
All best practices sourced from official Anthropic documentation:
- Anthropic Skills Repository
- Agent Skills Specification
- skill-creator/SKILL.md
- Anthropic Skills Announcement
Start here: Read SKILL.md for the complete proactive workflow.