OpenClaw Scrapling
Advanced web scraping with anti-bot bypass, JavaScript support, and adaptive selectors. Use when scraping websites with Cloudflare protection, dynamic content, or frequent UI changes.
Scrapling Web Scraping Skill
Use Scrapling to scrape modern websites, including those with anti-bot protection, JavaScript-rendered content, and adaptive element tracking.
When to Use This Skill
- User asks to scrape a website or extract data from a URL
- Need to bypass Cloudflare, bot detection, or anti-scraping measures
- Need to handle JavaScript-rendered/dynamic content (React, Vue, etc.)
- Website requires login or session management
- Website structure changes frequently (adaptive selectors)
- Need to scrape multiple pages with rate limiting
Commands
All commands use the scrape.py script in this skill's directory.
Basic HTTP Scraping (Fast)
python scrape.py \
--url "https://example.com" \
--selector ".product" \
--output products.json
Use when: Static HTML, no JavaScript, no bot protection
Stealth Mode (Bypass Anti-Bot)
python scrape.py \
--url "https://nopecha.com/demo/cloudflare" \
--stealth \
--selector "#content" \
--output data.json
Use when: Cloudflare protection, bot detection, fingerprinting
Features:
- Bypasses Cloudflare Turnstile automatically
- Browser fingerprint spoofing
- Headless browser mode
Dynamic/JavaScript Content
python scrape.py \
--url "https://spa-website.com" \
--dynamic \
--selector ".loaded-content" \
--wait-for ".loaded-content" \
--output data.json
Use when: React/Vue/Angular apps, lazy-loaded content, AJAX
Features:
- Full Playwright browser automation
- Wait for elements to load
- Network idle detection
Adaptive Selectors (Survives Website Changes)
# First time - save the selector pattern
python scrape.py \
--url "https://example.com" \
--selector ".product-card" \
--adaptive-save \
--output products.json
# Later, if website structure changes
python scrape.py \
--url "https://example.com" \
--adaptive \
--output products.json
Use when: Website frequently redesigns, need robust scraping
How it works:
- First run: Saves element patterns/structure
- Later runs: Uses similarity algorithms to relocate moved elements
- Auto-updates selector cache
Session Management (Login Required)
# Login and save session
python scrape.py \
--url "https://example.com/dashboard" \
--stealth \
--login \
--username "user@example.com" \
--password "password123" \
--session-name "my-session" \
--selector ".protected-data" \
--output data.json
# Reuse saved session (no login needed)
python scrape.py \
--url "https://example.com/another-page" \
--stealth \
--session-name "my-session" \
--selector ".more-data" \
--output more_data.json
Use when: Content requires authentication, multi-step scraping
Extract Specific Data Types
Text only:
python scrape.py \
--url "https://example.com" \
--selector ".content" \
--extract text \
--output content.txt
Markdown:
python scrape.py \
--url "https://docs.example.com" \
--selector "article" \
--extract markdown \
--output article.md
Attributes:
# Extract href links
python scrape.py \
--url "https://example.com" \
--selector "a.product-link" \
--extract attr:href \
--output links.json
Multiple fields:
python scrape.py \
--url "https://example.com/products" \
--selector ".product" \
--fields "title:.title::text,price:.price::text,link:a::attr(href)" \
--output products.json
Advanced Options
Proxy support:
python scrape.py \
--url "https://example.com" \
--proxy "http://user:pass@proxy.com:8080" \
--selector ".content"
Rate limiting:
python scrape.py \
--url "https://example.com" \
--selector ".content" \
--delay 2 # 2 seconds between requests
Custom headers:
python scrape.py \
--url "https://api.example.com" \
--headers '{"Authorization": "Bearer token123"}' \
--selector "body"
Screenshot (for debugging):
python scrape.py \
--url "https://example.com" \
--stealth \
--screenshot debug.png
Python API (For Custom Scripts)
You can also use Scrapling directly in Python scripts:
from scrapling.fetchers import Fetcher, StealthyFetcher, DynamicFetcher
# Basic HTTP request
page = Fetcher.get('https://example.com')
products = page.css('.product')
for product in products:
title = product.css('.title::text').get()
price = product.css('.price::text').get()
print(f"{title}: {price}")
# Stealth mode (bypass anti-bot)
page = StealthyFetcher.fetch('https://protected-site.com', headless=True)
data = page.css('.content').getall()
# Dynamic content (full browser)
page = DynamicFetcher.fetch('https://spa-app.com', network_idle=True)
items = page.css('.loaded-item').getall()
# Sessions (login)
from scrapling.fetchers import StealthySession
with StealthySession(headless=True) as session:
# Login
login_page = session.fetch('https://example.com/login')
login_page.fill('#username', 'user@example.com')
login_page.fill('#password', 'password123')
login_page.click('#submit')
# Access protected content
protected_page = session.fetch('https://example.com/dashboard')
data = protected_page.css('.private-data').getall()
Output Formats
- JSON (default):
--output data.json - JSONL (streaming):
--output data.jsonl - CSV:
--output data.csv - TXT (text only):
--output data.txt - MD (markdown):
--output data.md - HTML (raw):
--output data.html
Selector Types
Scrapling supports multiple selector formats:
CSS selectors:
--selector ".product"
--selector "div.container > p.text"
--selector "a[href*='product']"
XPath selectors:
--selector "//div[@class='product']"
--selector "//a[contains(@href, 'product')]"
Pseudo-elements (like Scrapy):
--selector ".product::text" # Text content
--selector "a::attr(href)" # Attribute value
--selector ".price::text::strip" # Text with whitespace removed
Combined selectors:
--selector ".product .title::text" # Nested elements
Troubleshooting
Issue: "Element not found"
- Try
--dynamicif content is JavaScript-loaded - Use
--wait-for SELECTORto wait for element - Use
--screenshotto debug what's visible
Issue: "Cloudflare blocking"
- Use
--stealthmode - Add
--solve-cloudflareflag (enabled by default in stealth) - Try
--delay 2to slow down requests
Issue: "Login not working"
- Use
--headless falseto see browser interaction - Check credentials are correct
- Website might use CAPTCHA (manual intervention needed)
Issue: "Selector broke after website update"
- Use
--adaptivemode to auto-relocate elements - Re-run with
--adaptive-saveto update saved patterns
Examples
Scrape Hacker News Front Page
python scrape.py \
--url "https://news.ycombinator.com" \
--selector ".athing" \
--fields "title:.titleline>a::text,link:.titleline>a::attr(href)" \
--output hn_stories.json
Scrape Protected Site with Login
python scrape.py \
--url "https://example.com/data" \
--stealth \
--login \
--username "user@example.com" \
--password "secret" \
--session-name "example-session" \
--selector ".data-table tr" \
--output protected_data.json
Monitor Price Changes
# Save initial selector pattern
python scrape.py \
--url "https://store.com/product/123" \
--selector ".price" \
--adaptive-save \
--output price.txt
# Later, check price (even if page redesigned)
python scrape.py \
--url "https://store.com/product/123" \
--adaptive \
--output price_new.txt
Scrape Dynamic JavaScript App
python scrape.py \
--url "https://react-app.com/data" \
--dynamic \
--wait-for ".loaded-content" \
--selector ".item" \
--fields "name:.name::text,value:.value::text" \
--output app_data.json
Notes
- First run: Scrapling downloads browsers (~500MB). This is automatic.
- Sessions: Saved in
sessions/directory, reusable across runs - Adaptive cache: Saved in
selector_cache.json, auto-updated - Rate limiting: Always respect
robots.txtand add delays for ethical scraping - Legal: Use only on sites you have permission to scrape
Dependencies
Installed automatically when skill is installed:
- scrapling[all] - Main library with all features
- pyyaml - For config file support
Skill Structure
scrapling/
āāā SKILL.md # This file
āāā scrape.py # Main CLI script
āāā requirements.txt # Python dependencies
āāā sessions/ # Saved browser sessions
āāā selector_cache.json # Adaptive selector patterns
āāā examples/ # Example scripts
āāā basic.py
āāā stealth.py
āāā dynamic.py
āāā adaptive.py
Advanced: Custom Python Scripts
For complex scraping tasks, you can create custom Python scripts in this directory:
# custom_scraper.py
from scrapling.fetchers import StealthyFetcher
from scrapling.spiders import Spider, Response
import json
class MySpider(Spider):
name = "custom"
start_urls = ["https://example.com/page1"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {
"title": item.css('.title::text').get(),
"price": item.css('.price::text').get()
}
# Follow pagination
next_page = response.css('.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page)
# Run spider
result = MySpider().start()
with open('output.json', 'w') as f:
json.dump(result.items, f, indent=2)
Run with:
python custom_scraper.py
Questions? Check Scrapling docs: https://scrapling.readthedocs.io