Browser集成Scrapling

Scrapling

Scrapling 是一个无法检测、功能强大、灵活且高性能的 Python 网页抓取库,旨在让网页抓取变得简单轻松。它是第一个能够从网站变化中学习并随之进化的自适应抓取库。当其他库在网站结构更新时失效时,Scrapling 会自动重新定位元素,并使您的抓取工具平稳运行。

主要功能:

  • 自适应抓取技术 – 第一个能够从网站变化中学习并自动进化的库。当网站结构更新时,Scrapling 会智能地重新定位元素,确保持续运行。
  • 浏览器指纹伪装 – 支持 TLS 指纹匹配和真实浏览器头模拟。
  • 隐身抓取能力StealthyFetcher 可以绕过像 Cloudflare Turnstile 这样的高级反机器人系统。
  • 持久会话支持 – 提供多种会话类型,包括 FetcherSessionDynamicSessionStealthySession,实现可靠高效的抓取。

在 [官方文档] 中了解更多。

为什么要结合使用 Scrapeless 和 Scrapling?

Scrapling 在高性能网页数据提取、自适应抓取和 AI 集成方面表现出色。它内置了多种 Fetcher 类——FetcherDynamicFetcherStealthyFetcher——以应对各种场景。 然而,在面对高级反机器人机制或大规模并发抓取时,仍然可能出现一些挑战,例如:

  • 本地浏览器容易被 Cloudflare、AWS WAF 或 reCAPTCHA 阻挡。
  • 大规模并发抓取期间浏览器资源消耗高,性能受限。
  • 尽管 StealthyFetcher 包含隐身功能,但在极端反机器人场景下仍然需要更强大的基础设施支持。
  • 复杂的调试过程使得难以查明抓取失败的根本原因。

Scrapeless 云浏览器完美解决了这些痛点:

  • 一键绕过反机器人机制: 自动处理 reCAPTCHA、Cloudflare Turnstile/Challenge、AWS WAF 和其他验证。结合 Scrapling 的自适应提取能力,大幅提高成功率。
  • 无限并发扩展: 每个任务可在几秒钟内启动 50–1000+ 个浏览器实例,消除本地性能瓶颈,最大化发挥 Scrapling 的高性能潜力。
  • 成本降低 40–80%: 与同类云解决方案相比,Scrapeless 总体成本仅为 20–60%,并支持按使用量付费——即使是小型项目也能负担得起。
  • 可视化调试工具: 借助 Session ReplayLive URL 功能,您可以实时监控 Scrapling 的执行过程,快速识别抓取失败,降低调试成本。
  • 灵活集成: Scrapling 的 DynamicFetcherPlayWrightFetcher(基于 Playwright)可以通过配置轻松连接到 Scrapeless 云浏览器——无需重写现有逻辑。
  • 边缘服务节点: 凭借全球数据中心,Scrapeless 实现的启动速度和稳定性比其他云浏览器快 2-3 倍,提供超过 195 个国家/地区 9000 万个可信住宅 IP,以提升 Scrapling 的执行速度。
  • 隔离环境和持久会话: 每个 Scrapeless 配置文件都在隔离环境中运行,支持持久登录,防止会话干扰,确保大规模抓取的稳定性。
  • 灵活的指纹配置: Scrapeless 可以随机生成或完全自定义浏览器指纹。与 Scrapling 的 StealthyFetcher 配合使用时,进一步降低检测风险,显著提高抓取成功率。

入门

登录 Scrapeless 并获取您的 API 密钥

get-api-key.png

先决条件

  • Python 3.10+
  • 一个已注册的 Scrapeless 账户,具有有效的 API 密钥
  • 已安装 Scrapling(或使用官方 Docker 镜像)
pip install scrapling
# 如果您需要动态或隐身抓取器:
pip install "scrapling[fetchers]"
# 安装浏览器依赖项
scrapling install
 

或者使用官方 Docker 镜像:

docker pull pyd4vinci/scrapling
# 或者
docker pull ghcr.io/d4vinci/scrapling:latest
 

快速入门

这是一个简单的示例:使用 DynamicSession(由 Scrapling 提供)通过其 WebSocket 端点连接到 Scrapeless 云浏览器,抓取页面,并打印响应。

from urllib.parse import urlencode
 
from scrapling.fetchers import DynamicSession
 
# 配置您的浏览器会话
config = {
    "token": "YOUR_API_KEY",
    "sessionName": "scrapling-session",
    "sessionTTL": "300",  # 5 分钟
    "proxyCountry": "ANY",
    "sessionRecording": "false",
}
 
# 构建 WebSocket URL
ws_endpoint = f"wss://browser.scrapeless.com/api/v2/browser?{urlencode(config)}"
print('Connecting to Scrapeless...')
 
with DynamicSession(cdp_url=ws_endpoint, disable_resources=True) as s:
    print("Connected!")
    page = s.fetch("https://httpbin.org/headers", network_idle=True)
    print(f"Page loaded, content length: {len(page.body)}")
    print(page.json())
 

注意: Scrapeless 云浏览器支持高级选项,例如代理配置、自定义指纹CAPTCHA 验证码解决

有关更多详细信息,请参阅 Scrapeless 浏览器文档


常见用例(含完整示例)

在开始之前,请确保:

  • 您已运行 pip install "scrapling[fetchers]"
  • 您已执行 scrapling install 以下载浏览器依赖项
  • 您拥有有效的 Scrapeless API 密钥
  • 您正在使用 Python 3.10+

使用 Scrapling + Scrapeless 抓取 Amazon

下面是抓取 Amazon 产品详情 的完整示例。

该脚本自动连接到 Scrapeless 云浏览器,加载目标页面,绕过反机器人检查,并提取关键产品信息——例如标题、价格、库存状态、评分、评论数量、功能、图片、ASIN、商家和类别

# amazon_scraper_response_only.py
from urllib.parse import urlencode
import json
import time
import re
from scrapling.fetchers import DynamicSession
 
# ---------------- 配置 ----------------
CONFIG = {
    "token": "YOUR_SCRAPELESS_API_KEY",  
    "sessionName": "Data Scraping",
    "sessionTTL": "900",
    "proxyCountry": "ANY",
    "sessionRecording": "true",
}
DISABLE_RESOURCES = True   # False -> load JS/resources (more stable for JS-heavy sites)
WAIT_FOR_SELECTOR_TIMEOUT = 60
MAX_RETRIES = 3
 
TARGET_URL = "https://www.amazon.com/ESR-Compatible-Military-Grade-Protection-Scratch-Resistant/dp/B0CC1F4V7Q"
WS_ENDPOINT = f"wss://browser.scrapeless.com/api/v2/browser?{urlencode(CONFIG)}"
 
 
# ---------------- 辅助函数 (仅使用响应) ----------------
def retry(func, retries=2, wait=2):
    for i in range(retries + 1):
        try:
            return func()
        except Exception as e:
            print(f"[retry] Attempt {i+1} failed: {e}")
            if i == retries:
                raise
            time.sleep(wait * (i + 1))
 
 
def _resp_css_first_text(resp, selector):
    """尝试 response.css_first('selector::text') 或 resp.query_selector_text(selector) - 返回字符串或 None。"""
    try:
        if hasattr(resp, "css_first"):
            # prefer unified ::text pseudo API
            val = resp.css_first(f"{selector}::text")
            if val:
                return val.strip()
    except Exception:
        pass
    try:
        if hasattr(resp, "query_selector_text"):
            val = resp.query_selector_text(selector)
            if val:
                return val.strip()
    except Exception:
        pass
    return None
 
 
def _resp_css_texts(resp, selector):
    """使用 response.css('selector::text') 或 query_selector_all_text 返回选择器的文本值列表。"""
    out = []
    try:
        if hasattr(resp, "css"):
            vals = resp.css(f"{selector}::text") or []
            for v in vals:
                if isinstance(v, str) and v.strip():
                    out.append(v.strip())
            if out:
                return out
    except Exception:
        pass
    try:
        if hasattr(resp, "query_selector_all_text"):
            vals = resp.query_selector_all_text(selector) or []
            for v in vals:
                if v and v.strip():
                    out.append(v.strip())
            if out:
                return out
    except Exception:
        pass
    # some fetchers provide query_selector_all and elements with .text() method
    try:
        if hasattr(resp, "query_selector_all"):
            els = resp.query_selector_all(selector) or []
            for el in els:
                try:
                    if hasattr(el, "text") and callable(el.text):
                        t = el.text()
                        if t and t.strip():
                            out.append(t.strip())
                            continue
                except Exception:
                    pass
                try:
                    if hasattr(el, "get_text"):
                        t = el.get_text(strip=True)
                        if t:
                            out.append(t)
                            continue
                except Exception:
                    pass
    except Exception:
        pass
    return out
 
 
def _resp_css_first_attr(resp, selector, attr):
    """尝试通过响应 css 伪类 ::attr(...) 或查询选择器元素属性获取属性。"""
    try:
        if hasattr(resp, "css_first"):
            val = resp.css_first(f"{selector}::attr({attr})")
            if val:
                return val.strip()
    except Exception:
        pass
    try:
        # try element and get_attribute / get
        if hasattr(resp, "query_selector"):
            el = resp.query_selector(selector)
            if el:
                if hasattr(el, "get_attribute"):
                    try:
                        v = el.get_attribute(attr)
                        if v:
                            return v
                    except Exception:
                        pass
                try:
                    v = el.get(attr) if hasattr(el, "get") else None
                    if v:
                        return v
                except Exception:
                    pass
                try:
                    attrs = getattr(el, "attrs", None)
                    if isinstance(attrs, dict) and attr in attrs:
                        return attrs.get(attr)
                except Exception:
                    pass
    except Exception:
        pass
    return None
 
 
def detect_bot_via_resp(resp):
    """仅使用响应文本选择器检测典型的机器人/验证码信号。"""
    checks = [
        # body text
        ("body",),
        # some common challenge indicators
        ("#challenge-form",),
        ("#captcha",),
        ("text:contains('are you a human')",),
    ]
    # 首先尝试大范围的 body 文本
    try:
        body_text = _resp_css_first_text(resp, "body")
        if body_text:
            txt = body_text.lower()
            for k in ("captcha", "are you a human", "verify you are human", "access to this page has been denied", "bot detection", "please enable javascript", "checking your browser"):
                if k in txt:
                    return True
    except Exception:
        pass
    # 尝试特定的选择器
    suspects = [
        "#captcha", "#cf-hcaptcha-container", "#challenge-form", "text:contains('are you a human')"
    ]
    for s in suspects:
        try:
            if _resp_css_first_text(resp, s):
                return True
        except Exception:
            pass
    return False
 
 
def parse_price_from_text(price_raw):
    if not price_raw:
        return None, None
    m = re.search(r"([^\d.,\s]+)?\s*([\d,]+\.\d{1,2}|[\d,]+)", price_raw)
    if m:
        currency = m.group(1).strip() if m.group(1) else None
        num = m.group(2).replace(",", "")
        try:
            price = float(num)
        except Exception:
            price = None
        return currency, price
    return None, None
 
 
def parse_int_from_text(text):
    if not text:
        return None
    digits = "".join(filter(str.isdigit, text))
    try:
        return int(digits) if digits else None
    except:
        return None
 
 
# ---------------- 主函数 (仅使用响应) ----------------
def scrape_amazon_using_response_only(url):
    with DynamicSession(cdp_url=WS_ENDPOINT, disable_resources=DISABLE_RESOURCES) as s:
        # 带有重试的抓取
        resp = retry(lambda: s.fetch(url, network_idle=True, timeout=120000), retries=MAX_RETRIES - 1)
 
        if detect_bot_via_resp(resp):
            print("[warn] 通过响应选择器检测到机器人/验证码。")
            try:
                resp.screenshot(path="captcha_detected.png")
            except Exception:
                pass
            # 再次重试
            time.sleep(2)
            resp = retry(lambda: s.fetch(url, network_idle=True, timeout=120000), retries=1)
 
        # 等待 productTitle 出现 (仅使用响应选择器轮询)
        title = _resp_css_first_text(resp, "#productTitle") or _resp_css_first_text(resp, "#title")
        waited = 0
        while not title and waited < WAIT_FOR_SELECTOR_TIMEOUT:
            print("[info] 正在等待 #productTitle 出现 (响应选择器)...")
            time.sleep(3)
            waited += 3
            resp = s.fetch(url, network_idle=True, timeout=120000)
            title = _resp_css_first_text(resp, "#productTitle") or _resp_css_first_text(resp, "#title")
 
        title = title.strip() if title else None
 
        # 使用仅响应的辅助函数提取字段
        def get_text(selectors, multiple=False):
            if multiple:
                out = []
                for sel in selectors:
                    out.extend(_resp_css_texts(resp, sel) or [])
                return out
            for sel in selectors:
                v = _resp_css_first_text(resp, sel)
                if v:
                    return v
            return None
 
        price_raw = get_text([
            "#priceblock_ourprice",
            "#priceblock_dealprice",
            "#priceblock_saleprice",
            "#price_inside_buybox",
            ".a-price .a-offscreen"
        ])
        rating_text = get_text(["span.a-icon-alt", "#acrPopover"])
        review_count_text = get_text(["#acrCustomerReviewText", "[data-hook='total-review-count']"])
        availability = get_text([
            "#availability .a-color-state",
            "#availability .a-color-success",
            "#outOfStock",
            "#availability"
        ])
        features = get_text(["#feature-bullets ul li"], multiple=True) or []
        description = get_text([
            "#productDescription",
            "#bookDescription_feature_div .a-expander-content",
            "#productOverview_feature_div"
        ])
 
        # 图片 (使用响应进行属性提取)
        images = []
        seen = set()
        main_src = _resp_css_first_attr(resp, "#imgTagWrapperId img", "data-old-hires") \
                   or _resp_css_first_attr(resp, "#landingImage", "src") \
                   or _resp_css_first_attr(resp, "#imgTagWrapperId img", "src")
        if main_src and main_src not in seen:
            images.append(main_src); seen.add(main_src)
 
        dyn = _resp_css_first_attr(resp, "#imgTagWrapperId img", "data-a-dynamic-image") \
              or _resp_css_first_attr(resp, "#landingImage", "data-a-dynamic-image")
        if dyn:
            try:
                obj = json.loads(dyn)
                for k in obj.keys():
                    if k not in seen:
                        images.append(k); seen.add(k)
            except Exception:
                pass
 
        thumbs = _resp_css_texts(resp, "#altImages img::attr(src)") or _resp_css_texts(resp, ".imageThumbnail img::attr(src)") or []
        for src in thumbs:
            if not src:
                continue
            src_clean = re.sub(r"\._[A-Z0-9,]+_\.", ".", src)
            if src_clean not in seen:
                images.append(src_clean); seen.add(src_clean)
 
        # ASIN (属性)
        asin = _resp_css_first_attr(resp, "input#ASIN", "value")
        if asin:
            asin = asin.strip()
        else:
            detail_texts = _resp_css_texts(resp, "#detailBullets_feature_div li") or []
            combined = " ".join([t for t in detail_texts if t])
            m = re.search(r"ASIN[:\s]*([A-Z0-9-]+)", combined, re.I)
            if m:
                asin = m.group(1).strip()
 
        merchant = _resp_css_first_text(resp, "#sellerProfileTriggerId") \
                   or _resp_css_first_text(resp, "#merchant-info") \
                   or _resp_css_first_text(resp, "#bylineInfo")
        categories = _resp_css_texts(resp, "#wayfinding-breadcrumbs_container ul li a") or _resp_css_texts(resp, "#wayfinding-breadcrumbs_feature_div ul li a") or []
        categories = [c.strip() for c in categories if c and c.strip()]
 
        currency, price = parse_price_from_text(price_raw)
        rating_val = None
        if rating_text:
            try:
                rating_val = float(rating_text.split()[0].replace(",", ""))
            except Exception:
                rating_val = None
        review_count = parse_int_from_text(review_count_text)
 
        data = {
            "title": title,
            "price_raw": price_raw,
            "price": price,
            "currency": currency,
            "rating": rating_val,
            "review_count": review_count,
            "availability": availability,
            "features": features,
            "description": description,
            "images": images,
            "asin": asin,
            "merchant": merchant,
            "categories": categories,
            "url": url,
            "scrapedAt": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        }
 
        return data
 
 
# ---------------- 运行 ----------------
if __name__ == "__main__":
    try:
        result = scrape_amazon_using_response_only(TARGET_URL)
        print(json.dumps(result, indent=2, ensure_ascii=False))
        with open("scrapeless-amazon-product.json", "w", encoding="utf-8") as f:
            json.dump(result, f, ensure_ascii=False, indent=2)
    except Exception as e:
        print("[error] 抓取失败:", e)

示例输出:

{
  "title": "ESR for iPhone 15 Pro Max Case, Compatible with MagSafe, Military-Grade Protection, Yellowing Resistant, Scratch-Resistant Back, Magnetic Phone Case for iPhone 15 Pro Max, Classic Series, Clear",
  "price_raw": "$12.99",
  "price": 12.99,
  "currency": "$",
  "rating": 4.6,
  "review_count": 133714,
  "availability": "In Stock",
  "features": [
    "Compatibility: only for iPhone 15 Pro Max; full functionality maintained via precise speaker and port cutouts and easy-press buttons",
    "Stronger Magnetic Lock: powerful built-in magnets with 1,500 g of holding force enable faster, easier place-and-go wireless charging and a secure lock on any MagSafe accessory",
    "Military-Grade Drop Protection: rigorously tested to ensure total protection on all sides, with specially designed Air Guard corners that absorb shock so your phone doesn\u2019t have to",
    "Raised-Edge Protection: raised screen edges and Camera Guard lens frame provide enhanced scratch protection where it really counts",
    "Stay Original: scratch-resistant, crystal-clear acrylic back lets you show off your iPhone 15 Pro Max\u2019s true style in stunning clarity that lasts",
    "Complete Customer Support: detailed setup videos and FAQs, comprehensive 12-month protection plan, lifetime support, and personalized help."
  ],
  "description": "BrandESRCompatible Phone ModelsiPhone 15 Pro MaxColorA-ClearCompatible DevicesiPhone 15 Pro MaxMaterialAcrylic",
  "images": [
    "https://m.media-amazon.com/images/I/71-ishbNM+L._AC_SL1500_.jpg",
    "https://m.media-amazon.com/images/I/71-ishbNM+L._AC_SX342_.jpg",
    "https://m.media-amazon.com/images/I/71-ishbNM+L._AC_SX679_.jpg",
    "https://m.media-amazon.com/images/I/71-ishbNM+L._AC_SX522_.jpg",
    "https://m.media-amazon.com/images/I/71-ishbNM+L._AC_SX385_.jpg",
    "https://m.media-amazon.com/images/I/71-ishbNM+L._AC_SX466_.jpg",
    "https://m.media-amazon.com/images/I/71-ishbNM+L._AC_SX425_.jpg",
    "https://m.media-amazon.com/images/I/71-ishbNM+L._AC_SX569_.jpg",
    "https://m.media-amazon.com/images/I/41Ajq9jnx9L._AC_SR38,50_.jpg",
    "https://m.media-amazon.com/images/I/51RkuGXBMVL._AC_SR38,50_.jpg",
    "https://m.media-amazon.com/images/I/516RCbMo5tL._AC_SR38,50_.jpg",
    "https://m.media-amazon.com/images/I/51DdOFdiQQL._AC_SR38,50_.jpg",
    "https://m.media-amazon.com/images/I/514qvXYcYOL._AC_SR38,50_.jpg",
    "https://m.media-amazon.com/images/I/518CS81EFXL._AC_SR38,50_.jpg",
    "https://m.media-amazon.com/images/I/413EWAtny9L.SX38_SY50_CR,0,0,38,50_BG85,85,85_BR-120_PKdp-play-icon-overlay__.jpg",
    "https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif"
  ],
  "asin": "B0CC1F4V7Q",
  "merchant": "Minghutech-US",
  "categories": [
    "Cell Phones & Accessories",
    "Cases, Holsters & Sleeves",
    "Basic Cases"
  ],
  "url": "https://www.amazon.com/ESR-Compatible-Military-Grade-Protection-Scratch-Resistant/dp/B0CC1F4V7Q",
  "scrapedAt": "2025-10-30T10:20:16Z"
}

此示例展示了 DynamicSessionScrapeless 如何协同工作,以创建一个稳定、可重用的长会话环境

在同一会话中,您可以无需重启浏览器即可请求多个页面,维护登录状态、Cookie 和本地存储,并实现配置文件隔离会话持久性