OpenSource-Hub

Scrapling

框架

D4Vinci/Scrapling

自适应网页抓取框架,自动绕过反爬并移动元素。

项目简介

Scrapling 是一个自适应网页抓取框架,支持从单次请求到大规模爬取。其解析器能学习网站变化,在页面更新时自动重定位元素。它内置绕过 Cloudflare Turnstile 等反爬系统,并支持并发多会话爬取和代理轮换。

README 预览

\n\n\n    \n        \n          \n          \n        \n    \n    \n    Effortless Web Scraping for the Modern Web\n\n\n\n    \n    \n    العربيه | Español | Português (Brasil) | Français | Deutsch | 简体中文 | 日本語 |  Русский | 한국어\n    \n    \n        \n    \n        \n    \n    \n        \n    \n        \n    \n    \n      \n    \n    \n      \n    \n    \n    \n        \n\n\n\n    Selection methods\n    ·\n    Fetchers\n    ·\n    Spiders\n    ·\n    Proxy Rotation\n    ·\n    CLI\n    ·\n    MCP\n\n\nScrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.\n\nIts parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation - all in a few lines of Python. One library, zero compromises.\n\nBlazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.\n\n```python\nfrom scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher\nStealthyFetcher.adaptive = True\np = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)  # Fetch website under the radar!\nproducts = p.css('.product', auto_save=True)                                        # Scrape data that survives website design changes!\nproducts = p.css('.product', adaptive=True)                                         # Later, if the website structure changes, pass `adaptive=True` to find them!\n```\nOr scale up to full crawls\n```python\nfrom scrapling.spiders import Spider, Response\n\nclass MySpider(Spider):\n  name = "demo"\n  start_urls = ["https://example.com/"]\n\n  async def parse(self, response: Response):\n      for item in response.css('.product'):\n