Every year someone declares web scraping is dead. And every year, there's more data to extract, more APIs to reverse engineer, and more anti-bot systems to navigate. The tools have evolved - here's an honest assessment of what works in 2026, when to use each, and what I actually reach for in production.
The Landscape
| Tool | Type | Best For | Anti-Bot Bypass |
|---|---|---|---|
requests |
HTTP client | Simple APIs, no protection | None |
HTTPX |
Async HTTP client | High-volume async scraping | None |
curl_cffi |
TLS-spoofing HTTP client | Sites with TLS fingerprinting | TLS/JA3 |
Scrapy |
Framework | Large-scale structured crawling | Basic |
Playwright |
Browser automation | JS-heavy sites, SPAs | Moderate |
BeautifulSoup |
HTML parser | Parsing static HTML | N/A (parser only) |
lxml |
HTML/XML parser | High-performance parsing | N/A (parser only) |
No single tool covers everything. The right choice depends on what you're scraping and what's protecting it.
requests: Still the Starting Point
Seven years after its creator stepped back, requests is still the first thing I reach for. Not because it's the best - but because it's the simplest:
import requests
response = requests.get("https://api.example.com/data")
data = response.json()Use when: The target has no bot protection, you need a quick script, or you're hitting an API with an API key.
Don't use when: You need async, the site has TLS fingerprinting, or you need HTTP/2. requests is HTTP/1.1 only and its TLS fingerprint is immediately recognizable.
HTTPX: The Modern requests
HTTPX is what requests would be if it were designed today. Same clean API, plus async support and HTTP/2:
import httpx
import asyncio
async def scrape():
async with httpx.AsyncClient(http2=True) as client:
response = await client.get("https://example.com")
return response.text
data = asyncio.run(scrape())Use when: You need async (high-volume scraping), HTTP/2 support, or you want a modern requests replacement.
Don't use when: The target uses TLS fingerprinting. HTTPX's TLS signature is as recognizable as requests. For protected sites, you still need curl_cffi.
The async support is HTTPX's killer feature. With semaphore-based concurrency, I regularly hit 100-200 requests/second for targets that allow it.
curl_cffi: The Anti-Bot Weapon
curl_cffi wraps curl-impersonate, a patched version of curl that mimics real browser TLS fingerprints. This is the tool that changed the game for scraping protected sites:
from curl_cffi import requests
response = requests.get(
"https://protected-site.com",
impersonate="chrome"
)That single impersonate="chrome" parameter makes your request's TLS handshake identical to Chrome's - matching JA3 fingerprint, HTTP/2 SETTINGS, header order, and cipher suites.
Use when: The target uses Cloudflare, Akamai, PerimeterX, or any system that does TLS fingerprinting. This handles 80% of "why am I getting 403'd" situations.
Don't use when: The target requires JavaScript execution (Turnstile challenges, SPAs). curl_cffi is an HTTP client - it doesn't run JavaScript.
Async support:
from curl_cffi import requests
async def scrape():
async with requests.AsyncSession(impersonate="chrome") as session:
response = await session.get("https://protected-site.com")
return response.textScrapy: The Framework
Scrapy isn't a library - it's a framework. It handles request scheduling, concurrency, retries, data pipelines, and output formatting. For large-scale crawling jobs, nothing else comes close:
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://shop.example.com/products']
def parse(self, response):
for product in response.css('.product-card'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get(),
'url': product.css('a::attr(href)').get(),
}
# Follow pagination
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)Use when: You're crawling thousands of pages with a structured pattern, need built-in request queuing and deduplication, or need to export data to multiple formats (JSON, CSV, database).
Don't use when: You need to scrape a single API endpoint, the site requires browser automation, or the project is small enough that async HTTPX covers it. Scrapy's overhead isn't worth it for simple jobs.
The Scrapy ecosystem is also valuable - scrapy-splash for JavaScript rendering, scrapy-rotating-proxies for proxy rotation, and scrapy-fake-useragent for UA rotation.
Playwright: When You Need a Real Browser
Some sites can't be scraped with HTTP clients alone. SPAs that render content with JavaScript, sites with Turnstile or reCAPTCHA challenges, or targets that do heavy browser fingerprinting. Playwright automates a real Chromium, Firefox, or WebKit browser:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://spa-app.com/products")
# Wait for data to render
page.wait_for_selector('.product-list')
# Extract data
products = page.query_selector_all('.product-card')
for product in products:
name = product.query_selector('h2').inner_text()
price = product.query_selector('.price').inner_text()
print(name, price)
browser.close()Use when: Content is rendered by JavaScript, you need to interact with the page (click, scroll, fill forms), or the site requires a full browser environment to pass bot detection.
Don't use when: The data is available via API (check the Network tab first), or you need high throughput. Browser automation is 10-100x slower than HTTP-level scraping and uses significantly more memory.
The Network Tab First Rule
Before reaching for Playwright, always check the browser's Network tab. Most "JavaScript-rendered" content actually comes from an API endpoint that returns JSON. If you can find that endpoint, scrape it with HTTPX or curl_cffi instead - it's faster, more reliable, and uses less resources.
BeautifulSoup vs. lxml
Both are HTML parsers - they don't fetch data, they parse it. The choice:
BeautifulSoup - Forgiving parser, handles malformed HTML well, Pythonic API:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
titles = [h2.text for h2 in soup.find_all('h2', class_='title')]lxml - 10-50x faster, stricter parsing, supports XPath:
from lxml import html
tree = html.fromstring(html_content)
titles = tree.xpath('//h2[@class="title"]/text()')For small pages, it doesn't matter. For parsing millions of pages, lxml's speed advantage is significant.
My Decision Tree
Does the site have bot protection?
├── No → requests or HTTPX (async)
└── Yes → What kind?
├── TLS fingerprinting (403s) → curl_cffi
├── JavaScript challenges → Playwright
└── Both → curl_cffi first, Playwright if needed
Is the content rendered by JavaScript?
├── No → HTTP client + BeautifulSoup/lxml
└── Yes → Check Network tab for API
├── API found → HTTP client + JSON parsing
└── No API → Playwright
Scale?
├── <1,000 pages → Simple async script
├── 1,000-100,000 → Async scraper with semaphores
└── 100,000+ → Scrapy with pipelinesWhat I Actually Use (Daily)
On a typical work week:
- 60% curl_cffi - Most targets I work with have some level of bot protection
- 25% HTTPX - Clean APIs, internal tools, targets without protection
- 10% Playwright - SPAs and targets requiring full browser interaction
- 5% Scrapy - Large-scale structured crawling jobs
I rarely use plain requests anymore. HTTPX is a drop-in upgrade with async and HTTP/2 support, and curl_cffi handles the protected sites.
Key Takeaways
- Always check the Network tab before choosing a tool - most sites have hidden APIs
curl_cffiwithimpersonate="chrome"solves 80% of bot detection issues- HTTPX is the modern default for unprotected targets
- Playwright is the last resort, not the first choice - it's slow and resource-heavy
- Scrapy is for large-scale crawling, not quick scripts
- Combine tools:
curl_cffifor fetching, BeautifulSoup for parsing - The right tool depends on what protects the target, not what framework is trending
The best scraper is the simplest one that works. Start with an HTTP client, escalate only when you need to.
Always respect robots.txt and terms of service. Use these tools responsibly and with authorization.