Web Scraping in 2026: The Tools That Actually Work

Every year someone declares web scraping is dead. And every year, there's more data to extract, more APIs to reverse engineer, and more anti-bot systems to navigate. The tools have evolved - here's an honest assessment of what works in 2026, when to use each, and what I actually reach for in production.

The Landscape

Tool	Type	Best For	Anti-Bot Bypass
`requests`	HTTP client	Simple APIs, no protection	None
`HTTPX`	Async HTTP client	High-volume async scraping	None
`curl_cffi`	TLS-spoofing HTTP client	Sites with TLS fingerprinting	TLS/JA3
`Scrapy`	Framework	Large-scale structured crawling	Basic
`Playwright`	Browser automation	JS-heavy sites, SPAs	Moderate
`BeautifulSoup`	HTML parser	Parsing static HTML	N/A (parser only)
`lxml`	HTML/XML parser	High-performance parsing	N/A (parser only)

No single tool covers everything. The right choice depends on what you're scraping and what's protecting it.

requests: Still the Starting Point

Seven years after its creator stepped back, requests is still the first thing I reach for. Not because it's the best - but because it's the simplest:

import requests

response = requests.get("https://api.example.com/data")
data = response.json()

Use when: The target has no bot protection, you need a quick script, or you're hitting an API with an API key.

Don't use when: You need async, the site has TLS fingerprinting, or you need HTTP/2. requests is HTTP/1.1 only and its TLS fingerprint is immediately recognizable.

HTTPX: The Modern requests

HTTPX is what requests would be if it were designed today. Same clean API, plus async support and HTTP/2:

import httpx
import asyncio

async def scrape():
    async with httpx.AsyncClient(http2=True) as client:
        response = await client.get("https://example.com")
        return response.text

data = asyncio.run(scrape())

Use when: You need async (high-volume scraping), HTTP/2 support, or you want a modern requests replacement.

Don't use when: The target uses TLS fingerprinting. HTTPX's TLS signature is as recognizable as requests. For protected sites, you still need curl_cffi.

The async support is HTTPX's killer feature. With semaphore-based concurrency, I regularly hit 100-200 requests/second for targets that allow it.

curl_cffi: The Anti-Bot Weapon

curl_cffi wraps curl-impersonate, a patched version of curl that mimics real browser TLS fingerprints. This is the tool that changed the game for scraping protected sites:

from curl_cffi import requests

response = requests.get(
    "https://protected-site.com",
    impersonate="chrome"
)

That single impersonate="chrome" parameter makes your request's TLS handshake identical to Chrome's - matching JA3 fingerprint, HTTP/2 SETTINGS, header order, and cipher suites.

Use when: The target uses Cloudflare, Akamai, PerimeterX, or any system that does TLS fingerprinting. This handles 80% of "why am I getting 403'd" situations.

Don't use when: The target requires JavaScript execution (Turnstile challenges, SPAs). curl_cffi is an HTTP client - it doesn't run JavaScript.

Async support:

from curl_cffi import requests

async def scrape():
    async with requests.AsyncSession(impersonate="chrome") as session:
        response = await session.get("https://protected-site.com")
        return response.text

Scrapy: The Framework

Scrapy isn't a library - it's a framework. It handles request scheduling, concurrency, retries, data pipelines, and output formatting. For large-scale crawling jobs, nothing else comes close:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://shop.example.com/products']

    def parse(self, response):
        for product in response.css('.product-card'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
                'url': product.css('a::attr(href)').get(),
            }

        # Follow pagination
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Use when: You're crawling thousands of pages with a structured pattern, need built-in request queuing and deduplication, or need to export data to multiple formats (JSON, CSV, database).

Don't use when: You need to scrape a single API endpoint, the site requires browser automation, or the project is small enough that async HTTPX covers it. Scrapy's overhead isn't worth it for simple jobs.

The Scrapy ecosystem is also valuable - scrapy-splash for JavaScript rendering, scrapy-rotating-proxies for proxy rotation, and scrapy-fake-useragent for UA rotation.

Playwright: When You Need a Real Browser

Some sites can't be scraped with HTTP clients alone. SPAs that render content with JavaScript, sites with Turnstile or reCAPTCHA challenges, or targets that do heavy browser fingerprinting. Playwright automates a real Chromium, Firefox, or WebKit browser:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://spa-app.com/products")

    # Wait for data to render
    page.wait_for_selector('.product-list')

    # Extract data
    products = page.query_selector_all('.product-card')
    for product in products:
        name = product.query_selector('h2').inner_text()
        price = product.query_selector('.price').inner_text()
        print(name, price)

    browser.close()

Use when: Content is rendered by JavaScript, you need to interact with the page (click, scroll, fill forms), or the site requires a full browser environment to pass bot detection.

Don't use when: The data is available via API (check the Network tab first), or you need high throughput. Browser automation is 10-100x slower than HTTP-level scraping and uses significantly more memory.

The Network Tab First Rule

Before reaching for Playwright, always check the browser's Network tab. Most "JavaScript-rendered" content actually comes from an API endpoint that returns JSON. If you can find that endpoint, scrape it with HTTPX or curl_cffi instead - it's faster, more reliable, and uses less resources.

BeautifulSoup vs. lxml

Both are HTML parsers - they don't fetch data, they parse it. The choice:

BeautifulSoup - Forgiving parser, handles malformed HTML well, Pythonic API:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
titles = [h2.text for h2 in soup.find_all('h2', class_='title')]

lxml - 10-50x faster, stricter parsing, supports XPath:

from lxml import html

tree = html.fromstring(html_content)
titles = tree.xpath('//h2[@class="title"]/text()')

For small pages, it doesn't matter. For parsing millions of pages, lxml's speed advantage is significant.

My Decision Tree

Does the site have bot protection?
├── No → requests or HTTPX (async)
└── Yes → What kind?
    ├── TLS fingerprinting (403s) → curl_cffi
    ├── JavaScript challenges → Playwright
    └── Both → curl_cffi first, Playwright if needed

Is the content rendered by JavaScript?
├── No → HTTP client + BeautifulSoup/lxml
└── Yes → Check Network tab for API
    ├── API found → HTTP client + JSON parsing
    └── No API → Playwright

Scale?
├── <1,000 pages → Simple async script
├── 1,000-100,000 → Async scraper with semaphores
└── 100,000+ → Scrapy with pipelines

What I Actually Use (Daily)

On a typical work week:

60% curl_cffi - Most targets I work with have some level of bot protection
25% HTTPX - Clean APIs, internal tools, targets without protection
10% Playwright - SPAs and targets requiring full browser interaction
5% Scrapy - Large-scale structured crawling jobs

I rarely use plain requests anymore. HTTPX is a drop-in upgrade with async and HTTP/2 support, and curl_cffi handles the protected sites.

Key Takeaways

Always check the Network tab before choosing a tool - most sites have hidden APIs
curl_cffi with impersonate="chrome" solves 80% of bot detection issues
HTTPX is the modern default for unprotected targets
Playwright is the last resort, not the first choice - it's slow and resource-heavy
Scrapy is for large-scale crawling, not quick scripts
Combine tools: curl_cffi for fetching, BeautifulSoup for parsing
The right tool depends on what protects the target, not what framework is trending

The best scraper is the simplest one that works. Start with an HTTP client, escalate only when you need to.

Always respect robots.txt and terms of service. Use these tools responsibly and with authorization.