Scrapy Playwright Tutorial: Scrape JS Sites (2026)

Modern sites lean on JavaScript rendering, background API calls, and client-side routing. If you scrape them using HTTP-only crawlers, you often get empty HTML shells, missing data, or endless anti-bot friction.

This Scrapy Playwright tutorial shows how to scrape JavaScript-heavy sites correctly in 2026 by combining Scrapy’s crawl engine with Playwright’s browser rendering. You will learn production patterns for rendering, pagination, proxy routing, performance tuning, and stability.

Why Scrapy alone fails on JavaScript-heavy sites

Scrapy is excellent at fetching server-rendered HTML and parsing it quickly. The problem is that many modern sites:

Render content after page load using React, Vue, or Angular
Fetch the real data via XHR or fetch calls
Use infinite scroll instead of server pagination
Gate key elements behind client-side checks

In these cases, Scrapy sees the initial HTML skeleton and none of the final DOM.

If you are still deciding whether you need a browser at all, start by comparing a browser renderer to an async request pipeline using async scraping patterns in Python automation. If HTTP-only collectors can reach the same data through an API or embedded JSON, you will save significant compute.

What Scrapy Playwright is

Scrapy Playwright is a Scrapy integration that renders pages using Playwright before Scrapy parses them.

High-level flow:

Scrapy schedules a request
Playwright opens the page and runs JavaScript
Scrapy receives the rendered DOM as the response
Your spider parses and yields items

This gives you Scrapy’s crawling, pipelines, and retries, while gaining browser-based rendering.

Installing Scrapy Playwright

Install packages:

pip install scrapy playwright scrapy-playwright
playwright install

Enable Playwright in settings.py:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

PLAYWRIGHT_BROWSER_TYPE = "chromium"  # or firefox, webkit

A minimal working spider

This example renders a page, waits for a selector, then parses the DOM.

import scrapy

class JsSpider(scrapy.Spider):
    name = "js_spider"
    start_urls = ["https://example-js-site.com"]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta={
                    "playwright": True,
                    "playwright_include_page": True,
                },
            )

    async def parse(self, response):
        page = response.meta["playwright_page"]

        # Wait until the JS-rendered content exists
        await page.wait_for_selector(".product-card")

        for card in response.css(".product-card"):
            yield {
                "title": card.css(".title::text").get(),
                "price": card.css(".price::text").get(),
            }

        await page.close()

Practical notes:

Use wait_for_selector to avoid parsing too early
Always close the page to prevent memory leaks
Keep selectors stable and defensive

Rendering strategy that stays stable in production

Browser rendering can be expensive. The goal is to use Playwright only where it is needed.

Use Playwright when:

Data appears only after JS execution
Pagination is client-side or infinite scroll
Key pages require clicks, tabs, or lazy loading

Avoid Playwright when:

The data is available via JSON endpoints
Pages are server-rendered
Costs or throughput are the top priority

If you need a framework for deciding, use headless browser vs HTTP client and then validate against your targets.

Handling infinite scroll and lazy-loaded lists

Many ecommerce and directory sites append results only after scrolling.

A simple scroll loop:

# Scroll down and wait for more content
await page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
await page.wait_for_timeout(1500)

Production tips:

Scroll in controlled steps, not huge jumps
Stop when you detect no new items
Treat scroll loops as a risk factor for 429 and 403

If you frequently hit blocks during scroll loading, you will want a structured debugging approach such as debugging scraper blocks.

Using proxies with Scrapy Playwright

Most real-world Playwright scrapers need proxy routing.

Configure a proxy at launch:

PLAYWRIGHT_LAUNCH_OPTIONS = {
    "proxy": {
        "server": "http://username:password@proxy-server:port"
    }
}

Operational guidance:

Avoid using one IP for everything
Separate aggressive crawls from login or session flows
Plan your IP pool size around concurrency and target sensitivity

If your workload is high-volume and datacenter-friendly, review why datacenter proxies excel in high-volume automation and then validate capacity against your plan on the pricing plans.

Performance tuning that actually moves the needle

Block heavy assets early

Blocking images, fonts, and media reduces bandwidth and speeds up rendering.

async def intercept(route):
    if route.request.resource_type in ["image", "media", "font"]:
        await route.abort()
    else:
        await route.continue_()

await page.route("**/*", intercept)

Control concurrency

Scrapy defaults that work for HTTP-only crawls will overload browser rendering.

Start conservative:

5 to 10 concurrent pages per worker
Increase gradually while watching memory and p95 latency

Prefer domain batching

Group requests by domain so that:

Connection reuse improves performance
Fingerprint and session behavior stays coherent
Proxy rotation is easier to reason about

For large crawls, planning the number of IPs and concurrency together prevents runaway retries. Use how many proxies for large crawls as a sizing reference.

Anti-bot friction and how to reduce it responsibly

Common failure patterns include:

Soft blocks that return partial HTML
Interstitial challenges
Rate limiting disguised as slow responses
Fingerprint mismatch between locale, headers, and IP geo

Stability tactics:

Keep sessions stable for multi-step flows
Use realistic user agents and locale headers
Align IP geo with language and timezone
Reduce burst concurrency on sensitive endpoints

If you suspect stealth and fingerprinting issues are dominating, use fingerprinting vs proxying to separate what proxies can solve from what browser behavior must solve.

Scaling Scrapy Playwright in production

For large jobs, treat browser rendering as a specialized workload.

A practical scaling approach:

Put JS-heavy targets into dedicated queues
Run separate workers for browser rendering
Use container limits to prevent memory exhaustion
Instrument outcomes per domain and proxy pool
Implement fallbacks when rendering fails

Metrics to track:

Render time until the selector is present
Status code distribution including soft blocks
Block and challenge rate per domain
Timeouts and retries per proxy range
Cost per successful rendered page

Frequently asked questions

Is Scrapy Playwright slower than normal Scrapy

Yes. Rendering in a browser is heavier than HTTP-only fetching. Use Playwright only when the data genuinely requires JavaScript execution.

Can you scrape at scale with Playwright

Yes, but you must limit concurrency, block unnecessary assets, and separate JS-heavy workloads into dedicated workers.

Should you run headless or headed mode

Headless is standard for production. Headed mode is best for debugging selectors, scroll behavior, and challenge flows.

Does Playwright support authenticated proxies

Yes. You can configure authenticated proxies in PLAYWRIGHT_LAUNCH_OPTIONS and route browser traffic through your provider.

What is the fastest way to reduce Playwright cost

Reduce rendering scope. Block heavy assets, avoid browser rendering when HTTP endpoints are sufficient, and lower concurrency until retries and timeouts stabilize.

Final thoughts

Scrapy Playwright is the bridge between modern JS-heavy sites and production-grade crawling. It unlocks content that HTTP-only collectors cannot see, but it requires discipline: conservative concurrency, smart proxy routing, and consistent sessions.

If you want the fastest path to a stable production setup, validate whether browser rendering is necessary for each target, then implement a dedicated rendering tier with strict limits and measurable success metrics. When you do this well, Scrapy Playwright becomes an asset instead of an expense.

Scrapy Playwright Tutorial: Scraping JavaScript-Heavy Sites the Right Way

Table of Contents