Proxies That Work logo

Scrapy Playwright Tutorial: Scraping JavaScript-Heavy Sites the Right Way

By Jesse Lewis2/18/20265 min read

Modern sites lean on JavaScript rendering, background API calls, and client-side routing. If you scrape them using HTTP-only crawlers, you often get empty HTML shells, missing data, or endless anti-bot friction.

This Scrapy Playwright tutorial shows how to scrape JavaScript-heavy sites correctly in 2026 by combining Scrapy’s crawl engine with Playwright’s browser rendering. You will learn production patterns for rendering, pagination, proxy routing, performance tuning, and stability.

Why Scrapy alone fails on JavaScript-heavy sites

Scrapy is excellent at fetching server-rendered HTML and parsing it quickly. The problem is that many modern sites:

  • Render content after page load using React, Vue, or Angular
  • Fetch the real data via XHR or fetch calls
  • Use infinite scroll instead of server pagination
  • Gate key elements behind client-side checks

In these cases, Scrapy sees the initial HTML skeleton and none of the final DOM.

If you are still deciding whether you need a browser at all, start by comparing a browser renderer to an async request pipeline using async scraping patterns in Python automation. If HTTP-only collectors can reach the same data through an API or embedded JSON, you will save significant compute.

What Scrapy Playwright is

Scrapy Playwright is a Scrapy integration that renders pages using Playwright before Scrapy parses them.

High-level flow:

  • Scrapy schedules a request
  • Playwright opens the page and runs JavaScript
  • Scrapy receives the rendered DOM as the response
  • Your spider parses and yields items

This gives you Scrapy’s crawling, pipelines, and retries, while gaining browser-based rendering.

Installing Scrapy Playwright

Install packages:

pip install scrapy playwright scrapy-playwright
playwright install

Enable Playwright in settings.py:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

PLAYWRIGHT_BROWSER_TYPE = "chromium"  # or firefox, webkit

A minimal working spider

This example renders a page, waits for a selector, then parses the DOM.

import scrapy

class JsSpider(scrapy.Spider):
    name = "js_spider"
    start_urls = ["https://example-js-site.com"]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta={
                    "playwright": True,
                    "playwright_include_page": True,
                },
            )

    async def parse(self, response):
        page = response.meta["playwright_page"]

        # Wait until the JS-rendered content exists
        await page.wait_for_selector(".product-card")

        for card in response.css(".product-card"):
            yield {
                "title": card.css(".title::text").get(),
                "price": card.css(".price::text").get(),
            }

        await page.close()

Practical notes:

  • Use wait_for_selector to avoid parsing too early
  • Always close the page to prevent memory leaks
  • Keep selectors stable and defensive

Rendering strategy that stays stable in production

Browser rendering can be expensive. The goal is to use Playwright only where it is needed.

Use Playwright when:

  • Data appears only after JS execution
  • Pagination is client-side or infinite scroll
  • Key pages require clicks, tabs, or lazy loading

Avoid Playwright when:

  • The data is available via JSON endpoints
  • Pages are server-rendered
  • Costs or throughput are the top priority

If you need a framework for deciding, use headless browser vs HTTP client and then validate against your targets.

Handling infinite scroll and lazy-loaded lists

Many ecommerce and directory sites append results only after scrolling.

A simple scroll loop:

# Scroll down and wait for more content
await page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
await page.wait_for_timeout(1500)

Production tips:

  • Scroll in controlled steps, not huge jumps
  • Stop when you detect no new items
  • Treat scroll loops as a risk factor for 429 and 403

If you frequently hit blocks during scroll loading, you will want a structured debugging approach such as debugging scraper blocks.

Using proxies with Scrapy Playwright

Most real-world Playwright scrapers need proxy routing.

Configure a proxy at launch:

PLAYWRIGHT_LAUNCH_OPTIONS = {
    "proxy": {
        "server": "http://username:password@proxy-server:port"
    }
}

Operational guidance:

  • Avoid using one IP for everything
  • Separate aggressive crawls from login or session flows
  • Plan your IP pool size around concurrency and target sensitivity

If your workload is high-volume and datacenter-friendly, review why datacenter proxies excel in high-volume automation and then validate capacity against your plan on the pricing plans.

Performance tuning that actually moves the needle

Block heavy assets early

Blocking images, fonts, and media reduces bandwidth and speeds up rendering.

async def intercept(route):
    if route.request.resource_type in ["image", "media", "font"]:
        await route.abort()
    else:
        await route.continue_()

await page.route("**/*", intercept)

Control concurrency

Scrapy defaults that work for HTTP-only crawls will overload browser rendering.

Start conservative:

  • 5 to 10 concurrent pages per worker
  • Increase gradually while watching memory and p95 latency

Prefer domain batching

Group requests by domain so that:

  • Connection reuse improves performance
  • Fingerprint and session behavior stays coherent
  • Proxy rotation is easier to reason about

For large crawls, planning the number of IPs and concurrency together prevents runaway retries. Use how many proxies for large crawls as a sizing reference.

Anti-bot friction and how to reduce it responsibly

Common failure patterns include:

  • Soft blocks that return partial HTML
  • Interstitial challenges
  • Rate limiting disguised as slow responses
  • Fingerprint mismatch between locale, headers, and IP geo

Stability tactics:

  • Keep sessions stable for multi-step flows
  • Use realistic user agents and locale headers
  • Align IP geo with language and timezone
  • Reduce burst concurrency on sensitive endpoints

If you suspect stealth and fingerprinting issues are dominating, use fingerprinting vs proxying to separate what proxies can solve from what browser behavior must solve.

Scaling Scrapy Playwright in production

For large jobs, treat browser rendering as a specialized workload.

A practical scaling approach:

  1. Put JS-heavy targets into dedicated queues
  2. Run separate workers for browser rendering
  3. Use container limits to prevent memory exhaustion
  4. Instrument outcomes per domain and proxy pool
  5. Implement fallbacks when rendering fails

Metrics to track:

  • Render time until the selector is present
  • Status code distribution including soft blocks
  • Block and challenge rate per domain
  • Timeouts and retries per proxy range
  • Cost per successful rendered page

Frequently asked questions

Is Scrapy Playwright slower than normal Scrapy

Yes. Rendering in a browser is heavier than HTTP-only fetching. Use Playwright only when the data genuinely requires JavaScript execution.

Can you scrape at scale with Playwright

Yes, but you must limit concurrency, block unnecessary assets, and separate JS-heavy workloads into dedicated workers.

Should you run headless or headed mode

Headless is standard for production. Headed mode is best for debugging selectors, scroll behavior, and challenge flows.

Does Playwright support authenticated proxies

Yes. You can configure authenticated proxies in PLAYWRIGHT_LAUNCH_OPTIONS and route browser traffic through your provider.

What is the fastest way to reduce Playwright cost

Reduce rendering scope. Block heavy assets, avoid browser rendering when HTTP endpoints are sufficient, and lower concurrency until retries and timeouts stabilize.

Final thoughts

Scrapy Playwright is the bridge between modern JS-heavy sites and production-grade crawling. It unlocks content that HTTP-only collectors cannot see, but it requires discipline: conservative concurrency, smart proxy routing, and consistent sessions.

If you want the fastest path to a stable production setup, validate whether browser rendering is necessary for each target, then implement a dedicated rendering tier with strict limits and measurable success metrics. When you do this well, Scrapy Playwright becomes an asset instead of an expense.

About the Author

J

Jesse Lewis

Jesse Lewis is a researcher and content contributor for ProxiesThatWork, covering compliance trends, data governance, and the evolving relationship between AI and proxy technologies. He focuses on helping businesses stay compliant while deploying efficient, scalable data-collection pipelines.

Proxies That Work logo
© 2026 ProxiesThatWork LLC. All Rights Reserved.
Scrapy Playwright Tutorial: Scrape JS Sites (2026) - ProxiesThatWork