Modern sites lean on JavaScript rendering, background API calls, and client-side routing. If you scrape them using HTTP-only crawlers, you often get empty HTML shells, missing data, or endless anti-bot friction.
This Scrapy Playwright tutorial shows how to scrape JavaScript-heavy sites correctly in 2026 by combining Scrapy’s crawl engine with Playwright’s browser rendering. You will learn production patterns for rendering, pagination, proxy routing, performance tuning, and stability.
Scrapy is excellent at fetching server-rendered HTML and parsing it quickly. The problem is that many modern sites:
In these cases, Scrapy sees the initial HTML skeleton and none of the final DOM.
If you are still deciding whether you need a browser at all, start by comparing a browser renderer to an async request pipeline using async scraping patterns in Python automation. If HTTP-only collectors can reach the same data through an API or embedded JSON, you will save significant compute.
Scrapy Playwright is a Scrapy integration that renders pages using Playwright before Scrapy parses them.
High-level flow:
This gives you Scrapy’s crawling, pipelines, and retries, while gaining browser-based rendering.
Install packages:
pip install scrapy playwright scrapy-playwright
playwright install
Enable Playwright in settings.py:
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_BROWSER_TYPE = "chromium" # or firefox, webkit
This example renders a page, waits for a selector, then parses the DOM.
import scrapy
class JsSpider(scrapy.Spider):
name = "js_spider"
start_urls = ["https://example-js-site.com"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_include_page": True,
},
)
async def parse(self, response):
page = response.meta["playwright_page"]
# Wait until the JS-rendered content exists
await page.wait_for_selector(".product-card")
for card in response.css(".product-card"):
yield {
"title": card.css(".title::text").get(),
"price": card.css(".price::text").get(),
}
await page.close()
Practical notes:
wait_for_selector to avoid parsing too earlypage to prevent memory leaksBrowser rendering can be expensive. The goal is to use Playwright only where it is needed.
Use Playwright when:
Avoid Playwright when:
If you need a framework for deciding, use headless browser vs HTTP client and then validate against your targets.
Many ecommerce and directory sites append results only after scrolling.
A simple scroll loop:
# Scroll down and wait for more content
await page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
await page.wait_for_timeout(1500)
Production tips:
If you frequently hit blocks during scroll loading, you will want a structured debugging approach such as debugging scraper blocks.
Most real-world Playwright scrapers need proxy routing.
Configure a proxy at launch:
PLAYWRIGHT_LAUNCH_OPTIONS = {
"proxy": {
"server": "http://username:password@proxy-server:port"
}
}
Operational guidance:
If your workload is high-volume and datacenter-friendly, review why datacenter proxies excel in high-volume automation and then validate capacity against your plan on the pricing plans.
Blocking images, fonts, and media reduces bandwidth and speeds up rendering.
async def intercept(route):
if route.request.resource_type in ["image", "media", "font"]:
await route.abort()
else:
await route.continue_()
await page.route("**/*", intercept)
Scrapy defaults that work for HTTP-only crawls will overload browser rendering.
Start conservative:
Group requests by domain so that:
For large crawls, planning the number of IPs and concurrency together prevents runaway retries. Use how many proxies for large crawls as a sizing reference.
Common failure patterns include:
Stability tactics:
If you suspect stealth and fingerprinting issues are dominating, use fingerprinting vs proxying to separate what proxies can solve from what browser behavior must solve.
For large jobs, treat browser rendering as a specialized workload.
A practical scaling approach:
Metrics to track:
Yes. Rendering in a browser is heavier than HTTP-only fetching. Use Playwright only when the data genuinely requires JavaScript execution.
Yes, but you must limit concurrency, block unnecessary assets, and separate JS-heavy workloads into dedicated workers.
Headless is standard for production. Headed mode is best for debugging selectors, scroll behavior, and challenge flows.
Yes. You can configure authenticated proxies in PLAYWRIGHT_LAUNCH_OPTIONS and route browser traffic through your provider.
Reduce rendering scope. Block heavy assets, avoid browser rendering when HTTP endpoints are sufficient, and lower concurrency until retries and timeouts stabilize.
Scrapy Playwright is the bridge between modern JS-heavy sites and production-grade crawling. It unlocks content that HTTP-only collectors cannot see, but it requires discipline: conservative concurrency, smart proxy routing, and consistent sessions.
If you want the fastest path to a stable production setup, validate whether browser rendering is necessary for each target, then implement a dedicated rendering tier with strict limits and measurable success metrics. When you do this well, Scrapy Playwright becomes an asset instead of an expense.
Jesse Lewis is a researcher and content contributor for ProxiesThatWork, covering compliance trends, data governance, and the evolving relationship between AI and proxy technologies. He focuses on helping businesses stay compliant while deploying efficient, scalable data-collection pipelines.