Proxies That Work logo

Build Orchestrated Scrapers Using Shared Proxy Routing

By Jesse Lewis12/8/20255 min read

The arms race between scrapers and anti-bot systems has pushed teams to move beyond lone scripts and ad hoc proxy lists. Today, scale and survivability come from orchestration: distributed workers coordinated by a scheduler and funneled through a shared proxy routing layer that centralizes IP policy, session control, and observability.

This article lays out a pragmatic blueprint for building orchestrated scrapers using shared proxy routing. We will cover architecture, routing strategies, anti-bot evasion without crossing ethical lines, and the operational guardrails that keep your program reliable and compliant.

What is shared proxy routing?

Shared proxy routing is a centralized egress layer that all scraper workers use to reach the public web. Instead of each worker managing its own proxies, a shared router owns:

  • Pooled IP resources (datacenter, residential, mobile)
  • Rotation and stickiness logic
  • Per-domain and per-tenant rate limits
  • Geo and ASN selection for locality and reputation
  • Health checks, failover, and telemetry

The benefits are immediate:

  • Consistency: One place to enforce identity rotation, headers, and quotas.
  • Efficiency: Reuse scarce IPs across many jobs with policy guarantees.
  • Observability: Unified metrics on success, blocks, latency, and egress spend.
  • Safety: Centralized compliance controls, audit trails, and access management.

A reference architecture

Think of the system as a set of loosely coupled services:

  • Scraper workers: Stateless processes (HTTP clients or headless browsers) that extract content and never store credentials or proxy lists locally.
  • Task queue: Kafka, RabbitMQ, SQS—buffers URLs, priorities, retries.
  • Scheduler: Assigns work, enforces domain-level rate limits, and chooses routing policies.
  • Shared proxy router: A forward-proxy gateway (managed provider endpoint, HAProxy/Envoy in forward-proxy mode, or a custom Golang/Node gateway) that selects the egress IP and applies rotation/stickiness.
  • IP pools: Mixed types with metadata (geo, ASN, historical reputation, price per GB).
  • Storage and aggregator: Writes results, normalizes responses, and emits metrics.
  • Observability: Logs, traces, and dashboards (OpenTelemetry + Prometheus/Grafana).

The core loop is straightforward:

  1. Worker pulls a URL from the queue with a domain-level policy.
  2. Scheduler returns a route profile (e.g., residential, US, sticky for 10 minutes, max 2 rps).
  3. Worker connects via the shared proxy, including a session key if sticky.
  4. Router selects or reuses an IP, applies headers, and forwards the request.
  5. Worker reports outcome; router updates health and reputation scores.

Routing strategies that actually work

  • Round-robin with health bias: Start with even distribution, but prefer healthy IPs and demote those recently hit by 403/429 spikes.
  • Sticky sessions: Maintain affinity between a worker-group and an IP for the lifetime of a site session (login, cart, or CSRF flow). Use a session identifier to pin requests to the same egress.
  • Geo- and ASN-aware routing: Some targets expect local presence or specific broadband ASNs. Tag pools and match targets accordingly.
  • Adaptive rotation: Rotate on suspicion (HTTP 403/429, sudden latency jumps, captcha pages). Increase rotation frequency during block storms, then cool down.
  • Per-domain concurrency caps: Set strict ceilings per target. Shape concurrency and pace with jitter to avoid bursty patterns that trip detectors.
  • Connection reuse: Favor HTTP/2 keep-alives and multiplexing where it helps blend in, but be ready to downgrade if a target discriminates by TLS or ALPN.

Taming anti-bot controls without going rogue

  • Fingerprints that match your client: If you use headless browsers, align navigator hints, WebGL, fonts, and TLS ciphers. If you use HTTP clients, ensure TLS JA3/JA4 and header order match common browsers. Avoid exotic stacks that stand out.
  • State continuity: Store and reuse cookies, tokens, and local storage via the scheduler. Sticky sessions at the router let you keep the same IP across those flows.
  • Captcha-aware backoff: Detect captcha challenge pages quickly. Back off, rotate identity, and consider human-in-the-loop or solver integrations only when policies allow.
  • Retry discipline: Exponential backoff with jitter, capped retries per domain. Use a circuit breaker to pause a domain when block rates exceed a threshold.
  • Respectful pacing: Honor robots.txt and crawl-delay. Set user agents appropriately, and offer opt-outs when you scrape your own partner ecosystems.

Implementation blueprint

You can build the router in-house or rely on a managed proxy network. In-house gives control and cost transparency; managed services shrink time-to-value and supply cleaner IPs. Many teams land somewhere in between: a thin gateway that brokers to multiple upstream providers while projecting a single endpoint to workers.

Key design choices:

  • Protocol support: HTTP CONNECT for HTTPS, first-party TLS if terminating, SOCKS5 when needed. Keep it simple unless you truly need man-in-the-middle.
  • Policy expressions: Per-domain rules (rate, pool, stickiness, geo) expressed in a small DSL or config. Hot-reload them without redeploying workers.
  • Session tokens: Encode stickiness in credentials (e.g., username fields) or a header—whatever your provider or router supports consistently.
  • Health model: Track per-IP success rate, time to first byte, and recent block rate. Decay scores over time. Auto-rotate sick IPs out of the pool.
  • Cost guardrails: Enforce egress budgets per job. Residential and mobile can be expensive—switch to datacenter where targets permit.

A minimal worker loop might look like this:

# Pseudocode
policy = scheduler.get_policy(url)
route  = router_client.get_route(policy)  # returns proxy_url, session_id

for attempt in range(1, policy.max_retries + 1):
    resp = http.get(
        url,
        proxy=route.proxy_url,
        headers={'x-scrape-session': route.session_id},
        timeout=policy.timeout,
    )

    if resp.status in [200, 201]:
        metrics.ok(url, route.meta)
        return parse(resp)

    if resp.status in [403, 429] or is_captcha(resp):
        backoff = jittered_backoff(attempt)
        router_client.rotate(route)  # new IP, possibly new ASN/geo
        sleep(backoff)
        continue

    if is_transient(resp.status):
        sleep(jittered_backoff(attempt))
        continue

    break

metrics.fail(url, route.meta, resp.status)

If you operate your own router, expose a simple control plane so the scheduler or SREs can push policies safely:

# Example policy patch (YAML)
domain: example.com
pool: residential_us
rate_limit: 2 rps
sticky: true
sticky_ttl: 10m
max_concurrency: 20
rotate_on_status: [403, 429]
max_retries: 4

Scheduling and fairness

  • Token buckets per domain: One bucket for global capacity, another per ASN or pool to avoid hammering a single cohort.
  • Priority tiers: Critical jobs preempt opportunistic crawls. Cap their footprint so they cannot starve the rest.
  • Jitter everywhere: Randomize delays, connection reuse, and navigation patterns.
  • Blast-radius isolation: Partition pools by tenant or job so a bad actor or a broken parser does not poison reputation for everyone.

Observability you cannot skip

  • Golden metrics: Success rate, block rate, median/95p latency, bytes per response, proxy utilization, and cost per 1k successful pages.
  • Per-domain views: Anti-bot defenses vary wildly; tune policies where they matter.
  • Tracing: Propagate a request-id from queue to worker to router. Export spans with tags for domain, pool, ASN, and outcome.
  • Canaries: A small recurring probe for each target to catch changes in layout or defenses before production jobs fail.

Security and compliance

  • Legal review: Terms of service, permitted use, and data rights. Do not collect or retain personal data without a lawful basis.
  • Source of IPs: Use providers that verify consent for residential/mobile peers. Maintain vendor attestations.
  • Robots.txt and rate limits: Honor publisher preferences where required; for your own properties, offer documented opt-outs.
  • Secrets hygiene: Route credentials via a vault. Rotate often. Least-privilege IAM.
  • Audit logs: Who changed what policy, when, and why.

Shared vs. dedicated proxies

Shared routing compresses cost and simplifies management, but you inherit neighbor reputation. Mitigations include:

  • Sharding: Separate pools for sensitive targets or VIP jobs.
  • Sticky windows: Keep an IP long enough to look natural, but not so long that it accumulates suspicion.
  • Escalation paths: Promote difficult domains to dedicated IPs or private ASNs.

A quick-start checklist

  • Define domains and classify them by difficulty (static, dynamic, authenticated).
  • Pick two pools to start (datacenter and residential) and tag by geo.
  • Stand up a simple router endpoint or choose a managed provider with session support.
  • Implement per-domain token buckets and a backoff-aware retry policy.
  • Add request-id propagation and a minimal dashboard for success and block rates.
  • Pilot a small set of targets, tune rotation and stickiness, then scale out workers.

Final word

Orchestrated scrapers with shared proxy routing turn fragile scripts into a service you can operate with confidence. Centralizing identity and policy yields fewer bans, better spend control, and clearer telemetry. Most teams do not need exotic tricks—disciplined pacing, adaptive rotation, and ethical boundaries deliver results that last.

Build Orchestrated Scrapers Using Shared Proxy Routing

About the Author

J

Jesse Lewis

Jesse Lewis is a researcher and content contributor for ProxiesThatWork, covering compliance trends, data governance, and the evolving relationship between AI and proxy technologies. He focuses on helping businesses stay compliant while deploying efficient, scalable data-collection pipelines.

Proxies That Work logo
© 2025 ProxiesThatWork LLC. All Rights Reserved.