web-scraping TechArticle Information Gain: 9/10

HTTP 403 Forbidden in Web Scraping: Every Cause Explained

Complete breakdown of why web scrapers get 403 errors. Covers header analysis, TLS detection, IP blocks, and anti-bot systems.

By ProxyOps Team ·

HTTP 403 Forbidden: Why Websites Reject Your Scraper

The 403 Forbidden response is the most common error in web scraping. Unlike a 404 (page doesn’t exist) or a 500 (server broke), a 403 means the server understood your request perfectly and deliberately refused to serve it.

The critical question isn’t “how do I fix 403?” — it’s “which detection layer caught me?” Because a 403 from a bare-metal Nginx server and a 403 from a Cloudflare-protected e-commerce site are completely different problems with completely different solutions.


The 7 Layers of 403 Detection

Modern websites can reject scrapers at any of these layers, from simplest to most sophisticated:

Layer 1: Missing or Suspicious User-Agent

The lowest-hanging fruit. Default HTTP libraries announce themselves honestly:

# Python requests default
python-requests/2.31.0

# Node axios default
axios/1.6.0

# Go net/http default
Go-http-client/2.0

# curl default
curl/8.4.0

Any server can block these with a single Nginx rule. But simply changing User-Agent to a browser string isn’t enough anymore — anti-bot systems check consistency.

# ❌ This gets caught by modern anti-bot systems
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/121..."}

# Why? Because the TLS fingerprint still says "Python requests"
# and the header set is missing sec-ch-ua, Sec-Fetch-* headers
# that real Chrome always sends

Layer 2: Incomplete Header Set

Real browsers send 12-18 headers with every request. Scrapers typically send 2-4. This gap is easily detectable:

HeaderChrome Sends?Scrapers Send?
User-Agent✅ (usually)
Accept❌ often missing
Accept-Language❌ often missing
Accept-Encoding✅ (auto)
sec-ch-ua
sec-ch-ua-mobile
sec-ch-ua-platform
Sec-Fetch-Dest
Sec-Fetch-Mode
Sec-Fetch-Site
Sec-Fetch-User
Referer✅ (navigation)
Cookie✅ (if set)

The Sec-Fetch-* and sec-ch-ua-* headers are Client Hints — introduced by Chrome in 2021 and now expected by most anti-bot systems. Their absence is a strong bot signal.

Layer 3: IP Address Reputation

Web servers and CDNs maintain IP reputation databases. Your IP gets flagged based on:

  • ASN type: Datacenter ASNs (AWS, Hetzner, DigitalOcean) vs residential ISPs
  • Historical activity: IPs previously used for scraping, spam, or attacks
  • Proxy/VPN detection: Known exit nodes from VPN services and proxy providers
  • Request volume: Unusual request patterns from a single IP
IP Classification Impact:

Residential ISP IP     → Trust score: High (8-10/10)
Mobile carrier IP      → Trust score: Very high (9-10/10)
Business/ISP IP        → Trust score: Medium (5-7/10)
Cloud provider IP      → Trust score: Low (1-3/10)
Known proxy exit node  → Trust score: Very low (0-2/10)

Layer 4: Rate-Based Blocking

Even with perfect headers and a clean IP, requesting too fast triggers 403:

  • Most sites allow 1-5 requests per second from a single IP
  • Some apply daily limits (e.g., 1000 pages/day per IP)
  • Many implement burst detection — 10 fast requests followed by a pause still gets caught

The challenge: rate limits are per-site, per-page, and often undocumented. You discover them by getting blocked.

Many sites require a valid session to serve content:

  1. First visit sets cookies via JavaScript
  2. Subsequent requests must include those cookies
  3. Missing or expired cookies → 403

This is especially common on e-commerce sites that use bot management platforms (Datadome, PerimeterX, Akamai). The initial page load runs JavaScript that generates a bot-detection cookie. Without it, every subsequent request returns 403.

Layer 6: TLS Fingerprint Mismatch

As covered in our Cloudflare Error 1020 guide, the TLS handshake reveals your client identity before any HTTP data is exchanged.

The JA3/JA4 fingerprint of Python requests is known and cataloged. Anti-bot services maintain databases of TLS fingerprints for every major HTTP library, automation tool, and browser version.

This is why “just add headers” doesn’t fix 403 on protected sites. The server already knows you’re not a browser from the TLS handshake.

Layer 7: JavaScript Challenge Failure

The most sophisticated layer. The server serves a JavaScript challenge that must execute before the real content loads. If your client doesn’t run JavaScript, the challenge fails silently — and you either get:

  • A 403 response
  • A 200 with an empty/challenge HTML page (not the actual content)
  • A redirect loop

Bot management systems using this approach:

  • Cloudflare — Managed Challenges, Turnstile
  • Datadome — JavaScript tag + behavioral analysis
  • PerimeterX (HUMAN) — Sensor data collection via JS
  • Akamai Bot Manager — Browser fingerprinting via JS
  • Kasada — Proof-of-work JavaScript challenges

Understanding what you’re up against helps you plan your infrastructure:

Protection SystemDetection ApproachExamples of Sites Using It
CloudflareWAF rules, TLS fingerprinting, Managed Challenges~20% of all websites
DatadomeJS fingerprinting, Picasso challenge, behavioral MLE-commerce, ticketing
Akamai Bot ManagerSensor data, device fingerprintingAirlines, banks, large retail
PerimeterX (HUMAN)Client-side sensors, behavioral analysisStreaming, SaaS platforms
KasadaProof-of-work, WASM challengesSports betting, gaming
AWS WAFRate rules, IP reputation, geo-blockingAWS-hosted applications
Imperva (Incapsula)Cookie challenges, JS fingerprintingEnterprise, government

Diagnostic Decision Tree

Got a 403? Start here:

1. Is response from a CDN/bot protection service?
   ├─ Yes → Skip to step 4
   └─ No → Continue

2. Did you set User-Agent?
   ├─ No → Add realistic browser UA. Retry.
   └─ Yes → Continue

3. Did you include full header set (sec-ch-ua, Sec-Fetch-*, etc.)?
   ├─ No → Add all Client Hints headers. Retry.
   └─ Yes → You're being blocked by IP or rate limit.

4. What bot protection system is it?
   ├─ Check response headers for: server, x-datadome, x-px-*
   ├─ Check page source for: challenge scripts, cf-mitigated
   └─ Identified? → Research that specific system's detection method.

5. Is your TLS fingerprint realistic?
   ├─ Using requests/axios → No. Switch to browser or tls-client.
   └─ Using Playwright → Check stealth plugin configuration.

6. Does the site require JavaScript execution?
   ├─ Yes → You need a browser engine, not an HTTP library.
   └─ No → Focus on headers + IP quality.

Choosing the Right Architecture

Your SituationRecommended Approach
Simple site, no bot protectionrequests + full headers + delays
Cloudflare Free planBrowser automation + stealth plugin
Enterprise bot protection (Datadome, Akamai)Managed scraping API or premium proxy with unblocking
High volume (>100k pages/day)Dedicated proxy infrastructure + distributed architecture
Compliance-sensitive data collectionLicensed data provider or official API

The key insight: the cheapest fix depends on what’s blocking you. A $0 header fix solves Layer 1-2 problems. Layer 5-7 problems require real infrastructure investment.


Key Takeaways

  1. A 403 is not one problem — it’s at least 7 different problems with different solutions.
  2. Headers alone won’t save you on modern protected sites. TLS fingerprinting catches you before HTTP.
  3. Know your target’s protection system before writing any code. The detection method determines the solution.
  4. IP quality matters more than IP quantity. One clean residential IP outperforms 1000 burned datacenter IPs.
  5. JavaScript execution is now table stakes for most commercial sites. HTTP-only scrapers hit 403 on over 40% of the web.
PS

ProxyOps Team

Independent infrastructure reviews from engineers who've deployed at scale. No vendor bias, just data.