HTTP 403 Forbidden in Web Scraping: Every Cause Explained
Complete breakdown of why web scrapers get 403 errors. Covers header analysis, TLS detection, IP blocks, and anti-bot systems.
HTTP 403 Forbidden: Why Websites Reject Your Scraper
The 403 Forbidden response is the most common error in web scraping. Unlike a 404 (page doesn’t exist) or a 500 (server broke), a 403 means the server understood your request perfectly and deliberately refused to serve it.
The critical question isn’t “how do I fix 403?” — it’s “which detection layer caught me?” Because a 403 from a bare-metal Nginx server and a 403 from a Cloudflare-protected e-commerce site are completely different problems with completely different solutions.
The 7 Layers of 403 Detection
Modern websites can reject scrapers at any of these layers, from simplest to most sophisticated:
Layer 1: Missing or Suspicious User-Agent
The lowest-hanging fruit. Default HTTP libraries announce themselves honestly:
# Python requests default
python-requests/2.31.0
# Node axios default
axios/1.6.0
# Go net/http default
Go-http-client/2.0
# curl default
curl/8.4.0
Any server can block these with a single Nginx rule. But simply changing User-Agent to a browser string isn’t enough anymore — anti-bot systems check consistency.
# ❌ This gets caught by modern anti-bot systems
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/121..."}
# Why? Because the TLS fingerprint still says "Python requests"
# and the header set is missing sec-ch-ua, Sec-Fetch-* headers
# that real Chrome always sends
Layer 2: Incomplete Header Set
Real browsers send 12-18 headers with every request. Scrapers typically send 2-4. This gap is easily detectable:
| Header | Chrome Sends? | Scrapers Send? |
|---|---|---|
User-Agent | ✅ | ✅ (usually) |
Accept | ✅ | ❌ often missing |
Accept-Language | ✅ | ❌ often missing |
Accept-Encoding | ✅ | ✅ (auto) |
sec-ch-ua | ✅ | ❌ |
sec-ch-ua-mobile | ✅ | ❌ |
sec-ch-ua-platform | ✅ | ❌ |
Sec-Fetch-Dest | ✅ | ❌ |
Sec-Fetch-Mode | ✅ | ❌ |
Sec-Fetch-Site | ✅ | ❌ |
Sec-Fetch-User | ✅ | ❌ |
Referer | ✅ (navigation) | ❌ |
Cookie | ✅ (if set) | ❌ |
The Sec-Fetch-* and sec-ch-ua-* headers are Client Hints — introduced by Chrome in 2021 and now expected by most anti-bot systems. Their absence is a strong bot signal.
Layer 3: IP Address Reputation
Web servers and CDNs maintain IP reputation databases. Your IP gets flagged based on:
- ASN type: Datacenter ASNs (AWS, Hetzner, DigitalOcean) vs residential ISPs
- Historical activity: IPs previously used for scraping, spam, or attacks
- Proxy/VPN detection: Known exit nodes from VPN services and proxy providers
- Request volume: Unusual request patterns from a single IP
IP Classification Impact:
Residential ISP IP → Trust score: High (8-10/10)
Mobile carrier IP → Trust score: Very high (9-10/10)
Business/ISP IP → Trust score: Medium (5-7/10)
Cloud provider IP → Trust score: Low (1-3/10)
Known proxy exit node → Trust score: Very low (0-2/10)
Layer 4: Rate-Based Blocking
Even with perfect headers and a clean IP, requesting too fast triggers 403:
- Most sites allow 1-5 requests per second from a single IP
- Some apply daily limits (e.g., 1000 pages/day per IP)
- Many implement burst detection — 10 fast requests followed by a pause still gets caught
The challenge: rate limits are per-site, per-page, and often undocumented. You discover them by getting blocked.
Layer 5: Cookie and Session Validation
Many sites require a valid session to serve content:
- First visit sets cookies via JavaScript
- Subsequent requests must include those cookies
- Missing or expired cookies → 403
This is especially common on e-commerce sites that use bot management platforms (Datadome, PerimeterX, Akamai). The initial page load runs JavaScript that generates a bot-detection cookie. Without it, every subsequent request returns 403.
Layer 6: TLS Fingerprint Mismatch
As covered in our Cloudflare Error 1020 guide, the TLS handshake reveals your client identity before any HTTP data is exchanged.
The JA3/JA4 fingerprint of Python requests is known and cataloged. Anti-bot services maintain databases of TLS fingerprints for every major HTTP library, automation tool, and browser version.
This is why “just add headers” doesn’t fix 403 on protected sites. The server already knows you’re not a browser from the TLS handshake.
Layer 7: JavaScript Challenge Failure
The most sophisticated layer. The server serves a JavaScript challenge that must execute before the real content loads. If your client doesn’t run JavaScript, the challenge fails silently — and you either get:
- A 403 response
- A 200 with an empty/challenge HTML page (not the actual content)
- A redirect loop
Bot management systems using this approach:
- Cloudflare — Managed Challenges, Turnstile
- Datadome — JavaScript tag + behavioral analysis
- PerimeterX (HUMAN) — Sensor data collection via JS
- Akamai Bot Manager — Browser fingerprinting via JS
- Kasada — Proof-of-work JavaScript challenges
Which Protection Do Popular Sites Use?
Understanding what you’re up against helps you plan your infrastructure:
| Protection System | Detection Approach | Examples of Sites Using It |
|---|---|---|
| Cloudflare | WAF rules, TLS fingerprinting, Managed Challenges | ~20% of all websites |
| Datadome | JS fingerprinting, Picasso challenge, behavioral ML | E-commerce, ticketing |
| Akamai Bot Manager | Sensor data, device fingerprinting | Airlines, banks, large retail |
| PerimeterX (HUMAN) | Client-side sensors, behavioral analysis | Streaming, SaaS platforms |
| Kasada | Proof-of-work, WASM challenges | Sports betting, gaming |
| AWS WAF | Rate rules, IP reputation, geo-blocking | AWS-hosted applications |
| Imperva (Incapsula) | Cookie challenges, JS fingerprinting | Enterprise, government |
Diagnostic Decision Tree
Got a 403? Start here:
1. Is response from a CDN/bot protection service?
├─ Yes → Skip to step 4
└─ No → Continue
2. Did you set User-Agent?
├─ No → Add realistic browser UA. Retry.
└─ Yes → Continue
3. Did you include full header set (sec-ch-ua, Sec-Fetch-*, etc.)?
├─ No → Add all Client Hints headers. Retry.
└─ Yes → You're being blocked by IP or rate limit.
4. What bot protection system is it?
├─ Check response headers for: server, x-datadome, x-px-*
├─ Check page source for: challenge scripts, cf-mitigated
└─ Identified? → Research that specific system's detection method.
5. Is your TLS fingerprint realistic?
├─ Using requests/axios → No. Switch to browser or tls-client.
└─ Using Playwright → Check stealth plugin configuration.
6. Does the site require JavaScript execution?
├─ Yes → You need a browser engine, not an HTTP library.
└─ No → Focus on headers + IP quality.
Choosing the Right Architecture
| Your Situation | Recommended Approach |
|---|---|
| Simple site, no bot protection | requests + full headers + delays |
| Cloudflare Free plan | Browser automation + stealth plugin |
| Enterprise bot protection (Datadome, Akamai) | Managed scraping API or premium proxy with unblocking |
| High volume (>100k pages/day) | Dedicated proxy infrastructure + distributed architecture |
| Compliance-sensitive data collection | Licensed data provider or official API |
The key insight: the cheapest fix depends on what’s blocking you. A $0 header fix solves Layer 1-2 problems. Layer 5-7 problems require real infrastructure investment.
Key Takeaways
- A 403 is not one problem — it’s at least 7 different problems with different solutions.
- Headers alone won’t save you on modern protected sites. TLS fingerprinting catches you before HTTP.
- Know your target’s protection system before writing any code. The detection method determines the solution.
- IP quality matters more than IP quantity. One clean residential IP outperforms 1000 burned datacenter IPs.
- JavaScript execution is now table stakes for most commercial sites. HTTP-only scrapers hit 403 on over 40% of the web.
ProxyOps Team
Independent infrastructure reviews from engineers who've deployed at scale. No vendor bias, just data.