NGINX Upstream Keepalive Explained: Why Missing It Causes 502 Errors
Missing keepalive in your NGINX upstream block silently kills connections under load. Here's exactly what keepalive does, how TCP connection reuse works, and the production-ready config that stops 502s before they start.
Introduction
If you've read the NGINX 502 Bad Gateway debugging guide, you know that a missing keepalive directive in the upstream block is the single most common cause of 502 errors under load. But why does one missing line cause thousands of connection failures? And what exactly does adding it do?
This article is the deep dive. We'll go from TCP fundamentals to production-ready NGINX configuration, covering every mistake engineers make along the way.
What Is Keepalive?
Keepalive is a mechanism that allows a single TCP connection to be reused for multiple HTTP requests, instead of opening and closing a new connection for every request.
Without keepalive, every HTTP request follows this lifecycle:
Client → [TCP SYN] → Server
Client ← [TCP SYN-ACK] ← Server
Client → [TCP ACK] → Server # 3-way handshake complete
Client → [HTTP Request] → Server
Client ← [HTTP Response] ← Server
Client → [TCP FIN] → Server # teardown begins
Client ← [TCP FIN-ACK] ← Server
Client → [TCP ACK] → Server # connection fully closed
That is 7 network round trips just to send one HTTP request and receive one response. At low traffic, the overhead is invisible. At 200 req/s, you are opening and closing 200 TCP connections every second.
With keepalive:
Client → [TCP SYN] → Server
Client ← [TCP SYN-ACK] ← Server
Client → [TCP ACK] → Server # handshake — done once
Client → [HTTP Request 1] → Server
Client ← [HTTP Response 1] ← Server
Client → [HTTP Request 2] → Server # same connection reused
Client ← [HTTP Response 2] ← Server
Client → [HTTP Request N] → Server # still the same connection
Client ← [HTTP Response N] ← Server
Client → [TCP FIN] → Server # teardown — done once
The TCP handshake happens once. All subsequent requests ride the same connection. This is the difference between 7 round trips per request and 2.
HTTP/1.0 vs HTTP/1.1: Why the Version Matters
This is the part most documentation skips over, and it is why NGINX keepalive configuration has two required lines that are easy to get wrong.
HTTP/1.0 — Connection per Request
HTTP/1.0 (1996) has no keepalive by default. Every request requires a new TCP connection. There was an unofficial Connection: Keep-Alive header that some implementations supported, but it was not standardized.
HTTP/1.1 — Persistent Connections by Default
HTTP/1.1 (1997) made persistent connections the default. If you do not explicitly send Connection: close, the connection stays open after the response. This is now how every browser, curl, and HTTP client works.
Why This Matters for NGINX
When NGINX proxies a request to an upstream, it acts as an HTTP client making a request to your backend. By default, NGINX uses HTTP/1.0 for upstream connections — even if the original client used HTTP/1.1.
This means: even if you configure keepalive 64 in the upstream block, NGINX will not actually reuse connections unless you explicitly tell it to use HTTP/1.1:
proxy_http_version 1.1;
And there is a second trap. HTTP/1.1 keeps connections open by default — but if the original client sent a Connection: close header, NGINX will forward that header to the upstream, which tells the backend to close the connection immediately after responding. You must clear it:
proxy_set_header Connection "";
Without both lines, keepalive in the upstream block does nothing.
The TIME_WAIT Problem: What Happens Without Keepalive
Every time a TCP connection closes, it does not disappear immediately. The closing side enters TIME_WAIT state and holds the socket for 2 × MSL (Maximum Segment Lifetime) — typically 60 seconds on Linux.
This exists for a good reason: to ensure any delayed packets from the old connection do not corrupt a new connection on the same port. But it creates a real problem at scale.
The Math
Linux allocates ephemeral ports for outgoing connections from a range defined by:
cat /proc/sys/net/ipv4/ip_local_port_range
# 32768 60999
That is ~28,000 available ports. Each port can only be in TIME_WAIT once at a time.
Now do the math:
- 200 req/s with no keepalive = 200 new TCP connections per second
- Each connection stays in TIME_WAIT for 60 seconds
- After 60 seconds: 200 × 60 = 12,000 ports consumed
- After 140 seconds: 200 × 140 = 28,000 ports — port pool exhausted
New connection attempts get ECONNREFUSED. NGINX logs:
connect() failed (111: Connection refused) while connecting to upstream
Your app is fine. Your NGINX is fine. The kernel ran out of ports.
Watching It Happen
Run this during a load test with no upstream keepalive:
watch -n 0.5 'ss -s'
You will see output like:
Total: 31847
TCP: 29203 (estab 187, closed 28901, orphaned 0, timewait 28893)
When timewait approaches your port range limit, 502s start appearing. This is the smoking gun.
The Complete Production-Ready Configuration
Here is the full upstream keepalive configuration with every directive explained:
upstream app_backend {
server 127.0.0.1:8080;
# Maximum number of idle keepalive connections per worker process.
# These are kept open and reused for new requests.
# Rule of thumb: (expected_rps / worker_processes) * 0.5
keepalive 64;
# Number of requests served through one keepalive connection
# before it is closed and recycled. Prevents memory leaks in
# long-running upstream connections.
keepalive_requests 1000;
# How long an idle keepalive connection is kept open.
# Should be slightly less than the upstream's own keepalive timeout
# to avoid race conditions where NGINX sends on a connection
# the upstream has already decided to close.
keepalive_timeout 65s;
}
server {
listen 443 ssl http2;
server_name api.example.com;
location /api/ {
proxy_pass http://app_backend;
# REQUIRED: Use HTTP/1.1 — keepalive does not work with HTTP/1.0
proxy_http_version 1.1;
# REQUIRED: Clear the Connection header so it is not forwarded
# to the upstream. Forwarding "Connection: close" would
# immediately close the keepalive connection.
proxy_set_header Connection "";
# Standard proxy headers
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeouts — tighten from the 60s defaults
proxy_connect_timeout 5s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
}
}
How keepalive 64 Actually Works
The keepalive 64 directive does not set a maximum connection limit per se. It sets the maximum number of idle connections that NGINX will keep open per worker process, waiting to be reused.
Under load, NGINX may have far more than 64 active connections to the upstream — that is fine. The 64 only controls how many idle ones are kept in the pool after requests complete. If all 64 idle connections are in use when a new request arrives, NGINX opens a new connection temporarily. If the pool is not full when a connection becomes idle, it goes back into the pool.
Think of it like a connection pool in a database driver.
Common Mistakes
Mistake 1: Setting keepalive but forgetting proxy_http_version
# BROKEN — keepalive silently does nothing
upstream app_backend {
server 127.0.0.1:8080;
keepalive 64;
}
location /api/ {
proxy_pass http://app_backend;
# Missing: proxy_http_version 1.1;
# Missing: proxy_set_header Connection "";
}
NGINX will open new connections for every request. No error, no warning. The keepalive pool sits empty.
Mistake 2: keepalive value too low
# Too conservative for 200 req/s across 4 workers
keepalive 4;
With 4 worker processes and only 4 idle connections per worker, the pool is almost always empty under load. NGINX keeps opening new connections. TIME_WAIT builds up.
A reasonable starting point for a busy API: keepalive 64 per worker. Tune upward if you still see TIME_WAIT climbing.
Mistake 3: keepalive_timeout higher than upstream's timeout
keepalive_timeout 120s; # NGINX will hold connections for 2 minutes
If your Node.js server has a default keepAliveTimeout of 5 seconds (the Node.js default before v19), NGINX will try to reuse connections the upstream has already closed. Result: sporadic 502s that are hard to reproduce.
Always set NGINX's keepalive_timeout slightly lower than the upstream's keepalive timeout.
For Node.js (which defaults to keepAliveTimeout: 5000ms in older versions):
// In your Node.js server — set this explicitly
server.keepAliveTimeout = 75000; // 75 seconds
server.headersTimeout = 76000; // must be > keepAliveTimeout
Then in NGINX:
keepalive_timeout 65s; # lower than Node's 75s
Mistake 4: Not setting keepalive on multiple upstream blocks
It is easy to add keepalive to one upstream and forget others:
upstream api_backend {
server 127.0.0.1:8080;
keepalive 64; # ✓ configured
}
upstream static_backend {
server 127.0.0.1:9000;
# ✗ missing — will exhaust ports under load
}
Audit every upstream block. Any high-traffic upstream without keepalive is a future incident.
Mistake 5: Applying keepalive to upstreams with a single short-lived request type
If your upstream handles very long-running requests (30+ seconds each), a small keepalive pool fills up quickly. Tune keepalive higher or use a dedicated upstream block for those endpoints with different settings.
Verifying Keepalive Is Actually Working
After configuring keepalive, verify it is active — do not assume.
Method 1: Watch TIME_WAIT Under Load
# Run load test in background, watch socket state
ab -n 50000 -c 100 http://localhost/api/health &
watch -n 0.5 'ss -s | grep -E "timewait|estab"'
Before keepalive: TIME_WAIT climbs steadily. At 200 req/s it reaches thousands within minutes.
After keepalive: TIME_WAIT stays low (under 200). ESTABLISHED count stays roughly stable — those are the reused keepalive connections.
Method 2: Check NGINX Status Module
If you have ngx_http_stub_status_module enabled:
location /nginx_status {
stub_status;
allow 127.0.0.1;
deny all;
}
curl http://localhost/nginx_status
# Active connections: 47
# server accepts handled requests
# 182340 182340 891023
# Reading: 0 Writing: 12 Waiting: 35
The Waiting count represents idle keepalive connections being held. Under load with keepalive working, this number stays healthy rather than dropping to zero constantly.
Method 3: strace a Worker Process
# Find an NGINX worker PID
ps aux | grep "nginx: worker"
# Watch its connect() calls
strace -p <worker_pid> -e trace=connect 2>&1 | head -50
Without keepalive: you see a constant stream of connect() calls.
With keepalive working: connect() calls are rare — connections are being reused.
Impact: Before and After
Here are real numbers from the incident described in the NGINX 502 debugging post, measured with ab at 200 req/s for 5 minutes:
| Metric | Without keepalive | With keepalive |
|---|---|---|
| TIME_WAIT sockets (peak) | 14,800 | 87 |
| New TCP connections/sec | ~200 | ~3 |
| 502 error rate | 0.3% | 0% |
| Avg response time | 42ms | 31ms |
| P99 response time | 310ms | 89ms |
The response time improvement is a bonus — fewer TCP handshakes means less latency for every request.
When Keepalive Is Not Enough
Keepalive solves the connection exhaustion problem, but it does not fix everything. If you are still seeing 502s after configuring keepalive correctly, check:
- Worker connections limit —
worker_connectionsmay be too low. See NGINX 502 Bad Gateway Under Load: Causes, Debugging, and Fixes - Upstream pool limits — PHP-FPM
pm.max_childrenor Gunicorn worker count - Upstream response time — if the app is slow,
proxy_read_timeoutmay be firing - Kernel socket buffers — under extreme load, check
net.core.somaxconnandnet.ipv4.tcp_max_syn_backlog
For the full diagnostic flow, see the NGINX 502 debugging guide.
TL;DR — The Minimum You Need
Add this to every NGINX upstream block that handles significant traffic:
upstream your_backend {
server 127.0.0.1:8080;
keepalive 64;
keepalive_requests 1000;
keepalive_timeout 65s;
}
Add this to every location block that proxies to it:
proxy_http_version 1.1;
proxy_set_header Connection "";
Without all five lines working together, keepalive does not function. With them, you eliminate the most common cause of NGINX 502 errors under load — permanently.