reachlin

reachlin's development notes

Summary

Cloudflare logs showed recurring spikes of HTTP 499 errors on web-api.bigbus.com/health-check. Investigation confirmed the issue is a stale connection race condition between Cloudflare and AWS ALB — not a problem with the web app, AWS infrastructure, or web devices.


Symptoms


Investigation

Cloudflare GraphQL Analysis

AWS ALB Access Logs (20:25 UTC spike window)

App Logs (web-firelens, 636 files scanned)

CloudWatch (ALB metrics)


Root Cause

Stale keep-alive connection race condition between Cloudflare edge and AWS ALB.

web iPad → CF Edge → [HTTP/2 persistent connection pool] → AWS ALB → ECS pod

When a pooled connection sits idle for >65s, ALB silently closes it. CF doesn’t know and tries to reuse the dead connection for the next health-check request → receives TCP RST → logs 499 with origin_status=0. The web device never sees the failure as CF retries or the next request succeeds.

The spike pattern (large burst, then back to normal) indicates multiple CF connections went stale simultaneously — consistent with a brief CF edge POP event causing connections to pile up idle and expire together.


Fix

Increase ALB idle timeout from 65s to 120s (must exceed CF’s ~90s keepalive).

AWS Console → EC2 → Load Balancers → vpc-production → Attributes → Connection idle timeout → 120

This ensures ALB never closes a connection before CF does, eliminating the race condition.

Setting Current Target
ALB idle timeout 65s 120s
CF proxy_read_timeout 100s no change
CF http2 to origin on no change

Impact: None. Read-only attribute change, no downtime, takes effect immediately.


Non-Issues Ruled Out