Incident Report · April 14, 2026
10-second TTFB on tradeit.gg — Root cause analysis via Datadog APM
ACTIVE INCIDENT — 2 Datadog Monitors Alerting
System load high, anomalous RDS database connections
Tested three hypotheses — one confirmed, two eliminated
CPU Starvation?
NO
Backend at 45-55% CPU
Headroom available
Nginx Throttling?
NO
~2,300 connections normal
1,300 req/s steady
DDoS Attack?
NO
Flat traffic over 4h
~600 MB/s steady
OpenSearch + MySQL RDS both degraded simultaneously
RDS "steamarbitrage"
Main database saturated. 24h avg 46% CPU, peaks to 96% during business hours. Connection count is extreme — typical Aurora handles a few hundred.
OpenSearch Cluster
Inventory queries make 3 sequential OpenSearch calls per request. Single inventory/data request = 1.9s just from OpenSearch.
Trace ID: 69de43e5...054b2fcb — GET /api/v2/inventory/data — 1,915ms
Request Waterfall — tradeit-backend-lb-001
99.9% of request time spent in OpenSearch. Three queries run partially sequential, totaling 1,906ms out of 1,915ms.
SSR Amplification Effect
Nuxt SSR renders the page server-side by calling multiple backend APIs before sending HTML. If each API call takes 1-2s, and they run in a waterfall, the cumulative TTFB easily reaches 10s:
Last hour — 58 slow inventory/data requests alone
Sustained saturation, not a spike — follows traffic pattern
RDS CPU follows daily traffic pattern: 22% overnight → 96% peak hours. This is structural, not a spike.
Current resource utilization across the stack
| Component | Host / Instance | CPU | Connections | Status |
|---|---|---|---|---|
| RDS steamarbitrage | Aurora MySQL | 60-96% | 3,477 | ALERT |
| RDS pricing-test | Aurora MySQL | 30-71% | 714 | ELEVATED |
| Backend LB-001 | c7a.2xlarge | 47.5% | — | BUSY |
| Backend LB-002 | c7a.2xlarge | 44.8% | — | BUSY |
| Nginx | Reverse proxy | — | 2,300 | OK |
| Nuxt SSR | new-tradeit | ? | ? | NO APM |
SHOW PROCESSLIST mapped to EC2 instances via AWS API — ~3,400 connections traced
| Source | Instance Type | Connections | Pool Limit | Notes |
|---|---|---|---|---|
| tradeit-backend-lb-001 | c7a.2xlarge | 125 | 40+40 × 16 workers | ~8 active per worker |
| tradeit-backend-lb-002 | c7a.2xlarge | 114 | 40+40 × 16 workers | ~7 active per worker |
| tradeit-prod-cron-server | m6a.large | 68 | — | No pool config found in repos |
| old.tradeit.gg | c5.xlarge | 44 | 35 | Over pool limit — leaking? |
| Steam Value Tracker | t2.xlarge | 40 | — | Hitting exact limit |
| Inventory service | c5.4xlarge | 39 | — | — |
| Metabase | t3.large | 14 | — | Analytics queries |
| Socket server | c7a.large | 10 | — | — |
Each bot is a single Node.js process with its own mysql.createPool({ connectionLimit: 40 }).
Most bots use only 4-5 connections, but 4 bots are maxing out their pool at 40 connections each.
No shared proxy — every bot maintains direct TCP connections to Aurora.
Bot fleet total: ~2,132 connections (82% of all RDS connections)
| Service | Package | Pool Limit | Clustering | Max Theoretical |
|---|---|---|---|---|
| tradeit-backend | mysql2 | 40 main + 40 pricing | PM2 cluster, 16/host × 2 | 2,560 |
| tradeit-tradebot-server | mysql (v1) | 40 per bot | Single process, ~400 bots | 16,000 |
| old-tradeit | mysql (v1) | 35 | Single instance | 35 |
| tradeit-admin-backend | Prisma | ~5 (default) | PM2 cluster max | varies |
| new-tradeit (Nuxt SSR) | — | — | No MySQL | 0 |
| tradeit-login-server | — | — | Redis only | 0 |
Conclusion: Not a Leak, Not a Misconfiguration
Connection count has been stable at 2,700-3,500 for 7 days. The 3,400 connections are structural:
~400 bots each maintaining direct TCP connections to Aurora with no shared proxy.
The highest-leverage fix is RDS Proxy for the bot fleet, which would multiplex ~2,000 bot connections through ~50-100 proxy connections.
Alternative: reduce bot connectionLimit from 40 to 5 (matching actual usage) to prevent spike risk.
Ranked by impact — immediate, short-term, medium-term
connectionLimit: 40. Four bots are already maxing out. Reducing prevents spikes and frees ~14,000 theoretical max connections.
dd-trace to the Nuxt server to see the SSR waterfall.
Promise.all().
Critical Observability Gap
The Nuxt SSR frontend (new-tradeit / old-tradeit) has zero APM traces in Datadog.
We can prove the backend is slow, but we cannot see the SSR waterfall that multiplies these delays.
Adding dd-trace to Nuxt is the single highest-leverage observability improvement.
Generated April 14, 2026 · Datadog MCP Investigation · tradeit.gg