Incident Report · April 14, 2026

Slow Site Load Investigation

10-second TTFB on tradeit.gg — Root cause analysis via Datadog APM

ACTIVE INCIDENT — 2 Datadog Monitors Alerting

System load high, anomalous RDS database connections

Hypothesis Ruling

Tested three hypotheses — one confirmed, two eliminated

CPU Starvation?

NO

Backend at 45-55% CPU
Headroom available

RULED OUT

Nginx Throttling?

NO

~2,300 connections normal
1,300 req/s steady

RULED OUT

DDoS Attack?

NO

Flat traffic over 4h
~600 MB/s steady

RULED OUT

Root Cause: Dual Data Layer Saturation

OpenSearch + MySQL RDS both degraded simultaneously

RDS "steamarbitrage"

95.9%
Peak CPU
3,477
Connections

Main database saturated. 24h avg 46% CPU, peaks to 96% during business hours. Connection count is extreme — typical Aurora handles a few hundred.

OpenSearch Cluster

772ms
Slowest Query
3x
Sequential

Inventory queries make 3 sequential OpenSearch calls per request. Single inventory/data request = 1.9s just from OpenSearch.


Proof: Full Trace Flamegraph

Trace ID: 69de43e5...054b2fcb — GET /api/v2/inventory/data — 1,915ms

Request Waterfall — tradeit-backend-lb-001

express.request — GET /api/v2/inventory/data
1,915ms
opensearch: group_internal_730/_search
772ms
opensearch: inventory_730/_search
622ms
opensearch: inventory_730/_search
512ms
redis: LRANGE
2ms

99.9% of request time spent in OpenSearch. Three queries run partially sequential, totaling 1,906ms out of 1,915ms.

SSR Amplification Effect

Nuxt SSR renders the page server-side by calling multiple backend APIs before sending HTML. If each API call takes 1-2s, and they run in a waterfall, the cumulative TTFB easily reaches 10s:

Nuxt SSR render starts...
inventory/data → 1.9s (OpenSearch x3)
meta/type → 1.2s (MySQL at 95% CPU)
categories/:gameId → 0.8s (MySQL)
giveaway → 1.5s (MySQL: tickets SUM)
configurations → 0.3s
exchange-rate, user/data, ... → ~0.5s each
Total TTFB: ~6-10s depending on waterfall depth

Slow Backend Endpoints (>1s)

Last hour — 58 slow inventory/data requests alone

inventory/data
1,058 req · 1.9s avg
seo/page-contents
187 req
meta/top-items
165 req
user/data
165 req
configurations
151 req
categories/:gameId
144 req
inventory/my/data
109 req

RDS "steamarbitrage" — 24h CPU Trend

Sustained saturation, not a spike — follows traffic pattern

14:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00
01:00
02:00
03:00
05:00
06:00
07:00
08:00
09:00
10:00
11:00
NOW

RDS CPU follows daily traffic pattern: 22% overnight → 96% peak hours. This is structural, not a spike.


Infrastructure State

Current resource utilization across the stack

Component Host / Instance CPU Connections Status
RDS steamarbitrage Aurora MySQL 60-96% 3,477 ALERT
RDS pricing-test Aurora MySQL 30-71% 714 ELEVATED
Backend LB-001 c7a.2xlarge 47.5% BUSY
Backend LB-002 c7a.2xlarge 44.8% BUSY
Nginx Reverse proxy 2,300 OK
Nuxt SSR new-tradeit ? ? NO APM

RDS Connection Audit

SHOW PROCESSLIST mapped to EC2 instances via AWS API — ~3,400 connections traced

82%
Bot Fleet Share
~400
Unique Bot IPs
5
Avg Conns/Bot
40
Pool Limit/Bot

Top Connection Sources (from MySQL PROCESSLIST)

Source Instance Type Connections Pool Limit Notes
tradeit-backend-lb-001 c7a.2xlarge 125 40+40 × 16 workers ~8 active per worker
tradeit-backend-lb-002 c7a.2xlarge 114 40+40 × 16 workers ~7 active per worker
tradeit-prod-cron-server m6a.large 68 No pool config found in repos
old.tradeit.gg c5.xlarge 44 35 Over pool limit — leaking?
Steam Value Tracker t2.xlarge 40 Hitting exact limit
Inventory service c5.4xlarge 39
Metabase t3.large 14 Analytics queries
Socket server c7a.large 10

Bot Fleet (~400 bots, each t3.micro)

Each bot is a single Node.js process with its own mysql.createPool({ connectionLimit: 40 }). Most bots use only 4-5 connections, but 4 bots are maxing out their pool at 40 connections each. No shared proxy — every bot maintains direct TCP connections to Aurora.

~200 bots at 5 conns
~1,000 conns
~180 bots at 4 conns
~720 conns
18 hot bots (10-40)
~412 conns

Bot fleet total: ~2,132 connections (82% of all RDS connections)

Pool Configuration (from source code)

Service Package Pool Limit Clustering Max Theoretical
tradeit-backend mysql2 40 main + 40 pricing PM2 cluster, 16/host × 2 2,560
tradeit-tradebot-server mysql (v1) 40 per bot Single process, ~400 bots 16,000
old-tradeit mysql (v1) 35 Single instance 35
tradeit-admin-backend Prisma ~5 (default) PM2 cluster max varies
new-tradeit (Nuxt SSR) No MySQL 0
tradeit-login-server Redis only 0

Conclusion: Not a Leak, Not a Misconfiguration

Connection count has been stable at 2,700-3,500 for 7 days. The 3,400 connections are structural: ~400 bots each maintaining direct TCP connections to Aurora with no shared proxy. The highest-leverage fix is RDS Proxy for the bot fleet, which would multiplex ~2,000 bot connections through ~50-100 proxy connections. Alternative: reduce bot connectionLimit from 40 to 5 (matching actual usage) to prevent spike risk.


Recommended Actions

Ranked by impact — immediate, short-term, medium-term

Immediate (Today)

Reduce bot pool limit from 40 to 5 — Bots use only 4-5 connections but have connectionLimit: 40. Four bots are already maxing out. Reducing prevents spikes and frees ~14,000 theoretical max connections.
Evaluate RDS Proxy for bot fleet — ~400 bots hold 2,132 direct TCP connections (82% of total). RDS Proxy would multiplex these through ~50-100 proxy connections, dramatically reducing Aurora overhead.
Optimize inventory/data OpenSearch queries — Three sequential queries take 1.9s. Investigate if they can run in parallel or be consolidated.

Short-Term (This Week)

Scale up RDS instance — The instance is CPU-saturated at peak. Larger instance class or read replicas for read-heavy queries.
Add APM to Nuxt SSR — Zero visibility into frontend rendering. Add dd-trace to the Nuxt server to see the SSR waterfall.
Parallelize SSR API calls — If the page makes sequential API calls during SSR, batch or parallelize them using Promise.all().

Medium-Term

Connection pooling audit — 3,400 connections suggests every worker opens its own pool. Consider ProxySQL or MySQL connection pooling.
Cache hot inventory data — Inventory/meta data that doesn't change per-request should be cached in Redis. Avoid hitting OpenSearch/DB on every SSR render.
SSR output caching — For pages that don't vary per-user, cache the SSR HTML at the Nginx/CDN layer (stale-while-revalidate pattern).

Critical Observability Gap

The Nuxt SSR frontend (new-tradeit / old-tradeit) has zero APM traces in Datadog. We can prove the backend is slow, but we cannot see the SSR waterfall that multiplies these delays. Adding dd-trace to Nuxt is the single highest-leverage observability improvement.


Generated April 14, 2026 · Datadog MCP Investigation · tradeit.gg