Incident Report · April 14, 2026

Slow Site Load Investigation

10-second TTFB on tradeit.gg — Root cause analysis via Datadog APM

ACTIVE INCIDENT — 2 Datadog Monitors Alerting

System load high, anomalous RDS database connections

Hypothesis Ruling

Tested three hypotheses — one confirmed, two eliminated

CPU Starvation?

Backend at 45-55% CPU
Headroom available

RULED OUT

Nginx Throttling?

~2,300 connections normal
1,300 req/s steady

RULED OUT

DDoS Attack?

Flat traffic over 4h
~600 MB/s steady

RULED OUT

Root Cause: Dual Data Layer Saturation

OpenSearch + MySQL RDS both degraded simultaneously

RDS "steamarbitrage"

95.9%

Peak CPU

3,477

Connections

Main database saturated. 24h avg 46% CPU, peaks to 96% during business hours. Connection count is extreme — typical Aurora handles a few hundred.

OpenSearch Cluster

772ms

Slowest Query

Sequential

Inventory queries make 3 sequential OpenSearch calls per request. Single inventory/data request = 1.9s just from OpenSearch.

Proof: Full Trace Flamegraph

Trace ID: 69de43e5...054b2fcb — GET /api/v2/inventory/data — 1,915ms

Request Waterfall — tradeit-backend-lb-001

express.request — GET /api/v2/inventory/data

1,915ms

opensearch: group_internal_730/_search

772ms

opensearch: inventory_730/_search

622ms

opensearch: inventory_730/_search

512ms

redis: LRANGE

2ms

99.9% of request time spent in OpenSearch. Three queries run partially sequential, totaling 1,906ms out of 1,915ms.

SSR Amplification Effect

Nuxt SSR renders the page server-side by calling multiple backend APIs before sending HTML. If each API call takes 1-2s, and they run in a waterfall, the cumulative TTFB easily reaches 10s:

Nuxt SSR render starts...

inventory/data → 1.9s (OpenSearch x3)

meta/type → 1.2s (MySQL at 95% CPU)

categories/:gameId → 0.8s (MySQL)

giveaway → 1.5s (MySQL: tickets SUM)

configurations → 0.3s

exchange-rate, user/data, ... → ~0.5s each

Total TTFB: ~6-10s depending on waterfall depth

Slow Backend Endpoints (>1s)

Last hour — 58 slow inventory/data requests alone

inventory/data

1,058 req · 1.9s avg

seo/page-contents

187 req

meta/top-items

165 req

user/data

165 req

configurations

151 req

categories/:gameId

144 req

inventory/my/data

109 req

RDS "steamarbitrage" — 24h CPU Trend

Sustained saturation, not a spike — follows traffic pattern

14:00

16:00

17:00

18:00

19:00

20:00

21:00

22:00

23:00

01:00

02:00

03:00

05:00

06:00

07:00

08:00

09:00

10:00

11:00

NOW

RDS CPU follows daily traffic pattern: 22% overnight → 96% peak hours. This is structural, not a spike.

Infrastructure State

Current resource utilization across the stack

Component	Host / Instance	CPU	Connections	Status
RDS steamarbitrage	Aurora MySQL	60-96%	3,477	ALERT
RDS pricing-test	Aurora MySQL	30-71%	714	ELEVATED
Backend LB-001	c7a.2xlarge	47.5%	—	BUSY
Backend LB-002	c7a.2xlarge	44.8%	—	BUSY
Nginx	Reverse proxy	—	2,300	OK
Nuxt SSR	new-tradeit	?	?	NO APM

RDS Connection Audit

SHOW PROCESSLIST mapped to EC2 instances via AWS API — ~3,400 connections traced

82%

Bot Fleet Share

~400

Unique Bot IPs

Avg Conns/Bot

Pool Limit/Bot

Top Connection Sources (from MySQL PROCESSLIST)

Source	Instance Type	Connections	Pool Limit	Notes
tradeit-backend-lb-001	c7a.2xlarge	125	40+40 × 16 workers	~8 active per worker
tradeit-backend-lb-002	c7a.2xlarge	114	40+40 × 16 workers	~7 active per worker
tradeit-prod-cron-server	m6a.large	68	—	No pool config found in repos
old.tradeit.gg	c5.xlarge	44	35	Over pool limit — leaking?
Steam Value Tracker	t2.xlarge	40	—	Hitting exact limit
Inventory service	c5.4xlarge	39	—	—
Metabase	t3.large	14	—	Analytics queries
Socket server	c7a.large	10	—	—

Bot Fleet (~400 bots, each t3.micro)

Each bot is a single Node.js process with its own mysql.createPool({ connectionLimit: 40 }). Most bots use only 4-5 connections, but 4 bots are maxing out their pool at 40 connections each. No shared proxy — every bot maintains direct TCP connections to Aurora.

~200 bots at 5 conns

~1,000 conns

~180 bots at 4 conns

~720 conns

18 hot bots (10-40)

~412 conns

Bot fleet total: ~2,132 connections (82% of all RDS connections)

Pool Configuration (from source code)

Service	Package	Pool Limit	Clustering	Max Theoretical
tradeit-backend	mysql2	40 main + 40 pricing	PM2 cluster, 16/host × 2	2,560
tradeit-tradebot-server	mysql (v1)	40 per bot	Single process, ~400 bots	16,000
old-tradeit	mysql (v1)	35	Single instance	35
tradeit-admin-backend	Prisma	~5 (default)	PM2 cluster max	varies
new-tradeit (Nuxt SSR)	—	—	No MySQL	0
tradeit-login-server	—	—	Redis only	0

Conclusion: Not a Leak, Not a Misconfiguration

Connection count has been stable at 2,700-3,500 for 7 days. The 3,400 connections are structural: ~400 bots each maintaining direct TCP connections to Aurora with no shared proxy. The highest-leverage fix is RDS Proxy for the bot fleet, which would multiplex ~2,000 bot connections through ~50-100 proxy connections. Alternative: reduce bot connectionLimit from 40 to 5 (matching actual usage) to prevent spike risk.

Recommended Actions

Ranked by impact — immediate, short-term, medium-term

Immediate (Today)

Reduce bot pool limit from 40 to 5 — Bots use only 4-5 connections but have connectionLimit: 40. Four bots are already maxing out. Reducing prevents spikes and frees ~14,000 theoretical max connections.

Evaluate RDS Proxy for bot fleet — ~400 bots hold 2,132 direct TCP connections (82% of total). RDS Proxy would multiplex these through ~50-100 proxy connections, dramatically reducing Aurora overhead.

Optimize inventory/data OpenSearch queries — Three sequential queries take 1.9s. Investigate if they can run in parallel or be consolidated.

Short-Term (This Week)

Scale up RDS instance — The instance is CPU-saturated at peak. Larger instance class or read replicas for read-heavy queries.

Add APM to Nuxt SSR — Zero visibility into frontend rendering. Add dd-trace to the Nuxt server to see the SSR waterfall.

Parallelize SSR API calls — If the page makes sequential API calls during SSR, batch or parallelize them using Promise.all().

Medium-Term

Connection pooling audit — 3,400 connections suggests every worker opens its own pool. Consider ProxySQL or MySQL connection pooling.

Cache hot inventory data — Inventory/meta data that doesn't change per-request should be cached in Redis. Avoid hitting OpenSearch/DB on every SSR render.

SSR output caching — For pages that don't vary per-user, cache the SSR HTML at the Nginx/CDN layer (stale-while-revalidate pattern).

Critical Observability Gap

The Nuxt SSR frontend (new-tradeit / old-tradeit) has zero APM traces in Datadog. We can prove the backend is slow, but we cannot see the SSR waterfall that multiplies these delays. Adding dd-trace to Nuxt is the single highest-leverage observability improvement.

Generated April 14, 2026 · Datadog MCP Investigation · tradeit.gg