P1 Incident Apr 5, 2026 · 00:23–01:42 UTC 03:23–04:42 Israel Time Auto-recovered

CPU & Load Spike — Two-Wave Incident

Both Nuxt frontend nodes hit 92% CPU simultaneously at 00:23 UTC, followed ~50 min later by both OpenSearch data nodes reaching 95%+ CPU with one crashing and restarting.

92.5%

Peak CPU — Frontend nodes

95%+

Peak CPU — OpenSearch nodes

~7 min

Frontend wave duration

~40 min

OpenSearch wave duration

P1 monitors triggered

Both

Waves auto-recovered

Monitors Triggered

Wave 1 — Frontend Layer 00:27–00:34 UTC

[P1] System load is high — tradeit-alb00-prod.tradeit.gg02 (i-0521bba8a5fa7b821)

system.load.norm.1 > 1.0 · Triggered 00:27:17 UTC · Recovered 00:34:17 UTC · m6a.2xlarge

[P1] System load is high — tradeit-alb00-prod.tradeit.gg01 (i-09c1ad0aa82395e99)

system.load.norm.1 > 1.0 · Triggered 00:27:17 UTC · Recovered 00:34:17 UTC · m6a.2xlarge

[P1] CPU usage is high — tradeit-alb00-prod.tradeit.gg01 & gg02

100 - system.cpu.idle > 90 · Triggered 00:29:14 UTC · Recovered 00:31:14 UTC

Wave 2 — OpenSearch Layer 01:12–01:22 UTC (still ongoing)

[P1] System load is high — opensearch_cluster_data_node_4 (i-07e02b117e9dbd0a9)

system.load.norm.1 > 1.0 reached 2.24 · Triggered 01:12:17 UTC · Node crashed ~01:41 UTC · c6a.xlarge

[P1] System load is high — opensearch_cluster_data_node_1 (i-0e3ae1adac2380052)

system.load.norm.1 > 1.0 reached 1.98 · Triggered 01:20:17 UTC · c6a.xlarge

[P1] CPU usage is high — data_node_1 & data_node_4

100 - system.cpu.idle > 90 · Triggered 01:22:14 UTC · Both nodes reached 95–96% CPU

Incident Timeline

Wave 1 — Frontend

00:05–00:23 UTC

Normal operation

Both ALB nodes at ~25–31% user CPU, load norm ~0.30. Nothing unusual.

00:23 UTC — 03:23 IL

CPU spikes to 92.5% on both nodes

Both tradeit-alb00 gg01 and gg02 spike simultaneously. Load norm hits 1.78. Both nodes, same second — not a single runaway process.

00:27:17 UTC — 03:27 IL

Load monitors fire (P1)

Both load monitors trigger at the same timestamp.

00:29:14 UTC — 03:29 IL

CPU >90% monitors fire (P1)

100 - idle > 90 confirmed — 92.5% is real.

00:31–00:34 UTC

Self-recovered

CPU drops back to ~18–25%. Both monitors recover. Total duration: ~7 minutes. No manual intervention.

Wave 2 — OpenSearch

00:41–01:05 UTC

data_node_4 starts climbing

Briefly dips to ~9% at 00:41 (possible brief recovery), then climbs again: 35% → 44% → 63% → 68%.

00:30–01:05 UTC

data_node_1 also elevated

Steadily ramps from 31% → 56% user CPU. Was already elevated when Wave 1 hit the frontend.

01:12–01:22 UTC — 04:12–04:22 IL

Both OS nodes cross load > 1.0 → P1 fires

data_node_4 load norm: 2.24. data_node_1 load norm: 1.97. CPU on both: 93–95%.

01:35–01:41 UTC

data_node_4 crashes / restarts

CPU drops from 90% to ~2% (near-zero). Shard reallocation begins. Then recovers to ~27–47%.

01:41+ UTC

data_node_1 remains elevated

Still at 40–88% CPU post-peak. Likely absorbing rebalanced shards from the node_4 crash.

CPU Profiles

Frontend Nodes — m6a.2xlarge (8 vCPU)

Both tradeit-alb00 gg01 & gg02 — identical profile

00:05–00:22 (baseline)~27%

00:23 (peak)92.5%

00:31+ (recovered)~22%

This is a sharp spike — normal → 92% → normal in under 10 minutes. Not a sustained load.

OpenSearch Data Nodes — c6a.xlarge (4 vCPU)

data_node_4 (i-07e02b117e9dbd0a9)

00:30–00:53 (ramp start)21–42%

01:05–01:11 (climbing fast)63–68%

01:17–01:29 (peak)93–95%

01:41 (crash/restart)~2%

data_node_1 (i-0e3ae1adac2380052) — similar ramp, no crash

00:30–01:05 (ramp)31–56%

01:17–01:29 (peak)92–95%

This is a slow ramp over ~60 minutes — consistent with index merge, JVM GC, or shard relocation, not a traffic spike.

Root Cause Analysis

Wave 1: Frontend Spike — Unknown Trigger, Sharp Shape

Both tradeit-alb00 nodes (Nuxt.js SSR, type:frontend) spiked from ~27% to 92.5% CPU simultaneously at exactly 00:23 UTC. No error logs were found for this host during the window, ruling out an error-handling storm.

The simultaneous exact-second spike on both nodes strongly suggests a shared trigger:

Most likely: A scheduled job or cron task running at 00:23 UTC on both nodes (e.g. SSR cache warming, sitemap generation, report job)
Possible: A traffic burst routed across both nodes simultaneously (e.g. bot crawl, automated test, external monitoring)
Less likely: A deployment or config reload (no evidence of restart)

Action needed: Check what runs on Nuxt nodes at ~00:23 UTC. Likely a cron or scheduled job. Live Processes snapshot would confirm the process name.

Wave 2: OpenSearch Nodes — Slow Ramp, One Crash

data_node_1 was already at 31–40% CPU from 00:30 UTC (while Wave 1 was resolving on the frontend). Both nodes then ramped together over ~60 minutes to 92–95%, with data_node_4 crashing at 01:41 UTC.

The slow linear ramp is characteristic of:

Most likely: Lucene segment merge or index optimization triggered by the preceding backend load (from Apr 4 sustained high query volume)
Possible: JVM GC pressure accumulating — heap fills, GC storms, threads stall
Possible: Shard relocation or recovery task in progress (check cluster state)

The fact that data_node_4 crashed and then both CPU and load immediately dropped to ~2% confirms it was a JVM/process crash, not a graceful restart.

Action needed: Check OpenSearch cluster health, data_node_4 JVM heap % and GC log around 01:35–01:41 UTC. Check for unassigned shards post-crash.

Are the Two Waves Connected?

Timing suggests the waves are likely related but not directly causal:

data_node_1 was already elevated at 00:30 UTC — before the frontend recovered — suggesting OS was under independent pressure
The Apr 4 sustained backend load (OpenSearch rejection storm, ~13 hours) likely left the OS cluster in a degraded state entering Apr 5
The frontend Wave 1 spike may have added a burst of inventory queries to an already-stressed OS cluster, accelerating the OS ramp
But Wave 2 would likely have occurred regardless — the 60-min slow ramp was already in motion

Context: Apr 4 Background

Yesterday (Apr 4) the two tradeit-backend nodes (i-0843e536600ca8982, i-0e7d0f846e38926a8) sustained 13+ hours of elevated CPU (50–60% user) driven by an OpenSearch thread pool rejection storm. POST /inventory_internal_?/_search query volume surged +61–67% above baseline, flooding data-node-1's search thread pool (7 threads, queue 1000/1000). No monitor fired because CPU peaked at ~60–65% total — below the 90% threshold.

That sustained load on OpenSearch across Apr 4 is the likely precondition for the OS node degradation that triggered Wave 2 this morning.

Next Steps

Identify the Nuxt cron/job trigger (P0)

SSH into tradeit-alb00 gg01/gg02 and check crontabs, PM2 scheduled tasks, and any job running at 00:23 UTC. Compare with APM live processes snapshot from monitor alert link.

Check OpenSearch cluster health NOW (P0)

Verify data_node_4 rejoined the cluster cleanly after crash. Check for unassigned shards, shard recovery status, and whether data_node_1 CPU has stabilized.

Check OS JVM heap & GC logs (P1)

Pull jvm.mem.heap_used_percent for both data nodes around 00:30–01:41 UTC. If heap was >85%, GC pressure is confirmed as the cause. Check opensearch.log on node_4 for OOM or crash signal.

Add circuit breaker + OS node monitoring (P1)

Backend needs a circuit breaker on queryInventoryOnOpenSearch to stop retry storms. OS cluster needs JVM heap % monitor and a shard health monitor — CPU alone is insufficient signal.

Generated by Claude Code · Datadog MCP · Apr 5, 2026 tradeit.gg · eu-west-1 · prod