P1 Incident Apr 5, 2026 · 00:23–01:42 UTC 03:23–04:42 Israel Time Auto-recovered

CPU & Load Spike — Two-Wave Incident

Both Nuxt frontend nodes hit 92% CPU simultaneously at 00:23 UTC, followed ~50 min later by both OpenSearch data nodes reaching 95%+ CPU with one crashing and restarting.

92.5%
Peak CPU — Frontend nodes
95%+
Peak CPU — OpenSearch nodes
~7 min
Frontend wave duration
~40 min
OpenSearch wave duration
6
P1 monitors triggered
Both
Waves auto-recovered

Monitors Triggered

 Wave 1 — Frontend Layer  00:27–00:34 UTC
[P1] System load is high — tradeit-alb00-prod.tradeit.gg02 (i-0521bba8a5fa7b821)
system.load.norm.1 > 1.0  ·  Triggered 00:27:17 UTC  ·  Recovered 00:34:17 UTC  ·  m6a.2xlarge
[P1] System load is high — tradeit-alb00-prod.tradeit.gg01 (i-09c1ad0aa82395e99)
system.load.norm.1 > 1.0  ·  Triggered 00:27:17 UTC  ·  Recovered 00:34:17 UTC  ·  m6a.2xlarge
[P1] CPU usage is high — tradeit-alb00-prod.tradeit.gg01 & gg02
100 - system.cpu.idle > 90  ·  Triggered 00:29:14 UTC  ·  Recovered 00:31:14 UTC
 Wave 2 — OpenSearch Layer  01:12–01:22 UTC (still ongoing)
[P1] System load is high — opensearch_cluster_data_node_4 (i-07e02b117e9dbd0a9)
system.load.norm.1 > 1.0 reached 2.24  ·  Triggered 01:12:17 UTC  ·  Node crashed ~01:41 UTC  ·  c6a.xlarge
[P1] System load is high — opensearch_cluster_data_node_1 (i-0e3ae1adac2380052)
system.load.norm.1 > 1.0 reached 1.98  ·  Triggered 01:20:17 UTC  ·  c6a.xlarge
[P1] CPU usage is high — data_node_1 & data_node_4
100 - system.cpu.idle > 90  ·  Triggered 01:22:14 UTC  ·  Both nodes reached 95–96% CPU

Incident Timeline

 Wave 1 — Frontend

00:05–00:23 UTC
Normal operation
Both ALB nodes at ~25–31% user CPU, load norm ~0.30. Nothing unusual.
00:23 UTC — 03:23 IL
CPU spikes to 92.5% on both nodes
Both tradeit-alb00 gg01 and gg02 spike simultaneously. Load norm hits 1.78. Both nodes, same second — not a single runaway process.
00:27:17 UTC — 03:27 IL
Load monitors fire (P1)
Both load monitors trigger at the same timestamp.
00:29:14 UTC — 03:29 IL
CPU >90% monitors fire (P1)
100 - idle > 90 confirmed — 92.5% is real.
00:31–00:34 UTC
Self-recovered
CPU drops back to ~18–25%. Both monitors recover. Total duration: ~7 minutes. No manual intervention.

 Wave 2 — OpenSearch

00:41–01:05 UTC
data_node_4 starts climbing
Briefly dips to ~9% at 00:41 (possible brief recovery), then climbs again: 35% → 44% → 63% → 68%.
00:30–01:05 UTC
data_node_1 also elevated
Steadily ramps from 31% → 56% user CPU. Was already elevated when Wave 1 hit the frontend.
01:12–01:22 UTC — 04:12–04:22 IL
Both OS nodes cross load > 1.0 → P1 fires
data_node_4 load norm: 2.24. data_node_1 load norm: 1.97. CPU on both: 93–95%.
01:35–01:41 UTC
data_node_4 crashes / restarts
CPU drops from 90% to ~2% (near-zero). Shard reallocation begins. Then recovers to ~27–47%.
01:41+ UTC
data_node_1 remains elevated
Still at 40–88% CPU post-peak. Likely absorbing rebalanced shards from the node_4 crash.

CPU Profiles

 Frontend Nodes — m6a.2xlarge (8 vCPU)

Both tradeit-alb00 gg01 & gg02 — identical profile

00:05–00:22 (baseline)~27%
00:23 (peak)92.5%
00:31+ (recovered)~22%
This is a sharp spike — normal → 92% → normal in under 10 minutes. Not a sustained load.
 OpenSearch Data Nodes — c6a.xlarge (4 vCPU)

data_node_4 (i-07e02b117e9dbd0a9)

00:30–00:53 (ramp start)21–42%
01:05–01:11 (climbing fast)63–68%
01:17–01:29 (peak)93–95%
01:41 (crash/restart)~2%

data_node_1 (i-0e3ae1adac2380052) — similar ramp, no crash

00:30–01:05 (ramp)31–56%
01:17–01:29 (peak)92–95%
This is a slow ramp over ~60 minutes — consistent with index merge, JVM GC, or shard relocation, not a traffic spike.

Root Cause Analysis

 Wave 1: Frontend Spike — Unknown Trigger, Sharp Shape

Both tradeit-alb00 nodes (Nuxt.js SSR, type:frontend) spiked from ~27% to 92.5% CPU simultaneously at exactly 00:23 UTC. No error logs were found for this host during the window, ruling out an error-handling storm.

The simultaneous exact-second spike on both nodes strongly suggests a shared trigger:

  • Most likely: A scheduled job or cron task running at 00:23 UTC on both nodes (e.g. SSR cache warming, sitemap generation, report job)
  • Possible: A traffic burst routed across both nodes simultaneously (e.g. bot crawl, automated test, external monitoring)
  • Less likely: A deployment or config reload (no evidence of restart)

 Action needed: Check what runs on Nuxt nodes at ~00:23 UTC. Likely a cron or scheduled job. Live Processes snapshot would confirm the process name.

 Wave 2: OpenSearch Nodes — Slow Ramp, One Crash

data_node_1 was already at 31–40% CPU from 00:30 UTC (while Wave 1 was resolving on the frontend). Both nodes then ramped together over ~60 minutes to 92–95%, with data_node_4 crashing at 01:41 UTC.

The slow linear ramp is characteristic of:

  • Most likely: Lucene segment merge or index optimization triggered by the preceding backend load (from Apr 4 sustained high query volume)
  • Possible: JVM GC pressure accumulating — heap fills, GC storms, threads stall
  • Possible: Shard relocation or recovery task in progress (check cluster state)

The fact that data_node_4 crashed and then both CPU and load immediately dropped to ~2% confirms it was a JVM/process crash, not a graceful restart.

 Action needed: Check OpenSearch cluster health, data_node_4 JVM heap % and GC log around 01:35–01:41 UTC. Check for unassigned shards post-crash.

 Are the Two Waves Connected?

Timing suggests the waves are likely related but not directly causal:

  • data_node_1 was already elevated at 00:30 UTC — before the frontend recovered — suggesting OS was under independent pressure
  • The Apr 4 sustained backend load (OpenSearch rejection storm, ~13 hours) likely left the OS cluster in a degraded state entering Apr 5
  • The frontend Wave 1 spike may have added a burst of inventory queries to an already-stressed OS cluster, accelerating the OS ramp
  • But Wave 2 would likely have occurred regardless — the 60-min slow ramp was already in motion

Context: Apr 4 Background

Yesterday (Apr 4) the two tradeit-backend nodes (i-0843e536600ca8982, i-0e7d0f846e38926a8) sustained 13+ hours of elevated CPU (50–60% user) driven by an OpenSearch thread pool rejection storm. POST /inventory_internal_?/_search query volume surged +61–67% above baseline, flooding data-node-1's search thread pool (7 threads, queue 1000/1000). No monitor fired because CPU peaked at ~60–65% total — below the 90% threshold.

That sustained load on OpenSearch across Apr 4 is the likely precondition for the OS node degradation that triggered Wave 2 this morning.

Next Steps

 Identify the Nuxt cron/job trigger (P0)
SSH into tradeit-alb00 gg01/gg02 and check crontabs, PM2 scheduled tasks, and any job running at 00:23 UTC. Compare with APM live processes snapshot from monitor alert link.
 Check OpenSearch cluster health NOW (P0)
Verify data_node_4 rejoined the cluster cleanly after crash. Check for unassigned shards, shard recovery status, and whether data_node_1 CPU has stabilized.
 Check OS JVM heap & GC logs (P1)
Pull jvm.mem.heap_used_percent for both data nodes around 00:30–01:41 UTC. If heap was >85%, GC pressure is confirmed as the cause. Check opensearch.log on node_4 for OOM or crash signal.
 Add circuit breaker + OS node monitoring (P1)
Backend needs a circuit breaker on queryInventoryOnOpenSearch to stop retry storms. OS cluster needs JVM heap % monitor and a shard health monitor — CPU alone is insufficient signal.
Generated by Claude Code · Datadog MCP · Apr 5, 2026 tradeit.gg · eu-west-1 · prod