Both Nuxt frontend nodes hit 92% CPU simultaneously at 00:23 UTC, followed ~50 min later by both OpenSearch data nodes reaching 95%+ CPU with one crashing and restarting.
system.load.norm.1 > 1.0 · Triggered 00:27:17 UTC · Recovered 00:34:17 UTC · m6a.2xlargesystem.load.norm.1 > 1.0 · Triggered 00:27:17 UTC · Recovered 00:34:17 UTC · m6a.2xlarge100 - system.cpu.idle > 90 · Triggered 00:29:14 UTC · Recovered 00:31:14 UTCsystem.load.norm.1 > 1.0 reached 2.24 · Triggered 01:12:17 UTC · Node crashed ~01:41 UTC · c6a.xlargesystem.load.norm.1 > 1.0 reached 1.98 · Triggered 01:20:17 UTC · c6a.xlarge100 - system.cpu.idle > 90 · Triggered 01:22:14 UTC · Both nodes reached 95–96% CPUWave 1 — Frontend
tradeit-alb00 gg01 and gg02 spike simultaneously. Load norm hits 1.78. Both nodes, same second — not a single runaway process.100 - idle > 90 confirmed — 92.5% is real.Wave 2 — OpenSearch
Both tradeit-alb00 gg01 & gg02 — identical profile
data_node_4 (i-07e02b117e9dbd0a9)
data_node_1 (i-0e3ae1adac2380052) — similar ramp, no crash
Both tradeit-alb00 nodes (Nuxt.js SSR, type:frontend) spiked from ~27% to 92.5% CPU simultaneously at exactly 00:23 UTC. No error logs were found for this host during the window, ruling out an error-handling storm.
The simultaneous exact-second spike on both nodes strongly suggests a shared trigger:
Action needed: Check what runs on Nuxt nodes at ~00:23 UTC. Likely a cron or scheduled job. Live Processes snapshot would confirm the process name.
data_node_1 was already at 31–40% CPU from 00:30 UTC (while Wave 1 was resolving on the frontend). Both nodes then ramped together over ~60 minutes to 92–95%, with data_node_4 crashing at 01:41 UTC.
The slow linear ramp is characteristic of:
The fact that data_node_4 crashed and then both CPU and load immediately dropped to ~2% confirms it was a JVM/process crash, not a graceful restart.
Action needed: Check OpenSearch cluster health, data_node_4 JVM heap % and GC log around 01:35–01:41 UTC. Check for unassigned shards post-crash.
Timing suggests the waves are likely related but not directly causal:
data_node_1 was already elevated at 00:30 UTC — before the frontend recovered — suggesting OS was under independent pressureYesterday (Apr 4) the two tradeit-backend nodes (i-0843e536600ca8982, i-0e7d0f846e38926a8) sustained 13+ hours of elevated CPU (50–60% user) driven by an OpenSearch thread pool rejection storm. POST /inventory_internal_?/_search query volume surged +61–67% above baseline, flooding data-node-1's search thread pool (7 threads, queue 1000/1000). No monitor fired because CPU peaked at ~60–65% total — below the 90% threshold.
That sustained load on OpenSearch across Apr 4 is the likely precondition for the OS node degradation that triggered Wave 2 this morning.
tradeit-alb00 gg01/gg02 and check crontabs, PM2 scheduled tasks, and any job running at 00:23 UTC. Compare with APM live processes snapshot from monitor alert link.data_node_4 rejoined the cluster cleanly after crash. Check for unassigned shards, shard recovery status, and whether data_node_1 CPU has stabilized.jvm.mem.heap_used_percent for both data nodes around 00:30–01:41 UTC. If heap was >85%, GC pressure is confirmed as the cause. Check opensearch.log on node_4 for OOM or crash signal.queryInventoryOnOpenSearch to stop retry storms. OS cluster needs JVM heap % monitor and a shard health monitor — CPU alone is insufficient signal.