P1 Incident Apr 4, 2026 · 07:15–13:05 IL 07:15–10:05 UTC Auto-Recovered

Backend CPU & OpenSearch Degradation

OpenSearch data_node_4 ran at 60%+ average CPU all morning, causing inventory/tag query failures from 07:15 IL. Backend CPU climbed from 31%→52% without a traffic increase. Two error bursts (~180–193 errors/5min) hit at 11:55 and 12:55 IL. Preceded by a 07:00 IL deployment and preconditioned by multi-day OpenSearch load.

60.8%

OS data_node_4 avg CPU

96%

OS data_node_4 peak CPU

52%

Backend lb-001 peak CPU

+25%

Traffic increase at 08:00 UTC

193

Peak errors / 5 min

~3h

Duration (visible errors)

Signals & Alerts

Deployment — 04:00 UTC (07:00 IL)

All 3 backend containers restarted simultaneously on the same second.
lb-001 (i-0e7d0f846e38926a8), lb-002 (i-0843e536600ca8982), cron-server (i-0c6f9ecf4865a8c62)
New image: sha256:37b08fee…
Instance types: c7a.2xlarge (lb), m6a.large (cron)

Watchdog — GET /api/v2/inventory/stickers

First fired: 08:10 UTC (11:10 IL)
Fired 5 more times through 08:49 UTC
Expanded to: GET /api/v2/insights/stock-data/:gameid/:itemid at 08:41 UTC
Story key: d46ce6b8-5476-54e0-a9ac-47ca506194f6

Error Burst #1 — 08:55–09:05 UTC (11:55 IL)

~179–180 errors / 5 min (vs normal baseline ~10–25)
Dominant errors: Execute Query error, getTags failed, getMyInventoryData failed
Cascading: trade → create failed, monnect → processPayment, sell → listingSell

Error Burst #2 — 09:55–10:05 UTC (12:55 IL)

~192–193 errors / 5 min — slightly worse than burst #1
Same error pattern: OS query failures cascading to trade/payment/inventory
Self-cleared by 10:10 UTC. Same cyclic re-saturation pattern as Apr 3/4 OS incident.

Incident Timeline

04:00 UTC · 07:00 IL

Scheduled deployment

All 3 backend containers KILL/DIE/STOP and restart with new image. CPU dips to ~2–4% during restart, recovers to 34–38% baseline by 04:20.

07:00–07:15 UTC · 10:00–10:15 IL

OpenSearch already degraded

data_node_4 (i-07e02b117e9dbd0a9) running at 60%+ avg CPU across the entire 07:00–10:00 window. Peak 96%. The cluster was not healthy when the European morning began.

07:15 UTC · 10:15 IL

First backend errors

getMyInventoryData failed, meta → getCollectionMeta, meta → getContainerMeta — all OpenSearch-backed endpoints beginning to fail. Traffic still lower than earlier morning.

07:29 UTC · 10:29 IL

Backend CPU starts climbing

lb-001: 31% → 42%, lb-002: 29% → 39%. No traffic increase — traffic was at ~840–900 req/5min, lower than 06:00–07:00. CPU rise is pure queue backlog from slow OS queries.

07:55 UTC · 10:55 IL

CPU plateaus at 48–53%

lb-001 peaks at 55.6%, lb-002 at 51.3%. Node.js worker threads saturated with in-flight OS requests waiting on timeouts.

08:00 UTC · 11:00 IL

European morning traffic ramp

Requests jump from ~843 to ~1005/5min (+25%). Adds fresh OS query load on top of already-saturated cluster. CPU holds at 49–52%.

08:10 UTC · 11:10 IL

Watchdog fires — latency on /inventory/stickers

Fires repeatedly through 08:49 UTC. Expands to /insights/stock-data at 08:41. Both endpoints perform OpenSearch queries for item/sticker data.

08:55–09:05 UTC · 11:55 IL

Error burst #1 — OS thread pool saturated

~180 errors/5min. Execute Query error burst indicating OS search thread pool full. Cascades to trade failures, payment failures, inventory unavailable.

09:10 UTC · 12:10 IL

Partial recovery

Error rate drops from 180 to ~15/5min. OS briefly drains its queue. Backend still running at 46–50% CPU — cluster not fully recovered.

09:55–10:05 UTC · 12:55 IL

Error burst #2 — re-saturation

~193 errors/5min — slightly worse. Same OS query failure cascade. Self-clears by 10:10 UTC. Cyclic pattern matches the Apr 3 OS saturation incident.

CPU Profiles (07:00–10:00 UTC)

Backend Nodes (c7a.2xlarge, 8 vCPU)

lb-001 (i-0e7d0f846e38926a8)avg 43% · peak 57%

lb-002 (i-0843e536600ca8982)avg 40% · peak 52%

cron-server (i-0c6f9ecf4865a8c62)avg 2.5% · peak 7%

Note: CPU climb on lb-001/002 happened at 07:29 UTC with traffic at its lowest of the day (~840 req/5min). Confirms the driver was OS latency, not user load.

OpenSearch Nodes (c6a.xlarge, 4 vCPU)

data_node_4 (i-07e02b117e9dbd0a9)avg 61% · peak 96%

Precondition: data_node_4 was already running at 60%+ average over the entire 07:00–10:00 window, meaning it entered Friday's morning peak in a degraded state — likely carryover load from the multi-day OpenSearch saturation incident that started Apr 3.

Apr 5 connection: The same node (data_node_4) crashed at 01:41 UTC on Apr 5, reaching 95%+ CPU. Friday was day 2 of progressive cluster failure.

Error Rate Profile (5-min buckets, 06:00–11:00 UTC)

06:00–06:05

~185 errors

06:10–07:25

~3–10 / bucket

07:30–08:50

~10–36 / bucket

08:55–09:05 ⚡

~179–180 errors

09:10–09:50

~14–32 / bucket

09:55–10:05 ⚡

~192–193 errors

10:10+

~15–18 / bucket

06:00 spike: ~185 errors at window start may be carryover from overnight or startup effects from the 04:00 deployment. Clears immediately by 06:10.

Failure Cascade

OpenSearch data_node_4 saturated

Running at 60%+ CPU avg on a 4-vCPU c6a.xlarge. Search thread pool fills up — new queries queued then rejected. Likely caused by sustained shard rebalancing / merge operations from multi-day load.

Inventory / tag queries fail

getMyInventoryData, getTags, getCollectionMeta, getContainerMeta — all OS-backed. Start timing out or returning errors from 07:15 UTC.

Node.js request queue backs up

In-flight requests waiting on OS timeouts hold Express worker slots. Backend CPU climbs from ~36% to 52%+ as GC and timeout handling accumulate.

Morning traffic amplifies load

European traffic adds +25% request volume from 08:00 UTC, compounding OS query pressure on an already-backed-up cluster. Watchdog fires on /inventory/stickers and /insights/stock-data.

OS thread pool saturates (×2)

Full saturation at 08:55 and 09:55 UTC. All search queries rejected. Cascades to: trade → create failed, monnect → processPayment, sell → listingSell, getFullImage failed.

Self-recovery

OS drains briefly between bursts. Backend CPU stays elevated at 46–54% through end of observed window. Full recovery unclear — cluster entered Apr 5 in degraded state.

Deployment at 04:00 UTC

Image sha256:37b08fee… — simultaneous restart on all 3 nodes

The 04:00 UTC deployment is confirmed by Docker KILL/DIE/STOP events on lb-001, lb-002, and cron-server within 1 second of each other. This is consistent with a coordinated CircleCI/GitHub Actions release pipeline.

Open question: Did this release increase OpenSearch query volume or introduce a less-efficient query? The OS cluster was already stressed before deployment, but the new code ran for ~3 hours before errors appeared — which could mean: (a) query change + warming effect, or (b) deployment is coincidental and OS degradation was independent. Check the git tag / release diff for any OpenSearch query changes.

Timeline note: OS errors begin at 07:15 UTC — 3h15m after the 04:00 deployment. The 4-hour interval between OpenSearch load surges seen in the Apr 3/4 data suggests a recurring background job (bot fleet scan, pricing job, or scheduled reindex) rather than a deployment regression.

Next Steps

Immediate

Check current OpenSearch cluster health — data_node_4 crashed Apr 5 at 01:41 UTC. Are all shards reallocated and green?
Review OS JVM heap for data_node_4 across Apr 3–5 — confirm whether this was a long-running GC/merge event
Check OS search thread pool queue config — consider increasing or adding circuit-breaker timeout on the backend client

Investigation

Identify the ~4-hour-interval background job triggering OS query surges — pricing job, bot fleet inventory scan, or scheduled reindex are prime suspects
Review Apr 4 release diff for any new or modified OpenSearch queries in inventory_internal or sticker endpoints
Add OpenSearch thread pool saturation monitor (alert before queue fills, not after)
Consider read replica or query timeout + fallback for /inventory/stickers and /insights/stock-data

Related Incidents

Apr 5, 2026 — Two-Wave CPU Spike: Nuxt frontend nodes hit 92% CPU + OpenSearch data_node_4 reached 95% and crashed. That incident was the direct continuation of Friday's degraded cluster state. View report →