Signals & Alerts
Deployment — 04:00 UTC (07:00 IL)
All 3 backend containers restarted simultaneously on the same second.
lb-001 (i-0e7d0f846e38926a8), lb-002 (i-0843e536600ca8982), cron-server (i-0c6f9ecf4865a8c62)
New image: sha256:37b08fee…
Instance types: c7a.2xlarge (lb), m6a.large (cron)
Watchdog — GET /api/v2/inventory/stickers
First fired: 08:10 UTC (11:10 IL)
Fired 5 more times through 08:49 UTC
Expanded to: GET /api/v2/insights/stock-data/:gameid/:itemid at 08:41 UTC
Story key: d46ce6b8-5476-54e0-a9ac-47ca506194f6
Error Burst #1 — 08:55–09:05 UTC (11:55 IL)
~179–180 errors / 5 min (vs normal baseline ~10–25)
Dominant errors: Execute Query error, getTags failed, getMyInventoryData failed
Cascading: trade → create failed, monnect → processPayment, sell → listingSell
Error Burst #2 — 09:55–10:05 UTC (12:55 IL)
~192–193 errors / 5 min — slightly worse than burst #1
Same error pattern: OS query failures cascading to trade/payment/inventory
Self-cleared by 10:10 UTC. Same cyclic re-saturation pattern as Apr 3/4 OS incident.
Incident Timeline
04:00 UTC · 07:00 IL
Scheduled deployment
All 3 backend containers KILL/DIE/STOP and restart with new image. CPU dips to ~2–4% during restart, recovers to 34–38% baseline by 04:20.
07:00–07:15 UTC · 10:00–10:15 IL
OpenSearch already degraded
data_node_4 (i-07e02b117e9dbd0a9) running at 60%+ avg CPU across the entire 07:00–10:00 window. Peak 96%. The cluster was not healthy when the European morning began.
07:15 UTC · 10:15 IL
First backend errors
getMyInventoryData failed, meta → getCollectionMeta, meta → getContainerMeta — all OpenSearch-backed endpoints beginning to fail. Traffic still lower than earlier morning.
07:29 UTC · 10:29 IL
Backend CPU starts climbing
lb-001: 31% → 42%, lb-002: 29% → 39%. No traffic increase — traffic was at ~840–900 req/5min, lower than 06:00–07:00. CPU rise is pure queue backlog from slow OS queries.
07:55 UTC · 10:55 IL
CPU plateaus at 48–53%
lb-001 peaks at 55.6%, lb-002 at 51.3%. Node.js worker threads saturated with in-flight OS requests waiting on timeouts.
08:00 UTC · 11:00 IL
European morning traffic ramp
Requests jump from ~843 to ~1005/5min (+25%). Adds fresh OS query load on top of already-saturated cluster. CPU holds at 49–52%.
08:10 UTC · 11:10 IL
Watchdog fires — latency on /inventory/stickers
Fires repeatedly through 08:49 UTC. Expands to /insights/stock-data at 08:41. Both endpoints perform OpenSearch queries for item/sticker data.
08:55–09:05 UTC · 11:55 IL
Error burst #1 — OS thread pool saturated
~180 errors/5min. Execute Query error burst indicating OS search thread pool full. Cascades to trade failures, payment failures, inventory unavailable.
09:10 UTC · 12:10 IL
Partial recovery
Error rate drops from 180 to ~15/5min. OS briefly drains its queue. Backend still running at 46–50% CPU — cluster not fully recovered.
09:55–10:05 UTC · 12:55 IL
Error burst #2 — re-saturation
~193 errors/5min — slightly worse. Same OS query failure cascade. Self-clears by 10:10 UTC. Cyclic pattern matches the Apr 3 OS saturation incident.
CPU Profiles (07:00–10:00 UTC)
Backend Nodes (c7a.2xlarge, 8 vCPU)
lb-001 (i-0e7d0f846e38926a8)avg 43% · peak 57%
lb-002 (i-0843e536600ca8982)avg 40% · peak 52%
cron-server (i-0c6f9ecf4865a8c62)avg 2.5% · peak 7%
Note: CPU climb on lb-001/002 happened at 07:29 UTC with traffic at its lowest of the day (~840 req/5min). Confirms the driver was OS latency, not user load.
OpenSearch Nodes (c6a.xlarge, 4 vCPU)
data_node_4 (i-07e02b117e9dbd0a9)avg 61% · peak 96%
Precondition: data_node_4 was already running at 60%+ average over the entire 07:00–10:00 window, meaning it entered Friday's morning peak in a degraded state — likely carryover load from the multi-day OpenSearch saturation incident that started Apr 3.
Apr 5 connection: The same node (data_node_4) crashed at 01:41 UTC on Apr 5, reaching 95%+ CPU. Friday was day 2 of progressive cluster failure.
Error Rate Profile (5-min buckets, 06:00–11:00 UTC)
06:10–07:25
~3–10 / bucket
07:30–08:50
~10–36 / bucket
08:55–09:05 ⚡
~179–180 errors
09:10–09:50
~14–32 / bucket
09:55–10:05 ⚡
~192–193 errors
06:00 spike: ~185 errors at window start may be carryover from overnight or startup effects from the 04:00 deployment. Clears immediately by 06:10.
Failure Cascade
OpenSearch data_node_4 saturated
Running at 60%+ CPU avg on a 4-vCPU c6a.xlarge.
Search thread pool fills up — new queries queued then rejected.
Likely caused by sustained shard rebalancing / merge operations from multi-day load.
Inventory / tag queries fail
getMyInventoryData, getTags, getCollectionMeta, getContainerMeta —
all OS-backed. Start timing out or returning errors from 07:15 UTC.
Node.js request queue backs up
In-flight requests waiting on OS timeouts hold Express worker slots.
Backend CPU climbs from ~36% to 52%+ as GC and timeout handling accumulate.
Morning traffic amplifies load
European traffic adds +25% request volume from 08:00 UTC,
compounding OS query pressure on an already-backed-up cluster.
Watchdog fires on /inventory/stickers and /insights/stock-data.
OS thread pool saturates (×2)
Full saturation at 08:55 and 09:55 UTC. All search queries rejected.
Cascades to: trade → create failed, monnect → processPayment,
sell → listingSell, getFullImage failed.
Self-recovery
OS drains briefly between bursts. Backend CPU stays elevated at 46–54% through end of observed window.
Full recovery unclear — cluster entered Apr 5 in degraded state.
Deployment at 04:00 UTC
Image sha256:37b08fee… — simultaneous restart on all 3 nodes
The 04:00 UTC deployment is confirmed by Docker KILL/DIE/STOP events on lb-001, lb-002, and cron-server within 1 second of each other.
This is consistent with a coordinated CircleCI/GitHub Actions release pipeline.
Open question: Did this release increase OpenSearch query volume or introduce a less-efficient query?
The OS cluster was already stressed before deployment, but the new code ran for ~3 hours before errors appeared —
which could mean: (a) query change + warming effect, or (b) deployment is coincidental and OS degradation was independent.
Check the git tag / release diff for any OpenSearch query changes.
Timeline note: OS errors begin at 07:15 UTC — 3h15m after the 04:00 deployment. The 4-hour interval between OpenSearch load surges seen in the Apr 3/4 data suggests a recurring background job (bot fleet scan, pricing job, or scheduled reindex) rather than a deployment regression.
Next Steps
Immediate
- Check current OpenSearch cluster health — data_node_4 crashed Apr 5 at 01:41 UTC. Are all shards reallocated and green?
- Review OS JVM heap for data_node_4 across Apr 3–5 — confirm whether this was a long-running GC/merge event
- Check OS search thread pool queue config — consider increasing or adding circuit-breaker timeout on the backend client
Investigation
- Identify the ~4-hour-interval background job triggering OS query surges — pricing job, bot fleet inventory scan, or scheduled reindex are prime suspects
- Review Apr 4 release diff for any new or modified OpenSearch queries in
inventory_internal or sticker endpoints
- Add OpenSearch thread pool saturation monitor (alert before queue fills, not after)
- Consider read replica or query timeout + fallback for
/inventory/stickers and /insights/stock-data
Related Incidents
Apr 5, 2026 — Two-Wave CPU Spike:
Nuxt frontend nodes hit 92% CPU + OpenSearch data_node_4 reached 95% and crashed.
That incident was the direct continuation of Friday's degraded cluster state.
View report →