Incident Report · April 15, 2026

OS Load Spike + Backend Event-Loop Stalls

Two correlated problems — causation NOT proven. User-reported slow unstack on store page (~8s) & data-node load alert at 17:38Z (20:38 IDT)

P2 OPEN — Performance Degradation, No Outage

User-visible 8s page load reported · OS data_node_1 alert auto-recovered · Investigation ongoing — missing observability prevents causal proof

TL;DR

One-paragraph summary

Two distinct performance problems were observed on 2026-04-15 within the same hour, and they correlate but causation is NOT proven in either direction. Problem (A): opensearch_cluster_data_node_1 hit sustained CPU 60–87% with load.1 peaking at 5.3 for ~10 minutes (17:30–17:40 UTC), triggering a Datadog monitor alert at 17:38Z. data_node_4 spiked in lockstep; data_node_3 spiked later (17:39–17:43); data_node_2 stayed idle (by design). Problem (B): tradeit-backend Node.js event loop repeatedly stalled for 1–3 SECONDS during the same window AND in unrelated windows earlier in the day. Two captured slow OpenSearch queries during the window were inventory-browse calls with 500+ asset IDs in must_not filters — expensive on Lucene and a likely OS contributor. Backend traffic volume was flat (no traffic surge). Earlier 14:41Z slow trace shows event-loop block of 1.82s while OS was healthy — proves Problem (B) exists independently. The fix path is two parallel tracks, not one.

5.3
Peak load.1 (data_node_1)
87%
Peak CPU User
3.2s
Worst event_loop.delay.max
2 / 4
OS Nodes > load 4.0
~10min
Spike Duration
8s
User-Reported Page Load

Timeline (UTC)

Both problems on the same day — only one was inside the alert window

13:25 — DEV-4815 backend PR #1244 merged to master (5ee632c8) DEPLOY
14:30 — DEV-4815 backend PR #1245 merged (18f3f1bd) — tradePurchase / store / imageScreenshot DEPLOY
14:41:56Problem (B) first evidence: 3.1s slow trace on GET /api/v2/inventory/data. OS healthy (queries 110ms). 1.82s silent gap = JS event-loop block on backend. Trace 69dfa3b4…. EVIDENCE
15:14 — DEV-4815 prod deploy (master image) to 3 backend hosts DEPLOY
16:55 — Hotfix PR #1246 (enable DD_RUNTIME_METRICS_ENABLED) prod deploy — e7a45685 DEPLOY
17:00–17:28 — Backend traffic baseline 49–77 traces/min sampled. OS data nodes baseline 1.5–2.2 load.1.
17:30–17:31 — OS load climbs — data_node_1 + data_node_4 both at 3.0
17:33–17:38Sustained spike: data_node_1 + data_node_4 at load.1 4.6–5.3, CPU 60–87%. Backend event_loop.delay.max hits 2.7s, 3.1s, 2.5s. PEAK
17:38 — Datadog Monitor 238873337 fires for opensearch_cluster_data_node_1 (load > 1.0 normalized) ALERT
17:39–17:43 — data_node_1 + data_node_4 cool down to ~1.0. data_node_3 takes over — load climbs to 3.3 then drops.
17:42, 17:45, 17:48 — Backend event_loop.delay.max spikes to 3.2s, 3.1s, 1.2s — continues AFTER OS recovered EVIDENCE
~17:50 — All metrics return to baseline RECOVERED

OS Cluster State During Spike (17:33–17:38 UTC)

4-node cluster, c6a.xlarge (4 vCPU), all in eu-west-1c

Peak load.1 per node. Threshold: 4.0 on a 4-vCPU box (= full saturation).

data_node_1
5.29 · SATURATED
data_node_4
5.16 · SATURATED (lockstep with node_1)
data_node_3
3.33 · HOT (took over after 1+4 cooled)
data_node_2
0.02 · IDLE (by design)

Pattern of node_1 + node_4 spike together, then node_3 is consistent with search-side replicated load on the active replica set. Less consistent with bulk indexing (which would pin a primary). However, lacking opensearch.indexing.* metrics, indexing as a contributor cannot be ruled out.


Two Problems — Same Day, Possibly Independent

Both real. Causation between them not proven by available data.

Problem (A): OpenSearch CPU Saturation

  • data_node_1 sustained load.1 4.6–5.3 for ~7 min (17:33–17:40)
  • CPU user 60–87% during the spike
  • data_node_4 spiked in lockstep; data_node_3 lagged by ~5 min
  • 2 captured slow queries (567ms, 664ms server-side) — both inventory browse with 500+ must_not:assetId
  • data_node_2 idle throughout (matches "by-design idle node")
  • Backend traffic flat — not driven by request volume burst

Problem (B): Backend Event-Loop Stalls

  • 14:41Z trace: 3.1s total, of which 1.82s was a silent gap with no async work — OS queries that trace took 110ms (healthy)
  • Same trace had 797ms redis.command GET marketplace:configs:enabled:trade on a 2-byte payload — only explainable by event-loop block delaying the reply callback
  • 17:30–17:48Z: runtime.node.event_loop.delay.max repeatedly hit 1–3 SECONDS
  • Spikes continued after OS recovered (17:42, 17:45, 17:48) — not solely OS-driven
  • Heap stable 300–500 MB — not GC pressure
  • Multiple Redis command types showed elevated client-side latency (79 GET + 55 LRANGE + 18 EXPIRE >100ms in 6h) — uniform across types is a signature of event-loop blocking, not Redis slowness

Evidence — The 14:41Z Trace (Problem B Independent of A)

Datadog APM trace 69dfa3b4000000001ba14667da4ff95c · 3.1s on GET /api/v2/inventory/data

URL: ?context=trade&gameId=730&groupId=117628&isForStore=0&limit=500&offset=100 — an unstack/page-2 browse of a popular CSGO group. Span tree:

Time (UTC)SpanDurationNotes
14:41:56.033Request start (Express)Trace begins
14:41:56.033redis.command GET sess:…18 msSession lookup — healthy
14:41:56.053redis.command GET marketplace:configs:enabled:trade797 ms2-byte payload per STRLEN — impossible over network. Client-side wait.
14:41:56.85 → 14:41:58.672SILENT GAP — no spans1,820 msJS event loop blocked. CPU work, no async I/O.
14:41:58.672redis.command LRANGE reserved-assets_73056 msReserved-assets fetch
14:41:58.730HTTP POST /inventory_730/_search82 msOS query — fast
14:41:58.820HTTP POST /inventory_730/_search (count agg)110 msOS count query — fast
14:41:58.932redis.command EXPIRE sess:…198 msSession expire write — also slow client-side
14:41:59.137Request end3,104 ms totalOS contributed only 192 ms — rest is client/CPU

Why this is the cleanest proof of Problem (B)

OS data nodes were healthy at 14:41Z (no alert window). The 1.82s silent gap and the 797ms Redis GET on a 2-byte value cannot be explained by network or remote-service slowness — the only mechanism is Node.js event-loop block delaying async callbacks. Whatever causes Problem (B) is internal to the backend and exists independently of OS load.


Evidence — The Slow OS Queries (Problem A)

Both queries 17:37Z + 17:43Z · same shape · expensive Lucene work

Body shape (truncated — the actual must_not.assetId arrays each had 500+ items):

{
  "sort": [{ "price": "asc" }, { "tradeLockDay": "asc" }, { "botIndex": "asc" }],
  "from": "0", "size": 100,
  "query": {
    "bool": {
      "filter": [{ "bool": { "must": [
        { "term": { "groupId": { "value": 35141 } } },
        { "range": { "priceForTrade": { "lt": 4741729 } } },
        { "term": { "onlyStore": false } }
      ]}}],
      "must_not": [
        { "terms": { "assetId": [ /* 500+ asset IDs */ ] } },
        { "terms": { "botIndex": [46,70,108,109,119,418,419,286] } }
      ]
    }
  }
}

Why this query shape spikes data-node CPU

Lucene evaluates each terms entry against the inverted index. A must_not.terms.assetId with 500+ values means each scored doc is checked against 500+ exclusions. Combined with sort + score + range filter on a hot group, it’s the most expensive shape we send to OS. Same shape as the 14:41Z slow trace (different group, different IDs).

APM sampling rate is 1% in prod (DD_TRACE_SAMPLE_RATE: '0.01'), so 2 captured slow queries in 15 minutes likely represents ~200 actual slow queries during that window.


What Was Ruled Out

Confirmed against the data — not contributing causes

DEV-4815 deploy?

NO

Inventory browse path doesn’t use queryByAssetIds (the only thing DEV-4815 touched).

RULED OUT

Traffic surge?

NO

Backend trace volume flat: 47–77 spans/min through 17:00–17:50.

RULED OUT

Redis (server) slowness?

NO

redis.ping.latency_ms 0.2–0.4ms, redis.cpu.user_main_thread 0.1–0.5%, no RDB fork.

RULED OUT

Marketplace config payload size?

NO

STRLEN marketplace:configs:enabled:trade = 2 bytes. Payload size is not the reason for the 797ms GET.

RULED OUT

JVM heap pressure / GC?

N/A

Backend Node.js heap stable 300–500MB. OS-side JVM heap not observable (no opensearch.* metrics).

PARTIAL

Data-node-2 idle by accident?

NO

Intentional — concentrated shard placement is a deliberate trade-off (faster bulk inserts).

RULED OUT

What We Cannot Yet Rule Out

Open questions blocking causal proof

Question Why It Matters What Would Answer It
Was parser indexing running during 17:30–17:40Z? If a 12-min lane indexing cycle happened to start at 17:28–17:30Z, it would explain the OS load on the primary shards (data_node_1, data_node_4) without involving query traffic. Parser PM2 logs / CircleCI run history / direct check on parser host. No parser logs in Datadog.
Did OS query latency to backend coincide with the load spike? If yes, OS-slowness → longer awaits on backend → more queued JSON.parse calls → event-loop blocked. If OS replies were still fast despite host CPU, that chain breaks. Server-side OS slow log (index.search.slowlog.threshold.query.warn: 1s) for the 17:30–17:40Z window.
What is the specific JS code blocking the event loop? The 1.82s silent gap at 14:41Z is the smoking gun for Problem (B). We need to identify which sync code path is responsible — suspects: JSON.parse of large Redis/OS responses, ramda indexBy on big arrays, sync crypto in middleware. CPU profile via DD_PROFILING_ENABLED=true for ~1 hour during peak traffic, OR clinic.js flame reproducing the slow path on staging.
Is the 500+ must_not:assetIds filter avoidable? If we can shrink it (e.g., paginate exclusions, post-filter in app, or use a different OS query pattern), Problem (A) gets cheaper per query. Code review of queryCountItemsOnGroup + the calling controllers. Where does the 500-item exclusion list come from? Reserved+cart items? All bot inventory?

Observability Gaps Exposed (DEV-4638 still relevant)

Same gaps as the 2026-04-14 incident — and one new one we can fix today

No opensearch.* metrics in Datadog

No bulk thread-pool queue depth, search rejections, JVM heap, GC, segment count, slow query logs. Datadog Agent is running on OS hosts (we get system.* metrics), but the OpenSearch integration is not configured. Setup is per-node opensearch.d/conf.yaml + cluster URL + read-only auth user. Tracked separately.

No parser logs in Datadog

Cannot directly correlate parser indexing cycles with OS spikes from log timestamps — have to infer from metric shapes alone. Same gap as 2026-04-14 incident.

Closed today: runtime.node.* metrics on backend

PR #1246 added DD_RUNTIME_METRICS_ENABLED=true to env_production in ecosystem.config.js. Deployed at 16:55Z. runtime.node.event_loop.delay.max is now flowing — this is the metric that surfaced the 1–3 SECOND event-loop stalls during the alert window. Without it, Problem (B) would have remained invisible.


Recommendations — Two Tracks

Both problems get worked in parallel; neither blocks the other

Rank Track Action Why
P0 Both Configure Datadog Agent OpenSearch integration on all 4 data nodes + coord node Without opensearch.thread_pool.search.queue, opensearch.indices.search.query_time, slow logs, etc., we cannot prove or disprove parser-indexing as a contributor — the same blind spot the 2026-04-14 incident hit. Setup is one opensearch.d/conf.yaml per host pointing at https://localhost:9200 with a read-only user.
P0 B Enable continuous profiler (DD_PROFILING_ENABLED=true) for ~24h on prod Now that event_loop.delay.max spikes are visible, the profiler will correlate spikes to specific JS frames (JSON.parse, ramda transforms, sync crypto). 2–5% CPU overhead. One-line change, same file as PR #1246.
P1 A Audit the 500+ must_not:assetId filter source — can it be capped or moved to app-layer post-filter? Same query shape recurred in slow traces at 14:41Z and 17:37/17:43Z. Likely the "exclude already-reserved + already-in-cart" set. If we can cap at, e.g., 100 IDs per query and post-filter the rest, OS CPU per query drops materially.
P1 B Identify owner of the 1.82s silent gap in the 14:41Z trace path (getInventoryData / fetchItems) Read the handler for any sync JSON.parse on Redis cache hits, sync ramda transforms, or sync crypto. The 500-item must_not array build itself may be a hotspot. Combine with profiler results from row above.
P2 A Enable parser log shipping to Datadog (open since 2026-04-14 incident) Closes the gap that blocks correlating parser bulk cycles with OS spikes. Already a P2 in the 2026-04-14 action plan — restated here because it surfaced again.
P2 B Once OS metrics are available, build a Datadog dashboard overlaying runtime.node.event_loop.delay.max with opensearch.thread_pool.search.active, opensearch.indices.search.query_time, and backend p99 latency. Visual correlation between OS load, OS query latency, backend event loop, and user-facing latency. Will resolve the causation question on the next incident.
P2 Both Bump DD_TRACE_SAMPLE_RATE from 0.01 to 0.1 temporarily during active investigation 1% sampling means most slow traces are invisible. 10% during active investigation makes evidence-gathering 10x faster. Revert after.

References

Incident owner: Ehud Shahak · Generated 2026-04-15 · Status: OPEN — investigation continuing