Incident Report · April 15, 2026

OS Load Spike + Backend Event-Loop Stalls

Two correlated problems — causation NOT proven. User-reported slow unstack on store page (~8s) & data-node load alert at 17:38Z (20:38 IDT)

P2 OPEN — Performance Degradation, No Outage

User-visible 8s page load reported · OS data_node_1 alert auto-recovered · Investigation ongoing — missing observability prevents causal proof

TL;DR

One-paragraph summary

Two distinct performance problems were observed on 2026-04-15 within the same hour, and they correlate but causation is NOT proven in either direction. Problem (A): opensearch_cluster_data_node_1 hit sustained CPU 60–87% with load.1 peaking at 5.3 for ~10 minutes (17:30–17:40 UTC), triggering a Datadog monitor alert at 17:38Z. data_node_4 spiked in lockstep; data_node_3 spiked later (17:39–17:43); data_node_2 stayed idle (by design). Problem (B): tradeit-backend Node.js event loop repeatedly stalled for 1–3 SECONDS during the same window AND in unrelated windows earlier in the day. Two captured slow OpenSearch queries during the window were inventory-browse calls with 500+ asset IDs in must_not filters — expensive on Lucene and a likely OS contributor. Backend traffic volume was flat (no traffic surge). Earlier 14:41Z slow trace shows event-loop block of 1.82s while OS was healthy — proves Problem (B) exists independently. The fix path is two parallel tracks, not one.

5.3

Peak load.1 (data_node_1)

87%

Peak CPU User

3.2s

Worst event_loop.delay.max

2 / 4

OS Nodes > load 4.0

~10min

Spike Duration

User-Reported Page Load

Timeline (UTC)

Both problems on the same day — only one was inside the alert window

13:25 — DEV-4815 backend PR #1244 merged to master (5ee632c8) DEPLOY

14:30 — DEV-4815 backend PR #1245 merged (18f3f1bd) — tradePurchase / store / imageScreenshot DEPLOY

14:41:56 — Problem (B) first evidence: 3.1s slow trace on GET /api/v2/inventory/data. OS healthy (queries 110ms). 1.82s silent gap = JS event-loop block on backend. Trace 69dfa3b4…. EVIDENCE

15:14 — DEV-4815 prod deploy (master image) to 3 backend hosts DEPLOY

16:55 — Hotfix PR #1246 (enable DD_RUNTIME_METRICS_ENABLED) prod deploy — e7a45685 DEPLOY

17:00–17:28 — Backend traffic baseline 49–77 traces/min sampled. OS data nodes baseline 1.5–2.2 load.1.

17:30–17:31 — OS load climbs — data_node_1 + data_node_4 both at 3.0

17:33–17:38 — Sustained spike: data_node_1 + data_node_4 at load.1 4.6–5.3, CPU 60–87%. Backend event_loop.delay.max hits 2.7s, 3.1s, 2.5s. PEAK

17:38 — Datadog Monitor 238873337 fires for opensearch_cluster_data_node_1 (load > 1.0 normalized) ALERT

17:39–17:43 — data_node_1 + data_node_4 cool down to ~1.0. data_node_3 takes over — load climbs to 3.3 then drops.

17:42, 17:45, 17:48 — Backend event_loop.delay.max spikes to 3.2s, 3.1s, 1.2s — continues AFTER OS recovered EVIDENCE

~17:50 — All metrics return to baseline RECOVERED

OS Cluster State During Spike (17:33–17:38 UTC)

4-node cluster, c6a.xlarge (4 vCPU), all in eu-west-1c

Peak load.1 per node. Threshold: 4.0 on a 4-vCPU box (= full saturation).

data_node_1

5.29 · SATURATED

data_node_4

5.16 · SATURATED (lockstep with node_1)

data_node_3

3.33 · HOT (took over after 1+4 cooled)

data_node_2

0.02 · IDLE (by design)

Pattern of node_1 + node_4 spike together, then node_3 is consistent with search-side replicated load on the active replica set. Less consistent with bulk indexing (which would pin a primary). However, lacking opensearch.indexing.* metrics, indexing as a contributor cannot be ruled out.

Two Problems — Same Day, Possibly Independent

Both real. Causation between them not proven by available data.

Problem (A): OpenSearch CPU Saturation

data_node_1 sustained load.1 4.6–5.3 for ~7 min (17:33–17:40)
CPU user 60–87% during the spike
data_node_4 spiked in lockstep; data_node_3 lagged by ~5 min
2 captured slow queries (567ms, 664ms server-side) — both inventory browse with 500+ must_not:assetId
data_node_2 idle throughout (matches "by-design idle node")
Backend traffic flat — not driven by request volume burst

Problem (B): Backend Event-Loop Stalls

14:41Z trace: 3.1s total, of which 1.82s was a silent gap with no async work — OS queries that trace took 110ms (healthy)
Same trace had 797ms redis.command GET marketplace:configs:enabled:trade on a 2-byte payload — only explainable by event-loop block delaying the reply callback
17:30–17:48Z: runtime.node.event_loop.delay.max repeatedly hit 1–3 SECONDS
Spikes continued after OS recovered (17:42, 17:45, 17:48) — not solely OS-driven
Heap stable 300–500 MB — not GC pressure
Multiple Redis command types showed elevated client-side latency (79 GET + 55 LRANGE + 18 EXPIRE >100ms in 6h) — uniform across types is a signature of event-loop blocking, not Redis slowness

Evidence — The 14:41Z Trace (Problem B Independent of A)

Datadog APM trace 69dfa3b4000000001ba14667da4ff95c · 3.1s on GET /api/v2/inventory/data

URL: ?context=trade&gameId=730&groupId=117628&isForStore=0&limit=500&offset=100 — an unstack/page-2 browse of a popular CSGO group. Span tree:

Time (UTC)	Span	Duration	Notes
14:41:56.033	Request start (Express)	—	Trace begins
14:41:56.033	`redis.command GET sess:…`	18 ms	Session lookup — healthy
14:41:56.053	`redis.command GET marketplace:configs:enabled:trade`	797 ms	2-byte payload per `STRLEN` — impossible over network. Client-side wait.
14:41:56.85 → 14:41:58.672	SILENT GAP — no spans	1,820 ms	JS event loop blocked. CPU work, no async I/O.
14:41:58.672	`redis.command LRANGE reserved-assets_730`	56 ms	Reserved-assets fetch
14:41:58.730	HTTP `POST /inventory_730/_search`	82 ms	OS query — fast
14:41:58.820	HTTP `POST /inventory_730/_search` (count agg)	110 ms	OS count query — fast
14:41:58.932	`redis.command EXPIRE sess:…`	198 ms	Session expire write — also slow client-side
14:41:59.137	Request end	3,104 ms total	OS contributed only 192 ms — rest is client/CPU

Why this is the cleanest proof of Problem (B)

OS data nodes were healthy at 14:41Z (no alert window). The 1.82s silent gap and the 797ms Redis GET on a 2-byte value cannot be explained by network or remote-service slowness — the only mechanism is Node.js event-loop block delaying async callbacks. Whatever causes Problem (B) is internal to the backend and exists independently of OS load.

Evidence — The Slow OS Queries (Problem A)

Both queries 17:37Z + 17:43Z · same shape · expensive Lucene work

Body shape (truncated — the actual must_not.assetId arrays each had 500+ items):

{
  "sort": [{ "price": "asc" }, { "tradeLockDay": "asc" }, { "botIndex": "asc" }],
  "from": "0", "size": 100,
  "query": {
    "bool": {
      "filter": [{ "bool": { "must": [
        { "term": { "groupId": { "value": 35141 } } },
        { "range": { "priceForTrade": { "lt": 4741729 } } },
        { "term": { "onlyStore": false } }
      ]}}],
      "must_not": [
        { "terms": { "assetId": [ /* 500+ asset IDs */ ] } },
        { "terms": { "botIndex": [46,70,108,109,119,418,419,286] } }
      ]
    }
  }
}

Why this query shape spikes data-node CPU

Lucene evaluates each terms entry against the inverted index. A must_not.terms.assetId with 500+ values means each scored doc is checked against 500+ exclusions. Combined with sort + score + range filter on a hot group, it’s the most expensive shape we send to OS. Same shape as the 14:41Z slow trace (different group, different IDs).

APM sampling rate is 1% in prod (DD_TRACE_SAMPLE_RATE: '0.01'), so 2 captured slow queries in 15 minutes likely represents ~200 actual slow queries during that window.

What Was Ruled Out

Confirmed against the data — not contributing causes

DEV-4815 deploy?

Inventory browse path doesn’t use queryByAssetIds (the only thing DEV-4815 touched).

RULED OUT

Traffic surge?

Backend trace volume flat: 47–77 spans/min through 17:00–17:50.

RULED OUT

Redis (server) slowness?

redis.ping.latency_ms 0.2–0.4ms, redis.cpu.user_main_thread 0.1–0.5%, no RDB fork.

RULED OUT

Marketplace config payload size?

STRLEN marketplace:configs:enabled:trade = 2 bytes. Payload size is not the reason for the 797ms GET.

RULED OUT

JVM heap pressure / GC?

N/A

Backend Node.js heap stable 300–500MB. OS-side JVM heap not observable (no opensearch.* metrics).

PARTIAL

Data-node-2 idle by accident?

Intentional — concentrated shard placement is a deliberate trade-off (faster bulk inserts).

RULED OUT

What We Cannot Yet Rule Out

Open questions blocking causal proof

Question	Why It Matters	What Would Answer It
Was parser indexing running during 17:30–17:40Z?	If a 12-min lane indexing cycle happened to start at 17:28–17:30Z, it would explain the OS load on the primary shards (data_node_1, data_node_4) without involving query traffic.	Parser PM2 logs / CircleCI run history / direct check on parser host. No parser logs in Datadog.
Did OS query latency to backend coincide with the load spike?	If yes, OS-slowness → longer awaits on backend → more queued JSON.parse calls → event-loop blocked. If OS replies were still fast despite host CPU, that chain breaks.	Server-side OS slow log (`index.search.slowlog.threshold.query.warn: 1s`) for the 17:30–17:40Z window.
What is the specific JS code blocking the event loop?	The 1.82s silent gap at 14:41Z is the smoking gun for Problem (B). We need to identify which sync code path is responsible — suspects: `JSON.parse` of large Redis/OS responses, ramda `indexBy` on big arrays, sync crypto in middleware.	CPU profile via `DD_PROFILING_ENABLED=true` for ~1 hour during peak traffic, OR `clinic.js flame` reproducing the slow path on staging.
Is the 500+ `must_not:assetIds` filter avoidable?	If we can shrink it (e.g., paginate exclusions, post-filter in app, or use a different OS query pattern), Problem (A) gets cheaper per query.	Code review of `queryCountItemsOnGroup` + the calling controllers. Where does the 500-item exclusion list come from? Reserved+cart items? All bot inventory?

Observability Gaps Exposed (DEV-4638 still relevant)

Same gaps as the 2026-04-14 incident — and one new one we can fix today

No opensearch.* metrics in Datadog

No bulk thread-pool queue depth, search rejections, JVM heap, GC, segment count, slow query logs. Datadog Agent is running on OS hosts (we get system.* metrics), but the OpenSearch integration is not configured. Setup is per-node opensearch.d/conf.yaml + cluster URL + read-only auth user. Tracked separately.

No parser logs in Datadog

Cannot directly correlate parser indexing cycles with OS spikes from log timestamps — have to infer from metric shapes alone. Same gap as 2026-04-14 incident.

Closed today: runtime.node.* metrics on backend

PR #1246 added DD_RUNTIME_METRICS_ENABLED=true to env_production in ecosystem.config.js. Deployed at 16:55Z. runtime.node.event_loop.delay.max is now flowing — this is the metric that surfaced the 1–3 SECOND event-loop stalls during the alert window. Without it, Problem (B) would have remained invisible.

Recommendations — Two Tracks

Both problems get worked in parallel; neither blocks the other

Rank	Track	Action	Why
P0	Both	Configure Datadog Agent OpenSearch integration on all 4 data nodes + coord node	Without `opensearch.thread_pool.search.queue`, `opensearch.indices.search.query_time`, slow logs, etc., we cannot prove or disprove parser-indexing as a contributor — the same blind spot the 2026-04-14 incident hit. Setup is one `opensearch.d/conf.yaml` per host pointing at `https://localhost:9200` with a read-only user.
P0	B	Enable continuous profiler (`DD_PROFILING_ENABLED=true`) for ~24h on prod	Now that `event_loop.delay.max` spikes are visible, the profiler will correlate spikes to specific JS frames (`JSON.parse`, ramda transforms, sync crypto). 2–5% CPU overhead. One-line change, same file as PR #1246.
P1	A	Audit the 500+ `must_not:assetId` filter source — can it be capped or moved to app-layer post-filter?	Same query shape recurred in slow traces at 14:41Z and 17:37/17:43Z. Likely the "exclude already-reserved + already-in-cart" set. If we can cap at, e.g., 100 IDs per query and post-filter the rest, OS CPU per query drops materially.
P1	B	Identify owner of the 1.82s silent gap in the 14:41Z trace path (`getInventoryData` / `fetchItems`)	Read the handler for any sync `JSON.parse` on Redis cache hits, sync ramda transforms, or sync crypto. The 500-item `must_not` array build itself may be a hotspot. Combine with profiler results from row above.
P2	A	Enable parser log shipping to Datadog (open since 2026-04-14 incident)	Closes the gap that blocks correlating parser bulk cycles with OS spikes. Already a P2 in the 2026-04-14 action plan — restated here because it surfaced again.
P2	B	Once OS metrics are available, build a Datadog dashboard overlaying `runtime.node.event_loop.delay.max` with `opensearch.thread_pool.search.active`, `opensearch.indices.search.query_time`, and backend p99 latency.	Visual correlation between OS load, OS query latency, backend event loop, and user-facing latency. Will resolve the causation question on the next incident.
P2	Both	Bump `DD_TRACE_SAMPLE_RATE` from `0.01` to `0.1` temporarily during active investigation	1% sampling means most slow traces are invisible. 10% during active investigation makes evidence-gathering 10x faster. Revert after.

References

Datadog Monitor 238873337 — system.load.norm.1 > 1 (the alert that fired at 17:38Z)
Datadog APM trace 69dfa3b4000000001ba14667da4ff95c — the 14:41Z 3.1s slow inventory browse
tradeit-backend PR #1246 — enabled DD_RUNTIME_METRICS_ENABLED (the change that made Problem (B) visible)
tradeit-backend PR #1244 & #1245 — DEV-4815 (ruled out as cause)
Linear DEV-4638 — OpenSearch dashboards + alerts (still pending; would have resolved most open questions here)
Related: OpenSearch Data Node Load Spike (2026-04-14) — same observability gaps, different root cause
Related: Slow Site Load (2026-04-14)
Datadog Elasticsearch/OpenSearch integration setup docs

Incident owner: Ehud Shahak · Generated 2026-04-15 · Status: OPEN — investigation continuing