Incident Report · April 15, 2026
Two correlated problems — causation NOT proven. User-reported slow unstack on store page (~8s) & data-node load alert at 17:38Z (20:38 IDT)
P2 OPEN — Performance Degradation, No Outage
User-visible 8s page load reported · OS data_node_1 alert auto-recovered · Investigation ongoing — missing observability prevents causal proof
One-paragraph summary
Two distinct performance problems were observed on 2026-04-15 within the same hour, and they correlate but causation is NOT proven in either direction. Problem (A): opensearch_cluster_data_node_1 hit sustained CPU 60–87% with load.1 peaking at 5.3 for ~10 minutes (17:30–17:40 UTC), triggering a Datadog monitor alert at 17:38Z. data_node_4 spiked in lockstep; data_node_3 spiked later (17:39–17:43); data_node_2 stayed idle (by design). Problem (B): tradeit-backend Node.js event loop repeatedly stalled for 1–3 SECONDS during the same window AND in unrelated windows earlier in the day. Two captured slow OpenSearch queries during the window were inventory-browse calls with 500+ asset IDs in must_not filters — expensive on Lucene and a likely OS contributor. Backend traffic volume was flat (no traffic surge). Earlier 14:41Z slow trace shows event-loop block of 1.82s while OS was healthy — proves Problem (B) exists independently. The fix path is two parallel tracks, not one.
Both problems on the same day — only one was inside the alert window
5ee632c8)
DEPLOY
18f3f1bd) — tradePurchase / store / imageScreenshot
DEPLOY
GET /api/v2/inventory/data. OS healthy (queries 110ms). 1.82s silent gap = JS event-loop block on backend. Trace 69dfa3b4….
EVIDENCE
DD_RUNTIME_METRICS_ENABLED) prod deploy — e7a45685
DEPLOY
load.1 4.6–5.3, CPU 60–87%. Backend event_loop.delay.max hits 2.7s, 3.1s, 2.5s.
PEAK
opensearch_cluster_data_node_1 (load > 1.0 normalized)
ALERT
event_loop.delay.max spikes to 3.2s, 3.1s, 1.2s — continues AFTER OS recovered
EVIDENCE
4-node cluster, c6a.xlarge (4 vCPU), all in eu-west-1c
Peak load.1 per node. Threshold: 4.0 on a 4-vCPU box (= full saturation).
Pattern of node_1 + node_4 spike together, then node_3 is consistent with search-side replicated load on the active replica set. Less consistent with bulk indexing (which would pin a primary). However, lacking opensearch.indexing.* metrics, indexing as a contributor cannot be ruled out.
Both real. Causation between them not proven by available data.
Problem (A): OpenSearch CPU Saturation
load.1 4.6–5.3 for ~7 min (17:33–17:40)must_not:assetIdProblem (B): Backend Event-Loop Stalls
redis.command GET marketplace:configs:enabled:trade on a 2-byte payload — only explainable by event-loop block delaying the reply callbackruntime.node.event_loop.delay.max repeatedly hit 1–3 SECONDSDatadog APM trace 69dfa3b4000000001ba14667da4ff95c · 3.1s on GET /api/v2/inventory/data
URL: ?context=trade&gameId=730&groupId=117628&isForStore=0&limit=500&offset=100 — an unstack/page-2 browse of a popular CSGO group. Span tree:
| Time (UTC) | Span | Duration | Notes |
|---|---|---|---|
| 14:41:56.033 | Request start (Express) | — | Trace begins |
| 14:41:56.033 | redis.command GET sess:… | 18 ms | Session lookup — healthy |
| 14:41:56.053 | redis.command GET marketplace:configs:enabled:trade | 797 ms | 2-byte payload per STRLEN — impossible over network. Client-side wait. |
| 14:41:56.85 → 14:41:58.672 | SILENT GAP — no spans | 1,820 ms | JS event loop blocked. CPU work, no async I/O. |
| 14:41:58.672 | redis.command LRANGE reserved-assets_730 | 56 ms | Reserved-assets fetch |
| 14:41:58.730 | HTTP POST /inventory_730/_search | 82 ms | OS query — fast |
| 14:41:58.820 | HTTP POST /inventory_730/_search (count agg) | 110 ms | OS count query — fast |
| 14:41:58.932 | redis.command EXPIRE sess:… | 198 ms | Session expire write — also slow client-side |
| 14:41:59.137 | Request end | 3,104 ms total | OS contributed only 192 ms — rest is client/CPU |
Why this is the cleanest proof of Problem (B)
OS data nodes were healthy at 14:41Z (no alert window). The 1.82s silent gap and the 797ms Redis GET on a 2-byte value cannot be explained by network or remote-service slowness — the only mechanism is Node.js event-loop block delaying async callbacks. Whatever causes Problem (B) is internal to the backend and exists independently of OS load.
Both queries 17:37Z + 17:43Z · same shape · expensive Lucene work
Body shape (truncated — the actual must_not.assetId arrays each had 500+ items):
{
"sort": [{ "price": "asc" }, { "tradeLockDay": "asc" }, { "botIndex": "asc" }],
"from": "0", "size": 100,
"query": {
"bool": {
"filter": [{ "bool": { "must": [
{ "term": { "groupId": { "value": 35141 } } },
{ "range": { "priceForTrade": { "lt": 4741729 } } },
{ "term": { "onlyStore": false } }
]}}],
"must_not": [
{ "terms": { "assetId": [ /* 500+ asset IDs */ ] } },
{ "terms": { "botIndex": [46,70,108,109,119,418,419,286] } }
]
}
}
}
Why this query shape spikes data-node CPU
Lucene evaluates each terms entry against the inverted index. A must_not.terms.assetId with 500+ values means each scored doc is checked against 500+ exclusions. Combined with sort + score + range filter on a hot group, it’s the most expensive shape we send to OS. Same shape as the 14:41Z slow trace (different group, different IDs).
APM sampling rate is 1% in prod (DD_TRACE_SAMPLE_RATE: '0.01'), so 2 captured slow queries in 15 minutes likely represents ~200 actual slow queries during that window.
Confirmed against the data — not contributing causes
DEV-4815 deploy?
NO
Inventory browse path doesn’t use queryByAssetIds (the only thing DEV-4815 touched).
Traffic surge?
NO
Backend trace volume flat: 47–77 spans/min through 17:00–17:50.
RULED OUTRedis (server) slowness?
NO
redis.ping.latency_ms 0.2–0.4ms, redis.cpu.user_main_thread 0.1–0.5%, no RDB fork.
Marketplace config payload size?
NO
STRLEN marketplace:configs:enabled:trade = 2 bytes. Payload size is not the reason for the 797ms GET.
JVM heap pressure / GC?
N/A
Backend Node.js heap stable 300–500MB. OS-side JVM heap not observable (no opensearch.* metrics).
Data-node-2 idle by accident?
NO
Intentional — concentrated shard placement is a deliberate trade-off (faster bulk inserts).
RULED OUTOpen questions blocking causal proof
| Question | Why It Matters | What Would Answer It |
|---|---|---|
| Was parser indexing running during 17:30–17:40Z? | If a 12-min lane indexing cycle happened to start at 17:28–17:30Z, it would explain the OS load on the primary shards (data_node_1, data_node_4) without involving query traffic. | Parser PM2 logs / CircleCI run history / direct check on parser host. No parser logs in Datadog. |
| Did OS query latency to backend coincide with the load spike? | If yes, OS-slowness → longer awaits on backend → more queued JSON.parse calls → event-loop blocked. If OS replies were still fast despite host CPU, that chain breaks. | Server-side OS slow log (index.search.slowlog.threshold.query.warn: 1s) for the 17:30–17:40Z window. |
| What is the specific JS code blocking the event loop? | The 1.82s silent gap at 14:41Z is the smoking gun for Problem (B). We need to identify which sync code path is responsible — suspects: JSON.parse of large Redis/OS responses, ramda indexBy on big arrays, sync crypto in middleware. |
CPU profile via DD_PROFILING_ENABLED=true for ~1 hour during peak traffic, OR clinic.js flame reproducing the slow path on staging. |
Is the 500+ must_not:assetIds filter avoidable? |
If we can shrink it (e.g., paginate exclusions, post-filter in app, or use a different OS query pattern), Problem (A) gets cheaper per query. | Code review of queryCountItemsOnGroup + the calling controllers. Where does the 500-item exclusion list come from? Reserved+cart items? All bot inventory? |
Same gaps as the 2026-04-14 incident — and one new one we can fix today
No opensearch.* metrics in Datadog
No bulk thread-pool queue depth, search rejections, JVM heap, GC, segment count, slow query logs. Datadog Agent is running on OS hosts (we get system.* metrics), but the OpenSearch integration is not configured. Setup is per-node opensearch.d/conf.yaml + cluster URL + read-only auth user. Tracked separately.
No parser logs in Datadog
Cannot directly correlate parser indexing cycles with OS spikes from log timestamps — have to infer from metric shapes alone. Same gap as 2026-04-14 incident.
Closed today: runtime.node.* metrics on backend
PR #1246 added DD_RUNTIME_METRICS_ENABLED=true to env_production in ecosystem.config.js. Deployed at 16:55Z. runtime.node.event_loop.delay.max is now flowing — this is the metric that surfaced the 1–3 SECOND event-loop stalls during the alert window. Without it, Problem (B) would have remained invisible.
Both problems get worked in parallel; neither blocks the other
| Rank | Track | Action | Why |
|---|---|---|---|
| P0 | Both | Configure Datadog Agent OpenSearch integration on all 4 data nodes + coord node | Without opensearch.thread_pool.search.queue, opensearch.indices.search.query_time, slow logs, etc., we cannot prove or disprove parser-indexing as a contributor — the same blind spot the 2026-04-14 incident hit. Setup is one opensearch.d/conf.yaml per host pointing at https://localhost:9200 with a read-only user. |
| P0 | B | Enable continuous profiler (DD_PROFILING_ENABLED=true) for ~24h on prod |
Now that event_loop.delay.max spikes are visible, the profiler will correlate spikes to specific JS frames (JSON.parse, ramda transforms, sync crypto). 2–5% CPU overhead. One-line change, same file as PR #1246. |
| P1 | A | Audit the 500+ must_not:assetId filter source — can it be capped or moved to app-layer post-filter? |
Same query shape recurred in slow traces at 14:41Z and 17:37/17:43Z. Likely the "exclude already-reserved + already-in-cart" set. If we can cap at, e.g., 100 IDs per query and post-filter the rest, OS CPU per query drops materially. |
| P1 | B | Identify owner of the 1.82s silent gap in the 14:41Z trace path (getInventoryData / fetchItems) |
Read the handler for any sync JSON.parse on Redis cache hits, sync ramda transforms, or sync crypto. The 500-item must_not array build itself may be a hotspot. Combine with profiler results from row above. |
| P2 | A | Enable parser log shipping to Datadog (open since 2026-04-14 incident) | Closes the gap that blocks correlating parser bulk cycles with OS spikes. Already a P2 in the 2026-04-14 action plan — restated here because it surfaced again. |
| P2 | B | Once OS metrics are available, build a Datadog dashboard overlaying runtime.node.event_loop.delay.max with opensearch.thread_pool.search.active, opensearch.indices.search.query_time, and backend p99 latency. |
Visual correlation between OS load, OS query latency, backend event loop, and user-facing latency. Will resolve the causation question on the next incident. |
| P2 | Both | Bump DD_TRACE_SAMPLE_RATE from 0.01 to 0.1 temporarily during active investigation |
1% sampling means most slow traces are invisible. 10% during active investigation makes evidence-gathering 10x faster. Revert after. |
system.load.norm.1 > 1 (the alert that fired at 17:38Z)69dfa3b4000000001ba14667da4ff95c — the 14:41Z 3.1s slow inventory browseDD_RUNTIME_METRICS_ENABLED (the change that made Problem (B) visible)Incident owner: Ehud Shahak · Generated 2026-04-15 · Status: OPEN — investigation continuing