INCIDENT · 2026-05-04 · 07:13–07:56 UTC
data-node-4 overflowed · 545 rejections · P3 · follow-up to 2026-04-15
What happened, in three sentences
Search threads on data-node-4 went 7/7 saturated during the morning peak; queue hit the 1000-thread cap at 07:46–47Z and emitted ~545 rejections. data-node-1 ran hot but did not overflow.
Root cause: a ~950-item must_not.assetId filter
on the inventory/group search — every shard on every fan-in index runs the terms filter for every page.
Compounded by JS-side gzip decompression of the responses (libuv pool overload) and JSON.parse
of the gzipped body blocking the event loop.
The fix already exists as
PR #1284 (Apr 27, OPEN, BLOCKED 7 days);
it disables OS-client gzip, raises UV_THREADPOOL_SIZE, spreads coord traffic across nodes,
and caps public-endpoint amplification. See Pre-Existing Fix below.
Per-node search thread-pool, 07:30–08:00 UTC
| Node | Peak Queue | Peak Active | Rejections | State |
|---|---|---|---|---|
| data-node-4 | 1000 (cap) | 7 / 7 | 545.65 | overflow |
| data-node-1 | 755 | 7 / 7 | 0 | hot, no overflow |
| data-node-3 | 3 | ~1 | 0 | nominal |
| data-node-2 | 0 | 0 | 0 | idle by design |
| coord-node-1 | 3 | <1 | 0 | nominal |
| master ×3 / index ×1 | 0 | 0 | 0 | nominal |
Source: elasticsearch.thread_pool.search.{queue,active,rejected} by node_name
Representative spans during the saturation window
| Time (UTC) | Duration | Resource | must_not.assetId |
|---|---|---|---|
| 07:57:12 | 1.70s | POST /group_730,group_730_uuskins/_search | ~950 |
| 07:57:03 | 2.50s | POST /inventory_730,inventory_730_uuskins/_search | ~950 |
Both queries hit coord-node-1 with multi-index fan-in across the uuSkins lane and the own-inventory index. Body shape:
// inventory search · 2.50s · ~950 must_not terms { "from": "0", "size": 100, "query": { "bool": { "filter": [{ "bool": { "must": [{ "term": { "groupId": 43315 }}] } }], "must_not": [{ "terms": { "assetId": [ /* ~950 ids */ ] }}], "should": [{ "range": { "floatValue": { "gte": 0, "lt": 1 } }}], "must": [/* saleOfferOwner / onlyStore branches */] }} }
terms.assetId filter runs once per shard, per index, per page.inventory_730 + inventory_730_uuskins — own-inventory + uuSkins lane) multiplies shard-level filter evaluations per request.JSON.parse on the main thread — a synchronous, all-or-nothing call that blocks the event loop while it runs. DD Profiler: Event Loop Blocking, 1.59s longest run.eu-west-1c), gzip saves zero meaningful bandwidth — the cost is paid every request for no payoff.Apr 16 10:33Z → Apr 17 09:14Z · DD_PROFILING_ENABLED on for 22.7h via PR #1250 → reverted by PR #1251
Flame graphs / hot-frame attribution from the Datadog Profiling product are not exposed
through APM/Metrics MCP tools — they live behind app.datadoghq.com/profiling.
Asking Nguyen, or checking the UI directly for service:tradeit-backend env:prod over that
window, is the only way to recover the actual hot frames.
That said, runtime metrics flowed for the entire window
(PR #1246 enabled DD_RUNTIME_METRICS_ENABLED the day before).
These give a quantitative reconstruction of what the backend was doing — even without flame graphs.
| GC type | Avg pause | Worst pause | Notes |
|---|---|---|---|
| mark-sweep-compact | 22.2ms | 68.5ms | Stop-the-world; main blocker on the loop |
| scavenge (young-gen) | 3.7ms | 4.2ms | Frequent, small, expected |
| incremental-marking | 2.8ms | 4.9ms | Background, low-cost |
| Hour bin (UTC) | Spans >1s | p95 | Max |
|---|---|---|---|
| Apr 16 13:00 | 9 | 1.31s | 1.69s |
| Apr 16 14:00 | 16 | 1.87s | 2.20s |
| Apr 17 07:00 | 9 | 4.79s | 4.82s |
All other hourly buckets in the 22.7h window had zero spans >1s.
must_not filter array build/serialization is itself the JS hot frame (would tie OS saturation and event-loop stalls into one root cause).
Open the Datadog Profiling page for service:tradeit-backend env:prod
and filter to Apr 16 10:33Z → Apr 17 09:14Z. Specifically:
JSON.stringify, ramda indexBy, or array spread / filter on a ~1k-item array — that's the must_not.assetId build.9 slow OS spans on Apr 17 06:55–07:10Z, all firing within ~500ms
Pulled the slowest opensearch.query spans from inside the profiler-on window. They tell the same story
as May 4 — at roughly ¼ the scale:
| Time (UTC) | Duration | groupId | PM2 worker | must_not.assetId |
|---|---|---|---|---|
| 07:04:37.220 | 4.82s | 37212 | b07a8320 | ~210 |
| 07:04:36.779 | 4.78s | 312563 | f2f87e58 | ~210 |
| 07:04:36.942 | 4.68s | 63840 | ee92caf6 | ~210 |
| 07:04:36.702 | 4.58s | 35569 | 9dc5d3a7 | ~210 |
| 07:04:36.744 | 4.49s | 316852 | 2f8f08b7 | ~210 |
| 07:04:36.739 | 4.46s | 117932 | 2f8f08b7 | ~210 |
| + 3 more in the same window | ||||
coord-node-1 — single-coordinator bottleneck. Exactly what PR #1284's OPENSEARCH_COORDINATION_NODE_URLS round-robin solves.inventory_730 (commit 678e105c, pre-DEV-4815). May 4 is post-DEV-4815 — multi-index fan-in (inventory_730 + inventory_730_uuskins) layered on top of the same growth.inventory/data?groupId=X requests (one per visible group on the page), each carrying the same user's reserved-set. Thundering herd: parallel page-fetch × 210 must_not × gzipped response × JSON.parse blocking → libuv overflow.
Datadog correlates traces↔profiles by runtime-id + time. Opening a slow Apr 17 trace
in the UI may surface an embedded "Profiling" tab with the flame graph for that exact request,
if the profile event for the worker's runtime-id survived retention:
In the trace detail, look for the "Profiling" tab on the right rail or the "Code Hotspots" view. If neither appears, the linked profile has aged out.
DD APM Flow Map · 49.3k traces sampled
coord-node-1.dns.a-record handles 0.7 hits/s of tradeit-backend's 1.03 hits/s total —
~68% of all backend egress hits one node.
This is the visual case for PR #1284's multi-node OS client in one number. Every other downstream
(api.steampowered.com, graph.facebook.com, api.nowpayments.io,
api2.amplitude.com, api-02.moengage.com, cdp.faceit.com,
openapi.cs2dt.com) is below 0.04 hits/s — rounding error compared to the OS coord traffic.
2.10 hits/s of unattributed callers — likely internal cron/bots/admin-backend/old-tradeit/parser without APM tracer or with mis-tagged service names. Error rate is 100× higher than named-service traffic (0.12%). Possible retry storm or repeated 4xx from one of these clients amplifying load.
0.03 hits/s from a private-subnet caller. Same VPC, named only by IP because the source isn't running the DD tracer. Worth identifying — the high error rate could mean a misconfigured client retrying tight-loop against the backend.
Two marketplaces, two different roles in the data:
openapi.cs2dt.com appears in the service map at < 0.01 hits/s exactly because no traffic is being sent to its API.inventory_730_uuskins / group_730_uuskins are the multi-index fan-in target the May 4 saturation queries hit alongside the own-inventory index.
So the uuSkins lane is the one currently amplifying OS query cost via fan-in — every page request fans out to inventory_730 + inventory_730_uuskins. Turning the uuSkins lane reads off in OS (without disabling the marketplace product-side) would halve the per-request shard-level filter evaluations. Worth noting as a separately-tunable dial.
Findings from the Apr 16–17 profiler diagnostic — captured Apr 27, sitting blocked since
On 2026-04-27, the Apr 16–17 profiler findings were turned into
PR #1284 — fix(api): cap public limits, disable OS gzip, multi-node OS client
on branch hotfix/os-perf-and-limits from master. The PR is
OPEN, BLOCKED, never merged — 2 failing checks
(claude-review, renovate-check), 0 human reviews.
The same patterns this PR is designed to mitigate hit prod again on May 4 — 7 days later.
"data-node-4 has been pegged at ~100% CPU during peaks; profiler showed
event-loop blocking on JSON.parse of large gzipped OS responses
+ libuv pool overload from gzip decompression. coord-node-1 was also at 100%
(single coordinator bottleneck). Several public endpoints accepted unbounded limit/array
sizes letting any client amplify pressure."
| Insight | Severity | Worst case | Last seen |
|---|---|---|---|
| Event Loop Blocking | MEDIUM | 1.59s longest run | ~17 days ago |
| Burst of High CPU Usage | MEDIUM | 16s burst · 24.54% avg | ~18 days ago |
| Libuv Thread Pool Overload | LOW | 83.25ms · 5 tasks / 4 threads | ~17 days ago |
All three insights "Last seen" line up with the Apr 16–17 profiler-on window.
| Area | Change | Purpose |
|---|---|---|
| OS client | compression: false + suggestCompression: false | Frees libuv pool from gzip decompression on internal VPC traffic; cuts coord-node-1 CPU |
| OS client | OPENSEARCH_COORDINATION_NODE_URLS (CSV) for client-side round-robin | Spreads coord traffic across coord+data nodes; relieves single-coord bottleneck |
| ecosystem.config.js | UV_THREADPOOL_SIZE: '16' | Default 4 was saturating on gzip + DNS + crypto + fs |
queryByItemId | Clamp limit to [1, 100] | Was unbounded from public /search-trade |
| middleware | New validateLimit + validateArrayBody | Public endpoint limit guardrails |
/api/v2/inventory/search-trade | items[] ≤ 50 | Cap user amplification |
/api/v2/inventory/items-prices | assetIds[] / groupIds[] ≤ 500 | Cap user amplification |
/api/v2/sell/soldItems & /payoutHistories | limit ≤ 100, offset ≤ 30 | Cap user amplification |
MAX_LIMIT_INVENTORY_API | 500 → 150 | Frontend already defaults to 150; 500 ceiling let any client request 3× more than ever renders |
The OpenSearch cluster is on the same VPC as the backend (eu-west-1c, private addresses).
Bandwidth between EC2 instances inside an AZ is effectively free; gzip's only payoff is wire-bytes
savings, which we don't need. The cost — JS-side JSON.parse on the gzipped
response (event-loop blocking) and zlib decompression on libuv worker threads (pool overload) —
is paid every single request. Disabling it on the internal client is a pure-positive trade.
In addition to landing PR #1284
How: investigate the claude-review + renovate-check failures (likely fixable / soft); rebase onto current master; re-run CI; merge; deploy. The PR's own test plan includes verifying "Libuv Thread Pool Overload profiler insight clears" after deploy.
Pros: already written, already reviewed-by-Claude, already targets the exact symptoms seen May 4. No new design.
Cons: 7 days stale — needs rebase + a fresh CI run. Doesn't directly shrink the 950-item must_not filter (Option A below); it makes the existing filter cheaper to ship over the wire and parse.
How: add an indexed boolean (isReserved / isLocked) maintained by the parser on reservation events. Replace the 950-term filter with a single term predicate. Alternative: terms-lookup against a small per-user reservations_<userId> doc.
Pros: kills root cost. No OpenSearch restart. Benefits every node including the idle one indirectly via lower coordinator latency.
Cons: requires parser + backend change + deploy. Reservation event plumbing.
thread_pool.search.queue_size on data-node-1 + data-node-4How: opensearch.yml on the affected nodes, 1000 → 2000, rolling restart.
Pros: no app deploy.
Cons: static setting → rolling restart per node, cluster yellow each time. Doesn't fix root cause; delays next overflow. Longer queue = longer tail latencies + heap pressure (queued tasks hold their payloads).
How: drop the oldest reservations from the filter, post-filter server-side after the OS response.
Pros: ships in one PR, no schema change. Fastest band-aid.
Cons: changes pagination semantics — pages may return fewer than size after post-filter; users may briefly see their own reserved items before the post-filter trims them.
server/controllers/inventory.js:436server/redis/reserveAssetRedis.jsserver/service/openSearchService.js:500drafted 2026-05-04 · pre-decision · tradeit.gg