INCIDENT · 2026-05-04 · 07:13–07:56 UTC

OS Search-Pool Saturation

data-node-4 overflowed · 545 rejections · P3 · follow-up to 2026-04-15

TL;DR

What happened, in three sentences

Search threads on data-node-4 went 7/7 saturated during the morning peak; queue hit the 1000-thread cap at 07:46–47Z and emitted ~545 rejections. data-node-1 ran hot but did not overflow.

Root cause: a ~950-item must_not.assetId filter on the inventory/group search — every shard on every fan-in index runs the terms filter for every page. Compounded by JS-side gzip decompression of the responses (libuv pool overload) and JSON.parse of the gzipped body blocking the event loop.

The fix already exists as PR #1284 (Apr 27, OPEN, BLOCKED 7 days); it disables OS-client gzip, raises UV_THREADPOOL_SIZE, spreads coord traffic across nodes, and caps public-endpoint amplification. See Pre-Existing Fix below.

Headline Numbers

1000
Peak Queue (cap)
545
Rejections
~950
must_not.assetId terms
2.5s
Slowest OS Span
363
Spans >1s / 10min

Saturation Map

Per-node search thread-pool, 07:30–08:00 UTC

Node Peak Queue Peak Active Rejections State
data-node-41000 (cap)7 / 7545.65overflow
data-node-17557 / 70hot, no overflow
data-node-33~10nominal
data-node-2000idle by design
coord-node-13<10nominal
master ×3 / index ×1000nominal

Source: elasticsearch.thread_pool.search.{queue,active,rejected} by node_name

Queue Depth at Peak

data-node-4
1000 (cap)
data-node-1
755
coord-node-1
3
data-node-3
3
data-node-2
0

Slow Spans (APM)

Representative spans during the saturation window

Time (UTC) Duration Resource must_not.assetId
07:57:121.70sPOST /group_730,group_730_uuskins/_search~950
07:57:032.50sPOST /inventory_730,inventory_730_uuskins/_search~950

Both queries hit coord-node-1 with multi-index fan-in across the uuSkins lane and the own-inventory index. Body shape:

// inventory search · 2.50s · ~950 must_not terms
{
  "from": "0", "size": 100,
  "query": { "bool": {
    "filter": [{ "bool": { "must": [{ "term": { "groupId": 43315 }}] } }],
    "must_not": [{ "terms": { "assetId": [ /* ~950 ids */ ] }}],
    "should":  [{ "range": { "floatValue": { "gte": 0, "lt": 1 } }}],
    "must":    [/* saleOfferOwner / onlyStore branches */]
  }}
}

Why It Hurts

OpenSearch side

Backend (Node.js) side — amplifies the same request

Profiler-Window Reconstruction

Apr 16 10:33Z → Apr 17 09:14Z · DD_PROFILING_ENABLED on for 22.7h via PR #1250 → reverted by PR #1251

Flame graphs / hot-frame attribution from the Datadog Profiling product are not exposed through APM/Metrics MCP tools — they live behind app.datadoghq.com/profiling. Asking Nguyen, or checking the UI directly for service:tradeit-backend env:prod over that window, is the only way to recover the actual hot frames.

That said, runtime metrics flowed for the entire window (PR #1246 enabled DD_RUNTIME_METRICS_ENABLED the day before). These give a quantitative reconstruction of what the backend was doing — even without flame graphs.

Headline numbers from the profiler window

957ms
Worst Event-Loop Stall
75ms
Sustained Max-Stall (avg)
0.72
Peak Loop Utilization
68.5ms
Worst GC Pause
1.50GB
Peak RSS · 75% of cap

Hot-spot timeline (event-loop max stall, by 68-min bin)

16:13–17:21Z
957ms ← worst
17:21–18:29Z
260ms
03:31–04:39Z
231ms
15:06–16:13Z
178ms
13:58–15:06Z
143ms

GC pressure during peak hours (16:13–18:29Z)

GC type Avg pause Worst pause Notes
mark-sweep-compact22.2ms68.5msStop-the-world; main blocker on the loop
scavenge (young-gen)3.7ms4.2msFrequent, small, expected
incremental-marking2.8ms4.9msBackground, low-cost

Slow OpenSearch spans during the same window

Hour bin (UTC) Spans >1s p95 Max
Apr 16 13:0091.31s1.69s
Apr 16 14:00161.87s2.20s
Apr 17 07:0094.79s4.82s

All other hourly buckets in the 22.7h window had zero spans >1s.

What this tells us — and what it doesn't

Confirmed

  • Multi-second event-loop stalls happened before the May 4 saturation — the 957ms peak on Apr 16 ~16:13Z proves it's a recurring backend pattern, not a May-4-only artifact.
  • Apr 17 morning peak (07:00Z) hit 4.82s OS span — the same morning-peak pattern as May 4 07:13–07:56Z.
  • Mark-sweep-compact GC routinely paused 22–68ms during peak — a measurable slice of the event-loop max-stall budget, but not enough alone to cause a 957ms stall.
  • Loop utilization sustained at ~0.70 for 10 hours — backend was compute-bound, not I/O-waiting. The blocking work is in JS, not in awaited I/O.

Still unknown

  • Which JS function caused the 957ms stall — flame graphs needed, MCP can't reach them.
  • Whether the 950-item must_not filter array build/serialization is itself the JS hot frame (would tie OS saturation and event-loop stalls into one root cause).
  • RSS climbed to 1.50GB during the window — was this profiler overhead, a leak, or normal heap growth? PM2 likely auto-restarted at 03:31Z when RSS hit the 2GB cap.
  • Whether Nguyen's 09:14Z revert was a planned end-of-experiment cutoff or a reaction to an observed problem.

Concrete next step on the profiler gap

Open the Datadog Profiling page for service:tradeit-backend env:prod and filter to Apr 16 10:33Z → Apr 17 09:14Z. Specifically:

Apr 17 Thundering Herd — The Pattern Was Already Here

9 slow OS spans on Apr 17 06:55–07:10Z, all firing within ~500ms

Pulled the slowest opensearch.query spans from inside the profiler-on window. They tell the same story as May 4 — at roughly ¼ the scale:

Time (UTC) Duration groupId PM2 worker must_not.assetId
07:04:37.2204.82s37212b07a8320~210
07:04:36.7794.78s312563f2f87e58~210
07:04:36.9424.68s63840ee92caf6~210
07:04:36.7024.58s355699dc5d3a7~210
07:04:36.7444.49s3168522f8f08b7~210
07:04:36.7394.46s1179322f8f08b7~210
+ 3 more in the same window

Five facts the data forces

  1. All 9 hit coord-node-1 — single-coordinator bottleneck. Exactly what PR #1284's OPENSEARCH_COORDINATION_NODE_URLS round-robin solves.
  2. ~210 must_not terms then vs ~950 now — the filter has grown ~4.5× in 17 days. Same shape, just bigger.
  3. All target single index inventory_730 (commit 678e105c, pre-DEV-4815). May 4 is post-DEV-4815 — multi-index fan-in (inventory_730 + inventory_730_uuskins) layered on top of the same growth.
  4. Spread across all 6 PM2 workers simultaneously — different runtime-ids — so libuv pools on every worker were saturated at the same moment, not just one hot worker.
  5. Identical exclude set across all 9 requests — frontend fired N parallel inventory/data?groupId=X requests (one per visible group on the page), each carrying the same user's reserved-set. Thundering herd: parallel page-fetch × 210 must_not × gzipped response × JSON.parse blocking → libuv overflow.

Direct trace links to try in the UI for flame graphs

Datadog correlates traces↔profiles by runtime-id + time. Opening a slow Apr 17 trace in the UI may surface an embedded "Profiling" tab with the flame graph for that exact request, if the profile event for the worker's runtime-id survived retention:

In the trace detail, look for the "Profiling" tab on the right rail or the "Code Hotspots" view. If neither appears, the linked profile has aged out.

Service Flow Map — Coord Dominance

DD APM Flow Map · 49.3k traces sampled

coord-node-1.dns.a-record handles 0.7 hits/s of tradeit-backend's 1.03 hits/s total — ~68% of all backend egress hits one node.

This is the visual case for PR #1284's multi-node OS client in one number. Every other downstream (api.steampowered.com, graph.facebook.com, api.nowpayments.io, api2.amplitude.com, api-02.moengage.com, cdp.faceit.com, openapi.cs2dt.com) is below 0.04 hits/s — rounding error compared to the OS coord traffic.

Two upstream signals worth their own follow-up

5 Inferred Services · 12.5% err

2.10 hits/s of unattributed callers — likely internal cron/bots/admin-backend/old-tradeit/parser without APM tracer or with mis-tagged service names. Error rate is 100× higher than named-service traffic (0.12%). Possible retry storm or repeated 4xx from one of these clients amplifying load.

172.31.38.147 · 10.1% err

0.03 hits/s from a private-subnet caller. Same VPC, named only by IP because the source isn't running the DD tracer. Worth identifying — the high error rate could mean a misconfigured client retrying tight-loop against the backend.

Marketplace mapping — what's hot vs what's cold

Two marketplaces, two different roles in the data:

So the uuSkins lane is the one currently amplifying OS query cost via fan-in — every page request fans out to inventory_730 + inventory_730_uuskins. Turning the uuSkins lane reads off in OS (without disabling the marketplace product-side) would halve the per-request shard-level filter evaluations. Worth noting as a separately-tunable dial.

Pre-Existing Fix: PR #1284 (Open, Unmerged)

Findings from the Apr 16–17 profiler diagnostic — captured Apr 27, sitting blocked since

On 2026-04-27, the Apr 16–17 profiler findings were turned into PR #1284 — fix(api): cap public limits, disable OS gzip, multi-node OS client on branch hotfix/os-perf-and-limits from master. The PR is OPEN, BLOCKED, never merged — 2 failing checks (claude-review, renovate-check), 0 human reviews.

The same patterns this PR is designed to mitigate hit prod again on May 4 — 7 days later.

What the profiler caught (per the PR body)

"data-node-4 has been pegged at ~100% CPU during peaks; profiler showed event-loop blocking on JSON.parse of large gzipped OS responses + libuv pool overload from gzip decompression. coord-node-1 was also at 100% (single coordinator bottleneck). Several public endpoints accepted unbounded limit/array sizes letting any client amplify pressure."

Datadog Profiling Insights still active for this service

Insight Severity Worst case Last seen
Event Loop BlockingMEDIUM1.59s longest run~17 days ago
Burst of High CPU UsageMEDIUM16s burst · 24.54% avg~18 days ago
Libuv Thread Pool OverloadLOW83.25ms · 5 tasks / 4 threads~17 days ago

All three insights "Last seen" line up with the Apr 16–17 profiler-on window.

What the PR changes (7 files, +57/-6)

AreaChangePurpose
OS clientcompression: false + suggestCompression: falseFrees libuv pool from gzip decompression on internal VPC traffic; cuts coord-node-1 CPU
OS clientOPENSEARCH_COORDINATION_NODE_URLS (CSV) for client-side round-robinSpreads coord traffic across coord+data nodes; relieves single-coord bottleneck
ecosystem.config.jsUV_THREADPOOL_SIZE: '16'Default 4 was saturating on gzip + DNS + crypto + fs
queryByItemIdClamp limit to [1, 100]Was unbounded from public /search-trade
middlewareNew validateLimit + validateArrayBodyPublic endpoint limit guardrails
/api/v2/inventory/search-tradeitems[] ≤ 50Cap user amplification
/api/v2/inventory/items-pricesassetIds[] / groupIds[] ≤ 500Cap user amplification
/api/v2/sell/soldItems & /payoutHistorieslimit ≤ 100, offset ≤ 30Cap user amplification
MAX_LIMIT_INVENTORY_API500 → 150Frontend already defaults to 150; 500 ceiling let any client request 3× more than ever renders

Why gzip-off is safe here

The OpenSearch cluster is on the same VPC as the backend (eu-west-1c, private addresses). Bandwidth between EC2 instances inside an AZ is effectively free; gzip's only payoff is wire-bytes savings, which we don't need. The cost — JS-side JSON.parse on the gzipped response (event-loop blocking) and zlib decompression on libuv worker threads (pool overload) — is paid every single request. Disabling it on the internal client is a pure-positive trade.

Stopgap Options

In addition to landing PR #1284

0 — Land PR #1284 first priority

How: investigate the claude-review + renovate-check failures (likely fixable / soft); rebase onto current master; re-run CI; merge; deploy. The PR's own test plan includes verifying "Libuv Thread Pool Overload profiler insight clears" after deploy.

Pros: already written, already reviewed-by-Claude, already targets the exact symptoms seen May 4. No new design.

Cons: 7 days stale — needs rebase + a fresh CI run. Doesn't directly shrink the 950-item must_not filter (Option A below); it makes the existing filter cheaper to ship over the wire and parse.

A — Shrink the must_not array app-side

How: add an indexed boolean (isReserved / isLocked) maintained by the parser on reservation events. Replace the 950-term filter with a single term predicate. Alternative: terms-lookup against a small per-user reservations_<userId> doc.

Pros: kills root cost. No OpenSearch restart. Benefits every node including the idle one indirectly via lower coordinator latency.

Cons: requires parser + backend change + deploy. Reservation event plumbing.

B — Raise thread_pool.search.queue_size on data-node-1 + data-node-4

How: opensearch.yml on the affected nodes, 1000 → 2000, rolling restart.

Pros: no app deploy.

Cons: static setting → rolling restart per node, cluster yellow each time. Doesn't fix root cause; delays next overflow. Longer queue = longer tail latencies + heap pressure (queued tasks hold their payloads).

C — Cap must_not.assetId length at ~200

How: drop the oldest reservations from the filter, post-filter server-side after the OS response.

Pros: ships in one PR, no schema change. Fastest band-aid.

Cons: changes pagination semantics — pages may return fewer than size after post-filter; users may briefly see their own reserved items before the post-filter trims them.

Open Questions

Refs

Related

Call Sites

  • server/controllers/inventory.js:436
  • server/redis/reserveAssetRedis.js
  • server/service/openSearchService.js:500

drafted 2026-05-04 · pre-decision · tradeit.gg