INCIDENT · 2026-05-04 · 07:13–07:56 UTC

OS Search-Pool Saturation

data-node-4 overflowed · 545 rejections · P3 · follow-up to 2026-04-15

TL;DR

What happened, in three sentences

Search threads on data-node-4 went 7/7 saturated during the morning peak; queue hit the 1000-thread cap at 07:46–47Z and emitted ~545 rejections. data-node-1 ran hot but did not overflow.

Root cause: a ~950-item must_not.assetId filter on the inventory/group search — every shard on every fan-in index runs the terms filter for every page. Compounded by JS-side gzip decompression of the responses (libuv pool overload) and JSON.parse of the gzipped body blocking the event loop.

The fix already exists as PR #1284 (Apr 27, OPEN, BLOCKED 7 days); it disables OS-client gzip, raises UV_THREADPOOL_SIZE, spreads coord traffic across nodes, and caps public-endpoint amplification. See Pre-Existing Fix below.

Headline Numbers

1000

Peak Queue (cap)

545

Rejections

~950

must_not.assetId terms

2.5s

Slowest OS Span

363

Spans >1s / 10min

Saturation Map

Per-node search thread-pool, 07:30–08:00 UTC

Node	Peak Queue	Peak Active	Rejections	State
data-node-4	1000 (cap)	7 / 7	545.65	overflow
data-node-1	755	7 / 7	0	hot, no overflow
data-node-3	3	~1	0	nominal
data-node-2	0	0	0	idle by design
coord-node-1	3	<1	0	nominal
master ×3 / index ×1	0	0	0	nominal

Source: elasticsearch.thread_pool.search.{queue,active,rejected} by node_name

Queue Depth at Peak

data-node-4

1000 (cap)

data-node-1

755

coord-node-1

data-node-3

data-node-2

Slow Spans (APM)

Representative spans during the saturation window

Time (UTC)	Duration	Resource	must_not.assetId
07:57:12	1.70s	`POST /group_730,group_730_uuskins/_search`	~950
07:57:03	2.50s	`POST /inventory_730,inventory_730_uuskins/_search`	~950

Both queries hit coord-node-1 with multi-index fan-in across the uuSkins lane and the own-inventory index. Body shape:

// inventory search · 2.50s · ~950 must_not terms
{
  "from": "0", "size": 100,
  "query": { "bool": {
    "filter": [{ "bool": { "must": [{ "term": { "groupId": 43315 }}] } }],
    "must_not": [{ "terms": { "assetId": [ /* ~950 ids */ ] }}],
    "should":  [{ "range": { "floatValue": { "gte": 0, "lt": 1 } }}],
    "must":    [/* saleOfferOwner / onlyStore branches */]
  }}
}

Why It Hurts

OpenSearch side

The terms.assetId filter runs once per shard, per index, per page.
Multi-index fan-in (inventory_730 + inventory_730_uuskins — own-inventory + uuSkins lane) multiplies shard-level filter evaluations per request.
Each shard builds a 950-element bitset → CPU + heap pressure on data nodes holding the active shards.
Shard placement is concentrated by design (data-node-2 idle to keep bulk indexing fast), so the cost lands on a small subset of nodes.

Backend (Node.js) side — amplifies the same request

OS client ships responses gzipped; the Node process pays zlib decompression on a libuv worker thread (default pool size 4) — under concurrent load this overflows. DD Profiler: Libuv Thread Pool Overload, 5 tasks on 4 threads.
After decompression, the (now-large) response body hits JSON.parse on the main thread — a synchronous, all-or-nothing call that blocks the event loop while it runs. DD Profiler: Event Loop Blocking, 1.59s longest run.
For internal VPC traffic (backend ↔ OpenSearch in eu-west-1c), gzip saves zero meaningful bandwidth — the cost is paid every request for no payoff.

Profiler-Window Reconstruction

Apr 16 10:33Z → Apr 17 09:14Z · DD_PROFILING_ENABLED on for 22.7h via PR #1250 → reverted by PR #1251

Flame graphs / hot-frame attribution from the Datadog Profiling product are not exposed through APM/Metrics MCP tools — they live behind app.datadoghq.com/profiling. Asking Nguyen, or checking the UI directly for service:tradeit-backend env:prod over that window, is the only way to recover the actual hot frames.

That said, runtime metrics flowed for the entire window (PR #1246 enabled DD_RUNTIME_METRICS_ENABLED the day before). These give a quantitative reconstruction of what the backend was doing — even without flame graphs.

Headline numbers from the profiler window

957ms

Worst Event-Loop Stall

75ms

Sustained Max-Stall (avg)

0.72

Peak Loop Utilization

68.5ms

Worst GC Pause

1.50GB

Peak RSS · 75% of cap

Hot-spot timeline (event-loop max stall, by 68-min bin)

16:13–17:21Z

957ms ← worst

17:21–18:29Z

260ms

03:31–04:39Z

231ms

15:06–16:13Z

178ms

13:58–15:06Z

143ms

GC pressure during peak hours (16:13–18:29Z)

GC type	Avg pause	Worst pause	Notes
mark-sweep-compact	22.2ms	68.5ms	Stop-the-world; main blocker on the loop
scavenge (young-gen)	3.7ms	4.2ms	Frequent, small, expected
incremental-marking	2.8ms	4.9ms	Background, low-cost

Slow OpenSearch spans during the same window

Hour bin (UTC)	Spans >1s	p95	Max
Apr 16 13:00	9	1.31s	1.69s
Apr 16 14:00	16	1.87s	2.20s
Apr 17 07:00	9	4.79s	4.82s

All other hourly buckets in the 22.7h window had zero spans >1s.

What this tells us — and what it doesn't

Confirmed

Multi-second event-loop stalls happened before the May 4 saturation — the 957ms peak on Apr 16 ~16:13Z proves it's a recurring backend pattern, not a May-4-only artifact.
Apr 17 morning peak (07:00Z) hit 4.82s OS span — the same morning-peak pattern as May 4 07:13–07:56Z.
Mark-sweep-compact GC routinely paused 22–68ms during peak — a measurable slice of the event-loop max-stall budget, but not enough alone to cause a 957ms stall.
Loop utilization sustained at ~0.70 for 10 hours — backend was compute-bound, not I/O-waiting. The blocking work is in JS, not in awaited I/O.

Still unknown

Which JS function caused the 957ms stall — flame graphs needed, MCP can't reach them.
Whether the 950-item must_not filter array build/serialization is itself the JS hot frame (would tie OS saturation and event-loop stalls into one root cause).
RSS climbed to 1.50GB during the window — was this profiler overhead, a leak, or normal heap growth? PM2 likely auto-restarted at 03:31Z when RSS hit the 2GB cap.
Whether Nguyen's 09:14Z revert was a planned end-of-experiment cutoff or a reaction to an observed problem.

Concrete next step on the profiler gap

Open the Datadog Profiling page for service:tradeit-backend env:prod and filter to Apr 16 10:33Z → Apr 17 09:14Z. Specifically:

CPU profile around Apr 16 16:13–17:21Z (the 957ms stall window) — look for the hot frame.
CPU profile around Apr 17 07:00–08:00Z (the 4.82s OS span window) — same view, different time slice.
Wall profile in the same windows — to separate "blocking on CPU" from "blocking on synchronous I/O".
If the hot frame is JSON.stringify, ramda indexBy, or array spread / filter on a ~1k-item array — that's the must_not.assetId build.

Apr 17 Thundering Herd — The Pattern Was Already Here

9 slow OS spans on Apr 17 06:55–07:10Z, all firing within ~500ms

Pulled the slowest opensearch.query spans from inside the profiler-on window. They tell the same story as May 4 — at roughly ¼ the scale:

Time (UTC)	Duration	groupId	PM2 worker	must_not.assetId
07:04:37.220	4.82s	37212	`b07a8320`	~210
07:04:36.779	4.78s	312563	`f2f87e58`	~210
07:04:36.942	4.68s	63840	`ee92caf6`	~210
07:04:36.702	4.58s	35569	`9dc5d3a7`	~210
07:04:36.744	4.49s	316852	`2f8f08b7`	~210
07:04:36.739	4.46s	117932	`2f8f08b7`	~210
+ 3 more in the same window

Five facts the data forces

All 9 hit coord-node-1 — single-coordinator bottleneck. Exactly what PR #1284's OPENSEARCH_COORDINATION_NODE_URLS round-robin solves.
~210 must_not terms then vs ~950 now — the filter has grown ~4.5× in 17 days. Same shape, just bigger.
All target single index inventory_730 (commit 678e105c, pre-DEV-4815). May 4 is post-DEV-4815 — multi-index fan-in (inventory_730 + inventory_730_uuskins) layered on top of the same growth.
Spread across all 6 PM2 workers simultaneously — different runtime-ids — so libuv pools on every worker were saturated at the same moment, not just one hot worker.
Identical exclude set across all 9 requests — frontend fired N parallel inventory/data?groupId=X requests (one per visible group on the page), each carrying the same user's reserved-set. Thundering herd: parallel page-fetch × 210 must_not × gzipped response × JSON.parse blocking → libuv overflow.

Direct trace links to try in the UI for flame graphs

Datadog correlates traces↔profiles by runtime-id + time. Opening a slow Apr 17 trace in the UI may surface an embedded "Profiling" tab with the flame graph for that exact request, if the profile event for the worker's runtime-id survived retention:

In the trace detail, look for the "Profiling" tab on the right rail or the "Code Hotspots" view. If neither appears, the linked profile has aged out.

Service Flow Map — Coord Dominance

DD APM Flow Map · 49.3k traces sampled

coord-node-1.dns.a-record handles 0.7 hits/s of tradeit-backend's 1.03 hits/s total — ~68% of all backend egress hits one node.

This is the visual case for PR #1284's multi-node OS client in one number. Every other downstream (api.steampowered.com, graph.facebook.com, api.nowpayments.io, api2.amplitude.com, api-02.moengage.com, cdp.faceit.com, openapi.cs2dt.com) is below 0.04 hits/s — rounding error compared to the OS coord traffic.

Two upstream signals worth their own follow-up

5 Inferred Services · 12.5% err

2.10 hits/s of unattributed callers — likely internal cron/bots/admin-backend/old-tradeit/parser without APM tracer or with mis-tagged service names. Error rate is 100× higher than named-service traffic (0.12%). Possible retry storm or repeated 4xx from one of these clients amplifying load.

172.31.38.147 · 10.1% err

0.03 hits/s from a private-subnet caller. Same VPC, named only by IP because the source isn't running the DD tracer. Worth identifying — the high error rate could mean a misconfigured client retrying tight-loop against the backend.

Marketplace mapping — what's hot vs what's cold

Two marketplaces, two different roles in the data:

HaloSkins — marketplace currently OFF. openapi.cs2dt.com appears in the service map at < 0.01 hits/s exactly because no traffic is being sent to its API.
uuSkins — marketplace currently ON. The OS indices inventory_730_uuskins / group_730_uuskins are the multi-index fan-in target the May 4 saturation queries hit alongside the own-inventory index.

So the uuSkins lane is the one currently amplifying OS query cost via fan-in — every page request fans out to inventory_730 + inventory_730_uuskins. Turning the uuSkins lane reads off in OS (without disabling the marketplace product-side) would halve the per-request shard-level filter evaluations. Worth noting as a separately-tunable dial.

Pre-Existing Fix: PR #1284 (Open, Unmerged)

Findings from the Apr 16–17 profiler diagnostic — captured Apr 27, sitting blocked since

On 2026-04-27, the Apr 16–17 profiler findings were turned into PR #1284 — fix(api): cap public limits, disable OS gzip, multi-node OS client on branch hotfix/os-perf-and-limits from master. The PR is OPEN, BLOCKED, never merged — 2 failing checks (claude-review, renovate-check), 0 human reviews.

The same patterns this PR is designed to mitigate hit prod again on May 4 — 7 days later.

What the profiler caught (per the PR body)

"data-node-4 has been pegged at ~100% CPU during peaks; profiler showed event-loop blocking on JSON.parse of large gzipped OS responses + libuv pool overload from gzip decompression. coord-node-1 was also at 100% (single coordinator bottleneck). Several public endpoints accepted unbounded limit/array sizes letting any client amplify pressure."

Datadog Profiling Insights still active for this service

Insight	Severity	Worst case	Last seen
Event Loop Blocking	MEDIUM	1.59s longest run	~17 days ago
Burst of High CPU Usage	MEDIUM	16s burst · 24.54% avg	~18 days ago
Libuv Thread Pool Overload	LOW	83.25ms · 5 tasks / 4 threads	~17 days ago

All three insights "Last seen" line up with the Apr 16–17 profiler-on window.

What the PR changes (7 files, +57/-6)

Area	Change	Purpose
OS client	`compression: false` + `suggestCompression: false`	Frees libuv pool from gzip decompression on internal VPC traffic; cuts coord-node-1 CPU
OS client	`OPENSEARCH_COORDINATION_NODE_URLS` (CSV) for client-side round-robin	Spreads coord traffic across coord+data nodes; relieves single-coord bottleneck
ecosystem.config.js	`UV_THREADPOOL_SIZE: '16'`	Default 4 was saturating on gzip + DNS + crypto + fs
`queryByItemId`	Clamp `limit` to `[1, 100]`	Was unbounded from public `/search-trade`
middleware	New `validateLimit` + `validateArrayBody`	Public endpoint limit guardrails
`/api/v2/inventory/search-trade`	`items[]` ≤ 50	Cap user amplification
`/api/v2/inventory/items-prices`	`assetIds[]` / `groupIds[]` ≤ 500	Cap user amplification
`/api/v2/sell/soldItems` & `/payoutHistories`	`limit` ≤ 100, `offset` ≤ 30	Cap user amplification
`MAX_LIMIT_INVENTORY_API`	500 → 150	Frontend already defaults to 150; 500 ceiling let any client request 3× more than ever renders

Why gzip-off is safe here

The OpenSearch cluster is on the same VPC as the backend (eu-west-1c, private addresses). Bandwidth between EC2 instances inside an AZ is effectively free; gzip's only payoff is wire-bytes savings, which we don't need. The cost — JS-side JSON.parse on the gzipped response (event-loop blocking) and zlib decompression on libuv worker threads (pool overload) — is paid every single request. Disabling it on the internal client is a pure-positive trade.

Stopgap Options

In addition to landing PR #1284

0 — Land PR #1284 first priority

How: investigate the claude-review + renovate-check failures (likely fixable / soft); rebase onto current master; re-run CI; merge; deploy. The PR's own test plan includes verifying "Libuv Thread Pool Overload profiler insight clears" after deploy.

Pros: already written, already reviewed-by-Claude, already targets the exact symptoms seen May 4. No new design.

Cons: 7 days stale — needs rebase + a fresh CI run. Doesn't directly shrink the 950-item must_not filter (Option A below); it makes the existing filter cheaper to ship over the wire and parse.

A — Shrink the must_not array app-side

How: add an indexed boolean (isReserved / isLocked) maintained by the parser on reservation events. Replace the 950-term filter with a single term predicate. Alternative: terms-lookup against a small per-user reservations_<userId> doc.

Pros: kills root cost. No OpenSearch restart. Benefits every node including the idle one indirectly via lower coordinator latency.

Cons: requires parser + backend change + deploy. Reservation event plumbing.

B — Raise `thread_pool.search.queue_size` on data-node-1 + data-node-4

How: opensearch.yml on the affected nodes, 1000 → 2000, rolling restart.

Pros: no app deploy.

Cons: static setting → rolling restart per node, cluster yellow each time. Doesn't fix root cause; delays next overflow. Longer queue = longer tail latencies + heap pressure (queued tasks hold their payloads).

C — Cap must_not.assetId length at ~200

How: drop the oldest reservations from the filter, post-filter server-side after the OS response.

Pros: ships in one PR, no schema change. Fastest band-aid.

Cons: changes pagination semantics — pages may return fewer than size after post-filter; users may briefly see their own reserved items before the post-filter trims them.

Open Questions

Why data-node-4 overflowed and data-node-1 didn't, given identical search-thread counts (7) and similar active-thread saturation. Likely shard-level skew on a hot index — confirm with a shard placement audit.
Confirm the 950-item set is per-user reserved + locked + cart; verify upstream growth driver (more concurrent trades? higher trade-lock retention?).
Recover the actual hot JS frame from Datadog Profiling for the Apr 16–17 window (see Profiler-Window Reconstruction above). Until that's done, the connection between the 950-item filter and the event-loop stalls is a strong hypothesis but not proven.

Refs

2026-04-15 — OS spike + event-loop correlation (parent)
Linear: marketplace-aware-opensearch project
DEV-4815 (fan-out reduction, shipped 2026-04-15)

Call Sites

server/controllers/inventory.js:436
server/redis/reserveAssetRedis.js
server/service/openSearchService.js:500

drafted 2026-05-04 · pre-decision · tradeit.gg