Incident Report · April 14, 2026

OpenSearch Data Node Load Spike

Post-deploy load threshold breaches on marketplace-aware indexing rollout

P1 RESOLVED — 2 Data Nodes Breached load.norm.1 > 1.0

8 alert events across a ~4-hour window post-deploy — all auto-recovered within minutes

TL;DR

One-paragraph summary

Following the marketplace-aware OpenSearch indexing prod rollout on 2026-04-14 (~15:06–15:40 UTC), the parser’s bulk engine was rewritten: 2 parallel workers (was 1 sequential), refresh-interval disabled during bulk (was refresh: true per chunk), and 8MB byte-aware chunking. HaloSkins lane is disabled, so only the own-inventory lane was running — but the new engine pushes the same data volume into a shorter, hotter CPU window. Starting at 18:31 UTC (~3 hours post-deploy), two of four data nodes breached the normalized 1-minute load threshold during those compressed indexing cycles. Mitigation (parser PR #114) adds cluster-health backpressure that self-throttles when CPU gets hot.

1.14

Peak Load Norm.1

Alert Events

67%

Peak CPU User

2 / 4

Nodes Breached

< 2min

Auto-Recovery

Timeline (UTC)

Deploy-to-alert correlation — all timestamps verified from git log + Datadog events

15:06 — Parser PR #99 + Backend PR #1215 merged to master DEPLOY

15:37–15:39 — Prod IP update commits (chore(deploy): update prod ip) — parser container swap completes DEPLOY

17:53–18:30 — Rising load trend on data_node_4 (0.69 avg climbing to 1.14) and data_node_1 (jumps from 0.07 to 0.67)

18:31:17 — P1 Triggered: data_node_1 load > 1.0 ALERT

18:33:17 — P1 Triggered: data_node_4 load > 1.0 ALERT

18:34:17 — Recovered: data_node_4 RECOVERED

18:38:17 — Recovered: data_node_1 RECOVERED

18:39:17 — P1 Triggered (second bounce): data_node_4 ALERT

18:40:17 — Recovered: data_node_4 RECOVERED

22:47:17 — P1 Triggered (isolated): data_node_4 — second indexing cycle hotspot ALERT

22:48:17 — Recovered: data_node_4 — final event RECOVERED

Cluster State During Peak

4-node cluster, c6a.xlarge (4 vCPU), all in eu-west-1c

Peak load norm.1 during 18:17–18:40 UTC window. Threshold: 1.0 (= full CPU saturation per core).

data_node_4 (i-07e02b117e9dbd0a9)

1.14 · BREACHED · 2×

data_node_1 (i-0e3ae1adac2380052)

1.14 · BREACHED · 1×

data_node_3 (i-0f2b2243f8e98a69a)

0.91 · HOT · did not breach

data_node_2 (—)

0.04 · IDLE

data_node_2 Idle — By Design

data_node_2 shows load max 0.04 across the 6-hour window — effectively zero indexing work. This is intentional: higher replica counts to spread shards across all 4 nodes were previously tested and made bulk inserts slower (more nodes replicating each write → longer per-bulk latency). The current concentrated placement is a deliberate trade-off favoring insert speed over even load distribution. Not a factor in this incident.

Root Cause

Bulk engine rewritten for throughput — removed natural pacing

HaloSkins lane was disabled in prod and stayed disabled through this window, so only the own-inventory lane ran. The behavioral change is inside the bulk engine itself — confirmed by diffing openSearch.service.ts at commit 3abe319 (PR #99):

Before Deploy (old bulkIndex)

Sequential for-loop over chunks — 1 bulk request at a time
refresh: true on every chunk — forces commit/flush, naturally paces cluster
Pure doc-count chunking (splitEvery(5000))
No retry, no failure circuit breaker

After Deploy (new processBulkChunks)

2 parallel workers for inventory (getBulkConcurrency(Inventory) = 2)
refresh_interval = -1 during bulk + refresh: false on each call — removes the natural brake
Byte-aware chunking (2500 docs or 8MB, whichever first)
Explicit single refresh + alias swap at the end

Why This Spikes Load

Same dataset (~320k CSGO items), same total work — but compressed into a shorter, hotter window. With the old per-chunk refresh: true, the cluster committed segments between each of ~130 chunks, spreading the CPU over a longer duration. The new engine streams 2 bulk requests concurrently with refresh disabled, so all indexing happens in a tight burst followed by one final refresh. CPU peaks higher even though total CPU-seconds are similar (or lower).

The two spike windows (18:17–18:40 UTC and 22:22–22:48 UTC) align with 2 separate indexing cycles, each now burning hotter than pre-deploy. data_node_4 triggered in both because primary shards for inventory_730_own_* happen to concentrate there.

What Was Ruled Out

Not an I/O issue. Not a network issue. Not a deploy-itself issue.

Disk I/O?

iowait < 0.5%
write_time_pct avg 5.8%

RULED OUT

Network?

rx ~3.5 MB/s
typical traffic shape

RULED OUT

Deploy Itself?

First alert was 2h51m
after container swap

RULED OUT

Signature is purely CPU-bound: CPU user at 67% peak with load norm.1 of 1.14 on 4-vCPU boxes — ~4.56 processes waiting to run. No disk or network bottleneck.

Observability Gaps Exposed

Investigation hit blind spots — DEV-4638 relevance confirmed

No OpenSearch Integration Metrics

No opensearch.* / elasticsearch.* metrics in Datadog. No visibility into bulk thread-pool queue depth, rejected writes, JVM heap, GC, segment count. Had to infer from host-level system.* metrics only.

No Parser Logs in Datadog

Searches for service:tradeit-inventory-parser and source:*parser* returned zero results. Cannot directly confirm concurrent lane execution from log timestamps — had to infer from metric correlation.

Mitigation — Action Plan

Ranked by effectiveness and urgency

Rank	Action	Owner	Why
P0	Merge parser PR #114 (Lane Scheduler + Backpressure)	Ehud / Nguyen V	The between-chunk backpressure directly addresses this scenario: when `heap_used_percent > 80` or write-queue depth climbs during bulk, the parser inserts proportional delays, restoring the pacing the old `refresh: true` loop provided. Cheapest fix — no infra change.
P1	Tune bulk concurrency — consider dropping `getBulkConcurrency(Inventory)` from 2 to 1 until backpressure ships	Ehud	One-line revert to restore sequential pacing. Gives immediate relief while PR #114 is in review. Trade-off: slower bulk — but current cycles already fit within the 10-min SLA comfortably.
P1	Ship DEV-4638 — OpenSearch dashboards + alerts	Ehud	Current observability is host-metrics-only. Need bulk thread-pool queue depth, JVM heap, pending tasks, rejections to tune backpressure thresholds intelligently.
P2	Scale data nodes — c6a.xlarge (4 vCPU) → c6a.2xlarge (8 vCPU)	DevOps / Ehud	Re-evaluate after PR #114 ships and uuskins/IGXE are enabled. Current 4-vCPU headroom is tight but sufficient for own-only lane.
P2	Enable parser log shipping to Datadog	DevOps	Closes the gap that blocked direct correlation in this investigation — had to infer from metric shapes alone.

References

Linear project: Marketplace-Aware OpenSearch Indexing (10x Ready)
Linear DEV-4637: Lane Scheduler + Backpressure Controller
Linear DEV-4638: Search/Index Regression Dashboard + Alerts
Parser PR #114: Lane Scheduler + Backpressure Controller
Parser PR #99: Marketplace-aware consolidation (deployed)
Backend PR #1215: Marketplace-aware consolidation (deployed)
Datadog Monitor 238873337: avg(last_5m):avg:system.load.norm.1{*} by {host,name} > 1
Related: Slow Site Load Incident (same day, different root cause)

Incident owner: Ehud Shahak · Generated 2026-04-15