Incident Report · April 14, 2026

OpenSearch Data Node Load Spike

Post-deploy load threshold breaches on marketplace-aware indexing rollout

P1 RESOLVED — 2 Data Nodes Breached load.norm.1 > 1.0

8 alert events across a ~4-hour window post-deploy — all auto-recovered within minutes

TL;DR

One-paragraph summary

Following the marketplace-aware OpenSearch indexing prod rollout on 2026-04-14 (~15:06–15:40 UTC), the parser’s bulk engine was rewritten: 2 parallel workers (was 1 sequential), refresh-interval disabled during bulk (was refresh: true per chunk), and 8MB byte-aware chunking. HaloSkins lane is disabled, so only the own-inventory lane was running — but the new engine pushes the same data volume into a shorter, hotter CPU window. Starting at 18:31 UTC (~3 hours post-deploy), two of four data nodes breached the normalized 1-minute load threshold during those compressed indexing cycles. Mitigation (parser PR #114) adds cluster-health backpressure that self-throttles when CPU gets hot.

1.14
Peak Load Norm.1
8
Alert Events
67%
Peak CPU User
2 / 4
Nodes Breached
< 2min
Auto-Recovery

Timeline (UTC)

Deploy-to-alert correlation — all timestamps verified from git log + Datadog events

15:06 — Parser PR #99 + Backend PR #1215 merged to master DEPLOY
15:37–15:39 — Prod IP update commits (chore(deploy): update prod ip) — parser container swap completes DEPLOY
17:53–18:30 — Rising load trend on data_node_4 (0.69 avg climbing to 1.14) and data_node_1 (jumps from 0.07 to 0.67)
18:31:17 — P1 Triggered: data_node_1 load > 1.0 ALERT
18:33:17 — P1 Triggered: data_node_4 load > 1.0 ALERT
18:34:17 — Recovered: data_node_4 RECOVERED
18:38:17 — Recovered: data_node_1 RECOVERED
18:39:17 — P1 Triggered (second bounce): data_node_4 ALERT
18:40:17 — Recovered: data_node_4 RECOVERED
22:47:17 — P1 Triggered (isolated): data_node_4 — second indexing cycle hotspot ALERT
22:48:17 — Recovered: data_node_4 — final event RECOVERED

Cluster State During Peak

4-node cluster, c6a.xlarge (4 vCPU), all in eu-west-1c

Peak load norm.1 during 18:17–18:40 UTC window. Threshold: 1.0 (= full CPU saturation per core).

data_node_4 (i-07e02b117e9dbd0a9)
1.14 · BREACHED · 2×
data_node_1 (i-0e3ae1adac2380052)
1.14 · BREACHED · 1×
data_node_3 (i-0f2b2243f8e98a69a)
0.91 · HOT · did not breach
data_node_2 (—)
0.04 · IDLE

data_node_2 Idle — By Design

data_node_2 shows load max 0.04 across the 6-hour window — effectively zero indexing work. This is intentional: higher replica counts to spread shards across all 4 nodes were previously tested and made bulk inserts slower (more nodes replicating each write → longer per-bulk latency). The current concentrated placement is a deliberate trade-off favoring insert speed over even load distribution. Not a factor in this incident.


Root Cause

Bulk engine rewritten for throughput — removed natural pacing

HaloSkins lane was disabled in prod and stayed disabled through this window, so only the own-inventory lane ran. The behavioral change is inside the bulk engine itself — confirmed by diffing openSearch.service.ts at commit 3abe319 (PR #99):

Before Deploy (old bulkIndex)

  • Sequential for-loop over chunks — 1 bulk request at a time
  • refresh: true on every chunk — forces commit/flush, naturally paces cluster
  • Pure doc-count chunking (splitEvery(5000))
  • No retry, no failure circuit breaker

After Deploy (new processBulkChunks)

  • 2 parallel workers for inventory (getBulkConcurrency(Inventory) = 2)
  • refresh_interval = -1 during bulk + refresh: false on each call — removes the natural brake
  • Byte-aware chunking (2500 docs or 8MB, whichever first)
  • Explicit single refresh + alias swap at the end

Why This Spikes Load

Same dataset (~320k CSGO items), same total work — but compressed into a shorter, hotter window. With the old per-chunk refresh: true, the cluster committed segments between each of ~130 chunks, spreading the CPU over a longer duration. The new engine streams 2 bulk requests concurrently with refresh disabled, so all indexing happens in a tight burst followed by one final refresh. CPU peaks higher even though total CPU-seconds are similar (or lower).

The two spike windows (18:17–18:40 UTC and 22:22–22:48 UTC) align with 2 separate indexing cycles, each now burning hotter than pre-deploy. data_node_4 triggered in both because primary shards for inventory_730_own_* happen to concentrate there.


What Was Ruled Out

Not an I/O issue. Not a network issue. Not a deploy-itself issue.

Disk I/O?

NO

iowait < 0.5%
write_time_pct avg 5.8%

RULED OUT

Network?

NO

rx ~3.5 MB/s
typical traffic shape

RULED OUT

Deploy Itself?

NO

First alert was 2h51m
after container swap

RULED OUT

Signature is purely CPU-bound: CPU user at 67% peak with load norm.1 of 1.14 on 4-vCPU boxes — ~4.56 processes waiting to run. No disk or network bottleneck.


Observability Gaps Exposed

Investigation hit blind spots — DEV-4638 relevance confirmed

No OpenSearch Integration Metrics

No opensearch.* / elasticsearch.* metrics in Datadog. No visibility into bulk thread-pool queue depth, rejected writes, JVM heap, GC, segment count. Had to infer from host-level system.* metrics only.

No Parser Logs in Datadog

Searches for service:tradeit-inventory-parser and source:*parser* returned zero results. Cannot directly confirm concurrent lane execution from log timestamps — had to infer from metric correlation.


Mitigation — Action Plan

Ranked by effectiveness and urgency

Rank Action Owner Why
P0 Merge parser PR #114 (Lane Scheduler + Backpressure) Ehud / Nguyen V The between-chunk backpressure directly addresses this scenario: when heap_used_percent > 80 or write-queue depth climbs during bulk, the parser inserts proportional delays, restoring the pacing the old refresh: true loop provided. Cheapest fix — no infra change.
P1 Tune bulk concurrency — consider dropping getBulkConcurrency(Inventory) from 2 to 1 until backpressure ships Ehud One-line revert to restore sequential pacing. Gives immediate relief while PR #114 is in review. Trade-off: slower bulk — but current cycles already fit within the 10-min SLA comfortably.
P1 Ship DEV-4638 — OpenSearch dashboards + alerts Ehud Current observability is host-metrics-only. Need bulk thread-pool queue depth, JVM heap, pending tasks, rejections to tune backpressure thresholds intelligently.
P2 Scale data nodes — c6a.xlarge (4 vCPU) → c6a.2xlarge (8 vCPU) DevOps / Ehud Re-evaluate after PR #114 ships and uuskins/IGXE are enabled. Current 4-vCPU headroom is tight but sufficient for own-only lane.
P2 Enable parser log shipping to Datadog DevOps Closes the gap that blocked direct correlation in this investigation — had to infer from metric shapes alone.

References

Incident owner: Ehud Shahak · Generated 2026-04-15