Incident Report · April 14, 2026
Post-deploy load threshold breaches on marketplace-aware indexing rollout
P1 RESOLVED — 2 Data Nodes Breached load.norm.1 > 1.0
8 alert events across a ~4-hour window post-deploy — all auto-recovered within minutes
One-paragraph summary
Following the marketplace-aware OpenSearch indexing prod rollout on 2026-04-14 (~15:06–15:40 UTC), the parser’s bulk engine was rewritten: 2 parallel workers (was 1 sequential), refresh-interval disabled during bulk (was refresh: true per chunk), and 8MB byte-aware chunking. HaloSkins lane is disabled, so only the own-inventory lane was running — but the new engine pushes the same data volume into a shorter, hotter CPU window. Starting at 18:31 UTC (~3 hours post-deploy), two of four data nodes breached the normalized 1-minute load threshold during those compressed indexing cycles. Mitigation (parser PR #114) adds cluster-health backpressure that self-throttles when CPU gets hot.
Deploy-to-alert correlation — all timestamps verified from git log + Datadog events
chore(deploy): update prod ip) — parser container swap completes
DEPLOY
data_node_4 (0.69 avg climbing to 1.14) and data_node_1 (jumps from 0.07 to 0.67)
data_node_1 load > 1.0
ALERT
data_node_4 load > 1.0
ALERT
data_node_4
RECOVERED
data_node_1
RECOVERED
data_node_4
ALERT
data_node_4
RECOVERED
data_node_4 — second indexing cycle hotspot
ALERT
data_node_4 — final event
RECOVERED
4-node cluster, c6a.xlarge (4 vCPU), all in eu-west-1c
Peak load norm.1 during 18:17–18:40 UTC window. Threshold: 1.0 (= full CPU saturation per core).
data_node_2 Idle — By Design
data_node_2 shows load max 0.04 across the 6-hour window — effectively zero indexing work. This is intentional: higher replica counts to spread shards across all 4 nodes were previously tested and made bulk inserts slower (more nodes replicating each write → longer per-bulk latency). The current concentrated placement is a deliberate trade-off favoring insert speed over even load distribution. Not a factor in this incident.
Bulk engine rewritten for throughput — removed natural pacing
HaloSkins lane was disabled in prod and stayed disabled through this window, so only the own-inventory lane ran. The behavioral change is inside the bulk engine itself — confirmed by diffing openSearch.service.ts at commit 3abe319 (PR #99):
Before Deploy (old bulkIndex)
for-loop over chunks — 1 bulk request at a timerefresh: true on every chunk — forces commit/flush, naturally paces clustersplitEvery(5000))
After Deploy (new processBulkChunks)
getBulkConcurrency(Inventory) = 2)refresh_interval = -1 during bulk + refresh: false on each call — removes the natural brakeWhy This Spikes Load
Same dataset (~320k CSGO items), same total work — but compressed into a shorter, hotter window. With the old per-chunk refresh: true, the cluster committed segments between each of ~130 chunks, spreading the CPU over a longer duration. The new engine streams 2 bulk requests concurrently with refresh disabled, so all indexing happens in a tight burst followed by one final refresh. CPU peaks higher even though total CPU-seconds are similar (or lower).
The two spike windows (18:17–18:40 UTC and 22:22–22:48 UTC) align with 2 separate indexing cycles, each now burning hotter than pre-deploy. data_node_4 triggered in both because primary shards for inventory_730_own_* happen to concentrate there.
Not an I/O issue. Not a network issue. Not a deploy-itself issue.
Disk I/O?
NO
iowait < 0.5%
write_time_pct avg 5.8%
Network?
NO
rx ~3.5 MB/s
typical traffic shape
Deploy Itself?
NO
First alert was 2h51m
after container swap
Signature is purely CPU-bound: CPU user at 67% peak with load norm.1 of 1.14 on 4-vCPU boxes — ~4.56 processes waiting to run. No disk or network bottleneck.
Investigation hit blind spots — DEV-4638 relevance confirmed
No OpenSearch Integration Metrics
No opensearch.* / elasticsearch.* metrics in Datadog. No visibility into bulk thread-pool queue depth, rejected writes, JVM heap, GC, segment count. Had to infer from host-level system.* metrics only.
No Parser Logs in Datadog
Searches for service:tradeit-inventory-parser and source:*parser* returned zero results. Cannot directly confirm concurrent lane execution from log timestamps — had to infer from metric correlation.
Ranked by effectiveness and urgency
| Rank | Action | Owner | Why |
|---|---|---|---|
| P0 | Merge parser PR #114 (Lane Scheduler + Backpressure) | Ehud / Nguyen V | The between-chunk backpressure directly addresses this scenario: when heap_used_percent > 80 or write-queue depth climbs during bulk, the parser inserts proportional delays, restoring the pacing the old refresh: true loop provided. Cheapest fix — no infra change. |
| P1 | Tune bulk concurrency — consider dropping getBulkConcurrency(Inventory) from 2 to 1 until backpressure ships |
Ehud | One-line revert to restore sequential pacing. Gives immediate relief while PR #114 is in review. Trade-off: slower bulk — but current cycles already fit within the 10-min SLA comfortably. |
| P1 | Ship DEV-4638 — OpenSearch dashboards + alerts | Ehud | Current observability is host-metrics-only. Need bulk thread-pool queue depth, JVM heap, pending tasks, rejections to tune backpressure thresholds intelligently. |
| P2 | Scale data nodes — c6a.xlarge (4 vCPU) → c6a.2xlarge (8 vCPU) | DevOps / Ehud | Re-evaluate after PR #114 ships and uuskins/IGXE are enabled. Current 4-vCPU headroom is tight but sufficient for own-only lane. |
| P2 | Enable parser log shipping to Datadog | DevOps | Closes the gap that blocked direct correlation in this investigation — had to infer from metric shapes alone. |
avg(last_5m):avg:system.load.norm.1{*} by {host,name} > 1Incident owner: Ehud Shahak · Generated 2026-04-15