tradeit.gg

Week 20 · 2026

Engineering Operations

May 4 – 10

Headlines of the Week

Tradebot shared-IP pilot validated at N=2 · Inventory reservation moved to per-doc timestamps · 3 foundation deep-dives shipped

11 work items shipped · 42 PRs merged org-wide · 4 systems documented

Architecture & Performance

Inventory Reservation — Redesigned

Fixed Inventory-Browse Slowdown — Same Shape as Apr 15 Incident
When users start a trade, items get "reserved" for 6–15min. The old design shipped the full reserved list (often 1000+ item IDs) into every browse query as a "please skip these" filter. At the 10:00 IL peak, that filter was big enough to spike OpenSearch CPU and stall the Node event loop — same shape as the April 15 incident. Redesigned so each item carries its own reserved-until timestamp directly, and the query just asks "exclude reserved-until > now". Constant cost regardless of reservation count.
Root cause of Apr 15 incident pattern removed · behind useOsReservation flag · instant SQL rollback · PRs: backend #1300, parser #147
Ehud (via Claude Code) · May 10

Architecture & Performance

Analytics Replica — Two Operational Fixes

Backfilled 29 Analytics Tables
When we added "last-updated" timestamps to 37 trading source tables last week, rows updated in the prior 2.5 days were silently invisible to the incremental sync. Ran targeted DMS reload on 29 smaller tables in low-traffic window — 4 min, 0 errors, ~$0.10 cost.
29 tables back to byte-for-byte freshness · 8 heavier tables scheduled off-peak
Ehud · May 4 · OPS-225
Remote Analytics MCP — Deployed (OAuth Gap)
Built remote MCP on AWS Fargate (ARM/Graviton) so Claude can query analytics DB from any Mac/PC without per-user setup. Live at analytics-mcp.tradeit.gg behind Cloudflare Tunnel + Google SSO, read-only at 3 layers, ~$8/mo. Remaining gap: Claude Desktop OAuth flow incompatible with CF Access SSO.
Infra in place · CF Workers OAuth proxy ~1–2h work · tracked OPS-227
Ehud · May 4 · OPS-201

Tools & Automation

Three Internal Tools Shipped

Claude Queries the Analytics DB
Read-only MCP connector lets Claude (Desktop, Code, Cursor, Continue) query the merged analytics DB in plain English. "Show me top 10 affiliates this week" → SQL + result + explanation. Read-only at 3 layers: DB grants, wrapper script, query timeout.
Data exploration without SQL skill · team guide live at ops.tradeit.gg
Ehud · May 4 · OPS-201
Interactive Tutor
Any wiki page or markdown doc → self-paced browser tutorial tailored to one of 4 personas (Technical, Support, Product, Tech-Light). Auto-generates diagrams, comprehension quiz, one-line takeaway per topic. Stress-tested on tradebot doc: 152 slides across 4 personas, all validated green.
Lowers onboarding cost for complex systems · lands at ops.tradeit.gg/tutor
Ehud · May 10
Ops Dashboard Layout Fix
As more sections landed on ops.tradeit.gg the left column was clipping Research entries and the brand logo was overlapping content. Switched columns to scroll independently with neon scrollbars + soft fade, moved logo to top-right page-chrome zone. Codified the fix in the page-design skill.
Dashboard grows gracefully · future generated pages inherit the fix
Ehud · May 10

Documentation

Two Foundation Deep-Dives Shipped

Tradebot Service — Fully Documented
Mapped end-to-end docs for the ~400-process Steam tradebot fleet that executes every trade. Three surfaces: deep-dive wiki page, single-page ops.tradeit.gg visual guide, in-repo CLAUDE.md. Covers Redis channels, Bull queue contracts, full 28-guard validation chain + the 18-guard chain in tradeit-backend, HTTP callbacks, kill-switch keys, the auto-QA contract.
New engineer (or AI) can operate the bot fleet from docs alone · surfaced 1 dead-code event (finishedtrades has no subscribers) · foundation for auto-QA
Ehud · May 9
OpenSearch Patterns — Foundation Tier Complete
Third foundation-tier deep-dive (after Platform Config + Redis). Covers cluster topology, the rick/morty alias-swap zero-downtime reindex contract, the reserved-asset must_not filter, marketplace fanout naming, and all 4 OS-consuming services. Every claim cited with file:line. Published as 32.7K-char wiki page + ops.tradeit.gg HTML mirror with 3 Mermaid diagrams.
Foundation tier complete · surfaced 8 open questions · next: Trade Lifecycle (Tier B, DEV-4959)
Ehud · May 10 · DEV-4958

Infra & Cost

Tradebot Shared-IP Pilot — Steam Said Yes

Two Bots on One IP — N=2 Probe Cleared
Today: one Steam bot per dedicated AWS server, one public IP per bot, ~400 of them. The cost-reduction program (OPS-229) wants many bots behind shared IPs. The biggest unknown was Steam itself — would it flag two accounts logging in from the same egress IP? Built a NAT server in eu-west-1c, snapshot-cloned two staging bots into a private subnet routed through it, forced fresh logins. Both bots completed refresh-token login with no Steam Guard, no 2FA challenge, no rate-limits, no errors. 72-hour endurance window now running.
The unknown that gated the whole plan now has a positive answer at N=2 · next: 14-day production probe with 2 real bots → 3 → full rollout
Ehud · May 10 · OPS-231 (parent OPS-229)

Infra & Cost

Hot Inventory API — Edge Caching Designed

Two-PR Change for Cloudflare Edge Cache
Traced our AWS internet-bandwidth bill (~$413/day) and found a single API endpoint — the inventory feed loaded on every trade and store page — accounts for more than half of it. Designed a two-PR change letting Cloudflare hold the response at the edge for 1h for anonymous browsers (vs 1min today), with cache-clear triggered from the inventory-indexer cron whenever prices/stock change. Logged-in users keep the short cache so live cart/reservation stays fresh — subtle correctness constraint that took two design iterations.
Engineering shipped · awaits Cloudflare Cache Rule update + new CF API token · PRs: backend #1299, parser #146
Ehud · May 10

Monitoring & Observability

Silent Replicator Failure — Caught & Wired

4-Hourly Analytics Replicator Failing for ~36h, No Alert
Discovered the 4-hourly analytics replicator had been failing every fire since setup — rejected by AWS before the task could even start. Root cause: one-character mismatch — EventBridge override said pricing-replicator but the actual container is replicator. Nothing crashed loudly; just silent FailedInvocations in CloudWatch. 5 analytics tables (reserved_items, trade_revert_reserved_items, banned_users, user_favorite_items, guess_questions) had been stuck at their setup-time snapshot.
Fixed and verified · added CloudWatch FailedInvocations alarms on both schedule rules (15-min + 4h) wired to existing Slack pipeline · next silent rejection pages in ~1 min, not 36h
Ehud · May 4 · OPS-210

Lesson: any new EventBridge → ECS rule needs real-task verification, not just config-on-paper.

Team Activity

Team Velocity — W20

May 11 – 17 · live from Swarmia · DORA metrics are org-wide

PRs Merged
42
5 contributors · Swarmia
Review Rate
83%
PRs with ≥1 review
Cycle Time
4.1d
open → merge, median
Deploy Freq
1.6/d
org-wide · 11 deploys
Change Lead Time
8.5h
org-wide · Swarmia
Change Failure Rate
36%
org-wide · Swarmia
MTTR
6.1h
org-wide · Swarmia
Time to First Review
9.4h
median · Swarmia
PRs In Progress
78
end-of-week snapshot

CFR elevated this week — worth a look at what tripped the 4 failed deploys.

Observability

The Platform, Watched

Live from Datadog · what we can see right now

Active Hosts
infrastructure agents reporting
Services Monitored
Datadog agents installed
APM Tracing
backend only · coverage gap
Observability Posture
Datadog infrastructure monitoring is broadly deployed — host metrics and basic agent telemetry across the fleet. APM tracing is only enabled on the backend service, which means request-level visibility into the trading flow, inventory parser, admin backend, tradebots, and the rest of the platform is still dark. The Apr 15 incident (and the inventory-browse fix shipped this week) would have been faster to diagnose with end-to-end traces.
Next observability investment: extend APM tracing to the highest-value services (inventory parser, tradebots, tradeit-service)

Host/service counts pending live Datadog MCP fetch — auth flow in progress.

Next Up

What's Loaded for W21

High Priority

Tradebot Shared-IP — 14-Day Production Probe
Promote the N=2 staging result to 2 real production bots for a 14-day window. Watch for delayed Steam flags or rate-limits before scaling to N=3.
Flip useOsReservation in Production
After staging confirmation, flip the inventory-reservation flag and measure OS data-node CPU + event-loop delay at the 10:00 IL peak. Rollback is one SQL update.
Cloudflare Edge-Cache Rule Activation
Coordinate the CF Cache Rule update + token rotation that unblocks the hot inventory API change already in PR.

Medium Priority

Trade Lifecycle Deep-Dive (Tier B)
Next foundation doc — likely split into deposit / withdrawal / p2p / instant-sell pages per the spec. DEV-4959.
Remote MCP OAuth Bridge
~1–2h Cloudflare Workers OAuth Provider proxy so Claude Desktop's Custom Connector can speak to analytics-mcp.tradeit.gg. OPS-227.
Backfill the 8 Heavier Analytics Tables
Off-peak DMS reload for the remaining tables left out of last week's silent-stale fix.

Number of the Week

1 character

The difference between pricing-replicator and replicator

One typo in an EventBridge override silently failed our analytics replicator for 36 hours. AWS rejected every fire before the task could even start — no crash, no alert, just FailedInvocations ticking up in CloudWatch. We caught it, fixed it, and wired alarms so the next silent rejection pages on-call in ~1 minute.

  That's the week. Onward.