tradeit.gg

Week 16 · 2026

Engineering Operations

April 6 – 12

Highlight of the Week

3 shipped, 3 in motion — holiday week (Passover Apr 9–11) & first auto QA tests live + 10x OpenSearch in staging

What We Shipped

Delivered This Week

Short week, focused output

Auto QA — First Tests Live
Full auto-test coverage for login/logout flows + page structure validation including SEO metadata checks. First production slice of the QA automation platform.
No more manual login & SEO checks per release
Auto QA · Phase 1
Item Details Caching — 10x Marketplace
Implemented item details caching on the marketplace-aware OpenSearch project. Eliminates redundant detail lookups during search and trade flows.
Faster marketplace query response, reduced DB pressure
Ehud · 10x project
Cost Opportunity — Redis Same-AZ
Spotted Redis instances running in a different AZ from the backend services calling them. AWS bills cross-AZ traffic ($0.01/GB each way) — co-locating Redis with backend eliminates the charge entirely.
Estimated monthly savings under calculation
Infra · cost optimization
DEV-4815 — OpenSearch Query Fan-Out Reduction
Every trade lookup was hitting 4 OpenSearch indexes (CS:GO, Rust, TF2, Steam) per call — multiplying thread-pool pressure and causing rejected_execution_exception errors at peak. Fixed by threading appId through the call chain so each queryByAssetIds call scopes to a single game. Shipped end-to-end: backend (#1244, #1245), old-tradeit (#144), frontend (#3257). Verified in APM: 1 OS span per call post-deploy vs 4 before.
4x headroom on the OS search thread pool for every trade operation
Ehud · Nguyen V
Backend Runtime Metrics Enabled (PR #1246)
Added DD_RUNTIME_METRICS_ENABLED: true to ecosystem.config.js env_production. Surfaces runtime.node.event_loop.delay.max, GC pauses, heap, and CPU from every backend pod. Immediately caught 1–3 SECOND event-loop stalls during a load spike — previously invisible.
New diagnostic dimension for backend tail latency
Ehud
Admin Panel — Full UI Redesign
Complete rewrite of the internal admin panel: neon-minimal dark theme, glassmorphism cards, new dashboard-as-menu homepage (tool grid replaces sidebar on /), animated FLIP morph navigation, Marketplace Config page, and role-based tool routing via shared composables.
Modernised ops tooling — faster navigation, cleaner UX
Ehud · tradeit-admin

Admin Panel — Before & After

Internal Tool UI Redesign

Neon-minimal dark theme · Dashboard-as-menu navigation · Marketplace Config page

Before

Admin panel before redesign

After — Dashboard

New admin dashboard

After — Inner Page

New admin inner page

In Progress

Active Work

Three tracks running in parallel

OpenSearch 10x — Staging
Marketplace-aware search deployed to staging. Running end-to-end tests with production-scale data.
On track for prod promotion
DEV-4759
Auto QA — Trade Flow Tests
Building on the live login/logout + page structure suite. Trade flows are next — most complex paths in the platform (inventory, deposits, sales, payments).
Phase 2 starting
OPS-127–147
OS Spike + Event-Loop Investigation (Apr 15)
Datadog alert at 20:38 IDT: data_node_1 + data_node_4 saturated (load.1 5.3, CPU 60–87%) for ~10 min. Backend Node.js event-loop stalls 1–3s correlated in the same window AND at 14:41Z when OS was healthy. Two problems, causation not proven. Full report on ops.tradeit.gg →
P2 open · two-track fix plan (OS query shape + Node.js profiler)
Ehud

Reliability

System Health

Monitoring — CPU Investigation Active

No Incidents

No customer-facing issues this week

3 Alerts Under Watch

RDS storage + host load spike + Apr 15 OS data_node_1 load (P2, auto-recovered) — all caught by Datadog before user impact

Team Activity

Development Metrics (Swarmia · Apr 6–12)

Review Rate
94%
from 81%
Up 13 points — Claude Code Review
Time to First Review
17h
from 6.5h
Holiday week — fewer reviewers online
Cycle Time
3.6d
from 3.3d
Stable
PRs Merged
43
from 29
5 contributors · 5 deploys
Change Failure Rate
60%
from 88%
Down 28 points — improving, still high
MTTR
28h
from 44h
Down 16 hours

4-Week Trend

Development Metrics Over Time (Mar 16 – Apr 12)

Review Rate (%) — higher is better
Time to First Review (h) — lower is better
Cycle Time (days) — lower is better
Change Failure Rate (%) — lower is better

Observability

The Platform, Watched

575 servers  ·  5 engineers  ·  Datadog watching everything

Servers Monitored 24/7
575
across AWS eu-west-1 fleet
Services Monitored
39
infra agents · APM on backend
HTTP Requests This Week
~730M
weekly avg · ~1,200 req/s
Monitor Health — 18 Active Monitors
14 healthy  ·  1 warning  ·  2 alerting
APM Spans Traced / Day
~2M
backend service · distributed tracing
2 Issues Auto-Detected This Week — Before Any User Report
RDS storage >90%  — steamarbitrage & pricing-test · under remediation
Host load spike  — detected Apr 5–6 · investigation in progress
Next: SLO dashboards — define uptime & latency targets per service, track against them weekly

What's Next

Coming Up — Week 17

10x to Production
Marketplace-aware OpenSearch passes staging validation → promote to production. Unlocks multi-marketplace search without full reindexes.
Auto QA — Trade Flow Coverage
With login/logout + SEO checks already live, next is automating trade flows: inventory, deposits, sales, payments. Target: full happy-path coverage so PM stops manually testing every release.
Read-Only DB Replica
Daily copy of tradeit + pricing DBs for AI agents and admin users. Isolates analytics queries from production load.

Week 16 in Numbers

Short Week. Strong Output.

~730M
requests served
43
PRs merged
0
user incidents
2
issues auto-caught

OpenSearch 10x ships next week.