tradeit.gg

Week 17 · 2026

Engineering Operations

April 13 – 19

Highlight of the Week

Marketplace 10x shipped to production, site slowness resolved & full platform documentation — 24 repos + 13 wiki pages

What We Shipped

Delivered This Week

One flagship feature, one infra win, one documentation blitz

Marketplace 10x — Production
Shipped the marketplace 10x feature — a major upgrade to search and item discovery, powered by marketplace-aware OpenSearch. Significantly faster and smarter results for traders.
Step change in marketplace performance and UX
Ehud · 10x project
Infra Cost Cleanup — Redis + RDS
Moved ElastiCache into eu-west-1c to eliminate cross-AZ transfer costs. Removed migration replicas. Flagged an orphaned Aurora cluster generating daily 245 GB snapshots nobody was using.
~$2,105/mo saved — details on Expenses slide
Ehud · Infra
Full Platform Documentation
Generated CLAUDE.md for all 24 active repos. Created 13 new wiki pages: architecture, service catalog, trade lifecycle, pricing, auth, OpenSearch, Redis patterns, deployment, and platform config.
Zero to full engineering docs — wiki is now self-serve for the whole company
Ehud (via Claude Code) · Documentation

Incident Resolved

Site Slowness — Fixed by Kim

Week-long slowness traced to Nginx upstream TCP handling

Root Cause
Nginx was opening a new TCP connection for every request to the Node.js upstream — then immediately closing it. At our request volume, the HTTP server became overloaded with TCP handshakes, causing slow responses and event loop stalls.
upstream backend { server ip-172-31-42-95...:3000; server ip-172-31-44-148...:3000; # No keepalive — every request = new TCP handshake }
The Fix — Upstream Keepalive
Added keepalive directives so Nginx reuses open connections instead of opening new ones. Node.js already supports keepalive (5s default) — Nginx was just closing connections after each request.
upstream backend { server ip-172-31-42-95...:3000; server ip-172-31-44-148...:3000; keepalive 128; keepalive_timeout 3s; keepalive_requests 1000; }
Site performance restored immediately
Kim · Nginx · Infrastructure

Ongoing Investigation

Node Event Loop & DB Connections

Site slowness resolved, but underlying signals still being investigated

Event Loop Blocking
Datadog is showing signals of Node event loop blocking. With Kim's keepalive fix deployed, the site is stable — but the event loop data suggests there may be deeper issues worth investigating (hot code paths, synchronous operations under load).
Less urgent now — site is stable, investigation ongoing
Infra · Monitoring
DB Connection Spike — 3,400
Detected abnormally high database connection count (3,400, far above normal). Likely related to the same upstream issue — excessive connections accumulating when the event loop stalled. Monitoring for recurrence post-fix.
Watching connection counts post-keepalive fix
Ehud · Investigation

Reliability

System Health

Stable — Site Slowness Resolved

Slowness Resolved

Kim's Nginx keepalive fix eliminated TCP handshake overload

Event Loop Under Watch

Datadog showing blocking signals — investigation continues, site stable

Expenses

AWS Cost Savings — This Week

All infra items audited and right-sized in eu-west-1

Item Before After Monthly Saving
Redis (ElastiCache) — AZ consolidation + removed replicas
Moved to eu-west-1c, eliminated cross-AZ traffic, removed migration replicas
$52/day $27/day ~$2,050
RDS — Orphaned Aurora cluster deleted
steamtrade-25-cluster — no instances, 7 daily 245 GB snapshots silently accumulating
~$55/mo $0 ~$55
Elastic IPs — 30 released (Steam Value Tracker)
AWS bills ~$3.65/EIP/mo since Feb 2024 — silently accruing on a terminated instance
~$110/mo $0 ~$110
EC2 — 2 terminated, 1 right-sized
Killed Steam Value Tracker (t2.xlarge) + Logs staging (c5.large). Cron server: m6a.large → t3a.medium
~$160/mo $0 ~$160
EBS — 487 volumes gp2 → gp3 + 5 orphans deleted
4,888 GB migrated live, zero downtime. gp3 is 20% cheaper with equal/better performance
~$115/mo ~$0 ~$115
Total Monthly Saving ~$2,490 / mo  ·  ~$29,880 / yr

Team Activity

Development Metrics (Swarmia · Apr 13–19)

Review Rate
% of PRs that received at least one review before merge
67%
from 94%
Down 27 points — heavy shipping week, less review bandwidth
Time to First Review
Median time from PR opened to first review comment
4.1h
from 17h
Down 13 hours — back to normal after holiday
Cycle Time
Median time from first commit to PR merged
3.4d
from 3.6d
Slightly improved
PRs Merged
Total pull requests merged this week
58
from 43
5 contributors · 15 deploys
Change Failure Rate
% of deployments that caused an incident or required rollback
60%
same as last week
Flat — still high, needs attention
MTTR
Mean Time To Restore — avg time to recover service after a failed deploy
14.5h
from 28h
Down 13.5 hours — nearly halved

4-Week Trend

Development Metrics Over Time (Mar 23 – Apr 19)

Review Rate (%) — higher is better
Time to First Review (h) — lower is better
Cycle Time (days) — lower is better
Change Failure Rate (%) — lower is better

Observability

The Platform, Watched

575 servers  ·  5 engineers  ·  Datadog watching everything

Servers Monitored 24/7
575
across AWS eu-west-1 fleet
Services Monitored
39
infra agents · APM on backend
Deploys This Week
15
~2.1 per day · 58 PRs merged
Monitor Health — 17 Active Monitors
15 healthy  ·  1 warning  ·  1 alerting
2 Issues Under Watch
RDS storage >90%  — still alerting, under remediation
Redis high memory  — steamtrade cluster warning
Investigating: Node event loop blocking signals in Datadog — site stable post-keepalive fix

Admin Panel Redesign

Before

The old admin — functional, but dated

Admin panel — before redesign

Admin Panel Redesign

After — Neon Minimal

Glassmorphism, card-based layout, consistent design language across every page

What's Next

Coming Up — Week 18

Event Loop Investigation
Close out the event loop blocking investigation. Keepalive fix stabilized the site — now dig into whether there are deeper code-level issues causing blocking under load.
Auto QA — Trade Flow Coverage
With login/logout + SEO checks live, next is automating trade flows: inventory, deposits, sales, payments. Full happy-path coverage.
Read-Only DB Replica
Daily copy of tradeit + pricing DBs for AI agents and admin users. Isolates analytics queries from production load.

Week 17 in Numbers

Big Ship. Incident Resolved. Docs Complete.

58
PRs merged
15
deploys
$29.9K
annual infra savings
37
docs (24 repos + 13 wiki)

Site fast again. Wiki answers questions about our code. Admin looks like 2026. $29.9K/yr in infra savings found this week.