AS-BUILT · tradeit-analytics · revised 2026-05-05

Aurora-Merged Analytics Replica

Revised 2026-05-05 — as-built state All phases (1-5) complete. Live in production. Aurora Serverless v2 merged target. DMS for steamarbitrage bulk load. Fargate scheduled task for 7 pricing tables across 4 strategy groups. analytics-mcp Fargate service live at analytics-mcp.tradeit.gg.

SOLUTION

Aurora Serverless v2 merged cluster (writer-only) DMS daily for steamarbitrage Fargate every 5–15 min for pricing ~$115–160/mo

Pricing tables

across 4 strategy groups

118M

Largest table

bot_trade_item_logs rows

Mutation tail

Steam revert window

5–15m

Refresh cadence

Fargate schedule

$115–160

Monthly cost

under $200 P2 ceiling

Architecture decisions — completed

2026-04-29 design review — all implemented and live

Aurora merged (writer-only) — LIVE

Persistent Aurora Serverless v2 cluster, writer-only — operator access only, not website traffic. 0.5-2 ACU range, idles toward 0.5 ACU when nobody's querying.

Fargate scheduled task — LIVE

VPC/IAM-attached, unlimited runtime — supports bulk loads beyond the 15-min ceiling that constrains Lambda. Two tasks: pricing-replicator (0.25 vCPU / 0.5 GB, every 15 min) and analytics-mcp (ARM64, 0.25 vCPU / 512 MB) with cloudflared sidecar.

8-day mutation tail + DDL fix — DONE

bot_trade_item_logs can mutate up to 8 days post-creation (Steam revert). Source DDL fix applied (Phase 0 / OPS-208): ON UPDATE CURRENT_TIMESTAMP + updated_at_idx, INPLACE, no write blocking. Live.

DMS for steamarbitrage bootstrap — DONE

DMS full-load completed 2026-04-30. 167 tables, zero errors. Gap-fill DMS reload-target (OPS-225) completed 2026-05-05: 41 tables, ~196M rows, ~1h47m wall-clock. DMS daily scheduler cancelled (cost-prohibitive) — on-demand only per OPS-feedback.

incremental-with-cdc-deletes for reserved_items (OPS-230) — LIVE

New strategy: INSERT side uses WHERE reserved_at > watermark with 5-min lookback. DELETE side reads reserved_transactions for type IN (WITHDRAW, DELETE, EXPIRED, FAIL). Cold-start seeds CDC watermark to MAX(tx_id) to skip historical sweep. DELETE-before-INSERT order handles asset_id reuse race correctly.

Scope

What this spec covers

In scope

Persistent Aurora MySQL 8 merged cluster
Steamarbitrage daily refresh into steamarbitrage.* via AWS DMS
All 7 pricing tables synced into pricing.* via Fargate
Migration from EC2-MySQL to Aurora; retire i-0adec4087a63a67b5
Observability + alerting

Out of scope (follow-ups)

Steamarbitrage CDC (sub-day freshness) — full-load only in v1
App-side trade-revert event emission (would simplify Group D, but bigger design)

Priorities

Inherited from project decision doc — all hard

Zero prod impact

< $200/mo ceiling

Plain MySQL for MCP

Architecture

Aurora merged cluster fed by two independent pipelines — as-built 2026-05-05

analytics-mcp service (OPS-201 / OPS-227) — LIVE

Fargate ARM64 service (0.25 vCPU / 512 MB) running the tradeit-analytics MCP server with a cloudflared sidecar. Cloudflare Worker OAuth proxy fronts it at analytics-mcp.tradeit.gg, providing Claude AI with authenticated read-only access to the merged Aurora cluster via plain MySQL queries. Deployed Phase 4 cutover 2026-05-04. ~$8-15/mo.

flowchart TB classDef src fill:#1f2937,stroke:#f59e0b,stroke-width:2px,color:#fcd34d; classDef src2 fill:#1f2937,stroke:#10b981,stroke-width:2px,color:#6ee7b7; classDef compute fill:#1f2937,stroke:#a855f7,stroke-width:2px,color:#c084fc; classDef target fill:#0f1729,stroke:#22d3ee,stroke-width:3px,color:#22d3ee; classDef sched fill:#0f1729,stroke:#64748b,stroke-width:1px,color:#cbd5e1,stroke-dasharray: 4 4; classDef consumer fill:#1f2937,stroke:#ec4899,stroke-width:2px,color:#f9a8d4; SA["steamarbitrage-cluster
daily snapshot →
ephemeral restored cluster
~250 GB"]:::src PR["pricing-test cluster
Aurora reader (db.t4g.medium)
schema: steamtrade.*"]:::src2 EB1["EventBridge
daily cron"]:::sched EB2["EventBridge
rate(15 min)"]:::sched DMS["AWS DMS
replication instance
dms.t3.medium"]:::compute FG["Fargate task
pricing-replicator
Node 22 + mysql2"]:::compute M[("Merged Aurora MySQL 8
Serverless v2 · writer-only
0.5–2 ACU · eu-west-1c")]:::target MB["Metabase"]:::consumer MCP["analytics-mcp
Fargate ARM64 + cloudflared
analytics-mcp.tradeit.gg
OPS-201/OPS-227"]:::consumer OPS["Operators
(SSM tunnel)"]:::consumer EB1 -->|"on-demand only
(no daily scheduler)"| DMS EB2 -->|trigger every 15 min| FG SA -->|"full-load via
ephemeral restore cluster"| DMS PR -->|incremental SELECT| FG DMS -->|"steamarbitrage.*"| M FG -->|"pricing.*"| M M --> MB M --> MCP M --> OPS

FIG. 1 · Two independent pipelines feeding one merged Aurora target · DMS on-demand only (no daily scheduler) · analytics-mcp Fargate ARM64 service at analytics-mcp.tradeit.gg

Sync flow per Fargate run

15-min cadence · per-table strategy dispatch

sequenceDiagram autonumber participant EB as EventBridge participant ECS as ECS Fargate participant SM as Secrets Manager participant SRC as Pricing Reader participant TGT as Merged Aurora participant CW as CloudWatch EB->>ECS: RunTask (pricing-replicator) ECS->>SM: GetSecret reader-ro + merged-rw SM-->>ECS: credentials ECS->>TGT: SELECT * FROM _repl_state TGT-->>ECS: watermarks (last_id, last_updated_at) Note over ECS,TGT: Group A · pricing_groups (39 rows) ECS->>SRC: SELECT * FROM pricing_groups ECS->>TGT: TRUNCATE + INSERT (full refresh) Note over ECS,TGT: Group B · 4 small update-only tables (36k each) ECS->>SRC: SELECT * (4 tables, parallel) ECS->>TGT: REPLACE INTO (full refresh) Note over ECS,TGT: Group C · item_eod (append-only, daily) ECS->>SRC: WHERE id > last_id LIMIT 50000 ECS->>TGT: INSERT + advance watermark Note over ECS,TGT: Group D · bot_trade_item_logs (118M, 8d tail) ECS->>SRC: WHERE updated_at >= last_run - 5min ECS->>TGT: INSERT ... ON DUPLICATE KEY UPDATE ECS->>TGT: UPDATE _repl_state SET last_run_at = NOW() ECS->>CW: PutMetric RowsReplicated, LagSeconds, Errors ECS-->>EB: exit 0 (or alert on 3 consecutive failures)

FIG. 2 · Single replicator run touches all 4 group strategies in one process

Components

All in eu-west-1c, all VPC-internal — no EC2-MySQL

Merged Aurora cluster (new)

Aurora MySQL 8 Serverless v2, 0.5-2 ACU range. Writer-only — operator access only, no website-traffic SLA. Native PITR, automatic backups.

~$50-90/mo · idles toward 0.5 ACU when no operators are querying

Pricing Aurora reader

db.t4g.medium reader on pricing cluster. Reuse if cluster already has a reader. Storage-layer replication, no binlog needed, no writer impact.

~$50/mo (or $0 if reusing)

AWS DMS replication instance

dms.t3.medium, ephemeral. Started by daily Lambda or EventBridge → runs full-load task → terminates. Used for steamarbitrage daily refresh and Group C/D bootstrap.

~$5-10/mo (ephemeral) or ~$30/mo (always-on)

Fargate scheduled task

pricing-replicator: 0.25 vCPU / 0.5 GB. Container: Node 22 + mysql2. Triggered by EventBridge every 5-15 min. VPC-attached. No time limit.

~$5/mo

Watermark control table

pricing._repl_state(table_name PK, last_id, last_updated_at, last_run_at, rows_last_run, status, error_message)

Single source of truth per source table

Secrets & IAM

New: tradeit-analytics/pricing-reader-ro, tradeit-analytics/merged-aurora-rw. DMS endpoint roles, Fargate task role.

No KMS grants needed (Aurora-native encryption)

Strategy quadrant

Why each table got its strategy — size × write-frequency

A · static reference B · small + hot full-refresh C · append-only daily D · large + 8-day mutation tail

FIG. 3 · Bubble area ≈ row count (log) · position picks the strategy

Group A — static reference + incremental-with-cdc-deletes

A banned_users full-refresh every run

A trade_revert_reserved_items full-refresh every run

A user_favorite_items full-refresh every run

A pricing_groups 39 rows · static · full-refresh every run

Strategy: full-table refresh on every Fargate run. Negligible cost; eliminates drift risk on tiny reference data.

A reserved_items incremental-with-cdc-deletes · 15-min cadence · OPS-230

Strategy: incremental-with-cdc-deletes (OPS-230) — previously full-refresh at 4h cadence, now incremental at 15 min.

INSERT side: WHERE reserved_at > watermark with 5-min lookback
DELETE side: reads reserved_transactions for type IN (WITHDRAW, DELETE, EXPIRED, FAIL)
Cold-start: seeds CDC watermark to MAX(tx_id) — first run skips the historical sweep
DELETE-before-INSERT order handles asset_id reuse race correctly

Group B — small + update-only + frequent

B item_prices_raw 36k · updates every 10 min

B item_price_adjust_percent 36k · updates every 10 min

B store_item_stats 36k · updates every 5 min

B trade_item_stats 36k · updates every 5 min

Strategy: full-table refresh every 5 min. 36k rows is sub-second to load. Eliminates incremental complexity entirely — no updated_at tracking, no watermarks, no UPSERT race conditions.

REPLACE INTO pricing.item_prices_raw
SELECT * FROM pricing_source.item_prices_raw;
-- same shape × 4 tables in this group

Group C — large + append-only + daily

C item_eod millions · daily inserts only, no updates

Strategy: incremental by id, daily cadence. No need for the 5-15 min cycle since rows only land once a day.

SELECT <columns>
FROM pricing_source.item_eod
WHERE id > :last_id
ORDER BY id LIMIT 50000;
-- paginate, INSERT into target, advance watermark

Group D — 118M rows + 8-day mutation tail + guess_questions

D bot_trade_item_logs 118M+ rows · updates+inserts continuously · incremental-updated-at (OPS-208)

D guess_questions 15-min cadence · 2ms steady-state · OPS-226

guess_questions is now on 15-min cadence after the source-side idx_updated_at_id covering index landed (OPS-226, 2026-05-05). Steady-state query time: ~2ms.

bot_trade_item_logs: Activity log: every transaction on the site. Most mutations land within 15 min (trade auto-cancel). Steam's trade-revert feature can mutate a row up to 8 days after creation. Uses updated_at_idx from OPS-208.

Strategy: incremental by updated_at with UPSERT, every 5-15 min. The watermark predicate catches reverts AT ANY AGE because the Phase 0 source DDL fix makes updated_at auto-bump on every mutation at the DB layer (independent of app code).

SELECT <columns>
FROM pricing_source.bot_trade_item_logs
WHERE updated_at >= :last_run_at - INTERVAL 5 MINUTE
ORDER BY updated_at, id LIMIT 50000;
-- uses updated_at_idx (added in Phase 0)
-- watermark catches an 8-day-old revert as easily as a 5-min-old insert

Loaded with INSERT ... ON DUPLICATE KEY UPDATE for idempotent replay.

Phase 0 DDL (in scope): add ON UPDATE CURRENT_TIMESTAMP + updated_at_idx on source. INPLACE + LOCK=NONE on Aurora MySQL 8 — ~30-90 min on 118M rows, no write blocking, ~1.5-3 GB extra storage.

Bootstrap via DMS, not Fargate paginated. 118M rows full-load in ~30-90 min on parallel streams. See Phase 3.

Table size distribution

Log scale — the gap between 39 and 118M rows is 6.5 orders of magnitude

FIG. 4 · Row count per source table (log10 scale) · coloured by group

Replication SLO budget

Lag targets per group — alert thresholds drive Slack pages

A pricing_groups

≤ 15 min

B 4× small tables

≤ 5 min

C item_eod

≤ 24 h

D bot_trade_item_logs

≤ 30 min

Bar width = relative lag tolerance per cycle. Full bar = sole table on its own cadence; shorter bar = tighter SLO for that group's run cycle.

Error handling & observability

Error handling

Per-table try/catch — failures isolated
On error: status=error, watermark NOT advanced
Replay on next run — UPSERT/REPLACE makes it safe
EventBridge: 2 retries with backoff
3 consecutive failures → SNS → Slack
DMS: built-in CloudWatch metrics + table validation

Observability

CW namespace tradeit-analytics/PricingReplicator
RowsReplicated, RunDurationMs, LagSeconds, Errors per {table}
DMS task metrics: FullLoadProgressPercent, throughput, validation
Datadog: per-group lag gauge, DMS daily-task duration, Aurora ACU usage
Alarms: Group D lag > 30 min, Group B lag > 10 min → Slack

Migration & rollout

All 5 phases complete — live in production as of 2026-05-05

gantt title Migration timeline (target: 4-week rollout) dateFormat YYYY-MM-DD axisFormat %b %d section Phase 0 — DDL prereq bot_trade_item_logs ALTER (INPLACE) :crit, p0, 2026-04-29, 2d Verify updated_at bumps :p0v, after p0, 1d section Phase 1 — Aurora stand-up Provision Serverless v2 cluster :done, p1a, 2026-04-29, 1d Schemas + _repl_state + secrets :done, p1b, after p1a, 1d section Phase 2 — Validation Build Fargate container, push ECR :done, p2a, 2026-04-29, 1d Deploy task def + EventBridge :done, p2b, after p2a, 1d 24h soak vs pricing_staging :active, p2c, 2026-04-29, 1d section Phase 3 — Initial bulk load DMS full-load steamarbitrage :p3a, after p2c, 1d DMS full-load Group C/D (118M) :p3b, after p3a, 1d Fargate BOOTSTRAP small tables :p3c, after p3b, 1d Row-count parity + seed _repl_state :p3d, after p3c, 1d section Phase 4 — Cut over Enable Fargate schedule :p4a, after p3d, 1d DMS daily schedule :p4b, after p4a, 1d Switch Metabase + MCP DSN :p4c, after p4b, 1d 24h watch :p4d, after p4c, 1d section Phase 5 — Decommission legacy host Stop scheduled jobs on legacy host :p5a, after p4d, 1d Snapshot data dir :p5b, after p5a, 1d Stop EC2 (preserve 1 week) :p5c, after p5b, 1d Terminate :p5d, after p5c, 7d

FIG. 5 · Sequenced rollout · Phase 0 DDL is the hard prerequisite gating Group D bootstrap

Phase 0 — Source DDL prerequisites DONE

On the production pricing cluster, run two ALTERs on bot_trade_item_logs before any sync runs:

ALTER TABLE bot_trade_item_logs
  MODIFY COLUMN updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
    ON UPDATE CURRENT_TIMESTAMP,
  ADD INDEX updated_at_idx (updated_at),
  ALGORITHM=INPLACE, LOCK=NONE;

~30-90 min on 118M rows. INPLACE + LOCK=NONE means no write blocking. ~1.5-3 GB additional index storage (Aurora auto-scales). Coordinate with the team that owns the pricing cluster. Verify with one synthetic UPDATE that updated_at bumps as expected before declaring Phase 0 complete.

Phase 1 — Stand up Aurora merged DONE

Provision Aurora Serverless v2 cluster (writer-only, 0.5-2 ACU)
Create steamarbitrage + pricing schemas (empty)
Create pricing._repl_state watermark table
Create user accounts + Secrets Manager entries

Phase 2 — Validate against staging schema DONE

Build Fargate container, push to ECR
Deploy ECS task definition, EventBridge rule (disabled)
Run task manually against pricing_staging.* schema
Verify all 4 group strategies (A/B/C/D)
≥24h soak, lag < expected per group
Cutover via atomic RENAME TABLE pricing_staging.* TO pricing.* (2026-04-30) — preserves watermarks, no data movement

Phase 3 — Initial bulk load (OPS-205) DONE 2026-04-30

DMS full-load of steamarbitrage completed 2026-04-30. 167 tables, zero errors. Ephemeral restore cluster used (zero prod impact). Scope reduced: pricing C/D and Group A/B bootstrap were moot — pricing.* already populated via Phase 2 cutover.

Gap-fill DMS reload-target (OPS-225) completed 2026-05-05: 41 tables, ~196M rows backfilled, ~1h47m total wall-clock.

Phase 4 — Cut over & enable steady-state (OPS-206) DONE

Fargate EventBridge schedule enabled 2026-04-30 (15-min pricing-replicator-schedule + daily pricing-replicator-daily for Group C)
DMS daily scheduler cancelled (cost-prohibitive) — on-demand only going forward
analytics-mcp Fargate service deployed 2026-05-04 (OPS-201 / OPS-227) — Claude MCP DSN pointing at merged Aurora via analytics-mcp.tradeit.gg
Metabase migration pending (OPS-161, operator-driven — not auto-migratable due to schema rename steamtrade.* → pricing.*/steamarbitrage.*)

Phase 5 — Decommission legacy host (OPS-207) DONE 2026-04-30

EC2 i-0adec4087a63a67b5 terminated
EBS vol-08d3b49fbcf44f43a (500 GB) deleted
IAM role + instance profile tradeit-analytics-merged-mysql deleted
Lambda tradeit-analytics-restore-pipeline deleted
KMS alias/rds-restore scheduled deletion 2026-05-07
6 obsolete secrets scheduled deletion 2026-05-07

Testing

Unit: predicate builders per group, REPLACE/UPSERT SQL gen, watermark math, pagination loop termination, full-refresh idempotency.

Integration: ephemeral MySQL via docker-compose. Each group's strategy verified against synthetic data. Late-arriving update (8-day-old row) caught by updated_at predicate.

Staging E2E (Phase 2): ≥24h soak against pricing_staging.*. Row-count and MAX(id) / MAX(updated_at) parity per table.

Aurora load test: verify Serverless v2 ACU range (0.5-2) handles peak Fargate write rate without throttling.

Cost

Observed monthly steady-state — as of 2026-05-05

FIG. 6 · Monthly cost ranking ($/mo) · teal = managed services, emerald = small line items

Item	$/mo
Aurora merged cluster (Serverless v2, writer-only, 0.5-2 ACU, briefly 8 ACU max during heavy DMS reloads)	~$30-60
Aurora pricing reader (db.t4g.medium, same-AZ)	~$50
DMS replication instance (dms.t3.medium, on-demand only)	~$50
Fargate pricing-replicator (0.25 vCPU / 0.5 GB, every 15 min)	~$15-25
Fargate analytics-mcp (ARM64, 0.25 vCPU / 512 MB + cloudflared sidecar)	~$8-15
Cloudflare Worker (OAuth proxy)	free tier
CloudWatch Logs + Metrics	~$3
Data transfer (same-AZ VPC)	$0
Total monthly steady-state	~$110–160

Stays under P2 ceiling. Aurora Serverless v2 observed at 0.5 ACU in steady state, with brief spikes to 8 ACU max during heavy DMS reloads. DMS instance is on-demand (not always-on) — the ~$50/mo figure assumes occasional bulk operations. Cloudflare Worker on free tier.

Out of scope — captured follow-ups

Source-side updated_at — DONE (Phase 0 / OPS-208)

ON UPDATE CURRENT_TIMESTAMP + updated_at_idx applied INPLACE on the 118M-row table. Live.

Source-side idx_updated_at_id on guess_questions — DONE (OPS-226, 2026-05-05)

Covering index landed on the source cluster. guess_questions now on 15-min cadence at ~2ms steady-state query time.

DMS gap-fill reload-target — DONE (OPS-225, 2026-05-05)

41 tables, ~196M rows backfilled, ~1h47m total wall-clock.

incremental-with-cdc-deletes for reserved_items — DONE (OPS-230)

New strategy live at 15-min cadence. Replaces prior full-refresh at 4h cadence. See Group A section for full strategy details.

OAuth hygiene hardening (OPS-228) — open

analytics-mcp Cloudflare Worker OAuth proxy. Low urgency; token rotation and expiry hygiene improvements.

Analytics safe redacted views (OPS-160 / OPS-164) — open, Low priority

Row-level security / redacted views for PII fields. Not blocking current MCP usage (analytics_ro already scoped).

Datadog APM (OPS-200) — open, Low priority

APM traces for the Fargate replicator task. CloudWatch metrics cover current observability needs.

Metabase per-question migration (OPS-161) — open, operator-driven

Not auto-migratable. Requires manually updating each saved question DSN + schema references from steamtrade.* → pricing.* / steamarbitrage.*. No timeline set.

Steamarbitrage CDC for sub-day freshness — future

Currently on-demand full-load only. For sub-day freshness, evaluate DMS CDC (requires binlog_format=ROW on writer). Cost sign-off required before any recurring DMS task for steamarbitrage.

Open questions

All implementation-time questions resolved — see "Resolved" line below

No open questions remain. Ongoing operational concerns (DMS daily refresh schedule, Metabase + MCP DSN cutover, EC2 retirement) are tracked in Linear (OPS-205, OPS-206, OPS-207).

Resolved this revision: table spelling (bot_trade_item_logs plural, confirmed by DDL) · updated_at column status (folded into Phase 0) · Aurora sizing (Serverless v2 writer-only) · bootstrap path for 118M rows (DMS, not Fargate) · pricing reader instance provisioned (pricing-test-reader-analytics) · Group B refresh cadence locked to 15 min (5-min freshness sufficient given 10-min source updates) · Phase 0 DDL ran 2026-04-29 outside any specific window · DMS source for steamarbitrage: snapshot-restored ephemeral cluster (NOT live reader). P1 zero-prod-impact makes this non-negotiable; live reader endpoint puts read load on the source writer when no replicas exist.

Generated 2026-04-26 · visualised 2026-04-29 · as-built 2026-05-05 · tradeit-analytics · branch main
This HTML is the canonical source of truth for the architecture. All phases complete. Live in production.