tradeit.gg

TRADEBOT INFRA · OPERATOR GUIDE · 2026-06-12

Tradebot Deploy — Operator Guide

How to deploy the bot fleet, what you'll see while it runs, and what each durability safeguard does. Three layers: deploy.sh (the command you run) → operator/* (run on your machine, drive AWS SSM) → host/recreate-containers.sh (runs on each bot host).

1How to run

one front door — pick target + speed

# from an operator machine (AWS creds, eu-west-1)
./infra/deploy/deploy.sh [target] [speed] [botId] [imageRef] [-y]

# target = staging | bot | all     speed = routine | fast | immediate  (all only)

./infra/deploy/deploy.sh                      # fully interactive — it prompts
./infra/deploy/deploy.sh staging              # staging trio 898/899/900
./infra/deploy/deploy.sh bot 539              # one production bot
./infra/deploy/deploy.sh all routine          # whole fleet, safe pace (default)
./infra/deploy/deploy.sh all fast             # whole fleet, ~2× faster
./infra/deploy/deploy.sh all immediate -y     # whole fleet at once (emergencies)

It shows a plan + the speed table, confirms once (type the host count for a fleet/staging deploy, or yes for a single bot), then runs the right script for you. Pass -y to skip the prompt in automation.

The three speeds — for the full fleet (178 hosts)

Speedhosts at once~Steam loginsstophalt on errors~time
routine (default)5~1535s10%~15–30 min
fast10~3020s10%~8–15 min
immediateALL~5345sneverfires over ~60s

routine is the default and what you want almost always. fast when you need it sooner. immediate only for a genuine emergency — see below.

The downside of immediate

All ~534 bots stop and re-login at the same moment.

Every bot on every host restarts together, so the whole fleet re-authenticates to Steam AND reconnects to Redis simultaneously — a login storm and a Redis connection storm. Steam rate-limits the logins; Redis can refuse/time-out the flood of new connections. Bots can crash-loop while the storm clears.

The ~60s random per-host jitter softens it (spreads the herd over a window) and the safe-swap means a crash falls back to the old image rather than going down — but it is still the entire fleet bouncing at once. Use routine/fast unless it's an emergency.

2What to expect when it runs

the deploy talks to you the whole way

a. It shows the speed table and asks you to choose:

deploy.sh
$ ./infra/deploy/deploy.sh all
Deploy speed for ALL 178 production hosts (~534 bots):
    speed     hosts/at  ~logins  stop  halt-errs  est. time
    routine   5         ~15      35s   10%        ~15-30 min
    fast      10        ~30      20s   10%        ~8-15 min
    immediate ALL       ~534     5s    never      fires over ~60s
choose  1=routine  2=fast  3=immediate: 1

b. It prints the plan and asks for one confirmation (type the host count):

plan + confirm
================ deploy plan ================
  Target          ALL production bots
  Hosts           178 (~534 bots)
  Speed           routine
  Concurrency     5 host(s) at a time  (~15 simultaneous Steam logins)
  Graceful stop   35s
  Error halt      10%
  Est. duration   ~15-30 min
=============================================
Proceed? type the host count (178): 178

c. A live progress bar + elapsed seconds while it rolls out:

rollout
command: 6ca32b16-367e-4372-8a33-6dd9442ed2c3
  [############........] 104/178 hosts · 0 errors · 612s
--- result: Success (178/178, 0 errors)

d. Then it auto-verifies — which bots are on the new version and logged in:

verification (non-blocking)
── verification report (534 bots) ──
  on deployed version : 534/534
  logged into Steam   : 528/534
  stale (old image)   : 0
  down (no container) : 0
  not logged in yet   : 6 -> bot512 bot516 bot517 bot603 bot77 bot89

e. Finally it waits and retries only the login stragglers (live countdown):

delayed retry
  letting 6 straggler(s) finish Steam login — retry in  41s
  re-checking the 6 straggler(s)…
── retry result ──
  logged in during wait : 6/6
  STILL not logged in   : 0  ✅
──────────────────────────────────────

If a bot is stale it's on the old image (incl. a safe-swap rollback); down means no container. Both are listed by bot id so you know exactly what to chase — nothing is silent.

3Durability — what each safeguard does & why

why a bad deploy can't take the fleet down anymore

SafeguardWhat it doesWhy (the failure it prevents)
Rollback-safe swap Renames the old container to bot<ID>-prev, starts the new one, and only drops the old after the new is confirmed running — else rolls back. A failed/OOM docker run used to delete the bot and leave nothing. Now the worst case is "stayed on the old image", never "bot down".
Image prune -af Removes every image not in use, each deploy (was: keep a week). Week-long retention filled the 8 GB root disks to 100% → pull/run failed and even corrupted /etc/hosts. This was the root cause of the whole incident.
resolv.conf mount Bind-mounts the host's /etc/resolv.conf (644) into each bot, read-only. Some docker versions write the container resolv.conf 0640 — unreadable by the non-root bot → no DNS → can't reach Redis/MySQL. The mount makes DNS immune.
3 speeds (absolute) Concurrency is a fixed host count (5 / 10 / all), not a percentage. A % grows the Steam login storm as the fleet grows; 35 hosts at once tripped the error-halt and false-failed rollouts. 5-at-a-time is verified clean.
Storm jitter Each host waits a random ≤60s before starting (immediate only). Spreads the fleet-wide Steam logins + Redis reconnects over a window instead of one thundering-herd instant.
Skip login-wait Immediate skips the informational Steam-login poll (5s settle only). The poll is informational (never gated success); skipping it lets an emergency rollout report done in seconds.
Post-deploy verify Checks every bot's image digest + Steam login, lists stale/down/not-logged-in, retries the stragglers. Before, the deploy said "success" while bots were actually down or on the old image — found only via Slack. Now nothing is silent.
docker enabled-on-boot ensure-host-baseline.sh enables docker at boot fleet-wide. AMI-cloned hosts shipped with docker disabled-on-boot, so a reboot left bots down until a manual start. Now a reboot self-recovers.
Drift guard check-host-script-drift.sh sha256-verifies every host runs the repo's host script. We hit hosts running an older on-host script than the repo. This catches that before it bites.
Unified deploy.sh One front door for staging / bot / all + speed, with plan + confirm. Operators no longer pick between confusing primitives or mistype a fleet-wide command — the plan + single confirm is the guardrail.

tradeit.gg · tradebot infra · 2026-06-12