TRADEBOT INFRA · OPERATOR GUIDE · 2026-06-12

Tradebot Deploy — Operator Guide

How to deploy the bot fleet, what you'll see while it runs, and what each durability safeguard does. Three layers: deploy.sh (the command you run) → operator/* (run on your machine, drive AWS SSM) → host/recreate-containers.sh (runs on each bot host).

1How to run

one front door — pick target + speed

# from an operator machine (AWS creds, eu-west-1)
./infra/deploy/deploy.sh [target] [speed] [botId] [imageRef] [-y]

# target = staging | bot | all     speed = routine | fast | immediate  (all only)

./infra/deploy/deploy.sh                      # fully interactive — it prompts
./infra/deploy/deploy.sh staging              # staging trio 898/899/900
./infra/deploy/deploy.sh bot 539              # one production bot
./infra/deploy/deploy.sh all routine          # whole fleet, safe pace (default)
./infra/deploy/deploy.sh all fast             # whole fleet, ~2× faster
./infra/deploy/deploy.sh all immediate -y     # whole fleet at once (emergencies)

It shows a plan + the speed table, confirms once (type the host count for a fleet/staging deploy, or yes for a single bot), then runs the right script for you. Pass -y to skip the prompt in automation.

The three speeds — for the full fleet (178 hosts)

Speed	hosts at once	~Steam logins	stop	halt on errors	~time
routine (default)	5	~15	35s	10%	~15–30 min
fast	10	~30	20s	10%	~8–15 min
immediate	ALL	~534	5s	never	fires over ~60s

routine is the default and what you want almost always. fast when you need it sooner. immediate only for a genuine emergency — see below.

The downside of immediate

All ~534 bots stop and re-login at the same moment.

Every bot on every host restarts together, so the whole fleet re-authenticates to Steam AND reconnects to Redis simultaneously — a login storm and a Redis connection storm. Steam rate-limits the logins; Redis can refuse/time-out the flood of new connections. Bots can crash-loop while the storm clears.

The ~60s random per-host jitter softens it (spreads the herd over a window) and the safe-swap means a crash falls back to the old image rather than going down — but it is still the entire fleet bouncing at once. Use routine/fast unless it's an emergency.

2What to expect when it runs

the deploy talks to you the whole way

a. It shows the speed table and asks you to choose:

deploy.sh

$ ./infra/deploy/deploy.sh all
Deploy speed for ALL 178 production hosts (~534 bots):
    speed     hosts/at  ~logins  stop  halt-errs  est. time
    routine   5         ~15      35s   10%        ~15-30 min
    fast      10        ~30      20s   10%        ~8-15 min
    immediate ALL       ~534     5s    never      fires over ~60s
choose  1=routine  2=fast  3=immediate: 1

b. It prints the plan and asks for one confirmation (type the host count):

plan + confirm

================ deploy plan ================
  Target          ALL production bots
  Hosts           178 (~534 bots)
  Speed           routine
  Concurrency     5 host(s) at a time  (~15 simultaneous Steam logins)
  Graceful stop   35s
  Error halt      10%
  Est. duration   ~15-30 min
=============================================
Proceed? type the host count (178): 178

c. A live progress bar + elapsed seconds while it rolls out:

rollout

command: 6ca32b16-367e-4372-8a33-6dd9442ed2c3
  [############........] 104/178 hosts · 0 errors · 612s
--- result: Success (178/178, 0 errors)

d. Then it auto-verifies — which bots are on the new version and logged in:

verification (non-blocking)

── verification report (534 bots) ──
  on deployed version : 534/534
  logged into Steam   : 528/534
  stale (old image)   : 0
  down (no container) : 0
  not logged in yet   : 6 -> bot512 bot516 bot517 bot603 bot77 bot89

e. Finally it waits and retries only the login stragglers (live countdown):

delayed retry

  letting 6 straggler(s) finish Steam login — retry in  41s
  re-checking the 6 straggler(s)…
── retry result ──
  logged in during wait : 6/6
  STILL not logged in   : 0  ✅
──────────────────────────────────────

If a bot is stale it's on the old image (incl. a safe-swap rollback); down means no container. Both are listed by bot id so you know exactly what to chase — nothing is silent.

3Durability — what each safeguard does & why

why a bad deploy can't take the fleet down anymore

Safeguard	What it does	Why (the failure it prevents)
Rollback-safe swap	Renames the old container to `bot<ID>-prev`, starts the new one, and only drops the old after the new is confirmed running — else rolls back.	A failed/OOM `docker run` used to delete the bot and leave nothing. Now the worst case is "stayed on the old image", never "bot down".
Image prune `-af`	Removes every image not in use, each deploy (was: keep a week).	Week-long retention filled the 8 GB root disks to 100% → `pull`/`run` failed and even corrupted `/etc/hosts`. This was the root cause of the whole incident.
resolv.conf mount	Bind-mounts the host's `/etc/resolv.conf` (644) into each bot, read-only.	Some docker versions write the container resolv.conf `0640` — unreadable by the non-root bot → no DNS → can't reach Redis/MySQL. The mount makes DNS immune.
3 speeds (absolute)	Concurrency is a fixed host count (5 / 10 / all), not a percentage.	A % grows the Steam login storm as the fleet grows; 35 hosts at once tripped the error-halt and false-failed rollouts. 5-at-a-time is verified clean.
Storm jitter	Each host waits a random ≤60s before starting (immediate only).	Spreads the fleet-wide Steam logins + Redis reconnects over a window instead of one thundering-herd instant.
Skip login-wait	Immediate skips the informational Steam-login poll (5s settle only).	The poll is informational (never gated success); skipping it lets an emergency rollout report done in seconds.
Post-deploy verify	Checks every bot's image digest + Steam login, lists stale/down/not-logged-in, retries the stragglers.	Before, the deploy said "success" while bots were actually down or on the old image — found only via Slack. Now nothing is silent.
docker enabled-on-boot	`ensure-host-baseline.sh` enables docker at boot fleet-wide.	AMI-cloned hosts shipped with docker disabled-on-boot, so a reboot left bots down until a manual start. Now a reboot self-recovers.
Drift guard	`check-host-script-drift.sh` sha256-verifies every host runs the repo's host script.	We hit hosts running an older on-host script than the repo. This catches that before it bites.
Unified `deploy.sh`	One front door for staging / bot / all + speed, with plan + confirm.	Operators no longer pick between confusing primitives or mistype a fleet-wide command — the plan + single confirm is the guardrail.

tradeit.gg · tradebot infra · 2026-06-12