TRADEBOT INFRA · OPERATOR GUIDE · 2026-06-12
How to deploy the bot fleet, what you'll see while it runs, and what each durability safeguard does. Three layers: deploy.sh (the command you run) → operator/* (run on your machine, drive AWS SSM) → host/recreate-containers.sh (runs on each bot host).
one front door — pick target + speed
# from an operator machine (AWS creds, eu-west-1) ./infra/deploy/deploy.sh [target] [speed] [botId] [imageRef] [-y] # target = staging | bot | all speed = routine | fast | immediate (all only) ./infra/deploy/deploy.sh # fully interactive — it prompts ./infra/deploy/deploy.sh staging # staging trio 898/899/900 ./infra/deploy/deploy.sh bot 539 # one production bot ./infra/deploy/deploy.sh all routine # whole fleet, safe pace (default) ./infra/deploy/deploy.sh all fast # whole fleet, ~2× faster ./infra/deploy/deploy.sh all immediate -y # whole fleet at once (emergencies)
It shows a plan + the speed table, confirms once (type the host count for a fleet/staging deploy, or yes for a single bot), then runs the right script for you. Pass -y to skip the prompt in automation.
| Speed | hosts at once | ~Steam logins | stop | halt on errors | ~time |
|---|---|---|---|---|---|
| routine (default) | 5 | ~15 | 35s | 10% | ~15–30 min |
| fast | 10 | ~30 | 20s | 10% | ~8–15 min |
| immediate | ALL | ~534 | 5s | never | fires over ~60s |
routine is the default and what you want almost always. fast when you need it sooner. immediate only for a genuine emergency — see below.
immediateAll ~534 bots stop and re-login at the same moment.
Every bot on every host restarts together, so the whole fleet re-authenticates to Steam AND reconnects to Redis simultaneously — a login storm and a Redis connection storm. Steam rate-limits the logins; Redis can refuse/time-out the flood of new connections. Bots can crash-loop while the storm clears.
The ~60s random per-host jitter softens it (spreads the herd over a window) and the safe-swap means a crash falls back to the old image rather than going down — but it is still the entire fleet bouncing at once. Use routine/fast unless it's an emergency.
the deploy talks to you the whole way
a. It shows the speed table and asks you to choose:
$ ./infra/deploy/deploy.sh all Deploy speed for ALL 178 production hosts (~534 bots): speed hosts/at ~logins stop halt-errs est. time routine 5 ~15 35s 10% ~15-30 min fast 10 ~30 20s 10% ~8-15 min immediate ALL ~534 5s never fires over ~60s choose 1=routine 2=fast 3=immediate: 1
b. It prints the plan and asks for one confirmation (type the host count):
================ deploy plan ================
Target ALL production bots
Hosts 178 (~534 bots)
Speed routine
Concurrency 5 host(s) at a time (~15 simultaneous Steam logins)
Graceful stop 35s
Error halt 10%
Est. duration ~15-30 min
=============================================
Proceed? type the host count (178): 178c. A live progress bar + elapsed seconds while it rolls out:
command: 6ca32b16-367e-4372-8a33-6dd9442ed2c3
[############........] 104/178 hosts · 0 errors · 612s
--- result: Success (178/178, 0 errors)d. Then it auto-verifies — which bots are on the new version and logged in:
── verification report (534 bots) ── on deployed version : 534/534 logged into Steam : 528/534 stale (old image) : 0 down (no container) : 0 not logged in yet : 6 -> bot512 bot516 bot517 bot603 bot77 bot89
e. Finally it waits and retries only the login stragglers (live countdown):
letting 6 straggler(s) finish Steam login — retry in 41s re-checking the 6 straggler(s)… ── retry result ── logged in during wait : 6/6 STILL not logged in : 0 ✅ ──────────────────────────────────────
If a bot is stale it's on the old image (incl. a safe-swap rollback); down means no container. Both are listed by bot id so you know exactly what to chase — nothing is silent.
why a bad deploy can't take the fleet down anymore
| Safeguard | What it does | Why (the failure it prevents) |
|---|---|---|
| Rollback-safe swap | Renames the old container to bot<ID>-prev, starts the new one, and only drops the old after the new is confirmed running — else rolls back. |
A failed/OOM docker run used to delete the bot and leave nothing. Now the worst case is "stayed on the old image", never "bot down". |
Image prune -af |
Removes every image not in use, each deploy (was: keep a week). | Week-long retention filled the 8 GB root disks to 100% → pull/run failed and even corrupted /etc/hosts. This was the root cause of the whole incident. |
| resolv.conf mount | Bind-mounts the host's /etc/resolv.conf (644) into each bot, read-only. |
Some docker versions write the container resolv.conf 0640 — unreadable by the non-root bot → no DNS → can't reach Redis/MySQL. The mount makes DNS immune. |
| 3 speeds (absolute) | Concurrency is a fixed host count (5 / 10 / all), not a percentage. | A % grows the Steam login storm as the fleet grows; 35 hosts at once tripped the error-halt and false-failed rollouts. 5-at-a-time is verified clean. |
| Storm jitter | Each host waits a random ≤60s before starting (immediate only). | Spreads the fleet-wide Steam logins + Redis reconnects over a window instead of one thundering-herd instant. |
| Skip login-wait | Immediate skips the informational Steam-login poll (5s settle only). | The poll is informational (never gated success); skipping it lets an emergency rollout report done in seconds. |
| Post-deploy verify | Checks every bot's image digest + Steam login, lists stale/down/not-logged-in, retries the stragglers. | Before, the deploy said "success" while bots were actually down or on the old image — found only via Slack. Now nothing is silent. |
| docker enabled-on-boot | ensure-host-baseline.sh enables docker at boot fleet-wide. |
AMI-cloned hosts shipped with docker disabled-on-boot, so a reboot left bots down until a manual start. Now a reboot self-recovers. |
| Drift guard | check-host-script-drift.sh sha256-verifies every host runs the repo's host script. |
We hit hosts running an older on-host script than the repo. This catches that before it bites. |
Unified deploy.sh |
One front door for staging / bot / all + speed, with plan + confirm. | Operators no longer pick between confusing primitives or mistype a fleet-wide command — the plan + single confirm is the guardrail. |
tradeit.gg · tradebot infra · 2026-06-12