AWS IBKR PT Market Session Runbook¶
Scope¶
This runbook covers AWS/Fargate aegis-pt-core as the primary IBKR paper-trading runtime during a US market session. It is written for unattended operation and successor handover: the reader should be able to decide whether to keep observing, notify, or stop and escalate without relying on Ryo being present.
Use this runbook only for read-only checks unless a separate action-time approval explicitly permits mutation.
Golden Signals¶
Treat the session as healthy when all of these are true:
| Signal | Healthy condition |
|---|---|
| Runtime primary | ECS service aegis-pt-core has desired=1, running=1, pending=0, and exactly one IBKR runtime task. |
| Gateway | ibkr-gateway is RUNNING/HEALTHY; lt-ibkr-trader is RUNNING. |
| Synology cutover guard | Synology aegis-lt-ibkr-* containers remain exited; do not run both primaries. |
| Heartbeat | Market-hours heartbeat is fresh, normally under 5 minutes old. |
| Reconciliation | broker_reconcile_pending=false, position_reconcile_pending=false, entry_blocking_reconcile_pending=false, and recent logs show position reconciliation OK. |
| Pending orders | pending_orders=0 or one normal-age entry order that clears/reprices/cancels with TWS code 202. |
| Exit safety | pending_exits=0, stuck_exits=0, and no unknown fill / broker-confirmation state. |
| Exit attribution | Recent closed trades should trend toward pnl_unknown=false; UNKNOWN exits are a reliability KPI, not cosmetic noise. |
| Broker imports | Open positions with reconcile-import-* / Broker tags should trend toward zero; each one means broker/state drift was repaired after the fact. |
| Market data | market_data_status=ok and market_data_10197_recent=0, or isolated 10197 without concrete harm while safety gates remain clean. |
| Severe signals | No Existing session detected, 1100/1101/1102, fatal/severe log burst, or send_outcome_unknown. |
Read-Only Check¶
Run from a clean dev-Project checkout:
Always pass the AWS profile explicitly. The local skill default may point at a different historical profile; a missing default profile is an operator environment problem, not an AWS runtime failure.
Then run the AWS IBKR PT detector:
Expected healthy detector result:
For a successor-repeatable market-session evidence bundle, run:
This is read-only. It records JST/ET time, detector output, optional GitHub
schedule-gap evidence, and a final continue/notify decision line. Add
RUN_PREDEPLOY_READINESS=1 when you also want the offline scheduler proof in
the same report.
Automated Monitor¶
GitHub Actions workflow:
.github/workflows/notify-aws-ibkr-pt-health.yml- Name:
Notify — AWS IBKR PT health - Schedule: primary at
07,17,27,37,47,57minutes and fallback at12,22,32,42,52minutes on weekdays from13:00through21:59UTC. This covers the regular US market session plus open/close buffer while avoiding the high-load00/10/20/30/40/50minute marks where GitHub scheduled workflows are more likely to be delayed or dropped. - External fallback trigger:
repository_dispatchevent typeaws_ibkr_pt_health. Usescripts/ops/dispatch_aws_ibkr_pt_health.pyfrom a non-GitHub scheduler when natural GitHub schedule is absent. - Prepared external scheduler:
AEGIS/ops/aws-ibkr-pt-health-scheduler/contains a Cloudflare Worker Cron fallback. Deploying it and settingGITHUB_DISPATCH_TOKENare infrastructure/secret mutations and require action-time approval. - Path: GitHub Actions runner joins Tailscale with
tag:ci, reads dashboardstatus_v3/bot_status, runsscripts/ops/aws_ibkr_pt_health_detect.py, uploads the snapshot artifact, and posts Slack/email only whennew_eventsis non-empty.
External dispatch smoke command:
The token must be able to create a repository dispatch event for ryofukutani/dev-Project. Do not print or commit the token.
Before any Cloudflare deploy or secret action, run the scheduler predeploy proof:
The same proof is available in GitHub Actions as AEGIS AWS IBKR PT Scheduler
Tools. It compiles the detector/scheduler tooling, runs the Worker runtime
test, verifies cron/config consistency, and runs offline regression tests. It
does not set secrets, deploy Cloudflare, or touch the trading runtime.
After action-time-approved deploy, capture STARTED_AT before deployment and
prove a fresh non-GitHub scheduler run:
STARTED_AT=2026-06-01T18:00:00Z \
WORKER_URL=https://<worker-url> \
bash AEGIS/ops/aws-ibkr-pt-health-scheduler/prove_after_deploy.sh
Completion requires a fresh repository_dispatch run whose source includes
cloudflare-cron: after STARTED_AT. A stale earlier repository-dispatch proof
or a recovered GitHub natural schedule is not enough.
Use the gap verifier to separate "GitHub schedule recovered later" from "all expected windows were covered":
python3 AEGIS/ops/aws-ibkr-pt-health-scheduler/check_schedule_gap.py \
--created-after 2026-06-01T17:45:00Z \
--grace-minutes 8
missing_windows means the GitHub native schedule cannot be the only unattended
monitor lane, even if later schedule runs completed successfully.
Default detector thresholds:
| Environment variable | Default | Meaning |
|---|---|---|
AEGIS_PT_HEARTBEAT_STALE_SEC |
300 |
Critical heartbeat-stale threshold during market hours. |
AEGIS_PT_PENDING_ORDER_WARN |
2 |
Warn when pending orders reach this count. |
AEGIS_PT_PENDING_ORDER_CRITICAL |
3 |
Critical pending-order buildup threshold. |
AEGIS_PT_MARKET_DATA_10197_WARN |
10 |
Warn on sustained market-data 10197 after open grace. |
AEGIS_PT_HIGH_EXIT_ATTEMPT_WARN |
12 |
Warn when an open position repeatedly cancels/retries exits even while pending/stuck counters remain clean. |
AEGIS_PT_HIGH_EXIT_ATTEMPT_CRITICAL |
20 |
Critical high-exit-attempt threshold. This intentionally fires before the runtime's market-order escalation zone so successors can inspect/deploy while the position is still on limit-order rescue. |
AEGIS_PT_OPEN_GRACE_MINUTE |
35 |
Treat market as past open grace from 09:35 ET. |
The workflow deduplicates event signatures through the previous snapshot artifact. A repeated known condition may appear in the run log without re-alerting until a new signature appears.
Current Stabilization Patch Gate¶
As of 2026-06-02 JST, two hardening patches are in flight:
| PR | Scope | Risk class | Preferred order |
|---|---|---|---|
#94 high exit-attempt detector |
Read-only GitHub Actions detector/runbook visibility. | Monitoring only; no broker/runtime mutation. | Merge first so successors see Broker, unknown-PnL, and high-attempt loops before runtime changes. |
#93 completed-order + partial-fill repair |
lt_ibkr_trader settlement and reconciliation behavior. |
Runtime code change; deploy replaces the AWS trader/gateway task. | Deploy only after an action-time approval and pre/post eecheck proof. Prefer market closed or a quiet window. |
Do not treat a green test run as deployment approval. For #93, use this
minimum gate:
- Before deployment, run the read-only check and detector from this runbook.
- Confirm exactly one AWS IBKR runtime task, Synology IBKR containers exited,
fresh heartbeat,
pending_exits=0,stuck_exits=0, all reconciliation flags false, and nosend_outcome_unknown/ session conflict. - Prefer
pending_orders=0. If a single normal-age entry order is present, wait for normal cancel/fill unless Ryo explicitly approves action during the poll window. Avoid replacing the Fargate task while an exit order is working. - Deploy only through git / GitHub Actions / AWS pipeline. Do not edit files on Synology, run compose/docker mutations there, or copy artifacts by hand.
- After deployment, rerun
eecheckand detector. Acceptance requires one AWS primary runtime, fresh heartbeat, gateway healthy, no pending/stuck exits, reconciliation OK, and no new Broker import or unknown fill caused by the rollout. - If the deploy happens during market hours, watch at least two scan cycles.
High exit attempt loopmay continue for pre-existing residual positions, butmax_exit_attempt_countshould not hide behind clean pending/stuck counters anymore.
Decision Rules¶
Observe only¶
Keep AWS PT running and keep observing when:
pending_ordersis0, or exactly one normal-age entry order is present.- The pending order follows the normal timeout/cancel/reprice path with TWS code
202. - Heartbeat is fresh.
- Reconciliation flags are false.
pending_exits=0andstuck_exits=0.market_data_status=ok, or isolated market-data notices appear without order/reconcile harm.market_data_10197_recentremains below the detector warning threshold and is not rising in combination with stale heartbeat, reconnect/session conflict, pending buildup, or reconcile flags.
Why: a single entry order during entry_poll is normal live behavior. Do not confuse it with pending buildup.
Notify / page¶
Notify when any of these appears:
- Heartbeat stale for more than 5 minutes during market hours.
pending_ordersaccumulates beyond the normal single-order entry-poll pattern, or a single order becomes stale.pending_exits>0persists, orstuck_exits>0.- Any reconciliation flag becomes true, or logs show missing broker/internal positions.
broker_confirmation_required=true,send_outcome_unknown, unknown fill, or impossible PnL/state anomaly appears.- Recent trades contain
pnl_unknown=true, or open positions includeBroker/reconcile-import-*tags. These are warnings unless paired with stuck exits, state drift, or send-outcome uncertainty, but they must be counted and driven down. Existing session detected, IBKR1100/1101/1102, fatal/severe burst, or task restart loop appears.market_data_10197_recentis sustained above the detector threshold during market hours, or a smaller count appears together with stale heartbeat, connectivity/session conflict, pending buildup, or reconciliation risk.
Stop and escalate¶
Do not stop live runtime merely for isolated 10197 or a single normal-age pending order. Stop/escalate only when a concrete safety signal exists:
- broker/state drift,
- stale heartbeat plus no self-recovery,
- stuck exit or unknown fill,
- repeated reconnect/session conflict,
- pending-order buildup that does not clear,
- or AWS service cannot maintain exactly one primary task.
Stopping or scaling runtime is a live-trading operational mutation. It requires explicit approval unless an already-approved emergency procedure says otherwise.
2026-06-01 Recovery Baseline¶
The 2026-06-01 post-open session provides the current baseline:
- IBKR market-data subscriptions were restored after funding.
- The first post-open Fargate task hit a TWS disconnect and self-recovered through ECS replacement.
10197/market_data_status=degradedappeared after open, then recovered tomarket_data_status=okandmarket_data_10197_recent=0.- COIN closed with known realized PnL; the required post-close proof later appeared as
position reconciliation OK matched=8 internal_count=8 broker_count=8. - Subsequent single pending entry orders cleared normally with TWS code
202; detector stayed quiet withevents=[]/new_events=[].
Use this baseline when judging future sessions: the system can self-recover and continue safely, but completion depends on concrete safety gates, not dashboard color alone.
Record Keeping¶
For material findings, update both:
AEGIS/WORK_LOG.md- Delimit memory
Record exact absolute times in JST and ET, the market state, the ECS service/task state, heartbeat age/status, pending counts, reconciliation proof, market-data status, detector output, and the current decision.