AWS IBKR PT Market Session Runbook¶

Scope¶

This runbook covers AWS/Fargate aegis-pt-core as the primary IBKR paper-trading runtime during a US market session. It is written for unattended operation and successor handover: the reader should be able to decide whether to keep observing, notify, or stop and escalate without relying on Ryo being present.

Use this runbook only for read-only checks unless a separate action-time approval explicitly permits mutation.

Golden Signals¶

Treat the session as healthy when all of these are true:

Signal	Healthy condition
Runtime primary	ECS service `aegis-pt-core` has `desired=1`, `running=1`, `pending=0`, and exactly one IBKR runtime task.
Gateway	`ibkr-gateway` is `RUNNING/HEALTHY`; `lt-ibkr-trader` is `RUNNING`.
Synology cutover guard	Synology `aegis-lt-ibkr-*` containers remain `exited`; do not run both primaries.
Heartbeat	Market-hours heartbeat is fresh, normally under 5 minutes old.
Reconciliation	`broker_reconcile_pending=false`, `position_reconcile_pending=false`, `entry_blocking_reconcile_pending=false`, and recent logs show `position reconciliation OK`.
Pending orders	`pending_orders=0` or one normal-age entry order that clears/reprices/cancels with TWS code `202`.
Exit safety	`pending_exits=0`, `stuck_exits=0`, and no unknown fill / broker-confirmation state.
Exit attribution	Recent closed trades should trend toward `pnl_unknown=false`; `UNKNOWN` exits are a reliability KPI, not cosmetic noise.
Broker imports	Open positions with `reconcile-import-*` / `Broker` tags should trend toward zero; each one means broker/state drift was repaired after the fact.
Market data	`market_data_status=ok` and `market_data_10197_recent=0`, or isolated `10197` without concrete harm while safety gates remain clean.
Severe signals	No `Existing session detected`, `1100/1101/1102`, fatal/severe log burst, or `send_outcome_unknown`.

Read-Only Check¶

Run from a clean dev-Project checkout:

EECHECK_AWS_PROFILE=teriyaki-deploy \
  bash /Users/ryo/.codex/skills/eecheck/scripts/run_eecheck.sh

Always pass the AWS profile explicitly. The local skill default may point at a different historical profile; a missing default profile is an operator environment problem, not an AWS runtime failure.

Then run the AWS IBKR PT detector:

python3 scripts/ops/aws_ibkr_pt_health_detect.py <<<'{}'

Expected healthy detector result:

{
  "events": [],
  "new_events": []
}

For a successor-repeatable market-session evidence bundle, run:

SCHEDULE_CREATED_AFTER=2026-06-01T17:45:00Z \
  scripts/ops/aws_ibkr_pt_market_session_snapshot.sh

This is read-only. It records JST/ET time, detector output, optional GitHub schedule-gap evidence, and a final continue/notify decision line. Add RUN_PREDEPLOY_READINESS=1 when you also want the offline scheduler proof in the same report.

Automated Monitor¶

GitHub Actions workflow:

.github/workflows/notify-aws-ibkr-pt-health.yml
Name: Notify — AWS IBKR PT health
Schedule: primary at 07,17,27,37,47,57 minutes and fallback at 12,22,32,42,52 minutes on weekdays from 13:00 through 21:59 UTC. This covers the regular US market session plus open/close buffer while avoiding the high-load 00/10/20/30/40/50 minute marks where GitHub scheduled workflows are more likely to be delayed or dropped.
External fallback trigger: repository_dispatch event type aws_ibkr_pt_health. Use scripts/ops/dispatch_aws_ibkr_pt_health.py from a non-GitHub scheduler when natural GitHub schedule is absent.
Prepared external scheduler: AEGIS/ops/aws-ibkr-pt-health-scheduler/ contains a Cloudflare Worker Cron fallback. Deploying it and setting GITHUB_DISPATCH_TOKEN are infrastructure/secret mutations and require action-time approval.
Path: GitHub Actions runner joins Tailscale with tag:ci, reads dashboard status_v3 / bot_status, runs scripts/ops/aws_ibkr_pt_health_detect.py, uploads the snapshot artifact, and posts Slack/email only when new_events is non-empty.

External dispatch smoke command:

GH_TOKEN=... python3 scripts/ops/dispatch_aws_ibkr_pt_health.py

The token must be able to create a repository dispatch event for ryofukutani/dev-Project. Do not print or commit the token.

Before any Cloudflare deploy or secret action, run the scheduler predeploy proof:

bash AEGIS/ops/aws-ibkr-pt-health-scheduler/predeploy_readiness.sh

The same proof is available in GitHub Actions as AEGIS AWS IBKR PT Scheduler Tools. It compiles the detector/scheduler tooling, runs the Worker runtime test, verifies cron/config consistency, and runs offline regression tests. It does not set secrets, deploy Cloudflare, or touch the trading runtime.

After action-time-approved deploy, capture STARTED_AT before deployment and prove a fresh non-GitHub scheduler run:

STARTED_AT=2026-06-01T18:00:00Z \
WORKER_URL=https://<worker-url> \
  bash AEGIS/ops/aws-ibkr-pt-health-scheduler/prove_after_deploy.sh

Completion requires a fresh repository_dispatch run whose source includes cloudflare-cron: after STARTED_AT. A stale earlier repository-dispatch proof or a recovered GitHub natural schedule is not enough.

Use the gap verifier to separate "GitHub schedule recovered later" from "all expected windows were covered":

python3 AEGIS/ops/aws-ibkr-pt-health-scheduler/check_schedule_gap.py \
  --created-after 2026-06-01T17:45:00Z \
  --grace-minutes 8

missing_windows means the GitHub native schedule cannot be the only unattended monitor lane, even if later schedule runs completed successfully.

Default detector thresholds:

Environment variable	Default	Meaning
`AEGIS_PT_HEARTBEAT_STALE_SEC`	`300`	Critical heartbeat-stale threshold during market hours.
`AEGIS_PT_PENDING_ORDER_WARN`	`2`	Warn when pending orders reach this count.
`AEGIS_PT_PENDING_ORDER_CRITICAL`	`3`	Critical pending-order buildup threshold.
`AEGIS_PT_MARKET_DATA_10197_WARN`	`10`	Warn on sustained market-data 10197 after open grace.
`AEGIS_PT_HIGH_EXIT_ATTEMPT_WARN`	`12`	Warn when an open position repeatedly cancels/retries exits even while pending/stuck counters remain clean.
`AEGIS_PT_HIGH_EXIT_ATTEMPT_CRITICAL`	`20`	Critical high-exit-attempt threshold. This intentionally fires before the runtime's market-order escalation zone so successors can inspect/deploy while the position is still on limit-order rescue.
`AEGIS_PT_OPEN_GRACE_MINUTE`	`35`	Treat market as past open grace from 09:35 ET.

The workflow deduplicates event signatures through the previous snapshot artifact. A repeated known condition may appear in the run log without re-alerting until a new signature appears.

Current Stabilization Patch Gate¶

As of 2026-06-02 JST, two hardening patches are in flight:

PR	Scope	Risk class	Preferred order
`#94` high exit-attempt detector	Read-only GitHub Actions detector/runbook visibility.	Monitoring only; no broker/runtime mutation.	Merge first so successors see Broker, unknown-PnL, and high-attempt loops before runtime changes.
`#93` completed-order + partial-fill repair	`lt_ibkr_trader` settlement and reconciliation behavior.	Runtime code change; deploy replaces the AWS trader/gateway task.	Deploy only after an action-time approval and pre/post `eecheck` proof. Prefer market closed or a quiet window.

Do not treat a green test run as deployment approval. For #93, use this minimum gate:

Before deployment, run the read-only check and detector from this runbook.
Confirm exactly one AWS IBKR runtime task, Synology IBKR containers exited, fresh heartbeat, pending_exits=0, stuck_exits=0, all reconciliation flags false, and no send_outcome_unknown / session conflict.
Prefer pending_orders=0. If a single normal-age entry order is present, wait for normal cancel/fill unless Ryo explicitly approves action during the poll window. Avoid replacing the Fargate task while an exit order is working.
Deploy only through git / GitHub Actions / AWS pipeline. Do not edit files on Synology, run compose/docker mutations there, or copy artifacts by hand.
After deployment, rerun eecheck and detector. Acceptance requires one AWS primary runtime, fresh heartbeat, gateway healthy, no pending/stuck exits, reconciliation OK, and no new Broker import or unknown fill caused by the rollout.
If the deploy happens during market hours, watch at least two scan cycles. High exit attempt loop may continue for pre-existing residual positions, but max_exit_attempt_count should not hide behind clean pending/stuck counters anymore.

Decision Rules¶

Observe only¶

Keep AWS PT running and keep observing when:

pending_orders is 0, or exactly one normal-age entry order is present.
The pending order follows the normal timeout/cancel/reprice path with TWS code 202.
Heartbeat is fresh.
Reconciliation flags are false.
pending_exits=0 and stuck_exits=0.
market_data_status=ok, or isolated market-data notices appear without order/reconcile harm.
market_data_10197_recent remains below the detector warning threshold and is not rising in combination with stale heartbeat, reconnect/session conflict, pending buildup, or reconcile flags.

Why: a single entry order during entry_poll is normal live behavior. Do not confuse it with pending buildup.

Notify / page¶

Notify when any of these appears:

Heartbeat stale for more than 5 minutes during market hours.
pending_orders accumulates beyond the normal single-order entry-poll pattern, or a single order becomes stale.
pending_exits>0 persists, or stuck_exits>0.
Any reconciliation flag becomes true, or logs show missing broker/internal positions.
broker_confirmation_required=true, send_outcome_unknown, unknown fill, or impossible PnL/state anomaly appears.
Recent trades contain pnl_unknown=true, or open positions include Broker / reconcile-import-* tags. These are warnings unless paired with stuck exits, state drift, or send-outcome uncertainty, but they must be counted and driven down.
Existing session detected, IBKR 1100/1101/1102, fatal/severe burst, or task restart loop appears.
market_data_10197_recent is sustained above the detector threshold during market hours, or a smaller count appears together with stale heartbeat, connectivity/session conflict, pending buildup, or reconciliation risk.

Stop and escalate¶

Do not stop live runtime merely for isolated 10197 or a single normal-age pending order. Stop/escalate only when a concrete safety signal exists:

broker/state drift,
stale heartbeat plus no self-recovery,
stuck exit or unknown fill,
repeated reconnect/session conflict,
pending-order buildup that does not clear,
or AWS service cannot maintain exactly one primary task.

Stopping or scaling runtime is a live-trading operational mutation. It requires explicit approval unless an already-approved emergency procedure says otherwise.

2026-06-01 Recovery Baseline¶

The 2026-06-01 post-open session provides the current baseline:

IBKR market-data subscriptions were restored after funding.
The first post-open Fargate task hit a TWS disconnect and self-recovered through ECS replacement.
10197 / market_data_status=degraded appeared after open, then recovered to market_data_status=ok and market_data_10197_recent=0.
COIN closed with known realized PnL; the required post-close proof later appeared as position reconciliation OK matched=8 internal_count=8 broker_count=8.
Subsequent single pending entry orders cleared normally with TWS code 202; detector stayed quiet with events=[] / new_events=[].

Use this baseline when judging future sessions: the system can self-recover and continue safely, but completion depends on concrete safety gates, not dashboard color alone.

Record Keeping¶

For material findings, update both:

AEGIS/WORK_LOG.md
Delimit memory

Record exact absolute times in JST and ET, the market state, the ECS service/task state, heartbeat age/status, pending counts, reconciliation proof, market-data status, detector output, and the current decision.