コンテンツにスキップ

AWS IBKR PT Market Session Runbook

Scope

This runbook covers AWS/Fargate aegis-pt-core as the primary IBKR paper-trading runtime during a US market session. It is written for unattended operation and successor handover: the reader should be able to decide whether to keep observing, notify, or stop and escalate without relying on Ryo being present.

Use this runbook only for read-only checks unless a separate action-time approval explicitly permits mutation.

Golden Signals

Treat the session as healthy when all of these are true:

Signal Healthy condition
Runtime primary ECS service aegis-pt-core has desired=1, running=1, pending=0, and exactly one IBKR runtime task.
Gateway ibkr-gateway is RUNNING/HEALTHY; lt-ibkr-trader is RUNNING.
Synology cutover guard Synology aegis-lt-ibkr-* containers remain exited; do not run both primaries.
Heartbeat Market-hours heartbeat is fresh, normally under 5 minutes old.
Reconciliation broker_reconcile_pending=false, position_reconcile_pending=false, entry_blocking_reconcile_pending=false, and recent logs show position reconciliation OK.
Pending orders pending_orders=0 or one normal-age entry order that clears/reprices/cancels with TWS code 202.
Exit safety pending_exits=0, stuck_exits=0, and no unknown fill / broker-confirmation state.
Exit attribution Recent closed trades should trend toward pnl_unknown=false; UNKNOWN exits are a reliability KPI, not cosmetic noise.
Broker imports Open positions with reconcile-import-* / Broker tags should trend toward zero; each one means broker/state drift was repaired after the fact.
Market data market_data_status=ok and market_data_10197_recent=0, or isolated 10197 without concrete harm while safety gates remain clean.
Severe signals No Existing session detected, 1100/1101/1102, fatal/severe log burst, or send_outcome_unknown.

Read-Only Check

Run from a clean dev-Project checkout:

EECHECK_AWS_PROFILE=teriyaki-deploy \
  bash /Users/ryo/.codex/skills/eecheck/scripts/run_eecheck.sh

Always pass the AWS profile explicitly. The local skill default may point at a different historical profile; a missing default profile is an operator environment problem, not an AWS runtime failure.

Then run the AWS IBKR PT detector:

python3 scripts/ops/aws_ibkr_pt_health_detect.py <<<'{}'

Expected healthy detector result:

{
  "events": [],
  "new_events": []
}

For a successor-repeatable market-session evidence bundle, run:

SCHEDULE_CREATED_AFTER=2026-06-01T17:45:00Z \
  scripts/ops/aws_ibkr_pt_market_session_snapshot.sh

This is read-only. It records JST/ET time, detector output, optional GitHub schedule-gap evidence, and a final continue/notify decision line. Add RUN_PREDEPLOY_READINESS=1 when you also want the offline scheduler proof in the same report.

Automated Monitor

GitHub Actions workflow:

  • .github/workflows/notify-aws-ibkr-pt-health.yml
  • Name: Notify — AWS IBKR PT health
  • Schedule: primary at 07,17,27,37,47,57 minutes and fallback at 12,22,32,42,52 minutes on weekdays from 13:00 through 21:59 UTC. This covers the regular US market session plus open/close buffer while avoiding the high-load 00/10/20/30/40/50 minute marks where GitHub scheduled workflows are more likely to be delayed or dropped.
  • External fallback trigger: repository_dispatch event type aws_ibkr_pt_health. Use scripts/ops/dispatch_aws_ibkr_pt_health.py from a non-GitHub scheduler when natural GitHub schedule is absent.
  • Prepared external scheduler: AEGIS/ops/aws-ibkr-pt-health-scheduler/ contains a Cloudflare Worker Cron fallback. Deploying it and setting GITHUB_DISPATCH_TOKEN are infrastructure/secret mutations and require action-time approval.
  • Path: GitHub Actions runner joins Tailscale with tag:ci, reads dashboard status_v3 / bot_status, runs scripts/ops/aws_ibkr_pt_health_detect.py, uploads the snapshot artifact, and posts Slack/email only when new_events is non-empty.

External dispatch smoke command:

GH_TOKEN=... python3 scripts/ops/dispatch_aws_ibkr_pt_health.py

The token must be able to create a repository dispatch event for ryofukutani/dev-Project. Do not print or commit the token.

Before any Cloudflare deploy or secret action, run the scheduler predeploy proof:

bash AEGIS/ops/aws-ibkr-pt-health-scheduler/predeploy_readiness.sh

The same proof is available in GitHub Actions as AEGIS AWS IBKR PT Scheduler Tools. It compiles the detector/scheduler tooling, runs the Worker runtime test, verifies cron/config consistency, and runs offline regression tests. It does not set secrets, deploy Cloudflare, or touch the trading runtime.

After action-time-approved deploy, capture STARTED_AT before deployment and prove a fresh non-GitHub scheduler run:

STARTED_AT=2026-06-01T18:00:00Z \
WORKER_URL=https://<worker-url> \
  bash AEGIS/ops/aws-ibkr-pt-health-scheduler/prove_after_deploy.sh

Completion requires a fresh repository_dispatch run whose source includes cloudflare-cron: after STARTED_AT. A stale earlier repository-dispatch proof or a recovered GitHub natural schedule is not enough.

Use the gap verifier to separate "GitHub schedule recovered later" from "all expected windows were covered":

python3 AEGIS/ops/aws-ibkr-pt-health-scheduler/check_schedule_gap.py \
  --created-after 2026-06-01T17:45:00Z \
  --grace-minutes 8

missing_windows means the GitHub native schedule cannot be the only unattended monitor lane, even if later schedule runs completed successfully.

Default detector thresholds:

Environment variable Default Meaning
AEGIS_PT_HEARTBEAT_STALE_SEC 300 Critical heartbeat-stale threshold during market hours.
AEGIS_PT_PENDING_ORDER_WARN 2 Warn when pending orders reach this count.
AEGIS_PT_PENDING_ORDER_CRITICAL 3 Critical pending-order buildup threshold.
AEGIS_PT_MARKET_DATA_10197_WARN 10 Warn on sustained market-data 10197 after open grace.
AEGIS_PT_HIGH_EXIT_ATTEMPT_WARN 12 Warn when an open position repeatedly cancels/retries exits even while pending/stuck counters remain clean.
AEGIS_PT_HIGH_EXIT_ATTEMPT_CRITICAL 20 Critical high-exit-attempt threshold. This intentionally fires before the runtime's market-order escalation zone so successors can inspect/deploy while the position is still on limit-order rescue.
AEGIS_PT_OPEN_GRACE_MINUTE 35 Treat market as past open grace from 09:35 ET.

The workflow deduplicates event signatures through the previous snapshot artifact. A repeated known condition may appear in the run log without re-alerting until a new signature appears.

Current Stabilization Patch Gate

As of 2026-06-02 JST, two hardening patches are in flight:

PR Scope Risk class Preferred order
#94 high exit-attempt detector Read-only GitHub Actions detector/runbook visibility. Monitoring only; no broker/runtime mutation. Merge first so successors see Broker, unknown-PnL, and high-attempt loops before runtime changes.
#93 completed-order + partial-fill repair lt_ibkr_trader settlement and reconciliation behavior. Runtime code change; deploy replaces the AWS trader/gateway task. Deploy only after an action-time approval and pre/post eecheck proof. Prefer market closed or a quiet window.

Do not treat a green test run as deployment approval. For #93, use this minimum gate:

  1. Before deployment, run the read-only check and detector from this runbook.
  2. Confirm exactly one AWS IBKR runtime task, Synology IBKR containers exited, fresh heartbeat, pending_exits=0, stuck_exits=0, all reconciliation flags false, and no send_outcome_unknown / session conflict.
  3. Prefer pending_orders=0. If a single normal-age entry order is present, wait for normal cancel/fill unless Ryo explicitly approves action during the poll window. Avoid replacing the Fargate task while an exit order is working.
  4. Deploy only through git / GitHub Actions / AWS pipeline. Do not edit files on Synology, run compose/docker mutations there, or copy artifacts by hand.
  5. After deployment, rerun eecheck and detector. Acceptance requires one AWS primary runtime, fresh heartbeat, gateway healthy, no pending/stuck exits, reconciliation OK, and no new Broker import or unknown fill caused by the rollout.
  6. If the deploy happens during market hours, watch at least two scan cycles. High exit attempt loop may continue for pre-existing residual positions, but max_exit_attempt_count should not hide behind clean pending/stuck counters anymore.

Decision Rules

Observe only

Keep AWS PT running and keep observing when:

  • pending_orders is 0, or exactly one normal-age entry order is present.
  • The pending order follows the normal timeout/cancel/reprice path with TWS code 202.
  • Heartbeat is fresh.
  • Reconciliation flags are false.
  • pending_exits=0 and stuck_exits=0.
  • market_data_status=ok, or isolated market-data notices appear without order/reconcile harm.
  • market_data_10197_recent remains below the detector warning threshold and is not rising in combination with stale heartbeat, reconnect/session conflict, pending buildup, or reconcile flags.

Why: a single entry order during entry_poll is normal live behavior. Do not confuse it with pending buildup.

Notify / page

Notify when any of these appears:

  • Heartbeat stale for more than 5 minutes during market hours.
  • pending_orders accumulates beyond the normal single-order entry-poll pattern, or a single order becomes stale.
  • pending_exits>0 persists, or stuck_exits>0.
  • Any reconciliation flag becomes true, or logs show missing broker/internal positions.
  • broker_confirmation_required=true, send_outcome_unknown, unknown fill, or impossible PnL/state anomaly appears.
  • Recent trades contain pnl_unknown=true, or open positions include Broker / reconcile-import-* tags. These are warnings unless paired with stuck exits, state drift, or send-outcome uncertainty, but they must be counted and driven down.
  • Existing session detected, IBKR 1100/1101/1102, fatal/severe burst, or task restart loop appears.
  • market_data_10197_recent is sustained above the detector threshold during market hours, or a smaller count appears together with stale heartbeat, connectivity/session conflict, pending buildup, or reconciliation risk.

Stop and escalate

Do not stop live runtime merely for isolated 10197 or a single normal-age pending order. Stop/escalate only when a concrete safety signal exists:

  • broker/state drift,
  • stale heartbeat plus no self-recovery,
  • stuck exit or unknown fill,
  • repeated reconnect/session conflict,
  • pending-order buildup that does not clear,
  • or AWS service cannot maintain exactly one primary task.

Stopping or scaling runtime is a live-trading operational mutation. It requires explicit approval unless an already-approved emergency procedure says otherwise.

2026-06-01 Recovery Baseline

The 2026-06-01 post-open session provides the current baseline:

  • IBKR market-data subscriptions were restored after funding.
  • The first post-open Fargate task hit a TWS disconnect and self-recovered through ECS replacement.
  • 10197 / market_data_status=degraded appeared after open, then recovered to market_data_status=ok and market_data_10197_recent=0.
  • COIN closed with known realized PnL; the required post-close proof later appeared as position reconciliation OK matched=8 internal_count=8 broker_count=8.
  • Subsequent single pending entry orders cleared normally with TWS code 202; detector stayed quiet with events=[] / new_events=[].

Use this baseline when judging future sessions: the system can self-recover and continue safely, but completion depends on concrete safety gates, not dashboard color alone.

Record Keeping

For material findings, update both:

  • AEGIS/WORK_LOG.md
  • Delimit memory

Record exact absolute times in JST and ET, the market state, the ECS service/task state, heartbeat age/status, pending counts, reconciliation proof, market-data status, detector output, and the current decision.