PoG – Building Failsafe Architecture for Mission-Critical Trading Systems

# Building Failsafe Architecture for Mission-Critical Trading Systems

When you’re running automated trading systems managing real money with live positions on exchanges, “it usually works” isn’t good enough. A single oversight in service lifecycle management can leave your positions exposed, unmonitored, and vulnerable to market moves.

This post documents a critical architectural change we made to the PoG trading system after discovering a dangerous design flaw.

## The Disaster Scenario

Picture this: You’re running three automated trading services:
– **PM (Portfolio Manager)** – The master controller on port 9004
– **Trader HL** – HyperLiquid automated trading on port 9002
– **TraderLighter** – Lighter Exchange trading on port 9102

PM provides the oversight dashboard, aggregates positions, tracks equity, and monitors service health. The traders execute actual trades on their respective exchanges.

Everything looks great in the UI – all services showing “Healthy”, positions displayed, real-time PnL streaming. Then you need to restart PM for an update…

**And both traders die.**

No warning. No graceful handoff. Both services killed. Positions still open on the exchanges but no longer monitored. Stop losses orphaned. No one watching.

## Root Cause: The Master Controller Problem

The original architecture treated PM as a “master controller” – when PM shuts down, it calls `shutdown_all_services()` which terminates all trader processes:

“`python
# THE DANGEROUS CODE
async def lifespan(app: FastAPI):

yield
# This killed all traders!
if service_manager:
await service_manager.shutdown_all_services()
“`

This design made sense from a DevOps perspective – clean shutdown, no orphan processes. But it’s catastrophic for trading:

1. **PM restart = Trading stops** – Any PM maintenance stops all trading
2. **PM crash = Exposure** – If PM crashes, traders die with it
3. **Single point of failure** – PM becomes the weak link in the chain

## The Fix: Service Independence

The fundamental insight is that **traders are the mission-critical component, not PM**. Traders manage actual positions on exchanges. PM just provides visibility.

The new architecture inverts the relationship:

“`python
# THE SAFE CODE
async def lifespan(app: FastAPI):

yield
# CRITICAL: Do NOT shutdown trader services – they must keep running!
if service_manager:
if service_manager._watchdog_enabled:
await service_manager.stop_watchdog()
await aggregator.close()
# Traders continue running independently
“`

### Service Lifecycle Now

**PM Startup:**
1. Check for running traders on their ports
2. **ADOPT** existing traders (don’t restart!)
3. Start watchdog to monitor health
4. Start any services that aren’t running

**PM Shutdown:**
1. Stop watchdog (stop monitoring)
2. Close HTTP clients cleanly
3. EXIT – traders keep running!

**PM Crash:**
1. Traders continue operating independently
2. Windows scheduled task restarts PM within 5 minutes
3. PM adopts traders on restart

## UI Failsafe Indicators

A silent failure is worse than a loud one. The UI now shows clear warnings when connectivity is lost:

### API Health Tracking

“`javascript
// Track consecutive API failures
this.apiHealth = {
lastSuccessfulFetch: null,
consecutiveFailures: 0,
maxFailuresBeforeWarning: 2,
isApiDown: false
};
“`

### Visual Warnings

After 2 consecutive API failures:
– **Red pulsing banner**: “PM API UNREACHABLE”
– **Unknown service badges**: Services show “Unknown” instead of stale “Healthy”
– **Last update timestamp**: Shows when data was last successfully fetched

The banner pulses with CSS animation to ensure it catches attention:

“`css
.api-down-banner {
background: linear-gradient(90deg, #4a0a0a 0%, #2a0505 50%, #4a0a0a 100%);
border-bottom: 2px solid #ff0000;
animation: critical-pulse 1s ease-in-out infinite;
}

@keyframes critical-pulse {
0%, 100% { background: linear-gradient(90deg, #4a0a0a…); }
50% { background: linear-gradient(90deg, #5a1010…); }
}
“`

## Windows Scheduled Task Watchdog

For production resilience, PM itself needs a watchdog. We created a PowerShell script that runs as a Windows Scheduled Task:

“`powershell
# pm_watchdog.ps1
function Test-PMRunning {
try {
$response = Invoke-WebRequest -Uri “http://localhost:9004/api/status” -TimeoutSec 5
return $response.StatusCode -eq 200
} catch {
return $false
}
}

if (-not (Test-PMRunning)) {
Write-Log “PM is NOT running – attempting restart…”
Start-Process -FilePath “cmd.exe” -ArgumentList “/c cd /d D:\PoG\PM && python run.py”
}
“`

Setup:
“`batch
# Run as Administrator
schtasks /create /tn “PoG PM Watchdog” /tr “PowerShell.exe -ExecutionPolicy Bypass -File D:\PoG\PM\scripts\pm_watchdog.ps1” /sc minute /mo 5 /ru SYSTEM
“`

This ensures PM is restarted within 5 minutes of any failure.

## Explicit Service Control

When you actually **want** to stop everything (for maintenance, end of day, etc.), use the explicit API:

“`bash
# Stop all services intentionally
POST http://localhost:9004/api/services/stop-all
“`

Or via the UI’s “Stop All” button in the Services panel.

## Testing the Fix

After implementing:

1. Started PM and both traders
2. Killed PM process (`taskkill /F /PID …`)
3. **Verified traders still running** on ports 9002 and 9102
4. Restarted PM
5. PM adopted running traders automatically

“`
PM is NOT running – attempting restart…
PM started with PID: 50436
PM is now responding on port 9004
Trader HL: RUNNING
Trader Lighter: RUNNING
“`

## Lessons Learned

1. **Question the “Master Controller” pattern** – In trading systems, the monitors aren’t the critical path. The execution services are.

2. **Fail-safe vs. Fail-secure** – Trading systems should fail-safe (keep running) not fail-secure (stop everything).

3. **Never trust “Healthy” without verification** – Cached status lies. Real-time health checks don’t.

4. **Design for PM crashes** – PM will crash. Make sure traders don’t care.

5. **Explicit is better than implicit** – If you want to stop services, make it an explicit action, not a side effect.

## Architecture Summary

“`
Before:
PM controls traders → PM dies → traders die → positions exposed

After:
PM monitors traders → PM dies → traders continue → PM restarts → PM adopts traders
“`

The traders are now truly independent services. PM provides visibility and convenience, but it’s no longer a single point of failure.

## Code Changes

The full implementation is in the PoGPM repository:

– **`backend/main.py`** – Modified lifespan shutdown to not kill services
– **`backend/services/service_manager.py`** – Added `detach_services()` method
– **`frontend/static/app.js`** – API health tracking and warnings
– **`frontend/static/styles.css`** – Critical warning banner styles
– **`scripts/pm_watchdog.ps1`** – Windows scheduled task for PM auto-restart

## GitHub Repository

Full source available at: **https://github.com/sirdavidjcoops/PoGPM**

*This is part of the PoG (Proof of Gains) trading system – automated cryptocurrency trading using reinforcement learning. When real money is on the line, every design decision matters.*