# From Monolith to Microservices: Building a Resilient Trading System with Failsafe Architecture
When we first built our trading system, it was a classic monolithic application. Everything lived in one codebase: market data ingestion, signal generation, order execution, risk management, and reporting. It worked—until it didn’t. As the system grew, deployments became risky, scaling was inefficient, and a bug in one component could bring down the entire platform.
This is the story of how we decomposed that monolith into microservices, built a Portfolio Manager to orchestrate them all, and then took reliability to the next level with a dedicated Failsafe Service.
## The Original Monolith: Where We Started
Our initial trading system was a single Python application handling everything:
– **Market Data Handler**: WebSocket connections to exchanges, parsing tick data
– **Signal Engine**: Technical analysis, pattern recognition, ML model inference
– **Order Manager**: Position sizing, order routing, execution
– **Risk Controller**: Drawdown limits, exposure checks, correlation analysis
– **Database Layer**: Trade logging, performance metrics, audit trails
The problems were textbook monolith issues:
1. **Deployment Fear**: Every release risked breaking unrelated functionality
2. **Scaling Limitations**: We couldn’t scale the compute-heavy signal engine without also scaling the I/O-bound market data handler
3. **Development Bottlenecks**: Teams stepped on each other’s code constantly
4. **Single Point of Failure**: One crashed thread meant complete system failure
## The Microservices Rewrite
We decomposed the monolith into discrete, independently deployable services:
### Market Data Service
Handles all exchange connections, normalizes data formats, and publishes to a message queue. It maintains persistent WebSocket connections and implements automatic reconnection logic with exponential backoff.
### Signal Generation Service
Consumes market data, runs technical indicators and ML models, and emits trading signals. This service is stateless and horizontally scalable—we can spin up multiple instances during high-volatility periods.
### Execution Service
Receives signals, applies position sizing rules, and routes orders to exchanges. It maintains its own order state machine and handles partial fills, rejections, and amendments.
### Risk Service
Monitors all positions in real-time, enforces drawdown limits, and can emit emergency stop signals. It has read access to all other services’ state but operates independently.
### Persistence Service
Handles all database operations, providing a clean API for trade logging, performance queries, and audit compliance.
Each service communicates via Redis Pub/Sub for real-time events and REST APIs for synchronous queries. They’re containerized with Docker and orchestrated with Docker Compose for local development, with Kubernetes manifests ready for production.
## Enter the Portfolio Manager
With five independent services running, we needed a conductor. The Portfolio Manager was born as the central orchestration layer:
### What the Portfolio Manager Does
1. **Service Discovery**: Maintains a registry of all running services and their health status
2. **Configuration Distribution**: Pushes trading parameters, risk limits, and feature flags to all services
3. **State Coordination**: Ensures all services have a consistent view of current positions and P&L
4. **Workflow Orchestration**: Manages complex multi-step operations like end-of-day reconciliation
5. **Logging Aggregation**: Collects logs and metrics from all services into a unified dashboard
### The Portfolio Manager Architecture
“`
┌─────────────────────────────────────────────────────────────┐
│ PORTFOLIO MANAGER │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Service │ │ Config │ │ State │ │
│ │ Registry │ │ Manager │ │ Store │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Workflow │ │ Metrics │ │ API │ │
│ │ Engine │ │ Collector │ │ Gateway │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ Market │ │ Signal │ │ Exec │ │ Risk │
│ Data │ │ Gen │ │ Service │ │ Service │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
“`
The Portfolio Manager itself runs as a service, exposing a web dashboard for manual oversight and a REST API for programmatic control.
## The Missing Piece: What Happens When Things Go Wrong?
Our microservices architecture was a massive improvement. Deployments were safer, scaling was granular, and development velocity increased. But we had a critical gap: **what happens when services fail?**
The Portfolio Manager could detect that a service was unhealthy, but it couldn’t:
– Automatically restart failed services
– Spin up replacement instances
– Reroute traffic during partial outages
– Maintain trading continuity during infrastructure issues
We were still vulnerable to:
– **Network partitions** between services
– **Resource exhaustion** causing service crashes
– **Cascading failures** when one service’s problems spread
– **External dependency failures** (exchange APIs, data providers)
This led us to design the **Failsafe Service**—a dedicated component focused entirely on system resilience.
## The Failsafe Service: Deep Dive
The Failsafe Service is not just a health checker. It’s a comprehensive reliability layer that implements multiple patterns for fault tolerance, automatic recovery, and graceful degradation.
### Core Responsibilities
#### 1. Health Monitoring and Diagnostics
The Failsafe Service implements multi-level health checking:
**Shallow Health Checks (Every 5 seconds)**
– TCP connection verification
– HTTP endpoint responsiveness
– Basic “am I alive” pings
**Deep Health Checks (Every 30 seconds)**
– Functional verification (can the service actually do its job?)
– Dependency checks (are downstream services accessible?)
– Resource utilization (CPU, memory, disk, file descriptors)
**Synthetic Transactions (Every 60 seconds)**
– End-to-end workflow tests
– Simulated trades through the entire pipeline
– Latency measurements at each hop
“`python
class HealthCheckOrchestrator:
def __init__(self, services: List[ServiceConfig]):
self.services = services
self.check_history = defaultdict(lambda: deque(maxlen=100))
async def shallow_check(self, service: ServiceConfig) -> HealthResult:
“””Quick liveness check – TCP + HTTP ping”””
try:
async with aiohttp.ClientSession(timeout=2) as session:
async with session.get(f”{service.url}/health”) as resp:
return HealthResult(
service=service.name,
status=”healthy” if resp.status == 200 else “degraded”,
latency_ms=(time.time() – start) * 1000,
timestamp=datetime.utcnow()
)
except Exception as e:
return HealthResult(
service=service.name,
status=”unhealthy”,
error=str(e),
timestamp=datetime.utcnow()
)
async def deep_check(self, service: ServiceConfig) -> HealthResult:
“””Comprehensive health verification”””
checks = await asyncio.gather(
self._check_dependencies(service),
self._check_resources(service),
self._check_functionality(service),
return_exceptions=True
)
# Aggregate results and determine overall health
return self._aggregate_health(service, checks)
“`
#### 2. Circuit Breaker Implementation
When a service starts failing, we don’t want to keep hammering it with requests. The Failsafe Service implements the circuit breaker pattern for all inter-service communication:
**States:**
– **Closed**: Normal operation, requests flow through
– **Open**: Service is failing, requests are rejected immediately
– **Half-Open**: Testing if service has recovered
**Thresholds:**
– Open circuit after 5 consecutive failures or 50% failure rate in 30-second window
– Attempt half-open after 30 seconds
– Close circuit after 3 successful requests in half-open state
“`python
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30, half_open_requests=3):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_requests = half_open_requests
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
async def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
else:
raise CircuitOpenError(“Circuit breaker is open”)
try:
result = await func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.half_open_requests:
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
“`
#### 3. Automatic Service Recovery
The Failsafe Service doesn’t just detect problems—it fixes them:
**Restart Strategies:**
– **Immediate Restart**: For transient failures (memory spike, deadlock)
– **Delayed Restart**: For failures that might need external resolution
– **Escalating Restart**: Increasing delays between restart attempts
**Recovery Actions:**
1. Attempt graceful shutdown (SIGTERM)
2. Wait for cleanup (configurable timeout)
3. Force kill if necessary (SIGKILL)
4. Clear any corrupted state files
5. Start fresh instance
6. Verify health before marking as recovered
7. Notify Portfolio Manager of state change
“`python
class ServiceRecoveryManager:
def __init__(self, docker_client, notification_service):
self.docker = docker_client
self.notifications = notification_service
self.restart_counts = defaultdict(int)
self.backoff_calculator = ExponentialBackoff(base=5, max=300)
async def recover_service(self, service_name: str, failure_reason: str):
“””Orchestrate full service recovery”””
self.restart_counts[service_name] += 1
restart_count = self.restart_counts[service_name]
# Calculate backoff delay
delay = self.backoff_calculator.get_delay(restart_count)
logger.warning(f”Recovering {service_name} (attempt {restart_count}), ”
f”waiting {delay}s, reason: {failure_reason}”)
await asyncio.sleep(delay)
# Stop the failed container
container = self.docker.containers.get(service_name)
try:
container.stop(timeout=10)
except Exception:
container.kill()
# Clear any problematic state
await self._clear_service_state(service_name)
# Start fresh container
container.start()
# Wait for health
healthy = await self._wait_for_health(service_name, timeout=60)
if healthy:
self.restart_counts[service_name] = 0 # Reset on success
await self.notifications.send(
level=”info”,
message=f”Service {service_name} recovered successfully”
)
else:
await self.notifications.send(
level=”critical”,
message=f”Service {service_name} failed to recover after restart”
)
# Escalate to human intervention
await self._escalate_to_oncall(service_name)
“`
#### 4. Fallback Mechanisms
When a service is unavailable, the system should degrade gracefully rather than fail completely:
**Market Data Service Fallback:**
– Primary: Direct WebSocket to exchange
– Secondary: Backup data provider API
– Tertiary: Cached last-known prices with staleness indicators
**Signal Generation Service Fallback:**
– Primary: Full ML model inference
– Secondary: Simplified rule-based signals
– Tertiary: Hold current positions (no new signals)
**Execution Service Fallback:**
– Primary: Smart order routing
– Secondary: Direct exchange API with basic logic
– Tertiary: Queue orders for later execution
“`python
class FallbackChain:
def __init__(self, strategies: List[FallbackStrategy]):
self.strategies = strategies
self.current_strategy_index = 0
async def execute(self, request: Request) -> Response:
for i, strategy in enumerate(self.strategies[self.current_strategy_index:],
start=self.current_strategy_index):
try:
result = await strategy.execute(request)
if i != self.current_strategy_index:
logger.info(f”Fallback to strategy {i}: {strategy.name}”)
self.current_strategy_index = i
return result
except StrategyUnavailableError:
continue
raise AllStrategiesExhaustedError(“No fallback strategies available”)
async def attempt_primary_recovery(self):
“””Periodically try to restore primary strategy”””
if self.current_strategy_index > 0:
primary = self.strategies[0]
if await primary.is_healthy():
logger.info(“Primary strategy recovered, restoring”)
self.current_strategy_index = 0
“`
#### 5. State Preservation and Recovery
Trading systems can’t lose state. The Failsafe Service maintains redundant state stores:
**State Synchronization:**
– All services publish state changes to a Redis stream
– Failsafe Service maintains a materialized view of all state
– On service restart, state is injected before the service goes live
**Checkpoint System:**
– Full state snapshots every 5 minutes
– Incremental state updates in real-time
– Point-in-time recovery capability
“`python
class StatePreservationManager:
def __init__(self, redis_client, s3_client):
self.redis = redis_client
self.s3 = s3_client
self.state_cache = {}
async def capture_state(self, service_name: str) -> ServiceState:
“””Capture current state of a service”””
# Get from service if healthy
try:
async with aiohttp.ClientSession() as session:
async with session.get(f”{service.url}/state”) as resp:
return ServiceState.from_dict(await resp.json())
except Exception:
# Fall back to cached state
return self.state_cache.get(service_name)
async def restore_state(self, service_name: str, state: ServiceState):
“””Inject state into a restarted service”””
async with aiohttp.ClientSession() as session:
await session.post(
f”{service.url}/state”,
json=state.to_dict()
)
async def create_checkpoint(self):
“””Full system state checkpoint”””
checkpoint = SystemCheckpoint(
timestamp=datetime.utcnow(),
services={
name: await self.capture_state(name)
for name in self.monitored_services
}
)
# Store in S3 for durability
await self.s3.put_object(
Bucket=”trading-checkpoints”,
Key=f”checkpoint-{checkpoint.timestamp.isoformat()}.json”,
Body=json.dumps(checkpoint.to_dict())
)
“`
#### 6. Network Partition Handling
In distributed systems, network partitions are inevitable. The Failsafe Service implements split-brain prevention:
**Partition Detection:**
– Heartbeats between all services through the Failsafe Service
– Quorum-based decision making for state changes
– Fencing tokens for operations that require coordination
**Partition Resolution:**
– Automatic partition healing detection
– State reconciliation after partition heals
– Conflict resolution with last-writer-wins or custom merge logic
“`python
class PartitionDetector:
def __init__(self, services: List[str], quorum_size: int):
self.services = services
self.quorum_size = quorum_size
self.last_seen = {}
async def check_partition(self) -> PartitionStatus:
“””Detect if we’re in a network partition”””
reachable = []
unreachable = []
for service in self.services:
if await self._can_reach(service):
reachable.append(service)
self.last_seen[service] = datetime.utcnow()
else:
unreachable.append(service)
if len(reachable) >= self.quorum_size:
return PartitionStatus.MAJORITY_PARTITION
elif len(reachable) > 0:
return PartitionStatus.MINORITY_PARTITION
else:
return PartitionStatus.ISOLATED
def get_allowed_operations(self, status: PartitionStatus) -> Set[Operation]:
“””Determine what operations are safe during partition”””
if status == PartitionStatus.MAJORITY_PARTITION:
return {Operation.READ, Operation.WRITE, Operation.TRADE}
elif status == PartitionStatus.MINORITY_PARTITION:
return {Operation.READ} # Read-only mode
else:
return set() # No operations allowed
“`
#### 7. Capacity Management and Auto-Scaling
The Failsafe Service monitors resource utilization and triggers scaling:
**Scaling Triggers:**
– CPU utilization > 70% for 2 minutes
– Memory utilization > 80%
– Request queue depth > threshold
– Response latency > SLA threshold
**Scaling Actions:**
– Horizontal: Add more container instances
– Vertical: Request more resources from orchestrator
– Predictive: Pre-scale based on historical patterns (market open, news events)
“`python
class AutoScaler:
def __init__(self, kubernetes_client, metrics_client):
self.k8s = kubernetes_client
self.metrics = metrics_client
self.scaling_history = []
async def evaluate_scaling(self, service: str) -> ScalingDecision:
“””Determine if scaling action is needed”””
current_metrics = await self.metrics.get_current(service)
historical_metrics = await self.metrics.get_historical(service, hours=1)
# Check against thresholds
if current_metrics.cpu_percent > 70:
trend = self._calculate_trend(historical_metrics.cpu)
if trend > 0: # Still increasing
return ScalingDecision(
action=ScaleAction.SCALE_UP,
reason=”CPU utilization trending up”,
target_replicas=self._calculate_target_replicas(current_metrics)
)
if current_metrics.request_latency_p99 > service.sla_latency_ms:
return ScalingDecision(
action=ScaleAction.SCALE_UP,
reason=”Latency exceeding SLA”,
target_replicas=current_metrics.replicas + 1
)
# Check for scale down opportunity
if (current_metrics.cpu_percent < 30 and
current_metrics.replicas > service.min_replicas):
return ScalingDecision(
action=ScaleAction.SCALE_DOWN,
reason=”Low utilization”,
target_replicas=max(service.min_replicas,
current_metrics.replicas – 1)
)
return ScalingDecision(action=ScaleAction.NONE)
“`
### The Complete Failsafe Architecture
Here’s how all the components work together:
“`
┌────────────────────────────────────────────────────────────────────────┐
│ FAILSAFE SERVICE │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Health │ │ Circuit │ │ Recovery │ │
│ │ Monitor │ │ Breakers │ │ Manager │ │
│ │ │ │ │ │ │ │
│ │ – Shallow │ │ – Per-service│ │ – Restart │ │
│ │ – Deep │ │ – Thresholds │ │ – Backoff │ │
│ │ – Synthetic │ │ – State mgmt │ │ – Escalation │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Fallback │ │ State │ │ Partition │ │
│ │ Chains │ │ Preservation │ │ Handler │ │
│ │ │ │ │ │ │ │
│ │ – Primary │ │ – Snapshots │ │ – Detection │ │
│ │ – Secondary │ │ – Recovery │ │ – Quorum │ │
│ │ – Tertiary │ │ – Sync │ │ – Fencing │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Auto │ │ Alerting │ │ Chaos │ │
│ │ Scaler │ │ Engine │ │ Engineering │ │
│ │ │ │ │ │ │ │
│ │ – Metrics │ │ – Thresholds │ │ – Fault │ │
│ │ – Triggers │ │ – Routing │ │ injection │ │
│ │ – Predictive │ │ – Escalation │ │ – Validation │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Portfolio│ │ Redis │ │Kubernetes│
│ Manager │ │ Cluster │ │ API │
└──────────┘ └──────────┘ └──────────┘
“`
### Deployment Considerations
The Failsafe Service itself must be highly available:
1. **Multiple Instances**: Run at least 3 replicas across different availability zones
2. **Leader Election**: Use distributed consensus (etcd/Consul) for coordination
3. **Graceful Degradation**: Each instance can operate independently if others fail
4. **Self-Monitoring**: The Failsafe Service monitors its own health and can trigger alerts
### Chaos Engineering Integration
To validate our failsafe mechanisms, we integrate chaos engineering:
“`python
class ChaosEngine:
“””Intentionally inject failures to test resilience”””
async def run_chaos_scenario(self, scenario: ChaosScenario):
“””Execute a chaos engineering scenario”””
logger.info(f”Starting chaos scenario: {scenario.name}”)
# Inject the failure
if scenario.type == ChaosType.KILL_SERVICE:
await self._kill_service(scenario.target)
elif scenario.type == ChaosType.NETWORK_DELAY:
await self._inject_latency(scenario.target, scenario.delay_ms)
elif scenario.type == ChaosType.RESOURCE_EXHAUSTION:
await self._exhaust_resources(scenario.target, scenario.resource)
# Observe recovery
recovery_time = await self._measure_recovery_time(scenario.target)
# Validate system behavior
assertions_passed = await self._validate_assertions(scenario.assertions)
return ChaosResult(
scenario=scenario.name,
recovery_time_seconds=recovery_time,
assertions_passed=assertions_passed,
system_state=await self._capture_system_state()
)
“`
## Results and Lessons Learned
After implementing the Failsafe Service, our system reliability improved dramatically:
– **Uptime**: 99.95% → 99.99%
– **Mean Time to Recovery**: 15 minutes → 45 seconds
– **Unattended Recovery Rate**: 60% → 95%
– **Alert Noise**: Reduced by 80% (only actionable alerts)
### Key Lessons
1. **Health checks must be meaningful**: A service returning 200 OK isn’t necessarily healthy. Deep health checks that verify actual functionality are essential.
2. **Fallbacks need testing**: Fallback paths that aren’t regularly exercised will fail when needed. Chaos engineering proved invaluable here.
3. **State is the hardest problem**: Stateless services are easy to recover. State synchronization and preservation required the most engineering effort.
4. **Observability enables reliability**: You can’t fix what you can’t see. Investment in logging, metrics, and tracing paid dividends.
5. **Automation with guardrails**: Automatic recovery is powerful but dangerous. We implemented safeguards to prevent recovery loops and escalate appropriately.
## Conclusion
The journey from monolith to microservices to resilient distributed system was substantial, but the investment paid off. Our trading system now handles failures gracefully, recovers automatically, and maintains uptime even during infrastructure issues.
The Failsafe Service isn’t just a nice-to-have—for any system where uptime matters, it’s essential. Whether you’re building a trading platform, an e-commerce site, or any mission-critical application, the patterns described here can help you build systems that stay running when things go wrong.
The code examples in this article are simplified for illustration, but they represent real patterns we use in production. The key is starting with the basics—health checks and circuit breakers—and progressively adding sophistication as your system grows.
Remember: in distributed systems, failure isn’t a possibility—it’s a certainty. The question is whether your system is prepared to handle it gracefully.