xps

Build production-ready automation with full error handling, logging, and alerting

Overview

This chapter transforms your T2.2 automation into a production-ready system with comprehensive error handling, structured logging, and intelligent alerting. By the end, your automation will handle failures gracefully, provide actionable debugging information, and alert you only when it matters.

Time Allocation: This 50-minute build is divided into four focused parts: error handling patterns (15 min), logging system (15 min), notification and alerts (10 min), and health monitoring (10 min). Each part builds on the previous one to create a robust system.

Part 1: Error Handling Patterns (15 min)

Production automation requires sophisticated error handling that goes beyond basic try/catch. You'll implement patterns that detect failures, retry intelligently, and prevent cascading errors.

Exit Codes and Error Detection

Every command returns an exit code: 0 for success, non-zero for failure. Detecting these codes is the foundation of error handling.

Concept: Check exit codes after critical operations to detect silent failures. Use conditional logic to handle errors before they propagate.

if ! curl -f "$API_URL" -o data.json; then
  echo "ERROR: API request failed" >&2
  exit 1
fi

Why this matters: Many failures happen silently. Without exit code checking, your automation continues with corrupt or missing data.

Retry Logic with Exponential Backoff

Transient failures (network timeouts, rate limits) often resolve themselves. Retry with increasing delays to avoid overwhelming failing systems.

Concept: Implement a retry loop that waits progressively longer between attempts. Start with 1 second, then 2, then 4, doubling each time up to a maximum.

for i in {1..5}; do
  curl "$API_URL" && break
  sleep $((2**i))
done

When to use: API calls, file downloads, database connections, any operation with temporary failure modes. Avoid for permanent errors like authentication failures.

Circuit Breaker Pattern

If a service is down, stop hammering it. The circuit breaker pattern prevents wasted retries and allows systems to recover.

Concept: Track consecutive failures. After a threshold (e.g., 3 failures), stop trying for a cooldown period. Check if service is healthy before resuming normal operation.

Implementation strategy: Maintain a failure counter in a file. Increment on failure, reset on success. Check counter before each operation. If threshold exceeded, skip operation and schedule health check.

Why this matters: Prevents log spam, reduces load on failing systems, and speeds recovery. Essential for distributed systems and external APIs.

Error Context and Recovery

When errors occur, capture context: what failed, why it failed, what data was involved, and what recovery options exist.

Concept: Wrap critical operations in functions that log detailed error context. Include timestamps, input parameters, system state, and suggested recovery actions.

Recovery strategies: Fallback to cached data, use alternative data sources, degrade gracefully, or fail fast with clear error messages. Choose based on criticality and data availability.

Part 2: Logging System (15 min)

Effective logging turns debugging from hours to minutes. You'll build a structured logging system that captures exactly what you need for troubleshooting.

Log Levels and When to Use Them

Different situations require different logging verbosity. Use standardized log levels to filter information appropriately.

Log Level Hierarchy:

DEBUG - Detailed diagnostic information for development. Use for variable values, loop iterations, conditional branches. Disabled in production to reduce noise.

INFO - General informational messages about normal operation. Use for workflow milestones, successful operations, data processing summaries. Confirms system is working as expected.

WARN - Warning messages for unusual but recoverable situations. Use for retries, fallbacks, deprecated features, approaching limits. Indicates potential issues that need monitoring.

ERROR - Error messages for failures that prevent normal operation. Use for caught exceptions, failed operations, missing resources. Requires investigation and potential intervention.

Choosing the right level: Ask yourself: Is this for debugging (DEBUG), confirming normal flow (INFO), highlighting potential issues (WARN), or reporting failures (ERROR)?

Log Rotation and Formatting

Logs grow unbounded without rotation. Configure automatic rotation to prevent disk space issues while maintaining debug history.

Concept: Use logrotate or custom rotation logic to archive logs daily or by size. Keep recent logs readily accessible, compress older logs, delete ancient logs.

# Rotate if log exceeds 10MB
[ $(stat -f%z "app.log") -gt 10485760 ] && \
  mv app.log "app.$(date +%Y%m%d).log"

Formatting best practices: Include timestamps, log levels, script names, and error context in every log line. Use consistent delimiters for parsing.

Structured Logging for Searchability

Unstructured logs are hard to search. Structure your logs with key-value pairs or JSON format for efficient filtering and analysis.

Concept: Instead of free-form text, use consistent structure: timestamp, level, component, message, and metadata. This enables grep searches, log aggregation, and automated analysis.

Example format: [2025-01-15 10:30:45] ERROR [api-client] Request failed | url=https://api.example.com | status=503 | retry=2/3

Searchability benefits: Find all errors from specific component, track all retries for an endpoint, analyze failure patterns over time, correlate errors across systems.

Logging Critical Operations

Not everything deserves logging. Focus on operations critical for debugging: state changes, external calls, error conditions, and recovery actions.

What to log:

Before critical operations: Input parameters, preconditions
After operations: Results, side effects, performance metrics
On errors: Error message, stack trace, recovery action taken
On state changes: Old state, new state, trigger

What NOT to log: Sensitive data (passwords, tokens), excessive loop iterations, redundant success messages, internal implementation details unrelated to debugging.

Part 3: Notification and Alerts (10 min)

Logging captures information, but alerts demand attention. You'll implement notification systems that alert you when intervention is needed.

Alert Fatigue Warning: Too many alerts become noise. Configure alerts only for actionable failures that require immediate attention. Use thresholds and deduplication to reduce false positives.

Email Alerts

Send email notifications for critical failures using macOS mail command or SMTP.

Concept: Trigger email when error threshold exceeded or critical operation fails. Include error summary, affected systems, and recovery instructions.

echo "Subject: Automation Failed
Critical error in research pipeline" | \
  sendmail admin@example.com

Use cases: Daily digest failures, authentication errors, data corruption, system resource exhaustion. Reserve for failures requiring human intervention.

macOS Notifications

Display native macOS notifications for immediate visual alerts during active work.

Concept: Use osascript to trigger notification center alerts. Effective for development and monitoring on local machines.

osascript -e 'display notification \
  "Pipeline failed" with title "Error"'

Use cases: Development debugging, local automation monitoring, non-critical warnings. Not suitable for unattended systems or critical alerts.

Slack Webhooks

Post alerts to Slack channels for team visibility and incident coordination.

Concept: Send JSON payloads to Slack incoming webhooks. Format messages with priority levels, affected systems, and remediation links.

curl -X POST "$SLACK_WEBHOOK" \
  -d '{"text":"Pipeline failed: data.json missing"}'

Use cases: Team-wide incidents, production failures, performance degradation, scheduled report failures. Best for collaborative troubleshooting.

Alert Design Principles

Actionability: Every alert should suggest clear next steps. Include error codes, affected resources, and remediation documentation links.

Severity levels: Use tiered alert channels. Critical failures go to SMS/phone, warnings go to Slack, info goes to email digest. Match urgency to notification method.

Deduplication: Don't send duplicate alerts for the same failure. Track alert state in files, send only on state changes (working → failing, or failing → recovered).

Part 4: Health Monitoring (10 min)

Error handling reacts to failures, but health monitoring detects issues before they become critical. You'll implement proactive monitoring to catch problems early.

Health Check Commands

Create lightweight commands that verify system health without running full workflows. Check connectivity, data freshness, resource availability, and service status.

Concept: Implement standalone health check scripts that test each critical dependency. Run these on schedules separate from main automation to detect issues early.

What to check:

Connectivity: Can reach APIs, databases, file systems
Data freshness: Last successful update timestamp
Resource availability: Disk space, memory, API quotas
Service status: Expected processes running, ports listening

Example check: Verify last successful data update was within 24 hours. Alert if stale, suggesting possible automation failure.

Monitoring Dashboard Concept

Centralize health metrics in a simple dashboard for at-a-glance system status. Track success rates, error counts, response times, and uptime.

Implementation approach: Write metrics to a status file after each run. Create a simple script that reads status files and displays formatted summary. Include timestamps, success rates, recent errors, and trend indicators.

Key metrics to track:

Availability: Percentage of successful runs over 24 hours/7 days
Performance: Average execution time, slowest operations
Error patterns: Most common errors, error rate trends
Data quality: Records processed, data completeness

Visualization options: Simple text dashboard in terminal, HTML report, or integration with monitoring tools like Grafana for advanced visualization.

Success Rate Tracking

Measure automation reliability by tracking success rates over time. Calculate percentage of successful runs and set thresholds for alerts.

Concept: Maintain a rolling log of run outcomes (success/failure). Calculate success rate daily, weekly, monthly. Alert if rate drops below threshold (e.g., 95%).

Metrics to capture:

Overall success rate: Total successful runs / total attempts
Recovery rate: Successful retries / total retries
Time to recovery: Average time from failure to success
Failure patterns: Group errors by type, track trends

Using the data: Identify flaky operations, optimize retry logic, prove reliability to stakeholders, justify infrastructure investments.

Proactive Failure Detection

Don't wait for alerts. Implement checks that detect potential failures before they occur.

Predictive checks:

Disk space trends: Alert if usage growing faster than expected
API quota consumption: Warn before hitting rate limits
Data staleness: Detect slowing update frequency
Performance degradation: Alert on sustained slowdowns

Implementation strategy: Track baseline metrics (normal disk usage, typical API calls, expected update frequency). Compare current values to baselines. Alert on significant deviations before hard failures occur.

Final Integration: Putting It All Together

You've learned error handling patterns, logging systems, alerting mechanisms, and health monitoring. Now integrate these components into a cohesive production system.

Integration workflow: Start by adding error handling to critical paths in your T2.2 automation. Wrap each major operation in try/catch with specific error messages. Add logging around error handlers to capture context. Configure alerts for errors that require intervention. Finally, implement health checks that run independently to verify system health.

Layered approach: Think of this as building layers of protection. Exit code checking catches immediate failures. Retry logic handles transient issues. Circuit breakers prevent cascading failures. Logging captures debugging context. Alerts notify on critical issues. Health monitoring detects problems early. Each layer increases reliability.

Operational excellence: Production automation isn't just about handling errors, it's about building systems that degrade gracefully, provide clear debugging information, and alert appropriately. Your automation should be more reliable than manual processes.

Testing your system: Inject failures deliberately to verify error handling. Disconnect network to test retries. Fill disk to test resource monitoring. Verify alerts fire correctly. Check that logs contain sufficient debugging context. Confirm health checks detect issues early.

Next steps: Apply these patterns to your T2.2 automation in the domain-specific chapters. See how error handling changes for research pipelines (economics), CI/CD systems (software), and competitive intelligence (business).

Success indicator: You know this integration is successful when you can debug automation failures using logs alone, without inspecting code. Alerts provide clear remediation steps. Health checks catch issues before they impact operations. Your automation achieves 99%+ uptime over 4 weeks.

Core Build: Comprehensive Error Handling and Logging