## Chapter 10 – Best Practices Professional infrastructure management requires discipline, planning, and adherence to proven practices. This chapter distills wisdom from production deployments into actionable guidelines. ### Section 10.1: Development Workflow **Progressive Implementation**: 1. **Start Simple** (Week 1-2): - Single service monitoring (test website) - Manual trigger only - Basic HTTP health checks - No automation, just observation 2. **Add Intelligence** (Week 3-4): - Diagnostic capabilities (Docker logs) - Structured JSON output - Basic Telegram notifications - Still manual execution 3. **Automate Carefully** (Week 5-6): - Schedule trigger (every 30 minutes initially) - Rate limiting and deduplication - Human-in-the-loop approvals - Comprehensive logging 4. **Expand Scope** (Week 7+): - Additional services one at a time - Specialized agents for different domains - Agent collaboration - Refinement based on experience **Testing Hierarchy**: ``` 1. Local Development ↓ (Test thoroughly) 2. Staging/Test Services ↓ (Validate behavior) 3. Non-Critical Production ↓ (Monitor closely) 4. Critical Production ↓ (Only after proven reliable) ``` **Never Skip Steps**. Each phase builds confidence and reveals edge cases. --- ### Section 10.2: Security Considerations **Principle of Least Privilege**: ``` Agent Capability Levels: Level 1 (Read-Only): - HTTP GET requests - Log viewing - Status checks ✅ Safe for production immediately Level 2 (Safe Operations): - Container restarts - Cache clearing - Log rotation ⚠️ Requires testing, generally safe Level 3 (Configuration Changes): - Firewall rules - Resource limits - Port mappings ❌ Requires approval, staging testing Level 4 (Data Operations): - Database modifications - Storage operations - User management 🔴 FORBIDDEN for automation ``` **Credential Management**: ```javascript // ❌ WRONG - Hardcoded secrets const apiKey = "sk-abc123xyz789"; const dbPassword = "MyPassword123"; // ✅ RIGHT - Environment variables const apiKey = $env.OPENAI_API_KEY; const dbPassword = $env.DATABASE_PASSWORD; // ✅ RIGHT - n8n Credentials // Use Credentials feature for: - API keys - Passwords - SSH keys - Tokens ``` **Access Control**: ``` Network Segmentation: - n8n server in management VLAN - Firewall rules limiting outbound access - VPN required for remote n8n access Authentication: - Enable n8n basic auth or LDAP - Strong passwords (20+ characters) - 2FA if available API Security: - Use API tokens instead of passwords - Rotate credentials quarterly - Audit credential access - Revoke unused credentials ``` **Audit Logging**: ```javascript // Log every agent action const logEntry = { timestamp: new Date().toISOString(), agent: "vishnu-cto", action: "container_restart", target: "plex", authorized_by: "human_approval", approval_id: "APR-20240115-001", result: "success", user_id: $json.telegram_user_id }; // Store logs: // - Local file (/var/log/n8n-agent-actions.log) // - Database (for queryability) // - SIEM system (for enterprise environments) ``` **Security Monitoring**: ``` Monitor the Monitors: - Who accessed n8n? - What workflows were modified? - What credentials were used? - What agents took actions? - Were approvals properly obtained? Alert on: - Failed authentication attempts >5 - Workflow changes outside business hours - Credentials accessed by unusual users - Agent actions without approval - New workflows created ``` --- ### Section 10.3: Performance Optimization **Polling Frequency**: ``` Service Type | Recommended Interval ---------------------|--------------------- Critical (Database) | Every 2-5 minutes Important (Web Apps) | Every 5-10 minutes Standard (Media) | Every 10-15 minutes Non-Critical (Dev) | Every 30-60 minutes Avoid over-polling: - Wastes API quota - Increases costs (LLM API calls) - Creates alert fatigue - Adds server load ``` **Caching Strategy**: ```javascript // Cache service status to reduce redundant checks const cache = $('WorkflowStaticData').first().json.cache || {}; const cacheKey = `status_${serviceName}`; const cachedStatus = cache[cacheKey]; const now = Date.now(); // Use cache if fresh (< 5 minutes old) if (cachedStatus && (now - cachedStatus.timestamp) < 5 * 60 * 1000) { return { json: cachedStatus.data }; } // Otherwise, fetch fresh data const freshStatus = await checkService(); // Update cache cache[cacheKey] = { timestamp: now, data: freshStatus }; return { json: freshStatus }; ``` **Conditional Execution**: ```javascript // Only trigger notifications on state change const previousState = $('WorkflowStaticData').first().json.last_status || {}; const currentState = $json.status; if (previousState.status === currentState.status) { // No change, skip notification return []; } // State changed, send notification $('WorkflowStaticData').first().json.last_status = currentState; return { json: { notify: true, state_change: true } }; ``` **Resource Limits**: ```yaml # If running n8n in Docker services: n8n: image: n8nio/n8n deploy: resources: limits: cpus: '2' memory: 2G reservations: cpus: '0.5' memory: 512M ``` **Workflow Optimization**: ``` Slow Workflow Pattern: [Trigger] → [Agent] → [Wait 30s] → [Agent] → [Wait 30s] → [Agent] Total time: 90+ seconds Optimized Pattern: [Trigger] → [Agent with all tools] → [Parallel checks] → [Synthesize] Total time: 10-20 seconds Use parallel execution where possible: - Multiple service checks - Multiple API calls - Multiple SSH commands ``` --- ### Section 10.4: Reliability Guidelines **Fallback Mechanisms**: ```javascript // Multi-channel notifications with fallback async function sendNotification(message) { try { // Primary: Telegram await sendTelegram(message); } catch (error) { try { // Fallback: Email await sendEmail(message); } catch (error2) { try { // Last resort: Write to file fs.appendFileSync('/var/log/failed-notifications.log', JSON.stringify({ timestamp: Date.now(), message, error: error2 })); } catch (error3) { // Critical: Can't notify at all console.error("CRITICAL: All notification methods failed"); } } } } ``` **Health Checks for Monitoring System**: ``` Monitor the Monitor: Create a separate workflow that checks if your main monitoring workflows are running. [Schedule: Every hour] ↓ [Check: When was last execution of main workflow?] ↓ [IF: >15 minutes ago] YES ↓ [ALERT: Monitoring system appears down!] [Send via external service - email, SMS, PagerDuty] ``` **Graceful Degradation**: ```javascript // If one tool fails, try alternatives async function checkService(url) { try { // Primary: HTTP Request tool return await httpCheck(url); } catch (error) { try { // Fallback: Curl via Execute Command return await curlCheck(url); } catch (error2) { // Can't check, assume down return { status: "unknown", error: "All check methods failed", last_known_status: getFromCache(url) }; } } } ``` **Recovery Procedures**: Document manual recovery steps: ```markdown ## Emergency Recovery: n8n Agent System Down 1. Check n8n is running: ``` docker ps | grep n8n systemctl status n8n ``` 2. Check n8n logs: ``` docker logs n8n --tail 100 journalctl -u n8n -n 100 ``` 3. Restart n8n: ``` docker restart n8n systemctl restart n8n ``` 4. Verify workflows activate: - Login to n8n web interface - Check each workflow's Active status - Manually execute one workflow to test 5. If persistent issues: - Disable all workflows - Re-enable one at a time - Identify problematic workflow 6. Nuclear option: - Restore n8n from backup - Reimport workflow exports ``` **Backup Strategy**: ```bash # Daily backup of n8n data #!/bin/bash BACKUP_DIR="/backups/n8n" DATE=$(date +%Y%m%d) # Backup n8n database docker exec n8n sqlite3 /home/node/.n8n/database.sqlite ".backup /tmp/backup.db" docker cp n8n:/tmp/backup.db ${BACKUP_DIR}/database-${DATE}.sqlite # Backup workflows (export as JSON) # Via n8n API or manual export # Keep last 30 days find ${BACKUP_DIR} -name "*.sqlite" -mtime +30 -delete # Upload to cloud storage rclone copy ${BACKUP_DIR} remote:n8n-backups ``` --- ### Section 10.5: Team Collaboration Best Practices **Clear Role Assignment**: ``` Document each agent's domain: agents/ ├── vishnu-cto.md │ - Responsibilities: Overall orchestration │ - Escalation triggers: Multi-system failures │ - Decision authority: Final say on all issues │ ├── brahma-network.md │ - Responsibilities: UniFi, routing, Wi-Fi │ - Escalation: Issues beyond network scope │ - Tools: UniFi API, network diagnostics │ ├── saraswati-database.md │ - Responsibilities: PostgreSQL, MySQL │ - Escalation: Data integrity threats │ - Forbidden: Write operations without approval │ └── ... ``` **Escalation Paths**: ``` Level 1: Hanuman (Helpdesk) ├─ Can resolve: Common questions, status checks ├─ Escalate to: Specialists for technical issues └─ Timeline: Respond within 5 minutes Level 2: Specialists (Brahma, Saraswati, Ganesha, Shiva) ├─ Can resolve: Domain-specific technical issues ├─ Escalate to: Vishnu for multi-system coordination └─ Timeline: Respond within 15 minutes Level 3: Vishnu (CTO) ├─ Can resolve: Complex multi-system issues ├─ Escalate to: Human for business decisions └─ Timeline: Respond within 30 minutes Level 4: Human ├─ Can resolve: Anything (final authority) └─ Timeline: Best effort (SLA depends on severity) ``` **Knowledge Sharing**: ```javascript // Shared knowledge base accessible to all agents const kb = { "plex_common_issues": [ { "symptom": "Remote access not working", "solution": "Check port 32400 forwarding, Plex server settings", "solved_count": 12, "success_rate": 0.95 } ], "network_topology": { "vlans": { "10": "Management", "20": "User Devices", "30": "Servers", "40": "IoT" }, "aps": [ { "name": "Living Room AP", "ip": "192.168.1.10" } ] }, "service_dependencies": { "plex": ["nas", "network"], "website": ["docker", "network"], "database": ["storage", "network"] } }; // Agents reference KB before troubleshooting // Update KB after resolving new issues ``` **Version Control**: ```bash # Export workflows regularly # Store in git repo workflows/ ├── monitoring-main.json ├── approval-handler.json ├── brahma-network.json ├── saraswati-database.json └── README.md # Commit after significant changes git add workflows/ git commit -m "Add database slow query detection to Saraswati" git push # Tags for stable versions git tag -a v1.0 -m "Production-ready release" ``` **Change Management**: ``` Before modifying production workflows: 1. Document change in issue tracker 2. Test in development environment 3. Peer review (or self-review with checklist) 4. Deploy during maintenance window 5. Monitor for 24 hours after change 6. Document results and lessons learned For emergency fixes: 1. Fix the immediate issue 2. Document what was changed 3. Proper testing and documentation follow-up within 48 hours ``` --- Your agent system is now enterprise-grade, with comprehensive troubleshooting, best practices, and reliability measures in place. ---