## Chapter 10 – Best Practices

Professional infrastructure management requires discipline, planning, and adherence to proven practices. This chapter distills wisdom from production deployments into actionable guidelines.

### Section 10.1: Development Workflow

**Progressive Implementation**:

1. **Start Simple** (Week 1-2):
   - Single service monitoring (test website)
   - Manual trigger only
   - Basic HTTP health checks
   - No automation, just observation

2. **Add Intelligence** (Week 3-4):
   - Diagnostic capabilities (Docker logs)
   - Structured JSON output
   - Basic Telegram notifications
   - Still manual execution

3. **Automate Carefully** (Week 5-6):
   - Schedule trigger (every 30 minutes initially)
   - Rate limiting and deduplication
   - Human-in-the-loop approvals
   - Comprehensive logging

4. **Expand Scope** (Week 7+):
   - Additional services one at a time
   - Specialized agents for different domains
   - Agent collaboration
   - Refinement based on experience

**Testing Hierarchy**:
```
1. Local Development
   ↓ (Test thoroughly)
2. Staging/Test Services
   ↓ (Validate behavior)
3. Non-Critical Production
   ↓ (Monitor closely)
4. Critical Production
   ↓ (Only after proven reliable)
```

**Never Skip Steps**. Each phase builds confidence and reveals edge cases.

---

### Section 10.2: Security Considerations

**Principle of Least Privilege**:

```
Agent Capability Levels:

Level 1 (Read-Only):
- HTTP GET requests
- Log viewing
- Status checks
✅ Safe for production immediately

Level 2 (Safe Operations):
- Container restarts
- Cache clearing
- Log rotation
⚠️ Requires testing, generally safe

Level 3 (Configuration Changes):
- Firewall rules
- Resource limits
- Port mappings
❌ Requires approval, staging testing

Level 4 (Data Operations):
- Database modifications
- Storage operations
- User management
🔴 FORBIDDEN for automation
```

**Credential Management**:

```javascript
// ❌ WRONG - Hardcoded secrets
const apiKey = "sk-abc123xyz789";
const dbPassword = "MyPassword123";

// ✅ RIGHT - Environment variables
const apiKey = $env.OPENAI_API_KEY;
const dbPassword = $env.DATABASE_PASSWORD;

// ✅ RIGHT - n8n Credentials
// Use Credentials feature for:
- API keys
- Passwords
- SSH keys
- Tokens
```

**Access Control**:

```
Network Segmentation:
- n8n server in management VLAN
- Firewall rules limiting outbound access
- VPN required for remote n8n access

Authentication:
- Enable n8n basic auth or LDAP
- Strong passwords (20+ characters)
- 2FA if available

API Security:
- Use API tokens instead of passwords
- Rotate credentials quarterly
- Audit credential access
- Revoke unused credentials
```

**Audit Logging**:

```javascript
// Log every agent action
const logEntry = {
  timestamp: new Date().toISOString(),
  agent: "vishnu-cto",
  action: "container_restart",
  target: "plex",
  authorized_by: "human_approval",
  approval_id: "APR-20240115-001",
  result: "success",
  user_id: $json.telegram_user_id
};

// Store logs:
// - Local file (/var/log/n8n-agent-actions.log)
// - Database (for queryability)
// - SIEM system (for enterprise environments)
```

**Security Monitoring**:

```
Monitor the Monitors:
- Who accessed n8n?
- What workflows were modified?
- What credentials were used?
- What agents took actions?
- Were approvals properly obtained?

Alert on:
- Failed authentication attempts >5
- Workflow changes outside business hours
- Credentials accessed by unusual users
- Agent actions without approval
- New workflows created
```

---

### Section 10.3: Performance Optimization

**Polling Frequency**:

```
Service Type         | Recommended Interval
---------------------|---------------------
Critical (Database)  | Every 2-5 minutes
Important (Web Apps) | Every 5-10 minutes
Standard (Media)     | Every 10-15 minutes
Non-Critical (Dev)   | Every 30-60 minutes

Avoid over-polling:
- Wastes API quota
- Increases costs (LLM API calls)
- Creates alert fatigue
- Adds server load
```

**Caching Strategy**:

```javascript
// Cache service status to reduce redundant checks
const cache = $('WorkflowStaticData').first().json.cache || {};
const cacheKey = `status_${serviceName}`;
const cachedStatus = cache[cacheKey];
const now = Date.now();

// Use cache if fresh (< 5 minutes old)
if (cachedStatus && (now - cachedStatus.timestamp) < 5 * 60 * 1000) {
  return { json: cachedStatus.data };
}

// Otherwise, fetch fresh data
const freshStatus = await checkService();

// Update cache
cache[cacheKey] = {
  timestamp: now,
  data: freshStatus
};

return { json: freshStatus };
```

**Conditional Execution**:

```javascript
// Only trigger notifications on state change
const previousState = $('WorkflowStaticData').first().json.last_status || {};
const currentState = $json.status;

if (previousState.status === currentState.status) {
  // No change, skip notification
  return [];
}

// State changed, send notification
$('WorkflowStaticData').first().json.last_status = currentState;
return { json: { notify: true, state_change: true } };
```

**Resource Limits**:

```yaml
# If running n8n in Docker
services:
  n8n:
    image: n8nio/n8n
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M
```

**Workflow Optimization**:

```
Slow Workflow Pattern:
[Trigger] → [Agent] → [Wait 30s] → [Agent] → [Wait 30s] → [Agent]
Total time: 90+ seconds

Optimized Pattern:
[Trigger] → [Agent with all tools] → [Parallel checks] → [Synthesize]
Total time: 10-20 seconds

Use parallel execution where possible:
- Multiple service checks
- Multiple API calls
- Multiple SSH commands
```

---

### Section 10.4: Reliability Guidelines

**Fallback Mechanisms**:

```javascript
// Multi-channel notifications with fallback
async function sendNotification(message) {
  try {
    // Primary: Telegram
    await sendTelegram(message);
  } catch (error) {
    try {
      // Fallback: Email
      await sendEmail(message);
    } catch (error2) {
      try {
        // Last resort: Write to file
        fs.appendFileSync('/var/log/failed-notifications.log', 
          JSON.stringify({ timestamp: Date.now(), message, error: error2 }));
      } catch (error3) {
        // Critical: Can't notify at all
        console.error("CRITICAL: All notification methods failed");
      }
    }
  }
}
```

**Health Checks for Monitoring System**:

```
Monitor the Monitor:
Create a separate workflow that checks if your main monitoring workflows are running.

[Schedule: Every hour]
  ↓
[Check: When was last execution of main workflow?]
  ↓
[IF: >15 minutes ago]
  YES ↓
      [ALERT: Monitoring system appears down!]
      [Send via external service - email, SMS, PagerDuty]
```

**Graceful Degradation**:

```javascript
// If one tool fails, try alternatives
async function checkService(url) {
  try {
    // Primary: HTTP Request tool
    return await httpCheck(url);
  } catch (error) {
    try {
      // Fallback: Curl via Execute Command
      return await curlCheck(url);
    } catch (error2) {
      // Can't check, assume down
      return {
        status: "unknown",
        error: "All check methods failed",
        last_known_status: getFromCache(url)
      };
    }
  }
}
```

**Recovery Procedures**:

Document manual recovery steps:

```markdown
## Emergency Recovery: n8n Agent System Down

1. Check n8n is running:
   ```
   docker ps | grep n8n
   systemctl status n8n
   ```

2. Check n8n logs:
   ```
   docker logs n8n --tail 100
   journalctl -u n8n -n 100
   ```

3. Restart n8n:
   ```
   docker restart n8n
   systemctl restart n8n
   ```

4. Verify workflows activate:
   - Login to n8n web interface
   - Check each workflow's Active status
   - Manually execute one workflow to test

5. If persistent issues:
   - Disable all workflows
   - Re-enable one at a time
   - Identify problematic workflow

6. Nuclear option:
   - Restore n8n from backup
   - Reimport workflow exports
```

**Backup Strategy**:

```bash
# Daily backup of n8n data
#!/bin/bash
BACKUP_DIR="/backups/n8n"
DATE=$(date +%Y%m%d)

# Backup n8n database
docker exec n8n sqlite3 /home/node/.n8n/database.sqlite ".backup /tmp/backup.db"
docker cp n8n:/tmp/backup.db ${BACKUP_DIR}/database-${DATE}.sqlite

# Backup workflows (export as JSON)
# Via n8n API or manual export

# Keep last 30 days
find ${BACKUP_DIR} -name "*.sqlite" -mtime +30 -delete

# Upload to cloud storage
rclone copy ${BACKUP_DIR} remote:n8n-backups
```

---

### Section 10.5: Team Collaboration Best Practices

**Clear Role Assignment**:

```
Document each agent's domain:

agents/
├── vishnu-cto.md
│   - Responsibilities: Overall orchestration
│   - Escalation triggers: Multi-system failures
│   - Decision authority: Final say on all issues
│
├── brahma-network.md  
│   - Responsibilities: UniFi, routing, Wi-Fi
│   - Escalation: Issues beyond network scope
│   - Tools: UniFi API, network diagnostics
│
├── saraswati-database.md
│   - Responsibilities: PostgreSQL, MySQL
│   - Escalation: Data integrity threats
│   - Forbidden: Write operations without approval
│
└── ...
```

**Escalation Paths**:

```
Level 1: Hanuman (Helpdesk)
  ├─ Can resolve: Common questions, status checks
  ├─ Escalate to: Specialists for technical issues
  └─ Timeline: Respond within 5 minutes

Level 2: Specialists (Brahma, Saraswati, Ganesha, Shiva)
  ├─ Can resolve: Domain-specific technical issues
  ├─ Escalate to: Vishnu for multi-system coordination
  └─ Timeline: Respond within 15 minutes

Level 3: Vishnu (CTO)
  ├─ Can resolve: Complex multi-system issues
  ├─ Escalate to: Human for business decisions
  └─ Timeline: Respond within 30 minutes

Level 4: Human
  ├─ Can resolve: Anything (final authority)
  └─ Timeline: Best effort (SLA depends on severity)
```

**Knowledge Sharing**:

```javascript
// Shared knowledge base accessible to all agents
const kb = {
  "plex_common_issues": [
    {
      "symptom": "Remote access not working",
      "solution": "Check port 32400 forwarding, Plex server settings",
      "solved_count": 12,
      "success_rate": 0.95
    }
  ],
  "network_topology": {
    "vlans": {
      "10": "Management",
      "20": "User Devices",
      "30": "Servers",
      "40": "IoT"
    },
    "aps": [
      { "name": "Living Room AP", "ip": "192.168.1.10" }
    ]
  },
  "service_dependencies": {
    "plex": ["nas", "network"],
    "website": ["docker", "network"],
    "database": ["storage", "network"]
  }
};

// Agents reference KB before troubleshooting
// Update KB after resolving new issues
```

**Version Control**:

```bash
# Export workflows regularly
# Store in git repo

workflows/
├── monitoring-main.json
├── approval-handler.json
├── brahma-network.json
├── saraswati-database.json
└── README.md

# Commit after significant changes
git add workflows/
git commit -m "Add database slow query detection to Saraswati"
git push

# Tags for stable versions
git tag -a v1.0 -m "Production-ready release"
```

**Change Management**:

```
Before modifying production workflows:
1. Document change in issue tracker
2. Test in development environment
3. Peer review (or self-review with checklist)
4. Deploy during maintenance window
5. Monitor for 24 hours after change
6. Document results and lessons learned

For emergency fixes:
1. Fix the immediate issue
2. Document what was changed
3. Proper testing and documentation follow-up within 48 hours
```

---

Your agent system is now enterprise-grade, with comprehensive troubleshooting, best practices, and reliability measures in place.

---