Files
n8n-AI-Multiple-Agent-Team/docs/10-Best-Practices.md

12 KiB
Raw Permalink Blame History

Chapter 10 Best Practices

Professional infrastructure management requires discipline, planning, and adherence to proven practices. This chapter distills wisdom from production deployments into actionable guidelines.

Section 10.1: Development Workflow

Progressive Implementation:

  1. Start Simple (Week 1-2):

    • Single service monitoring (test website)
    • Manual trigger only
    • Basic HTTP health checks
    • No automation, just observation
  2. Add Intelligence (Week 3-4):

    • Diagnostic capabilities (Docker logs)
    • Structured JSON output
    • Basic Telegram notifications
    • Still manual execution
  3. Automate Carefully (Week 5-6):

    • Schedule trigger (every 30 minutes initially)
    • Rate limiting and deduplication
    • Human-in-the-loop approvals
    • Comprehensive logging
  4. Expand Scope (Week 7+):

    • Additional services one at a time
    • Specialized agents for different domains
    • Agent collaboration
    • Refinement based on experience

Testing Hierarchy:

1. Local Development
   ↓ (Test thoroughly)
2. Staging/Test Services
   ↓ (Validate behavior)
3. Non-Critical Production
   ↓ (Monitor closely)
4. Critical Production
   ↓ (Only after proven reliable)

Never Skip Steps. Each phase builds confidence and reveals edge cases.


Section 10.2: Security Considerations

Principle of Least Privilege:

Agent Capability Levels:

Level 1 (Read-Only):
- HTTP GET requests
- Log viewing
- Status checks
✅ Safe for production immediately

Level 2 (Safe Operations):
- Container restarts
- Cache clearing
- Log rotation
⚠️ Requires testing, generally safe

Level 3 (Configuration Changes):
- Firewall rules
- Resource limits
- Port mappings
❌ Requires approval, staging testing

Level 4 (Data Operations):
- Database modifications
- Storage operations
- User management
🔴 FORBIDDEN for automation

Credential Management:

// ❌ WRONG - Hardcoded secrets
const apiKey = "sk-abc123xyz789";
const dbPassword = "MyPassword123";

// ✅ RIGHT - Environment variables
const apiKey = $env.OPENAI_API_KEY;
const dbPassword = $env.DATABASE_PASSWORD;

// ✅ RIGHT - n8n Credentials
// Use Credentials feature for:
- API keys
- Passwords
- SSH keys
- Tokens

Access Control:

Network Segmentation:
- n8n server in management VLAN
- Firewall rules limiting outbound access
- VPN required for remote n8n access

Authentication:
- Enable n8n basic auth or LDAP
- Strong passwords (20+ characters)
- 2FA if available

API Security:
- Use API tokens instead of passwords
- Rotate credentials quarterly
- Audit credential access
- Revoke unused credentials

Audit Logging:

// Log every agent action
const logEntry = {
  timestamp: new Date().toISOString(),
  agent: "vishnu-cto",
  action: "container_restart",
  target: "plex",
  authorized_by: "human_approval",
  approval_id: "APR-20240115-001",
  result: "success",
  user_id: $json.telegram_user_id
};

// Store logs:
// - Local file (/var/log/n8n-agent-actions.log)
// - Database (for queryability)
// - SIEM system (for enterprise environments)

Security Monitoring:

Monitor the Monitors:
- Who accessed n8n?
- What workflows were modified?
- What credentials were used?
- What agents took actions?
- Were approvals properly obtained?

Alert on:
- Failed authentication attempts >5
- Workflow changes outside business hours
- Credentials accessed by unusual users
- Agent actions without approval
- New workflows created

Section 10.3: Performance Optimization

Polling Frequency:

Service Type         | Recommended Interval
---------------------|---------------------
Critical (Database)  | Every 2-5 minutes
Important (Web Apps) | Every 5-10 minutes
Standard (Media)     | Every 10-15 minutes
Non-Critical (Dev)   | Every 30-60 minutes

Avoid over-polling:
- Wastes API quota
- Increases costs (LLM API calls)
- Creates alert fatigue
- Adds server load

Caching Strategy:

// Cache service status to reduce redundant checks
const cache = $('WorkflowStaticData').first().json.cache || {};
const cacheKey = `status_${serviceName}`;
const cachedStatus = cache[cacheKey];
const now = Date.now();

// Use cache if fresh (< 5 minutes old)
if (cachedStatus && (now - cachedStatus.timestamp) < 5 * 60 * 1000) {
  return { json: cachedStatus.data };
}

// Otherwise, fetch fresh data
const freshStatus = await checkService();

// Update cache
cache[cacheKey] = {
  timestamp: now,
  data: freshStatus
};

return { json: freshStatus };

Conditional Execution:

// Only trigger notifications on state change
const previousState = $('WorkflowStaticData').first().json.last_status || {};
const currentState = $json.status;

if (previousState.status === currentState.status) {
  // No change, skip notification
  return [];
}

// State changed, send notification
$('WorkflowStaticData').first().json.last_status = currentState;
return { json: { notify: true, state_change: true } };

Resource Limits:

# If running n8n in Docker
services:
  n8n:
    image: n8nio/n8n
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M

Workflow Optimization:

Slow Workflow Pattern:
[Trigger] → [Agent] → [Wait 30s] → [Agent] → [Wait 30s] → [Agent]
Total time: 90+ seconds

Optimized Pattern:
[Trigger] → [Agent with all tools] → [Parallel checks] → [Synthesize]
Total time: 10-20 seconds

Use parallel execution where possible:
- Multiple service checks
- Multiple API calls
- Multiple SSH commands

Section 10.4: Reliability Guidelines

Fallback Mechanisms:

// Multi-channel notifications with fallback
async function sendNotification(message) {
  try {
    // Primary: Telegram
    await sendTelegram(message);
  } catch (error) {
    try {
      // Fallback: Email
      await sendEmail(message);
    } catch (error2) {
      try {
        // Last resort: Write to file
        fs.appendFileSync('/var/log/failed-notifications.log', 
          JSON.stringify({ timestamp: Date.now(), message, error: error2 }));
      } catch (error3) {
        // Critical: Can't notify at all
        console.error("CRITICAL: All notification methods failed");
      }
    }
  }
}

Health Checks for Monitoring System:

Monitor the Monitor:
Create a separate workflow that checks if your main monitoring workflows are running.

[Schedule: Every hour]
  ↓
[Check: When was last execution of main workflow?]
  ↓
[IF: >15 minutes ago]
  YES ↓
      [ALERT: Monitoring system appears down!]
      [Send via external service - email, SMS, PagerDuty]

Graceful Degradation:

// If one tool fails, try alternatives
async function checkService(url) {
  try {
    // Primary: HTTP Request tool
    return await httpCheck(url);
  } catch (error) {
    try {
      // Fallback: Curl via Execute Command
      return await curlCheck(url);
    } catch (error2) {
      // Can't check, assume down
      return {
        status: "unknown",
        error: "All check methods failed",
        last_known_status: getFromCache(url)
      };
    }
  }
}

Recovery Procedures:

Document manual recovery steps:

## Emergency Recovery: n8n Agent System Down

1. Check n8n is running:

docker ps | grep n8n systemctl status n8n


2. Check n8n logs:

docker logs n8n --tail 100 journalctl -u n8n -n 100


3. Restart n8n:

docker restart n8n systemctl restart n8n


4. Verify workflows activate:
- Login to n8n web interface
- Check each workflow's Active status
- Manually execute one workflow to test

5. If persistent issues:
- Disable all workflows
- Re-enable one at a time
- Identify problematic workflow

6. Nuclear option:
- Restore n8n from backup
- Reimport workflow exports

Backup Strategy:

# Daily backup of n8n data
#!/bin/bash
BACKUP_DIR="/backups/n8n"
DATE=$(date +%Y%m%d)

# Backup n8n database
docker exec n8n sqlite3 /home/node/.n8n/database.sqlite ".backup /tmp/backup.db"
docker cp n8n:/tmp/backup.db ${BACKUP_DIR}/database-${DATE}.sqlite

# Backup workflows (export as JSON)
# Via n8n API or manual export

# Keep last 30 days
find ${BACKUP_DIR} -name "*.sqlite" -mtime +30 -delete

# Upload to cloud storage
rclone copy ${BACKUP_DIR} remote:n8n-backups

Section 10.5: Team Collaboration Best Practices

Clear Role Assignment:

Document each agent's domain:

agents/
├── vishnu-cto.md
│   - Responsibilities: Overall orchestration
│   - Escalation triggers: Multi-system failures
│   - Decision authority: Final say on all issues
│
├── brahma-network.md  
│   - Responsibilities: UniFi, routing, Wi-Fi
│   - Escalation: Issues beyond network scope
│   - Tools: UniFi API, network diagnostics
│
├── saraswati-database.md
│   - Responsibilities: PostgreSQL, MySQL
│   - Escalation: Data integrity threats
│   - Forbidden: Write operations without approval
│
└── ...

Escalation Paths:

Level 1: Hanuman (Helpdesk)
  ├─ Can resolve: Common questions, status checks
  ├─ Escalate to: Specialists for technical issues
  └─ Timeline: Respond within 5 minutes

Level 2: Specialists (Brahma, Saraswati, Ganesha, Shiva)
  ├─ Can resolve: Domain-specific technical issues
  ├─ Escalate to: Vishnu for multi-system coordination
  └─ Timeline: Respond within 15 minutes

Level 3: Vishnu (CTO)
  ├─ Can resolve: Complex multi-system issues
  ├─ Escalate to: Human for business decisions
  └─ Timeline: Respond within 30 minutes

Level 4: Human
  ├─ Can resolve: Anything (final authority)
  └─ Timeline: Best effort (SLA depends on severity)

Knowledge Sharing:

// Shared knowledge base accessible to all agents
const kb = {
  "plex_common_issues": [
    {
      "symptom": "Remote access not working",
      "solution": "Check port 32400 forwarding, Plex server settings",
      "solved_count": 12,
      "success_rate": 0.95
    }
  ],
  "network_topology": {
    "vlans": {
      "10": "Management",
      "20": "User Devices",
      "30": "Servers",
      "40": "IoT"
    },
    "aps": [
      { "name": "Living Room AP", "ip": "192.168.1.10" }
    ]
  },
  "service_dependencies": {
    "plex": ["nas", "network"],
    "website": ["docker", "network"],
    "database": ["storage", "network"]
  }
};

// Agents reference KB before troubleshooting
// Update KB after resolving new issues

Version Control:

# Export workflows regularly
# Store in git repo

workflows/
├── monitoring-main.json
├── approval-handler.json
├── brahma-network.json
├── saraswati-database.json
└── README.md

# Commit after significant changes
git add workflows/
git commit -m "Add database slow query detection to Saraswati"
git push

# Tags for stable versions
git tag -a v1.0 -m "Production-ready release"

Change Management:

Before modifying production workflows:
1. Document change in issue tracker
2. Test in development environment
3. Peer review (or self-review with checklist)
4. Deploy during maintenance window
5. Monitor for 24 hours after change
6. Document results and lessons learned

For emergency fixes:
1. Fix the immediate issue
2. Document what was changed
3. Proper testing and documentation follow-up within 48 hours

Your agent system is now enterprise-grade, with comprehensive troubleshooting, best practices, and reliability measures in place.