n8n-AI-Multiple-Agent-Team/10-Best-Practices.md at main

Files

copilot-swe-agent[bot] 4a7dc2b6b5 Add professional repository structure with metadata files, workflows, docs, and scripts

Co-authored-by: ambicuity <44251619+ambicuity@users.noreply.github.com>

2025-10-05 18:31:53 +00:00

12 KiB

Raw Permalink Blame History

Chapter 10 – Best Practices

Professional infrastructure management requires discipline, planning, and adherence to proven practices. This chapter distills wisdom from production deployments into actionable guidelines.

Section 10.1: Development Workflow

Progressive Implementation:

Start Simple (Week 1-2):
- Single service monitoring (test website)
- Manual trigger only
- Basic HTTP health checks
- No automation, just observation
Add Intelligence (Week 3-4):
- Diagnostic capabilities (Docker logs)
- Structured JSON output
- Basic Telegram notifications
- Still manual execution
Automate Carefully (Week 5-6):
- Schedule trigger (every 30 minutes initially)
- Rate limiting and deduplication
- Human-in-the-loop approvals
- Comprehensive logging
Expand Scope (Week 7+):
- Additional services one at a time
- Specialized agents for different domains
- Agent collaboration
- Refinement based on experience

Testing Hierarchy:

1. Local Development
   ↓ (Test thoroughly)
2. Staging/Test Services
   ↓ (Validate behavior)
3. Non-Critical Production
   ↓ (Monitor closely)
4. Critical Production
   ↓ (Only after proven reliable)

Never Skip Steps. Each phase builds confidence and reveals edge cases.

Section 10.2: Security Considerations

Principle of Least Privilege:

Agent Capability Levels:

Level 1 (Read-Only):
- HTTP GET requests
- Log viewing
- Status checks
✅ Safe for production immediately

Level 2 (Safe Operations):
- Container restarts
- Cache clearing
- Log rotation
⚠️ Requires testing, generally safe

Level 3 (Configuration Changes):
- Firewall rules
- Resource limits
- Port mappings
❌ Requires approval, staging testing

Level 4 (Data Operations):
- Database modifications
- Storage operations
- User management
🔴 FORBIDDEN for automation

Credential Management:

// ❌ WRONG - Hardcoded secrets
const apiKey = "sk-abc123xyz789";
const dbPassword = "MyPassword123";

// ✅ RIGHT - Environment variables
const apiKey = $env.OPENAI_API_KEY;
const dbPassword = $env.DATABASE_PASSWORD;

// ✅ RIGHT - n8n Credentials
// Use Credentials feature for:
- API keys
- Passwords
- SSH keys
- Tokens

Access Control:

Network Segmentation:
- n8n server in management VLAN
- Firewall rules limiting outbound access
- VPN required for remote n8n access

Authentication:
- Enable n8n basic auth or LDAP
- Strong passwords (20+ characters)
- 2FA if available

API Security:
- Use API tokens instead of passwords
- Rotate credentials quarterly
- Audit credential access
- Revoke unused credentials

Audit Logging:

// Log every agent action
const logEntry = {
  timestamp: new Date().toISOString(),
  agent: "vishnu-cto",
  action: "container_restart",
  target: "plex",
  authorized_by: "human_approval",
  approval_id: "APR-20240115-001",
  result: "success",
  user_id: $json.telegram_user_id
};

// Store logs:
// - Local file (/var/log/n8n-agent-actions.log)
// - Database (for queryability)
// - SIEM system (for enterprise environments)

Security Monitoring:

Monitor the Monitors:
- Who accessed n8n?
- What workflows were modified?
- What credentials were used?
- What agents took actions?
- Were approvals properly obtained?

Alert on:
- Failed authentication attempts >5
- Workflow changes outside business hours
- Credentials accessed by unusual users
- Agent actions without approval
- New workflows created

Section 10.3: Performance Optimization

Polling Frequency:

Service Type         | Recommended Interval
---------------------|---------------------
Critical (Database)  | Every 2-5 minutes
Important (Web Apps) | Every 5-10 minutes
Standard (Media)     | Every 10-15 minutes
Non-Critical (Dev)   | Every 30-60 minutes

Avoid over-polling:
- Wastes API quota
- Increases costs (LLM API calls)
- Creates alert fatigue
- Adds server load

Caching Strategy:

// Cache service status to reduce redundant checks
const cache = $('WorkflowStaticData').first().json.cache || {};
const cacheKey = `status_${serviceName}`;
const cachedStatus = cache[cacheKey];
const now = Date.now();

// Use cache if fresh (< 5 minutes old)
if (cachedStatus && (now - cachedStatus.timestamp) < 5 * 60 * 1000) {
  return { json: cachedStatus.data };
}

// Otherwise, fetch fresh data
const freshStatus = await checkService();

// Update cache
cache[cacheKey] = {
  timestamp: now,
  data: freshStatus
};

return { json: freshStatus };

Conditional Execution:

// Only trigger notifications on state change
const previousState = $('WorkflowStaticData').first().json.last_status || {};
const currentState = $json.status;

if (previousState.status === currentState.status) {
  // No change, skip notification
  return [];
}

// State changed, send notification
$('WorkflowStaticData').first().json.last_status = currentState;
return { json: { notify: true, state_change: true } };

Resource Limits:

# If running n8n in Docker
services:
  n8n:
    image: n8nio/n8n
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M

Workflow Optimization:

Slow Workflow Pattern:
[Trigger] → [Agent] → [Wait 30s] → [Agent] → [Wait 30s] → [Agent]
Total time: 90+ seconds

Optimized Pattern:
[Trigger] → [Agent with all tools] → [Parallel checks] → [Synthesize]
Total time: 10-20 seconds

Use parallel execution where possible:
- Multiple service checks
- Multiple API calls
- Multiple SSH commands

Section 10.4: Reliability Guidelines

Fallback Mechanisms:

// Multi-channel notifications with fallback
async function sendNotification(message) {
  try {
    // Primary: Telegram
    await sendTelegram(message);
  } catch (error) {
    try {
      // Fallback: Email
      await sendEmail(message);
    } catch (error2) {
      try {
        // Last resort: Write to file
        fs.appendFileSync('/var/log/failed-notifications.log', 
          JSON.stringify({ timestamp: Date.now(), message, error: error2 }));
      } catch (error3) {
        // Critical: Can't notify at all
        console.error("CRITICAL: All notification methods failed");
      }
    }
  }
}

Health Checks for Monitoring System:

Monitor the Monitor:
Create a separate workflow that checks if your main monitoring workflows are running.

[Schedule: Every hour]
  ↓
[Check: When was last execution of main workflow?]
  ↓
[IF: >15 minutes ago]
  YES ↓
      [ALERT: Monitoring system appears down!]
      [Send via external service - email, SMS, PagerDuty]

Graceful Degradation:

// If one tool fails, try alternatives
async function checkService(url) {
  try {
    // Primary: HTTP Request tool
    return await httpCheck(url);
  } catch (error) {
    try {
      // Fallback: Curl via Execute Command
      return await curlCheck(url);
    } catch (error2) {
      // Can't check, assume down
      return {
        status: "unknown",
        error: "All check methods failed",
        last_known_status: getFromCache(url)
      };
    }
  }
}

Recovery Procedures:

Document manual recovery steps:

## Emergency Recovery: n8n Agent System Down

1. Check n8n is running:

docker ps | grep n8n systemctl status n8n


2. Check n8n logs:

docker logs n8n --tail 100 journalctl -u n8n -n 100


3. Restart n8n:

docker restart n8n systemctl restart n8n


4. Verify workflows activate:
- Login to n8n web interface
- Check each workflow's Active status
- Manually execute one workflow to test

5. If persistent issues:
- Disable all workflows
- Re-enable one at a time
- Identify problematic workflow

6. Nuclear option:
- Restore n8n from backup
- Reimport workflow exports

Backup Strategy:

# Daily backup of n8n data
#!/bin/bash
BACKUP_DIR="/backups/n8n"
DATE=$(date +%Y%m%d)

# Backup n8n database
docker exec n8n sqlite3 /home/node/.n8n/database.sqlite ".backup /tmp/backup.db"
docker cp n8n:/tmp/backup.db ${BACKUP_DIR}/database-${DATE}.sqlite

# Backup workflows (export as JSON)
# Via n8n API or manual export

# Keep last 30 days
find ${BACKUP_DIR} -name "*.sqlite" -mtime +30 -delete

# Upload to cloud storage
rclone copy ${BACKUP_DIR} remote:n8n-backups

Section 10.5: Team Collaboration Best Practices

Clear Role Assignment:

Document each agent's domain:

agents/
├── vishnu-cto.md
│   - Responsibilities: Overall orchestration
│   - Escalation triggers: Multi-system failures
│   - Decision authority: Final say on all issues
│
├── brahma-network.md  
│   - Responsibilities: UniFi, routing, Wi-Fi
│   - Escalation: Issues beyond network scope
│   - Tools: UniFi API, network diagnostics
│
├── saraswati-database.md
│   - Responsibilities: PostgreSQL, MySQL
│   - Escalation: Data integrity threats
│   - Forbidden: Write operations without approval
│
└── ...

Escalation Paths:

Level 1: Hanuman (Helpdesk)
  ├─ Can resolve: Common questions, status checks
  ├─ Escalate to: Specialists for technical issues
  └─ Timeline: Respond within 5 minutes

Level 2: Specialists (Brahma, Saraswati, Ganesha, Shiva)
  ├─ Can resolve: Domain-specific technical issues
  ├─ Escalate to: Vishnu for multi-system coordination
  └─ Timeline: Respond within 15 minutes

Level 3: Vishnu (CTO)
  ├─ Can resolve: Complex multi-system issues
  ├─ Escalate to: Human for business decisions
  └─ Timeline: Respond within 30 minutes

Level 4: Human
  ├─ Can resolve: Anything (final authority)
  └─ Timeline: Best effort (SLA depends on severity)

Knowledge Sharing:

// Shared knowledge base accessible to all agents
const kb = {
  "plex_common_issues": [
    {
      "symptom": "Remote access not working",
      "solution": "Check port 32400 forwarding, Plex server settings",
      "solved_count": 12,
      "success_rate": 0.95
    }
  ],
  "network_topology": {
    "vlans": {
      "10": "Management",
      "20": "User Devices",
      "30": "Servers",
      "40": "IoT"
    },
    "aps": [
      { "name": "Living Room AP", "ip": "192.168.1.10" }
    ]
  },
  "service_dependencies": {
    "plex": ["nas", "network"],
    "website": ["docker", "network"],
    "database": ["storage", "network"]
  }
};

// Agents reference KB before troubleshooting
// Update KB after resolving new issues

Version Control:

# Export workflows regularly
# Store in git repo

workflows/
├── monitoring-main.json
├── approval-handler.json
├── brahma-network.json
├── saraswati-database.json
└── README.md

# Commit after significant changes
git add workflows/
git commit -m "Add database slow query detection to Saraswati"
git push

# Tags for stable versions
git tag -a v1.0 -m "Production-ready release"

Change Management:

Before modifying production workflows:
1. Document change in issue tracker
2. Test in development environment
3. Peer review (or self-review with checklist)
4. Deploy during maintenance window
5. Monitor for 24 hours after change
6. Document results and lessons learned

For emergency fixes:
1. Fix the immediate issue
2. Document what was changed
3. Proper testing and documentation follow-up within 48 hours

Your agent system is now enterprise-grade, with comprehensive troubleshooting, best practices, and reliability measures in place.

12 KiB Raw Permalink Blame History Unescape Escape