12 KiB
Chapter 10 – Best Practices
Professional infrastructure management requires discipline, planning, and adherence to proven practices. This chapter distills wisdom from production deployments into actionable guidelines.
Section 10.1: Development Workflow
Progressive Implementation:
-
Start Simple (Week 1-2):
- Single service monitoring (test website)
- Manual trigger only
- Basic HTTP health checks
- No automation, just observation
-
Add Intelligence (Week 3-4):
- Diagnostic capabilities (Docker logs)
- Structured JSON output
- Basic Telegram notifications
- Still manual execution
-
Automate Carefully (Week 5-6):
- Schedule trigger (every 30 minutes initially)
- Rate limiting and deduplication
- Human-in-the-loop approvals
- Comprehensive logging
-
Expand Scope (Week 7+):
- Additional services one at a time
- Specialized agents for different domains
- Agent collaboration
- Refinement based on experience
Testing Hierarchy:
1. Local Development
↓ (Test thoroughly)
2. Staging/Test Services
↓ (Validate behavior)
3. Non-Critical Production
↓ (Monitor closely)
4. Critical Production
↓ (Only after proven reliable)
Never Skip Steps. Each phase builds confidence and reveals edge cases.
Section 10.2: Security Considerations
Principle of Least Privilege:
Agent Capability Levels:
Level 1 (Read-Only):
- HTTP GET requests
- Log viewing
- Status checks
✅ Safe for production immediately
Level 2 (Safe Operations):
- Container restarts
- Cache clearing
- Log rotation
⚠️ Requires testing, generally safe
Level 3 (Configuration Changes):
- Firewall rules
- Resource limits
- Port mappings
❌ Requires approval, staging testing
Level 4 (Data Operations):
- Database modifications
- Storage operations
- User management
🔴 FORBIDDEN for automation
Credential Management:
// ❌ WRONG - Hardcoded secrets
const apiKey = "sk-abc123xyz789";
const dbPassword = "MyPassword123";
// ✅ RIGHT - Environment variables
const apiKey = $env.OPENAI_API_KEY;
const dbPassword = $env.DATABASE_PASSWORD;
// ✅ RIGHT - n8n Credentials
// Use Credentials feature for:
- API keys
- Passwords
- SSH keys
- Tokens
Access Control:
Network Segmentation:
- n8n server in management VLAN
- Firewall rules limiting outbound access
- VPN required for remote n8n access
Authentication:
- Enable n8n basic auth or LDAP
- Strong passwords (20+ characters)
- 2FA if available
API Security:
- Use API tokens instead of passwords
- Rotate credentials quarterly
- Audit credential access
- Revoke unused credentials
Audit Logging:
// Log every agent action
const logEntry = {
timestamp: new Date().toISOString(),
agent: "vishnu-cto",
action: "container_restart",
target: "plex",
authorized_by: "human_approval",
approval_id: "APR-20240115-001",
result: "success",
user_id: $json.telegram_user_id
};
// Store logs:
// - Local file (/var/log/n8n-agent-actions.log)
// - Database (for queryability)
// - SIEM system (for enterprise environments)
Security Monitoring:
Monitor the Monitors:
- Who accessed n8n?
- What workflows were modified?
- What credentials were used?
- What agents took actions?
- Were approvals properly obtained?
Alert on:
- Failed authentication attempts >5
- Workflow changes outside business hours
- Credentials accessed by unusual users
- Agent actions without approval
- New workflows created
Section 10.3: Performance Optimization
Polling Frequency:
Service Type | Recommended Interval
---------------------|---------------------
Critical (Database) | Every 2-5 minutes
Important (Web Apps) | Every 5-10 minutes
Standard (Media) | Every 10-15 minutes
Non-Critical (Dev) | Every 30-60 minutes
Avoid over-polling:
- Wastes API quota
- Increases costs (LLM API calls)
- Creates alert fatigue
- Adds server load
Caching Strategy:
// Cache service status to reduce redundant checks
const cache = $('WorkflowStaticData').first().json.cache || {};
const cacheKey = `status_${serviceName}`;
const cachedStatus = cache[cacheKey];
const now = Date.now();
// Use cache if fresh (< 5 minutes old)
if (cachedStatus && (now - cachedStatus.timestamp) < 5 * 60 * 1000) {
return { json: cachedStatus.data };
}
// Otherwise, fetch fresh data
const freshStatus = await checkService();
// Update cache
cache[cacheKey] = {
timestamp: now,
data: freshStatus
};
return { json: freshStatus };
Conditional Execution:
// Only trigger notifications on state change
const previousState = $('WorkflowStaticData').first().json.last_status || {};
const currentState = $json.status;
if (previousState.status === currentState.status) {
// No change, skip notification
return [];
}
// State changed, send notification
$('WorkflowStaticData').first().json.last_status = currentState;
return { json: { notify: true, state_change: true } };
Resource Limits:
# If running n8n in Docker
services:
n8n:
image: n8nio/n8n
deploy:
resources:
limits:
cpus: '2'
memory: 2G
reservations:
cpus: '0.5'
memory: 512M
Workflow Optimization:
Slow Workflow Pattern:
[Trigger] → [Agent] → [Wait 30s] → [Agent] → [Wait 30s] → [Agent]
Total time: 90+ seconds
Optimized Pattern:
[Trigger] → [Agent with all tools] → [Parallel checks] → [Synthesize]
Total time: 10-20 seconds
Use parallel execution where possible:
- Multiple service checks
- Multiple API calls
- Multiple SSH commands
Section 10.4: Reliability Guidelines
Fallback Mechanisms:
// Multi-channel notifications with fallback
async function sendNotification(message) {
try {
// Primary: Telegram
await sendTelegram(message);
} catch (error) {
try {
// Fallback: Email
await sendEmail(message);
} catch (error2) {
try {
// Last resort: Write to file
fs.appendFileSync('/var/log/failed-notifications.log',
JSON.stringify({ timestamp: Date.now(), message, error: error2 }));
} catch (error3) {
// Critical: Can't notify at all
console.error("CRITICAL: All notification methods failed");
}
}
}
}
Health Checks for Monitoring System:
Monitor the Monitor:
Create a separate workflow that checks if your main monitoring workflows are running.
[Schedule: Every hour]
↓
[Check: When was last execution of main workflow?]
↓
[IF: >15 minutes ago]
YES ↓
[ALERT: Monitoring system appears down!]
[Send via external service - email, SMS, PagerDuty]
Graceful Degradation:
// If one tool fails, try alternatives
async function checkService(url) {
try {
// Primary: HTTP Request tool
return await httpCheck(url);
} catch (error) {
try {
// Fallback: Curl via Execute Command
return await curlCheck(url);
} catch (error2) {
// Can't check, assume down
return {
status: "unknown",
error: "All check methods failed",
last_known_status: getFromCache(url)
};
}
}
}
Recovery Procedures:
Document manual recovery steps:
## Emergency Recovery: n8n Agent System Down
1. Check n8n is running:
docker ps | grep n8n systemctl status n8n
2. Check n8n logs:
docker logs n8n --tail 100 journalctl -u n8n -n 100
3. Restart n8n:
docker restart n8n systemctl restart n8n
4. Verify workflows activate:
- Login to n8n web interface
- Check each workflow's Active status
- Manually execute one workflow to test
5. If persistent issues:
- Disable all workflows
- Re-enable one at a time
- Identify problematic workflow
6. Nuclear option:
- Restore n8n from backup
- Reimport workflow exports
Backup Strategy:
# Daily backup of n8n data
#!/bin/bash
BACKUP_DIR="/backups/n8n"
DATE=$(date +%Y%m%d)
# Backup n8n database
docker exec n8n sqlite3 /home/node/.n8n/database.sqlite ".backup /tmp/backup.db"
docker cp n8n:/tmp/backup.db ${BACKUP_DIR}/database-${DATE}.sqlite
# Backup workflows (export as JSON)
# Via n8n API or manual export
# Keep last 30 days
find ${BACKUP_DIR} -name "*.sqlite" -mtime +30 -delete
# Upload to cloud storage
rclone copy ${BACKUP_DIR} remote:n8n-backups
Section 10.5: Team Collaboration Best Practices
Clear Role Assignment:
Document each agent's domain:
agents/
├── vishnu-cto.md
│ - Responsibilities: Overall orchestration
│ - Escalation triggers: Multi-system failures
│ - Decision authority: Final say on all issues
│
├── brahma-network.md
│ - Responsibilities: UniFi, routing, Wi-Fi
│ - Escalation: Issues beyond network scope
│ - Tools: UniFi API, network diagnostics
│
├── saraswati-database.md
│ - Responsibilities: PostgreSQL, MySQL
│ - Escalation: Data integrity threats
│ - Forbidden: Write operations without approval
│
└── ...
Escalation Paths:
Level 1: Hanuman (Helpdesk)
├─ Can resolve: Common questions, status checks
├─ Escalate to: Specialists for technical issues
└─ Timeline: Respond within 5 minutes
Level 2: Specialists (Brahma, Saraswati, Ganesha, Shiva)
├─ Can resolve: Domain-specific technical issues
├─ Escalate to: Vishnu for multi-system coordination
└─ Timeline: Respond within 15 minutes
Level 3: Vishnu (CTO)
├─ Can resolve: Complex multi-system issues
├─ Escalate to: Human for business decisions
└─ Timeline: Respond within 30 minutes
Level 4: Human
├─ Can resolve: Anything (final authority)
└─ Timeline: Best effort (SLA depends on severity)
Knowledge Sharing:
// Shared knowledge base accessible to all agents
const kb = {
"plex_common_issues": [
{
"symptom": "Remote access not working",
"solution": "Check port 32400 forwarding, Plex server settings",
"solved_count": 12,
"success_rate": 0.95
}
],
"network_topology": {
"vlans": {
"10": "Management",
"20": "User Devices",
"30": "Servers",
"40": "IoT"
},
"aps": [
{ "name": "Living Room AP", "ip": "192.168.1.10" }
]
},
"service_dependencies": {
"plex": ["nas", "network"],
"website": ["docker", "network"],
"database": ["storage", "network"]
}
};
// Agents reference KB before troubleshooting
// Update KB after resolving new issues
Version Control:
# Export workflows regularly
# Store in git repo
workflows/
├── monitoring-main.json
├── approval-handler.json
├── brahma-network.json
├── saraswati-database.json
└── README.md
# Commit after significant changes
git add workflows/
git commit -m "Add database slow query detection to Saraswati"
git push
# Tags for stable versions
git tag -a v1.0 -m "Production-ready release"
Change Management:
Before modifying production workflows:
1. Document change in issue tracker
2. Test in development environment
3. Peer review (or self-review with checklist)
4. Deploy during maintenance window
5. Monitor for 24 hours after change
6. Document results and lessons learned
For emergency fixes:
1. Fix the immediate issue
2. Document what was changed
3. Proper testing and documentation follow-up within 48 hours
Your agent system is now enterprise-grade, with comprehensive troubleshooting, best practices, and reliability measures in place.