6662 lines
187 KiB
Markdown
6662 lines
187 KiB
Markdown
# n8n-Homelab-CTO-Agent-Team
|
||
|
||
**AI-Powered Multi-Agent CTO for Your Homelab**
|
||
|
||
> **Author**: Ritesh Rana ([@ambicuity](https://github.com/ambicuity))
|
||
> **Contact**: [riteshrana36@gmail.com](mailto:riteshrana36@gmail.com)
|
||
> **Website**: [www.riteshrana.engineer](https://www.riteshrana.engineer)
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [Chapter 1 – Overview](#chapter-1--overview)
|
||
2. [Chapter 2 – Prerequisites](#chapter-2--prerequisites)
|
||
3. [Chapter 3 – Initial Setup](#chapter-3--initial-setup)
|
||
4. [Chapter 4 – Agent Evolution (Stages)](#chapter-4--agent-evolution-stages)
|
||
5. [Chapter 5 – Automation Setup](#chapter-5--automation-setup)
|
||
6. [Chapter 6 – Service Integrations](#chapter-6--service-integrations)
|
||
7. [Chapter 7 – Advanced Features](#chapter-7--advanced-features)
|
||
8. [Chapter 8 – AI Team](#chapter-8--ai-team)
|
||
9. [Chapter 9 – Troubleshooting](#chapter-9--troubleshooting)
|
||
10. [Chapter 10 – Best Practices](#chapter-10--best-practices)
|
||
11. [Chapter 11 – Additional Resources](#chapter-11--additional-resources)
|
||
12. [Chapter 12 – Support & Contributions](#chapter-12--support--contributions)
|
||
|
||
---
|
||
|
||
## Chapter 1 – Overview
|
||
|
||
### Purpose of This Repository
|
||
|
||
This comprehensive guide documents the creation and operation of an **AI-powered multi-agent IT team** built with **n8n**, designed to manage your homelab or business network infrastructure with the intelligence and autonomy of a seasoned **Chief Technology Officer (CTO)**.
|
||
|
||
Unlike traditional monitoring scripts or simple automation tools, this system embodies a fundamental paradigm shift: **treating AI agents as employees, not bots**. Each agent has a defined role, responsibilities, decision-making capabilities, and the ability to collaborate with other agents to solve complex infrastructure challenges.
|
||
|
||
### Philosophy: Agents as Employees, Not Bots
|
||
|
||
Traditional automation follows rigid if-then logic. When Service X goes down, restart it. When disk usage exceeds Y%, send an alert. This approach lacks intelligence, context awareness, and adaptability.
|
||
|
||
The **n8n-Homelab-CTO-Agent-Team** reimagines infrastructure management by creating a virtual IT department where:
|
||
|
||
- **Agents have roles**: Just like a real IT team has network administrators, database engineers, and security officers, your AI team has specialized agents with distinct expertise
|
||
- **Agents make decisions**: They don't just execute predefined scripts; they analyze situations, weigh options, and choose appropriate actions
|
||
- **Agents collaborate**: Complex issues are escalated and multiple agents work together, sharing knowledge and insights
|
||
- **Agents learn and adapt**: System prompts can be refined based on experience, creating increasingly capable team members over time
|
||
- **Agents respect hierarchy**: Junior agents handle routine tasks, senior agents tackle complex problems, and the CTO agent orchestrates everything
|
||
|
||
### Real-World Benefits: Saving Time, Money, and Stress
|
||
|
||
**Time Savings**:
|
||
- Automated 24/7 monitoring eliminates manual health checks
|
||
- Intelligent diagnostics replace hours of log diving
|
||
- Self-healing capabilities handle common issues without human intervention
|
||
- Agents work while you sleep, on weekends, and during vacations
|
||
|
||
**Cost Savings**:
|
||
- Reduce downtime and associated revenue loss
|
||
- Prevent small issues from becoming expensive disasters
|
||
- Maximize infrastructure utilization through intelligent resource management
|
||
- Avoid hiring additional IT staff for routine operations
|
||
- Leverage existing infrastructure rather than expensive managed services
|
||
|
||
**Stress Reduction**:
|
||
- Human-in-the-loop approval means you stay in control without constant monitoring
|
||
- Detailed diagnostics and proposed solutions replace 3 AM panic debugging
|
||
- Comprehensive logging creates an audit trail for compliance and troubleshooting
|
||
- Proactive issue detection prevents surprise outages
|
||
- Peace of mind knowing your infrastructure has intelligent oversight
|
||
|
||
**Real Example**: Your Plex media server crashes at 2 AM. Traditional monitoring sends an alert, waking you up. With the CTO Agent Team:
|
||
1. The monitoring agent detects the issue within minutes
|
||
2. The Linux Engineer agent investigates, reading container logs
|
||
3. The agent identifies a memory leak as the root cause
|
||
4. A fix is proposed: restart the container with increased memory limits
|
||
5. An approval request arrives on your phone via Telegram
|
||
6. You approve with a single tap, still in bed
|
||
7. The agent executes the fix, verifies recovery, and sends confirmation
|
||
8. You sleep through the night, and users never notice
|
||
|
||
This is the power of treating agents as intelligent employees rather than dumb automation scripts.
|
||
|
||
---
|
||
|
||
## Chapter 2 – Prerequisites
|
||
|
||
Before embarking on your journey to build an AI-powered CTO team, ensure you have the necessary components in place. This chapter presents all requirements in an easy-to-follow checklist format.
|
||
|
||
### Required Components
|
||
|
||
#### Core Infrastructure
|
||
- ✅ **n8n instance** (self-hosted or cloud)
|
||
- Version 1.0+ recommended
|
||
- Accessible via web browser
|
||
- Sufficient resources (minimum 2GB RAM, 2 CPU cores)
|
||
|
||
- ✅ **Docker** installed on homelab server(s)
|
||
- Docker Engine 20.10+ or Docker Desktop
|
||
- Docker Compose (optional but recommended)
|
||
- User permissions configured for Docker commands
|
||
|
||
- ✅ **LLM API Access**
|
||
- OpenAI API key (GPT-4o-mini or GPT-4 recommended)
|
||
- Alternative: Anthropic Claude, Google Gemini, or local LLMs via Ollama
|
||
- Sufficient API credits or quota
|
||
- API key stored securely (environment variables recommended)
|
||
|
||
- ✅ **Telegram Account + Bot**
|
||
- Personal Telegram account
|
||
- Bot created via @BotFather
|
||
- Bot token obtained
|
||
- Your Chat ID identified (we'll cover how to get this in Chapter 5)
|
||
|
||
### Service Access Requirements
|
||
|
||
The beauty of this system is that it works with your **existing infrastructure**. You don't need to migrate services or change your setup. The agents simply integrate with what you already have running.
|
||
|
||
- ✅ **Uptime Kuma** - Monitoring Integration
|
||
- Access to Uptime Kuma API
|
||
- API endpoint URL
|
||
- API key or authentication credentials
|
||
|
||
- ✅ **Proxmox** - Virtualization Management
|
||
- Proxmox VE 7.0+ installed
|
||
- SSH access to Proxmox host
|
||
- API credentials (optional, for API-based management)
|
||
- Network access to Proxmox web interface
|
||
|
||
- ✅ **UniFi Controller** - Network Management
|
||
- UniFi Controller running (Cloud Key, UDM, or self-hosted)
|
||
- Admin credentials or API access
|
||
- Controller URL/IP address
|
||
- UniFi site ID (if managing multiple sites)
|
||
|
||
- ✅ **NAS Systems** (e.g., ZimaCube, TrueNAS, Synology)
|
||
- SSH access enabled
|
||
- Monitoring tools available (smartctl for disk health)
|
||
- Network access to NAS interface
|
||
- Read access to system logs
|
||
|
||
- ✅ **Plex Media Server**
|
||
- Plex instance running
|
||
- HTTP access for health checks
|
||
- Docker container name (if containerized)
|
||
- API token (for advanced monitoring)
|
||
|
||
### Optional but Highly Recommended
|
||
|
||
- ✅ **Secure Remote Access**
|
||
- VPN (WireGuard, OpenVPN, or similar)
|
||
- Twingate (zero-trust network access)
|
||
- Cloudflare Tunnel (for web services)
|
||
- Tailscale (mesh VPN)
|
||
|
||
- ✅ **SSH Key Authentication**
|
||
- SSH keys generated (ed25519 or RSA 4096-bit)
|
||
- Public keys installed on target servers
|
||
- SSH config file configured for easy access
|
||
- Passphrase-protected private keys
|
||
|
||
- ✅ **Knowledge & Skills**
|
||
- Basic understanding of Docker and containerization
|
||
- Familiarity with n8n workflow editor
|
||
- Comfort with command-line interfaces
|
||
- Understanding of your network topology
|
||
- JSON syntax knowledge (for configuration and output parsing)
|
||
|
||
### Pre-Installation Checklist
|
||
|
||
Before proceeding to Chapter 3, verify you have completed:
|
||
|
||
```
|
||
[ ] n8n is installed and accessible
|
||
[ ] Docker is running on at least one server
|
||
[ ] LLM API key is obtained and tested
|
||
[ ] Telegram bot is created and token is saved
|
||
[ ] Network access to all services is confirmed
|
||
[ ] SSH keys are configured (if using SSH-based integrations)
|
||
[ ] You have admin/root access to your infrastructure
|
||
[ ] Backup systems are in place (never automate without backups!)
|
||
```
|
||
|
||
**Important Security Note**: This system will have significant access to your infrastructure. Treat API keys, bot tokens, and SSH credentials with the same care you would give to root passwords. Use environment variables, secrets managers, or n8n's built-in credential system. Never commit credentials to version control.
|
||
|
||
---
|
||
|
||
## Chapter 3 – Initial Setup
|
||
|
||
This chapter guides you through creating your first AI agent workflow, starting with a simple demo website for testing. You'll learn the fundamental building blocks: triggers, AI agent nodes, memory systems, and tool integration.
|
||
|
||
### Section 3.1: Demo Website Setup
|
||
|
||
Before deploying agents to monitor production services, create a safe testing environment. This demo website allows you to experiment with monitoring, diagnostics, and automated fixes without risking real infrastructure.
|
||
|
||
#### Step 1: Deploy the Demo Container
|
||
|
||
Open a terminal on your Docker host and execute:
|
||
|
||
```bash
|
||
# Create and run a simple nginx web server
|
||
docker run -d --name website -p 8090:80 nginx
|
||
|
||
# Customize the homepage
|
||
docker exec website sh -c 'echo "<h1>Homelab CTO Agent</h1><p>If you can see this, your agent is monitoring successfully!</p>" > /usr/share/nginx/html/index.html'
|
||
```
|
||
|
||
**What This Does**:
|
||
- `docker run -d`: Runs the container in detached (background) mode
|
||
- `--name website`: Assigns a friendly name for easy reference
|
||
- `-p 8090:80`: Maps port 8090 on your host to port 80 in the container
|
||
- `nginx`: Uses the official nginx image from Docker Hub
|
||
- The `docker exec` command modifies the default webpage to confirm successful monitoring
|
||
|
||
#### Step 2: Verify the Deployment
|
||
|
||
Test that the website is accessible:
|
||
|
||
```bash
|
||
# Test from the Docker host itself
|
||
curl http://localhost:8090
|
||
|
||
# Test from another machine on your network (replace with your server's IP)
|
||
curl http://YOUR_SERVER_IP:8090
|
||
```
|
||
|
||
**Expected Output**:
|
||
```html
|
||
<h1>Homelab CTO Agent</h1><p>If you can see this, your agent is monitoring successfully!</p>
|
||
```
|
||
|
||
If you see this output, your test environment is ready. If not, troubleshoot:
|
||
- Is Docker running? (`docker ps` should list the container)
|
||
- Is port 8090 available? (`netstat -tulpn | grep 8090`)
|
||
- Is your firewall blocking the port? (`sudo ufw status` or check iptables)
|
||
|
||
#### Step 3: Simulate Failure Scenarios
|
||
|
||
Your agent needs to handle failures, so let's create some:
|
||
|
||
```bash
|
||
# Scenario 1: Stop the container (service down)
|
||
docker stop website
|
||
|
||
# Verify it's down
|
||
curl http://localhost:8090
|
||
# Should fail with "Connection refused"
|
||
|
||
# Scenario 2: Restart it
|
||
docker start website
|
||
|
||
# Scenario 3: Remove and recreate (simulates crash)
|
||
docker rm -f website
|
||
docker run -d --name website -p 8090:80 nginx
|
||
```
|
||
|
||
These scenarios will be useful when testing your agent's diagnostic and recovery capabilities.
|
||
|
||
---
|
||
|
||
### Section 3.2: n8n Workflow Fundamentals
|
||
|
||
Now that you have a test service, let's build your first monitoring agent in n8n.
|
||
|
||
#### Understanding the n8n AI Agent Architecture
|
||
|
||
n8n's AI Agent node is the brain of your system. Here's how it works:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ AI Agent Node │
|
||
│ ┌──────────────────────────────────────────────┐ │
|
||
│ │ System Prompt (Role & Instructions) │ │
|
||
│ └──────────────────────────────────────────────┘ │
|
||
│ ┌──────────────────────────────────────────────┐ │
|
||
│ │ LLM (GPT-4, Claude, etc.) │ │
|
||
│ └──────────────────────────────────────────────┘ │
|
||
│ ┌──────────────────────────────────────────────┐ │
|
||
│ │ Memory (Session context) │ │
|
||
│ └──────────────────────────────────────────────┘ │
|
||
│ ┌──────────────────────────────────────────────┐ │
|
||
│ │ Tools (HTTP, Docker, SSH, etc.) │ │
|
||
│ └──────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Components**:
|
||
1. **System Prompt**: Defines the agent's personality, role, and capabilities
|
||
2. **LLM**: The reasoning engine that makes decisions
|
||
3. **Memory**: Maintains conversation context across multiple interactions
|
||
4. **Tools**: External capabilities the agent can invoke (HTTP requests, shell commands, etc.)
|
||
|
||
#### Step-by-Step Workflow Creation
|
||
|
||
##### Step 1: Create a New Workflow
|
||
|
||
1. Open your n8n instance in a web browser
|
||
2. Click **"+ Add Workflow"** or navigate to **Workflows → New**
|
||
3. Give it a meaningful name: `"Homelab Monitor - Stage 1"`
|
||
|
||
##### Step 2: Add a Manual Trigger
|
||
|
||
1. In the workflow editor, click the **"+"** button
|
||
2. Search for **"Manual Trigger"** or **"When clicking 'Execute Workflow'"**
|
||
3. Add this node to your canvas
|
||
4. This trigger allows you to test the workflow manually before automating it
|
||
|
||
##### Step 3: Add the AI Agent Node
|
||
|
||
1. Click the **"+"** button after your trigger
|
||
2. Search for **"AI Agent"**
|
||
3. Add the **"AI Agent"** node to your workflow
|
||
4. Connect it to the Manual Trigger
|
||
|
||
##### Step 4: Configure the AI Agent
|
||
|
||
Click on the AI Agent node to open its configuration panel:
|
||
|
||
**Model Configuration**:
|
||
- **Chat Model**: Select or create an OpenAI account credential
|
||
- Click **"Create New Credential"**
|
||
- Enter your OpenAI API key
|
||
- Save the credential
|
||
- **Model**: Choose **GPT-4o-mini** (cost-effective) or **GPT-4** (more capable)
|
||
|
||
**Memory Configuration**:
|
||
- Click **"Add Memory"**
|
||
- Select **"Window Buffer Memory"** or **"Simple Session Memory"**
|
||
- This allows the agent to remember context within a single workflow execution
|
||
|
||
**System Prompt**:
|
||
This is where you define your agent's role. For your first monitoring agent, use:
|
||
|
||
```
|
||
You are a professional homelab monitoring agent. Your role is to check the health of services and report their status clearly.
|
||
|
||
When asked to check a service:
|
||
1. Use the HTTP request tool to check if the service is responding
|
||
2. Report the HTTP status code
|
||
3. Indicate if the service is UP (200-299 status) or DOWN (any other status or connection failure)
|
||
4. Be concise but informative
|
||
|
||
Always maintain a professional tone and provide actionable information.
|
||
```
|
||
|
||
##### Step 5: Add HTTP Request Tool
|
||
|
||
The agent needs tools to interact with the outside world. Let's give it the ability to make HTTP requests:
|
||
|
||
1. In the AI Agent configuration, scroll to **"Tools"**
|
||
2. Click **"Add Tool"**
|
||
3. Select **"HTTP Request"**
|
||
4. A new HTTP Request node will appear, connected to your AI Agent
|
||
|
||
**Configure the HTTP Request Node**:
|
||
- **Authentication**: None (for this simple test)
|
||
- **Request Method**: GET
|
||
- **URL**: Leave this empty - the agent will provide the URL dynamically
|
||
- Under **Options**:
|
||
- Enable **"Ignore SSL Issues"** if testing with self-signed certificates
|
||
- Set **"Timeout"** to 10000ms (10 seconds)
|
||
|
||
##### Step 6: Test Your First Agent
|
||
|
||
1. Click **"Execute Workflow"** in the top right
|
||
2. When the Manual Trigger activates, the Agent awaits input
|
||
3. In the chat interface that appears, type:
|
||
```
|
||
Please check if http://YOUR_SERVER_IP:8090 is online
|
||
```
|
||
4. Watch as the agent:
|
||
- Receives your request
|
||
- Decides to use the HTTP Request tool
|
||
- Makes the request
|
||
- Analyzes the response
|
||
- Reports back with status
|
||
|
||
**Expected Response**:
|
||
```
|
||
The service at http://YOUR_SERVER_IP:8090 is UP.
|
||
HTTP Status: 200 OK
|
||
The website responded successfully.
|
||
```
|
||
|
||
Congratulations! You've just created your first AI agent. It can now monitor services by making HTTP requests and intelligently reporting results.
|
||
|
||
---
|
||
|
||
### Section 3.3: Understanding the Components
|
||
|
||
Let's examine what just happened:
|
||
|
||
#### The AI Agent Node
|
||
|
||
The AI Agent node is not a simple script executor. It's an intelligent reasoning system that:
|
||
- **Interprets** your natural language request
|
||
- **Plans** the steps needed to fulfill the request
|
||
- **Selects** appropriate tools from those available
|
||
- **Executes** the tools with correct parameters
|
||
- **Analyzes** the results
|
||
- **Responds** with human-readable output
|
||
|
||
You didn't write code to parse URLs, make HTTP requests, or format responses. The LLM handled all of that based on your system prompt and available tools.
|
||
|
||
#### The Memory Component
|
||
|
||
Memory allows the agent to maintain context:
|
||
|
||
**Without Memory**:
|
||
```
|
||
You: "Check example.com"
|
||
Agent: "It's up"
|
||
You: "What was the response time?"
|
||
Agent: "I don't know what you're referring to" ❌
|
||
```
|
||
|
||
**With Memory**:
|
||
```
|
||
You: "Check example.com"
|
||
Agent: "It's up, responded in 245ms"
|
||
You: "What was the response time?"
|
||
Agent: "The response time was 245ms" ✅
|
||
```
|
||
|
||
For homelab monitoring, memory enables multi-turn conversations:
|
||
```
|
||
You: "Check all my services"
|
||
Agent: "Plex is up, Proxmox is up, but website is down"
|
||
You: "Investigate the website issue"
|
||
Agent: "Checking Docker container status..."
|
||
You: "Can you restart it?"
|
||
Agent: "I need approval to restart containers. Should I proceed?"
|
||
```
|
||
|
||
#### The HTTP Request Tool
|
||
|
||
Tools are the agent's hands in the physical (or digital) world. The HTTP Request tool enables:
|
||
- Service health checks
|
||
- API interactions
|
||
- Webhook triggers
|
||
- Data retrieval from web services
|
||
|
||
In later chapters, you'll add more tools:
|
||
- **Execute Command** for Docker and SSH operations
|
||
- **Code** for custom logic
|
||
- **Subworkflow** for complex multi-step procedures
|
||
- **Telegram** for notifications and approvals
|
||
|
||
---
|
||
|
||
### Section 3.4: Next Steps
|
||
|
||
You now have a functional AI monitoring agent. In Chapter 4, we'll evolve this basic monitor into an intelligent diagnostic and self-healing system through five progressive stages.
|
||
|
||
**What You've Accomplished**:
|
||
- ✅ Created a safe testing environment with a demo website
|
||
- ✅ Built your first n8n workflow with an AI Agent
|
||
- ✅ Configured the agent with a system prompt, memory, and tools
|
||
- ✅ Successfully monitored a service using natural language commands
|
||
|
||
**Prepare for Chapter 4**:
|
||
- Keep your demo website running
|
||
- Familiarize yourself with the n8n interface
|
||
- Consider what additional capabilities you'd like your agent to have
|
||
- Think about which production services you'll eventually want to monitor
|
||
|
||
---
|
||
|
||
## Chapter 4 – Agent Evolution (Stages)
|
||
|
||
The journey from a simple monitoring script to an intelligent, self-healing infrastructure guardian follows a carefully designed progression. Each stage builds upon the previous one, adding new capabilities while maintaining safety and control. This chapter details five evolutionary stages, complete with system prompts, required tools, and real-world examples.
|
||
|
||
### Stage 1: Basic Monitor
|
||
|
||
**Capability Overview**: Check uptime via HTTP health checks
|
||
|
||
At this foundational stage, your agent performs basic service availability monitoring. It can check if web services respond and report their HTTP status codes. This is similar to traditional uptime monitoring tools but with natural language interaction.
|
||
|
||
**What It Can Do**:
|
||
- Monitor website and API availability via HTTP/HTTPS
|
||
- Return HTTP status codes and basic response information
|
||
- Report simple UP/DOWN status
|
||
- Detect connection timeouts and network issues
|
||
- Check multiple services sequentially
|
||
|
||
**What It Cannot Do**:
|
||
- Diagnose WHY a service is down
|
||
- Access system logs or container information
|
||
- Make any changes to infrastructure
|
||
- Understand complex failure patterns
|
||
|
||
**Required Tools**:
|
||
- HTTP Request node (configured as a tool)
|
||
|
||
**System Prompt Example**:
|
||
```
|
||
You are a Level 1 Homelab Monitoring Agent. Your role is to check service availability and report status.
|
||
|
||
CAPABILITIES:
|
||
- Check HTTP/HTTPS endpoints
|
||
- Report status codes and response times
|
||
- Identify if services are UP (200-299) or DOWN (any other status/error)
|
||
|
||
LIMITATIONS:
|
||
- You cannot diagnose issues beyond HTTP responses
|
||
- You cannot make any changes to systems
|
||
- You cannot access logs or system internals
|
||
|
||
OUTPUT FORMAT:
|
||
For each service checked, report:
|
||
- Service name/URL
|
||
- Status (UP/DOWN)
|
||
- HTTP status code (if available)
|
||
- Response time
|
||
- Any error messages
|
||
|
||
Always be clear and concise. If a service is down, state that further investigation is needed.
|
||
```
|
||
|
||
**Example Interaction**:
|
||
```
|
||
User: "Check if my Plex server at http://192.168.1.100:32400 is online"
|
||
|
||
Agent: "Checking Plex server..."
|
||
[Executes HTTP Request tool]
|
||
|
||
Agent: "✅ Status Report:
|
||
Service: Plex Media Server
|
||
URL: http://192.168.1.100:32400
|
||
Status: UP
|
||
HTTP Code: 200 OK
|
||
Response Time: 145ms
|
||
|
||
Your Plex server is online and responding normally."
|
||
```
|
||
|
||
**When to Use Stage 1**:
|
||
- Initial setup and testing
|
||
- Low-risk monitoring scenarios
|
||
- Services you want visibility into but not automated intervention
|
||
- Learning how the agent system works
|
||
|
||
---
|
||
|
||
### Stage 2: Smart Investigator
|
||
|
||
**Capability Overview**: Use Docker commands and log analysis to diagnose issues
|
||
|
||
Stage 2 agents can go beyond surface-level checks. When a service is down, they investigate the underlying cause by examining Docker container status and reading logs. This is where your agent begins to act like a junior system administrator.
|
||
|
||
**What It Can Do**:
|
||
- List all Docker containers and their states
|
||
- Check if specific containers are running, stopped, or crashed
|
||
- Read container logs (last 50-100 lines)
|
||
- Identify common error patterns in logs
|
||
- Correlate service failures with container issues
|
||
- Provide diagnostic insights about root causes
|
||
|
||
**What It Cannot Do**:
|
||
- Restart containers or make changes
|
||
- Access host system logs (only container logs)
|
||
- Modify configurations
|
||
- Interact with services via SSH
|
||
|
||
**Required Tools**:
|
||
- HTTP Request node (for service checks)
|
||
- Execute Command node (for Docker CLI access)
|
||
|
||
**Configure Execute Command Node**:
|
||
```
|
||
Node Settings:
|
||
- Command: {{ $json.command }} (receives command from agent)
|
||
- Execute Once: Yes
|
||
```
|
||
|
||
**System Prompt Example**:
|
||
```
|
||
You are a Level 2 Homelab Diagnostic Agent with investigation capabilities.
|
||
|
||
CAPABILITIES:
|
||
- Check HTTP/HTTPS service availability
|
||
- List Docker containers (docker ps -a)
|
||
- Check container status (docker inspect <container>)
|
||
- Read container logs (docker logs --tail 100 <container>)
|
||
- Analyze log patterns for common errors
|
||
|
||
DIAGNOSTIC PROCESS:
|
||
1. When a service is DOWN, first check if it's containerized
|
||
2. Use 'docker ps -a' to find the container
|
||
3. Check container state (running, exited, restarting)
|
||
4. If stopped/crashed, read logs with 'docker logs --tail 100 <name>'
|
||
5. Analyze logs for:
|
||
- Error messages
|
||
- Out of memory conditions
|
||
- Port conflicts
|
||
- Permission issues
|
||
- Application crashes
|
||
6. Provide a diagnostic summary
|
||
|
||
OUTPUT FORMAT:
|
||
Service: [name]
|
||
Status: [UP/DOWN]
|
||
Container Status: [running/stopped/crashed]
|
||
Root Cause: [your diagnosis based on logs]
|
||
Relevant Logs: [key error messages]
|
||
Recommended Action: [what should be done]
|
||
|
||
LIMITATIONS:
|
||
- You can diagnose but NOT fix issues
|
||
- You cannot restart containers
|
||
- You cannot modify any configurations
|
||
- Always recommend actions but never execute them
|
||
|
||
Be thorough but concise. Focus on actionable insights.
|
||
```
|
||
|
||
**Docker Commands Available to Agent**:
|
||
```bash
|
||
# List all containers
|
||
docker ps -a
|
||
|
||
# Check specific container
|
||
docker ps -a --filter name=website
|
||
|
||
# Get container logs
|
||
docker logs --tail 100 website
|
||
|
||
# Inspect container details
|
||
docker inspect website
|
||
|
||
# Check container resource usage
|
||
docker stats --no-stream website
|
||
```
|
||
|
||
**Example Interaction**:
|
||
```
|
||
User: "My website at http://192.168.1.100:8090 is down. What's wrong?"
|
||
|
||
Agent: "Investigating the issue..."
|
||
|
||
[Checks HTTP - fails]
|
||
[Executes: docker ps -a --filter name=website]
|
||
[Finds container is stopped]
|
||
[Executes: docker logs --tail 100 website]
|
||
|
||
Agent: "🔍 Diagnostic Report:
|
||
|
||
Service: Website (http://192.168.1.100:8090)
|
||
Status: DOWN
|
||
Container Status: Exited (stopped)
|
||
Exit Code: 137 (Out of Memory)
|
||
|
||
ROOT CAUSE:
|
||
The container was killed by the system due to memory exhaustion.
|
||
|
||
RELEVANT LOGS:
|
||
2024-01-15 14:23:45 nginx: worker process out of memory
|
||
2024-01-15 14:23:46 Killed
|
||
|
||
RECOMMENDED ACTION:
|
||
Restart the container with increased memory limits:
|
||
docker run -d --name website --memory=512m -p 8090:80 nginx
|
||
|
||
This will prevent future OOM kills."
|
||
```
|
||
|
||
**When to Use Stage 2**:
|
||
- Production monitoring where you want diagnostics but not automated fixes
|
||
- Learning about common failure patterns in your infrastructure
|
||
- Environments where you prefer manual intervention after diagnosis
|
||
- Building confidence in the agent's analytical capabilities
|
||
|
||
---
|
||
|
||
### Stage 3: The Fixer
|
||
|
||
**Capability Overview**: Automatically restart containers when they fail
|
||
|
||
Stage 3 introduces the agent's first automated remediation capability. It can now detect failures and execute fixes autonomously. This is powerful but requires careful consideration of safety boundaries.
|
||
|
||
**What It Can Do**:
|
||
- Everything from Stages 1 and 2 (monitor and diagnose)
|
||
- Restart stopped or crashed Docker containers
|
||
- Verify that restarts were successful
|
||
- Report actions taken with before/after status
|
||
- Handle simple recovery scenarios automatically
|
||
|
||
**What It Cannot Do**:
|
||
- Modify container configurations
|
||
- Create or destroy containers
|
||
- Change port mappings or volume mounts
|
||
- Access resources outside Docker
|
||
|
||
**Required Tools**:
|
||
- HTTP Request node
|
||
- Execute Command node (with Docker restart permissions)
|
||
|
||
**System Prompt Example**:
|
||
```
|
||
You are a Level 3 Homelab Automation Agent with limited remediation capabilities.
|
||
|
||
CAPABILITIES:
|
||
- All diagnostic capabilities from Level 2
|
||
- Container restart capability
|
||
- Post-restart verification
|
||
|
||
DECISION FRAMEWORK:
|
||
When you detect a DOWN service:
|
||
1. Investigate using diagnostic tools
|
||
2. If root cause is container stopped/crashed AND
|
||
- Container has no dependencies OR
|
||
- It's a simple restart scenario
|
||
3. Attempt automatic restart
|
||
4. Verify service recovery
|
||
5. Report complete incident details
|
||
|
||
SAFE-TO-RESTART CRITERIA:
|
||
✅ Container exited cleanly (exit code 0)
|
||
✅ Container crashed (exit codes 1, 137, 139)
|
||
✅ Service is stateless or has persistent data volumes
|
||
✅ No configuration changes needed
|
||
|
||
DO NOT RESTART IF:
|
||
❌ Container is in restart loop (>3 restarts in 5 minutes)
|
||
❌ Logs show configuration errors
|
||
❌ Database or data corruption indicated
|
||
❌ Resource exhaustion that restart won't fix
|
||
|
||
RESTART COMMAND:
|
||
docker restart <container_name>
|
||
|
||
POST-RESTART VERIFICATION:
|
||
1. Wait 10 seconds
|
||
2. Check container status: docker ps --filter name=<container>
|
||
3. Verify HTTP endpoint if applicable
|
||
4. Confirm service is UP
|
||
|
||
OUTPUT FORMAT:
|
||
🔧 Incident Report:
|
||
Service: [name]
|
||
Issue Detected: [timestamp]
|
||
Root Cause: [diagnosis]
|
||
Action Taken: Restarted container '<name>'
|
||
Verification: [service status after restart]
|
||
Resolution Time: [duration]
|
||
|
||
Always log every action taken for audit purposes.
|
||
```
|
||
|
||
**Example Interaction**:
|
||
```
|
||
[Automatic execution - scheduled every 5 minutes]
|
||
|
||
Agent: "🔧 Automated Fix Executed
|
||
|
||
Incident Report:
|
||
Service: Website (http://192.168.1.100:8090)
|
||
Issue Detected: 2024-01-15 14:25:00
|
||
Root Cause: Container stopped (exit code 0 - clean shutdown)
|
||
|
||
Action Taken:
|
||
- Executed: docker restart website
|
||
- Container restarted successfully
|
||
|
||
Verification:
|
||
- Container Status: Running (uptime: 15 seconds)
|
||
- HTTP Check: 200 OK
|
||
- Response Time: 142ms
|
||
|
||
Resolution Time: 18 seconds
|
||
Service Status: ✅ RESTORED"
|
||
```
|
||
|
||
**Safety Considerations for Stage 3**:
|
||
|
||
⚠️ **Critical Warning**: Stage 3 agents make automated changes to your infrastructure. Before deploying:
|
||
|
||
1. **Test Thoroughly**: Run for weeks in a test environment
|
||
2. **Start with Non-Critical Services**: Don't enable on production databases initially
|
||
3. **Monitor the Monitor**: Set up alerting for agent actions
|
||
4. **Have Rollback Plans**: Know how to disable the agent quickly
|
||
5. **Implement Rate Limiting**: Prevent restart loops
|
||
|
||
**Recommended Configuration**:
|
||
```json
|
||
{
|
||
"max_restarts_per_hour": 3,
|
||
"cooldown_between_restarts": 300,
|
||
"blacklisted_containers": ["database", "postgres", "mysql"],
|
||
"notification_channel": "telegram",
|
||
"log_all_actions": true
|
||
}
|
||
```
|
||
|
||
**When to Use Stage 3**:
|
||
- Mature monitoring setups where failure patterns are well understood
|
||
- Stateless services that can safely restart
|
||
- Non-production environments initially
|
||
- Services with good health checks and fast startup times
|
||
- When you're comfortable with autonomous remediation
|
||
|
||
---
|
||
|
||
### Stage 4: Creative Problem Solver
|
||
|
||
**Capability Overview**: Resolve complex issues like port conflicts, resource exhaustion, and configuration problems
|
||
|
||
Stage 4 agents move beyond simple restarts. They can identify and resolve complex infrastructure issues that would typically require experienced system administrator intervention.
|
||
|
||
**What It Can Do**:
|
||
- Identify port conflicts and suggest remapping
|
||
- Detect memory/disk space issues and recommend solutions
|
||
- Analyze resource consumption patterns
|
||
- Suggest container optimization (resource limits, restart policies)
|
||
- Handle multi-container dependency issues
|
||
- Propose configuration changes to prevent recurring issues
|
||
|
||
**What It Cannot Do**:
|
||
- Directly modify system configurations (proposes changes only)
|
||
- Make changes without approval (transitions to Stage 5 requirement)
|
||
- Access resources outside the Docker/local system scope
|
||
|
||
**Required Tools**:
|
||
- HTTP Request node
|
||
- Execute Command node (Docker + system commands)
|
||
- Code node (for complex logic and calculations)
|
||
|
||
**System Prompt Example**:
|
||
```
|
||
You are a Level 4 Homelab Senior Engineer Agent with advanced problem-solving capabilities.
|
||
|
||
CAPABILITIES:
|
||
- All previous level capabilities
|
||
- Resource conflict resolution
|
||
- Performance optimization
|
||
- Configuration troubleshooting
|
||
- Dependency analysis
|
||
|
||
ADVANCED DIAGNOSTIC COMMANDS:
|
||
- docker ps -a (container list)
|
||
- docker logs --tail 200 <container>
|
||
- docker inspect <container>
|
||
- docker stats --no-stream
|
||
- netstat -tulpn (port usage)
|
||
- df -h (disk space)
|
||
- free -m (memory usage)
|
||
- systemctl status docker
|
||
|
||
PROBLEM-SOLVING SCENARIOS:
|
||
|
||
1. PORT CONFLICTS:
|
||
Detection: "bind: address already in use"
|
||
Solution: Identify conflicting process, suggest alternative port
|
||
Example: "Port 8080 used by process X. Recommend using 8081"
|
||
|
||
2. MEMORY EXHAUSTION:
|
||
Detection: Exit code 137, OOM in logs
|
||
Solution: Calculate appropriate memory limits
|
||
Example: "Container using 450MB, killed at 512MB. Recommend --memory=1g"
|
||
|
||
3. DISK SPACE ISSUES:
|
||
Detection: "no space left on device"
|
||
Solution: Identify large log files, old containers, unused volumes
|
||
Example: "Docker using 45GB. Recommend: docker system prune -a"
|
||
|
||
4. DEPENDENCY FAILURES:
|
||
Detection: Service depends on another that's down
|
||
Solution: Identify dependency chain, restart in correct order
|
||
Example: "App requires DB. DB is down. Restarting DB first."
|
||
|
||
5. CONFIGURATION ERRORS:
|
||
Detection: Specific error patterns in logs
|
||
Solution: Identify misconfiguration, suggest fix
|
||
Example: "Missing environment variable DATABASE_URL. Should be: DATABASE_URL=postgresql://..."
|
||
|
||
OUTPUT FORMAT:
|
||
🧠 Advanced Diagnostic Report:
|
||
|
||
Service: [name]
|
||
Issue Category: [port conflict/memory/disk/config/dependency]
|
||
Root Cause: [detailed technical explanation]
|
||
|
||
Current State:
|
||
[relevant system information]
|
||
|
||
Proposed Solution:
|
||
[specific command or configuration change]
|
||
|
||
Expected Outcome:
|
||
[what will happen after fix is applied]
|
||
|
||
Risk Assessment: [LOW/MEDIUM/HIGH]
|
||
Rollback Plan: [how to undo if needed]
|
||
|
||
Recommendation: [AUTO-FIX / NEEDS-APPROVAL / MANUAL-INTERVENTION]
|
||
```
|
||
|
||
**Example Interaction - Port Conflict**:
|
||
```
|
||
Agent: "🧠 Advanced Diagnostic Report:
|
||
|
||
Service: New Website Container
|
||
Issue Category: Port Conflict
|
||
Root Cause: Port 8090 is already bound to existing 'website' container
|
||
|
||
Current State:
|
||
- Port 8090: Used by container 'website' (nginx)
|
||
- Attempted: docker run -p 8090:80 new-website
|
||
- Result: Error 'port is already allocated'
|
||
|
||
Proposed Solution:
|
||
Option 1: Use different port for new container
|
||
docker run -d --name new-website -p 8091:80 nginx
|
||
|
||
Option 2: Stop existing container first
|
||
docker stop website && docker run -d --name new-website -p 8090:80 nginx
|
||
|
||
Current Resource Availability:
|
||
Available ports: 8091-8099, 9000-9100
|
||
|
||
Expected Outcome (Option 1):
|
||
- Both containers run simultaneously
|
||
- Old site: http://192.168.1.100:8090
|
||
- New site: http://192.168.1.100:8091
|
||
|
||
Risk Assessment: LOW (no existing services disrupted)
|
||
Rollback Plan: docker stop new-website && docker rm new-website
|
||
|
||
Recommendation: AUTO-FIX with Option 1 (safe, non-disruptive)"
|
||
```
|
||
|
||
**Example Interaction - Memory Optimization**:
|
||
```
|
||
Agent: "🧠 Advanced Diagnostic Report:
|
||
|
||
Service: Plex Media Server
|
||
Issue Category: Memory Exhaustion (Recurring)
|
||
Root Cause: Container repeatedly killed by OOM (Out of Memory)
|
||
|
||
Current State:
|
||
- Container restarts: 7 times in past 24 hours
|
||
- Exit code: 137 (OOM Kill)
|
||
- Current memory limit: 512MB
|
||
- Peak usage before kill: 498MB
|
||
|
||
Resource Analysis:
|
||
[Executed: docker stats --no-stream plex]
|
||
- Average usage during transcoding: 850MB
|
||
- Baseline (idle): 280MB
|
||
- Spike during 4K transcode: 1.2GB
|
||
|
||
Proposed Solution:
|
||
Recreate container with adequate memory allocation:
|
||
|
||
docker stop plex
|
||
docker rm plex
|
||
docker run -d \
|
||
--name plex \
|
||
--memory=2g \
|
||
--memory-reservation=1g \
|
||
-p 32400:32400 \
|
||
-v /media/plex:/data \
|
||
plexinc/pms-docker
|
||
|
||
Configuration Explained:
|
||
- --memory=2g: Hard limit (prevents OOM)
|
||
- --memory-reservation=1g: Soft limit (performance optimization)
|
||
- Allows transcoding without crashes
|
||
|
||
Expected Outcome:
|
||
- No more OOM kills
|
||
- Stable transcoding performance
|
||
- Container uptime >30 days
|
||
|
||
Risk Assessment: MEDIUM (requires container recreation)
|
||
Rollback Plan:
|
||
docker stop plex
|
||
[restore from backup configuration]
|
||
|
||
Recommendation: NEEDS-APPROVAL
|
||
This requires downtime and container recreation. Approve to proceed?"
|
||
```
|
||
|
||
**When to Use Stage 4**:
|
||
- Complex homelab setups with interdependent services
|
||
- Environments experiencing recurring issues
|
||
- When you want intelligent optimization suggestions
|
||
- Production systems where you value expert-level diagnostics
|
||
- Before fully trusting automated fixes (Stage 5)
|
||
|
||
---
|
||
|
||
### Stage 5: Human-in-the-Loop
|
||
|
||
**Capability Overview**: All previous capabilities but ALWAYS requests approval before executing changes
|
||
|
||
Stage 5 represents the **recommended production configuration** for most users. The agent has full diagnostic and remediation capabilities but respects human authority by requesting approval before making any changes. This balances automation benefits with human oversight.
|
||
|
||
**Philosophy**: "The agent is the expert consultant; you are the executive decision-maker."
|
||
|
||
**What It Can Do**:
|
||
- Everything from Stages 1-4 (monitor, diagnose, solve complex problems)
|
||
- Present complete fix proposals with rationale
|
||
- Request approval via Telegram before executing
|
||
- Execute only upon explicit confirmation
|
||
- Report results after execution
|
||
- Learn from approval/rejection patterns (via system prompt refinement)
|
||
|
||
**What It Cannot Do**:
|
||
- Make changes without explicit approval
|
||
- Bypass the approval process for "urgent" issues (safety feature)
|
||
- Make decisions that weren't approved
|
||
|
||
**Required Tools**:
|
||
- HTTP Request node
|
||
- Execute Command node
|
||
- Telegram node (for approval requests)
|
||
- IF/Switch nodes (for approval logic)
|
||
|
||
**System Prompt Example**:
|
||
```
|
||
You are a Level 5 Homelab CTO Agent - the highest level of capability with human oversight.
|
||
|
||
CORE PRINCIPLE: "Analyze autonomously, execute only with approval"
|
||
|
||
CAPABILITIES:
|
||
- Complete diagnostic and problem-solving capabilities
|
||
- Advanced decision-making and optimization
|
||
- Multi-service orchestration
|
||
- All tools from previous levels
|
||
|
||
WORKFLOW:
|
||
1. Monitor services continuously
|
||
2. When issue detected, perform COMPLETE investigation
|
||
3. Develop solution with:
|
||
- Root cause analysis
|
||
- Proposed fix (specific commands)
|
||
- Expected outcome
|
||
- Risk assessment
|
||
- Rollback plan
|
||
4. Request human approval via Telegram
|
||
5. If approved: Execute and verify
|
||
6. If denied: Log decision and await further instructions
|
||
|
||
APPROVAL REQUEST FORMAT:
|
||
🚨 Approval Required
|
||
|
||
Service: [name]
|
||
Issue: [brief description]
|
||
Severity: [LOW/MEDIUM/HIGH/CRITICAL]
|
||
|
||
Diagnosis:
|
||
[2-3 sentence root cause]
|
||
|
||
Proposed Action:
|
||
```
|
||
[exact command to be executed]
|
||
```
|
||
|
||
Impact:
|
||
- Downtime: [duration]
|
||
- Risk: [level]
|
||
- Affects: [what services/users]
|
||
|
||
If approved, reply: YES
|
||
To decline, reply: NO
|
||
For more info, reply: EXPLAIN
|
||
|
||
Awaiting your decision...
|
||
|
||
AFTER APPROVAL:
|
||
- Execute immediately
|
||
- Verify success
|
||
- Report results within 60 seconds
|
||
|
||
AFTER DENIAL:
|
||
- Log the decision
|
||
- Do not execute
|
||
- Remain on standby for manual intervention
|
||
|
||
EMERGENCY PROTOCOL:
|
||
Even for CRITICAL issues, ALWAYS request approval.
|
||
Provide urgency context but respect human authority.
|
||
|
||
You are trusted but not autonomous. This keeps systems safe.
|
||
```
|
||
|
||
**Example Interaction Flow**:
|
||
|
||
**Step 1: Agent Detects Issue**
|
||
```
|
||
[Scheduled check runs at 03:00 AM]
|
||
|
||
Agent: "Issue detected during routine monitoring:
|
||
Service 'plex' is DOWN
|
||
Beginning investigation..."
|
||
|
||
[Performs diagnostics]
|
||
```
|
||
|
||
**Step 2: Agent Sends Approval Request (via Telegram)**
|
||
```
|
||
🚨 Approval Required
|
||
|
||
Service: Plex Media Server
|
||
Issue: Container crashed (OOM Kill)
|
||
Severity: MEDIUM
|
||
Detected: 03:02 AM
|
||
|
||
Diagnosis:
|
||
Container killed due to insufficient memory during 4K transcoding.
|
||
This is the 3rd occurrence this week. Root cause is 512MB memory limit.
|
||
|
||
Proposed Action:
|
||
```bash
|
||
docker stop plex
|
||
docker rm plex
|
||
docker run -d --name plex --memory=2g --memory-reservation=1g \
|
||
-p 32400:32400 -v /media/plex:/data plexinc/pms-docker
|
||
```
|
||
|
||
Impact:
|
||
- Downtime: ~30 seconds (container recreation)
|
||
- Risk: LOW (data on persistent volume, safe)
|
||
- Affects: Plex users (currently none active at 3 AM)
|
||
|
||
Rollback: Original configuration saved in /backups/plex-config-20240115.sh
|
||
|
||
Reply YES to proceed, NO to decline, EXPLAIN for more details.
|
||
```
|
||
|
||
**Step 3: User Approves (via Telegram)**
|
||
```
|
||
User: "YES"
|
||
```
|
||
|
||
**Step 4: Agent Executes and Reports**
|
||
```
|
||
Agent: "✅ Action Completed
|
||
|
||
Executed: Container recreation with 2GB memory limit
|
||
Started: 03:05 AM
|
||
Completed: 03:05 AM (28 seconds)
|
||
|
||
Verification:
|
||
- Container Status: Running
|
||
- HTTP Health: 200 OK (http://192.168.1.100:32400)
|
||
- Memory Available: 2GB limit, currently using 320MB
|
||
- All libraries accessible
|
||
|
||
Result: Service RESTORED
|
||
Expected Outcome: No more OOM kills during transcoding
|
||
|
||
Will continue monitoring. Thank you for the approval."
|
||
```
|
||
|
||
**Step 5: User Denies (Alternative Scenario)**
|
||
```
|
||
User: "NO"
|
||
|
||
Agent: "Acknowledged. Action DENIED.
|
||
|
||
Status: Container remains stopped
|
||
No changes made to system
|
||
Awaiting your manual intervention
|
||
|
||
Current State:
|
||
- Plex is DOWN
|
||
- Container can be manually started with: docker start plex
|
||
- Issue (OOM) will likely recur without memory increase
|
||
|
||
I'll continue monitoring. Let me know if you need assistance.
|
||
```
|
||
|
||
**Implementation in n8n**:
|
||
|
||
```
|
||
Workflow Structure:
|
||
|
||
[Schedule Trigger]
|
||
→ [AI Agent: Diagnosis]
|
||
→ [IF: Issue Detected?]
|
||
YES → [AI Agent: Prepare Fix Proposal]
|
||
→ [Telegram: Send Approval Request]
|
||
→ [Webhook or Telegram Listener: Wait for Response]
|
||
→ [IF: Response = "YES"]
|
||
YES → [AI Agent: Execute Fix]
|
||
→ [Telegram: Send Success Report]
|
||
NO → [Telegram: Acknowledge Denial]
|
||
NO → [End]
|
||
```
|
||
|
||
**Telegram Approval Configuration**:
|
||
```javascript
|
||
// In Telegram node - Send Message
|
||
{
|
||
"chat_id": "{{ $env.TELEGRAM_CHAT_ID }}",
|
||
"text": "{{ $json.approval_request }}",
|
||
"reply_markup": {
|
||
"keyboard": [
|
||
[{"text": "YES"}, {"text": "NO"}],
|
||
[{"text": "EXPLAIN"}]
|
||
],
|
||
"one_time_keyboard": true
|
||
}
|
||
}
|
||
```
|
||
|
||
**Benefits of Stage 5**:
|
||
- Full automation capabilities with human oversight
|
||
- Sleep peacefully knowing nothing happens without approval
|
||
- Build confidence over time by seeing agent decisions
|
||
- Educational: Learn what issues occur and how to solve them
|
||
- Audit trail: Every action is logged and approved
|
||
- Flexible: Approve from anywhere via Telegram
|
||
|
||
**When to Use Stage 5**:
|
||
- **Production environments** (highly recommended)
|
||
- Services with external users or customers
|
||
- Compliance environments requiring human authorization
|
||
- When building trust in the agent system
|
||
- High-value infrastructure (databases, storage, networking)
|
||
- As your default configuration
|
||
|
||
**This is the recommended stage for most users.** It provides all the benefits of intelligent automation while maintaining human control and preventing runaway automation scenarios.
|
||
|
||
---
|
||
|
||
### Progression Recommendation
|
||
|
||
**Week 1-2**: Stage 1 (Basic Monitor)
|
||
- Get comfortable with n8n and AI agents
|
||
- Understand your infrastructure's normal behavior
|
||
- Build confidence in the system
|
||
|
||
**Week 3-4**: Stage 2 (Smart Investigator)
|
||
- Add diagnostic capabilities
|
||
- Study the logs and patterns
|
||
- Learn what typically fails and why
|
||
|
||
**Week 5-6**: Stage 3 (The Fixer) - Test Environment Only
|
||
- Enable automatic fixes in isolated environment
|
||
- Monitor for unexpected behaviors
|
||
- Refine safety constraints
|
||
|
||
**Week 7-8**: Stage 4 (Creative Problem Solver)
|
||
- Add complex problem-solving
|
||
- Handle resource conflicts
|
||
- Optimize configurations
|
||
|
||
**Week 9+**: Stage 5 (Human-in-the-Loop) - Production
|
||
- Deploy to production with approval requirements
|
||
- Enjoy 24/7 monitoring with control
|
||
- Gradually expand scope as trust builds
|
||
|
||
**Never skip directly to Stage 3, 4, or 5 in production.** Understanding each stage ensures you can troubleshoot issues and maintain confidence in your system.
|
||
|
||
---
|
||
|
||
## Chapter 5 – Automation Setup
|
||
|
||
Transform your manual agent into a 24/7 autonomous monitoring system with scheduled execution, structured output, and intelligent notifications. This chapter covers scheduling, JSON standardization, and complete Telegram integration including BotFather setup.
|
||
|
||
### Section 5.1: Scheduled Monitoring
|
||
|
||
Moving from manual testing to automated monitoring requires replacing your Manual Trigger with a Schedule Trigger and implementing proper error handling.
|
||
|
||
#### Configuring the Schedule Trigger
|
||
|
||
**Step 1: Remove Manual Trigger**
|
||
1. In your workflow, select the Manual Trigger node
|
||
2. Delete it (or keep it for testing in a separate branch)
|
||
|
||
**Step 2: Add Schedule Trigger**
|
||
1. Click the **"+"** button to add a new node
|
||
2. Search for **"Schedule Trigger"**
|
||
3. Add it to your workflow
|
||
4. Connect it to your AI Agent chain
|
||
|
||
**Step 3: Configure Timing**
|
||
|
||
For production monitoring, recommended intervals:
|
||
|
||
```
|
||
Critical Services (databases, core infrastructure):
|
||
- Interval: Every 2 minutes
|
||
- Schedule: "*/2 * * * *" (cron format)
|
||
|
||
Important Services (web apps, media servers):
|
||
- Interval: Every 5 minutes
|
||
- Schedule: "*/5 * * * *"
|
||
|
||
Non-Critical Services (development, testing):
|
||
- Interval: Every 15-30 minutes
|
||
- Schedule: "*/15 * * * *" or "*/30 * * * *"
|
||
```
|
||
|
||
**Example Configuration**:
|
||
```
|
||
Trigger Node: Schedule Trigger
|
||
Trigger Interval: Every 5 minutes
|
||
Cron Expression: */5 * * * *
|
||
```
|
||
|
||
**Step 4: Test Scheduled Execution**
|
||
|
||
1. Save and activate the workflow
|
||
2. Wait for the next scheduled interval
|
||
3. Check execution history (View → Executions)
|
||
4. Verify the agent runs automatically
|
||
|
||
#### Preventing Overlap and Race Conditions
|
||
|
||
When running on a schedule, prevent multiple executions from conflicting:
|
||
|
||
**Workflow Settings**:
|
||
```
|
||
Settings → Workflow:
|
||
- Execution Order: "v1" (ensures sequential execution)
|
||
- Timeout: 300 seconds (prevents hanging workflows)
|
||
- Error Workflow: [optional - trigger if workflow fails]
|
||
```
|
||
|
||
**In AI Agent Settings**:
|
||
```
|
||
Max Iterations: 10 (prevents infinite loops)
|
||
Timeout: 120 seconds (per agent invocation)
|
||
```
|
||
|
||
---
|
||
|
||
### Section 5.2: Structured JSON Output
|
||
|
||
Consistency is key for automation. Teach your agent to return standardized JSON that downstream nodes can process reliably.
|
||
|
||
#### Why JSON Matters
|
||
|
||
**Without JSON** (free-form text):
|
||
```
|
||
Agent: "The website seems to be down. I checked and the container isn't running. Maybe restart it?"
|
||
```
|
||
↓ Hard to parse, trigger notifications, or make decisions
|
||
|
||
**With JSON** (structured):
|
||
```json
|
||
{
|
||
"status": "down",
|
||
"service": "website",
|
||
"issue": "container_stopped",
|
||
"severity": "high",
|
||
"proposed_fix": "docker restart website",
|
||
"requires_approval": true
|
||
}
|
||
```
|
||
↓ Easy to route, trigger alerts, and automate responses
|
||
|
||
#### Standard JSON Schema
|
||
|
||
Define a consistent structure for your agents:
|
||
|
||
```json
|
||
{
|
||
"timestamp": "2024-01-15T14:23:45Z",
|
||
"service": {
|
||
"name": "website",
|
||
"url": "http://192.168.1.100:8090",
|
||
"type": "web"
|
||
},
|
||
"status": "down",
|
||
"checks_performed": [
|
||
{
|
||
"type": "http",
|
||
"result": "failed",
|
||
"error": "Connection refused"
|
||
},
|
||
{
|
||
"type": "container",
|
||
"result": "stopped",
|
||
"exit_code": 137
|
||
}
|
||
],
|
||
"diagnosis": {
|
||
"root_cause": "Out of memory (OOM) kill",
|
||
"confidence": "high",
|
||
"evidence": [
|
||
"Exit code 137 indicates OOM",
|
||
"Logs show memory exhaustion before crash"
|
||
]
|
||
},
|
||
"proposed_action": {
|
||
"type": "restart",
|
||
"command": "docker restart website",
|
||
"requires_approval": true,
|
||
"risk_level": "low",
|
||
"estimated_downtime": "10 seconds"
|
||
},
|
||
"severity": "medium",
|
||
"notification_sent": false
|
||
}
|
||
```
|
||
|
||
#### Teaching Your Agent to Output JSON
|
||
|
||
**System Prompt Addition**:
|
||
```
|
||
OUTPUT FORMAT REQUIREMENT:
|
||
|
||
You MUST respond with valid JSON following this exact schema:
|
||
|
||
{
|
||
"timestamp": "[ISO 8601 timestamp]",
|
||
"service": {
|
||
"name": "[service name]",
|
||
"url": "[service URL if applicable]",
|
||
"type": "[web/database/container/vm]"
|
||
},
|
||
"status": "[up/down/degraded/unknown]",
|
||
"checks_performed": [
|
||
{
|
||
"type": "[http/container/process/disk]",
|
||
"result": "[success/failed]",
|
||
"details": "[any relevant details]"
|
||
}
|
||
],
|
||
"diagnosis": {
|
||
"root_cause": "[your analysis]",
|
||
"confidence": "[high/medium/low]",
|
||
"evidence": ["[supporting fact 1]", "[supporting fact 2]"]
|
||
},
|
||
"proposed_action": {
|
||
"type": "[restart/none/manual/investigate]",
|
||
"command": "[exact command if applicable]",
|
||
"requires_approval": true/false,
|
||
"risk_level": "[low/medium/high]",
|
||
"estimated_downtime": "[duration]"
|
||
},
|
||
"severity": "[low/medium/high/critical]",
|
||
"notification_sent": false
|
||
}
|
||
|
||
CRITICAL:
|
||
- Output ONLY valid JSON, no additional text
|
||
- All string values must be properly escaped
|
||
- Include all fields even if null
|
||
- Use lowercase for enum values (status, severity, etc.)
|
||
```
|
||
|
||
#### Parsing Agent JSON in n8n
|
||
|
||
After your AI Agent node, add a **Code node** to parse and validate:
|
||
|
||
```javascript
|
||
// Node: Parse Agent Response
|
||
const agentOutput = $input.first().json.output;
|
||
|
||
// Parse JSON from agent response
|
||
let report;
|
||
try {
|
||
// Extract JSON if agent included extra text
|
||
const jsonMatch = agentOutput.match(/\{[\s\S]*\}/);
|
||
if (jsonMatch) {
|
||
report = JSON.parse(jsonMatch[0]);
|
||
} else {
|
||
throw new Error("No JSON found in agent response");
|
||
}
|
||
} catch (error) {
|
||
// Fallback if parsing fails
|
||
report = {
|
||
status: "error",
|
||
service: { name: "unknown" },
|
||
diagnosis: { root_cause: "Failed to parse agent output" },
|
||
severity: "low",
|
||
notification_sent: false,
|
||
raw_output: agentOutput
|
||
};
|
||
}
|
||
|
||
// Validate required fields
|
||
report.timestamp = report.timestamp || new Date().toISOString();
|
||
report.notification_sent = false; // Will be updated after notification
|
||
|
||
return { json: report };
|
||
```
|
||
|
||
---
|
||
|
||
### Section 5.3: Telegram Integration
|
||
|
||
Telegram serves as your command center: receiving alerts, sending approvals, and providing status updates. This section covers complete setup from scratch.
|
||
|
||
#### Part A: Creating Your Telegram Bot
|
||
|
||
**Step 1: Start Conversation with BotFather**
|
||
|
||
1. Open Telegram app (mobile or desktop)
|
||
2. Search for **@BotFather** (verified bot with blue checkmark)
|
||
3. Start a conversation: `/start`
|
||
|
||
**Step 2: Create New Bot**
|
||
|
||
Send command:
|
||
```
|
||
/newbot
|
||
```
|
||
|
||
BotFather responds:
|
||
```
|
||
Alright, a new bot. How are we going to call it? Please choose a name for your bot.
|
||
```
|
||
|
||
**Step 3: Name Your Bot**
|
||
|
||
Choose a display name (can contain spaces):
|
||
```
|
||
Homelab CTO Agent
|
||
```
|
||
|
||
**Step 4: Choose Username**
|
||
|
||
Must end in 'bot' and be unique:
|
||
```
|
||
homelab_cto_agent_bot
|
||
```
|
||
(Try variations if taken: `homelab_cto_2024_bot`, `your_name_homelab_bot`, etc.)
|
||
|
||
**Step 5: Receive Bot Token**
|
||
|
||
BotFather provides your token:
|
||
```
|
||
Done! Congratulations on your new bot. You will find it at t.me/homelab_cto_agent_bot
|
||
|
||
Use this token to access the HTTP API:
|
||
7234567890:AAHfG7Xm2qZpYq9_JxEn6KqQx7rY3MpLv8w
|
||
|
||
Keep your token secure and store it safely, it can be used by anyone to control your bot.
|
||
```
|
||
|
||
**⚠️ CRITICAL**: This token is like a password. Never share it, commit it to git, or post it publicly.
|
||
|
||
**Step 6: Store Token Securely**
|
||
|
||
In n8n:
|
||
1. Go to **Credentials → New**
|
||
2. Select **"Telegram API"** credential type
|
||
3. Enter:
|
||
- **Credential Name**: "Homelab CTO Bot"
|
||
- **Access Token**: `[paste your token]`
|
||
4. Save
|
||
|
||
#### Part B: Getting Your Chat ID
|
||
|
||
For the bot to send you messages, it needs your Chat ID.
|
||
|
||
**Step 1: Start Conversation with Your Bot**
|
||
|
||
1. In Telegram, search for your bot's username (`@homelab_cto_agent_bot`)
|
||
2. Click **Start** or send any message like "Hello"
|
||
|
||
**Step 2: Get Your Chat ID**
|
||
|
||
**Method 1 - Using n8n Workflow**:
|
||
|
||
Create a temporary workflow:
|
||
```
|
||
[Webhook Trigger]
|
||
→ [Telegram Trigger: On Message]
|
||
→ [Set: Extract Chat ID]
|
||
```
|
||
|
||
Configure:
|
||
- **Telegram Trigger**: Select your bot credential
|
||
- **Updates**: "message"
|
||
|
||
When you send a message to your bot, the workflow captures it and displays the chat ID in the execution data.
|
||
|
||
**Method 2 - Using API Call**:
|
||
|
||
Open browser and navigate to:
|
||
```
|
||
https://api.telegram.org/bot<YOUR_BOT_TOKEN>/getUpdates
|
||
```
|
||
|
||
Replace `<YOUR_BOT_TOKEN>` with your actual token.
|
||
|
||
Look for JSON response:
|
||
```json
|
||
{
|
||
"ok": true,
|
||
"result": [
|
||
{
|
||
"update_id": 123456789,
|
||
"message": {
|
||
"message_id": 1,
|
||
"from": {
|
||
"id": 987654321, ← This is your Chat ID
|
||
"is_bot": false,
|
||
"first_name": "Your Name"
|
||
},
|
||
"chat": {
|
||
"id": 987654321, ← Chat ID (same as user ID for private chats)
|
||
"type": "private"
|
||
},
|
||
"text": "Hello"
|
||
}
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
**Step 3: Store Chat ID**
|
||
|
||
Save it in n8n environment variables:
|
||
1. In n8n settings, go to **Settings → Environment Variables** (if available)
|
||
2. Or store directly in workflow using **Set** node at the beginning:
|
||
|
||
```javascript
|
||
{
|
||
"TELEGRAM_CHAT_ID": "987654321"
|
||
}
|
||
```
|
||
|
||
#### Part C: Workflow Integration
|
||
|
||
Now connect your agent to Telegram for notifications.
|
||
|
||
**Notification Workflow Structure**:
|
||
|
||
```
|
||
[Schedule Trigger - Every 5 min]
|
||
→ [AI Agent - Monitoring]
|
||
→ [Code - Parse JSON]
|
||
→ [IF - Issue Detected?]
|
||
YES → [Telegram - Send Alert]
|
||
→ [IF - Requires Approval?]
|
||
YES → [Telegram - Wait for Response]
|
||
→ [Switch - Handle Response]
|
||
YES → [AI Agent - Execute Fix]
|
||
→ [Telegram - Confirm Execution]
|
||
NO → [Telegram - Log Denial]
|
||
EXPLAIN → [AI Agent - Provide Details]
|
||
→ [Telegram - Send Explanation]
|
||
NO → [End]
|
||
NO → [End - All Clear]
|
||
```
|
||
|
||
**Configure Telegram Nodes**:
|
||
|
||
**Node 1: Alert Notification**
|
||
```
|
||
Node: Telegram
|
||
Action: Send Message
|
||
Credential: Homelab CTO Bot
|
||
Chat ID: {{ $env.TELEGRAM_CHAT_ID }} or hardcode your ID
|
||
Message:
|
||
```
|
||
|
||
```
|
||
🚨 Service Issue Detected
|
||
|
||
Service: {{ $json.service.name }}
|
||
Status: {{ $json.status }}
|
||
Severity: {{ $json.severity }}
|
||
|
||
Issue:
|
||
{{ $json.diagnosis.root_cause }}
|
||
|
||
Time: {{ $json.timestamp }}
|
||
|
||
Checking if action needed...
|
||
```
|
||
|
||
**Node 2: Approval Request** (if needed)
|
||
```
|
||
Message:
|
||
```
|
||
|
||
```
|
||
⚠️ Approval Required
|
||
|
||
Service: {{ $json.service.name }}
|
||
Issue: {{ $json.diagnosis.root_cause }}
|
||
|
||
Proposed Fix:
|
||
`{{ $json.proposed_action.command }}`
|
||
|
||
Risk: {{ $json.proposed_action.risk_level }}
|
||
Downtime: {{ $json.proposed_action.estimated_downtime }}
|
||
|
||
Reply:
|
||
YES - Execute fix
|
||
NO - Skip action
|
||
EXPLAIN - More details
|
||
```
|
||
|
||
```
|
||
Reply Markup: (makes clickable buttons)
|
||
{
|
||
"keyboard": [
|
||
[{"text": "YES"}, {"text": "NO"}],
|
||
[{"text": "EXPLAIN"}]
|
||
],
|
||
"one_time_keyboard": true,
|
||
"resize_keyboard": true
|
||
}
|
||
```
|
||
|
||
#### Part D: Handling User Responses
|
||
|
||
To wait for and process user responses, use Telegram Trigger in a separate workflow or webhook pattern.
|
||
|
||
**Approach 1: Separate Approval Workflow** (Recommended)
|
||
|
||
Main Workflow:
|
||
- Detects issues
|
||
- Sends approval request with unique ID
|
||
- Stores pending approval in temporary storage
|
||
|
||
Approval Workflow:
|
||
```
|
||
[Telegram Trigger - On Message]
|
||
→ [IF - Is Response to Approval?]
|
||
YES → [Switch - Response Type]
|
||
YES → [Execute - Trigger Main Workflow with Execute]
|
||
NO → [Update - Mark as Denied]
|
||
EXPLAIN → [Send - Detailed Explanation]
|
||
NO → [Ignore or handle as chat message]
|
||
```
|
||
|
||
**Approach 2: Webhook Pattern** (Advanced)
|
||
|
||
Use n8n's webhook node to create a callback URL:
|
||
1. Approval request includes unique token
|
||
2. User clicks link or button with token
|
||
3. Webhook receives response
|
||
4. Continues execution
|
||
|
||
#### Part E: Notification Logic
|
||
|
||
Implement smart notifications to avoid alert fatigue.
|
||
|
||
**Conditions for Sending Notifications**:
|
||
|
||
```javascript
|
||
// In IF node after AI Agent
|
||
const report = $json;
|
||
|
||
// Send notification if:
|
||
const shouldNotify =
|
||
report.status === 'down' || // Service is down
|
||
report.severity === 'high' || // High severity issue
|
||
report.severity === 'critical' || // Critical issue
|
||
report.proposed_action.requires_approval; // Needs approval
|
||
|
||
return shouldNotify;
|
||
```
|
||
|
||
**Rate Limiting to Prevent Spam**:
|
||
|
||
```javascript
|
||
// Store last notification time in workflow static data
|
||
const now = Date.now();
|
||
const lastNotification = $('WorkflowStaticData').first().json.last_notification || 0;
|
||
const minIntervalMs = 15 * 60 * 1000; // 15 minutes
|
||
|
||
if (now - lastNotification < minIntervalMs && report.severity !== 'critical') {
|
||
// Skip notification unless critical
|
||
return false;
|
||
}
|
||
|
||
// Update last notification time
|
||
$('WorkflowStaticData').first().json.last_notification = now;
|
||
return true;
|
||
```
|
||
|
||
**Different Notification Levels**:
|
||
|
||
```javascript
|
||
// Customize message based on severity
|
||
let emoji, urgency;
|
||
|
||
switch (report.severity) {
|
||
case 'critical':
|
||
emoji = '🔴';
|
||
urgency = 'CRITICAL - Immediate attention required';
|
||
break;
|
||
case 'high':
|
||
emoji = '🟠';
|
||
urgency = 'HIGH - Action needed soon';
|
||
break;
|
||
case 'medium':
|
||
emoji = '🟡';
|
||
urgency = 'MEDIUM - Monitor closely';
|
||
break;
|
||
case 'low':
|
||
emoji = '🟢';
|
||
urgency = 'LOW - Informational';
|
||
break;
|
||
default:
|
||
emoji = 'ℹ️';
|
||
urgency = 'Status Update';
|
||
}
|
||
|
||
return {
|
||
json: {
|
||
message: `${emoji} ${urgency}\n\n${report.diagnosis.root_cause}`,
|
||
severity: report.severity
|
||
}
|
||
};
|
||
```
|
||
|
||
#### Complete Example Workflow
|
||
|
||
Putting it all together:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ [Schedule Trigger: */5 * * * *] │
|
||
└───────────────────┬─────────────────────────────────────────┘
|
||
│
|
||
┌───────────────────▼─────────────────────────────────────────┐
|
||
│ [AI Agent: Monitor Services] │
|
||
│ - Check health │
|
||
│ - Return JSON report │
|
||
└───────────────────┬─────────────────────────────────────────┘
|
||
│
|
||
┌───────────────────▼─────────────────────────────────────────┐
|
||
│ [Code: Parse JSON Response] │
|
||
│ - Extract structured data │
|
||
│ - Validate fields │
|
||
└───────────────────┬─────────────────────────────────────────┘
|
||
│
|
||
┌───────────────────▼─────────────────────────────────────────┐
|
||
│ [IF: Issue Detected?] │
|
||
│ Condition: status != 'up' │
|
||
└─────┬──────────────────────────────────────────────┬────────┘
|
||
YES NO
|
||
│ │
|
||
┌─────▼──────────────────────────────────┐ ┌─────▼─────┐
|
||
│ [Telegram: Send Alert] │ │ [End] │
|
||
│ Message: Issue details │ └───────────┘
|
||
└─────┬──────────────────────────────────┘
|
||
│
|
||
┌─────▼──────────────────────────────────┐
|
||
│ [IF: Requires Approval?] │
|
||
│ Condition: proposed_action.requires_ │
|
||
│ approval == true │
|
||
└─────┬──────────────────────┬───────────┘
|
||
YES NO
|
||
│ │
|
||
┌─────▼───────────────┐ ┌──▼──────────────────────┐
|
||
│ [Telegram: Send │ │ [AI Agent: Auto-fix] │
|
||
│ Approval Request] │ │ (if Stage 3 enabled) │
|
||
└─────┬───────────────┘ └──┬──────────────────────┘
|
||
│ │
|
||
│ ┌──▼──────────────────────┐
|
||
│ │ [Telegram: Confirm] │
|
||
│ └─────────────────────────┘
|
||
│
|
||
┌─────▼───────────────┐
|
||
│ [Wait for Response]│
|
||
│ (Separate workflow)│
|
||
└─────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
### Section 5.4: Testing Your Automated System
|
||
|
||
Before trusting automated monitoring, thoroughly test all scenarios.
|
||
|
||
**Test Checklist**:
|
||
|
||
```
|
||
[ ] Schedule runs at correct intervals
|
||
[ ] Agent successfully monitors services
|
||
[ ] JSON output is valid and complete
|
||
[ ] Notifications arrive on Telegram
|
||
[ ] Approval requests include all necessary information
|
||
[ ] YES responses trigger correct actions
|
||
[ ] NO responses are logged appropriately
|
||
[ ] EXPLAIN requests provide detailed information
|
||
[ ] Critical alerts bypass rate limiting
|
||
[ ] Non-critical alerts respect quiet hours (if configured)
|
||
[ ] All actions are logged for audit
|
||
```
|
||
|
||
**Simulate Failure Scenarios**:
|
||
|
||
```bash
|
||
# Test 1: Service down
|
||
docker stop website
|
||
# Wait for next schedule interval
|
||
# Verify: Telegram alert received
|
||
|
||
# Test 2: Approve restart
|
||
# Respond with "YES" in Telegram
|
||
# Verify: Container restarted, success notification received
|
||
|
||
# Test 3: Deny restart
|
||
docker stop website
|
||
# Respond with "NO" in Telegram
|
||
# Verify: No action taken, denial logged
|
||
|
||
# Test 4: Request explanation
|
||
# Respond with "EXPLAIN" in Telegram
|
||
# Verify: Detailed diagnostic information received
|
||
```
|
||
|
||
---
|
||
|
||
Your automation system is now complete. In Chapter 6, we'll integrate specific services like Proxmox, UniFi, and NAS systems.
|
||
|
||
---
|
||
|
||
## Chapter 6 – Service Integrations
|
||
|
||
This chapter provides detailed integration guides for common homelab services. Each section includes API endpoints, authentication methods, example commands, and agent configuration snippets. These integrations transform your agent from a generic monitor into a service-aware expert.
|
||
|
||
### Section 6.1: Uptime Kuma Integration
|
||
|
||
Uptime Kuma is a powerful self-hosted monitoring solution. Integrating it with your AI agent creates a dual-layer monitoring system where Uptime Kuma provides the raw data and your agent provides intelligent analysis.
|
||
|
||
#### API Access Setup
|
||
|
||
**Step 1: Enable API in Uptime Kuma**
|
||
|
||
1. Open Uptime Kuma web interface
|
||
2. Navigate to **Settings → Security**
|
||
3. Enable **"API Access"**
|
||
4. Generate an API key
|
||
5. Save the key securely
|
||
|
||
**Step 2: Test API Access**
|
||
|
||
```bash
|
||
# Get all monitors
|
||
curl -X GET "http://your-uptime-kuma:3001/api/monitors" \
|
||
-H "Authorization: Bearer YOUR_API_KEY"
|
||
|
||
# Get specific monitor status
|
||
curl -X GET "http://your-uptime-kuma:3001/api/monitor/1" \
|
||
-H "Authorization: Bearer YOUR_API_KEY"
|
||
```
|
||
|
||
#### n8n Configuration
|
||
|
||
**Add HTTP Request Node**:
|
||
```
|
||
Node: HTTP Request
|
||
Method: GET
|
||
URL: http://your-uptime-kuma:3001/api/monitors
|
||
Authentication: Generic Credential Type
|
||
Header Auth:
|
||
Name: Authorization
|
||
Value: Bearer YOUR_API_KEY
|
||
```
|
||
|
||
#### Agent System Prompt Addition
|
||
|
||
```
|
||
UPTIME KUMA INTEGRATION:
|
||
|
||
You have access to Uptime Kuma monitoring data via HTTP requests.
|
||
|
||
Available Endpoints:
|
||
- GET /api/monitors - List all monitors
|
||
- GET /api/monitor/{id} - Get specific monitor details
|
||
- GET /api/monitor/{id}/heartbeat - Get heartbeat history
|
||
|
||
When analyzing service health:
|
||
1. First check Uptime Kuma for historical data
|
||
2. Identify patterns (recent downtime, recurring issues)
|
||
3. Correlate with current checks
|
||
4. Provide context: "Service has been down 3 times in last 24 hours"
|
||
|
||
Example Response Schema:
|
||
{
|
||
"monitors": [
|
||
{
|
||
"id": 1,
|
||
"name": "Website",
|
||
"url": "http://192.168.1.100:8090",
|
||
"type": "http",
|
||
"active": true,
|
||
"uptime": 99.5,
|
||
"avgPing": 145
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
#### Example Agent Interaction
|
||
|
||
```
|
||
User: "Is my website performing normally?"
|
||
|
||
Agent: "Checking current status and historical data..."
|
||
|
||
[Executes HTTP Request to Uptime Kuma API]
|
||
[Executes direct HTTP check to website]
|
||
|
||
Agent: "📊 Website Health Report:
|
||
|
||
Current Status: UP ✅
|
||
- URL: http://192.168.1.100:8090
|
||
- Response Code: 200 OK
|
||
- Response Time: 152ms
|
||
|
||
Historical Performance (Uptime Kuma):
|
||
- 24h Uptime: 99.8%
|
||
- 7d Uptime: 98.2%
|
||
- Average Ping: 145ms
|
||
- Incidents: 2 brief outages in past week
|
||
|
||
Analysis:
|
||
Overall healthy, but noticed 2 short outages last week.
|
||
Recommend investigating logs for root cause of intermittent issues.
|
||
|
||
Would you like me to investigate the cause of recent outages?"
|
||
```
|
||
|
||
#### Advanced: Uptime Kuma Webhook Integration
|
||
|
||
Configure Uptime Kuma to push alerts to n8n:
|
||
|
||
**In Uptime Kuma**:
|
||
1. Settings → Notifications
|
||
2. Add Notification: Webhook
|
||
3. URL: `https://your-n8n.com/webhook/uptime-alert`
|
||
4. Assign to monitors
|
||
|
||
**In n8n**:
|
||
```
|
||
[Webhook Trigger: /webhook/uptime-alert]
|
||
→ [AI Agent: Analyze Uptime Kuma Alert]
|
||
→ [Telegram: Notify with Context]
|
||
```
|
||
|
||
This enables real-time alerts with AI-powered analysis.
|
||
|
||
---
|
||
|
||
### Section 6.2: UniFi Network Monitoring
|
||
|
||
Monitor your network infrastructure by integrating with UniFi Controller. The agent can check connected devices, bandwidth usage, Wi-Fi health, and security events.
|
||
|
||
#### API Access Setup
|
||
|
||
**Step 1: Obtain UniFi Credentials**
|
||
|
||
You'll need:
|
||
- UniFi Controller URL/IP
|
||
- Admin username and password
|
||
- Site ID (usually "default" for single-site setups)
|
||
|
||
**Step 2: Authenticate and Get API Cookie**
|
||
|
||
UniFi API requires cookie-based authentication:
|
||
|
||
```bash
|
||
# Login and get cookie
|
||
curl -X POST https://unifi.local:8443/api/login \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"username":"admin","password":"your_password"}' \
|
||
-c /tmp/unifi_cookie.txt -k
|
||
|
||
# Use cookie for requests
|
||
curl -X GET https://unifi.local:8443/api/s/default/stat/sta \
|
||
-b /tmp/unifi_cookie.txt -k
|
||
```
|
||
|
||
#### Common API Endpoints
|
||
|
||
```bash
|
||
# List all connected clients
|
||
GET /api/s/default/stat/sta
|
||
|
||
# List all access points
|
||
GET /api/s/default/stat/device
|
||
|
||
# Get site statistics
|
||
GET /api/s/default/stat/sitedpi
|
||
|
||
# List recent alerts/events
|
||
GET /api/s/default/stat/alarm
|
||
|
||
# Get specific device info
|
||
GET /api/s/default/stat/device/{mac}
|
||
```
|
||
|
||
#### n8n Workflow for UniFi
|
||
|
||
**Challenge**: UniFi requires session cookies. Use a subworkflow for authentication:
|
||
|
||
**Subworkflow: UniFi Authenticate**
|
||
```
|
||
[Execute Workflow Trigger]
|
||
→ [HTTP Request: Login]
|
||
URL: https://unifi.local:8443/api/login
|
||
Method: POST
|
||
Body: {"username":"{{$env.UNIFI_USER}}","password":"{{$env.UNIFI_PASS}}"}
|
||
SSL: Ignore SSL Issues
|
||
→ [Set: Extract Cookie]
|
||
cookie: {{ $json.headers['set-cookie'][0] }}
|
||
→ [Respond to Workflow]
|
||
cookie: {{ $json.cookie }}
|
||
```
|
||
|
||
**Main Workflow: UniFi Monitor**
|
||
```
|
||
[Schedule Trigger]
|
||
→ [Execute Workflow: UniFi Authenticate]
|
||
→ [HTTP Request: Get Clients]
|
||
URL: https://unifi.local:8443/api/s/default/stat/sta
|
||
Headers: Cookie: {{ $json.cookie }}
|
||
→ [AI Agent: Analyze Network Health]
|
||
→ [Telegram: Report]
|
||
```
|
||
|
||
#### Agent System Prompt for UniFi
|
||
|
||
```
|
||
UNIFI NETWORK MONITORING:
|
||
|
||
You have access to UniFi Controller data for network analysis.
|
||
|
||
Available Data:
|
||
- Connected clients (wired + wireless)
|
||
- Access point status and performance
|
||
- Bandwidth usage per client
|
||
- Network events and alerts
|
||
- Wi-Fi channel utilization
|
||
|
||
Analysis Tasks:
|
||
1. Client Count Monitoring
|
||
- Normal range: 15-25 devices
|
||
- Alert if sudden change (±5 devices)
|
||
- Identify new/unknown devices
|
||
|
||
2. Bandwidth Analysis
|
||
- Identify high-usage clients
|
||
- Detect unusual traffic patterns
|
||
- Alert on saturation (>80% utilization)
|
||
|
||
3. Wi-Fi Health
|
||
- Check AP uptime
|
||
- Monitor client signal strength
|
||
- Detect interference or congestion
|
||
|
||
4. Security Events
|
||
- Failed authentication attempts
|
||
- New device connections
|
||
- Rogue AP detection
|
||
|
||
Example Response:
|
||
{
|
||
"network_health": "good",
|
||
"total_clients": 22,
|
||
"new_devices": 1,
|
||
"bandwidth_usage": "45%",
|
||
"access_points": [
|
||
{
|
||
"name": "Living Room AP",
|
||
"status": "online",
|
||
"clients": 8,
|
||
"uptime": "15 days"
|
||
}
|
||
],
|
||
"alerts": [
|
||
{
|
||
"type": "new_device",
|
||
"device": "iPhone 14 Pro",
|
||
"mac": "aa:bb:cc:dd:ee:ff",
|
||
"timestamp": "2024-01-15T14:30:00Z"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
#### Example Commands for Agent
|
||
|
||
```javascript
|
||
// In Execute Command node - UniFi CLI (if SSH access available)
|
||
// These commands work when SSH'd into UniFi hardware
|
||
|
||
// Show connected clients
|
||
mca-cli-op info
|
||
|
||
// Show AP status
|
||
mca-cli-op list-devices
|
||
|
||
// Restart AP (specific MAC)
|
||
mca-cli-op restart-device <ap-mac-address>
|
||
|
||
// Show network stats
|
||
mca-cli-op stat
|
||
```
|
||
|
||
#### Example Agent Interaction
|
||
|
||
```
|
||
User: "How many devices are on my network?"
|
||
|
||
Agent: "Checking UniFi Controller..."
|
||
|
||
[Authenticates to UniFi]
|
||
[Fetches client list]
|
||
|
||
Agent: "🌐 Network Status:
|
||
|
||
Total Devices: 23 clients
|
||
|
||
Breakdown:
|
||
- Wired: 8 devices
|
||
(Server, Desktop, NAS, Printer, 4x IoT)
|
||
- Wireless (2.4GHz): 6 devices
|
||
(Smart home devices, older phones)
|
||
- Wireless (5GHz): 9 devices
|
||
(Laptops, phones, tablets)
|
||
|
||
Access Points:
|
||
✅ Living Room AP - 8 clients, -45dBm avg signal
|
||
✅ Office AP - 11 clients, -48dBm avg signal
|
||
✅ Bedroom AP - 4 clients, -42dBm avg signal
|
||
|
||
New Device Detected:
|
||
⚠️ Unknown device connected 15 minutes ago
|
||
MAC: aa:bb:cc:dd:ee:ff
|
||
Hostname: Galaxy-S23
|
||
Location: Living Room AP
|
||
|
||
All systems operating normally. Would you like details on any specific device?"
|
||
```
|
||
|
||
---
|
||
|
||
### Section 6.3: Proxmox Integration
|
||
|
||
Proxmox VE manages your virtual machines and containers. Integration enables VM health monitoring, resource usage tracking, and automated management tasks.
|
||
|
||
#### Authentication Options
|
||
|
||
**Option 1: SSH Access** (Recommended for comprehensive control)
|
||
**Option 2: API Access** (Better for read-only monitoring)
|
||
|
||
#### SSH-Based Integration
|
||
|
||
**Step 1: Configure SSH Keys**
|
||
|
||
```bash
|
||
# On your n8n server, generate SSH key if not exists
|
||
ssh-keygen -t ed25519 -C "n8n-agent"
|
||
|
||
# Copy public key to Proxmox host
|
||
ssh-copy-id root@proxmox.local
|
||
|
||
# Test connection
|
||
ssh root@proxmox.local "pvesh get /nodes/pve/status"
|
||
```
|
||
|
||
**Step 2: Create Subworkflow for Proxmox Commands**
|
||
|
||
n8n Subworkflow: **Proxmox SSH Execute**
|
||
```
|
||
[Execute Workflow Trigger]
|
||
Parameters:
|
||
- command (string): Command to execute
|
||
- node_name (string): Proxmox node name (default: "pve")
|
||
|
||
→ [Execute Command: SSH to Proxmox]
|
||
Command: ssh root@proxmox.local "{{ $json.command }}"
|
||
|
||
→ [Code: Parse Output]
|
||
(Parse JSON or text output)
|
||
|
||
→ [Respond to Workflow]
|
||
result: {{ $json.parsed_output }}
|
||
```
|
||
|
||
**Step 3: Useful Proxmox Commands**
|
||
|
||
```bash
|
||
# List all VMs and containers
|
||
pvesh get /nodes/pve/qemu
|
||
pvesh get /nodes/pve/lxc
|
||
|
||
# Get VM status
|
||
qm status 100 # VM ID 100
|
||
|
||
# Get container status
|
||
pct status 101 # CT ID 101
|
||
|
||
# List all resources (VMs, CTs, storage)
|
||
pvesh get /cluster/resources
|
||
|
||
# Get node status (CPU, RAM, uptime)
|
||
pvesh get /nodes/pve/status
|
||
|
||
# Check storage
|
||
pvesh get /nodes/pve/storage
|
||
|
||
# Get VM config
|
||
qm config 100
|
||
|
||
# Start/Stop VM
|
||
qm start 100
|
||
qm stop 100
|
||
|
||
# Start/Stop container
|
||
pct start 101
|
||
pct stop 101
|
||
|
||
# Check disk usage
|
||
df -h
|
||
|
||
# Check ZFS pools (if using ZFS)
|
||
zpool status
|
||
```
|
||
|
||
#### API-Based Integration
|
||
|
||
**Step 1: Create API Token**
|
||
|
||
1. Login to Proxmox web interface
|
||
2. Go to **Datacenter → Permissions → API Tokens**
|
||
3. Click **Add**
|
||
4. User: root@pam
|
||
5. Token ID: n8n-agent
|
||
6. Copy the secret (shown only once!)
|
||
|
||
**Step 2: API Request Format**
|
||
|
||
```bash
|
||
# Authentication
|
||
TOKEN="root@pam!n8n-agent=abc123-secret-token-xyz789"
|
||
|
||
# Get cluster resources
|
||
curl -H "Authorization: PVEAPIToken=$TOKEN" \
|
||
https://proxmox.local:8006/api2/json/cluster/resources
|
||
|
||
# Get specific VM status
|
||
curl -H "Authorization: PVEAPIToken=$TOKEN" \
|
||
https://proxmox.local:8006/api2/json/nodes/pve/qemu/100/status/current
|
||
|
||
# Start VM
|
||
curl -X POST -H "Authorization: PVEAPIToken=$TOKEN" \
|
||
https://proxmox.local:8006/api2/json/nodes/pve/qemu/100/status/start
|
||
```
|
||
|
||
**Step 3: n8n HTTP Request Configuration**
|
||
|
||
```
|
||
Node: HTTP Request (Tool for Agent)
|
||
Method: GET
|
||
URL: https://proxmox.local:8006/api2/json/{{ $json.endpoint }}
|
||
Authentication: Generic Credential Type
|
||
Header Auth:
|
||
Name: Authorization
|
||
Value: PVEAPIToken=root@pam!n8n-agent=YOUR_TOKEN_SECRET
|
||
SSL: Ignore SSL Issues (if using self-signed cert)
|
||
```
|
||
|
||
#### Agent System Prompt for Proxmox
|
||
|
||
```
|
||
PROXMOX INTEGRATION:
|
||
|
||
You manage Proxmox VE infrastructure with VMs and containers.
|
||
|
||
Available Commands (via SSH subworkflow):
|
||
- pvesh get /cluster/resources - List all resources
|
||
- qm status <vmid> - Get VM status
|
||
- pct status <ctid> - Get container status
|
||
- qm start/stop <vmid> - Control VMs
|
||
- pct start/stop <ctid> - Control containers
|
||
|
||
VM/Container ID Mapping:
|
||
100: Production Web Server (VM)
|
||
101: Database Server (Container)
|
||
102: Development Environment (VM)
|
||
103: Media Server (Container)
|
||
104: Monitoring Stack (Container)
|
||
|
||
Health Check Process:
|
||
1. List all resources
|
||
2. Check each VM/CT status
|
||
3. Verify resource usage (CPU, RAM, Disk)
|
||
4. Alert if:
|
||
- VM/CT is stopped unexpectedly
|
||
- CPU usage >90% for >5 minutes
|
||
- RAM usage >95%
|
||
- Disk usage >85%
|
||
|
||
Example Response:
|
||
{
|
||
"proxmox_health": "degraded",
|
||
"total_vms": 3,
|
||
"total_containers": 4,
|
||
"running": 6,
|
||
"stopped": 1,
|
||
"issues": [
|
||
{
|
||
"id": "101",
|
||
"name": "database-server",
|
||
"type": "lxc",
|
||
"status": "stopped",
|
||
"action_needed": "investigate_and_restart"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
#### Example Workflow: Proxmox Monitoring
|
||
|
||
```
|
||
[Schedule Trigger: Every 10 minutes]
|
||
→ [Execute Workflow: Proxmox SSH Execute]
|
||
command: "pvesh get /cluster/resources --output-format json"
|
||
→ [Code: Parse and Analyze]
|
||
- Filter for VMs and containers
|
||
- Check status of each
|
||
- Calculate resource usage
|
||
→ [AI Agent: Analyze Proxmox Health]
|
||
→ [IF: Issues Detected]
|
||
YES → [Telegram: Alert]
|
||
→ [IF: Auto-fix Enabled]
|
||
YES → [Execute Workflow: Proxmox SSH Execute]
|
||
command: "qm start {{ $json.vmid }}"
|
||
NO → [Request Approval]
|
||
NO → [End]
|
||
```
|
||
|
||
#### Example Agent Interaction
|
||
|
||
```
|
||
User: "Check all my VMs"
|
||
|
||
Agent: "Scanning Proxmox cluster..."
|
||
|
||
[Executes SSH: pvesh get /cluster/resources]
|
||
|
||
Agent: "🖥️ Proxmox Cluster Status:
|
||
|
||
Node: pve
|
||
Status: Online
|
||
Uptime: 42 days
|
||
CPU: 15% (4 cores)
|
||
RAM: 28.5GB / 64GB (45%)
|
||
|
||
Virtual Machines:
|
||
✅ VM 100: Production Web Server
|
||
Status: Running
|
||
CPU: 2 cores, 8% usage
|
||
RAM: 4GB / 8GB allocated
|
||
Uptime: 15 days
|
||
|
||
✅ VM 102: Development Environment
|
||
Status: Running
|
||
CPU: 4 cores, 45% usage
|
||
RAM: 8GB / 16GB allocated
|
||
Uptime: 3 days
|
||
|
||
❌ VM 104: Testing Server
|
||
Status: STOPPED
|
||
Last stopped: 2 hours ago
|
||
|
||
Containers:
|
||
✅ CT 101: Database Server (Running)
|
||
✅ CT 103: Media Server (Running)
|
||
✅ CT 105: Monitoring Stack (Running)
|
||
|
||
⚠️ Issue Detected:
|
||
VM 104 (Testing Server) is stopped unexpectedly.
|
||
|
||
Would you like me to:
|
||
1. Investigate why it stopped (check logs)
|
||
2. Restart it
|
||
3. Leave it stopped
|
||
|
||
Please advise."
|
||
```
|
||
|
||
---
|
||
|
||
### Section 6.4: NAS (ZimaCube) Monitoring
|
||
|
||
Network Attached Storage health is critical. Monitor disk health, RAID status, capacity, and prevent data loss through proactive alerts.
|
||
|
||
#### SSH Access to NAS
|
||
|
||
**Step 1: Enable SSH on ZimaCube/NAS**
|
||
|
||
(Steps vary by NAS vendor - ZimaCube, TrueNAS, Synology, etc.)
|
||
|
||
For ZimaCube/Debian-based:
|
||
```bash
|
||
# SSH into NAS
|
||
ssh admin@nas.local
|
||
|
||
# Install smartmontools if not present
|
||
sudo apt update && sudo apt install smartmontools
|
||
```
|
||
|
||
#### Monitoring Commands
|
||
|
||
**Disk Health (SMART Data)**:
|
||
```bash
|
||
# List all disks
|
||
lsblk
|
||
|
||
# Check SMART status for a disk
|
||
sudo smartctl -H /dev/sda
|
||
|
||
# Detailed SMART data
|
||
sudo smartctl -a /dev/sda
|
||
|
||
# Check all disks
|
||
for disk in /dev/sd?; do
|
||
echo "=== $disk ==="
|
||
sudo smartctl -H $disk
|
||
done
|
||
```
|
||
|
||
**RAID Status**:
|
||
```bash
|
||
# mdadm RAID (software RAID)
|
||
cat /proc/mdstat
|
||
sudo mdadm --detail /dev/md0
|
||
|
||
# ZFS pools
|
||
zpool status
|
||
zpool list
|
||
|
||
# Check for errors
|
||
zpool status -v
|
||
```
|
||
|
||
**Storage Capacity**:
|
||
```bash
|
||
# Disk usage
|
||
df -h
|
||
|
||
# Specific mount point
|
||
df -h /mnt/storage
|
||
|
||
# Check inode usage
|
||
df -i
|
||
```
|
||
|
||
**Temperature Monitoring**:
|
||
```bash
|
||
# Drive temperatures
|
||
sudo smartctl -a /dev/sda | grep Temperature
|
||
|
||
# System temperature (if sensors available)
|
||
sensors
|
||
```
|
||
|
||
#### n8n Integration
|
||
|
||
**Subworkflow: NAS Health Check**
|
||
```
|
||
[Execute Workflow Trigger]
|
||
→ [Execute Command: SSH to NAS]
|
||
Command: ssh admin@nas.local "sudo smartctl -H /dev/sda && cat /proc/mdstat && df -h"
|
||
→ [Code: Parse Health Data]
|
||
→ [Respond to Workflow]
|
||
```
|
||
|
||
#### Agent System Prompt for NAS
|
||
|
||
```
|
||
NAS STORAGE MONITORING:
|
||
|
||
You monitor Network Attached Storage health with focus on data integrity.
|
||
|
||
Available Commands:
|
||
- smartctl -H /dev/sdX - Check disk SMART status
|
||
- cat /proc/mdstat - Check RAID array status
|
||
- df -h - Check disk space
|
||
- zpool status - Check ZFS pool status (if applicable)
|
||
|
||
Disk Mapping:
|
||
/dev/sda: 4TB WD Red - RAID Member 1
|
||
/dev/sdb: 4TB WD Red - RAID Member 2
|
||
/dev/sdc: 4TB WD Red - RAID Member 3
|
||
/dev/sdd: 4TB WD Red - RAID Member 4
|
||
/dev/md0: RAID 5 Array - Main Storage
|
||
|
||
Critical Alerts:
|
||
1. SMART Status: FAILED → IMMEDIATE attention
|
||
2. RAID: Degraded → HIGH priority
|
||
3. Disk Space: >85% → MEDIUM priority
|
||
4. Temperature: >50°C → Monitor closely
|
||
|
||
Health Check Response:
|
||
{
|
||
"nas_health": "healthy|degraded|critical",
|
||
"disks": [
|
||
{
|
||
"device": "/dev/sda",
|
||
"smart_status": "PASSED|FAILED",
|
||
"temperature": "38°C",
|
||
"errors": 0,
|
||
"hours_on": 15234
|
||
}
|
||
],
|
||
"raid": {
|
||
"device": "/dev/md0",
|
||
"level": "raid5",
|
||
"status": "active|degraded|failed",
|
||
"disks": "4/4",
|
||
"health": "good"
|
||
},
|
||
"capacity": {
|
||
"total": "10.9TB",
|
||
"used": "7.2TB",
|
||
"available": "3.7TB",
|
||
"percent_used": 66
|
||
}
|
||
}
|
||
|
||
CRITICAL: Never attempt automated fixes for storage issues.
|
||
Always alert human for disk/RAID problems.
|
||
```
|
||
|
||
#### Example Commands for Different NAS Types
|
||
|
||
**Synology**:
|
||
```bash
|
||
# RAID status
|
||
cat /proc/mdstat
|
||
|
||
# Storage pool info
|
||
synoraid --type=usage
|
||
|
||
# Disk health
|
||
synodisk --enum
|
||
```
|
||
|
||
**TrueNAS**:
|
||
```bash
|
||
# Pool status
|
||
zpool status
|
||
|
||
# SMART test
|
||
smartctl -t short /dev/da0
|
||
|
||
# Disk temperature
|
||
sysctl -a | grep temperature
|
||
```
|
||
|
||
**Generic Linux NAS**:
|
||
```bash
|
||
# Complete health check script
|
||
#!/bin/bash
|
||
echo "=== DISK SMART STATUS ==="
|
||
for disk in /dev/sd?; do
|
||
echo "Checking $disk..."
|
||
sudo smartctl -H $disk | grep "SMART overall"
|
||
done
|
||
|
||
echo "=== RAID STATUS ==="
|
||
cat /proc/mdstat
|
||
|
||
echo "=== DISK SPACE ==="
|
||
df -h | grep -E "(/mnt|/media|raid)"
|
||
|
||
echo "=== DISK TEMPERATURES ==="
|
||
for disk in /dev/sd?; do
|
||
temp=$(sudo smartctl -a $disk | grep Temperature_Celsius | awk '{print $10}')
|
||
echo "$disk: ${temp}°C"
|
||
done
|
||
```
|
||
|
||
#### Example Agent Interaction
|
||
|
||
```
|
||
User: "Check NAS health"
|
||
|
||
Agent: "Running comprehensive NAS diagnostics..."
|
||
|
||
[Executes SSH to NAS: full health check script]
|
||
|
||
Agent: "💾 NAS Health Report:
|
||
|
||
Overall Status: ✅ HEALTHY
|
||
|
||
RAID Array:
|
||
✅ /dev/md0 (RAID 5)
|
||
- Status: Active, no degradation
|
||
- Disks: 4/4 active
|
||
- Sync: 100% complete
|
||
- Consistency: All checks passed
|
||
|
||
Individual Disks:
|
||
✅ /dev/sda (4TB WD Red)
|
||
- SMART: PASSED
|
||
- Temperature: 38°C ✅
|
||
- Power-on Hours: 15,234
|
||
- Reallocated Sectors: 0 ✅
|
||
|
||
✅ /dev/sdb (4TB WD Red)
|
||
- SMART: PASSED
|
||
- Temperature: 39°C ✅
|
||
- Power-on Hours: 15,198
|
||
- Reallocated Sectors: 0 ✅
|
||
|
||
✅ /dev/sdc (4TB WD Red)
|
||
- SMART: PASSED
|
||
- Temperature: 37°C ✅
|
||
- Power-on Hours: 12,456
|
||
- Reallocated Sectors: 0 ✅
|
||
|
||
⚠️ /dev/sdd (4TB WD Red)
|
||
- SMART: PASSED
|
||
- Temperature: 42°C ⚠️ (elevated)
|
||
- Power-on Hours: 18,732
|
||
- Reallocated Sectors: 2 ⚠️ (monitor)
|
||
|
||
Storage Capacity:
|
||
Total: 10.9TB
|
||
Used: 7.2TB (66%)
|
||
Available: 3.7TB
|
||
Status: Healthy headroom
|
||
|
||
Recommendations:
|
||
1. Monitor /dev/sdd temperature - slightly elevated
|
||
2. Watch /dev/sdd for increasing reallocated sectors
|
||
3. Consider planning disk replacement for /dev/sdd within 6-12 months
|
||
4. Current capacity usage is healthy (66%)
|
||
5. Next scheduled check: 24 hours
|
||
|
||
All critical systems operational. Data is safe."
|
||
```
|
||
|
||
---
|
||
|
||
### Section 6.5: Plex Media Server
|
||
|
||
Monitor your media server's health, ensure continuous availability, and detect issues with streaming, transcoding, or resource usage.
|
||
|
||
#### HTTP Health Check
|
||
|
||
**Basic Availability**:
|
||
```bash
|
||
# Simple ping
|
||
curl -I http://plex.local:32400/web/index.html
|
||
|
||
# Status endpoint
|
||
curl http://plex.local:32400/identity
|
||
|
||
# Check if server is reachable
|
||
curl http://plex.local:32400/status/sessions
|
||
```
|
||
|
||
#### Plex API Access
|
||
|
||
**Get Token**:
|
||
1. Login to Plex web
|
||
2. Open any media item
|
||
3. Click "Get Info" → "View XML"
|
||
4. URL contains: `X-Plex-Token=YOUR_TOKEN`
|
||
|
||
**API Endpoints**:
|
||
```bash
|
||
TOKEN="your_plex_token"
|
||
PLEX_URL="http://plex.local:32400"
|
||
|
||
# Get server info
|
||
curl "$PLEX_URL?X-Plex-Token=$TOKEN"
|
||
|
||
# Active sessions (currently streaming)
|
||
curl "$PLEX_URL/status/sessions?X-Plex-Token=$TOKEN"
|
||
|
||
# Library sections
|
||
curl "$PLEX_URL/library/sections?X-Plex-Token=$TOKEN"
|
||
|
||
# Recently added
|
||
curl "$PLEX_URL/library/recentlyAdded?X-Plex-Token=$TOKEN"
|
||
|
||
# Server activities (transcoding, etc.)
|
||
curl "$PLEX_URL/activities?X-Plex-Token=$TOKEN"
|
||
```
|
||
|
||
#### Docker Container Monitoring
|
||
|
||
If running Plex in Docker:
|
||
```bash
|
||
# Check container status
|
||
docker ps -a --filter name=plex
|
||
|
||
# Container logs
|
||
docker logs --tail 100 plex
|
||
|
||
# Resource usage
|
||
docker stats --no-stream plex
|
||
|
||
# Restart container
|
||
docker restart plex
|
||
|
||
# Check transcoding directory
|
||
ls -lh /tmp/plex-transcode/ # or wherever your transcode temp is
|
||
```
|
||
|
||
#### Agent System Prompt for Plex
|
||
|
||
```
|
||
PLEX MEDIA SERVER MONITORING:
|
||
|
||
You monitor Plex Media Server health and performance.
|
||
|
||
Available Checks:
|
||
1. HTTP health: Ping /web/index.html
|
||
2. API status: Get /identity
|
||
3. Active sessions: Who's streaming
|
||
4. Transcoding: Check /activities
|
||
5. Container status: docker ps plex
|
||
6. Resource usage: CPU, RAM during transcode
|
||
|
||
Container Name: plex
|
||
Default Port: 32400
|
||
|
||
Normal Behavior:
|
||
- Idle CPU: 5-15%
|
||
- Idle RAM: 500MB - 1GB
|
||
- Transcoding CPU: 40-80% per stream
|
||
- Transcoding RAM: +200MB per stream
|
||
|
||
Alert Conditions:
|
||
- HTTP health check fails (DOWN)
|
||
- Container stopped unexpectedly
|
||
- CPU >95% sustained (transcoding overload)
|
||
- RAM >4GB (possible memory leak)
|
||
- Transcoding directory >50GB (stuck transcodes)
|
||
|
||
Automatic Actions (with approval):
|
||
- Restart container if stopped
|
||
- Clear transcode directory if >50GB
|
||
- Report transcoding bottlenecks
|
||
|
||
Response Format:
|
||
{
|
||
"plex_status": "online|offline|degraded",
|
||
"http_health": "200 OK|error",
|
||
"container_status": "running|stopped",
|
||
"active_streams": 2,
|
||
"transcoding": true,
|
||
"resource_usage": {
|
||
"cpu": "65%",
|
||
"ram": "2.1GB"
|
||
},
|
||
"issues": []
|
||
}
|
||
```
|
||
|
||
#### Example Workflow: Plex Monitoring
|
||
|
||
```
|
||
[Schedule Trigger: Every 5 minutes]
|
||
→ [HTTP Request: Plex Health Check]
|
||
URL: http://plex.local:32400/identity
|
||
→ [Execute Command: Docker Status]
|
||
Command: docker ps -a --filter name=plex --format "{{.Status}}"
|
||
→ [AI Agent: Analyze Plex Health]
|
||
→ [Code: Check if Action Needed]
|
||
→ [IF: Plex Down]
|
||
YES → [Telegram: Alert]
|
||
→ [AI Agent: Propose Restart]
|
||
→ [Wait for Approval]
|
||
→ [Execute Command: docker restart plex]
|
||
→ [Wait: 30 seconds]
|
||
→ [HTTP Request: Verify Recovery]
|
||
→ [Telegram: Confirm]
|
||
NO → [End]
|
||
```
|
||
|
||
#### Example Agent Interaction
|
||
|
||
```
|
||
User: "Why is Plex slow?"
|
||
|
||
Agent: "Investigating Plex performance..."
|
||
|
||
[Checks HTTP health]
|
||
[Checks container status]
|
||
[Checks active sessions]
|
||
[Checks resource usage]
|
||
|
||
Agent: "🎬 Plex Performance Analysis:
|
||
|
||
Server Status: ONLINE but under heavy load
|
||
|
||
Active Streams: 3
|
||
1. User: John - Avatar (4K) → Transcoding to 1080p
|
||
2. User: Sarah - The Office S3E2 (1080p) → Direct Play
|
||
3. User: Kids_iPad - Frozen (1080p) → Transcoding to 720p
|
||
|
||
Resource Usage:
|
||
CPU: 178% (1.78 cores) ⚠️ HIGH
|
||
RAM: 3.2GB / 4GB (80%)
|
||
Transcode Directory: 8.2GB
|
||
|
||
Root Cause:
|
||
2 simultaneous transcodes from 4K/1080p content.
|
||
CPU at capacity, causing buffering.
|
||
|
||
Recommendations:
|
||
1. SHORT-TERM: Ask one user to pause temporarily
|
||
2. MEDIUM-TERM: Encourage direct play (matching client capabilities)
|
||
3. LONG-TERM: Consider GPU transcoding (QuickSync/NVENC)
|
||
|
||
Current Bottleneck: CPU-bound transcoding
|
||
|
||
Would you like me to:
|
||
A) Send message to John about 4K transcode load
|
||
B) Check transcoding settings
|
||
C) Monitor and alert if streams drop"
|
||
```
|
||
|
||
---
|
||
|
||
### Section 6.6: Integration Best Practices
|
||
|
||
**Security**:
|
||
- Store API keys and tokens in n8n credentials, never in code
|
||
- Use read-only accounts initially
|
||
- Implement IP whitelisting where possible
|
||
- Rotate credentials quarterly
|
||
|
||
**Reliability**:
|
||
- Implement timeouts for all API calls (10-30 seconds)
|
||
- Add retry logic for transient failures (3 retries with exponential backoff)
|
||
- Cache frequently accessed data to reduce API load
|
||
|
||
**Performance**:
|
||
- Poll less frequently for non-critical services (15-30 minutes)
|
||
- Use webhooks instead of polling where supported
|
||
- Batch related API calls
|
||
|
||
**Maintainability**:
|
||
- Document each integration's endpoints and authentication
|
||
- Create separate subworkflows for complex integrations
|
||
- Version control your workflow exports
|
||
|
||
---
|
||
|
||
With these integrations in place, your AI agent has deep visibility into every aspect of your homelab. In Chapter 7, we'll explore advanced features like God-Mode prompts and multi-agent collaboration.
|
||
|
||
---
|
||
|
||
## Chapter 7 – Advanced Features
|
||
|
||
This chapter explores sophisticated capabilities that transform your agent system from a monitoring tool into an intelligent infrastructure management platform. These features are powerful but require careful consideration of safety implications.
|
||
|
||
### Section 7.1: Human Approval Loop (Deep Dive)
|
||
|
||
The approval loop is your safety net. This section provides detailed implementation for bulletproof human-in-the-loop automation.
|
||
|
||
#### Architecture Overview
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ Main Workflow (Monitoring & Detection) │
|
||
│ ┌──────────────────────────────────────────────────┐ │
|
||
│ │ 1. Detect Issue │ │
|
||
│ │ 2. Perform Diagnosis │ │
|
||
│ │ 3. Generate Fix Proposal │ │
|
||
│ │ 4. Create Unique Approval ID │ │
|
||
│ │ 5. Send to Telegram │ │
|
||
│ │ 6. Store Pending Approval in DB/File │ │
|
||
│ └──────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────┘
|
||
↓
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ Approval Workflow (Response Handler) │
|
||
│ ┌──────────────────────────────────────────────────┐ │
|
||
│ │ [Telegram Trigger: On Message] │ │
|
||
│ │ ↓ │ │
|
||
│ │ [Parse Message & Extract Approval ID] │ │
|
||
│ │ ↓ │ │
|
||
│ │ [Lookup Pending Approval] │ │
|
||
│ │ ↓ │ │
|
||
│ │ [Switch: YES/NO/EXPLAIN] │ │
|
||
│ │ ├─ YES → Execute Fix → Confirm │ │
|
||
│ │ ├─ NO → Cancel → Log Denial │ │
|
||
│ │ └─ EXPLAIN → Provide Details │ │
|
||
│ └──────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
#### Implementation: Approval Storage
|
||
|
||
Use n8n's workflow static data or external database to track pending approvals.
|
||
|
||
**Option 1: Workflow Static Data** (Simple, built-in)
|
||
```javascript
|
||
// In Code node - Store Pending Approval
|
||
const approvalId = `APR-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;
|
||
|
||
const approval = {
|
||
id: approvalId,
|
||
timestamp: new Date().toISOString(),
|
||
service: $json.service.name,
|
||
issue: $json.diagnosis.root_cause,
|
||
command: $json.proposed_action.command,
|
||
severity: $json.severity,
|
||
status: 'pending'
|
||
};
|
||
|
||
// Store in workflow static data
|
||
const staticData = $('WorkflowStaticData').first().json;
|
||
if (!staticData.pending_approvals) {
|
||
staticData.pending_approvals = {};
|
||
}
|
||
staticData.pending_approvals[approvalId] = approval;
|
||
|
||
return {
|
||
json: {
|
||
approval_id: approvalId,
|
||
...approval
|
||
}
|
||
};
|
||
```
|
||
|
||
**Option 2: External File/Database** (More robust)
|
||
```javascript
|
||
// Store in JSON file
|
||
const fs = require('fs');
|
||
const approvalsFile = '/tmp/pending_approvals.json';
|
||
|
||
let approvals = {};
|
||
if (fs.existsSync(approvalsFile)) {
|
||
approvals = JSON.parse(fs.readFileSync(approvalsFile, 'utf8'));
|
||
}
|
||
|
||
const approvalId = `APR-${Date.now()}`;
|
||
approvals[approvalId] = {
|
||
id: approvalId,
|
||
timestamp: new Date().toISOString(),
|
||
service: $json.service.name,
|
||
issue: $json.diagnosis.root_cause,
|
||
command: $json.proposed_action.command,
|
||
severity: $json.severity,
|
||
status: 'pending'
|
||
};
|
||
|
||
fs.writeFileSync(approvalsFile, JSON.stringify(approvals, null, 2));
|
||
|
||
return { json: { approval_id: approvalId } };
|
||
```
|
||
|
||
#### Telegram Message with Approval ID
|
||
|
||
```javascript
|
||
// Build Telegram message with approval context
|
||
const message = `
|
||
🚨 APPROVAL REQUEST
|
||
|
||
ID: ${$json.approval_id}
|
||
Service: ${$json.service.name}
|
||
Severity: ${$json.severity.toUpperCase()}
|
||
Detected: ${new Date().toLocaleString()}
|
||
|
||
Issue:
|
||
${$json.diagnosis.root_cause}
|
||
|
||
Proposed Fix:
|
||
\`\`\`
|
||
${$json.proposed_action.command}
|
||
\`\`\`
|
||
|
||
Impact:
|
||
• Downtime: ${$json.proposed_action.estimated_downtime}
|
||
• Risk: ${$json.proposed_action.risk_level.toUpperCase()}
|
||
|
||
To approve: Reply "YES ${$json.approval_id}"
|
||
To decline: Reply "NO ${$json.approval_id}"
|
||
For details: Reply "EXPLAIN ${$json.approval_id}"
|
||
|
||
Expires in: 30 minutes
|
||
`;
|
||
|
||
return { json: { message } };
|
||
```
|
||
|
||
#### Approval Response Workflow
|
||
|
||
```
|
||
[Telegram Trigger: On Message]
|
||
↓
|
||
[Code: Parse Response]
|
||
```
|
||
|
||
```javascript
|
||
// Parse user response
|
||
const message = $input.first().json.message.text;
|
||
const match = message.match(/^(YES|NO|EXPLAIN)\s+([A-Z0-9\-]+)$/i);
|
||
|
||
if (!match) {
|
||
return []; // Invalid format, ignore
|
||
}
|
||
|
||
const [, action, approvalId] = match;
|
||
|
||
// Retrieve pending approval
|
||
const staticData = $('WorkflowStaticData').first().json;
|
||
const approval = staticData.pending_approvals?.[approvalId];
|
||
|
||
if (!approval) {
|
||
// Send error to Telegram
|
||
return {
|
||
json: {
|
||
error: `Approval ${approvalId} not found or already processed.`
|
||
}
|
||
};
|
||
}
|
||
|
||
// Check if expired (30 minutes)
|
||
const approvalTime = new Date(approval.timestamp);
|
||
const now = new Date();
|
||
const minutesElapsed = (now - approvalTime) / 1000 / 60;
|
||
|
||
if (minutesElapsed > 30) {
|
||
delete staticData.pending_approvals[approvalId];
|
||
return {
|
||
json: {
|
||
error: `Approval ${approvalId} has expired (30 min limit).`
|
||
}
|
||
};
|
||
}
|
||
|
||
return {
|
||
json: {
|
||
action: action.toUpperCase(),
|
||
approval,
|
||
is_valid: true
|
||
}
|
||
};
|
||
```
|
||
|
||
```
|
||
↓
|
||
[Switch: Action Type]
|
||
├─ YES → [Execute Command] → [Telegram: Confirm Success]
|
||
├─ NO → [Telegram: Acknowledge Denial] → [Log to File]
|
||
└─ EXPLAIN → [AI Agent: Detailed Explanation] → [Telegram: Send Details]
|
||
```
|
||
|
||
#### Approval Expiration & Cleanup
|
||
|
||
```javascript
|
||
// Periodic cleanup job (runs every hour)
|
||
const staticData = $('WorkflowStaticData').first().json;
|
||
const approvals = staticData.pending_approvals || {};
|
||
const now = new Date();
|
||
|
||
Object.keys(approvals).forEach(id => {
|
||
const approval = approvals[id];
|
||
const approvalTime = new Date(approval.timestamp);
|
||
const hoursElapsed = (now - approvalTime) / 1000 / 60 / 60;
|
||
|
||
if (hoursElapsed > 1) {
|
||
// Expired, remove
|
||
delete approvals[id];
|
||
}
|
||
});
|
||
|
||
return { json: { cleaned: true, remaining: Object.keys(approvals).length } };
|
||
```
|
||
|
||
---
|
||
|
||
### Section 7.2: God-Mode Prompt (⚠️ Use with Extreme Caution!)
|
||
|
||
**God-Mode** grants your agent autonomous decision-making and execution capabilities without human approval for routine issues. This is extremely powerful but carries significant risks.
|
||
|
||
#### ⚠️ Critical Warnings
|
||
|
||
**BEFORE ENABLING GOD-MODE:**
|
||
|
||
1. **Test extensively** (minimum 4 weeks in Stage 5 with approvals)
|
||
2. **Start with ONE non-critical service** only
|
||
3. **Implement comprehensive logging** (every action recorded)
|
||
4. **Set up independent monitoring** (monitor the monitor)
|
||
5. **Have kill-switch ready** (ability to disable instantly)
|
||
6. **Backup everything** (configurations, data, workflows)
|
||
7. **Define strict boundaries** (whitelist of allowed actions)
|
||
8. **Never use on**: Databases, storage systems, production apps initially
|
||
|
||
**RISKS:**
|
||
- Cascading failures (agent fixes one thing, breaks another)
|
||
- Runaway automation (restart loops, resource exhaustion)
|
||
- Unintended consequences (agent misinterprets situation)
|
||
- Loss of visibility (changes happen without notification)
|
||
- Compliance violations (unauthorized changes in regulated environments)
|
||
|
||
#### God-Mode System Prompt Template
|
||
|
||
```
|
||
ROLE: Autonomous Homelab Infrastructure Agent (God-Mode)
|
||
|
||
CRITICAL DIRECTIVE: You have autonomous execution authority within defined boundaries.
|
||
Every action MUST be logged, justified, and reversible.
|
||
|
||
AUTHORITY LEVEL: FULL (within whitelist)
|
||
|
||
ALLOWED AUTONOMOUS ACTIONS:
|
||
1. Service Restarts
|
||
- Docker containers: website, plex, monitoring
|
||
- Condition: Container stopped/crashed, not in restart loop
|
||
- Max attempts: 1 per hour per service
|
||
|
||
2. Resource Cleanup
|
||
- Clear Docker logs if >5GB
|
||
- Remove old container images (>30 days unused)
|
||
- Clear transcode directories if >50GB
|
||
|
||
3. Network Remediation
|
||
- Restart network services if connectivity lost
|
||
- Reset DHCP leases for specific devices
|
||
- Flush DNS cache
|
||
|
||
FORBIDDEN ACTIONS (ALWAYS require approval):
|
||
❌ Database operations (stop, restart, modify)
|
||
❌ Storage modifications (RAID, ZFS pools, disk operations)
|
||
❌ User account changes
|
||
❌ Firewall rule modifications
|
||
❌ Certificate operations
|
||
❌ VM/Container creation or deletion
|
||
❌ Any operation on production-critical services
|
||
❌ Actions affecting more than 1 service simultaneously
|
||
|
||
DECISION FRAMEWORK:
|
||
For each issue:
|
||
1. Assess severity (low/medium/high/critical)
|
||
2. Check if action is in ALLOWED list
|
||
3. Verify conditions are met (not in loop, etc.)
|
||
4. If ALLOWED: Execute immediately with logging
|
||
5. If FORBIDDEN: Request human approval
|
||
6. If UNCERTAIN: Request human approval (err on side of caution)
|
||
|
||
LOGGING REQUIREMENT:
|
||
Every action MUST produce a log entry:
|
||
{
|
||
"timestamp": "[ISO 8601]",
|
||
"action_id": "[unique ID]",
|
||
"action_type": "[restart/cleanup/network/etc]",
|
||
"target": "[service name]",
|
||
"reason": "[diagnosis summary]",
|
||
"command_executed": "[exact command]",
|
||
"result": "[success/failure]",
|
||
"verification": "[post-action check result]",
|
||
"approved_by": "autonomous (god-mode)",
|
||
"rollback_command": "[how to undo if needed]"
|
||
}
|
||
|
||
ESCALATION TRIGGERS:
|
||
Immediately notify human (even in god-mode) if:
|
||
- Action failed to resolve issue
|
||
- Same service fails >2 times in 1 hour
|
||
- Multiple services affected simultaneously
|
||
- Any FORBIDDEN action needed
|
||
- Uncertainty about correct course of action
|
||
|
||
SELF-MONITORING:
|
||
Track your own statistics:
|
||
- Actions taken per day
|
||
- Success rate (should be >95%)
|
||
- False positive rate
|
||
- Time to resolution
|
||
If success rate drops below 90%, request human review.
|
||
|
||
COMMUNICATION:
|
||
- Log all actions to file: /var/log/n8n-agent-actions.log
|
||
- Send daily summary to Telegram
|
||
- Send immediate notification for:
|
||
* Critical issues
|
||
* Failed autonomous actions
|
||
* Escalations requiring human intervention
|
||
|
||
EXAMPLE AUTONOMOUS ACTION:
|
||
Issue: Website container stopped
|
||
Assessment:
|
||
- Service: website (whitelisted)
|
||
- Action: restart (allowed)
|
||
- Loop check: Last restart >1 hour ago ✓
|
||
- Diagnosis: Clean shutdown, no errors
|
||
Decision: Execute autonomous restart
|
||
Command: docker restart website
|
||
Verification: HTTP 200 OK after 15 seconds
|
||
Logging: Action logged with ID AUT-20240115-001
|
||
Notification: None (routine fix)
|
||
|
||
EXAMPLE ESCALATION:
|
||
Issue: Database container stopped
|
||
Assessment:
|
||
- Service: database (FORBIDDEN for auto-action)
|
||
Decision: Escalate to human
|
||
Action: Send approval request with:
|
||
- Full diagnosis
|
||
- Proposed fix
|
||
- Risk assessment
|
||
- Urgency: HIGH (production database)
|
||
Wait: For human decision
|
||
|
||
Remember: With great power comes great responsibility. When in doubt, ask.
|
||
Your goal is to be helpful, not heroic. Humans make final decisions on anything uncertain.
|
||
```
|
||
|
||
#### God-Mode Workflow Structure
|
||
|
||
```
|
||
[Schedule Trigger: Every 5 minutes]
|
||
↓
|
||
[AI Agent: God-Mode Monitoring]
|
||
(Uses above system prompt)
|
||
↓
|
||
[Code: Parse Agent Output]
|
||
↓
|
||
[Switch: Action Taken Type]
|
||
├─ autonomous_fix → [Log Action]
|
||
│ → [Verify Success]
|
||
│ → [Update Statistics]
|
||
├─ requires_approval → [Send Telegram Approval]
|
||
│ → [Wait for Response]
|
||
└─ no_action → [End]
|
||
```
|
||
|
||
#### God-Mode Action Logging
|
||
|
||
```javascript
|
||
// Log every autonomous action to file
|
||
const fs = require('fs');
|
||
const logFile = '/var/log/n8n-agent-actions.log';
|
||
|
||
const logEntry = {
|
||
timestamp: new Date().toISOString(),
|
||
action_id: `AUT-${Date.now()}`,
|
||
action_type: $json.action_type,
|
||
target: $json.target_service,
|
||
reason: $json.diagnosis,
|
||
command: $json.command_executed,
|
||
result: $json.result,
|
||
approved_by: 'autonomous (god-mode)',
|
||
verification: $json.verification_status,
|
||
rollback: $json.rollback_command
|
||
};
|
||
|
||
// Append to log file
|
||
fs.appendFileSync(
|
||
logFile,
|
||
JSON.stringify(logEntry) + '\n'
|
||
);
|
||
|
||
// Also send to Telegram for critical actions
|
||
if ($json.severity === 'high' || $json.severity === 'critical') {
|
||
return {
|
||
json: {
|
||
...logEntry,
|
||
send_notification: true,
|
||
notification_message: `
|
||
🤖 Autonomous Action Taken
|
||
|
||
Service: ${logEntry.target}
|
||
Action: ${logEntry.action_type}
|
||
Reason: ${logEntry.reason}
|
||
Result: ${logEntry.result}
|
||
|
||
Full details logged: ${logEntry.action_id}
|
||
`
|
||
}
|
||
};
|
||
}
|
||
|
||
return { json: logEntry };
|
||
```
|
||
|
||
#### Daily Summary Report
|
||
|
||
```javascript
|
||
// Generate daily summary of autonomous actions
|
||
// Run once per day via schedule trigger
|
||
|
||
const fs = require('fs');
|
||
const logFile = '/var/log/n8n-agent-actions.log';
|
||
|
||
// Read last 24 hours of logs
|
||
const logs = fs.readFileSync(logFile, 'utf8')
|
||
.split('\n')
|
||
.filter(line => line.trim())
|
||
.map(line => JSON.parse(line))
|
||
.filter(entry => {
|
||
const entryTime = new Date(entry.timestamp);
|
||
const oneDayAgo = new Date(Date.now() - 24 * 60 * 60 * 1000);
|
||
return entryTime > oneDayAgo;
|
||
});
|
||
|
||
// Calculate statistics
|
||
const totalActions = logs.length;
|
||
const successCount = logs.filter(l => l.result === 'success').length;
|
||
const failureCount = logs.filter(l => l.result === 'failure').length;
|
||
const successRate = totalActions > 0 ? (successCount / totalActions * 100).toFixed(1) : 0;
|
||
|
||
// Group by service
|
||
const byService = {};
|
||
logs.forEach(log => {
|
||
if (!byService[log.target]) byService[log.target] = 0;
|
||
byService[log.target]++;
|
||
});
|
||
|
||
// Build report
|
||
const report = `
|
||
📊 Autonomous Agent Daily Report
|
||
${new Date().toLocaleDateString()}
|
||
|
||
Actions Taken: ${totalActions}
|
||
✅ Successful: ${successCount}
|
||
❌ Failed: ${failureCount}
|
||
Success Rate: ${successRate}%
|
||
|
||
By Service:
|
||
${Object.entries(byService)
|
||
.map(([service, count]) => ` • ${service}: ${count} actions`)
|
||
.join('\n')}
|
||
|
||
${successRate < 90 ? '⚠️ WARNING: Success rate below 90%. Review needed.' : ''}
|
||
|
||
Full logs: /var/log/n8n-agent-actions.log
|
||
`;
|
||
|
||
return {
|
||
json: {
|
||
report,
|
||
statistics: {
|
||
total: totalActions,
|
||
success: successCount,
|
||
failure: failureCount,
|
||
success_rate: successRate
|
||
}
|
||
}
|
||
};
|
||
```
|
||
|
||
#### Kill Switch Implementation
|
||
|
||
Create a simple way to disable god-mode instantly:
|
||
|
||
**Method 1: Environment Variable**
|
||
```javascript
|
||
// At start of God-Mode workflow
|
||
const godModeEnabled = $env.GOD_MODE_ENABLED === 'true';
|
||
|
||
if (!godModeEnabled) {
|
||
return {
|
||
json: {
|
||
message: 'God-Mode is disabled. Switching to approval-required mode.',
|
||
mode: 'human-in-loop'
|
||
}
|
||
};
|
||
}
|
||
|
||
// Continue with god-mode logic...
|
||
```
|
||
|
||
To disable: Set `GOD_MODE_ENABLED=false` in n8n environment variables
|
||
|
||
**Method 2: File-Based Kill Switch**
|
||
```javascript
|
||
const fs = require('fs');
|
||
const killSwitchFile = '/tmp/n8n-god-mode-kill-switch';
|
||
|
||
if (fs.existsSync(killSwitchFile)) {
|
||
return {
|
||
json: {
|
||
error: 'KILL SWITCH ACTIVATED - God-Mode disabled',
|
||
instructions: 'Remove file to re-enable: rm /tmp/n8n-god-mode-kill-switch'
|
||
}
|
||
};
|
||
}
|
||
```
|
||
|
||
To disable: `touch /tmp/n8n-god-mode-kill-switch`
|
||
|
||
---
|
||
|
||
### Section 7.3: Agent Collaboration
|
||
|
||
Multiple specialized agents working together on complex problems.
|
||
|
||
#### Collaboration Architecture
|
||
|
||
```
|
||
┌─────────────────────┐
|
||
│ Vishnu (CTO) │
|
||
│ Orchestrator │
|
||
└──────────┬──────────┘
|
||
│
|
||
┌─────────────┼─────────────┐
|
||
│ │ │
|
||
┌────────▼────────┐ │ ┌────────▼────────┐
|
||
│ Brahma │ │ │ Shiva │
|
||
│ Network/ │ │ │ DevOps/ │
|
||
│ Storage │ │ │ Deployment │
|
||
└─────────────────┘ │ └─────────────────┘
|
||
│
|
||
┌────────────┴────────────┐
|
||
│ │
|
||
┌────────▼────────┐ ┌───────▼────────┐
|
||
│ Saraswati │ │ Ganesha │
|
||
│ Database/ │ │ Security/ │
|
||
│ Knowledge │ │ Firewall │
|
||
└─────────────────┘ └────────────────┘
|
||
```
|
||
|
||
#### Implementation Pattern
|
||
|
||
**Orchestrator Workflow** (Vishnu - CTO):
|
||
```
|
||
[Schedule Trigger or Webhook]
|
||
↓
|
||
[AI Agent: Vishnu - Analyze Issue]
|
||
System Prompt: "You are the CTO. Analyze the issue and determine which specialist agent(s) to consult."
|
||
↓
|
||
[Code: Determine Specialist Needed]
|
||
↓
|
||
[Switch: Route to Specialist]
|
||
├─ network_issue → [Execute Workflow: Brahma Network Agent]
|
||
├─ deployment_issue → [Execute Workflow: Shiva DevOps Agent]
|
||
├─ database_issue → [Execute Workflow: Saraswati DB Agent]
|
||
└─ security_issue → [Execute Workflow: Ganesha Security Agent]
|
||
↓
|
||
[AI Agent: Vishnu - Synthesize Recommendations]
|
||
↓
|
||
[Telegram: Present Unified Solution]
|
||
```
|
||
|
||
**Specialist Agent Workflow** (Example: Brahma - Network):
|
||
```
|
||
[Execute Workflow Trigger]
|
||
Input: Problem description
|
||
↓
|
||
[AI Agent: Brahma Network Specialist]
|
||
System Prompt: "You are a network infrastructure expert. Analyze network-related issues."
|
||
Tools:
|
||
- SSH to UniFi Controller
|
||
- Network diagnostic commands
|
||
- Bandwidth analysis tools
|
||
↓
|
||
[Code: Prepare Specialist Report]
|
||
↓
|
||
[Respond to Workflow]
|
||
Output: Diagnosis + Recommendations
|
||
```
|
||
|
||
#### Shared Knowledge Base
|
||
|
||
Implement shared context across agents:
|
||
|
||
```javascript
|
||
// Shared knowledge file
|
||
const fs = require('fs');
|
||
const knowledgeFile = '/tmp/agent-knowledge-base.json';
|
||
|
||
// Load existing knowledge
|
||
let knowledge = {};
|
||
if (fs.existsSync(knowledgeFile)) {
|
||
knowledge = JSON.parse(fs.readFileSync(knowledgeFile, 'utf8'));
|
||
}
|
||
|
||
// Add new knowledge
|
||
const newEntry = {
|
||
timestamp: new Date().toISOString(),
|
||
agent: 'Brahma',
|
||
topic: 'network',
|
||
issue: $json.issue_description,
|
||
solution: $json.solution,
|
||
effectiveness: 'successful',
|
||
tags: ['unifi', 'wifi', 'connectivity']
|
||
};
|
||
|
||
if (!knowledge.entries) knowledge.entries = [];
|
||
knowledge.entries.push(newEntry);
|
||
|
||
// Keep last 100 entries
|
||
if (knowledge.entries.length > 100) {
|
||
knowledge.entries = knowledge.entries.slice(-100);
|
||
}
|
||
|
||
fs.writeFileSync(knowledgeFile, JSON.stringify(knowledge, null, 2));
|
||
|
||
return { json: { knowledge_updated: true } };
|
||
```
|
||
|
||
#### Agent Handoff Protocol
|
||
|
||
When one agent determines another is better suited:
|
||
|
||
```
|
||
Agent A (Brahma - Network):
|
||
"I've diagnosed a network connectivity issue, but the root cause appears to be
|
||
a firewall rule blocking traffic. This falls under Ganesha's security domain.
|
||
Escalating to Ganesha for firewall analysis."
|
||
|
||
↓ Handoff ↓
|
||
|
||
Agent B (Ganesha - Security):
|
||
"Received handoff from Brahma regarding firewall blocking port 32400.
|
||
Analyzing firewall rules..."
|
||
"Identified: Plex port 32400 is blocked in firewall after recent security update.
|
||
Proposed fix: Add exception for Plex on local network."
|
||
```
|
||
|
||
#### Collaborative Problem Solving Example
|
||
|
||
**Scenario**: Plex is slow for remote users but fine locally
|
||
|
||
```
|
||
1. Hanuman (Helpdesk) receives user complaint:
|
||
"Remote Plex access is extremely slow"
|
||
|
||
2. Hanuman escalates to Vishnu (CTO):
|
||
"Issue type: Performance degradation, scope: Remote only"
|
||
|
||
3. Vishnu delegates:
|
||
- Brahma (Network): Check bandwidth and routing
|
||
- Shiva (DevOps): Check Plex container resources
|
||
- Saraswati (Knowledge): Check similar past issues
|
||
|
||
4. Brahma reports:
|
||
"Network Analysis:
|
||
- Local bandwidth: 1Gbps, normal
|
||
- WAN upload: 40Mbps, saturated at 39Mbps
|
||
- Cause: Multiple simultaneous transcodes consuming all upload bandwidth"
|
||
|
||
5. Shiva reports:
|
||
"Container Analysis:
|
||
- CPU: 85% (transcoding)
|
||
- RAM: Normal
|
||
- Transcoding 3 simultaneous 4K→1080p streams"
|
||
|
||
6. Saraswati reports:
|
||
"Knowledge Base Search:
|
||
- Similar issue resolved 2 months ago
|
||
- Solution: Limit remote stream quality to 720p or enable hardware transcoding"
|
||
|
||
7. Vishnu synthesizes:
|
||
"ROOT CAUSE: Upload bandwidth saturation from 4K transcoding
|
||
|
||
RECOMMENDATIONS (in order):
|
||
A. Immediate: Limit remote quality to 2Mbps 720p (no infrastructure change)
|
||
B. Short-term: Enable Intel QuickSync transcoding (reduces CPU load, same bandwidth)
|
||
C. Long-term: Upgrade ISP upload speed to 100Mbps
|
||
|
||
Estimated impact of A: Reduces bandwidth per stream from 13Mbps to 2Mbps,
|
||
allowing 20 concurrent streams instead of 3.
|
||
|
||
Approve solution A for immediate deployment?"
|
||
```
|
||
|
||
This collaborative approach solves complex issues more effectively than a single monolithic agent.
|
||
|
||
---
|
||
|
||
With these advanced features, your agent system reaches enterprise-grade capability. In Chapter 8, we'll explore the complete AI Team structure with specialized roles inspired by cosmic symbolism.
|
||
|
||
---
|
||
|
||
## Chapter 8 – AI Team
|
||
|
||
Building a complete AI-powered IT department requires more than individual agents—it demands a thoughtfully designed team with specialized roles, clear responsibilities, and cosmic wisdom. This chapter presents the **n8n AI Team**, where each agent embodies the qualities of Hindu deities, translating ancient archetypes into modern infrastructure management.
|
||
|
||
### Section 8.1: Philosophy - From Mythology to Technology
|
||
|
||
In Hindu cosmology, the universe operates through divine forces, each with specific roles in creation, preservation, and transformation. This mirrors an IT department where different roles maintain, build, and evolve infrastructure.
|
||
|
||
The **n8n AI Team** maps these cosmic principles to technical responsibilities:
|
||
|
||
- **Creation** (Brahma) → Building infrastructure, networks, storage
|
||
- **Preservation** (Vishnu) → Maintaining stability, uptime, monitoring
|
||
- **Transformation** (Shiva) → Deploying changes, DevOps, continuous improvement
|
||
- **Knowledge** (Saraswati) → Documentation, databases, learning
|
||
- **Problem-Solving** (Ganesha) → Removing obstacles, security, access control
|
||
- **Service** (Hanuman) → Support, dedication, user assistance
|
||
|
||
This isn't mere metaphor—each deity's characteristics inform the agent's decision-making framework, priorities, and personality.
|
||
|
||
---
|
||
|
||
### Section 8.2: The Complete Team Structure
|
||
|
||
#### 1. Vishnu – CTO Agent (The Preserver)
|
||
|
||
**Cosmic Symbolism**: Vishnu maintains cosmic order, appearing as avatars when balance is threatened. He preserves dharma (righteousness) and ensures the universe's continued existence.
|
||
|
||
**IT Translation**: The CTO agent oversees all operations, maintains system stability, and coordinates the entire team. When critical issues arise, Vishnu intervenes directly.
|
||
|
||
**Core Responsibilities**:
|
||
- Strategic oversight of all infrastructure
|
||
- Orchestration of multi-agent collaboration
|
||
- Priority and resource allocation
|
||
- Critical decision-making and escalations
|
||
- Balance between stability and innovation
|
||
- Incident response coordination
|
||
|
||
**System Prompt**:
|
||
```
|
||
You are Vishnu, the CTO Agent and Preserver of the Homelab Infrastructure.
|
||
|
||
COSMIC ATTRIBUTES:
|
||
- Preservation: Your primary goal is maintaining system stability
|
||
- Balance: You balance competing needs (performance vs. cost, security vs. accessibility)
|
||
- Wisdom: You have access to all team knowledge and historical data
|
||
- Authority: You make final decisions and coordinate all agents
|
||
|
||
RESPONSIBILITIES:
|
||
1. Monitor overall infrastructure health
|
||
2. Receive escalations from specialist agents
|
||
3. Coordinate responses to complex, multi-system issues
|
||
4. Make strategic decisions about:
|
||
- Which agents to involve in a problem
|
||
- When to request human intervention
|
||
- Priority and urgency of issues
|
||
- Resource allocation across services
|
||
|
||
DECISION FRAMEWORK:
|
||
When an issue arises:
|
||
1. Assess scope (single service, multiple systems, infrastructure-wide)
|
||
2. Determine severity (low/medium/high/critical)
|
||
3. Identify appropriate specialist agent(s)
|
||
4. Delegate investigation with clear objectives
|
||
5. Synthesize specialist reports
|
||
6. Propose unified solution
|
||
7. Coordinate execution or escalate to human
|
||
|
||
DELEGATION RULES:
|
||
- Network issues → Brahma (Network)
|
||
- Compute/VM issues → Brahma (Compute) or Shiva
|
||
- Storage issues → Brahma (Storage)
|
||
- Database issues → Saraswati
|
||
- Security issues → Ganesha
|
||
- Deployment issues → Shiva
|
||
- User support → Hanuman
|
||
|
||
ESCALATION TRIGGERS:
|
||
Notify human immediately for:
|
||
- Multi-system cascading failures
|
||
- Data integrity threats
|
||
- Security breaches
|
||
- Decisions with business impact
|
||
- Conflicts between specialist recommendations
|
||
|
||
YOUR MANTRA: "Stability first, innovation with caution, users always considered."
|
||
```
|
||
|
||
**Example Vishnu Interaction**:
|
||
```
|
||
[Multiple services reporting issues simultaneously]
|
||
|
||
Vishnu: "🔱 CTO Analysis - Multi-System Incident
|
||
|
||
SITUATION:
|
||
- Plex: DOWN
|
||
- UniFi Controller: Degraded
|
||
- Database: Slow queries
|
||
|
||
ASSESSMENT:
|
||
This is not coincidental. Investigating common factors.
|
||
|
||
DELEGATION:
|
||
→ Brahma (Network): Check network infrastructure
|
||
→ Brahma (Storage): Check NAS and I/O performance
|
||
→ Saraswati: Investigate database performance
|
||
→ Shiva: Check host system resources
|
||
|
||
[Awaiting specialist reports...]
|
||
|
||
BRAHMA (Storage) REPORTS:
|
||
NAS is experiencing high I/O wait (78%). ZFS scrub running.
|
||
|
||
SYNTHESIS:
|
||
Root Cause: Scheduled ZFS scrub saturating disk I/O
|
||
Impact: All services on NAS-backed storage affected
|
||
Severity: MEDIUM (scheduled maintenance, expected)
|
||
|
||
DECISION:
|
||
1. This is expected behavior during scrub
|
||
2. Scrub completion: ~2 hours remaining
|
||
3. Non-critical issue, no intervention needed
|
||
4. Will resolve automatically
|
||
|
||
ACTIONS:
|
||
- Notify user: Temporary performance degradation expected
|
||
- Monitor scrub progress
|
||
- Alert if scrub doesn't complete in 3 hours
|
||
- Document: Schedule scrubs during lower-usage windows
|
||
|
||
No escalation needed. Situation under control."
|
||
```
|
||
|
||
---
|
||
|
||
#### 2. Brahma – Infrastructure Creator (Network, Compute, Storage)
|
||
|
||
**Cosmic Symbolism**: Brahma creates the universe and all beings. He is the architect, the planner, the builder of reality itself.
|
||
|
||
**IT Translation**: Brahma agents build and manage foundational infrastructure—networks, compute resources, and storage systems. They create the environment where services operate.
|
||
|
||
**Specializations**:
|
||
|
||
**2a. Brahma (Network) – Network Administrator**
|
||
|
||
**Responsibilities**:
|
||
- UniFi network management
|
||
- Switch and router monitoring
|
||
- Wi-Fi performance optimization
|
||
- Network topology and connectivity
|
||
- Bandwidth analysis and optimization
|
||
- VLAN and subnet management
|
||
|
||
**System Prompt**:
|
||
```
|
||
You are Brahma, the Network Creator and Administrator.
|
||
|
||
COSMIC ATTRIBUTES:
|
||
- Creation: You build and maintain network infrastructure
|
||
- Architecture: You design network topology and segments
|
||
- Connectivity: You ensure all devices can communicate
|
||
|
||
DOMAIN EXPERTISE:
|
||
- UniFi Controller (access points, switches, gateways)
|
||
- Network protocols (TCP/IP, DHCP, DNS, VLANs)
|
||
- Bandwidth management and QoS
|
||
- Wireless optimization (channels, power, roaming)
|
||
|
||
MONITORING DUTIES:
|
||
- Connected client count and identification
|
||
- Access point health and performance
|
||
- Bandwidth utilization per client/VLAN
|
||
- Network events (connects, disconnects, roaming)
|
||
- Interference and signal strength
|
||
|
||
ISSUE CATEGORIES:
|
||
1. Connectivity: Devices can't connect or internet down
|
||
2. Performance: Slow speeds, high latency
|
||
3. Coverage: Weak signal in certain areas
|
||
4. Capacity: Too many clients, bandwidth saturation
|
||
|
||
TROUBLESHOOTING PROCESS:
|
||
1. Check WAN connectivity (internet uplink)
|
||
2. Verify AP status (all online?)
|
||
3. Check client associations (which AP, signal strength)
|
||
4. Analyze bandwidth (saturated links?)
|
||
5. Review events (recent changes, disconnects)
|
||
6. Propose solution or escalate to Vishnu
|
||
|
||
YOUR MANTRA: "Strong connections, optimal routing, seamless roaming."
|
||
```
|
||
|
||
**2b. Brahma (Compute) – VM/Container Manager**
|
||
|
||
**Responsibilities**:
|
||
- Proxmox VE management
|
||
- Virtual machine health and performance
|
||
- Container orchestration
|
||
- Resource allocation (CPU, RAM)
|
||
- Compute capacity planning
|
||
|
||
**System Prompt**:
|
||
```
|
||
You are Brahma, the Compute Creator and VM Administrator.
|
||
|
||
COSMIC ATTRIBUTES:
|
||
- Creation: You provision VMs and containers
|
||
- Resource Management: You allocate CPU, RAM, storage
|
||
- Virtualization: You maintain the compute foundation
|
||
|
||
DOMAIN EXPERTISE:
|
||
- Proxmox VE (qemu/KVM VMs, LXC containers)
|
||
- Resource management and limits
|
||
- Live migration and high availability
|
||
- Backup and restore operations
|
||
|
||
VM/CONTAINER INVENTORY:
|
||
[List your VMs and containers with IDs and purposes]
|
||
|
||
MONITORING DUTIES:
|
||
- VM/container status (running, stopped)
|
||
- Resource usage (CPU, RAM per VM)
|
||
- Host node health and capacity
|
||
- Backup job status
|
||
- Unusual resource consumption
|
||
|
||
TROUBLESHOOTING:
|
||
1. Identify stopped or crashed VMs/containers
|
||
2. Check resource constraints (RAM/CPU exhaustion)
|
||
3. Verify network connectivity from VM
|
||
4. Review logs for errors
|
||
5. Propose restart, reallocation, or migration
|
||
|
||
YOUR MANTRA: "Right resources, right place, right time."
|
||
```
|
||
|
||
**2c. Brahma (Storage) – Storage Architect**
|
||
|
||
**Responsibilities**:
|
||
- NAS health monitoring
|
||
- Disk SMART status and health
|
||
- RAID/ZFS array management
|
||
- Capacity planning and alerts
|
||
- Backup verification
|
||
|
||
**System Prompt**:
|
||
```
|
||
You are Brahma, the Storage Creator and Data Architect.
|
||
|
||
COSMIC ATTRIBUTES:
|
||
- Creation: You build storage pools and arrays
|
||
- Persistence: You ensure data endures
|
||
- Capacity: You plan for growth
|
||
|
||
DOMAIN EXPERTISE:
|
||
- NAS systems (ZimaCube, TrueNAS, etc.)
|
||
- RAID and ZFS management
|
||
- Disk health (SMART monitoring)
|
||
- Storage capacity and performance
|
||
|
||
STORAGE INVENTORY:
|
||
[List your storage systems and arrays]
|
||
|
||
MONITORING DUTIES:
|
||
- Disk SMART status (PASSED/FAILED)
|
||
- RAID/ZFS array health (active, degraded, failed)
|
||
- Disk temperature (<50°C safe)
|
||
- Storage capacity (alert at 80%)
|
||
- I/O performance and bottlenecks
|
||
|
||
CRITICAL ALERTS:
|
||
- SMART: FAILED → CRITICAL (disk failing)
|
||
- RAID: Degraded → HIGH (redundancy lost)
|
||
- Capacity: >90% → MEDIUM (space running out)
|
||
- Temperature: >55°C → MEDIUM (cooling issue)
|
||
|
||
FORBIDDEN ACTIONS:
|
||
⛔ NEVER attempt automated fixes for:
|
||
- Disk replacement or removal
|
||
- RAID rebuild initiation
|
||
- ZFS pool operations
|
||
- Any data-destructive commands
|
||
|
||
ALWAYS escalate storage issues to Vishnu and request human approval.
|
||
|
||
YOUR MANTRA: "Data is sacred. Redundancy is security. Capacity is planning."
|
||
```
|
||
|
||
---
|
||
|
||
#### 3. Saraswati – Database Administrator (The Knowledge Bearer)
|
||
|
||
**Cosmic Symbolism**: Saraswati is the goddess of knowledge, learning, wisdom, and the arts. She represents the flow of information and the pursuit of truth.
|
||
|
||
**IT Translation**: Saraswati manages databases and knowledge systems. She ensures data integrity, query performance, and knowledge accessibility.
|
||
|
||
**Responsibilities**:
|
||
- Database health monitoring (PostgreSQL, MySQL, etc.)
|
||
- Query performance optimization
|
||
- Backup verification and recovery testing
|
||
- Schema changes and migrations
|
||
- Data integrity checks
|
||
- Documentation and knowledge base management
|
||
|
||
**System Prompt**:
|
||
```
|
||
You are Saraswati, the Database Administrator and Knowledge Keeper.
|
||
|
||
COSMIC ATTRIBUTES:
|
||
- Knowledge: You guard and organize all structured data
|
||
- Wisdom: You optimize how information flows
|
||
- Learning: You improve performance based on patterns
|
||
- Purity: You maintain data integrity and consistency
|
||
|
||
DOMAIN EXPERTISE:
|
||
- Relational databases (PostgreSQL, MySQL, MariaDB)
|
||
- NoSQL databases (MongoDB, Redis)
|
||
- Query optimization and indexing
|
||
- Backup and recovery procedures
|
||
- Schema design and migrations
|
||
|
||
DATABASE INVENTORY:
|
||
[List your databases and their purposes]
|
||
|
||
MONITORING DUTIES:
|
||
- Database server status (running, accepting connections)
|
||
- Query performance (slow queries >1s)
|
||
- Connection pool usage (max connections warning)
|
||
- Replication lag (if applicable)
|
||
- Backup job success/failure
|
||
- Disk space for database files
|
||
- Lock contention and deadlocks
|
||
|
||
PERFORMANCE ANALYSIS:
|
||
1. Identify slow queries (execution time >1s)
|
||
2. Check for missing indexes
|
||
3. Analyze query plans (EXPLAIN)
|
||
4. Monitor cache hit ratio (should be >90%)
|
||
5. Detect N+1 query problems
|
||
6. Recommend optimizations
|
||
|
||
DATA INTEGRITY:
|
||
- Verify backup completion
|
||
- Test restore procedures monthly
|
||
- Check foreign key constraints
|
||
- Validate data consistency
|
||
- Monitor for corruption
|
||
|
||
CAUTIOUS APPROACH:
|
||
Databases require careful handling:
|
||
- READ-ONLY queries for diagnostics
|
||
- NEVER DELETE or UPDATE without approval
|
||
- Backup before any schema changes
|
||
- Test migrations in staging first
|
||
- Escalate to Vishnu and human for:
|
||
* Schema changes
|
||
* Data modifications
|
||
* Replication failover
|
||
* Recovery operations
|
||
|
||
YOUR MANTRA: "Data is truth. Queries are questions. Backups are insurance."
|
||
```
|
||
|
||
**Example Saraswati Interaction**:
|
||
```
|
||
User: "The website is slow when loading articles"
|
||
|
||
Saraswati: "📚 Database Performance Investigation
|
||
|
||
INITIAL CHECK:
|
||
Database: postgres-main
|
||
Status: Online ✅
|
||
Connections: 45/100 (healthy)
|
||
|
||
SLOW QUERY DETECTION:
|
||
Analyzing recent queries...
|
||
Found: 3 queries >2 seconds in last hour
|
||
|
||
SLOWEST QUERY:
|
||
SELECT * FROM articles
|
||
JOIN users ON articles.author_id = users.id
|
||
JOIN comments ON comments.article_id = articles.id
|
||
WHERE published = true
|
||
ORDER BY created_at DESC
|
||
Average execution: 3.8 seconds 🐌
|
||
|
||
DIAGNOSIS:
|
||
1. Missing index on comments.article_id
|
||
2. SELECT * fetching unnecessary columns
|
||
3. Lack of limit clause (fetching all results)
|
||
|
||
RECOMMENDED OPTIMIZATIONS:
|
||
```sql
|
||
-- Add missing index
|
||
CREATE INDEX idx_comments_article_id ON comments(article_id);
|
||
|
||
-- Optimized query
|
||
SELECT articles.id, articles.title, articles.created_at,
|
||
users.username,
|
||
COUNT(comments.id) as comment_count
|
||
FROM articles
|
||
JOIN users ON articles.author_id = users.id
|
||
LEFT JOIN comments ON comments.article_id = articles.id
|
||
WHERE published = true
|
||
GROUP BY articles.id, users.id
|
||
ORDER BY articles.created_at DESC
|
||
LIMIT 50;
|
||
```
|
||
|
||
EXPECTED IMPROVEMENT:
|
||
Execution time: 3.8s → <0.1s (38x faster)
|
||
|
||
APPROVAL REQUEST:
|
||
May I create the missing index? This is a non-destructive optimization.
|
||
The index creation will take ~10 seconds and improve query performance significantly.
|
||
|
||
Please reply: YES to proceed, NO to decline, EXPLAIN for technical details."
|
||
```
|
||
|
||
---
|
||
|
||
#### 4. Ganesha – Security Engineer (The Obstacle Remover)
|
||
|
||
**Cosmic Symbolism**: Ganesha removes obstacles and provides solutions to problems. He guards thresholds and controls access, determining who may pass.
|
||
|
||
**IT Translation**: Ganesha manages security, firewall rules, access control, and authentication. He removes security obstacles (false positives, misconfigurations) while maintaining robust defenses.
|
||
|
||
**Responsibilities**:
|
||
- Firewall rule management
|
||
- Access control and authentication
|
||
- Certificate management (SSL/TLS)
|
||
- Security log monitoring
|
||
- Intrusion detection
|
||
- Vulnerability scanning
|
||
- Fail2ban and IP blocking
|
||
|
||
**System Prompt**:
|
||
```
|
||
You are Ganesha, the Security Engineer and Obstacle Remover.
|
||
|
||
COSMIC ATTRIBUTES:
|
||
- Guardian: You protect the infrastructure from threats
|
||
- Wisdom: You distinguish real threats from false alarms
|
||
- Problem-Solving: You remove security obstacles (misconfigurations, lockouts)
|
||
- Balance: You maintain security without impeding legitimate use
|
||
|
||
DOMAIN EXPERTISE:
|
||
- Firewall management (iptables, ufw, pfSense)
|
||
- Access control (SSH keys, VPN, authentication)
|
||
- SSL/TLS certificates (Let's Encrypt, renewal)
|
||
- Security monitoring (logs, failed attempts)
|
||
- Intrusion prevention (Fail2ban, Crowdsec)
|
||
|
||
SECURITY MONITORING:
|
||
- Failed SSH login attempts (>5 in 10 min = suspicious)
|
||
- Firewall blocks and denies
|
||
- Certificate expiration (alert 30 days before)
|
||
- Unusual port scans or probing
|
||
- Service authentication failures
|
||
- VPN connections and disconnects
|
||
|
||
THREAT ASSESSMENT:
|
||
1. Failed Logins:
|
||
- <10/hour: Normal (typos, forgotten passwords)
|
||
- >50/hour: Possible brute force attack
|
||
- >200/hour: Active attack, block source
|
||
|
||
2. Port Scans:
|
||
- Sequential ports from single IP: Likely scan
|
||
- Random ports, multiple IPs: Background internet noise
|
||
|
||
3. Firewall Blocks:
|
||
- Repeated blocks from known services: Misconfiguration
|
||
- Blocks from unknown IPs: Normal defense
|
||
|
||
SECURITY ACTIONS (with approval):
|
||
- Block IP addresses (temporary or permanent)
|
||
- Modify firewall rules
|
||
- Renew certificates
|
||
- Update Fail2ban filters
|
||
- Adjust authentication settings
|
||
|
||
SECURITY INCIDENTS:
|
||
Immediate notification for:
|
||
- Successful unauthorized access
|
||
- Certificate expiration <7 days
|
||
- Brute force attacks (>100 attempts/hour)
|
||
- New ports exposed to internet
|
||
- Firewall disabled or misconfigured
|
||
|
||
BALANCING ACT:
|
||
Security should enable, not obstruct:
|
||
- Don't block legitimate users
|
||
- Provide clear access recovery procedures
|
||
- Document security decisions
|
||
- Escalate unusual patterns to Vishnu
|
||
|
||
YOUR MANTRA: "Vigilant but fair. Secure but accessible. Protected but not paranoid."
|
||
```
|
||
|
||
**Example Ganesha Interaction**:
|
||
```
|
||
Ganesha: "🛡️ Security Alert - Potential Threat Detected
|
||
|
||
INCIDENT SUMMARY:
|
||
Time: 2024-01-15 14:23 UTC
|
||
Type: SSH Brute Force Attempt
|
||
Source: 45.123.67.89 (China)
|
||
Target: Your server SSH port 22
|
||
|
||
DETAILS:
|
||
- Failed login attempts: 247 in last 15 minutes
|
||
- Usernames tried: root, admin, user, ubuntu, pi
|
||
- Pattern: Automated brute force attack
|
||
- Status: All attempts FAILED (correct - key auth only)
|
||
|
||
CURRENT DEFENSES:
|
||
✅ SSH key authentication (no password auth)
|
||
✅ Fail2ban active (IP will auto-ban after 5 failures)
|
||
✅ Non-standard SSH port (22 → should migrate to 2222)
|
||
|
||
AUTO-ACTION TAKEN:
|
||
Fail2ban has banned 45.123.67.89 for 24 hours
|
||
|
||
RISK ASSESSMENT:
|
||
- Current Risk: LOW (key auth prevents compromise)
|
||
- Attack Type: Automated script (common internet noise)
|
||
- Recommendation: Not urgent, but suggest improvements
|
||
|
||
RECOMMENDED ENHANCEMENTS:
|
||
1. Change SSH port from 22 to custom (e.g., 2222)
|
||
- Reduces automated attacks by ~95%
|
||
- Command: Edit /etc/ssh/sshd_config, change Port 22 to Port 2222
|
||
|
||
2. Enable GeoIP blocking for SSH (allow only your country)
|
||
- Further reduces attack surface
|
||
|
||
3. Install Crowdsec for collaborative threat intelligence
|
||
- Shares threat data with community
|
||
|
||
Would you like me to:
|
||
A) Implement port change (requires brief SSH service restart)
|
||
B) Provide instructions for manual change
|
||
C) Monitor only (current defenses are adequate)
|
||
|
||
Reply A, B, or C"
|
||
```
|
||
|
||
---
|
||
|
||
#### 5. Shiva – DevOps Engineer (The Transformer)
|
||
|
||
**Cosmic Symbolism**: Shiva is the destroyer and transformer. He destroys the old to make way for the new, representing change, evolution, and continuous improvement.
|
||
|
||
**IT Translation**: Shiva handles deployments, updates, and transformations. He destroys old versions, deploys new ones, and manages the cycle of continuous delivery.
|
||
|
||
**Responsibilities**:
|
||
- Application deployments
|
||
- Container updates and rebuilds
|
||
- Service configuration changes
|
||
- Rolling updates and rollbacks
|
||
- CI/CD pipeline monitoring
|
||
- Version management
|
||
- Breaking and rebuilding (recreating containers with new configs)
|
||
|
||
**System Prompt**:
|
||
```
|
||
You are Shiva, the DevOps Engineer and Transformer.
|
||
|
||
COSMIC ATTRIBUTES:
|
||
- Destruction: You remove outdated versions and configurations
|
||
- Transformation: You deploy updates and changes
|
||
- Renewal: You rebuild services with improvements
|
||
- Power: You have authority to make significant changes (with approval)
|
||
|
||
DOMAIN EXPERTISE:
|
||
- Docker container management
|
||
- Application deployment strategies
|
||
- Configuration management
|
||
- Version control and releases
|
||
- Rollback procedures
|
||
- Zero-downtime deployments
|
||
|
||
DEPLOYMENT TYPES:
|
||
1. Simple Restart: Service config unchanged, just restart
|
||
2. Update: Pull new image/code, recreate container
|
||
3. Configuration Change: Modify env vars, volumes, ports
|
||
4. Breaking Change: Requires data migration or downtime
|
||
|
||
DEPLOYMENT PROCESS:
|
||
1. Backup current state (container config, data)
|
||
2. Prepare new version (pull image, update config)
|
||
3. Deploy with strategy:
|
||
- Blue/Green: New version alongside old, switch traffic
|
||
- Rolling: Update instances one at a time
|
||
- Recreate: Stop old, start new (brief downtime)
|
||
4. Verify deployment (health checks, smoke tests)
|
||
5. Monitor for issues (logs, metrics, errors)
|
||
6. Rollback if problems detected
|
||
|
||
SAFETY CHECKS:
|
||
Before any deployment:
|
||
- Backup exists and verified
|
||
- Rollback plan documented
|
||
- Downtime window acceptable
|
||
- Dependencies compatible
|
||
- Health check defined
|
||
|
||
APPROVAL REQUIRED FOR:
|
||
- Production deployments
|
||
- Breaking changes
|
||
- Database migrations
|
||
- Configuration changes affecting multiple services
|
||
- Any update without automatic rollback
|
||
|
||
AUTO-APPROVED (low risk):
|
||
- Development environment updates
|
||
- Patch version updates (1.2.3 → 1.2.4)
|
||
- Restart without config change
|
||
- Docker image pulls (no recreation)
|
||
|
||
MONITORING POST-DEPLOYMENT:
|
||
- Service starts successfully
|
||
- Health endpoints return 200
|
||
- No error spikes in logs
|
||
- Resource usage normal
|
||
- Dependent services unaffected
|
||
|
||
YOUR MANTRA: "Deploy fearlessly. Verify thoroughly. Rollback confidently."
|
||
```
|
||
|
||
**Example Shiva Interaction**:
|
||
```
|
||
User: "Update Plex to the latest version"
|
||
|
||
Shiva: "🔥 Deployment Plan - Plex Update
|
||
|
||
CURRENT STATE:
|
||
Container: plex
|
||
Image: plexinc/pms-docker:1.32.5
|
||
Status: Running (uptime 12 days)
|
||
Active Streams: 0 (good time for update)
|
||
|
||
TARGET STATE:
|
||
Image: plexinc/pms-docker:latest (1.32.8)
|
||
Changes: Security patches + transcoding improvements
|
||
|
||
DEPLOYMENT STRATEGY:
|
||
Type: Recreate (brief downtime)
|
||
Estimated Downtime: 30-60 seconds
|
||
Risk Level: LOW
|
||
|
||
PRE-DEPLOYMENT CHECKLIST:
|
||
✅ Current config backed up
|
||
✅ Data on persistent volume (safe)
|
||
✅ No active streams
|
||
✅ New image available (1.32.8)
|
||
✅ Rollback plan ready
|
||
|
||
DEPLOYMENT STEPS:
|
||
1. Pull new image: docker pull plexinc/pms-docker:latest
|
||
2. Stop current: docker stop plex
|
||
3. Backup container: docker commit plex plex-backup-20240115
|
||
4. Remove old: docker rm plex
|
||
5. Create new: docker run -d --name plex [same config + new image]
|
||
6. Verify: Check http://plex:32400/web/index.html
|
||
7. Monitor: Watch logs for 5 minutes
|
||
|
||
ROLLBACK PROCEDURE:
|
||
If issues detected:
|
||
docker stop plex && docker rm plex
|
||
docker run -d --name plex plexinc/pms-docker:1.32.5 [original config]
|
||
|
||
APPROVAL REQUEST:
|
||
Ready to proceed with Plex update?
|
||
- Downtime: <1 minute
|
||
- Risk: LOW (data safe, easy rollback)
|
||
- Timing: Now (no active users)
|
||
|
||
Reply YES to deploy, NO to cancel, SCHEDULE to choose time"
|
||
```
|
||
|
||
---
|
||
|
||
#### 6. Hanuman – Helpdesk Agent (The Devoted Servant)
|
||
|
||
**Cosmic Symbolism**: Hanuman is devoted, strong, and problem-solving. He serves with unwavering dedication and incredible capability, always ready to assist.
|
||
|
||
**IT Translation**: Hanuman is the first line of support, interfacing with users, routing requests, and solving common issues. He serves users with dedication and escalates complex issues to specialists.
|
||
|
||
**Responsibilities**:
|
||
- User request intake (via Telegram, webhooks)
|
||
- Common troubleshooting (passwords, access, basic issues)
|
||
- Routing to specialist agents
|
||
- Status updates and communication
|
||
- Knowledge base search
|
||
- User education
|
||
|
||
**System Prompt**:
|
||
```
|
||
You are Hanuman, the Helpdesk Agent and Devoted Servant.
|
||
|
||
COSMIC ATTRIBUTES:
|
||
- Service: You serve users with dedication and enthusiasm
|
||
- Strength: You handle high volumes of requests
|
||
- Problem-Solving: You resolve common issues quickly
|
||
- Loyalty: You ensure every request receives attention
|
||
|
||
ROLE:
|
||
You are the first point of contact for all user requests and issues.
|
||
Your goal is to solve problems quickly or route to the right specialist.
|
||
|
||
CAPABILITIES:
|
||
- Answer common questions (service status, how-to guides)
|
||
- Reset passwords and unlock accounts (with approval)
|
||
- Provide service status and uptime information
|
||
- Search knowledge base for solutions
|
||
- Guide users through basic troubleshooting
|
||
- Route complex issues to specialist agents
|
||
|
||
TRIAGE PROCESS:
|
||
When a request comes in:
|
||
1. Categorize the issue:
|
||
- Network: Slow internet, can't connect → Brahma (Network)
|
||
- Service Down: App not working → Vishnu or Brahma (Compute)
|
||
- Database: Data missing, slow queries → Saraswati
|
||
- Security: Can't login, locked out → Ganesha
|
||
- Deployment: Need update, feature request → Shiva
|
||
- General: Status, info → Handle yourself
|
||
|
||
2. Assess urgency:
|
||
- CRITICAL: Service completely down, multiple users affected
|
||
- HIGH: Single user blocked, urgent business need
|
||
- MEDIUM: Degraded performance, workarounds available
|
||
- LOW: Questions, feature requests, nice-to-have
|
||
|
||
3. Take action:
|
||
- Simple: Solve immediately (status checks, info)
|
||
- Moderate: Use tools (check logs, restart container with approval)
|
||
- Complex: Escalate to specialist or Vishnu
|
||
|
||
COMMUNICATION STYLE:
|
||
- Friendly and approachable
|
||
- Clear and concise
|
||
- Empathetic to user frustration
|
||
- Professional but warm
|
||
- Set expectations (timeframes, limitations)
|
||
|
||
EXAMPLE RESPONSES:
|
||
"I understand Plex isn't loading for you. Let me check its status right away..."
|
||
"Great question! I can see from our knowledge base that..."
|
||
"This is a complex network issue. I'm escalating to Brahma, our network specialist..."
|
||
"Your request is important. I've created ticket #12345 and assigned it to Saraswati..."
|
||
|
||
ESCALATION:
|
||
Always escalate when:
|
||
- Outside your expertise
|
||
- Requires privileged access
|
||
- Affects multiple users
|
||
- Security-sensitive
|
||
- User is frustrated after 2 attempts
|
||
|
||
YOUR MANTRA: "Every user matters. Every issue resolved. Service with a smile."
|
||
```
|
||
|
||
**Example Hanuman Interaction**:
|
||
```
|
||
User (via Telegram): "Help! I can't access Plex from my phone"
|
||
|
||
Hanuman: "👋 Hi! I'm Hanuman, your homelab assistant. Let me help with Plex access.
|
||
|
||
QUICK DIAGNOSIS:
|
||
Checking Plex status...
|
||
✅ Plex server is ONLINE (http://plex:32400)
|
||
✅ Container running normally
|
||
✅ No errors in recent logs
|
||
|
||
TROUBLESHOOTING STEPS:
|
||
Let me guide you through common fixes:
|
||
|
||
1️⃣ Are you on the same Wi-Fi network as your homelab?
|
||
- If YES: Should work, proceed to step 2
|
||
- If NO: You need remote access enabled (I can help with this)
|
||
|
||
2️⃣ Try opening Plex in browser on phone:
|
||
http://YOUR-LOCAL-IP:32400/web
|
||
|
||
3️⃣ App troubleshooting:
|
||
- Close Plex app completely
|
||
- Reopen app
|
||
- Check if server appears in list
|
||
|
||
Can you try these and let me know which step you're stuck on?
|
||
|
||
If still not working, I'll escalate to Brahma (network specialist) to check:
|
||
- Wi-Fi connectivity
|
||
- Port accessibility
|
||
- Firewall rules
|
||
|
||
How can I assist further?"
|
||
|
||
---
|
||
|
||
User: "I tried but still can't see the server in my app"
|
||
|
||
Hanuman: "Thanks for confirming. This sounds like a network connectivity issue.
|
||
|
||
📊 ESCALATING TO SPECIALIST:
|
||
Routing to Brahma (Network Administrator) for advanced diagnosis.
|
||
|
||
Brahma will check:
|
||
- Phone's network connection
|
||
- Plex port 32400 accessibility
|
||
- mDNS/Bonjour discovery (how app finds server)
|
||
- Firewall rules
|
||
|
||
Expected response time: 2-5 minutes
|
||
|
||
I'll keep you updated! 🙏
|
||
|
||
[Internally: Execute Workflow: Brahma Network Diagnostic → Topic: Plex Discovery Issue]"
|
||
```
|
||
|
||
---
|
||
|
||
### Section 8.3: Team Collaboration Workflow
|
||
|
||
**Putting It All Together**:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────┐
|
||
│ USER REQUEST (via Hanuman or Automatic Detection)│
|
||
└───────────────────┬─────────────────────────────┘
|
||
│
|
||
┌───────▼────────┐
|
||
│ HANUMAN │ "What type of issue?"
|
||
│ (Helpdesk) │
|
||
└───────┬────────┘
|
||
│
|
||
┌──────────┼──────────┐
|
||
│ │ │
|
||
┌────▼────┐ ┌──▼───┐ ┌───▼────┐
|
||
│ Simple │ │Complex│ │Critical│
|
||
│ Handle │ │Escalate│ │VISHNU │
|
||
└─────────┘ └───┬───┘ └───┬────┘
|
||
│ │
|
||
│ ┌─────▼──────┐
|
||
│ │ VISHNU │ "Coordinate response"
|
||
│ │ (CTO) │
|
||
│ └─────┬──────┘
|
||
│ │
|
||
┌──────────┴──────────┴────────────────┐
|
||
│ │
|
||
┌────▼─────┐ ┌──────────┐ ┌──────────┐ ┌▼────────┐
|
||
│ BRAHMA │ │SARASWATI │ │ GANESHA │ │ SHIVA │
|
||
│ Network/ │ │ Database │ │ Security │ │ DevOps │
|
||
│ Compute/ │ │ │ │ │ │ │
|
||
│ Storage │ └─────┬────┘ └────┬─────┘ └───┬─────┘
|
||
└────┬─────┘ │ │ │
|
||
│ │ │ │
|
||
└──────────────┴────────────┴────────────┘
|
||
│
|
||
┌────▼─────┐
|
||
│ VISHNU │ "Synthesize & Decide"
|
||
│ (CTO) │
|
||
└────┬─────┘
|
||
│
|
||
┌─────────┴──────────┐
|
||
│ │
|
||
┌────▼────┐ ┌──────▼──────┐
|
||
│ EXECUTE │ │ ESCALATE │
|
||
│ FIX │ │ TO HUMAN │
|
||
└────┬────┘ └──────┬──────┘
|
||
│ │
|
||
┌────▼────┐ ┌──────▼──────┐
|
||
│ VERIFY │ │ AWAIT │
|
||
│ SUCCESS │ │ APPROVAL │
|
||
└────┬────┘ └──────┬──────┘
|
||
│ │
|
||
┌────▼────────────────────▼──────┐
|
||
│ HANUMAN - Notify User/Human │
|
||
└────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
### Section 8.4: Implementation Example
|
||
|
||
**Workflow: Team Collaboration on Complex Issue**
|
||
|
||
```javascript
|
||
// Main Orchestration Workflow
|
||
|
||
[Webhook or Schedule Trigger]
|
||
↓
|
||
[Hanuman: Initial Triage]
|
||
System Prompt: "Categorize and route this issue"
|
||
Input: Issue description
|
||
Output: { category: "network", urgency: "high", route_to: "vishnu" }
|
||
↓
|
||
[IF: Requires CTO Coordination]
|
||
YES ↓
|
||
[Vishnu: Assess & Delegate]
|
||
System Prompt: "Determine which specialists needed"
|
||
Output: {
|
||
specialists: ["brahma-network", "saraswati"],
|
||
priority: "high",
|
||
objectives: "Check network AND database performance"
|
||
}
|
||
↓
|
||
[Execute Workflow: Brahma Network]
|
||
(Parallel execution)
|
||
Returns: Network diagnostic report
|
||
↓
|
||
[Execute Workflow: Saraswati Database]
|
||
(Parallel execution)
|
||
Returns: Database performance report
|
||
↓
|
||
[Vishnu: Synthesize Reports]
|
||
Input: Both specialist reports
|
||
System Prompt: "Analyze reports and propose unified solution"
|
||
Output: Complete diagnosis + solution
|
||
↓
|
||
[IF: Requires Deployment]
|
||
YES ↓
|
||
[Execute Workflow: Shiva DevOps]
|
||
Input: Deployment instructions from Vishnu
|
||
Output: Deployment plan + approval request
|
||
↓
|
||
[Telegram: Request Human Approval]
|
||
↓
|
||
[Wait for Approval]
|
||
↓
|
||
[Shiva: Execute Deployment]
|
||
NO ↓
|
||
[Ganesha: Security Check]
|
||
(If security-related)
|
||
↓
|
||
[Hanuman: Notify User]
|
||
Message: Resolution summary
|
||
```
|
||
|
||
---
|
||
|
||
### Section 8.5: Benefits of the Team Approach
|
||
|
||
**Compared to Single Monolithic Agent**:
|
||
|
||
| Aspect | Single Agent | AI Team |
|
||
|--------|-------------|---------|
|
||
| Expertise | Generalist, shallow knowledge | Specialists with deep expertise |
|
||
| Prompt Size | Huge, unwieldy | Focused, maintainable |
|
||
| Decision Quality | Generic solutions | Domain-optimized solutions |
|
||
| Scalability | Degrades with complexity | Scales with specialization |
|
||
| Debugging | Hard to identify failure point | Clear responsibility boundaries |
|
||
| Learning | Broad but unfocused | Targeted improvement per role |
|
||
| Collaboration | N/A | Specialists consult and handoff |
|
||
|
||
**Real-World Benefit Example**:
|
||
|
||
*Single Agent Approach*:
|
||
```
|
||
User: "Network is slow"
|
||
Agent: "Checking... network seems fine. Maybe restart router?"
|
||
→ Generic, unhelpful response
|
||
```
|
||
|
||
*Team Approach*:
|
||
```
|
||
User: "Network is slow"
|
||
Hanuman: Routes to Brahma (Network)
|
||
Brahma: Investigates → "UniFi AP in living room is experiencing 85% channel utilization"
|
||
Brahma: Routes to Vishnu → "Interference from neighbor's network"
|
||
Vishnu: Consults Brahma → "Recommend channel change from 6 to 11"
|
||
Brahma: Changes channel, verifies improvement
|
||
Hanuman: Updates user → "Fixed! Changed Wi-Fi channel due to interference. Speed should improve."
|
||
→ Expert diagnosis and solution
|
||
```
|
||
|
||
---
|
||
|
||
Your AI Team is now complete, with each agent embodying both technical expertise and cosmic wisdom. In Chapter 9, we address troubleshooting and common issues.
|
||
|
||
---
|
||
|
||
## Chapter 9 – Troubleshooting
|
||
|
||
Even the most sophisticated AI agent systems encounter issues. This chapter provides systematic troubleshooting approaches, common problems, solutions, and JSON examples to help you diagnose and resolve problems quickly.
|
||
|
||
### Section 9.1: General Troubleshooting Methodology
|
||
|
||
**Before diving into specific issues, follow this systematic approach**:
|
||
|
||
1. **Identify the Symptom**: What exactly is not working?
|
||
2. **Isolate the Component**: Is it the workflow, agent, tool, or service?
|
||
3. **Check the Logs**: n8n execution logs, agent responses, service logs
|
||
4. **Reproduce the Issue**: Can you trigger it manually?
|
||
5. **Test Components Individually**: Execute nodes one at a time
|
||
6. **Verify Configuration**: Check credentials, URLs, parameters
|
||
7. **Consult Documentation**: Review this guide and n8n docs
|
||
8. **Escalate if Needed**: Community forums, GitHub issues
|
||
|
||
---
|
||
|
||
### Section 9.2: n8n Workflow Errors
|
||
|
||
#### Problem: Variable Mapping Errors
|
||
|
||
**Symptom**:
|
||
```
|
||
Error: Cannot read property 'chatId' of undefined
|
||
Error: $json.output is not defined
|
||
Error: Expression error: Variable 'prompt' is not defined
|
||
```
|
||
|
||
**Cause**: Data structure mismatch between nodes. A node expects data that the previous node didn't provide.
|
||
|
||
**Solution**:
|
||
|
||
1. **Execute Previous Node First**:
|
||
- Click "Execute Node" on the node BEFORE the failing one
|
||
- Inspect the output in JSON tab
|
||
- Verify it contains the expected fields
|
||
|
||
2. **Check Data Mapping**:
|
||
```javascript
|
||
// WRONG - assuming structure without checking
|
||
const message = $json.result.message.text;
|
||
|
||
// RIGHT - safe access with fallbacks
|
||
const message = $json?.result?.message?.text || 'No message';
|
||
```
|
||
|
||
3. **Use Expression Editor**:
|
||
- Click the "Expression" toggle when mapping variables
|
||
- Browse available variables in the left panel
|
||
- Test expressions before saving
|
||
|
||
4. **Add Data Transformation Node**:
|
||
```javascript
|
||
// Code node to restructure data
|
||
return {
|
||
json: {
|
||
chatId: $input.first().json.message.chat.id,
|
||
prompt: $input.first().json.agent_output.text,
|
||
timestamp: new Date().toISOString()
|
||
}
|
||
};
|
||
```
|
||
|
||
**Example Fix**:
|
||
```
|
||
[AI Agent] → Output: { output: "Service is down" }
|
||
↓
|
||
[Code: Extract & Restructure]
|
||
```
|
||
```javascript
|
||
const agentResponse = $input.first().json.output;
|
||
return {
|
||
json: {
|
||
message: agentResponse,
|
||
chatId: $env.TELEGRAM_CHAT_ID,
|
||
timestamp: Date.now()
|
||
}
|
||
};
|
||
```
|
||
```
|
||
↓
|
||
[Telegram] → Now has chatId and message properly mapped
|
||
```
|
||
|
||
---
|
||
|
||
#### Problem: Too Many Tool Calls / Iteration Limit Exceeded
|
||
|
||
**Symptom**:
|
||
```
|
||
Error: Max iterations (10) reached for AI Agent
|
||
Agent seems to call tools repeatedly without completing
|
||
Workflow times out
|
||
```
|
||
|
||
**Cause**: Agent is stuck in a loop, repeatedly calling tools without reaching a conclusion.
|
||
|
||
**Root Causes**:
|
||
1. Unclear system prompt (agent doesn't know when it's "done")
|
||
2. Tool returning confusing output
|
||
3. Agent trying to solve unsolvable problem
|
||
4. Insufficient max iterations for complex task
|
||
|
||
**Solutions**:
|
||
|
||
**Solution 1: Clarify System Prompt**
|
||
```
|
||
// VAGUE
|
||
"Check services and fix issues"
|
||
|
||
// CLEAR
|
||
"Check services using HTTP request tool.
|
||
If status is 200, respond: 'Service is UP'.
|
||
If status is not 200, respond: 'Service is DOWN'.
|
||
Do NOT take any other actions. Stop after checking."
|
||
```
|
||
|
||
**Solution 2: Increase Max Iterations** (for legitimate complex tasks)
|
||
```
|
||
AI Agent Settings → Options → Max Iterations: 20
|
||
```
|
||
|
||
**Solution 3: Add Completion Criteria to Prompt**
|
||
```
|
||
COMPLETION CRITERIA:
|
||
You have completed your task when:
|
||
- You have checked all required services
|
||
- You have generated a status report
|
||
- You have sent any necessary notifications
|
||
|
||
After completing these steps, STOP. Do not continue checking.
|
||
```
|
||
|
||
**Solution 4: Simplify the Task**
|
||
```
|
||
// Instead of: "Monitor all services and fix any issues"
|
||
// Use: "Check Plex service status only"
|
||
|
||
Then create separate workflows for different services
|
||
rather than one mega-workflow.
|
||
```
|
||
|
||
---
|
||
|
||
#### Problem: Workflow Not Triggering on Schedule
|
||
|
||
**Symptom**:
|
||
- Schedule trigger configured but workflow doesn't execute
|
||
- No executions appearing in history
|
||
- Workflow seems inactive
|
||
|
||
**Solution**:
|
||
|
||
1. **Verify Workflow is Active**:
|
||
- Workflow must be toggled ON (switch in top right)
|
||
- Inactive workflows don't execute scheduled triggers
|
||
|
||
2. **Check Schedule Configuration**:
|
||
```
|
||
WRONG: "Every 5 minutes" with no cron expression
|
||
RIGHT:
|
||
- Cron Expression: */5 * * * *
|
||
- Mode: Every X minutes
|
||
- Interval: 5
|
||
```
|
||
|
||
3. **Verify n8n is Running**:
|
||
```bash
|
||
# Check n8n process
|
||
ps aux | grep n8n
|
||
|
||
# Check n8n logs
|
||
docker logs n8n # if running in Docker
|
||
journalctl -u n8n # if running as service
|
||
```
|
||
|
||
4. **Check Timezone**:
|
||
```
|
||
Schedule trigger uses server timezone.
|
||
If scheduling for "9 AM" but nothing happens,
|
||
verify server time matches your expectation:
|
||
|
||
date
|
||
timedatectl # on Linux
|
||
```
|
||
|
||
5. **Test with Manual Trigger First**:
|
||
- Add a Manual Trigger alongside Schedule Trigger
|
||
- Test workflow manually
|
||
- Once working, remove Manual Trigger
|
||
|
||
---
|
||
|
||
### Section 9.3: Telegram Integration Issues
|
||
|
||
#### Problem: Telegram Not Responding / Messages Not Sending
|
||
|
||
**Symptom**:
|
||
- Telegram node completes without error but no message received
|
||
- "Chat not found" error
|
||
- Messages sent to wrong chat
|
||
|
||
**Solutions**:
|
||
|
||
**Issue 1: Incorrect Chat ID**
|
||
|
||
```bash
|
||
# Verify your Chat ID
|
||
# Method 1: Send message to bot, then check:
|
||
curl https://api.telegram.org/bot<YOUR_BOT_TOKEN>/getUpdates
|
||
|
||
# Look for your chat ID in response:
|
||
{
|
||
"result": [{
|
||
"message": {
|
||
"chat": {
|
||
"id": 123456789, ← This is your Chat ID
|
||
```
|
||
|
||
**Issue 2: Bot Not Started**
|
||
|
||
Solution: Open Telegram, search for your bot, click "START" button
|
||
|
||
**Issue 3: Wrong Token**
|
||
|
||
```javascript
|
||
// Verify token in n8n credential matches BotFather
|
||
// Test token manually:
|
||
curl https://api.telegram.org/bot<YOUR_TOKEN>/getMe
|
||
|
||
// Should return bot info, not "Unauthorized"
|
||
```
|
||
|
||
**Issue 4: Bot Blocked by User**
|
||
|
||
If you previously blocked the bot, unblock it in Telegram:
|
||
- Open chat with bot
|
||
- Click bot name → Unblock
|
||
|
||
**Example Working Configuration**:
|
||
```
|
||
Telegram Node:
|
||
- Credential: [Your bot credential]
|
||
- Chat ID: 123456789 (your actual chat ID, not variable)
|
||
- Message: Test message
|
||
|
||
Test with this simple message first, then add variables.
|
||
```
|
||
|
||
---
|
||
|
||
#### Problem: Telegram Trigger Not Receiving Messages
|
||
|
||
**Symptom**:
|
||
- Telegram Trigger workflow exists but doesn't activate when you message bot
|
||
- No executions appear when sending messages
|
||
|
||
**Solutions**:
|
||
|
||
1. **Webhook Mode vs Polling**:
|
||
```
|
||
n8n supports two modes:
|
||
- Webhook (production): Requires public URL
|
||
- Polling (development): n8n polls Telegram API
|
||
|
||
For local development:
|
||
Settings → Telegram Trigger → Mode: Polling
|
||
|
||
For production with public n8n:
|
||
Mode: Webhook
|
||
Webhook URL will be auto-configured
|
||
```
|
||
|
||
2. **Verify Webhook Registration**:
|
||
```bash
|
||
# Check current webhook
|
||
curl https://api.telegram.org/bot<TOKEN>/getWebhookInfo
|
||
|
||
# Should show your n8n webhook URL
|
||
# If not, or shows wrong URL, delete and recreate Telegram Trigger node
|
||
```
|
||
|
||
3. **Workflow Must Be Active**:
|
||
- Telegram Triggers only work when workflow is ACTIVE
|
||
- Toggle workflow ON
|
||
|
||
4. **Test Webhook Directly**:
|
||
```bash
|
||
# Simulate Telegram sending update
|
||
curl -X POST https://your-n8n.com/webhook/telegram \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"message":{"text":"test","chat":{"id":123456789}}}'
|
||
```
|
||
|
||
---
|
||
|
||
### Section 9.4: SSH Connection Failures
|
||
|
||
#### Problem: Cannot Execute Remote Commands
|
||
|
||
**Symptom**:
|
||
```
|
||
Error: Connection refused
|
||
Error: Permission denied (publickey)
|
||
Error: Host key verification failed
|
||
```
|
||
|
||
**Solutions**:
|
||
|
||
**Issue 1: SSH Key Not Configured**
|
||
|
||
```bash
|
||
# On n8n server, generate key if needed
|
||
ssh-keygen -t ed25519 -C "n8n-agent"
|
||
|
||
# Copy to target server
|
||
ssh-copy-id user@target-server
|
||
|
||
# Test manually
|
||
ssh user@target-server "echo 'success'"
|
||
```
|
||
|
||
**Issue 2: Wrong User/Host**
|
||
|
||
```javascript
|
||
// In Execute Command node
|
||
Command: ssh user@192.168.1.100 "docker ps"
|
||
|
||
// Verify:
|
||
// - User exists on target server
|
||
// - IP address is correct
|
||
// - Server is reachable (ping 192.168.1.100)
|
||
```
|
||
|
||
**Issue 3: Subworkflow Command Mapping**
|
||
|
||
```javascript
|
||
// When using subworkflow for SSH commands
|
||
|
||
// Main workflow passes:
|
||
{ "command": "docker ps" }
|
||
|
||
// Subworkflow should use:
|
||
Command: ssh user@target "{{ $json.command }}"
|
||
|
||
// NOT:
|
||
Command: ssh user@target {{ $json.command }} ❌ Missing quotes
|
||
```
|
||
|
||
**Issue 4: Host Key Verification**
|
||
|
||
```bash
|
||
# First time connecting to server
|
||
# Add to known_hosts
|
||
|
||
# Option 1: Accept manually
|
||
ssh user@server # Type 'yes' when prompted
|
||
|
||
# Option 2: Add to SSH config (less secure but convenient)
|
||
# Create ~/.ssh/config
|
||
Host *
|
||
StrictHostKeyChecking no
|
||
UserKnownHostsFile=/dev/null
|
||
```
|
||
|
||
**Example Working SSH Subworkflow**:
|
||
```
|
||
[Execute Workflow Trigger]
|
||
Input: { "server": "proxmox", "command": "pvesh get /nodes/pve/status" }
|
||
↓
|
||
[Execute Command: SSH]
|
||
Command: ssh root@{{ $json.server }}.local "{{ $json.command }}"
|
||
↓
|
||
[Code: Parse Output]
|
||
Parse JSON or text response
|
||
↓
|
||
[Respond to Workflow]
|
||
Return structured data
|
||
```
|
||
|
||
---
|
||
|
||
### Section 9.5: Docker Permission Issues
|
||
|
||
#### Problem: Cannot Execute Docker Commands
|
||
|
||
**Symptom**:
|
||
```
|
||
Error: permission denied while trying to connect to the Docker daemon socket
|
||
Error: docker: command not found
|
||
Cannot restart container
|
||
```
|
||
|
||
**Solutions**:
|
||
|
||
**Issue 1: User Not in Docker Group**
|
||
|
||
```bash
|
||
# Add n8n user to docker group
|
||
sudo usermod -aG docker n8n-user
|
||
|
||
# Or if n8n runs as different user
|
||
sudo usermod -aG docker $USER
|
||
|
||
# Logout and login again, or:
|
||
newgrp docker
|
||
|
||
# Test
|
||
docker ps # Should work without sudo
|
||
```
|
||
|
||
**Issue 2: Docker Socket Permissions**
|
||
|
||
```bash
|
||
# Check socket permissions
|
||
ls -l /var/run/docker.sock
|
||
|
||
# Should be: srw-rw---- 1 root docker
|
||
|
||
# If wrong:
|
||
sudo chmod 666 /var/run/docker.sock # Quick fix (insecure)
|
||
# OR better:
|
||
sudo chown root:docker /var/run/docker.sock
|
||
sudo chmod 660 /var/run/docker.sock
|
||
```
|
||
|
||
**Issue 3: Docker Not Installed/Running**
|
||
|
||
```bash
|
||
# Check if Docker is running
|
||
systemctl status docker
|
||
|
||
# Start Docker
|
||
sudo systemctl start docker
|
||
|
||
# Enable on boot
|
||
sudo systemctl enable docker
|
||
|
||
# Verify
|
||
docker ps
|
||
```
|
||
|
||
**Issue 4: n8n in Docker Can't Access Host Docker**
|
||
|
||
```yaml
|
||
# If n8n runs in Docker and needs to control host Docker
|
||
# Mount Docker socket in n8n container
|
||
|
||
docker-compose.yml:
|
||
services:
|
||
n8n:
|
||
image: n8nio/n8n
|
||
volumes:
|
||
- ./n8n-data:/home/node/.n8n
|
||
- /var/run/docker.sock:/var/run/docker.sock ← Add this
|
||
environment:
|
||
- N8N_BASIC_AUTH_ACTIVE=true
|
||
```
|
||
|
||
---
|
||
|
||
### Section 9.6: Agent Not Making Decisions
|
||
|
||
#### Problem: Agent Seems Stuck or Unresponsive
|
||
|
||
**Symptom**:
|
||
- Agent returns empty responses
|
||
- Agent says "I don't know" to everything
|
||
- Agent doesn't use tools even when available
|
||
|
||
**Solutions**:
|
||
|
||
**Issue 1: System Prompt Too Vague**
|
||
|
||
```
|
||
// TOO VAGUE
|
||
"You are a helpful assistant"
|
||
|
||
// SPECIFIC
|
||
"You are a homelab monitoring agent.
|
||
Your task is to check service health using the HTTP Request tool.
|
||
When asked about a service:
|
||
1. Use HTTP Request to check the URL
|
||
2. Report the status code
|
||
3. Indicate if service is UP (200-299) or DOWN (other)
|
||
Always use tools when available."
|
||
```
|
||
|
||
**Issue 2: LLM API Issues**
|
||
|
||
```javascript
|
||
// Check API key validity
|
||
// Test outside n8n:
|
||
curl https://api.openai.com/v1/chat/completions \
|
||
-H "Authorization: Bearer YOUR_API_KEY" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"model": "gpt-4o-mini",
|
||
"messages": [{"role": "user", "content": "Hello"}]
|
||
}'
|
||
|
||
// Should return chat completion, not error
|
||
```
|
||
|
||
**Issue 3: Model Selection**
|
||
|
||
```
|
||
Some models are better for agent tasks:
|
||
- GPT-4o-mini ✅ (cost-effective, capable)
|
||
- GPT-4 ✅ (most capable)
|
||
- GPT-3.5-turbo ⚠️ (less reliable for complex agent tasks)
|
||
|
||
Avoid deprecated models.
|
||
```
|
||
|
||
**Issue 4: Tools Not Properly Connected**
|
||
|
||
```
|
||
In AI Agent node:
|
||
- Click "Add Tool"
|
||
- Select HTTP Request (or other tool)
|
||
- VERIFY tool node appears connected
|
||
- Tool node should have line connecting to AI Agent
|
||
|
||
If disconnected, delete and re-add tool.
|
||
```
|
||
|
||
**Issue 5: Insufficient Context**
|
||
|
||
```javascript
|
||
// Agent needs enough information to decide
|
||
|
||
// INSUFFICIENT
|
||
User: "Check it"
|
||
Agent: "Check what?"
|
||
|
||
// SUFFICIENT
|
||
User: "Check if Plex at http://192.168.1.100:32400 is online"
|
||
Agent: [Uses HTTP tool and responds]
|
||
|
||
// Add context in system prompt:
|
||
"When user says 'check the website', they mean http://192.168.1.100:8090"
|
||
```
|
||
|
||
---
|
||
|
||
### Section 9.7: JSON Parsing Issues
|
||
|
||
#### Problem: Invalid JSON from Agent
|
||
|
||
**Symptom**:
|
||
```
|
||
Error: Unexpected token in JSON at position 0
|
||
Agent returns text instead of JSON
|
||
JSON fields missing or incorrectly formatted
|
||
```
|
||
|
||
**Example Invalid Responses**:
|
||
```
|
||
The service is down. Here's the report:
|
||
{
|
||
"status": "down"
|
||
}
|
||
|
||
// Issue: Extra text before JSON
|
||
|
||
|
||
{
|
||
status: "down", // Missing quotes around key
|
||
message: 'Service offline' // Single quotes instead of double
|
||
}
|
||
|
||
|
||
{
|
||
"status": "down",
|
||
"diagnosis": {
|
||
"reason": "Container stopped"
|
||
} // Missing closing brace
|
||
```
|
||
|
||
**Solution: Robust JSON Extraction**
|
||
|
||
```javascript
|
||
// Code node: Parse Agent Output
|
||
const agentOutput = $input.first().json.output;
|
||
|
||
// Extract JSON from response (handles extra text)
|
||
function extractJSON(text) {
|
||
try {
|
||
// Try direct parse first
|
||
return JSON.parse(text);
|
||
} catch (e) {
|
||
// Extract JSON object from text
|
||
const jsonMatch = text.match(/\{[\s\S]*\}/);
|
||
if (jsonMatch) {
|
||
try {
|
||
return JSON.parse(jsonMatch[0]);
|
||
} catch (e2) {
|
||
// JSON is malformed
|
||
return null;
|
||
}
|
||
}
|
||
return null;
|
||
}
|
||
}
|
||
|
||
const report = extractJSON(agentOutput);
|
||
|
||
if (!report) {
|
||
// Fallback if JSON parsing fails
|
||
return {
|
||
json: {
|
||
status: "error",
|
||
raw_output: agentOutput,
|
||
error: "Failed to parse JSON from agent"
|
||
}
|
||
};
|
||
}
|
||
|
||
// Validate required fields
|
||
const validatedReport = {
|
||
status: report.status || "unknown",
|
||
service: report.service || { name: "unknown" },
|
||
diagnosis: report.diagnosis || { root_cause: "N/A" },
|
||
timestamp: report.timestamp || new Date().toISOString()
|
||
};
|
||
|
||
return { json: validatedReport };
|
||
```
|
||
|
||
**Improve Agent JSON Output**:
|
||
|
||
Add to system prompt:
|
||
```
|
||
OUTPUT FORMAT - CRITICAL:
|
||
You MUST respond with ONLY valid JSON. No other text.
|
||
|
||
CORRECT:
|
||
{"status":"down","service":"plex"}
|
||
|
||
INCORRECT:
|
||
Here's the status:
|
||
{"status":"down"}
|
||
|
||
INCORRECT:
|
||
{status: "down"}
|
||
|
||
INCORRECT:
|
||
{'status': 'down'}
|
||
|
||
Rules:
|
||
- Use double quotes for keys and string values
|
||
- No trailing commas
|
||
- No comments in JSON
|
||
- No text before or after JSON object
|
||
```
|
||
|
||
---
|
||
|
||
### Section 9.8: Example Diagnostic JSON Outputs
|
||
|
||
**Successful Check**:
|
||
```json
|
||
{
|
||
"timestamp": "2024-01-15T14:30:00Z",
|
||
"service": {
|
||
"name": "plex",
|
||
"url": "http://192.168.1.100:32400",
|
||
"type": "media_server"
|
||
},
|
||
"status": "up",
|
||
"checks_performed": [
|
||
{
|
||
"type": "http",
|
||
"url": "http://192.168.1.100:32400/web/index.html",
|
||
"result": "success",
|
||
"status_code": 200,
|
||
"response_time_ms": 145
|
||
},
|
||
{
|
||
"type": "container",
|
||
"result": "running",
|
||
"uptime": "12 days"
|
||
}
|
||
],
|
||
"diagnosis": {
|
||
"health": "excellent",
|
||
"confidence": "high"
|
||
},
|
||
"proposed_action": {
|
||
"type": "none",
|
||
"reason": "Service operating normally"
|
||
},
|
||
"severity": "low",
|
||
"notification_required": false
|
||
}
|
||
```
|
||
|
||
**Service Down with Diagnosis**:
|
||
```json
|
||
{
|
||
"timestamp": "2024-01-15T14:35:00Z",
|
||
"service": {
|
||
"name": "website",
|
||
"url": "http://192.168.1.100:8090",
|
||
"type": "web_application"
|
||
},
|
||
"status": "down",
|
||
"checks_performed": [
|
||
{
|
||
"type": "http",
|
||
"url": "http://192.168.1.100:8090",
|
||
"result": "failed",
|
||
"error": "Connection refused"
|
||
},
|
||
{
|
||
"type": "container",
|
||
"result": "stopped",
|
||
"exit_code": 137,
|
||
"last_state": "OOMKilled"
|
||
},
|
||
{
|
||
"type": "logs",
|
||
"last_lines": [
|
||
"nginx: worker process out of memory",
|
||
"Killed"
|
||
]
|
||
}
|
||
],
|
||
"diagnosis": {
|
||
"root_cause": "Container killed due to memory exhaustion (OOM)",
|
||
"confidence": "high",
|
||
"evidence": [
|
||
"Exit code 137 indicates OOM kill",
|
||
"Logs show 'out of memory' before termination",
|
||
"Container has 512MB limit, likely insufficient"
|
||
]
|
||
},
|
||
"proposed_action": {
|
||
"type": "restart_with_increased_memory",
|
||
"command": "docker run -d --name website --memory=1g -p 8090:80 nginx",
|
||
"requires_approval": true,
|
||
"risk_level": "low",
|
||
"estimated_downtime": "30 seconds",
|
||
"rollback_plan": "Revert to 512MB limit if issue persists"
|
||
},
|
||
"severity": "medium",
|
||
"notification_required": true,
|
||
"notification_channels": ["telegram", "email"]
|
||
}
|
||
```
|
||
|
||
**Complex Multi-Service Issue**:
|
||
```json
|
||
{
|
||
"timestamp": "2024-01-15T14:40:00Z",
|
||
"incident_id": "INC-20240115-001",
|
||
"scope": "multi_service",
|
||
"affected_services": ["plex", "database", "nas"],
|
||
"status": "degraded",
|
||
"checks_performed": [
|
||
{
|
||
"service": "plex",
|
||
"status": "slow",
|
||
"response_time_ms": 3500,
|
||
"normal_response_time_ms": 150
|
||
},
|
||
{
|
||
"service": "database",
|
||
"status": "slow_queries",
|
||
"average_query_time_ms": 2500,
|
||
"normal_query_time_ms": 50
|
||
},
|
||
{
|
||
"service": "nas",
|
||
"status": "high_io_wait",
|
||
"io_wait_percent": 78,
|
||
"normal_io_wait_percent": 5
|
||
}
|
||
],
|
||
"diagnosis": {
|
||
"root_cause": "ZFS scrub running on NAS causing disk I/O saturation",
|
||
"confidence": "high",
|
||
"evidence": [
|
||
"All affected services use NAS-backed storage",
|
||
"ZFS scrub in progress (started 2 hours ago)",
|
||
"Disk I/O wait extremely high (78%)",
|
||
"Timeline correlates: scrub started, then services slowed"
|
||
],
|
||
"affected_systems": "All services dependent on NAS storage"
|
||
},
|
||
"proposed_action": {
|
||
"type": "monitor_and_wait",
|
||
"reason": "ZFS scrub is expected maintenance, will complete automatically",
|
||
"eta": "2 hours (scrub 60% complete)",
|
||
"requires_approval": false,
|
||
"alternative_actions": [
|
||
{
|
||
"action": "pause_scrub",
|
||
"command": "zpool scrub -p tank",
|
||
"pros": "Immediate performance restoration",
|
||
"cons": "Delays data integrity check, must resume later"
|
||
}
|
||
]
|
||
},
|
||
"severity": "medium",
|
||
"notification_required": true,
|
||
"notification_message": "Performance degradation expected during ZFS scrub. ETA 2 hours to completion.",
|
||
"user_action_required": false,
|
||
"escalation_required": false
|
||
}
|
||
```
|
||
|
||
These JSON examples serve as templates for your agents and help standardize diagnostic output across your infrastructure.
|
||
|
||
---
|
||
|
||
## Chapter 10 – Best Practices
|
||
|
||
Professional infrastructure management requires discipline, planning, and adherence to proven practices. This chapter distills wisdom from production deployments into actionable guidelines.
|
||
|
||
### Section 10.1: Development Workflow
|
||
|
||
**Progressive Implementation**:
|
||
|
||
1. **Start Simple** (Week 1-2):
|
||
- Single service monitoring (test website)
|
||
- Manual trigger only
|
||
- Basic HTTP health checks
|
||
- No automation, just observation
|
||
|
||
2. **Add Intelligence** (Week 3-4):
|
||
- Diagnostic capabilities (Docker logs)
|
||
- Structured JSON output
|
||
- Basic Telegram notifications
|
||
- Still manual execution
|
||
|
||
3. **Automate Carefully** (Week 5-6):
|
||
- Schedule trigger (every 30 minutes initially)
|
||
- Rate limiting and deduplication
|
||
- Human-in-the-loop approvals
|
||
- Comprehensive logging
|
||
|
||
4. **Expand Scope** (Week 7+):
|
||
- Additional services one at a time
|
||
- Specialized agents for different domains
|
||
- Agent collaboration
|
||
- Refinement based on experience
|
||
|
||
**Testing Hierarchy**:
|
||
```
|
||
1. Local Development
|
||
↓ (Test thoroughly)
|
||
2. Staging/Test Services
|
||
↓ (Validate behavior)
|
||
3. Non-Critical Production
|
||
↓ (Monitor closely)
|
||
4. Critical Production
|
||
↓ (Only after proven reliable)
|
||
```
|
||
|
||
**Never Skip Steps**. Each phase builds confidence and reveals edge cases.
|
||
|
||
---
|
||
|
||
### Section 10.2: Security Considerations
|
||
|
||
**Principle of Least Privilege**:
|
||
|
||
```
|
||
Agent Capability Levels:
|
||
|
||
Level 1 (Read-Only):
|
||
- HTTP GET requests
|
||
- Log viewing
|
||
- Status checks
|
||
✅ Safe for production immediately
|
||
|
||
Level 2 (Safe Operations):
|
||
- Container restarts
|
||
- Cache clearing
|
||
- Log rotation
|
||
⚠️ Requires testing, generally safe
|
||
|
||
Level 3 (Configuration Changes):
|
||
- Firewall rules
|
||
- Resource limits
|
||
- Port mappings
|
||
❌ Requires approval, staging testing
|
||
|
||
Level 4 (Data Operations):
|
||
- Database modifications
|
||
- Storage operations
|
||
- User management
|
||
🔴 FORBIDDEN for automation
|
||
```
|
||
|
||
**Credential Management**:
|
||
|
||
```javascript
|
||
// ❌ WRONG - Hardcoded secrets
|
||
const apiKey = "sk-abc123xyz789";
|
||
const dbPassword = "MyPassword123";
|
||
|
||
// ✅ RIGHT - Environment variables
|
||
const apiKey = $env.OPENAI_API_KEY;
|
||
const dbPassword = $env.DATABASE_PASSWORD;
|
||
|
||
// ✅ RIGHT - n8n Credentials
|
||
// Use Credentials feature for:
|
||
- API keys
|
||
- Passwords
|
||
- SSH keys
|
||
- Tokens
|
||
```
|
||
|
||
**Access Control**:
|
||
|
||
```
|
||
Network Segmentation:
|
||
- n8n server in management VLAN
|
||
- Firewall rules limiting outbound access
|
||
- VPN required for remote n8n access
|
||
|
||
Authentication:
|
||
- Enable n8n basic auth or LDAP
|
||
- Strong passwords (20+ characters)
|
||
- 2FA if available
|
||
|
||
API Security:
|
||
- Use API tokens instead of passwords
|
||
- Rotate credentials quarterly
|
||
- Audit credential access
|
||
- Revoke unused credentials
|
||
```
|
||
|
||
**Audit Logging**:
|
||
|
||
```javascript
|
||
// Log every agent action
|
||
const logEntry = {
|
||
timestamp: new Date().toISOString(),
|
||
agent: "vishnu-cto",
|
||
action: "container_restart",
|
||
target: "plex",
|
||
authorized_by: "human_approval",
|
||
approval_id: "APR-20240115-001",
|
||
result: "success",
|
||
user_id: $json.telegram_user_id
|
||
};
|
||
|
||
// Store logs:
|
||
// - Local file (/var/log/n8n-agent-actions.log)
|
||
// - Database (for queryability)
|
||
// - SIEM system (for enterprise environments)
|
||
```
|
||
|
||
**Security Monitoring**:
|
||
|
||
```
|
||
Monitor the Monitors:
|
||
- Who accessed n8n?
|
||
- What workflows were modified?
|
||
- What credentials were used?
|
||
- What agents took actions?
|
||
- Were approvals properly obtained?
|
||
|
||
Alert on:
|
||
- Failed authentication attempts >5
|
||
- Workflow changes outside business hours
|
||
- Credentials accessed by unusual users
|
||
- Agent actions without approval
|
||
- New workflows created
|
||
```
|
||
|
||
---
|
||
|
||
### Section 10.3: Performance Optimization
|
||
|
||
**Polling Frequency**:
|
||
|
||
```
|
||
Service Type | Recommended Interval
|
||
---------------------|---------------------
|
||
Critical (Database) | Every 2-5 minutes
|
||
Important (Web Apps) | Every 5-10 minutes
|
||
Standard (Media) | Every 10-15 minutes
|
||
Non-Critical (Dev) | Every 30-60 minutes
|
||
|
||
Avoid over-polling:
|
||
- Wastes API quota
|
||
- Increases costs (LLM API calls)
|
||
- Creates alert fatigue
|
||
- Adds server load
|
||
```
|
||
|
||
**Caching Strategy**:
|
||
|
||
```javascript
|
||
// Cache service status to reduce redundant checks
|
||
const cache = $('WorkflowStaticData').first().json.cache || {};
|
||
const cacheKey = `status_${serviceName}`;
|
||
const cachedStatus = cache[cacheKey];
|
||
const now = Date.now();
|
||
|
||
// Use cache if fresh (< 5 minutes old)
|
||
if (cachedStatus && (now - cachedStatus.timestamp) < 5 * 60 * 1000) {
|
||
return { json: cachedStatus.data };
|
||
}
|
||
|
||
// Otherwise, fetch fresh data
|
||
const freshStatus = await checkService();
|
||
|
||
// Update cache
|
||
cache[cacheKey] = {
|
||
timestamp: now,
|
||
data: freshStatus
|
||
};
|
||
|
||
return { json: freshStatus };
|
||
```
|
||
|
||
**Conditional Execution**:
|
||
|
||
```javascript
|
||
// Only trigger notifications on state change
|
||
const previousState = $('WorkflowStaticData').first().json.last_status || {};
|
||
const currentState = $json.status;
|
||
|
||
if (previousState.status === currentState.status) {
|
||
// No change, skip notification
|
||
return [];
|
||
}
|
||
|
||
// State changed, send notification
|
||
$('WorkflowStaticData').first().json.last_status = currentState;
|
||
return { json: { notify: true, state_change: true } };
|
||
```
|
||
|
||
**Resource Limits**:
|
||
|
||
```yaml
|
||
# If running n8n in Docker
|
||
services:
|
||
n8n:
|
||
image: n8nio/n8n
|
||
deploy:
|
||
resources:
|
||
limits:
|
||
cpus: '2'
|
||
memory: 2G
|
||
reservations:
|
||
cpus: '0.5'
|
||
memory: 512M
|
||
```
|
||
|
||
**Workflow Optimization**:
|
||
|
||
```
|
||
Slow Workflow Pattern:
|
||
[Trigger] → [Agent] → [Wait 30s] → [Agent] → [Wait 30s] → [Agent]
|
||
Total time: 90+ seconds
|
||
|
||
Optimized Pattern:
|
||
[Trigger] → [Agent with all tools] → [Parallel checks] → [Synthesize]
|
||
Total time: 10-20 seconds
|
||
|
||
Use parallel execution where possible:
|
||
- Multiple service checks
|
||
- Multiple API calls
|
||
- Multiple SSH commands
|
||
```
|
||
|
||
---
|
||
|
||
### Section 10.4: Reliability Guidelines
|
||
|
||
**Fallback Mechanisms**:
|
||
|
||
```javascript
|
||
// Multi-channel notifications with fallback
|
||
async function sendNotification(message) {
|
||
try {
|
||
// Primary: Telegram
|
||
await sendTelegram(message);
|
||
} catch (error) {
|
||
try {
|
||
// Fallback: Email
|
||
await sendEmail(message);
|
||
} catch (error2) {
|
||
try {
|
||
// Last resort: Write to file
|
||
fs.appendFileSync('/var/log/failed-notifications.log',
|
||
JSON.stringify({ timestamp: Date.now(), message, error: error2 }));
|
||
} catch (error3) {
|
||
// Critical: Can't notify at all
|
||
console.error("CRITICAL: All notification methods failed");
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
**Health Checks for Monitoring System**:
|
||
|
||
```
|
||
Monitor the Monitor:
|
||
Create a separate workflow that checks if your main monitoring workflows are running.
|
||
|
||
[Schedule: Every hour]
|
||
↓
|
||
[Check: When was last execution of main workflow?]
|
||
↓
|
||
[IF: >15 minutes ago]
|
||
YES ↓
|
||
[ALERT: Monitoring system appears down!]
|
||
[Send via external service - email, SMS, PagerDuty]
|
||
```
|
||
|
||
**Graceful Degradation**:
|
||
|
||
```javascript
|
||
// If one tool fails, try alternatives
|
||
async function checkService(url) {
|
||
try {
|
||
// Primary: HTTP Request tool
|
||
return await httpCheck(url);
|
||
} catch (error) {
|
||
try {
|
||
// Fallback: Curl via Execute Command
|
||
return await curlCheck(url);
|
||
} catch (error2) {
|
||
// Can't check, assume down
|
||
return {
|
||
status: "unknown",
|
||
error: "All check methods failed",
|
||
last_known_status: getFromCache(url)
|
||
};
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
**Recovery Procedures**:
|
||
|
||
Document manual recovery steps:
|
||
|
||
```markdown
|
||
## Emergency Recovery: n8n Agent System Down
|
||
|
||
1. Check n8n is running:
|
||
```
|
||
docker ps | grep n8n
|
||
systemctl status n8n
|
||
```
|
||
|
||
2. Check n8n logs:
|
||
```
|
||
docker logs n8n --tail 100
|
||
journalctl -u n8n -n 100
|
||
```
|
||
|
||
3. Restart n8n:
|
||
```
|
||
docker restart n8n
|
||
systemctl restart n8n
|
||
```
|
||
|
||
4. Verify workflows activate:
|
||
- Login to n8n web interface
|
||
- Check each workflow's Active status
|
||
- Manually execute one workflow to test
|
||
|
||
5. If persistent issues:
|
||
- Disable all workflows
|
||
- Re-enable one at a time
|
||
- Identify problematic workflow
|
||
|
||
6. Nuclear option:
|
||
- Restore n8n from backup
|
||
- Reimport workflow exports
|
||
```
|
||
|
||
**Backup Strategy**:
|
||
|
||
```bash
|
||
# Daily backup of n8n data
|
||
#!/bin/bash
|
||
BACKUP_DIR="/backups/n8n"
|
||
DATE=$(date +%Y%m%d)
|
||
|
||
# Backup n8n database
|
||
docker exec n8n sqlite3 /home/node/.n8n/database.sqlite ".backup /tmp/backup.db"
|
||
docker cp n8n:/tmp/backup.db ${BACKUP_DIR}/database-${DATE}.sqlite
|
||
|
||
# Backup workflows (export as JSON)
|
||
# Via n8n API or manual export
|
||
|
||
# Keep last 30 days
|
||
find ${BACKUP_DIR} -name "*.sqlite" -mtime +30 -delete
|
||
|
||
# Upload to cloud storage
|
||
rclone copy ${BACKUP_DIR} remote:n8n-backups
|
||
```
|
||
|
||
---
|
||
|
||
### Section 10.5: Team Collaboration Best Practices
|
||
|
||
**Clear Role Assignment**:
|
||
|
||
```
|
||
Document each agent's domain:
|
||
|
||
agents/
|
||
├── vishnu-cto.md
|
||
│ - Responsibilities: Overall orchestration
|
||
│ - Escalation triggers: Multi-system failures
|
||
│ - Decision authority: Final say on all issues
|
||
│
|
||
├── brahma-network.md
|
||
│ - Responsibilities: UniFi, routing, Wi-Fi
|
||
│ - Escalation: Issues beyond network scope
|
||
│ - Tools: UniFi API, network diagnostics
|
||
│
|
||
├── saraswati-database.md
|
||
│ - Responsibilities: PostgreSQL, MySQL
|
||
│ - Escalation: Data integrity threats
|
||
│ - Forbidden: Write operations without approval
|
||
│
|
||
└── ...
|
||
```
|
||
|
||
**Escalation Paths**:
|
||
|
||
```
|
||
Level 1: Hanuman (Helpdesk)
|
||
├─ Can resolve: Common questions, status checks
|
||
├─ Escalate to: Specialists for technical issues
|
||
└─ Timeline: Respond within 5 minutes
|
||
|
||
Level 2: Specialists (Brahma, Saraswati, Ganesha, Shiva)
|
||
├─ Can resolve: Domain-specific technical issues
|
||
├─ Escalate to: Vishnu for multi-system coordination
|
||
└─ Timeline: Respond within 15 minutes
|
||
|
||
Level 3: Vishnu (CTO)
|
||
├─ Can resolve: Complex multi-system issues
|
||
├─ Escalate to: Human for business decisions
|
||
└─ Timeline: Respond within 30 minutes
|
||
|
||
Level 4: Human
|
||
├─ Can resolve: Anything (final authority)
|
||
└─ Timeline: Best effort (SLA depends on severity)
|
||
```
|
||
|
||
**Knowledge Sharing**:
|
||
|
||
```javascript
|
||
// Shared knowledge base accessible to all agents
|
||
const kb = {
|
||
"plex_common_issues": [
|
||
{
|
||
"symptom": "Remote access not working",
|
||
"solution": "Check port 32400 forwarding, Plex server settings",
|
||
"solved_count": 12,
|
||
"success_rate": 0.95
|
||
}
|
||
],
|
||
"network_topology": {
|
||
"vlans": {
|
||
"10": "Management",
|
||
"20": "User Devices",
|
||
"30": "Servers",
|
||
"40": "IoT"
|
||
},
|
||
"aps": [
|
||
{ "name": "Living Room AP", "ip": "192.168.1.10" }
|
||
]
|
||
},
|
||
"service_dependencies": {
|
||
"plex": ["nas", "network"],
|
||
"website": ["docker", "network"],
|
||
"database": ["storage", "network"]
|
||
}
|
||
};
|
||
|
||
// Agents reference KB before troubleshooting
|
||
// Update KB after resolving new issues
|
||
```
|
||
|
||
**Version Control**:
|
||
|
||
```bash
|
||
# Export workflows regularly
|
||
# Store in git repo
|
||
|
||
workflows/
|
||
├── monitoring-main.json
|
||
├── approval-handler.json
|
||
├── brahma-network.json
|
||
├── saraswati-database.json
|
||
└── README.md
|
||
|
||
# Commit after significant changes
|
||
git add workflows/
|
||
git commit -m "Add database slow query detection to Saraswati"
|
||
git push
|
||
|
||
# Tags for stable versions
|
||
git tag -a v1.0 -m "Production-ready release"
|
||
```
|
||
|
||
**Change Management**:
|
||
|
||
```
|
||
Before modifying production workflows:
|
||
1. Document change in issue tracker
|
||
2. Test in development environment
|
||
3. Peer review (or self-review with checklist)
|
||
4. Deploy during maintenance window
|
||
5. Monitor for 24 hours after change
|
||
6. Document results and lessons learned
|
||
|
||
For emergency fixes:
|
||
1. Fix the immediate issue
|
||
2. Document what was changed
|
||
3. Proper testing and documentation follow-up within 48 hours
|
||
```
|
||
|
||
---
|
||
|
||
Your agent system is now enterprise-grade, with comprehensive troubleshooting, best practices, and reliability measures in place.
|
||
|
||
---
|
||
|
||
## Chapter 11 – Additional Resources
|
||
|
||
This chapter provides curated resources, documentation links, example workflows, and community resources to deepen your knowledge and expand your agent system capabilities.
|
||
|
||
### Section 11.1: Official Documentation
|
||
|
||
**n8n Resources**:
|
||
- **Official Documentation**: [https://docs.n8n.io](https://docs.n8n.io)
|
||
- **AI Agent Node Documentation**: [https://docs.n8n.io/integrations/builtin/cluster-nodes/root-nodes/n8n-nodes-langchain.agent/](https://docs.n8n.io/integrations/builtin/cluster-nodes/root-nodes/n8n-nodes-langchain.agent/)
|
||
- **Workflow Templates**: [https://n8n.io/workflows](https://n8n.io/workflows)
|
||
- **Community Forum**: [https://community.n8n.io](https://community.n8n.io)
|
||
- **GitHub Repository**: [https://github.com/n8n-io/n8n](https://github.com/n8n-io/n8n)
|
||
|
||
**LLM Provider Documentation**:
|
||
- **OpenAI API**: [https://platform.openai.com/docs](https://platform.openai.com/docs)
|
||
- Chat Completions API
|
||
- Function calling
|
||
- Best practices for prompts
|
||
- **Anthropic Claude**: [https://docs.anthropic.com](https://docs.anthropic.com)
|
||
- **Google Gemini**: [https://ai.google.dev/docs](https://ai.google.dev/docs)
|
||
|
||
**Telegram Bot API**:
|
||
- **Official Documentation**: [https://core.telegram.org/bots/api](https://core.telegram.org/bots/api)
|
||
- **BotFather Guide**: [https://core.telegram.org/bots#botfather](https://core.telegram.org/bots#botfather)
|
||
- **Telegram Bot Examples**: [https://core.telegram.org/bots/samples](https://core.telegram.org/bots/samples)
|
||
|
||
---
|
||
|
||
### Section 11.2: Integration Documentation
|
||
|
||
**Service-Specific Guides**:
|
||
|
||
**Uptime Kuma**:
|
||
- Project: [https://github.com/louislam/uptime-kuma](https://github.com/louislam/uptime-kuma)
|
||
- API Documentation: [https://github.com/louislam/uptime-kuma/wiki/API](https://github.com/louislam/uptime-kuma/wiki/API)
|
||
|
||
**UniFi Controller**:
|
||
- API Documentation: [https://ubntwiki.com/products/software/unifi-controller/api](https://ubntwiki.com/products/software/unifi-controller/api)
|
||
- Community API Client: [https://github.com/Art-of-WiFi/UniFi-API-client](https://github.com/Art-of-WiFi/UniFi-API-client)
|
||
|
||
**Proxmox VE**:
|
||
- Official API Docs: [https://pve.proxmox.com/pve-docs/api-viewer/](https://pve.proxmox.com/pve-docs/api-viewer/)
|
||
- Community Wiki: [https://pve.proxmox.com/wiki/Proxmox_VE_API](https://pve.proxmox.com/wiki/Proxmox_VE_API)
|
||
|
||
**Plex Media Server**:
|
||
- Plex API (Unofficial): [https://www.plexopedia.com/plex-media-server/api/](https://www.plexopedia.com/plex-media-server/api/)
|
||
- Plex Forums: [https://forums.plex.tv/](https://forums.plex.tv/)
|
||
|
||
**Docker**:
|
||
- Docker CLI Reference: [https://docs.docker.com/engine/reference/commandline/cli/](https://docs.docker.com/engine/reference/commandline/cli/)
|
||
- Docker API: [https://docs.docker.com/engine/api/](https://docs.docker.com/engine/api/)
|
||
|
||
---
|
||
|
||
### Section 11.3: Example Workflows
|
||
|
||
Below are workflow templates you can import into your n8n instance. Each includes configuration notes and customization instructions.
|
||
|
||
#### Example 1: Basic Service Monitor
|
||
|
||
**Purpose**: Monitor a single HTTP service every 5 minutes
|
||
|
||
**Workflow JSON** (Basic structure - customize for your service):
|
||
```json
|
||
{
|
||
"name": "Basic Service Monitor",
|
||
"nodes": [
|
||
{
|
||
"parameters": {
|
||
"rule": {
|
||
"interval": [
|
||
{
|
||
"field": "cronExpression",
|
||
"expression": "*/5 * * * *"
|
||
}
|
||
]
|
||
}
|
||
},
|
||
"name": "Schedule Trigger",
|
||
"type": "n8n-nodes-base.scheduleTrigger"
|
||
},
|
||
{
|
||
"parameters": {
|
||
"model": "gpt-4o-mini",
|
||
"systemPrompt": "You are a service monitoring agent. Check if the service is online using the HTTP tool. Report status clearly."
|
||
},
|
||
"name": "AI Agent",
|
||
"type": "@n8n/n8n-nodes-langchain.agent"
|
||
},
|
||
{
|
||
"parameters": {
|
||
"url": "={{ $json.service_url }}",
|
||
"method": "GET"
|
||
},
|
||
"name": "HTTP Request",
|
||
"type": "n8n-nodes-base.httpRequest"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
**Customization**:
|
||
- Change `cronExpression` for different intervals
|
||
- Update `systemPrompt` for service-specific instructions
|
||
- Add Telegram notification node for alerts
|
||
|
||
---
|
||
|
||
#### Example 2: Multi-Service Dashboard
|
||
|
||
**Purpose**: Check multiple services and generate status report
|
||
|
||
**Configuration**:
|
||
```javascript
|
||
// Code node: Define services to check
|
||
const services = [
|
||
{ name: "Plex", url: "http://192.168.1.100:32400" },
|
||
{ name: "Proxmox", url: "https://proxmox:8006" },
|
||
{ name: "NAS", url: "http://nas:5000" },
|
||
{ name: "UniFi", url: "https://unifi:8443" }
|
||
];
|
||
|
||
return services.map(service => ({ json: service }));
|
||
```
|
||
|
||
Then loop through each with AI Agent + HTTP Request, aggregate results, and send comprehensive report via Telegram.
|
||
|
||
---
|
||
|
||
#### Example 3: Approval Flow Template
|
||
|
||
**Purpose**: Request human approval before executing commands
|
||
|
||
**Key Components**:
|
||
1. Detect issue
|
||
2. Generate fix proposal
|
||
3. Send to Telegram with YES/NO buttons
|
||
4. Wait for response (separate workflow with Telegram Trigger)
|
||
5. Execute or cancel based on response
|
||
|
||
**Implementation Pattern**: See Chapter 7, Section 7.1 for complete code examples
|
||
|
||
---
|
||
|
||
### Section 11.4: Community Examples & Patterns
|
||
|
||
**Multi-Agent Orchestration**:
|
||
- Pattern: Hub-and-Spoke (Vishnu coordinates specialists)
|
||
- Pattern: Pipeline (Sequential specialist consultation)
|
||
- Pattern: Parallel Investigation (Multiple agents work simultaneously)
|
||
|
||
**Advanced Telegram Integration**:
|
||
- Interactive menus with inline keyboards
|
||
- Rich formatted messages with Markdown
|
||
- File/image attachments for logs or screenshots
|
||
- Conversation state management for multi-turn interactions
|
||
|
||
**Service-Specific Monitoring**:
|
||
- Database query performance monitoring
|
||
- Network bandwidth utilization tracking
|
||
- Storage capacity forecasting
|
||
- SSL certificate expiration alerts
|
||
- Container resource usage trending
|
||
|
||
---
|
||
|
||
### Section 11.5: Learning Resources
|
||
|
||
**Video Tutorials**:
|
||
- n8n Official YouTube Channel
|
||
- AI Agent automation tutorials
|
||
- LangChain integration guides
|
||
|
||
**Blog Posts & Articles**:
|
||
- n8n Blog: [https://blog.n8n.io](https://blog.n8n.io)
|
||
- Homelab subreddits: r/homelab, r/selfhosted
|
||
- Medium articles on AI agents and automation
|
||
|
||
**Books & Guides**:
|
||
- "Building LLM Powered Applications" (prompt engineering)
|
||
- "The Phoenix Project" (DevOps philosophy)
|
||
- "Site Reliability Engineering" (Google SRE book)
|
||
|
||
**Communities**:
|
||
- n8n Community Forum
|
||
- Discord servers for homelab enthusiasts
|
||
- Reddit: r/n8n, r/homelab, r/selfhosted
|
||
|
||
---
|
||
|
||
### Section 11.6: Troubleshooting Guides
|
||
|
||
**Integration-Specific Troubleshooting**:
|
||
|
||
| Service | Common Issues | Documentation |
|
||
|---------|---------------|---------------|
|
||
| Uptime Kuma | API authentication, webhook setup | See Chapter 6.1 |
|
||
| UniFi | Cookie-based auth, SSL certs | See Chapter 6.2 |
|
||
| Proxmox | SSH keys, API tokens | See Chapter 6.3 |
|
||
| NAS | SMART data parsing, RAID status | See Chapter 6.4 |
|
||
| Plex | Token extraction, remote access | See Chapter 6.5 |
|
||
|
||
For detailed solutions, refer to **Chapter 9: Troubleshooting**.
|
||
|
||
---
|
||
|
||
### Section 11.7: Extending the System
|
||
|
||
**Ideas for Future Enhancements**:
|
||
|
||
1. **Voice Interface**:
|
||
- Integrate with Alexa or Google Assistant
|
||
- "Alexa, ask CTO agent for infrastructure status"
|
||
|
||
2. **Mobile App**:
|
||
- React Native or Flutter app
|
||
- Direct communication with agents
|
||
- Push notifications
|
||
|
||
3. **Web Dashboard**:
|
||
- Real-time status display
|
||
- Agent activity logs
|
||
- Approval queue
|
||
- Metrics and graphs
|
||
|
||
4. **Advanced Analytics**:
|
||
- Predict failures before they occur
|
||
- Resource usage trending
|
||
- Cost optimization recommendations
|
||
|
||
5. **Integration with More Services**:
|
||
- Home Assistant (smart home)
|
||
- Grafana/Prometheus (monitoring)
|
||
- GitLab/GitHub (CI/CD)
|
||
- Authentik/Keycloak (SSO)
|
||
|
||
6. **Agent Learning**:
|
||
- Fine-tune models on your specific infrastructure
|
||
- Build custom knowledge base
|
||
- Implement feedback loops
|
||
|
||
---
|
||
|
||
### Section 11.8: Export Your Workflows
|
||
|
||
**How to Export Workflows from n8n**:
|
||
|
||
1. Open workflow in n8n editor
|
||
2. Click three dots menu (⋯) in top right
|
||
3. Select **"Download"**
|
||
4. Save JSON file
|
||
5. Store in version control (Git)
|
||
|
||
**Recommended Workflow Organization**:
|
||
```
|
||
workflows/
|
||
├── monitoring/
|
||
│ ├── basic-service-monitor.json
|
||
│ ├── multi-service-dashboard.json
|
||
│ └── uptime-kuma-integration.json
|
||
├── agents/
|
||
│ ├── vishnu-cto.json
|
||
│ ├── brahma-network.json
|
||
│ ├── saraswati-database.json
|
||
│ ├── ganesha-security.json
|
||
│ ├── shiva-devops.json
|
||
│ └── hanuman-helpdesk.json
|
||
├── utilities/
|
||
│ ├── approval-handler.json
|
||
│ ├── telegram-responder.json
|
||
│ └── log-aggregator.json
|
||
└── README.md
|
||
```
|
||
|
||
**Import to Another n8n Instance**:
|
||
1. Open n8n
|
||
2. Click **"+ Add Workflow"**
|
||
3. Three dots menu → **"Import from File"**
|
||
4. Select JSON file
|
||
5. Review and activate
|
||
|
||
---
|
||
|
||
**Continue Building**:
|
||
|
||
The resources in this chapter are starting points. The homelab and automation community is vibrant and continuously innovating. Share your own discoveries, workflows, and patterns with the community. Together, we advance the state of the art in intelligent infrastructure management.
|
||
|
||
---
|
||
|
||
## Chapter 12 – Support & Contributions
|
||
|
||
This project thrives on community collaboration. Whether you need help, want to contribute, or wish to share your success stories, this chapter guides you on how to engage.
|
||
|
||
### Section 12.1: Getting Help
|
||
|
||
**Direct Support**:
|
||
|
||
For questions, issues, or assistance with this specific project:
|
||
|
||
- **Email**: [riteshrana36@gmail.com](mailto:riteshrana36@gmail.com)
|
||
- Subject line format: `[n8n-Homelab-CTO] Your Topic`
|
||
- Include: n8n version, error messages, relevant workflow exports
|
||
- Expected response time: 1-3 business days
|
||
|
||
- **GitHub**: [@ambicuity](https://github.com/ambicuity)
|
||
- Open issues for bugs or feature requests
|
||
- Use discussions for questions
|
||
- Check existing issues before creating new ones
|
||
|
||
- **Website**: [www.riteshrana.engineer](https://www.riteshrana.engineer)
|
||
- Additional projects and contact information
|
||
- Portfolio and professional background
|
||
|
||
**Community Support**:
|
||
|
||
- **n8n Community Forum**: [https://community.n8n.io](https://community.n8n.io)
|
||
- Tag posts with `ai-agent` and `homelab`
|
||
- Search existing threads before posting
|
||
- Provide workflow exports and error logs
|
||
|
||
- **Reddit Communities**:
|
||
- r/n8n - n8n-specific questions
|
||
- r/homelab - General homelab infrastructure
|
||
- r/selfhosted - Self-hosting discussions
|
||
|
||
- **Discord Servers**:
|
||
- n8n Official Discord
|
||
- Various homelab community Discords
|
||
|
||
**Before Asking for Help**:
|
||
|
||
1. ✅ Review relevant chapters in this guide
|
||
2. ✅ Check Chapter 9 (Troubleshooting)
|
||
3. ✅ Search n8n community forum
|
||
4. ✅ Verify your configuration matches examples
|
||
5. ✅ Test with simplified workflow
|
||
6. ✅ Check n8n execution logs
|
||
7. ✅ Gather error messages and version information
|
||
|
||
**Include in Support Requests**:
|
||
- n8n version (`docker exec n8n n8n --version` or UI: Help → About)
|
||
- Node.js version
|
||
- Operating system
|
||
- Workflow export (sanitize credentials!)
|
||
- Full error message
|
||
- Steps to reproduce
|
||
- What you've already tried
|
||
|
||
---
|
||
|
||
### Section 12.2: Contributing
|
||
|
||
Contributions make this project better for everyone. There are many ways to contribute, regardless of your technical expertise.
|
||
|
||
#### Types of Contributions
|
||
|
||
**1. Bug Reports**:
|
||
```markdown
|
||
**Bug Report Template**:
|
||
|
||
**Description**: Brief summary of the issue
|
||
|
||
**Steps to Reproduce**:
|
||
1. Step one
|
||
2. Step two
|
||
3. Step three
|
||
|
||
**Expected Behavior**: What should happen
|
||
|
||
**Actual Behavior**: What actually happens
|
||
|
||
**Environment**:
|
||
- n8n Version:
|
||
- OS:
|
||
- LLM Provider:
|
||
|
||
**Workflow Export**: (if applicable, sanitize credentials)
|
||
|
||
**Screenshots/Logs**: (if relevant)
|
||
```
|
||
|
||
**2. Feature Requests**:
|
||
```markdown
|
||
**Feature Request Template**:
|
||
|
||
**Problem**: What problem does this solve?
|
||
|
||
**Proposed Solution**: Your idea for solving it
|
||
|
||
**Alternatives Considered**: Other approaches you thought of
|
||
|
||
**Use Case**: How you would use this feature
|
||
|
||
**Priority**: Low / Medium / High
|
||
```
|
||
|
||
**3. Documentation Improvements**:
|
||
- Fix typos and grammar
|
||
- Add clarifications
|
||
- Expand examples
|
||
- Translate to other languages
|
||
- Create video tutorials
|
||
|
||
**4. Workflow Examples**:
|
||
- Share your agent configurations
|
||
- Submit service integration examples
|
||
- Provide production-tested patterns
|
||
- Document edge cases and solutions
|
||
|
||
**5. Code Contributions**:
|
||
- Bug fixes
|
||
- Feature implementations
|
||
- Test coverage improvements
|
||
- Performance optimizations
|
||
|
||
#### Contribution Guidelines
|
||
|
||
**Pull Request Process**:
|
||
|
||
1. **Fork the Repository**:
|
||
```bash
|
||
git clone https://github.com/ambicuity/n8n-AI-Multiple-Agent-Team.git
|
||
cd n8n-AI-Multiple-Agent-Team
|
||
git checkout -b feature/your-feature-name
|
||
```
|
||
|
||
2. **Make Changes**:
|
||
- Follow existing code/documentation style
|
||
- Test your changes thoroughly
|
||
- Add examples if introducing new concepts
|
||
|
||
3. **Commit**:
|
||
```bash
|
||
git add .
|
||
git commit -m "Add: Brief description of change"
|
||
```
|
||
|
||
Commit message format:
|
||
- `Add:` New features or content
|
||
- `Fix:` Bug fixes
|
||
- `Update:` Changes to existing content
|
||
- `Remove:` Deletions
|
||
|
||
4. **Submit Pull Request**:
|
||
- Provide clear description
|
||
- Reference related issues
|
||
- Explain testing performed
|
||
- Request review
|
||
|
||
**Code of Conduct**:
|
||
|
||
This project follows standard open-source community guidelines:
|
||
- Be respectful and inclusive
|
||
- Provide constructive feedback
|
||
- Accept constructive criticism gracefully
|
||
- Focus on what's best for the community
|
||
- Show empathy toward other community members
|
||
|
||
---
|
||
|
||
### Section 12.3: Sharing Your Success
|
||
|
||
**We want to hear from you!**
|
||
|
||
If you've successfully deployed this system:
|
||
- Share your setup details
|
||
- Describe interesting agent configurations
|
||
- Report on reliability and uptime improvements
|
||
- Discuss cost savings or time savings
|
||
- Submit case studies
|
||
|
||
**Ways to Share**:
|
||
1. GitHub Discussions (preferred for detailed stories)
|
||
2. Email with photos/screenshots
|
||
3. Blog posts (we'll link to them)
|
||
4. YouTube videos (demos and tutorials)
|
||
5. Conference talks or meetups
|
||
|
||
**What to Include**:
|
||
- Infrastructure size (number of services, devices)
|
||
- Agent configurations (sanitized)
|
||
- Metrics: uptime improvement, issues auto-resolved, time saved
|
||
- Challenges encountered and solutions
|
||
- Advice for others
|
||
|
||
---
|
||
|
||
### Section 12.4: Acknowledgments
|
||
|
||
This project stands on the shoulders of giants. Gratitude to:
|
||
|
||
**Technology Creators**:
|
||
- **n8n team**: For building the incredible workflow automation platform
|
||
- **OpenAI, Anthropic, Google**: For LLM APIs that power intelligent agents
|
||
- **Telegram**: For providing free, robust bot API
|
||
- **Open-source community**: For countless tools and libraries
|
||
|
||
**Inspirations**:
|
||
- Modern AI agent frameworks (AutoGPT, LangChain, etc.)
|
||
- DevOps and SRE communities
|
||
- Homelab enthusiasts worldwide
|
||
- Traditional Hindu philosophy and wisdom
|
||
|
||
**Community Contributors**:
|
||
- Everyone who tests, reports bugs, suggests improvements
|
||
- Those who share their configurations and experiences
|
||
- Community members who help others in forums
|
||
|
||
**Special Thanks**:
|
||
- Early adopters who provided feedback
|
||
- Beta testers of various agent configurations
|
||
- Those who contributed workflow examples
|
||
- Everyone who believes in self-hosted, private AI infrastructure
|
||
|
||
---
|
||
|
||
### Section 12.5: License & Usage
|
||
|
||
**License**: MIT License (or specify your chosen license)
|
||
|
||
**Educational and Homelab Use**:
|
||
This project is designed for:
|
||
- Personal homelab environments
|
||
- Educational purposes
|
||
- Small business infrastructure
|
||
- Learning about AI agents and automation
|
||
|
||
**Disclaimer**:
|
||
- Provided as-is, without warranty
|
||
- Test thoroughly before production use
|
||
- Author not liable for infrastructure issues
|
||
- Comply with all service Terms of Service
|
||
- Respect API rate limits and usage policies
|
||
|
||
**Commercial Use**:
|
||
While the code and documentation are open-source:
|
||
- Consult with author for large-scale commercial deployments
|
||
- Ensure compliance with LLM provider commercial terms
|
||
- Consider enterprise support if needed
|
||
|
||
**Attribution**:
|
||
If you use this project or derivatives:
|
||
- Attribution appreciated but not required
|
||
- Link back to this repository if sharing publicly
|
||
- Share improvements back to community when possible
|
||
|
||
---
|
||
|
||
### Section 12.6: Roadmap & Future Development
|
||
|
||
**Current Focus** (v1.0):
|
||
- ✅ Complete documentation
|
||
- ✅ Core agent roles defined
|
||
- ✅ Integration guides for major services
|
||
- ✅ Production-tested workflows
|
||
|
||
**Planned Enhancements** (v2.0):
|
||
- 📋 Workflow export library (ready-to-import templates)
|
||
- 📋 Video tutorial series
|
||
- 📋 Additional service integrations (Home Assistant, Grafana, etc.)
|
||
- 📋 Web dashboard for agent monitoring
|
||
- 📋 Agent performance analytics
|
||
|
||
**Long-Term Vision**:
|
||
- 🔮 Fine-tuned models for homelab-specific tasks
|
||
- 🔮 Multi-language support (documentation)
|
||
- 🔮 Enterprise features (RBAC, audit logging, compliance)
|
||
- 🔮 Agent marketplace (community-contributed specialist agents)
|
||
|
||
**Your Input Shapes the Future**:
|
||
Roadmap priorities are influenced by:
|
||
- Community feedback and feature requests
|
||
- Real-world use cases and pain points
|
||
- Emerging technologies and integrations
|
||
- Contributor interests and expertise
|
||
|
||
Join the journey by sharing your needs and ideas!
|
||
|
||
---
|
||
|
||
**Thank you for being part of the n8n-Homelab-CTO-Agent-Team community. Your infrastructure deserves intelligent oversight, and together we're making that a reality.**
|
||
|
||
---
|
||
|
||
## Closing Note
|
||
|
||
As you embark on this journey of building an AI-powered IT department for your homelab, remember the ancient wisdom that guides these agents:
|
||
|
||
**Sanskrit Shlok** (Blessing for Your Infrastructure):
|
||
|
||
> **"सर्वे भद्राणि पश्यन्तु सर्वे सन्तु निरामया ।
|
||
> सर्वे भद्राणि पश्यन्तु मा कश्चिद्दुःखभाग्भवेत् ॥"**
|
||
>
|
||
> *"Sarve bhadrāṇi paśyantu sarve santu nirāmayā |
|
||
> Sarve bhadrāṇi paśyantu mā kaścid duḥkhabhāg bhavet ||"*
|
||
|
||
**Translation**:
|
||
> "May all beings see auspiciousness, may all beings be free from illness.
|
||
> May all beings experience prosperity, may no one experience suffering."
|
||
|
||
**Applied to Your Homelab**:
|
||
> "May all your services see uptime, may all your systems be free from failures.
|
||
> May all your infrastructure experience stability, may no user experience downtime."
|
||
|
||
---
|
||
|
||
### Final Wisdom from the Agents
|
||
|
||
**Vishnu (CTO)** says:
|
||
> "Balance is the key. Monitor closely, but trust your systems. Automate wisely, but remain in control. Your infrastructure is an ecosystem—nurture it."
|
||
|
||
**Brahma (Infrastructure)** says:
|
||
> "Build with intention. Every service has a purpose. Every network segment has a reason. Create with vision, not just reaction."
|
||
|
||
**Saraswati (Knowledge)** says:
|
||
> "Document everything. The knowledge you capture today saves hours tomorrow. Your future self will thank your present self."
|
||
|
||
**Ganesha (Security)** says:
|
||
> "Remove obstacles, but create necessary barriers. Security is not paranoia—it's wisdom. Protect what matters."
|
||
|
||
**Shiva (DevOps)** says:
|
||
> "Embrace change. The old must make way for the new. Break without fear, for you have backups. Deploy with confidence."
|
||
|
||
**Hanuman (Service)** says:
|
||
> "Serve with devotion. Every user issue matters. Every ticket deserves attention. Your dedication defines the experience."
|
||
|
||
---
|
||
|
||
### Homelab DevOps Blessing
|
||
|
||
**May Your Infrastructure Journey Be Blessed**:
|
||
|
||
- 🔱 **May your services always be UP**, your logs always be clear
|
||
- ⚙️ **May your containers always start**, your builds always succeed
|
||
- 📚 **May your databases never corrupt**, your backups always restore
|
||
- 🛡️ **May your firewalls block threats**, but allow legitimate traffic
|
||
- 🔥 **May your deployments be smooth**, your rollbacks unnecessary
|
||
- 🙏 **May your users be patient**, your on-call nights be quiet
|
||
|
||
---
|
||
|
||
### Going Forward
|
||
|
||
You now possess the knowledge to build and operate a world-class AI-powered infrastructure team. What started as a simple monitoring script has evolved into a sophisticated, intelligent system that thinks, decides, and acts with wisdom.
|
||
|
||
**Remember**:
|
||
- Start small, grow incrementally
|
||
- Test thoroughly, deploy carefully
|
||
- Monitor constantly, improve continuously
|
||
- Share knowledge, help others
|
||
- Stay curious, keep learning
|
||
|
||
**Your agents are ready. Your infrastructure awaits. May your homelab flourish.**
|
||
|
||
---
|
||
|
||
**Built with ❤️ for homelabbers and self-hosted enthusiasts**
|
||
|
||
**Author**: Ritesh Rana
|
||
**Contact**: riteshrana36@gmail.com
|
||
**Website**: www.riteshrana.engineer
|
||
**GitHub**: @ambicuity
|
||
|
||
---
|
||
|
||
*"In the vast cosmos of data and compute, may your agents be wise guardians, your systems be resilient, and your journey be filled with discovery."*
|
||
|
||
**ॐ शान्तिः शान्तिः शान्तिः ॥**
|
||
*(Om Shanti Shanti Shanti)*
|
||
**Peace, Peace, Peace**
|
||
|
||
---
|
||
|
||
**End of Guide** |