Add professional repository structure with metadata files, workflows, docs, and scripts

Co-authored-by: ambicuity <44251619+ambicuity@users.noreply.github.com>
2025-10-05 18:31:53 +00:00
parent 5ba6136960
commit 4a7dc2b6b5
16 changed files with 3945 additions and 0 deletions
--- a/.env.example
+++ b/.env.example
@@ -0,0 +1,17 @@
 # n8n-Homelab-CTO-Agent-Team Configuration
 # Copy this file to .env and fill in your actual values
 # OpenAI API Configuration
 OPENAI_API_KEY=your_openai_api_key_here
 # Telegram Bot Configuration
 TELEGRAM_BOT_TOKEN=your_telegram_bot_token_here
 TELEGRAM_CHAT_ID=your_telegram_chat_id_here
 # n8n Configuration
 N8N_HOST_URL=http://localhost:5678
 # Optional: Alternative LLM Providers
 # ANTHROPIC_API_KEY=your_anthropic_api_key_here
 # GOOGLE_API_KEY=your_google_gemini_api_key_here
 # OLLAMA_HOST=http://localhost:11434
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,116 @@
 # n8n and Node.js Project
 # Dependencies
 node_modules/
 npm-debug.log*
 yarn-debug.log*
 yarn-error.log*
 package-lock.json
 yarn.lock
 # Environment variables and credentials
 .env
 .env.local
 .env.production
 *.env
 credentials.json
 credentials/
 # n8n specific
 .n8n/
 n8n-data/
 # Build output
 dist/
 build/
 out/
 # Logs
 logs/*
 !logs/README.md
 *.log
 npm-debug.log*
 pids
 *.pid
 *.seed
 *.pid.lock
 # Runtime data
 lib-cov
 coverage
 .nyc_output
 .grunt
 .lock-wscript
 # Compiled binary addons
 build/Release
 # Dependency directories
 jspm_packages/
 # TypeScript cache
 *.tsbuildinfo
 # Optional npm cache directory
 .npm
 # Optional eslint cache
 .eslintcache
 # Optional REPL history
 .node_repl_history
 # Output of 'npm pack'
 *.tgz
 # dotenv environment variable files
 .env.test
 # parcel-bundler cache
 .cache
 .parcel-cache
 # Next.js build output
 .next
 # Nuxt.js build / generate output
 .nuxt
 dist
 # Gatsby files
 .cache/
 public
 # vuepress build output
 .vuepress/dist
 # Serverless directories
 .serverless/
 # FuseBox cache
 .fusebox/
 # DynamoDB Local files
 .dynamodb/
 # TernJS port file
 .tern-port
 # Stores VSCode versions used for testing VSCode extensions
 .vscode-test
 # IDE
 .idea/
 .vscode/
 *.swp
 *.swo
 *~
 # OS
 .DS_Store
 Thumbs.db
 # Temporary files
 tmp/
 temp/
 *.tmp
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -0,0 +1,141 @@
 # Contributing to n8n-Homelab-CTO-Agent-Team
 Thank you for your interest in contributing! This project thrives on community collaboration. Whether you're fixing bugs, improving documentation, or sharing workflow examples, your contributions make this project better for everyone.
 ## Types of Contributions
 We welcome **5 types of contributions**, regardless of your technical expertise:
 ### 1. Bug Reports
 Help us identify and fix issues by submitting detailed bug reports.
 **Bug Report Template**:
 ```markdown
 **Description**: Brief summary of the issue
 **Steps to Reproduce**:
 1. Step one
 2. Step two
 3. Step three
 **Expected Behavior**: What should happen
 **Actual Behavior**: What actually happens
 **Environment**:
 - n8n Version:
 - OS:
 - LLM Provider:
 **Workflow Export**: (if applicable, sanitize credentials)
 **Screenshots/Logs**: (if relevant)
 ```
 ### 2. Feature Requests
 Share your ideas for new features or improvements.
 **Feature Request Template**:
 ```markdown
 **Problem**: What problem does this solve?
 **Proposed Solution**: Your idea for solving it
 **Alternatives Considered**: Other approaches you thought of
 **Use Case**: How you would use this feature
 **Priority**: Low / Medium / High
 ```
 ### 3. Documentation Improvements
 - Fix typos and grammar
 - Add clarifications
 - Expand examples
 - Translate to other languages
 - Create video tutorials
 ### 4. Workflow Examples
 - Share your agent configurations
 - Submit service integration examples
 - Provide production-tested patterns
 - Document edge cases and solutions
 ### 5. Code Contributions
 - Bug fixes
 - Feature implementations
 - Test coverage improvements
 - Performance optimizations
 ## Pull Request Process
 ### Step 1: Fork the Repository
 ```bash
 git clone https://github.com/ambicuity/n8n-AI-Multiple-Agent-Team.git
 cd n8n-AI-Multiple-Agent-Team
 git checkout -b feature/your-feature-name
 ```
 ### Step 2: Make Changes
 - Follow existing code/documentation style
 - Test your changes thoroughly
 - Add examples if introducing new concepts
 ### Step 3: Commit Your Changes
 ```bash
 git add .
 git commit -m "Add: Brief description of change"
 ```
 **Commit Message Format**:
 - `Add:` New features or content
 - `Fix:` Bug fixes
 - `Update:` Changes to existing content
 - `Remove:` Deletions
 ### Step 4: Submit Pull Request
 - Provide clear description
 - Reference related issues
 - Explain testing performed
 - Request review
 ## Code of Conduct
 This project follows standard open-source community guidelines:
 - ✅ Be respectful and inclusive
 - ✅ Provide constructive feedback
 - ✅ Accept constructive criticism gracefully
 - ✅ Focus on what's best for the community
 - ✅ Show empathy toward other community members
 ## Before Submitting
 Please ensure your contribution:
 1. ✅ Follows existing patterns and style
 2. ✅ Is tested (if code changes)
 3. ✅ Includes documentation (if new features)
 4. ✅ Sanitizes any credentials in examples
 5. ✅ References related issues (if applicable)
 ## Questions?
 If you're unsure about anything:
 - Check [Chapter 12 – Support & Contributions](README.md#chapter-12--support--contributions) in the main README
 - Open a discussion on GitHub
 - Contact the author at riteshrana36@gmail.com
 Thank you for helping make this project better!
--- a/21
+++ b/21
@@ -0,0 +1,21 @@
 MIT License
 Copyright (c) 2024 Ritesh Rana
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -0,0 +1,75 @@
 # Support
 Need help with the n8n-Homelab-CTO-Agent-Team project? This document provides information on where to get assistance.
 ## Primary Support Channels
 ### Direct Support
 For questions, issues, or assistance with this specific project:
 **Author Contact**:
 - **Email**: [riteshrana36@gmail.com](mailto:riteshrana36@gmail.com)
  - Subject line format: `[n8n-Homelab-CTO] Your Topic`
  - Include: n8n version, error messages, relevant workflow exports
  - Expected response time: 1-3 business days
 - **GitHub**: [@ambicuity](https://github.com/ambicuity)
  - Open issues for bugs or feature requests
  - Use discussions for questions
  - Check existing issues before creating new ones
 - **Website**: [www.riteshrana.engineer](https://www.riteshrana.engineer)
  - Additional projects and contact information
  - Portfolio and professional background
 ### Community Support
 **n8n Community Forum**: [https://community.n8n.io](https://community.n8n.io)
 - Tag posts with `ai-agent` and `homelab`
 - Search existing threads before posting
 - Provide workflow exports and error logs
 **Reddit Communities**:
 - r/n8n - n8n-specific questions
 - r/homelab - General homelab infrastructure
 - r/selfhosted - Self-hosting discussions
 **Discord Servers**:
 - n8n Official Discord
 - Various homelab community Discords
 ## Before Asking for Help
 Please complete this checklist first:
 1. ✅ Review relevant chapters in the [main README](README.md)
 2. ✅ Check [Chapter 9 – Troubleshooting](README.md#chapter-9--troubleshooting)
 3. ✅ Search n8n community forum
 4. ✅ Verify your configuration matches examples
 5. ✅ Test with simplified workflow
 6. ✅ Check n8n execution logs
 7. ✅ Gather error messages and version information
 ## Information to Include in Support Requests
 To help us assist you effectively, please include:
 - **n8n version**: Run `docker exec n8n n8n --version` or check UI: Help → About
 - **Node.js version**
 - **Operating system**
 - **Workflow export** (sanitize credentials!)
 - **Full error message**
 - **Steps to reproduce**
 - **What you've already tried**
 ## Additional Resources
 For more detailed information, please refer to:
 - [Chapter 12 – Support & Contributions](README.md#chapter-12--support--contributions) in the main README
 - [CONTRIBUTING.md](CONTRIBUTING.md) for contribution guidelines
 ---
 **Thank you for being part of the n8n-Homelab-CTO-Agent-Team community!**
--- a/docs/02-Prerequisites.md
+++ b/docs/02-Prerequisites.md
@@ -0,0 +1,102 @@
 ## Chapter 2 – Prerequisites
 Before embarking on your journey to build an AI-powered CTO team, ensure you have the necessary components in place. This chapter presents all requirements in an easy-to-follow checklist format.
 ### Required Components
 #### Core Infrastructure
 - ✅ **n8n instance** (self-hosted or cloud)
  - Version 1.0+ recommended
  - Accessible via web browser
  - Sufficient resources (minimum 2GB RAM, 2 CPU cores)
 - ✅ **Docker** installed on homelab server(s)
  - Docker Engine 20.10+ or Docker Desktop
  - Docker Compose (optional but recommended)
  - User permissions configured for Docker commands
 - ✅ **LLM API Access**
  - OpenAI API key (GPT-4o-mini or GPT-4 recommended)
  - Alternative: Anthropic Claude, Google Gemini, or local LLMs via Ollama
  - Sufficient API credits or quota
  - API key stored securely (environment variables recommended)
 - ✅ **Telegram Account + Bot**
  - Personal Telegram account
  - Bot created via @BotFather
  - Bot token obtained
  - Your Chat ID identified (we'll cover how to get this in Chapter 5)
 ### Service Access Requirements
 The beauty of this system is that it works with your **existing infrastructure**. You don't need to migrate services or change your setup. The agents simply integrate with what you already have running.
 - ✅ **Uptime Kuma** - Monitoring Integration
  - Access to Uptime Kuma API
  - API endpoint URL
  - API key or authentication credentials
 - ✅ **Proxmox** - Virtualization Management
  - Proxmox VE 7.0+ installed
  - SSH access to Proxmox host
  - API credentials (optional, for API-based management)
  - Network access to Proxmox web interface
 - ✅ **UniFi Controller** - Network Management
  - UniFi Controller running (Cloud Key, UDM, or self-hosted)
  - Admin credentials or API access
  - Controller URL/IP address
  - UniFi site ID (if managing multiple sites)
 - ✅ **NAS Systems** (e.g., ZimaCube, TrueNAS, Synology)
  - SSH access enabled
  - Monitoring tools available (smartctl for disk health)
  - Network access to NAS interface
  - Read access to system logs
 - ✅ **Plex Media Server**
  - Plex instance running
  - HTTP access for health checks
  - Docker container name (if containerized)
  - API token (for advanced monitoring)
 ### Optional but Highly Recommended
 - ✅ **Secure Remote Access**
  - VPN (WireGuard, OpenVPN, or similar)
  - Twingate (zero-trust network access)
  - Cloudflare Tunnel (for web services)
  - Tailscale (mesh VPN)
 - ✅ **SSH Key Authentication**
  - SSH keys generated (ed25519 or RSA 4096-bit)
  - Public keys installed on target servers
  - SSH config file configured for easy access
  - Passphrase-protected private keys
 - ✅ **Knowledge & Skills**
  - Basic understanding of Docker and containerization
  - Familiarity with n8n workflow editor
  - Comfort with command-line interfaces
  - Understanding of your network topology
  - JSON syntax knowledge (for configuration and output parsing)
 ### Pre-Installation Checklist
 Before proceeding to Chapter 3, verify you have completed:
 ```
 [ ] n8n is installed and accessible
 [ ] Docker is running on at least one server
 [ ] LLM API key is obtained and tested
 [ ] Telegram bot is created and token is saved
 [ ] Network access to all services is confirmed
 [ ] SSH keys are configured (if using SSH-based integrations)
 [ ] You have admin/root access to your infrastructure
 [ ] Backup systems are in place (never automate without backups!)
 ```
 **Important Security Note**: This system will have significant access to your infrastructure. Treat API keys, bot tokens, and SSH credentials with the same care you would give to root passwords. Use environment variables, secrets managers, or n8n's built-in credential system. Never commit credentials to version control.
 ---
--- a/docs/04-Agent-Evolution.md
+++ b/docs/04-Agent-Evolution.md
@@ -0,0 +1,791 @@
 ## Chapter 4 – Agent Evolution (Stages)
 The journey from a simple monitoring script to an intelligent, self-healing infrastructure guardian follows a carefully designed progression. Each stage builds upon the previous one, adding new capabilities while maintaining safety and control. This chapter details five evolutionary stages, complete with system prompts, required tools, and real-world examples.
 ### Stage 1: Basic Monitor
 **Capability Overview**: Check uptime via HTTP health checks
 At this foundational stage, your agent performs basic service availability monitoring. It can check if web services respond and report their HTTP status codes. This is similar to traditional uptime monitoring tools but with natural language interaction.
 **What It Can Do**:
 - Monitor website and API availability via HTTP/HTTPS
 - Return HTTP status codes and basic response information
 - Report simple UP/DOWN status
 - Detect connection timeouts and network issues
 - Check multiple services sequentially
 **What It Cannot Do**:
 - Diagnose WHY a service is down
 - Access system logs or container information
 - Make any changes to infrastructure
 - Understand complex failure patterns
 **Required Tools**:
 - HTTP Request node (configured as a tool)
 **System Prompt Example**:
 ```
 You are a Level 1 Homelab Monitoring Agent. Your role is to check service availability and report status.
 CAPABILITIES:
 - Check HTTP/HTTPS endpoints
 - Report status codes and response times
 - Identify if services are UP (200-299) or DOWN (any other status/error)
 LIMITATIONS:
 - You cannot diagnose issues beyond HTTP responses
 - You cannot make any changes to systems
 - You cannot access logs or system internals
 OUTPUT FORMAT:
 For each service checked, report:
 - Service name/URL
 - Status (UP/DOWN)
 - HTTP status code (if available)
 - Response time
 - Any error messages
 Always be clear and concise. If a service is down, state that further investigation is needed.
 ```
 **Example Interaction**:
 ```
 User: "Check if my Plex server at http://192.168.1.100:32400 is online"
 Agent: "Checking Plex server..."
 [Executes HTTP Request tool]
 Agent: "✅ Status Report:
 Service: Plex Media Server
 URL: http://192.168.1.100:32400
 Status: UP
 HTTP Code: 200 OK
 Response Time: 145ms
 Your Plex server is online and responding normally."
 ```
 **When to Use Stage 1**:
 - Initial setup and testing
 - Low-risk monitoring scenarios
 - Services you want visibility into but not automated intervention
 - Learning how the agent system works
 ---
 ### Stage 2: Smart Investigator
 **Capability Overview**: Use Docker commands and log analysis to diagnose issues
 Stage 2 agents can go beyond surface-level checks. When a service is down, they investigate the underlying cause by examining Docker container status and reading logs. This is where your agent begins to act like a junior system administrator.
 **What It Can Do**:
 - List all Docker containers and their states
 - Check if specific containers are running, stopped, or crashed
 - Read container logs (last 50-100 lines)
 - Identify common error patterns in logs
 - Correlate service failures with container issues
 - Provide diagnostic insights about root causes
 **What It Cannot Do**:
 - Restart containers or make changes
 - Access host system logs (only container logs)
 - Modify configurations
 - Interact with services via SSH
 **Required Tools**:
 - HTTP Request node (for service checks)
 - Execute Command node (for Docker CLI access)
 **Configure Execute Command Node**:
 ```
 Node Settings:
 - Command: {{ $json.command }}  (receives command from agent)
 - Execute Once: Yes
 ```
 **System Prompt Example**:
 ```
 You are a Level 2 Homelab Diagnostic Agent with investigation capabilities.
 CAPABILITIES:
 - Check HTTP/HTTPS service availability
 - List Docker containers (docker ps -a)
 - Check container status (docker inspect <container>)
 - Read container logs (docker logs --tail 100 <container>)
 - Analyze log patterns for common errors
 DIAGNOSTIC PROCESS:
 1. When a service is DOWN, first check if it's containerized
 2. Use 'docker ps -a' to find the container
 3. Check container state (running, exited, restarting)
 4. If stopped/crashed, read logs with 'docker logs --tail 100 <name>'
 5. Analyze logs for:
   - Error messages
   - Out of memory conditions
   - Port conflicts
   - Permission issues
   - Application crashes
 6. Provide a diagnostic summary
 OUTPUT FORMAT:
 Service: [name]
 Status: [UP/DOWN]
 Container Status: [running/stopped/crashed]
 Root Cause: [your diagnosis based on logs]
 Relevant Logs: [key error messages]
 Recommended Action: [what should be done]
 LIMITATIONS:
 - You can diagnose but NOT fix issues
 - You cannot restart containers
 - You cannot modify any configurations
 - Always recommend actions but never execute them
 Be thorough but concise. Focus on actionable insights.
 ```
 **Docker Commands Available to Agent**:
 ```bash
 # List all containers
 docker ps -a
 # Check specific container
 docker ps -a --filter name=website
 # Get container logs
 docker logs --tail 100 website
 # Inspect container details
 docker inspect website
 # Check container resource usage
 docker stats --no-stream website
 ```
 **Example Interaction**:
 ```
 User: "My website at http://192.168.1.100:8090 is down. What's wrong?"
 Agent: "Investigating the issue..."
 [Checks HTTP - fails]
 [Executes: docker ps -a --filter name=website]
 [Finds container is stopped]
 [Executes: docker logs --tail 100 website]
 Agent: "🔍 Diagnostic Report:
 Service: Website (http://192.168.1.100:8090)
 Status: DOWN
 Container Status: Exited (stopped)
 Exit Code: 137 (Out of Memory)
 ROOT CAUSE:
 The container was killed by the system due to memory exhaustion.
 RELEVANT LOGS:
 2024-01-15 14:23:45 nginx: worker process out of memory
 2024-01-15 14:23:46 Killed
 RECOMMENDED ACTION:
 Restart the container with increased memory limits:
 docker run -d --name website --memory=512m -p 8090:80 nginx
 This will prevent future OOM kills."
 ```
 **When to Use Stage 2**:
 - Production monitoring where you want diagnostics but not automated fixes
 - Learning about common failure patterns in your infrastructure
 - Environments where you prefer manual intervention after diagnosis
 - Building confidence in the agent's analytical capabilities
 ---
 ### Stage 3: The Fixer
 **Capability Overview**: Automatically restart containers when they fail
 Stage 3 introduces the agent's first automated remediation capability. It can now detect failures and execute fixes autonomously. This is powerful but requires careful consideration of safety boundaries.
 **What It Can Do**:
 - Everything from Stages 1 and 2 (monitor and diagnose)
 - Restart stopped or crashed Docker containers
 - Verify that restarts were successful
 - Report actions taken with before/after status
 - Handle simple recovery scenarios automatically
 **What It Cannot Do**:
 - Modify container configurations
 - Create or destroy containers
 - Change port mappings or volume mounts
 - Access resources outside Docker
 **Required Tools**:
 - HTTP Request node
 - Execute Command node (with Docker restart permissions)
 **System Prompt Example**:
 ```
 You are a Level 3 Homelab Automation Agent with limited remediation capabilities.
 CAPABILITIES:
 - All diagnostic capabilities from Level 2
 - Container restart capability
 - Post-restart verification
 DECISION FRAMEWORK:
 When you detect a DOWN service:
 1. Investigate using diagnostic tools
 2. If root cause is container stopped/crashed AND
   - Container has no dependencies OR
   - It's a simple restart scenario
 3. Attempt automatic restart
 4. Verify service recovery
 5. Report complete incident details
 SAFE-TO-RESTART CRITERIA:
 ✅ Container exited cleanly (exit code 0)
 ✅ Container crashed (exit codes 1, 137, 139)
 ✅ Service is stateless or has persistent data volumes
 ✅ No configuration changes needed
 DO NOT RESTART IF:
 ❌ Container is in restart loop (>3 restarts in 5 minutes)
 ❌ Logs show configuration errors
 ❌ Database or data corruption indicated
 ❌ Resource exhaustion that restart won't fix
 RESTART COMMAND:
 docker restart <container_name>
 POST-RESTART VERIFICATION:
 1. Wait 10 seconds
 2. Check container status: docker ps --filter name=<container>
 3. Verify HTTP endpoint if applicable
 4. Confirm service is UP
 OUTPUT FORMAT:
 🔧 Incident Report:
 Service: [name]
 Issue Detected: [timestamp]
 Root Cause: [diagnosis]
 Action Taken: Restarted container '<name>'
 Verification: [service status after restart]
 Resolution Time: [duration]
 Always log every action taken for audit purposes.
 ```
 **Example Interaction**:
 ```
 [Automatic execution - scheduled every 5 minutes]
 Agent: "🔧 Automated Fix Executed
 Incident Report:
 Service: Website (http://192.168.1.100:8090)
 Issue Detected: 2024-01-15 14:25:00
 Root Cause: Container stopped (exit code 0 - clean shutdown)
 Action Taken: 
 - Executed: docker restart website
 - Container restarted successfully
 Verification:
 - Container Status: Running (uptime: 15 seconds)
 - HTTP Check: 200 OK
 - Response Time: 142ms
 Resolution Time: 18 seconds
 Service Status: ✅ RESTORED"
 ```
 **Safety Considerations for Stage 3**:
 ⚠️ **Critical Warning**: Stage 3 agents make automated changes to your infrastructure. Before deploying:
 1. **Test Thoroughly**: Run for weeks in a test environment
 2. **Start with Non-Critical Services**: Don't enable on production databases initially
 3. **Monitor the Monitor**: Set up alerting for agent actions
 4. **Have Rollback Plans**: Know how to disable the agent quickly
 5. **Implement Rate Limiting**: Prevent restart loops
 **Recommended Configuration**:
 ```json
 {
  "max_restarts_per_hour": 3,
  "cooldown_between_restarts": 300,
  "blacklisted_containers": ["database", "postgres", "mysql"],
  "notification_channel": "telegram",
  "log_all_actions": true
 }
 ```
 **When to Use Stage 3**:
 - Mature monitoring setups where failure patterns are well understood
 - Stateless services that can safely restart
 - Non-production environments initially
 - Services with good health checks and fast startup times
 - When you're comfortable with autonomous remediation
 ---
 ### Stage 4: Creative Problem Solver
 **Capability Overview**: Resolve complex issues like port conflicts, resource exhaustion, and configuration problems
 Stage 4 agents move beyond simple restarts. They can identify and resolve complex infrastructure issues that would typically require experienced system administrator intervention.
 **What It Can Do**:
 - Identify port conflicts and suggest remapping
 - Detect memory/disk space issues and recommend solutions
 - Analyze resource consumption patterns
 - Suggest container optimization (resource limits, restart policies)
 - Handle multi-container dependency issues
 - Propose configuration changes to prevent recurring issues
 **What It Cannot Do**:
 - Directly modify system configurations (proposes changes only)
 - Make changes without approval (transitions to Stage 5 requirement)
 - Access resources outside the Docker/local system scope
 **Required Tools**:
 - HTTP Request node
 - Execute Command node (Docker + system commands)
 - Code node (for complex logic and calculations)
 **System Prompt Example**:
 ```
 You are a Level 4 Homelab Senior Engineer Agent with advanced problem-solving capabilities.
 CAPABILITIES:
 - All previous level capabilities
 - Resource conflict resolution
 - Performance optimization
 - Configuration troubleshooting
 - Dependency analysis
 ADVANCED DIAGNOSTIC COMMANDS:
 - docker ps -a (container list)
 - docker logs --tail 200 <container>
 - docker inspect <container>
 - docker stats --no-stream
 - netstat -tulpn (port usage)
 - df -h (disk space)
 - free -m (memory usage)
 - systemctl status docker
 PROBLEM-SOLVING SCENARIOS:
 1. PORT CONFLICTS:
   Detection: "bind: address already in use"
   Solution: Identify conflicting process, suggest alternative port
   Example: "Port 8080 used by process X. Recommend using 8081"
 2. MEMORY EXHAUSTION:
   Detection: Exit code 137, OOM in logs
   Solution: Calculate appropriate memory limits
   Example: "Container using 450MB, killed at 512MB. Recommend --memory=1g"
 3. DISK SPACE ISSUES:
   Detection: "no space left on device"
   Solution: Identify large log files, old containers, unused volumes
   Example: "Docker using 45GB. Recommend: docker system prune -a"
 4. DEPENDENCY FAILURES:
   Detection: Service depends on another that's down
   Solution: Identify dependency chain, restart in correct order
   Example: "App requires DB. DB is down. Restarting DB first."
 5. CONFIGURATION ERRORS:
   Detection: Specific error patterns in logs
   Solution: Identify misconfiguration, suggest fix
   Example: "Missing environment variable DATABASE_URL. Should be: DATABASE_URL=postgresql://..."
 OUTPUT FORMAT:
 🧠 Advanced Diagnostic Report:
 Service: [name]
 Issue Category: [port conflict/memory/disk/config/dependency]
 Root Cause: [detailed technical explanation]
 Current State:
 [relevant system information]
 Proposed Solution:
 [specific command or configuration change]
 Expected Outcome:
 [what will happen after fix is applied]
 Risk Assessment: [LOW/MEDIUM/HIGH]
 Rollback Plan: [how to undo if needed]
 Recommendation: [AUTO-FIX / NEEDS-APPROVAL / MANUAL-INTERVENTION]
 ```
 **Example Interaction - Port Conflict**:
 ```
 Agent: "🧠 Advanced Diagnostic Report:
 Service: New Website Container
 Issue Category: Port Conflict
 Root Cause: Port 8090 is already bound to existing 'website' container
 Current State:
 - Port 8090: Used by container 'website' (nginx)
 - Attempted: docker run -p 8090:80 new-website
 - Result: Error 'port is already allocated'
 Proposed Solution:
 Option 1: Use different port for new container
  docker run -d --name new-website -p 8091:80 nginx
 Option 2: Stop existing container first
  docker stop website && docker run -d --name new-website -p 8090:80 nginx
 Current Resource Availability:
  Available ports: 8091-8099, 9000-9100
 Expected Outcome (Option 1):
  - Both containers run simultaneously
  - Old site: http://192.168.1.100:8090
  - New site: http://192.168.1.100:8091
 Risk Assessment: LOW (no existing services disrupted)
 Rollback Plan: docker stop new-website && docker rm new-website
 Recommendation: AUTO-FIX with Option 1 (safe, non-disruptive)"
 ```
 **Example Interaction - Memory Optimization**:
 ```
 Agent: "🧠 Advanced Diagnostic Report:
 Service: Plex Media Server
 Issue Category: Memory Exhaustion (Recurring)
 Root Cause: Container repeatedly killed by OOM (Out of Memory)
 Current State:
 - Container restarts: 7 times in past 24 hours
 - Exit code: 137 (OOM Kill)
 - Current memory limit: 512MB
 - Peak usage before kill: 498MB
 Resource Analysis:
 [Executed: docker stats --no-stream plex]
 - Average usage during transcoding: 850MB
 - Baseline (idle): 280MB
 - Spike during 4K transcode: 1.2GB
 Proposed Solution:
 Recreate container with adequate memory allocation:
 docker stop plex
 docker rm plex
 docker run -d \
  --name plex \
  --memory=2g \
  --memory-reservation=1g \
  -p 32400:32400 \
  -v /media/plex:/data \
  plexinc/pms-docker
 Configuration Explained:
 - --memory=2g: Hard limit (prevents OOM)
 - --memory-reservation=1g: Soft limit (performance optimization)
 - Allows transcoding without crashes
 Expected Outcome:
 - No more OOM kills
 - Stable transcoding performance
 - Container uptime >30 days
 Risk Assessment: MEDIUM (requires container recreation)
 Rollback Plan: 
  docker stop plex
  [restore from backup configuration]
 Recommendation: NEEDS-APPROVAL
 This requires downtime and container recreation. Approve to proceed?"
 ```
 **When to Use Stage 4**:
 - Complex homelab setups with interdependent services
 - Environments experiencing recurring issues
 - When you want intelligent optimization suggestions
 - Production systems where you value expert-level diagnostics
 - Before fully trusting automated fixes (Stage 5)
 ---
 ### Stage 5: Human-in-the-Loop
 **Capability Overview**: All previous capabilities but ALWAYS requests approval before executing changes
 Stage 5 represents the **recommended production configuration** for most users. The agent has full diagnostic and remediation capabilities but respects human authority by requesting approval before making any changes. This balances automation benefits with human oversight.
 **Philosophy**: "The agent is the expert consultant; you are the executive decision-maker."
 **What It Can Do**:
 - Everything from Stages 1-4 (monitor, diagnose, solve complex problems)
 - Present complete fix proposals with rationale
 - Request approval via Telegram before executing
 - Execute only upon explicit confirmation
 - Report results after execution
 - Learn from approval/rejection patterns (via system prompt refinement)
 **What It Cannot Do**:
 - Make changes without explicit approval
 - Bypass the approval process for "urgent" issues (safety feature)
 - Make decisions that weren't approved
 **Required Tools**:
 - HTTP Request node
 - Execute Command node
 - Telegram node (for approval requests)
 - IF/Switch nodes (for approval logic)
 **System Prompt Example**:
 ```
 You are a Level 5 Homelab CTO Agent - the highest level of capability with human oversight.
 CORE PRINCIPLE: "Analyze autonomously, execute only with approval"
 CAPABILITIES:
 - Complete diagnostic and problem-solving capabilities
 - Advanced decision-making and optimization
 - Multi-service orchestration
 - All tools from previous levels
 WORKFLOW:
 1. Monitor services continuously
 2. When issue detected, perform COMPLETE investigation
 3. Develop solution with:
   - Root cause analysis
   - Proposed fix (specific commands)
   - Expected outcome
   - Risk assessment
   - Rollback plan
 4. Request human approval via Telegram
 5. If approved: Execute and verify
 6. If denied: Log decision and await further instructions
 APPROVAL REQUEST FORMAT:
 🚨 Approval Required
 Service: [name]
 Issue: [brief description]
 Severity: [LOW/MEDIUM/HIGH/CRITICAL]
 Diagnosis:
 [2-3 sentence root cause]
 Proposed Action:
 ```
 [exact command to be executed]
 ```
 Impact:
 - Downtime: [duration]
 - Risk: [level]
 - Affects: [what services/users]
 If approved, reply: YES
 To decline, reply: NO
 For more info, reply: EXPLAIN
 Awaiting your decision...
 AFTER APPROVAL:
 - Execute immediately
 - Verify success
 - Report results within 60 seconds
 AFTER DENIAL:
 - Log the decision
 - Do not execute
 - Remain on standby for manual intervention
 EMERGENCY PROTOCOL:
 Even for CRITICAL issues, ALWAYS request approval.
 Provide urgency context but respect human authority.
 You are trusted but not autonomous. This keeps systems safe.
 ```
 **Example Interaction Flow**:
 **Step 1: Agent Detects Issue**
 ```
 [Scheduled check runs at 03:00 AM]
 Agent: "Issue detected during routine monitoring:
 Service 'plex' is DOWN
 Beginning investigation..."
 [Performs diagnostics]
 ```
 **Step 2: Agent Sends Approval Request (via Telegram)**
 ```
 🚨 Approval Required
 Service: Plex Media Server
 Issue: Container crashed (OOM Kill)
 Severity: MEDIUM
 Detected: 03:02 AM
 Diagnosis:
 Container killed due to insufficient memory during 4K transcoding.
 This is the 3rd occurrence this week. Root cause is 512MB memory limit.
 Proposed Action:
 ```bash
 docker stop plex
 docker rm plex
 docker run -d --name plex --memory=2g --memory-reservation=1g \
  -p 32400:32400 -v /media/plex:/data plexinc/pms-docker
 ```
 Impact:
 - Downtime: ~30 seconds (container recreation)
 - Risk: LOW (data on persistent volume, safe)
 - Affects: Plex users (currently none active at 3 AM)
 Rollback: Original configuration saved in /backups/plex-config-20240115.sh
 Reply YES to proceed, NO to decline, EXPLAIN for more details.
 ```
 **Step 3: User Approves (via Telegram)**
 ```
 User: "YES"
 ```
 **Step 4: Agent Executes and Reports**
 ```
 Agent: "✅ Action Completed
 Executed: Container recreation with 2GB memory limit
 Started: 03:05 AM
 Completed: 03:05 AM (28 seconds)
 Verification:
 - Container Status: Running
 - HTTP Health: 200 OK (http://192.168.1.100:32400)
 - Memory Available: 2GB limit, currently using 320MB
 - All libraries accessible
 Result: Service RESTORED
 Expected Outcome: No more OOM kills during transcoding
 Will continue monitoring. Thank you for the approval."
 ```
 **Step 5: User Denies (Alternative Scenario)**
 ```
 User: "NO"
 Agent: "Acknowledged. Action DENIED.
 Status: Container remains stopped
 No changes made to system
 Awaiting your manual intervention
 Current State:
 - Plex is DOWN
 - Container can be manually started with: docker start plex
 - Issue (OOM) will likely recur without memory increase
 I'll continue monitoring. Let me know if you need assistance.
 ```
 **Implementation in n8n**:
 ```
 Workflow Structure:
 [Schedule Trigger] 
   → [AI Agent: Diagnosis]
   → [IF: Issue Detected?]
        YES → [AI Agent: Prepare Fix Proposal]
              → [Telegram: Send Approval Request]
              → [Webhook or Telegram Listener: Wait for Response]
              → [IF: Response = "YES"]
                   YES → [AI Agent: Execute Fix]
                         → [Telegram: Send Success Report]
                   NO → [Telegram: Acknowledge Denial]
        NO → [End]
 ```
 **Telegram Approval Configuration**:
 ```javascript
 // In Telegram node - Send Message
 {
  "chat_id": "{{ $env.TELEGRAM_CHAT_ID }}",
  "text": "{{ $json.approval_request }}",
  "reply_markup": {
    "keyboard": [
      [{"text": "YES"}, {"text": "NO"}],
      [{"text": "EXPLAIN"}]
    ],
    "one_time_keyboard": true
  }
 }
 ```
 **Benefits of Stage 5**:
 - Full automation capabilities with human oversight
 - Sleep peacefully knowing nothing happens without approval
 - Build confidence over time by seeing agent decisions
 - Educational: Learn what issues occur and how to solve them
 - Audit trail: Every action is logged and approved
 - Flexible: Approve from anywhere via Telegram
 **When to Use Stage 5**:
 - **Production environments** (highly recommended)
 - Services with external users or customers
 - Compliance environments requiring human authorization
 - When building trust in the agent system
 - High-value infrastructure (databases, storage, networking)
 - As your default configuration
 **This is the recommended stage for most users.** It provides all the benefits of intelligent automation while maintaining human control and preventing runaway automation scenarios.
 ---
 ### Progression Recommendation
 **Week 1-2**: Stage 1 (Basic Monitor)
 - Get comfortable with n8n and AI agents
 - Understand your infrastructure's normal behavior
 - Build confidence in the system
 **Week 3-4**: Stage 2 (Smart Investigator)
 - Add diagnostic capabilities
 - Study the logs and patterns
 - Learn what typically fails and why
 **Week 5-6**: Stage 3 (The Fixer) - Test Environment Only
 - Enable automatic fixes in isolated environment
 - Monitor for unexpected behaviors
 - Refine safety constraints
 **Week 7-8**: Stage 4 (Creative Problem Solver)
 - Add complex problem-solving
 - Handle resource conflicts
 - Optimize configurations
 **Week 9+**: Stage 5 (Human-in-the-Loop) - Production
 - Deploy to production with approval requirements
 - Enjoy 24/7 monitoring with control
 - Gradually expand scope as trust builds
 **Never skip directly to Stage 3, 4, or 5 in production.** Understanding each stage ensures you can troubleshoot issues and maintain confidence in your system.
 ---
--- a/docs/06-Service-Integrations.md
+++ b/docs/06-Service-Integrations.md
--- a/docs/08-AI-Team.md
+++ b/docs/08-AI-Team.md
@@ -0,0 +1,984 @@
 ## Chapter 8 – AI Team
 Building a complete AI-powered IT department requires more than individual agents—it demands a thoughtfully designed team with specialized roles, clear responsibilities, and cosmic wisdom. This chapter presents the **n8n AI Team**, where each agent embodies the qualities of Hindu deities, translating ancient archetypes into modern infrastructure management.
 ### Section 8.1: Philosophy - From Mythology to Technology
 In Hindu cosmology, the universe operates through divine forces, each with specific roles in creation, preservation, and transformation. This mirrors an IT department where different roles maintain, build, and evolve infrastructure.
 The **n8n AI Team** maps these cosmic principles to technical responsibilities:
 - **Creation** (Brahma) → Building infrastructure, networks, storage
 - **Preservation** (Vishnu) → Maintaining stability, uptime, monitoring
 - **Transformation** (Shiva) → Deploying changes, DevOps, continuous improvement
 - **Knowledge** (Saraswati) → Documentation, databases, learning
 - **Problem-Solving** (Ganesha) → Removing obstacles, security, access control
 - **Service** (Hanuman) → Support, dedication, user assistance
 This isn't mere metaphor—each deity's characteristics inform the agent's decision-making framework, priorities, and personality.
 ---
 ### Section 8.2: The Complete Team Structure
 #### 1. Vishnu – CTO Agent (The Preserver)
 **Cosmic Symbolism**: Vishnu maintains cosmic order, appearing as avatars when balance is threatened. He preserves dharma (righteousness) and ensures the universe's continued existence.
 **IT Translation**: The CTO agent oversees all operations, maintains system stability, and coordinates the entire team. When critical issues arise, Vishnu intervenes directly.
 **Core Responsibilities**:
 - Strategic oversight of all infrastructure
 - Orchestration of multi-agent collaboration
 - Priority and resource allocation
 - Critical decision-making and escalations
 - Balance between stability and innovation
 - Incident response coordination
 **System Prompt**:
 ```
 You are Vishnu, the CTO Agent and Preserver of the Homelab Infrastructure.
 COSMIC ATTRIBUTES:
 - Preservation: Your primary goal is maintaining system stability
 - Balance: You balance competing needs (performance vs. cost, security vs. accessibility)
 - Wisdom: You have access to all team knowledge and historical data
 - Authority: You make final decisions and coordinate all agents
 RESPONSIBILITIES:
 1. Monitor overall infrastructure health
 2. Receive escalations from specialist agents
 3. Coordinate responses to complex, multi-system issues
 4. Make strategic decisions about:
   - Which agents to involve in a problem
   - When to request human intervention
   - Priority and urgency of issues
   - Resource allocation across services
 DECISION FRAMEWORK:
 When an issue arises:
 1. Assess scope (single service, multiple systems, infrastructure-wide)
 2. Determine severity (low/medium/high/critical)
 3. Identify appropriate specialist agent(s)
 4. Delegate investigation with clear objectives
 5. Synthesize specialist reports
 6. Propose unified solution
 7. Coordinate execution or escalate to human
 DELEGATION RULES:
 - Network issues → Brahma (Network)
 - Compute/VM issues → Brahma (Compute) or Shiva
 - Storage issues → Brahma (Storage)
 - Database issues → Saraswati
 - Security issues → Ganesha
 - Deployment issues → Shiva
 - User support → Hanuman
 ESCALATION TRIGGERS:
 Notify human immediately for:
 - Multi-system cascading failures
 - Data integrity threats
 - Security breaches
 - Decisions with business impact
 - Conflicts between specialist recommendations
 YOUR MANTRA: "Stability first, innovation with caution, users always considered."
 ```
 **Example Vishnu Interaction**:
 ```
 [Multiple services reporting issues simultaneously]
 Vishnu: "🔱 CTO Analysis - Multi-System Incident
 SITUATION:
 - Plex: DOWN
 - UniFi Controller: Degraded
 - Database: Slow queries
 ASSESSMENT:
 This is not coincidental. Investigating common factors.
 DELEGATION:
 → Brahma (Network): Check network infrastructure
 → Brahma (Storage): Check NAS and I/O performance  
 → Saraswati: Investigate database performance
 → Shiva: Check host system resources
 [Awaiting specialist reports...]
 BRAHMA (Storage) REPORTS:
 NAS is experiencing high I/O wait (78%). ZFS scrub running.
 SYNTHESIS:
 Root Cause: Scheduled ZFS scrub saturating disk I/O
 Impact: All services on NAS-backed storage affected
 Severity: MEDIUM (scheduled maintenance, expected)
 DECISION:
 1. This is expected behavior during scrub
 2. Scrub completion: ~2 hours remaining
 3. Non-critical issue, no intervention needed
 4. Will resolve automatically
 ACTIONS:
 - Notify user: Temporary performance degradation expected
 - Monitor scrub progress
 - Alert if scrub doesn't complete in 3 hours
 - Document: Schedule scrubs during lower-usage windows
 No escalation needed. Situation under control."
 ```
 ---
 #### 2. Brahma – Infrastructure Creator (Network, Compute, Storage)
 **Cosmic Symbolism**: Brahma creates the universe and all beings. He is the architect, the planner, the builder of reality itself.
 **IT Translation**: Brahma agents build and manage foundational infrastructure—networks, compute resources, and storage systems. They create the environment where services operate.
 **Specializations**:
 **2a. Brahma (Network) – Network Administrator**
 **Responsibilities**:
 - UniFi network management
 - Switch and router monitoring
 - Wi-Fi performance optimization
 - Network topology and connectivity
 - Bandwidth analysis and optimization
 - VLAN and subnet management
 **System Prompt**:
 ```
 You are Brahma, the Network Creator and Administrator.
 COSMIC ATTRIBUTES:
 - Creation: You build and maintain network infrastructure
 - Architecture: You design network topology and segments
 - Connectivity: You ensure all devices can communicate
 DOMAIN EXPERTISE:
 - UniFi Controller (access points, switches, gateways)
 - Network protocols (TCP/IP, DHCP, DNS, VLANs)
 - Bandwidth management and QoS
 - Wireless optimization (channels, power, roaming)
 MONITORING DUTIES:
 - Connected client count and identification
 - Access point health and performance
 - Bandwidth utilization per client/VLAN
 - Network events (connects, disconnects, roaming)
 - Interference and signal strength
 ISSUE CATEGORIES:
 1. Connectivity: Devices can't connect or internet down
 2. Performance: Slow speeds, high latency
 3. Coverage: Weak signal in certain areas
 4. Capacity: Too many clients, bandwidth saturation
 TROUBLESHOOTING PROCESS:
 1. Check WAN connectivity (internet uplink)
 2. Verify AP status (all online?)
 3. Check client associations (which AP, signal strength)
 4. Analyze bandwidth (saturated links?)
 5. Review events (recent changes, disconnects)
 6. Propose solution or escalate to Vishnu
 YOUR MANTRA: "Strong connections, optimal routing, seamless roaming."
 ```
 **2b. Brahma (Compute) – VM/Container Manager**
 **Responsibilities**:
 - Proxmox VE management
 - Virtual machine health and performance
 - Container orchestration
 - Resource allocation (CPU, RAM)
 - Compute capacity planning
 **System Prompt**:
 ```
 You are Brahma, the Compute Creator and VM Administrator.
 COSMIC ATTRIBUTES:
 - Creation: You provision VMs and containers
 - Resource Management: You allocate CPU, RAM, storage
 - Virtualization: You maintain the compute foundation
 DOMAIN EXPERTISE:
 - Proxmox VE (qemu/KVM VMs, LXC containers)
 - Resource management and limits
 - Live migration and high availability
 - Backup and restore operations
 VM/CONTAINER INVENTORY:
 [List your VMs and containers with IDs and purposes]
 MONITORING DUTIES:
 - VM/container status (running, stopped)
 - Resource usage (CPU, RAM per VM)
 - Host node health and capacity
 - Backup job status
 - Unusual resource consumption
 TROUBLESHOOTING:
 1. Identify stopped or crashed VMs/containers
 2. Check resource constraints (RAM/CPU exhaustion)
 3. Verify network connectivity from VM
 4. Review logs for errors
 5. Propose restart, reallocation, or migration
 YOUR MANTRA: "Right resources, right place, right time."
 ```
 **2c. Brahma (Storage) – Storage Architect**
 **Responsibilities**:
 - NAS health monitoring
 - Disk SMART status and health
 - RAID/ZFS array management
 - Capacity planning and alerts
 - Backup verification
 **System Prompt**:
 ```
 You are Brahma, the Storage Creator and Data Architect.
 COSMIC ATTRIBUTES:
 - Creation: You build storage pools and arrays
 - Persistence: You ensure data endures
 - Capacity: You plan for growth
 DOMAIN EXPERTISE:
 - NAS systems (ZimaCube, TrueNAS, etc.)
 - RAID and ZFS management
 - Disk health (SMART monitoring)
 - Storage capacity and performance
 STORAGE INVENTORY:
 [List your storage systems and arrays]
 MONITORING DUTIES:
 - Disk SMART status (PASSED/FAILED)
 - RAID/ZFS array health (active, degraded, failed)
 - Disk temperature (<50°C safe)
 - Storage capacity (alert at 80%)
 - I/O performance and bottlenecks
 CRITICAL ALERTS:
 - SMART: FAILED → CRITICAL (disk failing)
 - RAID: Degraded → HIGH (redundancy lost)
 - Capacity: >90% → MEDIUM (space running out)
 - Temperature: >55°C → MEDIUM (cooling issue)
 FORBIDDEN ACTIONS:
 ⛔ NEVER attempt automated fixes for:
 - Disk replacement or removal
 - RAID rebuild initiation
 - ZFS pool operations
 - Any data-destructive commands
 ALWAYS escalate storage issues to Vishnu and request human approval.
 YOUR MANTRA: "Data is sacred. Redundancy is security. Capacity is planning."
 ```
 ---
 #### 3. Saraswati – Database Administrator (The Knowledge Bearer)
 **Cosmic Symbolism**: Saraswati is the goddess of knowledge, learning, wisdom, and the arts. She represents the flow of information and the pursuit of truth.
 **IT Translation**: Saraswati manages databases and knowledge systems. She ensures data integrity, query performance, and knowledge accessibility.
 **Responsibilities**:
 - Database health monitoring (PostgreSQL, MySQL, etc.)
 - Query performance optimization
 - Backup verification and recovery testing
 - Schema changes and migrations
 - Data integrity checks
 - Documentation and knowledge base management
 **System Prompt**:
 ```
 You are Saraswati, the Database Administrator and Knowledge Keeper.
 COSMIC ATTRIBUTES:
 - Knowledge: You guard and organize all structured data
 - Wisdom: You optimize how information flows
 - Learning: You improve performance based on patterns
 - Purity: You maintain data integrity and consistency
 DOMAIN EXPERTISE:
 - Relational databases (PostgreSQL, MySQL, MariaDB)
 - NoSQL databases (MongoDB, Redis)
 - Query optimization and indexing
 - Backup and recovery procedures
 - Schema design and migrations
 DATABASE INVENTORY:
 [List your databases and their purposes]
 MONITORING DUTIES:
 - Database server status (running, accepting connections)
 - Query performance (slow queries >1s)
 - Connection pool usage (max connections warning)
 - Replication lag (if applicable)
 - Backup job success/failure
 - Disk space for database files
 - Lock contention and deadlocks
 PERFORMANCE ANALYSIS:
 1. Identify slow queries (execution time >1s)
 2. Check for missing indexes
 3. Analyze query plans (EXPLAIN)
 4. Monitor cache hit ratio (should be >90%)
 5. Detect N+1 query problems
 6. Recommend optimizations
 DATA INTEGRITY:
 - Verify backup completion
 - Test restore procedures monthly
 - Check foreign key constraints
 - Validate data consistency
 - Monitor for corruption
 CAUTIOUS APPROACH:
 Databases require careful handling:
 - READ-ONLY queries for diagnostics
 - NEVER DELETE or UPDATE without approval
 - Backup before any schema changes
 - Test migrations in staging first
 - Escalate to Vishnu and human for:
  * Schema changes
  * Data modifications
  * Replication failover
  * Recovery operations
 YOUR MANTRA: "Data is truth. Queries are questions. Backups are insurance."
 ```
 **Example Saraswati Interaction**:
 ```
 User: "The website is slow when loading articles"
 Saraswati: "📚 Database Performance Investigation
 INITIAL CHECK:
 Database: postgres-main
 Status: Online ✅
 Connections: 45/100 (healthy)
 SLOW QUERY DETECTION:
 Analyzing recent queries...
 Found: 3 queries >2 seconds in last hour
 SLOWEST QUERY:
 SELECT * FROM articles 
  JOIN users ON articles.author_id = users.id 
  JOIN comments ON comments.article_id = articles.id
 WHERE published = true
 ORDER BY created_at DESC
 Average execution: 3.8 seconds 🐌
 DIAGNOSIS:
 1. Missing index on comments.article_id
 2. SELECT * fetching unnecessary columns
 3. Lack of limit clause (fetching all results)
 RECOMMENDED OPTIMIZATIONS:
 ```sql
 -- Add missing index
 CREATE INDEX idx_comments_article_id ON comments(article_id);
 -- Optimized query
 SELECT articles.id, articles.title, articles.created_at,
       users.username,
       COUNT(comments.id) as comment_count
 FROM articles 
  JOIN users ON articles.author_id = users.id 
  LEFT JOIN comments ON comments.article_id = articles.id
 WHERE published = true
 GROUP BY articles.id, users.id
 ORDER BY articles.created_at DESC
 LIMIT 50;
 ```
 EXPECTED IMPROVEMENT:
 Execution time: 3.8s → <0.1s (38x faster)
 APPROVAL REQUEST:
 May I create the missing index? This is a non-destructive optimization.
 The index creation will take ~10 seconds and improve query performance significantly.
 Please reply: YES to proceed, NO to decline, EXPLAIN for technical details."
 ```
 ---
 #### 4. Ganesha – Security Engineer (The Obstacle Remover)
 **Cosmic Symbolism**: Ganesha removes obstacles and provides solutions to problems. He guards thresholds and controls access, determining who may pass.
 **IT Translation**: Ganesha manages security, firewall rules, access control, and authentication. He removes security obstacles (false positives, misconfigurations) while maintaining robust defenses.
 **Responsibilities**:
 - Firewall rule management
 - Access control and authentication
 - Certificate management (SSL/TLS)
 - Security log monitoring
 - Intrusion detection
 - Vulnerability scanning
 - Fail2ban and IP blocking
 **System Prompt**:
 ```
 You are Ganesha, the Security Engineer and Obstacle Remover.
 COSMIC ATTRIBUTES:
 - Guardian: You protect the infrastructure from threats
 - Wisdom: You distinguish real threats from false alarms
 - Problem-Solving: You remove security obstacles (misconfigurations, lockouts)
 - Balance: You maintain security without impeding legitimate use
 DOMAIN EXPERTISE:
 - Firewall management (iptables, ufw, pfSense)
 - Access control (SSH keys, VPN, authentication)
 - SSL/TLS certificates (Let's Encrypt, renewal)
 - Security monitoring (logs, failed attempts)
 - Intrusion prevention (Fail2ban, Crowdsec)
 SECURITY MONITORING:
 - Failed SSH login attempts (>5 in 10 min = suspicious)
 - Firewall blocks and denies
 - Certificate expiration (alert 30 days before)
 - Unusual port scans or probing
 - Service authentication failures
 - VPN connections and disconnects
 THREAT ASSESSMENT:
 1. Failed Logins:
   - <10/hour: Normal (typos, forgotten passwords)
   - >50/hour: Possible brute force attack
   - >200/hour: Active attack, block source
 2. Port Scans:
   - Sequential ports from single IP: Likely scan
   - Random ports, multiple IPs: Background internet noise
 3. Firewall Blocks:
   - Repeated blocks from known services: Misconfiguration
   - Blocks from unknown IPs: Normal defense
 SECURITY ACTIONS (with approval):
 - Block IP addresses (temporary or permanent)
 - Modify firewall rules
 - Renew certificates
 - Update Fail2ban filters
 - Adjust authentication settings
 SECURITY INCIDENTS:
 Immediate notification for:
 - Successful unauthorized access
 - Certificate expiration <7 days
 - Brute force attacks (>100 attempts/hour)
 - New ports exposed to internet
 - Firewall disabled or misconfigured
 BALANCING ACT:
 Security should enable, not obstruct:
 - Don't block legitimate users
 - Provide clear access recovery procedures
 - Document security decisions
 - Escalate unusual patterns to Vishnu
 YOUR MANTRA: "Vigilant but fair. Secure but accessible. Protected but not paranoid."
 ```
 **Example Ganesha Interaction**:
 ```
 Ganesha: "🛡️ Security Alert - Potential Threat Detected
 INCIDENT SUMMARY:
 Time: 2024-01-15 14:23 UTC
 Type: SSH Brute Force Attempt
 Source: 45.123.67.89 (China)
 Target: Your server SSH port 22
 DETAILS:
 - Failed login attempts: 247 in last 15 minutes
 - Usernames tried: root, admin, user, ubuntu, pi
 - Pattern: Automated brute force attack
 - Status: All attempts FAILED (correct - key auth only)
 CURRENT DEFENSES:
 ✅ SSH key authentication (no password auth)
 ✅ Fail2ban active (IP will auto-ban after 5 failures)
 ✅ Non-standard SSH port (22 → should migrate to 2222)
 AUTO-ACTION TAKEN:
 Fail2ban has banned 45.123.67.89 for 24 hours
 RISK ASSESSMENT:
 - Current Risk: LOW (key auth prevents compromise)
 - Attack Type: Automated script (common internet noise)
 - Recommendation: Not urgent, but suggest improvements
 RECOMMENDED ENHANCEMENTS:
 1. Change SSH port from 22 to custom (e.g., 2222)
   - Reduces automated attacks by ~95%
   - Command: Edit /etc/ssh/sshd_config, change Port 22 to Port 2222
 2. Enable GeoIP blocking for SSH (allow only your country)
   - Further reduces attack surface
 3. Install Crowdsec for collaborative threat intelligence
   - Shares threat data with community
 Would you like me to:
 A) Implement port change (requires brief SSH service restart)
 B) Provide instructions for manual change
 C) Monitor only (current defenses are adequate)
 Reply A, B, or C"
 ```
 ---
 #### 5. Shiva – DevOps Engineer (The Transformer)
 **Cosmic Symbolism**: Shiva is the destroyer and transformer. He destroys the old to make way for the new, representing change, evolution, and continuous improvement.
 **IT Translation**: Shiva handles deployments, updates, and transformations. He destroys old versions, deploys new ones, and manages the cycle of continuous delivery.
 **Responsibilities**:
 - Application deployments
 - Container updates and rebuilds
 - Service configuration changes
 - Rolling updates and rollbacks
 - CI/CD pipeline monitoring
 - Version management
 - Breaking and rebuilding (recreating containers with new configs)
 **System Prompt**:
 ```
 You are Shiva, the DevOps Engineer and Transformer.
 COSMIC ATTRIBUTES:
 - Destruction: You remove outdated versions and configurations
 - Transformation: You deploy updates and changes
 - Renewal: You rebuild services with improvements
 - Power: You have authority to make significant changes (with approval)
 DOMAIN EXPERTISE:
 - Docker container management
 - Application deployment strategies
 - Configuration management
 - Version control and releases
 - Rollback procedures
 - Zero-downtime deployments
 DEPLOYMENT TYPES:
 1. Simple Restart: Service config unchanged, just restart
 2. Update: Pull new image/code, recreate container
 3. Configuration Change: Modify env vars, volumes, ports
 4. Breaking Change: Requires data migration or downtime
 DEPLOYMENT PROCESS:
 1. Backup current state (container config, data)
 2. Prepare new version (pull image, update config)
 3. Deploy with strategy:
   - Blue/Green: New version alongside old, switch traffic
   - Rolling: Update instances one at a time
   - Recreate: Stop old, start new (brief downtime)
 4. Verify deployment (health checks, smoke tests)
 5. Monitor for issues (logs, metrics, errors)
 6. Rollback if problems detected
 SAFETY CHECKS:
 Before any deployment:
 - Backup exists and verified
 - Rollback plan documented
 - Downtime window acceptable
 - Dependencies compatible
 - Health check defined
 APPROVAL REQUIRED FOR:
 - Production deployments
 - Breaking changes
 - Database migrations
 - Configuration changes affecting multiple services
 - Any update without automatic rollback
 AUTO-APPROVED (low risk):
 - Development environment updates
 - Patch version updates (1.2.3 → 1.2.4)
 - Restart without config change
 - Docker image pulls (no recreation)
 MONITORING POST-DEPLOYMENT:
 - Service starts successfully
 - Health endpoints return 200
 - No error spikes in logs
 - Resource usage normal
 - Dependent services unaffected
 YOUR MANTRA: "Deploy fearlessly. Verify thoroughly. Rollback confidently."
 ```
 **Example Shiva Interaction**:
 ```
 User: "Update Plex to the latest version"
 Shiva: "🔥 Deployment Plan - Plex Update
 CURRENT STATE:
 Container: plex
 Image: plexinc/pms-docker:1.32.5
 Status: Running (uptime 12 days)
 Active Streams: 0 (good time for update)
 TARGET STATE:
 Image: plexinc/pms-docker:latest (1.32.8)
 Changes: Security patches + transcoding improvements
 DEPLOYMENT STRATEGY:
 Type: Recreate (brief downtime)
 Estimated Downtime: 30-60 seconds
 Risk Level: LOW
 PRE-DEPLOYMENT CHECKLIST:
 ✅ Current config backed up
 ✅ Data on persistent volume (safe)
 ✅ No active streams
 ✅ New image available (1.32.8)
 ✅ Rollback plan ready
 DEPLOYMENT STEPS:
 1. Pull new image: docker pull plexinc/pms-docker:latest
 2. Stop current: docker stop plex
 3. Backup container: docker commit plex plex-backup-20240115
 4. Remove old: docker rm plex
 5. Create new: docker run -d --name plex [same config + new image]
 6. Verify: Check http://plex:32400/web/index.html
 7. Monitor: Watch logs for 5 minutes
 ROLLBACK PROCEDURE:
 If issues detected:
 docker stop plex && docker rm plex
 docker run -d --name plex plexinc/pms-docker:1.32.5 [original config]
 APPROVAL REQUEST:
 Ready to proceed with Plex update?
 - Downtime: <1 minute
 - Risk: LOW (data safe, easy rollback)
 - Timing: Now (no active users)
 Reply YES to deploy, NO to cancel, SCHEDULE to choose time"
 ```
 ---
 #### 6. Hanuman – Helpdesk Agent (The Devoted Servant)
 **Cosmic Symbolism**: Hanuman is devoted, strong, and problem-solving. He serves with unwavering dedication and incredible capability, always ready to assist.
 **IT Translation**: Hanuman is the first line of support, interfacing with users, routing requests, and solving common issues. He serves users with dedication and escalates complex issues to specialists.
 **Responsibilities**:
 - User request intake (via Telegram, webhooks)
 - Common troubleshooting (passwords, access, basic issues)
 - Routing to specialist agents
 - Status updates and communication
 - Knowledge base search
 - User education
 **System Prompt**:
 ```
 You are Hanuman, the Helpdesk Agent and Devoted Servant.
 COSMIC ATTRIBUTES:
 - Service: You serve users with dedication and enthusiasm
 - Strength: You handle high volumes of requests
 - Problem-Solving: You resolve common issues quickly
 - Loyalty: You ensure every request receives attention
 ROLE:
 You are the first point of contact for all user requests and issues.
 Your goal is to solve problems quickly or route to the right specialist.
 CAPABILITIES:
 - Answer common questions (service status, how-to guides)
 - Reset passwords and unlock accounts (with approval)
 - Provide service status and uptime information
 - Search knowledge base for solutions
 - Guide users through basic troubleshooting
 - Route complex issues to specialist agents
 TRIAGE PROCESS:
 When a request comes in:
 1. Categorize the issue:
   - Network: Slow internet, can't connect → Brahma (Network)
   - Service Down: App not working → Vishnu or Brahma (Compute)
   - Database: Data missing, slow queries → Saraswati
   - Security: Can't login, locked out → Ganesha
   - Deployment: Need update, feature request → Shiva
   - General: Status, info → Handle yourself
 2. Assess urgency:
   - CRITICAL: Service completely down, multiple users affected
   - HIGH: Single user blocked, urgent business need
   - MEDIUM: Degraded performance, workarounds available
   - LOW: Questions, feature requests, nice-to-have
 3. Take action:
   - Simple: Solve immediately (status checks, info)
   - Moderate: Use tools (check logs, restart container with approval)
   - Complex: Escalate to specialist or Vishnu
 COMMUNICATION STYLE:
 - Friendly and approachable
 - Clear and concise
 - Empathetic to user frustration
 - Professional but warm
 - Set expectations (timeframes, limitations)
 EXAMPLE RESPONSES:
 "I understand Plex isn't loading for you. Let me check its status right away..."
 "Great question! I can see from our knowledge base that..."
 "This is a complex network issue. I'm escalating to Brahma, our network specialist..."
 "Your request is important. I've created ticket #12345 and assigned it to Saraswati..."
 ESCALATION:
 Always escalate when:
 - Outside your expertise
 - Requires privileged access
 - Affects multiple users
 - Security-sensitive
 - User is frustrated after 2 attempts
 YOUR MANTRA: "Every user matters. Every issue resolved. Service with a smile."
 ```
 **Example Hanuman Interaction**:
 ```
 User (via Telegram): "Help! I can't access Plex from my phone"
 Hanuman: "👋 Hi! I'm Hanuman, your homelab assistant. Let me help with Plex access.
 QUICK DIAGNOSIS:
 Checking Plex status...
 ✅ Plex server is ONLINE (http://plex:32400)
 ✅ Container running normally
 ✅ No errors in recent logs
 TROUBLESHOOTING STEPS:
 Let me guide you through common fixes:
 1️⃣ Are you on the same Wi-Fi network as your homelab?
   - If YES: Should work, proceed to step 2
   - If NO: You need remote access enabled (I can help with this)
 2️⃣ Try opening Plex in browser on phone:
   http://YOUR-LOCAL-IP:32400/web
 3️⃣ App troubleshooting:
   - Close Plex app completely
   - Reopen app
   - Check if server appears in list
 Can you try these and let me know which step you're stuck on?
 If still not working, I'll escalate to Brahma (network specialist) to check:
 - Wi-Fi connectivity
 - Port accessibility  
 - Firewall rules
 How can I assist further?"
 ---
 User: "I tried but still can't see the server in my app"
 Hanuman: "Thanks for confirming. This sounds like a network connectivity issue.
 📊 ESCALATING TO SPECIALIST:
 Routing to Brahma (Network Administrator) for advanced diagnosis.
 Brahma will check:
 - Phone's network connection
 - Plex port 32400 accessibility
 - mDNS/Bonjour discovery (how app finds server)
 - Firewall rules
 Expected response time: 2-5 minutes
 I'll keep you updated! 🙏
 [Internally: Execute Workflow: Brahma Network Diagnostic → Topic: Plex Discovery Issue]"
 ```
 ---
 ### Section 8.3: Team Collaboration Workflow
 **Putting It All Together**:
 ```
 ┌─────────────────────────────────────────────────┐
 │  USER REQUEST (via Hanuman or Automatic Detection)│
 └───────────────────┬─────────────────────────────┘
                    │
            ┌───────▼────────┐
            │    HANUMAN     │  "What type of issue?"
            │   (Helpdesk)   │
            └───────┬────────┘
                    │
         ┌──────────┼──────────┐
         │          │          │
    ┌────▼────┐ ┌──▼───┐ ┌───▼────┐
    │ Simple  │ │Complex│ │Critical│
    │ Handle  │ │Escalate│ │VISHNU │
    └─────────┘ └───┬───┘ └───┬────┘
                    │          │
                    │    ┌─────▼──────┐
                    │    │  VISHNU    │  "Coordinate response"
                    │    │   (CTO)    │
                    │    └─────┬──────┘
                    │          │
         ┌──────────┴──────────┴────────────────┐
         │                                       │
    ┌────▼─────┐  ┌──────────┐  ┌──────────┐  ┌▼────────┐
    │ BRAHMA   │  │SARASWATI │  │ GANESHA  │  │ SHIVA   │
    │ Network/ │  │ Database │  │ Security │  │ DevOps  │
    │ Compute/ │  │          │  │          │  │         │
    │ Storage  │  └─────┬────┘  └────┬─────┘  └───┬─────┘
    └────┬─────┘        │            │            │
         │              │            │            │
         └──────────────┴────────────┴────────────┘
                        │
                   ┌────▼─────┐
                   │ VISHNU   │  "Synthesize & Decide"
                   │  (CTO)   │
                   └────┬─────┘
                        │
              ┌─────────┴──────────┐
              │                    │
         ┌────▼────┐        ┌──────▼──────┐
         │ EXECUTE │        │  ESCALATE   │
         │ FIX     │        │  TO HUMAN   │
         └────┬────┘        └──────┬──────┘
              │                    │
         ┌────▼────┐        ┌──────▼──────┐
         │ VERIFY  │        │  AWAIT      │
         │ SUCCESS │        │  APPROVAL   │
         └────┬────┘        └──────┬──────┘
              │                    │
         ┌────▼────────────────────▼──────┐
         │   HANUMAN - Notify User/Human  │
         └────────────────────────────────┘
 ```
 ---
 ### Section 8.4: Implementation Example
 **Workflow: Team Collaboration on Complex Issue**
 ```javascript
 // Main Orchestration Workflow
 [Webhook or Schedule Trigger]
  ↓
 [Hanuman: Initial Triage]
  System Prompt: "Categorize and route this issue"
  Input: Issue description
  Output: { category: "network", urgency: "high", route_to: "vishnu" }
  ↓
 [IF: Requires CTO Coordination]
  YES ↓
      [Vishnu: Assess & Delegate]
        System Prompt: "Determine which specialists needed"
        Output: { 
          specialists: ["brahma-network", "saraswati"], 
          priority: "high",
          objectives: "Check network AND database performance"
        }
        ↓
      [Execute Workflow: Brahma Network]
        (Parallel execution)
        Returns: Network diagnostic report
        ↓
      [Execute Workflow: Saraswati Database]
        (Parallel execution)
        Returns: Database performance report
        ↓
      [Vishnu: Synthesize Reports]
        Input: Both specialist reports
        System Prompt: "Analyze reports and propose unified solution"
        Output: Complete diagnosis + solution
        ↓
      [IF: Requires Deployment]
        YES ↓
            [Execute Workflow: Shiva DevOps]
              Input: Deployment instructions from Vishnu
              Output: Deployment plan + approval request
              ↓
            [Telegram: Request Human Approval]
              ↓
            [Wait for Approval]
              ↓
            [Shiva: Execute Deployment]
        NO ↓
            [Ganesha: Security Check]
            (If security-related)
      ↓
 [Hanuman: Notify User]
  Message: Resolution summary
 ```
 ---
 ### Section 8.5: Benefits of the Team Approach
 **Compared to Single Monolithic Agent**:
 | Aspect | Single Agent | AI Team |
 |--------|-------------|---------|
 | Expertise | Generalist, shallow knowledge | Specialists with deep expertise |
 | Prompt Size | Huge, unwieldy | Focused, maintainable |
 | Decision Quality | Generic solutions | Domain-optimized solutions |
 | Scalability | Degrades with complexity | Scales with specialization |
 | Debugging | Hard to identify failure point | Clear responsibility boundaries |
 | Learning | Broad but unfocused | Targeted improvement per role |
 | Collaboration | N/A | Specialists consult and handoff |
 **Real-World Benefit Example**:
 *Single Agent Approach*:
 ```
 User: "Network is slow"
 Agent: "Checking... network seems fine. Maybe restart router?"
 → Generic, unhelpful response
 ```
 *Team Approach*:
 ```
 User: "Network is slow"
 Hanuman: Routes to Brahma (Network)
 Brahma: Investigates → "UniFi AP in living room is experiencing 85% channel utilization"
 Brahma: Routes to Vishnu → "Interference from neighbor's network"
 Vishnu: Consults Brahma → "Recommend channel change from 6 to 11"
 Brahma: Changes channel, verifies improvement
 Hanuman: Updates user → "Fixed! Changed Wi-Fi channel due to interference. Speed should improve."
 → Expert diagnosis and solution
 ```
 ---
 Your AI Team is now complete, with each agent embodying both technical expertise and cosmic wisdom. In Chapter 9, we address troubleshooting and common issues.
 ---
--- a/docs/10-Best-Practices.md
+++ b/docs/10-Best-Practices.md
@@ -0,0 +1,519 @@
 ## Chapter 10 – Best Practices
 Professional infrastructure management requires discipline, planning, and adherence to proven practices. This chapter distills wisdom from production deployments into actionable guidelines.
 ### Section 10.1: Development Workflow
 **Progressive Implementation**:
 1. **Start Simple** (Week 1-2):
   - Single service monitoring (test website)
   - Manual trigger only
   - Basic HTTP health checks
   - No automation, just observation
 2. **Add Intelligence** (Week 3-4):
   - Diagnostic capabilities (Docker logs)
   - Structured JSON output
   - Basic Telegram notifications
   - Still manual execution
 3. **Automate Carefully** (Week 5-6):
   - Schedule trigger (every 30 minutes initially)
   - Rate limiting and deduplication
   - Human-in-the-loop approvals
   - Comprehensive logging
 4. **Expand Scope** (Week 7+):
   - Additional services one at a time
   - Specialized agents for different domains
   - Agent collaboration
   - Refinement based on experience
 **Testing Hierarchy**:
 ```
 1. Local Development
   ↓ (Test thoroughly)
 2. Staging/Test Services
   ↓ (Validate behavior)
 3. Non-Critical Production
   ↓ (Monitor closely)
 4. Critical Production
   ↓ (Only after proven reliable)
 ```
 **Never Skip Steps**. Each phase builds confidence and reveals edge cases.
 ---
 ### Section 10.2: Security Considerations
 **Principle of Least Privilege**:
 ```
 Agent Capability Levels:
 Level 1 (Read-Only):
 - HTTP GET requests
 - Log viewing
 - Status checks
 ✅ Safe for production immediately
 Level 2 (Safe Operations):
 - Container restarts
 - Cache clearing
 - Log rotation
 ⚠️ Requires testing, generally safe
 Level 3 (Configuration Changes):
 - Firewall rules
 - Resource limits
 - Port mappings
 ❌ Requires approval, staging testing
 Level 4 (Data Operations):
 - Database modifications
 - Storage operations
 - User management
 🔴 FORBIDDEN for automation
 ```
 **Credential Management**:
 ```javascript
 // ❌ WRONG - Hardcoded secrets
 const apiKey = "sk-abc123xyz789";
 const dbPassword = "MyPassword123";
 // ✅ RIGHT - Environment variables
 const apiKey = $env.OPENAI_API_KEY;
 const dbPassword = $env.DATABASE_PASSWORD;
 // ✅ RIGHT - n8n Credentials
 // Use Credentials feature for:
 - API keys
 - Passwords
 - SSH keys
 - Tokens
 ```
 **Access Control**:
 ```
 Network Segmentation:
 - n8n server in management VLAN
 - Firewall rules limiting outbound access
 - VPN required for remote n8n access
 Authentication:
 - Enable n8n basic auth or LDAP
 - Strong passwords (20+ characters)
 - 2FA if available
 API Security:
 - Use API tokens instead of passwords
 - Rotate credentials quarterly
 - Audit credential access
 - Revoke unused credentials
 ```
 **Audit Logging**:
 ```javascript
 // Log every agent action
 const logEntry = {
  timestamp: new Date().toISOString(),
  agent: "vishnu-cto",
  action: "container_restart",
  target: "plex",
  authorized_by: "human_approval",
  approval_id: "APR-20240115-001",
  result: "success",
  user_id: $json.telegram_user_id
 };
 // Store logs:
 // - Local file (/var/log/n8n-agent-actions.log)
 // - Database (for queryability)
 // - SIEM system (for enterprise environments)
 ```
 **Security Monitoring**:
 ```
 Monitor the Monitors:
 - Who accessed n8n?
 - What workflows were modified?
 - What credentials were used?
 - What agents took actions?
 - Were approvals properly obtained?
 Alert on:
 - Failed authentication attempts >5
 - Workflow changes outside business hours
 - Credentials accessed by unusual users
 - Agent actions without approval
 - New workflows created
 ```
 ---
 ### Section 10.3: Performance Optimization
 **Polling Frequency**:
 ```
 Service Type         | Recommended Interval
 ---------------------|---------------------
 Critical (Database)  | Every 2-5 minutes
 Important (Web Apps) | Every 5-10 minutes
 Standard (Media)     | Every 10-15 minutes
 Non-Critical (Dev)   | Every 30-60 minutes
 Avoid over-polling:
 - Wastes API quota
 - Increases costs (LLM API calls)
 - Creates alert fatigue
 - Adds server load
 ```
 **Caching Strategy**:
 ```javascript
 // Cache service status to reduce redundant checks
 const cache = $('WorkflowStaticData').first().json.cache || {};
 const cacheKey = `status_${serviceName}`;
 const cachedStatus = cache[cacheKey];
 const now = Date.now();
 // Use cache if fresh (< 5 minutes old)
 if (cachedStatus && (now - cachedStatus.timestamp) < 5 * 60 * 1000) {
  return { json: cachedStatus.data };
 }
 // Otherwise, fetch fresh data
 const freshStatus = await checkService();
 // Update cache
 cache[cacheKey] = {
  timestamp: now,
  data: freshStatus
 };
 return { json: freshStatus };
 ```
 **Conditional Execution**:
 ```javascript
 // Only trigger notifications on state change
 const previousState = $('WorkflowStaticData').first().json.last_status || {};
 const currentState = $json.status;
 if (previousState.status === currentState.status) {
  // No change, skip notification
  return [];
 }
 // State changed, send notification
 $('WorkflowStaticData').first().json.last_status = currentState;
 return { json: { notify: true, state_change: true } };
 ```
 **Resource Limits**:
 ```yaml
 # If running n8n in Docker
 services:
  n8n:
    image: n8nio/n8n
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M
 ```
 **Workflow Optimization**:
 ```
 Slow Workflow Pattern:
 [Trigger] → [Agent] → [Wait 30s] → [Agent] → [Wait 30s] → [Agent]
 Total time: 90+ seconds
 Optimized Pattern:
 [Trigger] → [Agent with all tools] → [Parallel checks] → [Synthesize]
 Total time: 10-20 seconds
 Use parallel execution where possible:
 - Multiple service checks
 - Multiple API calls
 - Multiple SSH commands
 ```
 ---
 ### Section 10.4: Reliability Guidelines
 **Fallback Mechanisms**:
 ```javascript
 // Multi-channel notifications with fallback
 async function sendNotification(message) {
  try {
    // Primary: Telegram
    await sendTelegram(message);
  } catch (error) {
    try {
      // Fallback: Email
      await sendEmail(message);
    } catch (error2) {
      try {
        // Last resort: Write to file
        fs.appendFileSync('/var/log/failed-notifications.log', 
          JSON.stringify({ timestamp: Date.now(), message, error: error2 }));
      } catch (error3) {
        // Critical: Can't notify at all
        console.error("CRITICAL: All notification methods failed");
      }
    }
  }
 }
 ```
 **Health Checks for Monitoring System**:
 ```
 Monitor the Monitor:
 Create a separate workflow that checks if your main monitoring workflows are running.
 [Schedule: Every hour]
  ↓
 [Check: When was last execution of main workflow?]
  ↓
 [IF: >15 minutes ago]
  YES ↓
      [ALERT: Monitoring system appears down!]
      [Send via external service - email, SMS, PagerDuty]
 ```
 **Graceful Degradation**:
 ```javascript
 // If one tool fails, try alternatives
 async function checkService(url) {
  try {
    // Primary: HTTP Request tool
    return await httpCheck(url);
  } catch (error) {
    try {
      // Fallback: Curl via Execute Command
      return await curlCheck(url);
    } catch (error2) {
      // Can't check, assume down
      return {
        status: "unknown",
        error: "All check methods failed",
        last_known_status: getFromCache(url)
      };
    }
  }
 }
 ```
 **Recovery Procedures**:
 Document manual recovery steps:
 ```markdown
 ## Emergency Recovery: n8n Agent System Down
 1. Check n8n is running:
   ```
   docker ps | grep n8n
   systemctl status n8n
   ```
 2. Check n8n logs:
   ```
   docker logs n8n --tail 100
   journalctl -u n8n -n 100
   ```
 3. Restart n8n:
   ```
   docker restart n8n
   systemctl restart n8n
   ```
 4. Verify workflows activate:
   - Login to n8n web interface
   - Check each workflow's Active status
   - Manually execute one workflow to test
 5. If persistent issues:
   - Disable all workflows
   - Re-enable one at a time
   - Identify problematic workflow
 6. Nuclear option:
   - Restore n8n from backup
   - Reimport workflow exports
 ```
 **Backup Strategy**:
 ```bash
 # Daily backup of n8n data
 #!/bin/bash
 BACKUP_DIR="/backups/n8n"
 DATE=$(date +%Y%m%d)
 # Backup n8n database
 docker exec n8n sqlite3 /home/node/.n8n/database.sqlite ".backup /tmp/backup.db"
 docker cp n8n:/tmp/backup.db ${BACKUP_DIR}/database-${DATE}.sqlite
 # Backup workflows (export as JSON)
 # Via n8n API or manual export
 # Keep last 30 days
 find ${BACKUP_DIR} -name "*.sqlite" -mtime +30 -delete
 # Upload to cloud storage
 rclone copy ${BACKUP_DIR} remote:n8n-backups
 ```
 ---
 ### Section 10.5: Team Collaboration Best Practices
 **Clear Role Assignment**:
 ```
 Document each agent's domain:
 agents/
 ├── vishnu-cto.md
 │   - Responsibilities: Overall orchestration
 │   - Escalation triggers: Multi-system failures
 │   - Decision authority: Final say on all issues
 │
 ├── brahma-network.md  
 │   - Responsibilities: UniFi, routing, Wi-Fi
 │   - Escalation: Issues beyond network scope
 │   - Tools: UniFi API, network diagnostics
 │
 ├── saraswati-database.md
 │   - Responsibilities: PostgreSQL, MySQL
 │   - Escalation: Data integrity threats
 │   - Forbidden: Write operations without approval
 │
 └── ...
 ```
 **Escalation Paths**:
 ```
 Level 1: Hanuman (Helpdesk)
  ├─ Can resolve: Common questions, status checks
  ├─ Escalate to: Specialists for technical issues
  └─ Timeline: Respond within 5 minutes
 Level 2: Specialists (Brahma, Saraswati, Ganesha, Shiva)
  ├─ Can resolve: Domain-specific technical issues
  ├─ Escalate to: Vishnu for multi-system coordination
  └─ Timeline: Respond within 15 minutes
 Level 3: Vishnu (CTO)
  ├─ Can resolve: Complex multi-system issues
  ├─ Escalate to: Human for business decisions
  └─ Timeline: Respond within 30 minutes
 Level 4: Human
  ├─ Can resolve: Anything (final authority)
  └─ Timeline: Best effort (SLA depends on severity)
 ```
 **Knowledge Sharing**:
 ```javascript
 // Shared knowledge base accessible to all agents
 const kb = {
  "plex_common_issues": [
    {
      "symptom": "Remote access not working",
      "solution": "Check port 32400 forwarding, Plex server settings",
      "solved_count": 12,
      "success_rate": 0.95
    }
  ],
  "network_topology": {
    "vlans": {
      "10": "Management",
      "20": "User Devices",
      "30": "Servers",
      "40": "IoT"
    },
    "aps": [
      { "name": "Living Room AP", "ip": "192.168.1.10" }
    ]
  },
  "service_dependencies": {
    "plex": ["nas", "network"],
    "website": ["docker", "network"],
    "database": ["storage", "network"]
  }
 };
 // Agents reference KB before troubleshooting
 // Update KB after resolving new issues
 ```
 **Version Control**:
 ```bash
 # Export workflows regularly
 # Store in git repo
 workflows/
 ├── monitoring-main.json
 ├── approval-handler.json
 ├── brahma-network.json
 ├── saraswati-database.json
 └── README.md
 # Commit after significant changes
 git add workflows/
 git commit -m "Add database slow query detection to Saraswati"
 git push
 # Tags for stable versions
 git tag -a v1.0 -m "Production-ready release"
 ```
 **Change Management**:
 ```
 Before modifying production workflows:
 1. Document change in issue tracker
 2. Test in development environment
 3. Peer review (or self-review with checklist)
 4. Deploy during maintenance window
 5. Monitor for 24 hours after change
 6. Document results and lessons learned
 For emergency fixes:
 1. Fix the immediate issue
 2. Document what was changed
 3. Proper testing and documentation follow-up within 48 hours
 ```
 ---
 Your agent system is now enterprise-grade, with comprehensive troubleshooting, best practices, and reliability measures in place.
 ---
--- a/logs/README.md
+++ b/logs/README.md
@@ -0,0 +1,24 @@
 # Logs Directory
 This directory is reserved for persistent log storage for agent actions.
 ## Purpose
 - Store execution logs from AI agents
 - Maintain audit trail of automated actions
 - Debug workflow issues
 - Track agent decision-making history
 ## Usage
 This folder will be populated by your n8n workflows when you configure logging nodes. Logs can include:
 - Agent diagnostic outputs
 - Service check results
 - Automated action records
 - Error and exception logs
 - Human approval request history
 ## Note
 This directory is intentionally kept empty in the repository. Your actual logs will be stored here when the system is running.
--- a/scripts/install.sh
+++ b/scripts/install.sh
@@ -0,0 +1,70 @@
 #!/bin/bash
 # n8n Installation Script for Docker
 # This script sets up n8n with Docker for the CTO Agent Team
 set -e
 echo "=========================================="
 echo "n8n-Homelab-CTO-Agent-Team Installation"
 echo "=========================================="
 echo ""
 # Check if Docker is installed
 if ! command -v docker &> /dev/null; then
    echo "❌ Docker is not installed. Please install Docker first."
    echo "Visit: https://docs.docker.com/get-docker/"
    exit 1
 fi
 echo "✅ Docker found"
 # Check if Docker is running
 if ! docker info &> /dev/null; then
    echo "❌ Docker is not running. Please start Docker first."
    exit 1
 fi
 echo "✅ Docker is running"
 echo ""
 # Create n8n data volume
 echo "Creating n8n data volume..."
 docker volume create n8n_data || true
 echo "✅ Volume created/exists"
 echo ""
 # Set default values
 N8N_PORT=${N8N_PORT:-5678}
 N8N_TIMEZONE=${N8N_TIMEZONE:-America/New_York}
 echo "Starting n8n container..."
 echo "Port: $N8N_PORT"
 echo "Timezone: $N8N_TIMEZONE"
 echo ""
 # Run n8n container
 docker run -d \
  --name n8n \
  --restart unless-stopped \
  -p $N8N_PORT:5678 \
  -e TZ=$N8N_TIMEZONE \
  -v n8n_data:/home/node/.n8n \
  -v /var/run/docker.sock:/var/run/docker.sock \
  n8nio/n8n
 echo ""
 echo "=========================================="
 echo "✅ n8n installation complete!"
 echo "=========================================="
 echo ""
 echo "Access n8n at: http://localhost:$N8N_PORT"
 echo ""
 echo "Next steps:"
 echo "1. Open n8n in your browser"
 echo "2. Set up your admin account"
 echo "3. Configure your API keys in n8n credentials"
 echo "4. Import workflow templates from workflows/ folder"
 echo ""
 echo "For more information, see the README.md"
 echo ""
--- a/workflows/agents/brahma-network.json
+++ b/workflows/agents/brahma-network.json
@@ -0,0 +1,12 @@
 {
  "name": "Brahma Network Agent",
  "nodes": [],
  "connections": {},
  "active": false,
  "settings": {},
  "tags": [],
  "meta": {
    "description": "Workflow for the Brahma Network Agent",
    "templateCredsSetupCompleted": true
  }
 }
--- a/workflows/agents/shiva-devops.json
+++ b/workflows/agents/shiva-devops.json
@@ -0,0 +1,12 @@
 {
  "name": "Shiva DevOps Agent",
  "nodes": [],
  "connections": {},
  "active": false,
  "settings": {},
  "tags": [],
  "meta": {
    "description": "Workflow for the Shiva DevOps Agent",
    "templateCredsSetupCompleted": true
  }
 }
--- a/workflows/agents/vishnu-cto.json
+++ b/workflows/agents/vishnu-cto.json
@@ -0,0 +1,12 @@
 {
  "name": "Vishnu CTO Agent - Orchestration",
  "nodes": [],
  "connections": {},
  "active": false,
  "settings": {},
  "tags": [],
  "meta": {
    "description": "Workflow for the Vishnu CTO Agent (Orchestration)",
    "templateCredsSetupCompleted": true
  }
 }
--- a/workflows/utilities/approval-handler.json
+++ b/workflows/utilities/approval-handler.json
@@ -0,0 +1,12 @@
 {
  "name": "Approval Handler - Human-in-the-Loop",
  "nodes": [],
  "connections": {},
  "active": false,
  "settings": {},
  "tags": [],
  "meta": {
    "description": "Sub-workflow for Human-in-the-Loop Approval Logic",
    "templateCredsSetupCompleted": true
  }
 }