2026-03-26 05:32:52 +02:00

15 KiB

Remediate Feature Architecture

This document provides a detailed architecture overview of the Remediate feature in the DevOps AI Toolkit.

Overview

The Remediate feature provides AI-powered Kubernetes issue analysis and remediation. It investigates problems using kubectl tools, identifies root causes with confidence scoring, and executes verified fixes with optional post-execution validation.

High-Level Architecture

flowchart TB
    subgraph Users["User / AI Agent"]
        Agent["Claude Code, Cursor,<br/>VS Code, etc."]
    end

    subgraph MCP["MCP Server (dot-ai)"]
        Remediate["remediate Tool"]
        AI["AI Provider"]
        Session["Session<br/>Manager"]
        Kubectl["Kubectl Tools"]
    end

    subgraph External["External Services"]
        LLM["Claude, OpenAI,<br/>Gemini, etc."]
    end

    subgraph K8s["Kubernetes Cluster"]
        API["Kubernetes API"]
        Controller["Controller<br/>(dot-ai-controller)"]
        Events["Kubernetes Events"]
        Resources["Cluster Resources<br/>Pods, Deployments,<br/>Services, etc."]
    end

    subgraph WebUI["Web UI (dot-ai-ui)"]
        Viz["Visualization Dashboard<br/>- Investigation Flow<br/>- Root Cause Analysis<br/>- Remediation Commands"]
    end

    subgraph Notifications["Notifications"]
        Slack["Slack"]
        GChat["Google Chat"]
    end

    Agent <-->|MCP Protocol| Remediate
    Remediate --> AI
    Remediate --> Session
    Remediate --> Kubectl
    AI --> LLM
    AI <-->|Tool Loop| Kubectl
    Kubectl --> API
    Remediate -->|Execute Commands| API
    Controller -->|Watch| Events
    Controller -->|RemediationPolicy| Remediate
    Controller -.->|Webhook| Slack
    Controller -.->|Webhook| GChat
    Events --> Resources
    Agent -.->|User opens<br/>Visualization URL| WebUI

Remediation Workflow

The remediate tool operates as a multi-phase workflow with persistent session management:

flowchart TD
    subgraph Phase1["Phase 1: Investigation"]
        Issue["Issue Description"]
        CreateSession["Create Session<br/>(rem-{ts}-{uuid})"]
        Investigation["AI Investigation Loop<br/>(max 30 iterations)"]
        KubectlTools["kubectl Tools:<br/>get, describe, logs,<br/>events, api-resources, patch"]

        Issue --> CreateSession --> Investigation
        Investigation <-->|Tool Calls| KubectlTools
    end

    subgraph Phase2["Phase 2: Analysis"]
        ParseResponse["Parse AI Response"]
        StatusCheck{"Issue<br/>Status?"}
        AlreadyResolved["Return: Issue already<br/>resolved/non-existent"]
        Analysis["Root Cause Analysis<br/>+ Confidence Score<br/>+ Contributing Factors"]

        Investigation --> ParseResponse --> StatusCheck
        StatusCheck -->|resolved| AlreadyResolved
        StatusCheck -->|non_existent| AlreadyResolved
        StatusCheck -->|active| Analysis
    end

    subgraph Phase3["Phase 3: Execution Decision"]
        ModeCheck{"Execution<br/>Mode?"}

        subgraph Manual["Manual Mode (default)"]
            ReturnChoices["Return 2 Choices:<br/>1. Execute via MCP<br/>2. Execute via Agent"]
            WaitApproval["await_user_approval"]
        end

        subgraph Auto["Automatic Mode"]
            ThresholdCheck{"Confidence >= threshold<br/>AND Risk <= maxRisk?"}
            AutoExecute["Execute Automatically"]
            Fallback["Return with<br/>fallbackReason"]
        end

        Analysis --> ModeCheck
        ModeCheck -->|manual| ReturnChoices --> WaitApproval
        ModeCheck -->|automatic| ThresholdCheck
        ThresholdCheck -->|Yes| AutoExecute
        ThresholdCheck -->|No| Fallback
    end

    subgraph Phase4["Phase 4: Command Execution"]
        UserChoice["User Choice<br/>(executeChoice=1 or 2)"]
        Choice1{"Choice?"}
        ExecuteMCP["Execute via MCP<br/>(child_process.exec)"]
        ExecuteAgent["Return Commands<br/>for Agent Execution"]
        LogResults["Log Results<br/>(success/failure/output)"]

        WaitApproval --> UserChoice --> Choice1
        Choice1 -->|1| ExecuteMCP --> LogResults
        Choice1 -->|2| ExecuteAgent
        AutoExecute --> ExecuteMCP
    end

    subgraph Phase5["Phase 5: Validation"]
        ValidationCheck{"All Commands<br/>Succeeded?"}
        HasValidation{"Has validationIntent?"}
        RecursiveCall["Recursive Investigation<br/>with validationIntent"]
        WaitReconcile["Wait 30s<br/>(automatic mode)"]
        FinalStatus["Final Status:<br/>resolved / still_active"]

        LogResults --> ValidationCheck
        ValidationCheck -->|Yes| HasValidation
        ValidationCheck -->|No| FinalStatus
        HasValidation -->|Yes| WaitReconcile --> RecursiveCall --> FinalStatus
        HasValidation -->|No| FinalStatus
    end

Component Details

MCP Server (dot-ai)

The MCP server is the core remediation engine:

Component File Description
remediate tool src/tools/remediate.ts Entry point, orchestrates investigation and execution
System Prompt prompts/remediate-system.md AI instructions for investigation behavior
GenericSessionManager src/core/generic-session-manager.ts File-based session persistence
AIProvider src/core/ai-provider.interface.ts AI abstraction with tool loop support
AIProviderFactory src/core/ai-provider-factory.ts Multi-provider factory (Anthropic, OpenAI, etc.)
kubectl-tools src/core/kubectl-tools.ts Kubectl investigation tools
visualization src/core/visualization.ts URL generation for web UI

Kubectl Investigation Tools

Tools available during AI investigation:

Tool Description
kubectl_api_resources Discover available resources in cluster
kubectl_get List resources with table format (compact)
kubectl_describe Detailed resource information with events
kubectl_logs Container logs (supports --previous for crashes)
kubectl_events Cluster events for understanding state changes
kubectl_patch_dryrun Validate patches before actual execution

Controller (dot-ai-controller)

The Kubernetes controller provides event-driven remediation:

Component File Description
RemediationPolicy CRD config/crd/bases/ Custom resource for remediation rules
Policy Controller internal/controller/remediationpolicy_controller.go Event matching and MCP dispatch
Rate Limiter internal/controller/remediationpolicy_ratelimit.go Per-object cooldowns and rate limits
MCP Client internal/controller/remediationpolicy_mcp.go HTTP client for remediate tool
Cooldown State ConfigMaps Persistent cooldown state across restarts

Web UI (dot-ai-ui)

Provides visualization for remediation results:

Component File Description
Visualization Page src/pages/Visualization.tsx Main page for /v/{sessionId}
MermaidRenderer src/components/renderers/MermaidRenderer.tsx Interactive flowcharts (collapsible)
CardRenderer src/components/renderers/CardRenderer.tsx Issue/solution cards
CodeRenderer src/components/renderers/CodeRenderer.tsx Commands and logs with syntax highlighting
InsightsPanel src/components/InsightsPanel.tsx AI observations display
API Client src/api/client.ts Data fetching from MCP server

Integration Points

flowchart LR
    subgraph MCP["MCP Server"]
        Remediate["remediate tool"]
        AIProvider["AI Provider"]
        KubectlTools["Kubectl Tools"]
        SessionMgr["Session Manager"]
    end

    subgraph AI["AI Providers"]
        Anthropic["Claude API"]
        OpenAI["OpenAI API"]
        Google["Gemini API"]
        Others["xAI, Bedrock,<br/>OpenRouter, etc."]
    end

    subgraph K8s["Kubernetes"]
        API["API Server"]
        Controller["RemediationPolicy<br/>Controller"]
        Events["Kubernetes Events"]
    end

    subgraph Storage["Session Storage"]
        Files["File System<br/>tmp/sessions/*.json"]
        ConfigMaps["ConfigMaps<br/>(cooldown state)"]
    end

    subgraph UI["Web UI"]
        Viz["Visualization<br/>Dashboard"]
    end

    subgraph Notif["Notifications"]
        Slack["Slack"]
        GChat["Google Chat"]
    end

    AIProvider <-->|Tool Loop| KubectlTools
    AIProvider --> Anthropic
    AIProvider --> OpenAI
    AIProvider --> Google
    AIProvider --> Others

    KubectlTools -->|Investigation| API
    Remediate -->|Execute Commands| API

    Controller -->|Watch| Events
    Controller -->|HTTP POST| Remediate
    Controller -.->|Webhook| Slack
    Controller -.->|Webhook| GChat

    SessionMgr --> Files
    Controller --> ConfigMaps

    Remediate -.->|Session URL| Viz

MCP Server ↔ AI Provider

  • Tool Loop: AI iteratively calls kubectl tools (max 30 iterations)
  • Investigation: Gathers cluster data to understand the issue
  • Analysis: Parses JSON response with root cause, confidence, and remediation steps
  • Validation: Optional recursive investigation after command execution

MCP Server ↔ Kubernetes API

  • Read Operations: kubectl get, describe, logs, events
  • Validation: kubectl patch --dry-run=server
  • Execution: child_process.exec() for remediation commands

Controller ↔ MCP Server

  • Event-Driven: Controller watches Kubernetes events
  • Policy Matching: Events matched against RemediationPolicy selectors
  • HTTP Dispatch: POST to MCP /api/v1/tools/remediate
  • Rate Limiting: Per-object cooldowns prevent remediation storms

MCP Server ↔ Web UI

  • Session Storage: Remediation data stored with session IDs
  • Visualization API: /api/v1/visualize/{sessionId} endpoint
  • URL Generation: WEB_UI_BASE_URL/v/{sessionId}

Controller ↔ Notifications

  • Slack Webhooks: Controller sends remediation events to Slack channels
  • Google Chat Webhooks: Controller sends remediation events to Google Chat spaces
  • Secret References: Webhook URLs stored securely in Kubernetes Secrets
  • Event Types: Notifications sent on remediation start, success, and failure

Session Management

Sessions persist workflow state across tool calls:

Session ID Format: rem-{timestamp}-{uuid8}
Example: rem-1767465086590-11029192

Session Data:
├── toolName: 'remediate'
├── issue: "Pod my-app is crashing with OOMKilled"
├── mode: 'manual' | 'automatic'
├── interaction_id: (for evaluation dataset)
├── status: 'investigating' | 'analysis_complete' | 'executed_*' | ...
├── finalAnalysis:
│   ├── rootCause: "Container memory limit too low"
│   ├── confidence: 0.92
│   ├── factors: ["High memory usage", "No HPA"]
│   ├── remediation:
│   │   ├── summary: "Increase memory limit"
│   │   ├── actions: [{description, command, risk, rationale}]
│   │   └── risk: 'low' | 'medium' | 'high'
│   └── validationIntent: "Verify pod is running"
├── executionResults: [{command, success, output, error}]
└── timestamp: ISO date

Session States

State Description
investigating AI is gathering data via kubectl tools
analysis_complete Analysis done, awaiting user approval
failed Investigation failed (error)
executed_successfully All commands succeeded
executed_with_errors Some commands failed
cancelled User cancelled the remediation

RemediationPolicy CRD

The controller uses a CRD for event-driven remediation:

apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: RemediationPolicy
metadata:
  name: oom-killer-policy
spec:
  eventSelectors:
    - type: Warning
      reason: OOMKilled
      involvedObjectKind: Pod
      namespace: production
      message: ".*memory.*"  # Regex support
      mode: automatic        # Override per selector
      confidenceThreshold: 0.9
      maxRiskLevel: low

  mcpEndpoint: https://mcp.example.com/api/v1/tools
  mcpAuthSecretRef:
    name: mcp-auth
    key: token
  mcpTool: remediate

  mode: manual              # Default mode
  confidenceThreshold: 0.8
  maxRiskLevel: low

  rateLimiting:
    eventsPerMinute: 10
    cooldownMinutes: 5

  notifications:
    slack:
      webhookSecretRef:
        name: slack-webhook
        key: url
      channel: "#alerts"
    googleChat:
      webhookSecretRef:
        name: gchat-webhook
        key: url

status:
  totalEventsProcessed: 150
  successfulRemediations: 142
  failedRemediations: 8
  rateLimitedEvents: 25
  lastProcessedEvent: "2025-01-07T10:30:00Z"

Output Formats

The remediate tool returns structured output:

Field Description
status success, failed, or awaiting_user_approval
sessionId Session ID for continuation or visualization
investigation.iterations Number of AI tool loop iterations
investigation.dataGathered List of kubectl tools called
analysis.rootCause Identified root cause
analysis.confidence Confidence score (0-1)
analysis.factors Contributing factors
remediation.summary Human-readable summary
remediation.actions Commands with risk levels
remediation.risk Overall risk level
validationIntent Post-execution validation instructions
executionChoices Available execution options
results Execution results (if executed)

Error Handling

The remediation workflow includes robust error handling:

  1. Session Not Found: Clear guidance to start new investigation
  2. AI Service Errors: Logged with request IDs for debugging
  3. JSON Parsing Failures: Original AI response logged for analysis
  4. Command Execution Failures: Individual command results tracked
  5. Validation Failures: Recursive investigation continues despite errors
  6. Investigation Timeouts: Max 30 iterations prevents infinite loops

Configuration

Environment Variables

Variable Description Default
AI_PROVIDER AI provider selection anthropic
ANTHROPIC_API_KEY Anthropic API key Required if using
OPENAI_API_KEY OpenAI API key Required if using
KUBECONFIG Kubernetes config path Auto-detected
DOT_AI_SESSION_DIR Session storage directory ./tmp/sessions
WEB_UI_BASE_URL Web UI base URL Optional
DEBUG_DOT_AI Enable debug logging false

Supported AI Providers

Provider Models Notes
Anthropic Claude Sonnet 4.5, Opus, Haiku Default, 1M token context
OpenAI GPT-5.1-codex
Google Gemini 3 Pro, Flash
xAI Grok-4
Amazon Bedrock Various Uses AWS credential chain
OpenRouter Multi-model Proxy to multiple providers
Custom Ollama, vLLM, LocalAI Via baseURL config

See Also