17 KiB
Operate Feature Architecture
This document provides a detailed architecture overview of the Operate feature in the DevOps AI Toolkit.
Overview
The Operate feature provides AI-powered Day 2 operations for Kubernetes applications. It handles updates, scaling, enhancements, rollbacks, and deletions through natural language intents while applying organizational patterns and policies, validating changes via dry-run, and executing approved operations safely.
High-Level Architecture
flowchart TB
subgraph Users["User / AI Agent"]
Agent["Claude Code, Cursor,<br/>VS Code, etc."]
end
subgraph MCP["MCP Server (dot-ai)"]
Operate["operate Tool"]
AI["AI Provider"]
Session["Session<br/>Manager"]
Kubectl["Kubectl Tools"]
VectorClient["Vector DB<br/>Client"]
end
subgraph External["External Services"]
LLM["Claude, OpenAI,<br/>Gemini, etc."]
Qdrant["Qdrant<br/>(Patterns, Policies,<br/>Capabilities)"]
end
subgraph K8s["Kubernetes Cluster"]
API["Kubernetes API"]
Controller["Controller<br/>(dot-ai-controller)"]
Resources["Cluster Resources<br/>Deployments, StatefulSets,<br/>CRDs, etc."]
end
subgraph WebUI["Web UI (dot-ai-ui)"]
Viz["Visualization Dashboard<br/>- Current State<br/>- Proposed Changes<br/>- Risk Assessment"]
end
Agent <-->|MCP Protocol| Operate
Operate --> AI
Operate --> Session
Operate --> Kubectl
Operate --> VectorClient
AI --> LLM
AI <-->|Tool Loop| Kubectl
VectorClient --> Qdrant
Kubectl --> API
Operate -->|Execute Commands| API
Controller -->|Scan & Embed| Qdrant
Controller -->|Watch| Resources
Agent -.->|User opens<br/>Visualization URL| WebUI
Operation Workflow
The operate tool implements a three-phase workflow with persistent session management:
flowchart TD
subgraph Phase1["Phase 1: Intent Analysis"]
Intent["User Intent<br/>(natural language)"]
CreateSession["Create Session<br/>(opr-{ts}-{uuid})"]
EmbedContext["Embed Context:<br/>- Search Patterns<br/>- Search Policies<br/>- Search Capabilities"]
ContextCheck{"Capabilities<br/>Found?"}
NoCapError["Error: Run<br/>capability scan first"]
Investigation["AI Investigation Loop<br/>(max 30 iterations)"]
KubectlTools["kubectl Tools:<br/>get, describe, logs,<br/>patch_dryrun, apply_dryrun,<br/>delete_dryrun, get_crd_schema"]
Intent --> CreateSession --> EmbedContext
EmbedContext --> ContextCheck
ContextCheck -->|No| NoCapError
ContextCheck -->|Yes| Investigation
Investigation <-->|Tool Calls| KubectlTools
end
subgraph Phase2["Phase 2: Analysis & Validation"]
ParseResponse["Parse AI Response"]
DryRunCheck{"Dry-Run<br/>Validated?"}
RetryAnalysis["AI Iterates with<br/>Corrected Commands"]
Analysis["Generate Analysis:<br/>- Current State<br/>- Proposed Changes<br/>- Commands<br/>- Risk Assessment"]
SaveSession["Save Session<br/>(status: analysis_complete)"]
ReturnAnalysis["Return with<br/>Visualization URL"]
Investigation --> ParseResponse --> DryRunCheck
DryRunCheck -->|No| RetryAnalysis --> Investigation
DryRunCheck -->|Yes| Analysis
Analysis --> SaveSession --> ReturnAnalysis
end
subgraph Phase3["Phase 3: User Approval & Execution"]
UserChoice["User Approval<br/>(executeChoice=1)"]
LoadSession["Load Session"]
ExecuteCommands["Execute Commands<br/>(continue-on-error)"]
PostValidation{"Has<br/>validationIntent?"}
CallRemediate["Call remediate<br/>Internally"]
SaveResults["Save Results<br/>(status: executed_*)"]
ReturnResults["Return Execution<br/>Results"]
ReturnAnalysis -.->|"User reviews<br/>and approves"| UserChoice
UserChoice --> LoadSession --> ExecuteCommands
ExecuteCommands --> PostValidation
PostValidation -->|Yes| CallRemediate --> SaveResults
PostValidation -->|No| SaveResults
SaveResults --> ReturnResults
end
Component Details
MCP Server (dot-ai)
The MCP server provides the core operations engine:
| Component | File | Description |
|---|---|---|
operate tool |
src/tools/operate.ts |
Entry point, routing, context embedding, formatting |
| Analysis workflow | src/tools/operate-analysis.ts |
Intent analysis, AI tool loop, response parsing |
| Execution workflow | src/tools/operate-execution.ts |
Command execution, post-validation, results |
| System Prompt | prompts/operate-system.md |
AI instructions for operation behavior |
| User Prompt | prompts/operate-user.md |
Handlebars template with context injection |
GenericSessionManager |
src/core/generic-session-manager.ts |
File-based session persistence |
AIProvider |
src/core/ai-provider.interface.ts |
AI abstraction with tool loop support |
kubectl-tools |
src/core/kubectl-tools.ts |
Kubectl investigation and validation tools |
| Vector Services | src/services/*-vector-service.ts |
Pattern, policy, capability search |
visualization |
src/core/visualization.ts |
URL generation for web UI |
Kubectl Investigation & Validation Tools
Tools available during AI analysis:
| Tool | Description |
|---|---|
kubectl_api_resources |
Discover available resources in cluster |
kubectl_get |
List resources with table format |
kubectl_describe |
Detailed resource information with events |
kubectl_logs |
Container logs for debugging |
kubectl_patch_dryrun |
Validate patch operations before execution |
kubectl_apply_dryrun |
Validate apply operations before execution |
kubectl_delete_dryrun |
Validate delete operations before execution |
kubectl_get_crd_schema |
Get CRD schema for custom resources |
Controller (dot-ai-controller)
The Kubernetes controller provides capability scanning:
| Component | File | Description |
|---|---|---|
| Capability Scanner | internal/controller/capability_scanner.go |
Discovers cluster resources and capabilities |
| Embedding Service | internal/controller/embedding_service.go |
Generates embeddings for semantic search |
| Qdrant Client | internal/controller/qdrant_client.go |
Stores capabilities in vector database |
Web UI (dot-ai-ui)
Provides visualization for operation analysis and execution:
| Component | File | Description |
|---|---|---|
| Visualization Page | src/pages/Visualization.tsx |
Main page for /v/{sessionId} |
| MermaidRenderer | src/components/renderers/MermaidRenderer.tsx |
Interactive flowcharts |
| CardRenderer | src/components/renderers/CardRenderer.tsx |
Current state and proposed changes |
| CodeRenderer | src/components/renderers/CodeRenderer.tsx |
Commands with syntax highlighting |
| InsightsPanel | src/components/InsightsPanel.tsx |
AI observations and risk assessment |
Integration Points
flowchart LR
subgraph MCP["MCP Server"]
Operate["operate tool"]
AIProvider["AI Provider"]
KubectlTools["Kubectl Tools"]
SessionMgr["Session Manager"]
VectorSvc["Vector Services"]
end
subgraph AI["AI Providers"]
Anthropic["Claude API"]
OpenAI["OpenAI API"]
Google["Gemini API"]
Others["xAI, Bedrock,<br/>OpenRouter, etc."]
end
subgraph VectorDB["Vector Database"]
Patterns["Patterns<br/>Collection"]
Policies["Policies<br/>Collection"]
Capabilities["Capabilities<br/>Collection"]
end
subgraph K8s["Kubernetes"]
API["API Server"]
Controller["Capability<br/>Scanner"]
end
subgraph Storage["Session Storage"]
Files["File System<br/>~/.dot-ai/sessions/opr-sessions/"]
end
subgraph UI["Web UI"]
Viz["Visualization<br/>Dashboard"]
end
AIProvider <-->|Tool Loop| KubectlTools
AIProvider --> Anthropic
AIProvider --> OpenAI
AIProvider --> Google
AIProvider --> Others
VectorSvc --> Patterns
VectorSvc --> Policies
VectorSvc --> Capabilities
KubectlTools -->|Investigation| API
Operate -->|Execute Commands| API
Controller -->|Scan & Embed| Capabilities
SessionMgr --> Files
Operate -.->|Session URL| Viz
MCP Server ↔ AI Provider
- Tool Loop: AI iteratively calls kubectl tools (max 30 iterations)
- Investigation: Gathers current cluster state to understand resources
- Dry-Run Validation: Validates all commands before proposing
- Analysis: Generates JSON response with changes, commands, and risk assessment
MCP Server ↔ Vector Database
- Patterns: Organizational patterns for operational best practices
- Policies: Policy intents for validation and compliance
- Capabilities: Cluster resource capabilities for intelligent recommendations
- Capabilities are mandatory; patterns/policies are optional
MCP Server ↔ Kubernetes API
- Read Operations:
kubectl get,describe,logs - Validation:
kubectl patch/apply/delete --dry-run=server - Execution: Sequential command execution via
child_process.exec()
MCP Server ↔ Remediate Tool
- Post-Execution Validation: Internally calls remediate with
validationIntent - Verification: Confirms operations completed successfully
- Error Detection: Identifies issues introduced by operations
MCP Server ↔ Web UI
- Session Storage: Operation data stored with session IDs
- Visualization API:
/api/v1/visualize/{sessionId}endpoint - URL Generation:
WEB_UI_BASE_URL/v/{sessionId}
Session Management
Sessions persist workflow state across tool calls:
Session ID Format: opr-{timestamp}-{uuid8}
Example: opr-1704067200000-a1b2c3d4
Session Data:
├── toolName: 'operate'
├── intent: "Update my-api to v2.0 with zero downtime"
├── context:
│ ├── patterns: OrganizationalPattern[]
│ ├── policies: PolicyIntent[]
│ └── capabilities: ResourceCapability[]
├── proposedChanges:
│ ├── create: ResourceChange[]
│ ├── update: ResourceChange[]
│ └── delete: ResourceChange[]
├── commands: ["kubectl set image...", "kubectl patch..."]
├── dryRunValidation:
│ ├── status: 'success' | 'failed'
│ └── details: string
├── patternsApplied: ["Zero-Downtime Rolling Update"]
├── capabilitiesUsed: ["metrics-server", "KEDA"]
├── policiesChecked: ["Production Update Policy"]
├── risks: { level: 'low', description: "..." }
├── validationIntent: "Verify deployment rollout complete"
├── status: 'analyzing' | 'analysis_complete' | 'executing' | 'executed_*'
└── executionResults: [{command, success, output, error}]
Session States
| State | Description |
|---|---|
analyzing |
AI is gathering data and generating commands |
analysis_complete |
Analysis done, awaiting user approval |
executing |
Commands are being executed |
executed_successfully |
All commands succeeded |
executed_with_errors |
Some commands failed |
failed |
Analysis or execution failed |
Organizational Context Integration
The operate tool integrates organizational knowledge via vector database search:
flowchart TB
subgraph Input["User Intent"]
Intent["Update my-api to v2.0<br/>with zero downtime"]
end
subgraph Search["Vector DB Search"]
PatternSearch["Search Patterns<br/>(limit: 5)"]
PolicySearch["Search Policies<br/>(limit: 5)"]
CapSearch["Search Capabilities<br/>(limit: 50)"]
end
subgraph Results["Search Results"]
Patterns["Patterns Found:<br/>- Zero-Downtime Rolling Update<br/>- Blue-Green Deployment"]
Policies["Policies Found:<br/>- Production Update Policy<br/>- Change Management"]
Caps["Capabilities Found:<br/>- metrics-server<br/>- KEDA Operator<br/>- Argo Rollouts"]
end
subgraph Prompt["AI Prompt"]
Context["Embedded Context:<br/>patterns + policies + capabilities"]
end
Intent --> PatternSearch & PolicySearch & CapSearch
PatternSearch --> Patterns
PolicySearch --> Policies
CapSearch --> Caps
Patterns & Policies & Caps --> Context
Context Priority
- Capabilities (Mandatory): What the cluster can do
- Patterns (Optional): Organizational best practices
- Policies (Optional): Compliance and validation rules
Output Formats
The operate tool returns structured output at different stages:
Analysis Response
| Field | Description |
|---|---|
status |
awaiting_user_approval |
sessionId |
Session ID for continuation |
visualizationUrl |
URL to view analysis in web UI |
currentState |
Current cluster resource state |
proposedChanges |
Create, update, delete operations |
commands |
Pre-validated kubectl commands |
dryRunValidation |
Dry-run validation results |
patternsApplied |
Applied organizational patterns |
capabilitiesUsed |
Used cluster capabilities |
policiesChecked |
Checked policies |
risks |
Risk assessment (level + description) |
validationIntent |
Post-execution validation instructions |
Execution Response
| Field | Description |
|---|---|
status |
success or failed |
sessionId |
Session ID for reference |
results |
Per-command execution results |
validation |
Post-execution validation summary |
message |
Human-readable summary |
Error Handling
The operation workflow includes robust error handling:
- No Capabilities Found: Clear guidance to run capability scan first
- Session Not Found: Guidance to start new operation
- Dry-Run Failures: AI iterates to fix commands before proposing
- Command Execution Failures: Continue-on-error, capture all results
- Validation Failures: Report issues via remediate tool integration
- AI Service Errors: Logged with request IDs for debugging
- Investigation Timeouts: Max 30 iterations prevents infinite loops
Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
AI_PROVIDER |
AI provider selection | anthropic |
ANTHROPIC_API_KEY |
Anthropic API key | Required if using |
OPENAI_API_KEY |
OpenAI API key | Required if using |
QDRANT_URL |
Qdrant vector database URL | http://localhost:6333 |
QDRANT_API_KEY |
Qdrant API key | Optional |
QDRANT_CAPABILITIES_COLLECTION |
Capabilities collection name | capabilities |
KUBECONFIG |
Kubernetes config path | Auto-detected |
DOT_AI_SESSION_DIR |
Session storage directory | ~/.dot-ai/sessions |
WEB_UI_BASE_URL |
Web UI base URL | Optional |
DEBUG_DOT_AI |
Enable debug logging | false |
Supported AI Providers
| Provider | Models | Notes |
|---|---|---|
| Anthropic | Claude Sonnet 4.5, Opus, Haiku | Default, 1M token context |
| OpenAI | GPT-5.1-codex | |
| Gemini 3 Pro, Flash | ||
| xAI | Grok-4 | |
| Amazon Bedrock | Various | Uses AWS credential chain |
| OpenRouter | Multi-model | Proxy to multiple providers |
| Custom | Ollama, vLLM, LocalAI | Via baseURL config |
Workflow Example
User Intent: "Update my-api deployment in prod to v2.0 with zero downtime"
1. CONTEXT EMBEDDING
└─ embedContext(intent)
├─ Search patterns → "Zero-Downtime Rolling Update"
├─ Search policies → "Production Update Requirements"
└─ Search capabilities → "metrics-server", "KEDA Operator"
2. AI INVESTIGATION LOOP
└─ AI Tool Loop (30 iterations max)
├─ kubectl_get deployment/my-api -n prod
├─ kubectl_describe deployment/my-api -n prod
├─ kubectl_patch_dryrun (test maxUnavailable: 0)
└─ kubectl_set_image (test v2.0 image --dry-run=server)
3. ANALYSIS GENERATION
└─ Session created: opr-1704067200000-a1b2c3d4
├─ Status: analysis_complete
├─ Current: 3 replicas, my-api:v1.5, maxUnavailable: 1
├─ Proposed: image v2.0, maxUnavailable: 0
├─ Commands: set image + patch strategy
├─ Risk: LOW
└─ Visualization URL: https://dot-ai-ui/v/opr-1704067200000-a1b2c3d4
4. USER APPROVAL
└─ User reviews analysis in terminal or web UI
└─ Calls: operate({ sessionId: 'opr-...', executeChoice: 1 })
5. COMMAND EXECUTION
└─ executeOperations()
├─ Load session (status: analysis_complete)
├─ Update status to executing
├─ Execute commands sequentially
│ ├─ kubectl set image deployment/my-api my-api=my-api:v2.0 -n prod
│ └─ kubectl patch deployment/my-api -n prod -p '{"spec":...}'
├─ Call remediate internally for validation
└─ Update status to executed_successfully
6. RETURN RESULTS
└─ Results: 2 commands succeeded
├─ Validation: "Rollout complete, all pods running v2.0"
└─ Status: success