DreamOps

DreamOps

Dream Easy While AI Takes Care Of Your Oncall Duty

Created on 22nd June 2025

DreamOps

DreamOps

Dream Easy While AI Takes Care Of Your Oncall Duty

The problem DreamOps solves

On-call engineers face constant interruptions, sleep deprivation, and alert fatigue from managing production incidents 24/7. Current incident response relies heavily on manual intervention - engineers must wake up at 3 AM, analyze logs across multiple systems, diagnose issues, and execute remediation steps while under pressure.

How DreamOps Solves This

Core Features:

  • Automated Incident Response: AI agent automatically triages and resolves common incidents (pod crashes, OOM kills, configuration issues) without human intervention
  • Intelligent Root Cause Analysis: Uses Claude AI to analyze alerts with full context from Kubernetes, logs, metrics, and documentation
  • Reduced MTTR: What typically takes 30-60 minutes of manual debugging can be resolved in 2-5 minutes automatically
  • Context-Aware Decisions: Integrates with GitHub, Grafana, and Notion to understand your specific infrastructure and runbooks
  • Sleep Protection: Engineers can actually rest while the AI handles routine incidents, only escalating complex issues

Real-World Impact:

  • Saves 2-4 hours per on-call shift
  • Reduces incident resolution time by 80% for common issues
  • Prevents engineer burnout from repetitive tasks
  • Maintains consistent remediation quality even at 3 AM

Challenges we ran into

1. MCP Integration Complexity

Integrating multiple Model Context Protocol servers (Kubernetes, GitHub, Grafana) required building a robust abstraction layer. Each MCP server had different authentication methods and response formats.

  • Solution: Created a unified

    MCPIntegration

    base class.

2. Safety vs Automation Balance

Implementing YOLO mode that can execute kubectl commands automatically was risky.

  • Solution: Built a comprehensive risk assessment system that categorizes commands (low/medium/high risk) and only auto-executes with high confidence scores (≥0.8).

3. Real-time Webhook Processing

PagerDuty webhooks needed sub-second acknowledgment while Claude API calls take 2-4 seconds.

  • Solution: Implemented async processing with immediate webhook acknowledgment and background analysis.

4. Context Window Management

With multiple integrations providing data, we hit Claude's token limits.

  • Solution: Developed intelligent context prioritization - only fetching relevant logs/metrics based on the alert type.

5. Testing Kubernetes Failures

Creating realistic test scenarios was challenging.

  • Solution: Built

    fuck_kubernetes.sh

    script that simulates various failure modes (CrashLoopBackOff, OOMKills, etc.) in isolated namespaces.

Progress made before hackathon

Existing Similar Products

We conducted extensive research on current incident response automation solutions:

1. PagerDuty Event Intelligence

  • Offers noise reduction and intelligent alerting
  • Limited to alert grouping and routing, not actual remediation
  • Requires manual intervention for fixes

2. Shoreline.io

  • Provides automated remediation for known issues
  • Requires extensive manual runbook creation
  • Limited AI capabilities for unknown scenarios

3. BigPanda AIOps

  • Focuses on alert correlation and root cause analysis
  • Lacks direct infrastructure integration for automated fixes
  • Enterprise-focused with high barrier to entry

4. K8sGPT

  • Kubernetes-specific AI analysis tool
  • Provides recommendations but no automated execution
  • Limited to cluster-level issues only

MCP Server Ecosystem Research

We evaluated available Model Context Protocol servers for integration:

Infrastructure & Monitoring:

  • kubernetes-mcp-server: Direct cluster access for pod management
  • grafana-mcp-server: Metrics and dashboard integration
  • datadog-mcp-server: Alternative monitoring solution (not implemented)

Knowledge & Documentation:

  • notion-mcp-server: Access to internal runbooks and documentation
  • github-mcp-server: Codebase context for application-specific issues

Communication:

  • slack-mcp-server: Team notifications and approval workflows
  • discord-mcp-server: Alternative communication channel

Key Differentiators Identified

1. Unified AI Brain: Unlike existing solutions that handle specific aspects, DreamOps combines alert processing, analysis, and remediation in one AI agent

2. MCP Ecosystem Leverage: First solution to fully utilize Anthropic's MCP protocol for comprehensive context gathering

3. Confidence-Based Automation: Dynamic execution based on AI confidence rather than rigid rule-based systems

4. Zero-Configuration Start: Works out-of-the-box with standard Kubernetes setups, unlike competitors requiring extensive setup

5. Developer-First Design: Built for startups and small teams, not just enterprises

Discussion

Builders also viewed

See more projects on Devfolio