DreamOps

Dream Easy While AI Takes Care Of Your Oncall Duty

466

Built at Warpspeed: Agentic AI Hackathon | Lightspeed India

Grand Prize

Created on 22nd June 2025

•

DreamOps

Dream Easy While AI Takes Care Of Your Oncall Duty

The problem DreamOps solves

On-call engineers face constant interruptions, sleep deprivation, and alert fatigue from managing production incidents 24/7. Current incident response relies heavily on manual intervention - engineers must wake up at 3 AM, analyze logs across multiple systems, diagnose issues, and execute remediation steps while under pressure.

How DreamOps Solves This

Core Features:

Automated Incident Response: AI agent automatically triages and resolves common incidents (pod crashes, OOM kills, configuration issues) without human intervention
Intelligent Root Cause Analysis: Uses Claude AI to analyze alerts with full context from Kubernetes, logs, metrics, and documentation
Reduced MTTR: What typically takes 30-60 minutes of manual debugging can be resolved in 2-5 minutes automatically
Context-Aware Decisions: Integrates with GitHub, Grafana, and Notion to understand your specific infrastructure and runbooks
Sleep Protection: Engineers can actually rest while the AI handles routine incidents, only escalating complex issues

Real-World Impact:

Saves 2-4 hours per on-call shift
Reduces incident resolution time by 80% for common issues
Prevents engineer burnout from repetitive tasks
Maintains consistent remediation quality even at 3 AM

Challenges we ran into

1. MCP Integration Complexity

Integrating multiple Model Context Protocol servers (Kubernetes, GitHub, Grafana) required building a robust abstraction layer. Each MCP server had different authentication methods and response formats.

Solution: Created a unified
MCPIntegration
base class.

2. Safety vs Automation Balance

Implementing YOLO mode that can execute kubectl commands automatically was risky.

Solution: Built a comprehensive risk assessment system that categorizes commands (low/medium/high risk) and only auto-executes with high confidence scores (≥0.8).

3. Real-time Webhook Processing

PagerDuty webhooks needed sub-second acknowledgment while Claude API calls take 2-4 seconds.

Solution: Implemented async processing with immediate webhook acknowledgment and background analysis.

4. Context Window Management

With multiple integrations providing data, we hit Claude's token limits.

Solution: Developed intelligent context prioritization - only fetching relevant logs/metrics based on the alert type.

5. Testing Kubernetes Failures

Creating realistic test scenarios was challenging.

Solution: Built
fuck_kubernetes.sh
script that simulates various failure modes (CrashLoopBackOff, OOMKills, etc.) in isolated namespaces.

Progress made before hackathon

Existing Similar Products

We conducted extensive research on current incident response automation solutions:

1. PagerDuty Event Intelligence

Offers noise reduction and intelligent alerting
Limited to alert grouping and routing, not actual remediation
Requires manual intervention for fixes

2. Shoreline.io

Provides automated remediation for known issues
Requires extensive manual runbook creation
Limited AI capabilities for unknown scenarios

3. BigPanda AIOps

Focuses on alert correlation and root cause analysis
Lacks direct infrastructure integration for automated fixes
Enterprise-focused with high barrier to entry

4. K8sGPT

Kubernetes-specific AI analysis tool
Provides recommendations but no automated execution
Limited to cluster-level issues only

MCP Server Ecosystem Research

We evaluated available Model Context Protocol servers for integration:

Infrastructure & Monitoring:

kubernetes-mcp-server: Direct cluster access for pod management
grafana-mcp-server: Metrics and dashboard integration
datadog-mcp-server: Alternative monitoring solution (not implemented)

Knowledge & Documentation:

notion-mcp-server: Access to internal runbooks and documentation
github-mcp-server: Codebase context for application-specific issues

Communication:

slack-mcp-server: Team notifications and approval workflows
discord-mcp-server: Alternative communication channel

Key Differentiators Identified

1. Unified AI Brain: Unlike existing solutions that handle specific aspects, DreamOps combines alert processing, analysis, and remediation in one AI agent

2. MCP Ecosystem Leverage: First solution to fully utilize Anthropic's MCP protocol for comprehensive context gathering

3. Confidence-Based Automation: Dynamic execution based on AI confidence rather than rigid rule-based systems

4. Zero-Configuration Start: Works out-of-the-box with standard Kubernetes setups, unlike competitors requiring extensive setup

5. Developer-First Design: Built for startups and small teams, not just enterprises

Technologies used

HTML

Node.js

Next.js

WebSockets

Python

AWS

Slack API

AWS Cloudwatch

Webhook

Kubernetes

Tailwind CSS

Discussion

Builders also viewed

See more projects on Devfolio