VibeCut
v0 for Video Editors
The problem VibeCut solves
๐ The Problem It Solves
Video editing is traditionally a time-consuming, highly technical, and often tedious process. Whether you're a content creator trying to keep up with the relentless pace of social media or a professional editor bogged down by the initial rough cut, hours are lost simply sifting through raw footage, finding the right moments, syncing clips to audio, and applying basic transitions or color grading.
VibeCut completely reimagines this workflow by turning video editing into a conversational, intent-driven experience. By abstracting away the complex mechanics of traditional NLEs (Non-Linear Editors) behind a team of specialized AI agents, VibeCut gives users their time back and lets them focus purely on creative storytelling.
What it can be used for:
- Rapid Social Media Content Creation: Creators can upload a folder of raw b-roll, provide a simple prompt (e.g., "Create a high-energy travel reel using this script..."), and VibeCut will automatically select the best clips, generate text overlays, and match the cuts to the music.
- Automated Rough Cuts: Video editors and filmmakers can use it to instantly generate a baseline timeline from hours of raw footage, saving them from the grueling, repetitive task of initial clip sorting and assembly.
- Accessible Storytelling: Individuals or small businesses with zero video editing experience or technical knowledge can produce high-quality, engaging promotional videos just by describing the "vibe" they want to achieve.
How it makes existing tasks easier and incredibly fast:
- Intelligent Media Analysis: Instead of manually scrubbing through hours of clips, VibeCut's multi-agent system uses Vision and Speech AI to automatically analyze, transcribe, and tag your footage based on sentiment, action, and visual content.
- Context-Aware Assembly: The
Edit Planner
andPreset Intelligence
agents translate your natural language prompt into precise timeline operations. They automatically apply the correct color profiles, motion effects, and typography that fit the exact mood of your request. - Real-Time, Conversational Iteration: Making changes is as simple as chatting with the AI. You can simply ask the orchestrator to "make the pacing punchier" or "swap the second clip with something more cinematic," and the React-based timeline updates dynamically in real-time.
- Multi-Format Generation: With built-in aspect ratio intelligence, repurposing content for different platforms (e.g., converting a 16:9 YouTube video into a 9:16 TikTok/Reel) is handled natively by the agentic pipeline without the need for manual reframing.
๐๏ธ Architecture
VibeCut is built on a Python/FastAPI backend and a React frontend connected over WebSocket for real-time updates. The core of the system is a 10-agent agentic pipeline powered by Google Gemini, where each agent is a specialist responsible for one stage of the editing process โ from media ingestion and scene analysis, all the way to music selection and final FFmpeg export.
The user's natural language prompt enters the Orchestrator, which coordinates the full pipeline. Media is first analyzed by Ingestion, Speech, and Vision agents to extract metadata, transcripts, and visual context. This indexed context flows into the Clip Retrieval agent for semantic search. The Edit Planner then converts the user's intent into precise timeline operations, collaborating with the Preset Intelligence, Music, and Gen Media agents before handing off to the Editing Execution agent for the final render.

๐ Example Workflow
- Open the editor.
- Upload video clips or set
folder_path
when creating a project. - Type:
Create a high-energy travel reel using this script: "Welcome to paradise. The beaches are stunning. The food is incredible. Let's explore together."
- Watch the timeline assemble live with clips, text overlays, color presets, and music.
Challenges we ran into
๐ Challenges We Ran Into
Building an autonomous, multi-agent video editing pipeline introduced several complex technical hurdles, particularly around state management and real-time execution.
1. Orchestrating a 10-Agent Pipeline
The Bug/Hurdle: Initially, agents were executing sequentially and passing massive JSON objects back and forth. This caused severe latency (sometimes taking 3-4 minutes to process a simple prompt) and increased the likelihood of the LLM context window overflowing or generating hallucinated timeline operations.
The Fix: We completely refactored the pipeline. Instead of passing massive state objects, we implemented a centralized
Project State
model. Agents now operate more asynchronously, fetching only the metadata they need (from a structured Vector DB/Clip Registry for retrieval) and yielding operational intents rather than full timeline JSONs. TheEdit Planner
then synthesizes these intents into strict, atomic operations.2. Handling Explicit vs. Implicit Instructions
The Bug/Hurdle: When users provided highly specific commands (e.g., "Put clip A at 0:00 and clip B at 5:00"), the system would still try to intelligently match scenes based on emotion or action, completely ignoring the user's explicit timing requests.
The Fix: We implemented a dual-path processing model in the Orchestrator. The system now parses the prompt to separate structural commands (explicit timings, specific file requests) from narrative intent (pacing, mood, style). Structural operations are processed immediately and locked in the timeline, while the narrative agents (like the Preset Intelligence agent) fill in the rest around those locked constraints.
3. Real-Time WebSocket Synchronization
The Bug/Hurdle: The frontend React timeline would often glitch or desync when the backend sent complex timeline updates, especially when multiple overlapping clips and text overlays were generated simultaneously.
The Fix: We implemented a differential update system over the WebSocket. Instead of sending the entire updated timeline state every time an agent made a change, the backend now emits specific event payloads (e.g.,
operation_added
,clip_trimmed
). The React frontend uses a structured reducer to apply these atomic changes, ensuring smooth UI updates without fully re-rendering the timeline component.4. FFmpeg Media Alignment
The Bug/Hurdle: Aligning audio beats with video cuts dynamically using FFmpeg was a nightmare. Small floating-point precision errors in timestamps resulted in visually jarring cuts that were a few frames off the beat.
The Fix: We moved away from relying purely on FFmpeg for the intermediate timeline intelligence. We built a Python-based
engine
module that handles all the heavy lifting of calculating precise trim, offset, and duration math (in milliseconds) before ever touching FFmpeg. FFmpeg is now strictly an execution layer the end of the pipeline.Tracks Applied (1)
Gemini API
Major League Hacking
Technologies used

