Kahani

Immersive Audio-Stories for Bharat

218

Built at Warpspeed: Agentic AI Hackathon | Lightspeed India

Sarvam.ai: Runner Up: Sarvam Track

Created on 21st June 2025

•

Kahani

Immersive Audio-Stories for Bharat

The problem Kahani solves

Kahani: AI-Native, Creator-First Audio Storytelling

Kahani is the first true AI-native, creator-first audio storytelling platform for India.
We help creators go from text → full, regional-language audio stories — in Hindi, Punjabi, English, Marathi, and Telugu — in seconds, unlocking the next chapter of India’s audio economy.

The Opportunity

India’s audio storytelling market has grown rapidly — fueled by platforms like PocketFM (a portfolio company of Lightspeed Ventures).
However, content creation is still locked into studio-heavy, curated production.
Millions of creators across India, especially in Tier-2 and Tier-3 cities, don’t have access to scalable tools that let them produce high-quality audio in their own language, at their own pace.

Why Now?

Two key shifts make this the perfect time:

AI for regional languages is finally production-grade — powered by players like SarvamAI.
India’s creator economy is booming — with 200M+ creators, mostly text- and video-first.
Audio remains an untapped layer that more and more creators, especially from Tier-2/3 cities, are eager to explore.

Our Solution

Kahani solves this by being an AI-native, creator-first tool where anyone can turn a simple text prompt into a 2–10 minute rich audio story — all in their own language — with no studio, no mic, and no wait.
It’s like “Midjourney for audio”, already live in 5 languages and built to scale across India.

Impact

With Kahani:

Audio storytelling is accessible to all — removing technical, cost, and time barriers.
Creators can easily generate and iterate on stories in their native language.
The audio OTT market is projected to grow at 34% CAGR, crossing $1.5B by 2027.
Over 200M+ monthly active audio consumers in India offer a massive opportunity.
Kahani is well-positioned to build an independent platform and also complement existing players, creating new synergies across the ecosystem.

Reference:
Grand View Research: Audio Streaming Market Outlook, India

Challenges we ran into

Core Technical Challenges

Building this system introduced several complex challenges, especially around multi-agent coordination and the current capabilities of the voice models we employed.

1. Extensive Prompt Engineering

A significant part of our process involved deep prompt refinement to enable seamless multi-agent collaboration.

Key Details:

Each agent (e.g., narrator, character 1, character 2, etc.) requires very carefully constructed prompts.
Poorly tuned prompts can cause agents to produce dialogues that feel disjointed or off-tone.
The refinement cycle involved several iterations, spanning multiple hours of trial and error.
Achieving prompt accuracy was crucial for allowing agents to align properly on context and narrative style.

Outcome:
By spending the time to perfect these prompts, we ensured that each agent could “stay in character” and produce high-quality output that enhances the listener’s experience.

2. Sequential Multi-Agent Execution & Context Limits

Our design adopted a sequential multi-agent architecture, where one agent passes its output as context to the next. However, this led to serious context-window constraints.

Challenges:

Every agent call consumed valuable context length, quickly pushing us toward token limits.
Degrading the output with each iteration became a major issue — the longer the conversation went, the more context was lost.

Solution Strategies:

Developed novel context management techniques to minimize token usage without losing key story details.
Implemented parallelization where possible, allowing some agents to work concurrently and then merge their contributions.
These optimizations preserved both consistency of output and processing speed, even as the dialogues grew in length and complexity.

3. Expressiveness in Synthetic Voices

Our use case — audio entertainment — places a premium on expressive and dynamic speech. However, our chosen TTS model posed some hurdles.

Current TTS Model:

Sarvam's Bulbul TTS model was originally tuned for voice agent or support scenarios.
Its speech style is fairly flat, lacking the rich expressiveness we want for entertainment use cases.

Workarounds Implemented:

Introduced explicit prompts for pacing, pitch, and loudness to mimic a more dramatic and emotional voice.
Tuned dialogues to add pauses, emphasis, and tonal variations.
This yielded a subtle but meaningful improvement in perceived expressiveness.

Future Plans:

Discussed these requirements with the Sarvam team, who acknowledged this gap.
They assured us that future models will better support expressive audio for creative industries — a very promising direction for Kahani.

Progress made before hackathon

The team was formed at the Venue, so there was no preparatory work done before the hackathon. Each team member studied the tracks provided on the Devfolio page. No part of the project was built outside the venue.

Tracks Applied (2)

Sarvam AI Track

Our project used 2 models from sarvam - TTS (Bulbul) and STT models STT models for voice input and Bulbul models for ge...Read More

Sarvam.ai

Google Cloud Platform Usage

Our backend is deployed on Google Cloud Platform (GCP). In addition, we integrate Gemini models in parts of the agent sy...Read More

Google Cloud Platform

Technologies used

Next.js

SQLite

Python

FFmpeg

Google Cloud Platform (GCP)

Discussion

Builders also viewed

See more projects on Devfolio