Kahani
Immersive Audio-Stories for Bharat
Created on 21st June 2025
•
Kahani
Immersive Audio-Stories for Bharat
The problem Kahani solves
Kahani: AI-Native, Creator-First Audio Storytelling
Kahani is the first true AI-native, creator-first audio storytelling platform for India.
We help creators go from text → full, regional-language audio stories — in Hindi, Punjabi, English, Marathi, and Telugu — in seconds, unlocking the next chapter of India’s audio economy.
The Opportunity
India’s audio storytelling market has grown rapidly — fueled by platforms like PocketFM (a portfolio company of Lightspeed Ventures).
However, content creation is still locked into studio-heavy, curated production.
Millions of creators across India, especially in Tier-2 and Tier-3 cities, don’t have access to scalable tools that let them produce high-quality audio in their own language, at their own pace.
Why Now?
Two key shifts make this the perfect time:
- AI for regional languages is finally production-grade — powered by players like SarvamAI.
- India’s creator economy is booming — with 200M+ creators, mostly text- and video-first.
Audio remains an untapped layer that more and more creators, especially from Tier-2/3 cities, are eager to explore.
Our Solution
Kahani solves this by being an AI-native, creator-first tool where anyone can turn a simple text prompt into a 2–10 minute rich audio story — all in their own language — with no studio, no mic, and no wait.
It’s like “Midjourney for audio”, already live in 5 languages and built to scale across India.
Impact
With Kahani:
- Audio storytelling is accessible to all — removing technical, cost, and time barriers.
- Creators can easily generate and iterate on stories in their native language.
- The audio OTT market is projected to grow at 34% CAGR, crossing $1.5B by 2027.
- Over 200M+ monthly active audio consumers in India offer a massive opportunity.
- Kahani is well-positioned to build an independent platform and also complement existing players, creating new synergies across the ecosystem.
Reference:
Grand View Research: Audio Streaming Market Outlook, India
Challenges we ran into
Core Technical Challenges
Building this system introduced several complex challenges, especially around multi-agent coordination and the current capabilities of the voice models we employed.
1. Extensive Prompt Engineering
A significant part of our process involved deep prompt refinement to enable seamless multi-agent collaboration.
Key Details:
- Each agent (e.g., narrator, character 1, character 2, etc.) requires very carefully constructed prompts.
- Poorly tuned prompts can cause agents to produce dialogues that feel disjointed or off-tone.
- The refinement cycle involved several iterations, spanning multiple hours of trial and error.
- Achieving prompt accuracy was crucial for allowing agents to align properly on context and narrative style.
Outcome:
By spending the time to perfect these prompts, we ensured that each agent could “stay in character” and produce high-quality output that enhances the listener’s experience.
2. Sequential Multi-Agent Execution & Context Limits
Our design adopted a sequential multi-agent architecture, where one agent passes its output as context to the next. However, this led to serious context-window constraints.
Challenges:
- Every agent call consumed valuable context length, quickly pushing us toward token limits.
- Degrading the output with each iteration became a major issue — the longer the conversation went, the more context was lost.
Solution Strategies:
- Developed novel context management techniques to minimize token usage without losing key story details.
- Implemented parallelization where possible, allowing some agents to work concurrently and then merge their contributions.
- These optimizations preserved both consistency of output and processing speed, even as the dialogues grew in length and complexity.
3. Expressiveness in Synthetic Voices
Our use case — audio entertainment — places a premium on expressive and dynamic speech. However, our chosen TTS model posed some hurdles.
Current TTS Model:
- Sarvam's Bulbul TTS model was originally tuned for voice agent or support scenarios.
- Its speech style is fairly flat, lacking the rich expressiveness we want for entertainment use cases.
Workarounds Implemented:
- Introduced explicit prompts for pacing, pitch, and loudness to mimic a more dramatic and emotional voice.
- Tuned dialogues to add pauses, emphasis, and tonal variations.
- This yielded a subtle but meaningful improvement in perceived expressiveness.
Future Plans:
- Discussed these requirements with the Sarvam team, who acknowledged this gap.
- They assured us that future models will better support expressive audio for creative industries — a very promising direction for Kahani.
Progress made before hackathon
The team was formed at the Venue, so there was no preparatory work done before the hackathon. Each team member studied the tracks provided on the Devfolio page. No part of the project was built outside the venue.
Tracks Applied (2)
Sarvam AI Track
Sarvam.ai
Google Cloud Platform Usage
Google Cloud Platform
Technologies used