The problem CR3SC3NDO solves
Are you a music addict? Obviously you are, who isn’t. Remember being hooked to music visualisers that simulates not only your sense of hearing but also your sense of sight.
Music visualizers have been a familiar sight on computers for a while. They are essentially animated imagery synchronized with whatever musical selection being played, and changes with every song.
A great variety of instruments come together to create the 3 minute song you vibe to. But, have you ever wished to absorb every independent piece of the song, individually, all at the same time?
There exists no music visualiser which animates on the basis of the instruments comprising a song, hence, not conveying the role each instrument plays in conveying the true essence of the song. Usually it is not implemented so far as the composers don’t make their individual instrumental tracks publicly accessible. Also, separating instrumental tracks from a composed music piece is a state-of-art ML problem and attempting to animate them manually is a tedious task. This is where we come to action and automate this entire process.
Added to all this fun, what if this whole setup went to the next dimension? A VR based implementation of such an audio visualiser would make it no less than a musical concert. A lot of VR videos have been made where objects move around the space randomly or in a very precise manner. But a virtual space where you see stuff animating on the instruments of your music would make you feel like the song has come to life.
Challenges we ran into
- For intuitive music visualization, we wanted to animate using particular instruments from a song. But stem files (audio files) of the constituent instruments weren't available in most cases. To solve this problem, we used a state-of-the-art Machine Leaning model that separates the song into its constituent instruments.
- The instrument stems generated are not very clear. This leads to stuttering in animations later. We solved this by resampling the audio, and calculating root mean square amplitude values for every 60th part of a second.
- The next challenge was to generate an animated video with animations that react to the song. But in almost all graphic processing libraries we tried, the audio and animation were not in sync because videos generally have about 60 frames per second whereas audio files have a sample rate of about 44.1 kHz. We solved this by resampling the audio files to the target frame per second and then rendering the video frame by frame. We used the language called Processing for doing this.
- Processing does not come with a GUI, which is essential for generating 360-degree music video templates. This problem led us to switch to Blender for 360-degree videos and use python scripts in Blender to generate the animations.
- All tasks in our workflow are computationally intensive and take about 10 minutes to process. This problem leads to a complex backend for our website.
- Rendering videos is a computationally intensive task and also requires high throughput. Hence our backend is hosted on a GPU powered virtual machine.