Created on 20th June 2022
•
Speech activity and robocall detection using state of the art supervised learning.
Problem Statement :- On a daily basis, the Dial 112 System receives a large number of calls, with over 95 percent of them being blank, system-generated, or spoofing calls. The solution should prevent these non-productive calls on IVRS after identifying them.
I am using a technique called Voice Activity Detection to screen the call for human voice and filter the system generated calls or calls made by mistake by user applications like emergency dialers.
We can screen every call for the first 5 seconds to detect if the call is genuine or not and then make a decision to pass it to the police operator. I've also incorporated a feature to extract out exact timestamps in the audio where speech is detected. Try it out here.
VADs are already widely used in call-center industry to increase agent productivity . Dialers can discern whether a human or a computer answered the call by accurately setting VAD settings, and if it was a person, transfer the call to an available agent.
I expanded an existing(silero) state of the art VAD model to better detect Indian speech using a dataset generously provided by IEEE dataport.
I cleaned, standardised and organised the data (uploaded here) and used pytorch and torch-audio to retrain the exiting silero jit model. (took some help from the silero author to better understand the original dataset it was trained on).