S

Speech Enhancement and Metrics in TTS systems

This is a research project that aims to remove and quantify noise in TTS systems.

Created on 6th February 2022

S

Speech Enhancement and Metrics in TTS systems

This is a research project that aims to remove and quantify noise in TTS systems.

The problem Speech Enhancement and Metrics in TTS systems solves

Problem Statement

The future of CX is with Voice AI. Text-To-Speech (TTS) systems of Skit, as well as TTS systems in general, have a tendency to mix some ambient noise along with the speech it outputs. This aim of this research project was to remove that noise and quantify how well the noise has been removed using standard metrics.

Implementations

Filters

Speech enhancement can be done using the traditional signal processing techniques or using deep learning techniques. We hypothesised that signal processing techniques would be suitable for task and tested them out. I implemented the techniques of Wiener filter, Kalman filter, Minimum Mean Square Error (MMSE) filter, Minimum Mean Square Log Error filter and Spectral Subtraction (Oversubtraction) methods.

Metrics

To find out which of these filters work the 'best', we need to define the 'best'. For this, I tested several metrics like Perceptual Evaluation of Speech Quality (PESQ, narrow and wide band), Short-Time Objective Intelligibility (STOI) and implemented several metrics from scratch like F0 Frame Error (FFE), Gross Pitch Error (GPE), Mel Cepstral Distortion (MCD, both versions), Voicing Error Decision (VED), Mean Speech Distortion (MSD), PitchTracking and Word Error Rate (WER).

Results

We found that the Wiener Filter and the Kalman Filters perform the best outperforming one another for different signal-to-noise ratios (SNR) on the NOIZEUS dataset, however they do not perform as well as we want on the TTS. This is because the TTS dataset has really subtle noise as unlike the NOIZUES dataset.

Use Cases

The repo of this project can be used for real life speech denoisement purposes and most importantly, it provides implementations of crucial metrics which can be used for measuring the amount of distortion/clarity of the speech.

Challenges I ran into

The biggest challenge of this project was that the noise by TTS systems are not the same as any other real noise. Generally noise and clean signal are additive in nature or have a direct sum decomposition but this is not the case with TTS systems as here the noise is generated along with the speech and is not added separately to the signal. Hence most of the traditional filters which although work well for real life noise separation, do not work well for this use case. This is where we resorted to deep learning techniques and models like the Facebook denoiser and SeGAN helped.

Discussion

Builders also viewed

See more projects on Devfolio