Results Audio Inpainting

Andrea Rodriguez  |  Thesis  |  Slides
The field of audio restoration and inpainting has witnessed significant advancements in recent years, driven by the increasing demand for high-quality audio processing. One promising approach is the utilization of generative diffusion models. Our research builds upon the foundation laid by previous studies on the successful applications of generative diffusion models in the image domain. However, we go a step further by adapting these techniques to the unique characteristics and challenges posed by audio signals. By harnessing the power of generative diffusion models and integrating it with state-of-the-art image inpainting techniques tailored for audio, we aim to address the intricacies of audio inpainting and offer robust and effective solutions for reconstructing missing or degraded audio segments.

* SDR = Signal Distortion Ratio
* SNR = Signal Noise Ratio
* PSNR = Peak Signal Noise Ratio
* SSIM = Structural Similarity Index Measure
Metrics averages
SDR AudioLDM Tango Tango + DDNM Tango + DDNM+ Tango + RePaint Tango + RePaint+
4.5s - 5.5s -3.27 5.48 5.03 5.47 5.96 4.97
4s - 6s -4.97 1.48 1.53 1.99 1.64 1.48
SNR AudioLDM Tango Tango + DDNM Tango + DDNM+ Tango + RePaint Tango + RePaint+
4.5s - 5.5s -0.25 5.73 5.39 5.71 6.17 5.28
4s - 6s -1.45 2.82 2.90 2.21 2.71 2.02
PSNR AudioLDM Tango Tango + DDNM Tango + DDNM+ Tango + RePaint Tango + RePaint+
4.5s - 5.5s 39.46 43.35 42.44 43.22 44.08 42.38
4s - 6s 35.44 39.74 39.85 38.61 39.92 38.56
SSIM AudioLDM Tango Tango + DDNM Tango + DDNM+ Tango + RePaint Tango + RePaint+
4.5s - 5.5s 98.40 99.25 99.21 99.28 99.18 99.14
4s - 6s 97.84 98.34 98.46 98.45 98.59 98.56
Inference time
in Colab Free tier (1 x GPU T4)
Parameters 1 Clip (minutes) Batch of 8 clips (minutes)
AudioLDM DDIM T = 200 1 4
Tango 10 40
Tango + DDNM 10 40
Tango + DDNM+ travel length = 10 120 480
Tango + RePaint number of resampling = 10 100 400
Tango + RePaint+ number of resampling = 10; jump length = 10 100 400
Denoising using modified DDNM

SNR for denoising only
PSNR 30 Clip 1 Clip 2 Clip 3 Clip 4
Noisy -3.53 -2.45 -3.54 -3.61
Denoised -1.82 -1.14 -1.32 -2.20
PSNR 20 Clip 1 Clip 2 Clip 3 Clip 4
Noisy -9.80 -9.11 -8.61 -10.10
Denoised -2.47 -3.80 -2.64 -2.58
References
AudioLDM:
Tango:
DDNM:
RePaint:
DDPM:
AudioCaps: