AN AUDIO-TEXTUAL DIFFUSION MODEL FOR CONVERTING SPEECH SIGNALS INTO ULTRASOUND TONGUE IMAGING DATA

1Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.
2University of Chinese Academy of Sciences, Beijing, China.
3Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen, China.

Abstract

Acoustic-to-articulatory inversion (AAI) is to convert audio into the movement of articulators, such as ultrasound images (UIs) of the tongue. A key issue of existing AAI methods is only using the highly personalized information from acoustic inputs to derive the general patterns of tongue motions, and thus the quality of generated UIs is limited. To address this issue, this paper proposes an audio-textual diffusion model for generating UIs from speech data, named WAV2UIT. This model consists of two stages: conditional encoding and UI generation. In the first stage, the inherent acoustic characteristics of individuals related to the details of tongue movements are encoded by using wav2vec 2.0, while the ASR transcriptions in the textual space related to the universality of tongue motions are encoded by using BERT. In the second stage, high-quality UIs are generated by using a diffusion model with the cyclic denoising sampling strategy. Experimental results on a Mandarin speech-ultrasound dataset showed that the proposed WAV2UIT system outperformed the state-of-the-art DNN baseline for UI generation by a LPIPS improvement of 67.95% relative. The codes and examples can be found on the website.

This is real and fake Ultrasound tongue imaging

Hover a video to see whether it's an original sample or a generated sample.

Our ultrasound.

Original
Original
Original
Original
Original
Original
A4U+A
A4U+A
A4U+A
A4U+A
A4U+A
A4U+A
A4U+AT
A4U+AT
A4U+AT
A4U+AT
A4U+AT
A4U+AT
Score: 0/0 (0%)

Datasets

Experiments were conducted on Mandarin speech-ultrasound dataset. This dataset was collected from 44 healthy persons with three different speech tasks (vowel, word and sentence), totally 6.85 hours. The training set consists of 40 speakers, while the test set consists of 4 speakers. No overlap exists between the training and test sets. The UTI data were recorded in mid sagittal orientation using a Focus&Fusion Finus 55 ultrasound system with a sampling rate of 60 fps and a resolution of 920×700. The P5-2 phased array probe was fixed by an ultrasound stabilization headset. The speech data were recorded by a BOYA BY-WM4 PRO microphone with a sampling frequency of 16khz and single channel. The speech signals and UTI data were synchronized by using an external sound card. The downsampled UTI data under different resolutions were used for training different models.

Tasks Number Time (hours)
Vowel 687 0.76
Word 599 0.66
Sentence 4913 5.42

BibTeX


        Under Viewer