Acoustic-to-articulatory inversion (AAI) is to convert audio into the movement of articulators, such as ultrasound images (UIs) of the tongue. A key issue of existing AAI methods is only using the highly personalized information from acoustic inputs to derive the general patterns of tongue motions, and thus the quality of generated UIs is limited. To address this issue, this paper proposes an audio-textual diffusion model for generating UIs from speech data, named WAV2UIT. This model consists of two stages: conditional encoding and UI generation. In the first stage, the inherent acoustic characteristics of individuals related to the details of tongue movements are encoded by using wav2vec 2.0, while the ASR transcriptions in the textual space related to the universality of tongue motions are encoded by using BERT. In the second stage, high-quality UIs are generated by using a diffusion model with the cyclic denoising sampling strategy. Experimental results on a Mandarin speech-ultrasound dataset showed that the proposed WAV2UIT system outperformed the state-of-the-art DNN baseline for UI generation by a LPIPS improvement of 67.95% relative. The codes and examples can be found on the website.
Hover a video to see whether it's an original sample or a generated sample.
Our ultrasound.
Experiments were conducted on Mandarin speech-ultrasound dataset. This dataset was collected from 44 healthy persons with three different speech tasks (vowel, word and sentence), totally 6.85 hours. The training set consists of 40 speakers, while the test set consists of 4 speakers. No overlap exists between the training and test sets. The UTI data were recorded in mid sagittal orientation using a Focus&Fusion Finus 55 ultrasound system with a sampling rate of 60 fps and a resolution of 920×700. The P5-2 phased array probe was fixed by an ultrasound stabilization headset. The speech data were recorded by a BOYA BY-WM4 PRO microphone with a sampling frequency of 16khz and single channel. The speech signals and UTI data were synchronized by using an external sound card. The downsampled UTI data under different resolutions were used for training different models.
Tasks | Number | Time (hours) |
---|---|---|
Vowel | 687 | 0.76 |
Word | 599 | 0.66 |
Sentence | 4913 | 5.42 |
Under Viewer