09:00
According to news from IT House on July 27, Microsoft recently launched a speech model called NaturalSpeech2, which uses a "potential diffusion" design and has an outstanding effect at the level of zero-sample speech synthesis. "Class" speech/singing solution, which can give users high-quality and diverse speech synthesis experience. Unlike traditional speech-to-text (TTS) systems, Microsoft's NaturalSpeech2 uses "continuous vectors" instead of "discrete tokens" to represent speech, resulting in more complete speech fragments that do not produce "unsentimental" "stick readings" (a speech in one word)" phenomenon. The experimental results show that the speech generated by NaturalSpeech2 under the zero-sample condition is nearly consistent with the prosody of the speech prompt and the real speech, and the naturalness (measured by CMOS) on the LibriTTS and VCTK test sets is indistinguishable from the real speech.








