ai声音模仿_该AI只需聆听5秒钟即可克隆您的声音

ai声音模仿

This post is about some fairly recent improvements in the field of AI-based voice cloning. If we have hours and hours of footage of a particular voice at our disposal then that voice can be cloned using existing methods.But this recent breakthrough enables us to do the same using minuscule data — only five seconds of audio footage. The output generated using this method has timbre strikingly similar to the original voice and it is able to synthesize sounds and consonants that are non-existent in the original audio sample.It is able to construct these sounds on it’s own. You can listen to some generated samples here.

这篇文章是关于基于AI的语音克隆领域中一些相当近期的改进。 如果我们有一个小时或几个小时的特定声音素材可以使用,那么可以使用现有方法克隆该声音。但是,最近的突破使我们能够使用微小的数据(只有五秒钟的音频素材)进行相同的操作。 使用这种方法产生的输出音色与原始声音非常相似,并且能够合成原始音频样本中不存在的声音和辅音,并且能够自行构造这些声音。 您可以在此处听一些生成的样本。

Here is how the interface looks like:

界面如下所示:

Reference[2]) 参考[2] )

The detailed diagram of the architecture is given below.This method is able to do what it does using the following three components.

下面给出了该体系结构的详细示意图。该方法能够使用以下三个组件来完成其工作。

paper 纸

1.扬声器编码器 (1. The Speaker Encoder)

It is basically a Neural Network trained on thousands of speakers and it squeezes the information learned from the training data into a compressed representation. In other words it learns the essence of human speech from a multitude of speakers. It uses the training audio sample footage to pick up the intricacies of human speech but this training need to be done only once. After that only five seconds of speech is enough to replicate the voice of an unknown speaker.

它基本上是一个经过数千位演讲者训练的神经网络,它将从训练数据中学到的信息压缩为压缩的表示形式。 换句话说,它从众多演讲者那里学习了人类演讲的精髓。 它使用训练音频样本素材来拾取人类语音的复杂性,但是该训练只需要执行一次。 之后,仅五秒钟的语音就足以复制未知扬声器的声音。

2.合成器 (2. Synthesizer)

It takes as input whatever we want our synthesized voice to say as text input and returns a Mel Spectrogram. A Mel Spectrogram is a concise representation of one’s voice and intonation. This part of the network is implemented using DeepMind’s Tactocron 2 technique.The diagram below shows an example of Mel Spectrogram for male and female speakers. On the left we have a spectrogram of reference recordings of the voice sample we want to replicate and on the right we specify the piece of text that we want our synthesized voice to say and it’s corresponding synthesized spectrogram.

它把我们希望合成语音说出的任何内容作为输入,并返回梅尔频谱图 。 梅尔频谱图是一个人的声音和语调的简洁表示。 网络的这一部分是使用DeepMind的Tactocron 2技术实现的。下图显示了针对男性和女性扬声器的Mel频谱图示例。 左侧有我们要复制的语音样本参考录音的声谱图,而右侧则是我们希望合成语音说出的一段文本,它是相应的合成声谱图。

paper 本文所见,用于训练和合成数据的梅尔频谱图

3.神经声码器 (3. The Neural Vocoder)

Ultimately to listen to the learned voices we need to output a waveform. This is done by the Neural Vocoder component and it is implemented using DeepMind’s Wavenet Technique.

最终,要听取学习到的声音,我们需要输出波形。 这是由神经语音编码器组件完成的,并使用DeepMind的Wavenet技术实现。

测量相似度和自然度 (Measuring Similarity and Naturalness)

Ultimately our goal is to output something that is similar to the voice of the target person but it should say something very different from the input sample in a natural manner. From the table below we can see that swapping the training and the test data drastically changes the naturalness and similarity of the synthesized voices. The detailed section of the paper describes how to work our way around these difficulties. The authors also define a metric called Mean Opinion Score that describes how well a cloned voice sample would pass as authentic human speech.

最终,我们的目标是输出与目标人的声音类似的声音,但它应该以自然的方式说出与输入样本非常不同的声音。 从下表中可以看出,交换训练数据和测试数据会极大地改变合成声音的自然性和相似性。 本文的详细部分描述了如何解决这些困难。 这组作者还定义了一个称为“平均意见得分”的指标,该指标描述了克隆的语音样本作为真实人类语音通过的程度。

paper 本文所见,相似度和自然度指标

结论: (Conclusion:)

This technology has a lot of promise for the future like generating voices of people who have lost theirs due to degenerative diseases.Also it might be used unscrupulously to clone voices of authoritative people and world leaders for wrong reasons.The only out of this would be to have proper techniques to detect whether a voice is natural or synthesized.

这项技术对未来具有很大的希望,例如产生由于退化性疾病而失去自己的声音的人的声音,而且由于错误的原因也可能被无良地用于克隆权威人士和世界领导人的声音,唯一的原因是具有适当的技术来检测声音是自然的还是合成的。

翻译自: https://medium.com/swlh/this-ai-can-clone-your-voice-just-by-listening-for-5-seconds-783885102a8d

ai声音模仿

你可能感兴趣的:(ai声音模仿_该AI只需聆听5秒钟即可克隆您的声音)