VALL-E:Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
wordembeddingELMOGPTBertAudioLM:aLanguageModelingApproachtoAudioGenerationabstractintrorelatedworkVALL-E:NeuralCodecLanguageModelsareZero-ShotTexttoSpeechSynthesizersabstractspeechqua