GitHub - roedoejet/FastSpeech2: An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
python3 synthesize.py --text "你好" --restore_step 400000 --mode single -p config/AISHELL3/preprocess.yaml -m config/AISHELL3/model.yaml -t config/AISHELL3/train.yaml
附录:
--restore_step 这个parameter要根据所使用的trained model的实际情况填写
-- text这个parameter只能输入汉字不能输入拼音
有4个常规函数:
def read_lexicon(lex_path):
def preprocess_english(text, preprocess_config):
def preprocess_mandarin(text, preprocess_config):
def synthesize(model, step, configs, vocoder, batchs, control_values):
和一个main函数
if __name__ == "__main__":
定义可控训练参数
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--restore_step", type=int, required=True)
parser.add_argument(
"--mode",
type=str,
choices=["batch", "single"],
required=True,
help="Synthesize a whole dataset or a single sentence",
)
parser.add_argument(
"--source",
type=str,
default=None,
help="path to a source file with format like train.txt and val.txt, for batch mode only",
)
parser.add_argument(
"--text",
type=str,
default=None,
help="raw text to synthesize, for single-sentence mode only",
)
parser.add_argument(
"--speaker_id",
type=int,
default=0,
help="speaker ID for multi-speaker synthesis, for single-sentence mode only",
)
parser.add_argument(
"-p",
"--preprocess_config",
type=str,
required=True,
help="path to preprocess.yaml",
)
parser.add_argument(
"-m", "--model_config", type=str, required=True, help="path to model.yaml"
)
parser.add_argument(
"-t", "--train_config", type=str, required=True, help="path to train.yaml"
)
parser.add_argument(
"--pitch_control",
type=float,
default=1.0,
help="control the pitch of the whole utterance, larger value for higher pitch",
)
parser.add_argument(
"--energy_control",
type=float,
default=1.0,
help="control the energy of the whole utterance, larger value for larger volume",
)
parser.add_argument(
"--duration_control",
type=float,
default=1.0,
help="control the speed of the whole utterance, larger value for slower speaking rate",
)
args = parser.parse_args()
分batch mode和single mode检查source text
# Check source texts
if args.mode == "batch":
assert args.source is not None and args.text is None
if args.mode == "single":
assert args.source is None and args.text is not None
读取configs
# Read Config
preprocess_config = yaml.load(
open(args.preprocess_config, "r"), Loader=yaml.FullLoader
)
model_config = yaml.load(open(args.model_config, "r"), Loader=yaml.FullLoader)
train_config = yaml.load(open(args.train_config, "r"), Loader=yaml.FullLoader)
configs = (preprocess_config, model_config, train_config)
从utils文件夹下的model.py调用模型和声码器
# Get model
model = get_model(args, configs, device, train=False)
# Load vocoder
vocoder = get_vocoder(model_config, device)
根据之前设定的preprocess_config["preprocessing"]["text"]["language"] 是 "zh"来调动preprocess_mandarin 这个function,对texts进行预处理
附录:如果用英语或者其他语言,preprocess_config["preprocessing"]["text"]["language"]以及synthesize.py中的preprocess function要相应调整
# Preprocess texts
if args.mode == "batch":
# Get dataset
dataset = TextDataset(args.source, preprocess_config)
batchs = DataLoader(
dataset,
batch_size=8,
collate_fn=dataset.collate_fn,
)
if args.mode == "single":
ids = raw_texts = [args.text[:100]]
speakers = np.array([args.speaker_id])
if preprocess_config["preprocessing"]["text"]["language"] == "en":
texts = np.array([preprocess_english(args.text, preprocess_config)])
elif preprocess_config["preprocessing"]["text"]["language"] == "zh":
texts = np.array([preprocess_mandarin(args.text, preprocess_config)])
text_lens = np.array([len(texts[0])])
batchs = [(ids, raw_texts, speakers, texts, text_lens, max(text_lens))]
control_values = args.pitch_control, args.energy_control, args.duration_control
调动synthesize这个function进行最终语音合成
synthesize(model, args.restore_step, configs, vocoder, batchs, control_values)
调动read_lexicon这个function,读取lexicon(我设定的为"./lexicon/pinyin-lexicon-r.txt")
def preprocess_mandarin(text, preprocess_config):
lexicon = read_lexicon(preprocess_config["path"]["lexicon_path"])
调动Python 中拼音库 PyPinyin,把text转化成phones这个list里的phones
附录:style=Style.TONE3,声调风格3,即拼音声调在各个拼音之后,用数字 [1-4] 进行表示。如: 中国 -> ``zhong1 guo2``
phones = []
pinyins = [
p[0]
for p in pinyin(
text, style=Style.TONE3, strict=False, neutral_tone_with_five=True
)
]
for p in pinyins:
if p in lexicon:
phones += lexicon[p]
else:
phones.append("sp")
phones = "{" + " ".join(phones) + "}"
print("Raw Text Sequence: {}".format(text))
print("Phoneme Sequence: {}".format(phones))
调动text文件夹里的_init_.py里的text_to_sequence这个function,把之前处理好的phones变成sequence,输出这个sequence
sequence = np.array(
text_to_sequence(
phones, preprocess_config["preprocessing"]["text"]["text_cleaners"]
)
)
return np.array(sequence)
同理类比preprocess_mandarin,不再做详细解释
def preprocess_english(text, preprocess_config):
text = text.rstrip(punctuation)
lexicon = read_lexicon(preprocess_config["path"]["lexicon_path"])
g2p = G2p()
phones = []
words = re.split(r"([,;.\-\?\!\s+])", text)
for w in words:
if w.lower() in lexicon:
phones += lexicon[w.lower()]
else:
phones += list(filter(lambda p: p != " ", g2p(w)))
phones = "{" + "}{".join(phones) + "}"
phones = re.sub(r"\{[^\w\s]?\}", "{sp}", phones)
phones = phones.replace("}{", " ")
print("Raw Text Sequence: {}".format(text))
print("Phoneme Sequence: {}".format(phones))
sequence = np.array(
text_to_sequence(
phones, preprocess_config["preprocessing"]["text"]["text_cleaners"]
)
)
return np.array(sequence)
根据lexicon_path读取lexicon
附录:
lexicon统一要求格式如下
WORDA PHONEA PHONEB WORDA PHONEC WORDB PHONEB PHONEC
def read_lexicon(lex_path):
lexicon = {}
with open(lex_path) as f:
for line in f:
temp = re.split(r"\s+", line.strip("\n"))
word = temp[0]
phones = temp[1:]
if word.lower() not in lexicon:
lexicon[word.lower()] = phones
return lexicon
3.2.5.1 synthesize函数的input
是在mian函数里定好的,详见上文3.2.1对于main函数的解释
if __name__ == "__main__":
synthesize(model, args.restore_step, configs, vocoder, batchs, control_values)
3.2.5.2理解synthesize函数
从utils文件夹下的tools.py调用函数to_device function加载数据,也加载main函数里定好的model,最后调动utils文件夹下的tools.py中的synth_samples function合成最终语音
def synthesize(model, step, configs, vocoder, batchs, control_values):
preprocess_config, model_config, train_config = configs
pitch_control, energy_control, duration_control = control_values
for batch in batchs:
batch = to_device(batch, device)
with torch.no_grad():
# Forward
output = model(
*(batch[2:]),
p_control=pitch_control,
e_control=energy_control,
d_control=duration_control
)
synth_samples(
batch,
output,
vocoder,
model_config,
preprocess_config,
train_config["path"]["result_path"],
)
4. 语音合成代码的输出
在设定好的result_path(我这里是./output/result/AISHELL3)输出音频和合成音频的频谱图