公司项目之前一直是采用人工录音,然而上线一段时间之后发现,人工录音成本太高,而且每周上线的音频不多,老板发现问题后,甚至把音频功能裸停了一段时间。直到最近项目要向海外扩展,需要内容做国际化,就想到了用机器翻译。目前机翻已经相对成熟,做的好的国内有科大讯飞,国外有微软。既然项目主要面对海外用户,就决定采用微软的TTS。(PS:这里不是打广告,微软的TTS是真的不错,自己可以去官网试听下,虽然无法像人一样很有感情的朗读诗歌什么的,但是朗读新闻咨询类文章还是抑扬顿挫的。)
使用背景已经啰嗦了一大堆,我觉得读者还是会关注的,但是我想作为资深CV码农,我想你们更关注还是如何应用,所以还是老规矩,简简单单的上代码。(申请账号这些就不介绍了)
com.microsoft.cognitiveservices.speech
client-sdk
1.12.1
public class TtsConst {
/**
* 音频合成类型(亲测这种效果最佳,其他的你自己去试试)
*/
public static final String AUDIO_24KHZ_48KBITRATE_MONO_MP3 = "audio-24khz-48kbitrate-mono-mp3";
/**
* 授权url
*/
public static final String ACCESS_TOKEN_URI = "https://eastasia.api.cognitive.microsoft.com/sts/v1.0/issuetoken";
/**
* api key
*/
public static final String API_KEY = "你自己的 api key";
/**
* 设置accessToken的过期时间为9分钟
*/
public static final Integer ACCESS_TOKEN_EXPIRE_TIME = 9 * 60;
/**
* 性别
*/
public static final String MALE = "Male";
/**
* tts服务url
*/
public static final String TTS_SERVICE_URI = "https://eastasia.tts.speech.microsoft.com/cognitiveservices/v1";
}
public class HttpsConnection {
public static HttpsURLConnection getHttpsConnection(String connectingUrl) throws Exception {
URL url = new URL(connectingUrl);
return (HttpsURLConnection) url.openConnection();
}
}
@Component
@Slf4j
public class Authentication {
@Resource
private RedisCache redisCache;
public String genAccessToken() {
InputStream inSt;
HttpsURLConnection webRequest;
try {
String accessToken = redisCache.get(RedisKey.KEY_TTS_ACCESS_TOKEN);
if (StringUtils.isEmpty(accessToken)) {
webRequest = HttpsConnection.getHttpsConnection(TtsConst.ACCESS_TOKEN_URI);
webRequest.setDoInput(true);
webRequest.setDoOutput(true);
webRequest.setConnectTimeout(5000);
webRequest.setReadTimeout(5000);
webRequest.setRequestMethod("POST");
byte[] bytes = new byte[0];
webRequest.setRequestProperty("content-length", String.valueOf(bytes.length));
webRequest.setRequestProperty("Ocp-Apim-Subscription-Key", TtsConst.API_KEY);
webRequest.connect();
DataOutputStream dop = new DataOutputStream(webRequest.getOutputStream());
dop.write(bytes);
dop.flush();
dop.close();
inSt = webRequest.getInputStream();
InputStreamReader in = new InputStreamReader(inSt);
BufferedReader bufferedReader = new BufferedReader(in);
StringBuilder strBuffer = new StringBuilder();
String line = null;
while ((line = bufferedReader.readLine()) != null) {
strBuffer.append(line);
}
bufferedReader.close();
in.close();
inSt.close();
webRequest.disconnect();
accessToken = strBuffer.toString();
//设置accessToken的过期时间为9分钟
redisCache.set(RedisKey.KEY_TTS_ACCESS_TOKEN, accessToken, TtsConst.ACCESS_TOKEN_EXPIRE_TIME);
log.info("New tts access token {}", accessToken);
}
return accessToken;
} catch (Exception e) {
log.error("Generate tts access token failed {}", e.getMessage());
}
return null;
}
}
public class ByteArray {
private byte[] data;
private int length;
public ByteArray(){
length = 0;
data = new byte[length];
}
public ByteArray(byte[] ba){
data = ba;
length = ba.length;
}
/**
合并数组
*/
public void cat(byte[] second, int offset, int length){
if(this.length + length > data.length) {
int allocatedLength = Math.max(data.length, length);
byte[] allocated = new byte[allocatedLength << 1];
System.arraycopy(data, 0, allocated, 0, this.length);
System.arraycopy(second, offset, allocated, this.length, length);
data = allocated;
}else {
System.arraycopy(second, offset, data, this.length, length);
}
this.length += length;
}
public void cat(byte[] second){
cat(second, 0, second.length);
}
public byte[] getArray(){
if(length == data.length){
return data;
}
byte[] ba = new byte[length];
System.arraycopy(data, 0, ba, 0, this.length);
data = ba;
return ba;
}
public int getLength(){
return length;
}
}
@Slf4j
public class XmlDom {
public static String createDom(String locale, String genderName, String voiceName, String textToSynthesize){
Document doc = null;
Element speak, voice;
try {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbf.newDocumentBuilder();
doc = builder.newDocument();
if (doc != null){
speak = doc.createElement("speak");
speak.setAttribute("version", "1.0");
speak.setAttribute("xml:lang", "en-us");
voice = doc.createElement("voice");
voice.setAttribute("xml:lang", locale);
voice.setAttribute("xml:gender", genderName);
voice.setAttribute("name", voiceName);
voice.appendChild(doc.createTextNode(textToSynthesize));
speak.appendChild(voice);
doc.appendChild(speak);
}
} catch (ParserConfigurationException e) {
log.error("Create ssml document failed: {}",e.getMessage());
return null;
}
return transformDom(doc);
}
private static String transformDom(Document doc){
StringWriter writer = new StringWriter();
try {
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer;
transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.transform(new DOMSource(doc), new StreamResult(writer));
} catch (TransformerException e) {
log.error("Transform ssml document failed: {}",e.getMessage());
return null;
}
return writer.getBuffer().toString().replaceAll("\n|\r", "");
}
}
@Slf4j
@Component
public class TtsService {
@Resource
private Authentication authentication;
/**
* 合成音频
*/
public byte[] genAudioBytes(String textToSynthesize, String locale, String gender, String voiceName) {
String accessToken = authentication.genAccessToken();
if (StringUtils.isEmpty(accessToken)) {
return new byte[0];
}
try {
HttpsURLConnection webRequest = HttpsConnection.getHttpsConnection(TtsConst.TTS_SERVICE_URI);
webRequest.setDoInput(true);
webRequest.setDoOutput(true);
webRequest.setConnectTimeout(5000);
webRequest.setReadTimeout(300000);
webRequest.setRequestMethod("POST");
webRequest.setRequestProperty("Content-Type", "application/ssml+xml");
webRequest.setRequestProperty("X-Microsoft-OutputFormat", TtsConst.AUDIO_24KHZ_48KBITRATE_MONO_MP3);
webRequest.setRequestProperty("Authorization", "Bearer " + accessToken);
webRequest.setRequestProperty("X-Search-AppId", "07D3234E49CE426DAA29772419F436CC");
webRequest.setRequestProperty("X-Search-ClientID", "1ECFAE91408841A480F00935DC390962");
webRequest.setRequestProperty("User-Agent", "TTSAndroid");
webRequest.setRequestProperty("Accept", "*/*");
String body = XmlDom.createDom(locale, gender, voiceName, textToSynthesize);
if (StringUtils.isEmpty(body)) {
return new byte[0];
}
byte[] bytes = body.getBytes();
webRequest.setRequestProperty("content-length", String.valueOf(bytes.length));
webRequest.connect();
DataOutputStream dop = new DataOutputStream(webRequest.getOutputStream());
dop.write(bytes);
dop.flush();
dop.close();
InputStream inSt = webRequest.getInputStream();
ByteArray ba = new ByteArray();
int rn2 = 0;
int bufferLength = 4096;
byte[] buf2 = new byte[bufferLength];
while ((rn2 = inSt.read(buf2, 0, bufferLength)) > 0) {
ba.cat(buf2, 0, rn2);
}
inSt.close();
webRequest.disconnect();
return ba.getArray();
} catch (Exception e) {
log.error("Synthesis tts speech failed {}", e.getMessage());
}
return null;
}
}
由于项目中需要将音频上传到OSS,所以这里生成的是字节码文件,你也可以选择下载或保存音频文件。
由于项目中需要生成超过10分钟的音频,我在调试中发现tts不能生成超过10分钟的音频,尴尬了呀,在微软官网中摸索了半天也没找到生成超过10分钟音频的办法,放弃了吗?不可能的。在我感觉到无计可施的时候,我的脑海中蹦出了四个字,那就是”断点续传“。我就想能不能通过tts把内容分段生成字节码两个,然后拼接后再上传到OSS。说干就干,没想到最后真的可行。成功那一瞬间的感觉无法言喻呀。不废话了,嗯,上大妈,哦不是,上代码。太激动了。
/**
* 生成中文音频信息
*/
public byte[] getZHAudioBuffer(String gender, String chapterContent, String locale, String voiceName) {
byte[] audioBuffer;
if (chapterContent.length() <= 2600) {
audioBuffer = ttsService.genAudioBytes(chapterContent, locale, gender, voiceName);
} else {
byte[] audioBuffer1 = ttsService.genAudioBytes(chapterContent.substring(0, chapterContent.length() / 2), locale, gender, voiceName);
byte[] audioBuffer2 = ttsService.genAudioBytes(chapterContent.substring(chapterContent.length() / 2), locale, gender, voiceName);
ByteArray byteArray = new ByteArray(audioBuffer1);
byteArray.cat(audioBuffer2);
audioBuffer = byteArray.getArray();
}
return audioBuffer;
}
/**
* 生成英文音频信息
*/
public byte[] getUSAudioBuffer(String gender, String chapterContent, String locale, String voiceName) {
String[] words = chapterContent.split(" ");
byte[] audioBuffer;
int maxLength = 1500;
if (words.length <= maxLength) {
audioBuffer = ttsService.genAudioBytes(chapterContent, locale, gender, voiceName);
} else {
String[] part1 = new String[maxLength];
String[] part2 = new String[words.length - maxLength];
for (int i = 0; i < words.length; i++) {
if (i < maxLength) {
part1[i] = words[i];
} else {
part2[i - maxLength] = words[i];
}
}
byte[] audioBuffer1 = ttsService.genAudioBytes(String.join(" ", part1), locale, gender, voiceName);
byte[] audioBuffer2 = ttsService.genAudioBytes(String.join(" ", part2), locale, gender, voiceName);
ByteArray byteArray = new ByteArray(audioBuffer1);
byteArray.cat(audioBuffer2);
audioBuffer = byteArray.getArray();
}
return audioBuffer;
}
我要说的都在代码里了,你细品。(PS:中文的2600字符和英文的1500字符,是我调试出来的,生成的音频肯定是在10分钟以内的)
微软TTS还是挺香的,嗯,总结很到位,我继续摸索其他功能去了。