AVS相关知识
Alexa Voice Service使用户获得基于云的Alexa能力,即将Alexa整合到产品中。通过在云端进行语音识别和自然语言理解,使构建语音产品更加简单
AVS由多个相应的客户端功能的接口(关于接口,下面有详细解释)组成,比如语音识别、音频播放和音量控制。每个接口包含逻辑上分组的消息,这些消息称为指令和事件。
指令是从云端发送的消息,指示客户端执行行动。
事件是从客户端发送到云的消息,通知Alexa有些事情已经发生。
为了访问AVS API,产品需要获得一个LWA(a Login with Amazon)访问令牌,授权产品代表用户的意愿去调用API。有以下授权方式:
(1)远程授权(Remote Authorization )远程授权用于设备,比如智能音箱。
配套应用程序授权 参考
https://blog.csdn.net/qq_27061049/article/details/80167376
关键:获取正确的Api_key。授权成功后获得access token。token需放在event的Header上发送给AVS
配套网页授权
(2)本地授权(Local Authorization) 本地授权用于Android和iOS应用程序。
(3)基于code(Code Based Linking)通常用于输入受限的产品,如电视或智能手表
(1)基本概念:
Frame:帧,HTTP/2中基本的协议单元,每一帧服务于不同的目的,例如Head Frame和Data Frame组成了基本的请求和响应
Stream:在HTTP/2连接中,一个独立的,在客户端和服务器间交互的双向一系列连续的帧。
Downchannel: 在HTTP/2连接中创建的流用于从云端向客户端分发指令,主要用于发送从云端首发的指令或是音频给客户端。
Cloud-initiated Directives:云端首发的指令,如用户通过App调节设备的音量,一个指令被发送给产品而没有相关的语音请求。
(2)维持HTTP/2连接需要做到:a)建立下行通道流(downchannel stream)b)同步产品与AVS组件状态
(3)建立下行通道流。打开AVS连接10s之内,客户端发送GET请求。请求如下
:method = GET
:scheme = https
:path = /{{API version}}/directives
authorization = Bearer {{YOUR_ACCESS_TOKEN}}
在Android中通过OkHTTP进行访问,代码如下
mClient = new OkHttpClient.Builder().connectTimeout(5, TimeUnit.SECONDS).writeTimeout(10, TimeUnit.SECONDS)
.readTimeout(10, TimeUnit.SECONDS).build();//构建请求
Request request = new Request.Builder().addHeader(AUTHORIZATION, BEARER + mAuthToken).addHeader
(CONTENT_TYPE, MULTI_DATA).url(AVS_DIRECTIVES_URL).build();
Log.i(TAG, "[http call] downchannel request");
mDownChannelCall = mClient.newCall(request);
Response downChannelRsp = mDownChannelCall.execute();
boolean isDownChannelOpen = downChannelRsp.isSuccessful();
Log.i(TAG, "[down channel]: " + downChannelRsp.code());
内部相应变量如下:
private final String AVS_DIRECTIVES_URL = "https://avs-alexa-na.amazon.com/v20160207/directives";
private final String AVS_EVENTS_URL = "https://avs-alexa-na.amazon.com/v20160207/events";
private final String AVS_EVENTS_PING = "https://avs-alexa-na.amazon.com/ping";
private final String BOUNDARY = "this-is-a-boundary";
private final String AUTHORIZATION = "authorization";
private final String BEARER = "Bearer ";
private final String CONTENT_TYPE = "content-type";
private final String MULTI_DATA = "multipart/form-data; boundary=" + BOUNDARY;
private final String DISPOSITION = "Content-Disposition";
private final String META_DATA = "form-data; name=\"metadata\"";
private final String AUDIO_DATA = "form-data; name=\"audio\"";
(4)同步组件状态
在现有的连接上在新的事件流上建立POST请求。该事件流可以在客户端接收到响应后关闭。下面是一个同步状态事件的例子
:method = POST
:scheme = https
:path = /{{API version}}/events
authorization = Bearer {{YOUR_ACCESS_TOKEN}}
content-type = multipart/form-data; boundary={{BOUNDARY_TERM_HERE}}
--{{BOUNDARY_TERM_HERE}}
Content-Disposition: form-data; name="metadata"
Content-Type: application/json; charset=UTF-8
{
"context": [
// This is an array of context objects that are used to communicate the
// state of all client components to Alexa. See Context for details.
],
"event": {
"header": {
"namespace": "System",
"name": "SynchronizeState",
"messageId": "{{STRING}}"
},
"payload": {
}
}
}
--{{BOUNDARY_TERM_HERE}}--
Android代码
RequestBody requestBody = RequestBody.create(JSON_TYPE, JsonHelper.getSyncStateJson(mPlayToken, mMsgId)); RequestBody multiBody = new MultipartBody.Builder().setType(MultipartBody.FORM).addPart(Headers.of (DISPOSITION, META_DATA), requestBody).build(); Request request2 = new Request.Builder().url(AVS_EVENTS_URL).post(multiBody).addHeader(AUTHORIZATION, BEARER + mAuthToken).addHeader(CONTENT_TYPE, MULTI_DATA).build(); Log.i(TAG, "[http call] Synchronized state request"); mStateCall = mClient.newCall(request2); Response response2 = mStateCall.execute();
同步成功后,客户端就能通过这个连接发送事件events,接收指令directives.
(5)注意事项:
a)采集的音频应该按如下方式编码:
16bit Linear PCM (LPCM16)
16kHz sample rate
Single channel
Little endian byte order
b)HTTP/2连接仅能同时支持10个流,包括events stream、downchannel、ping。因此需要确保事件流在接收到响应后被关闭
c)许多库都设有读取超时,当客户端长时间没有接收到数据就读取超时。因为downchannel stream需要AVS和客户端保持打开的状态,而且该流可能长时间不发送数据给客户端,所以需要设置读取超时为至少60分钟。
(6)Ping and Timeout
当连接处于空闲状态时,应该每5分钟发送一次Ping帧
Sample Request
:method = GET
:scheme = https
:path = /ping
authorization = Bearer {{YOUR_ACCESS_TOKEN}}
https://developer.amazon.com/zh/docs/alexa-voice-service/speechrecognizer.html#stopcapture
Android中通过AudioRecord类进行音频采集。大致过程如下
private final int AUDIO_SAMPLE_RATE = 16000; //采样率16000Hz private final int AUDIO_CHANNEL_CONFIG = AudioFormat.CHANNEL_IN_MONO;//单声道 private final int AUDIO_FORMAT = AudioFormat.ENCODING_PCM_16BIT;//音频量化位数,可以设置每个样本的分辨率为16位或者8位,16位将占用更多的空间和处理能力,表示的音频也更加接近真实。 private final int AUDIO_BUFFER_SIZE = AudioRecord.getMinBufferSize(AUDIO_SAMPLE_RATE, AUDIO_CHANNEL_CONFIG, AUDIO_FORMAT); //1280//1、构造AudioRecorder mAudioRecorder = new AudioRecord(MediaRecorder.AudioSource.MIC, AUDIO_SAMPLE_RATE, AUDIO_CHANNEL_CONFIG, AUDIO_FORMAT, AUDIO_BUFFER_SIZE);//(音频采集来源(麦克风),音频采样率,声道,音频采样精度,音频数据存放的缓冲区大小) //2、开始采集 mAudioRecorder.startRecording(); mRecording = true; //3、将采集的数据写入文件,AVS识别 mAudioRecordingThread = new Thread(new Runnable() { @Override public void run() { writeAudioDataToFile(); } }); mAudioRecordingThread.start(); //4、停止采集 mHandler.postDelayed(mStopAudioRunnable, MAX_TIME);
在同Alexa交互初始化后,麦克风始终保持打开状态直到出现如下状况
Recognize Event用于把用户的语音发送给AVS,该event由两部分message构成,第一部分为JSON格式的object,第二部分为由产品麦克风采集的音频字节流。
所有采集到的发送给AVS的音频应该按如下方式编码:
Sample Message
{
"context": [
// This is an array of context objects that are used to communicate the
// state of all client components to Alexa. See Context for details.
],
"event": {
"header": {
"namespace": "SpeechRecognizer",
"name": "Recognize",
"messageId": "{{STRING}}",
"dialogRequestId": "{{STRING}}"
},
"payload": {
"profile": "{{STRING}}",
"format": "{{STRING}}",
"initiator": {
"type": "{{STRING}}",
"payload": {
"wakeWordIndices": {
"startIndexInSamples": {{LONG}},
"endIndexInSamples": {{LONG}}
},
"token": "{{STRING}}"
}
}
}
}
}
Binary Audio Attachment
每个Recognize event都要求有音频字节流,下面是每个音频字节流的请求头headers
Content-Disposition: form-data; name="audio"
Content-Type: application/octet-stream
{{BINARY AUDIO ATTACHMENT}}
对应Android请求如下
RequestBody audioBody = new MultipartBody.Builder().setType(MultipartBody.FORM).addPart(Headers.of (DISPOSITION, META_DATA), RequestBody.create(JSON_TYPE, JsonHelper.getRecognizeJson(mPlayToken, mMsgId, mDialogReqId))).addPart(Headers.of(DISPOSITION, AUDIO_DATA), streamBody).build(); Request request3 = new Request.Builder().url(AVS_EVENTS_URL).post(audioBody).addHeader(AUTHORIZATION, BEARER + mAuthToken).addHeader(CONTENT_TYPE, MULTI_DATA).build(); Log.i(TAG, "[http call] recognize audio request"); mAudioCall = mClient.newCall(request3);其中streambody即为麦克风采集的音频数据流,为下面的mRequestBodymRequestBody = new RequestBody() { @Override public MediaType contentType() { return AVSMgr.OCTET_TYPE; }//将音频缓冲区AUDIO_BUFFER_SIZE内的音频数据写入BufferedSink @Override public void writeTo(BufferedSink bufferedSink) throws IOException { int readSize; byte buffer[] = new byte[AUDIO_BUFFER_SIZE]; while (mRecording) { readSize = mAudioRecorder.read(buffer, 0, AUDIO_BUFFER_SIZE); if (AudioRecord.ERROR_INVALID_OPERATION != readSize) { try { int sizeCount = readSize / AMZN_AUDIO_BYTE_SIZE; if (sizeCount > 0) { byte[][] bytes = new byte[sizeCount][AMZN_AUDIO_BYTE_SIZE]; for (int i = 0, start = 0, end = AMZN_AUDIO_BYTE_SIZE; i < sizeCount; i++, start += AMZN_AUDIO_BYTE_SIZE, end += AMZN_AUDIO_BYTE_SIZE) { bytes[i] = Arrays.copyOfRange(buffer, start, end); bufferedSink.write(bytes[i]); } } mTimeCounter++; mCounter++; if (mTimeCounter > 4 && mCounter > FIRST_RECORD_COUNTER) { long sum = 0; for (int i = 0; i < buffer.length; i++) { sum += Math.abs(buffer[i]); } double rawAmplitude = sum / (double) readSize; mAmplitude = Math.max(rawAmplitude, mAmplitude); Log.d(TAG, "rawAmp: " + mAmplitude); if (rawAmplitude < 33.3 && rawAmplitude > 25) { Log.d(TAG, "no sound dectected"); mNoSoundCounter++; } else if (rawAmplitude > 33.3) { Log.d(TAG, "sound dectected"); mNoSoundCounter = 0; } mTimeCounter = 0; mAmplitude = 0; } if (mNoSoundCounter > 3) { stopRecording(); } } catch (IOException e) { e.printStackTrace(); Log.e(TAG, "Error writing to audio file", e); } } } Log.i(TAG, "Finish speaking"); if (null != mListener) { mListener.onFinishRecording(); } else { Log.e(TAG, "RecordMgr: mListener == null."); } } };
a)停止抓取指令StopCapture Directive
当AVS识别到用户意图或检测到用户语音结束时,该指令将指示客户端停止抓取用户语音。当接收到该指令时,客户端需要关闭麦克风,停止监听用户语音。
Sample Message
{
"directive": {
"header": {
"namespace": "SpeechRecognizer",
"name": "StopCapture",
"messageId": "{{STRING}}",
"dialogRequestId": "{{STRING}}"
},
"payload": {
}
}
}
b) ExpectSpeech Directive
当Alexa需要额外的信息去完成用户的请求时,AVS将发送ExpectSpeech Directive。它指示客户端打开麦克风,采集语音流。如果麦克风未在指定的时间内打开,客户端必须发送ExpectSpeechTimedOut event给AVS。
在与Alexa的多轮交互中,设备将接收到至少一个ExpectSpeech Directive,指示客户端监听用户语音。如果当前 initiator
object包含在ExpectSpeech
directive的 payload中,则在返回给Alexa的 Recognize
event 中, initiator
object也包含在里面。如果payload内不包含 initiator
,则在Recognize
even也不包含initiator
。
Sample Message
{
"directive": {
"header": {
"namespace": "SpeechRecognizer",
"name": "ExpectSpeech",
"messageId": "{{STRING}}",
"dialogRequestId": "{{STRING}}"
},
"payload": {
"timeoutInMilliseconds": {{LONG}},
"initiator": {{STRING}}
}
}
}
c) ExpectSpeechTimedOut Event
如果麦克风未在指定的时间内打开,客户端必须发送ExpectSpeechTimedOut event给AVS。
Sample Message
{
"event": {
"header": {
"namespace": "SpeechRecognizer",
"name": "ExpectSpeechTimedOut",
"messageId": "{{STRING}}",
},
"payload": {
}
}
}
当用户询问问题或做出请求时,语音合成接口用于发送Alexa语音回复。如当用户发送一个天气询问”What's the weather in Seattle?”时,AVS将会返回一个带有二进制音频的Speak
directive给客户端。
Speak Directive
当客户端需要一个来自于Alexa的语音响应的时候,AVS发送Speak Directive给客户端。该指令发送给客户端包括多个部分的信息,一部分是JSON格式的指令,另一部分是二进制音频数据。
Sample Message
{
"directive": {
"header": {
"namespace": "SpeechSynthesizer",
"name": "Speak",
"messageId": "{{STRING}}",
"dialogRequestId": "{{STRING}}"
},
"payload": {
"url": "{{STRING}}",
"format": "{{STRING}}",
"token": "{{STRING}}"
}
}
}
Binary Audio Attachment 下面是二进制音频文件的头信息
Content-Type: application/octet-stream
Content-ID: {{Audio Item CID}}
{{BINARY AUDIO ATTACHMENT}}
SpeechStarted Event
当客户端处理完Speak Directive并且开始播放合成语音的时候,SpeechStarted Event应当被发送给AVS。
Sample Message
{
"event": {
"header": {
"namespace": "SpeechSynthesizer",
"name": "SpeechStarted",
"messageId": "{{STRING}}"
},
"payload": {
"token": "{{STRING}}"
}
}
}
SpeechFinished Event
当客户端处理完Speak Directive并且 Alexa TTS完全渲染给用户(语音播放完成),SpeechFinished Event应当被发送给AVS。如果播放没有结束,如用户通过"Alexa, stop"打断了播放的时候,SpeechFinished就不会被发送。
Sample Message
{
"event": {
"header": {
"namespace": "SpeechSynthesizer",
"name": "SpeechFinished",
"messageId": "{{STRING}}"
},
"payload": {
"token": "{{STRING}}"
} }
}