The Speech Kit library provides the classes necessary to perform network-based speech recognition and text-to-speech synthesis. This library provides a simple, high-level speech service API that automatically performs all the tasks necessary for speech recognition or synthesis, including audio recording, audio playback, and network connection management.
The following sections describe how to connect to a speech server and perform speech recognition or synthesis:
- “Speech Kit Basics ” provides an overview of the Speech Kit library.
- “Connecting to a Speech Server ” details the top-level server connection process.
- “Recognizing Speech ” describes how to use a network recognizer to transcribe speech.
- “Converting Text to Speech ” shows how to use the network-based vocalizer to convert text to speech.
The Speech Kit library allows you to add voice recognition and text-to-speech services to your applications easily and quickly. This library provides access to speech processing components hosted on a server through a clean asynchronous network service API, minimizing overhead and resource consumption. The Speech Kit library lets you provide fast voice search, dictation, and high-quality, multilingual text-to-speech functionality in your application.
The Speech Kit library is a full-featured, high-level library that automatically manages all the required low-level services.
At the application level, there are two main components available to the developer: the recognizer and the text-to-speech synthesizer.
In the library there are several coordinated processes:
- The library fully manages the audio system for recording and playback.
- The networking component manages the connection to the server and, at the start of a new request, automatically re-establishes connections that have timed-out.
- The end-of-speech detector determines when the user has stopped speaking and automatically stops recording.
- The encoding component compresses and decompresses the streaming audio to reduce bandwidth requirements and decrease latency.
The server system is responsible for the majority of the work in the speech processing cycle. The complete recognition or synthesis procedure is performed on the server, consuming or producing the streaming audio. In addition, the server manages authentication as configured through the developer portal.
To use Speech Kit, you will need to have the Android SDK installed. Instructions for installing the Android SDK can be found at http://developer.android.com/sdk/index.html . You can use the Speech Kit library in the same way that you would use any of the standard jar library.
To start using the Speech Kit library, add it to your new or existing project, as follows:
- Copy the libs folder into the root of the project folder for your android project. The libs folder contains an armeabi subfolder that contains the file libnmsp_speex.so . This is a native library that is required to use Speech Kit.
- From the menu select Project ‣ Properties....
- In the popup menu, select Java Build Path from the menu at the left.
- In the right panel of the popup menu, select the Libraries tab.
- Use the Add External JARs button to add nmdp_speech_kit.jar .
To view the Javadoc for Speech Kit in Eclipse, you must tell Eclipse where to find the class documentation. This can be done with the following steps:
- In the Package Explorer tab for your project, a Referenced Libraries heading should appear inside your project. Expand this heading so that all referenced libraries are visible.
- Right click nmdp_speech_kit.jar and select Properties
- In the popup menu, select Javadoc Location from the menu at the left.
- In the right panel of the popup menu, select the Javadoc URL radio button.
- Click the Browse button to the right of Javadoc location path .
- Browse to and select the Speech Kit javadoc folder.
You also need to add the necessary permissions to AndroidManifest.xml so that the application can carry out the needed operations. This can be done as follows:
- In the Package Explorer tab for your project, open AndroidManifest.xml .
- Add the following lines immediately before the end of the manifest tag.
<uses-permission android:name= "android.permission.ACCESS_NETWORK_STATE" ></uses-permission> <uses-permission android:name= "android.permission.INTERNET" ></uses-permission> <uses-permission android:name= "android.permission.RECORD_AUDIO" ></uses-permission> <uses-permission android:name= "android.permission.READ_PHONE_STATE" ></uses-permission> ... </manifest>
- If you want to use prompts that vibrate, you will need to include the following additional permission:
<uses-permission android:name= "android.permission.VIBRATE" ></uses-permission>
You are now ready to start using recognition and text-to-speech services.
While using the Speech Kit library, you will occasionally encounter errors. In this library the SpeechError class includes SpeechError.Codes to define the various types of possible errors.
There are effectively two types of errors that can be expected in this framework.
- The first type are service connection errors and include the SpeechError.Codes.ServerConnectionError and SpeechError.Codes.ServerRetryError codes. These errors indicate that there is some kind of failure in the connection with the speech server. The failure may be temporary, and it can be solved by retrying the query. The error may be the result of an authorization failure or some other network problem.
- The second type are speech processing errors and include the SpeechError.Codes.RecognizerError and SpeechError.Codes.VocalizerError codes. These errors indicate a problem with the speech request, ranging from a text format issue to an audio detection failure.
It is essential to always monitor for errors, as signal conditions may generate errors even in a correctly implemented application. The application’s user interface needs to respond appropriately and elegantly to ensure a robust user experience.
The Speech Kit library is a network service and requires some basic setup before you can use either the recognition or text-to-speech classes.
This setup performs two primary operations:
First, it identifies and authorizes your application.
Second, it optionally establishes a connection to the speech server immediately, allowing for fast initial speech requests and thus enhancing the user experience.
Note
This network connection requires authorization credentials and server details set by the developer. The necessary credentials are providedthrough the Dragon Mobile SDK portal at http://dragonmobile.nuancemobiledeveloper.com .
The application key SpeechKitApplicationKey is required by the Speech Kit library and must be set by the developer. This key is effectively your application’s password for the speech server and should be kept secret to prevent misuse.
Your unique credentials, provided through the developer portal, include the necessary line of code to set this value. Thus, this process is as simple as copying and pasting the line into your source file. You must set this key before you initialize the Speech Kit system. For example, you configure the application key as follows:
static
final
byte
[]
SpeechKitApplicationKey
=
{(
byte
)
0x12
,
(
byte
)
0x34
,
...,
(
byte
)
0x89
};
The setup method, SpeechKit.initialize() , takes six parameters:
- An application Context (Android.content.Context)
- An application identifier
- A server address
- A port
- The SSL setting
- The application key defined above.
The appContext parameter is used to determine application level information such as the state of the network. It can be defined with code such as:
Context
context
=
getApplication
().
getApplicationContext
();
The ID parameter identifies your application and is used in conjunction with your key to provide authorization to the speech server.
The host and port parameters define the speech server, which may differ from application to application. Therefore, you should always use the values provided with your authentication parameters.
The ssl parameter indicates whether to connect to the speech server using SSL. The specified server and port must support the given SSL setting, or else a connection error will occur.
The applicationKey stores the key that identifies your application to the server.
The library is configured in the following example:
SpeechKit
sk
=
SpeechKit
.
initialize
(
context
,
speechKitAppId
,
speechKitServer
,
speechKitPort
,
speechKitSsl
,
speechKitApplicationKey
);
Note
This method is meant to be called one time per application execution to configure the underlying network connection. This method does not attempt to establish the connection to the server.
At this point the speech server is fully configured. The connection to the server will be established automatically when needed. To make sure the next recognition or vocalization is as fast as possible, connect to the server in advance using the optional connect method.
sk
.
connect
();
Note
This method does not indicate failure. Instead, the success or failure of the setup is known when the Recognizer and Vocalizer classes are used.
When the connection is opened, it will remain open for some period of time, ensuring that subsequent speech requests are served quickly as long as the user is actively making use of speech. If the connection times out and closes, it will be re-opened automatically on the next speech request or call to connect .
The application is now configured and ready to recognize and synthesize speech.
The recognizer allows users to speak instead of type in locations where text entry would generally be required. The speech recognizer returns a list of text results. It is not attached to any UI object in any way, so the presentation of the best result and selection of alternative results is left up to the UI of application.
Before you use speech recognition, ensure that you have set up the core Speech Kit library with the SpeechKit.initialize method.
Then create and initialize a Recognizer object:
recognizer
=
sk
.
createRecognizer
(
Recognizer
.
RecognizerType
.
Dictation
,
Recognizer
.
EndOfSpeechDetection
.
Short
,
"en_US"
,
this
,
handler
);
The SpeechKit.createRecognizer method initializes a recognizer and starts the speech recognition process.
The type parameter is a String , generally one of the recognition type constants defined in the Speech Kit library and available in the class documentation for Recognizer . Nuance may provide you with a different value for your unique recognition needs, in which case you will enter the raw String .
The detection parameter determines the end-of-speech detection model and must be one of the Recognizer.EndOfSpeechDetection types.
The language parameter defines the speech language as a string in the format of the ISO 639 language code, followed by an underscore “_”, followed by the ISO 3166-1 country code.
Note
For example, the English language as spoken in the United States is en_US . An up-to-date list of supported languages for recognition is available on the FAQ at http://dragonmobile.nuancemobiledeveloper.com/faq.php .
The this parameter defines the object to receive status, error, and result messages from the recognizer. It can be replaced with any object that implements the RecognizerListener interface.
handler should be an android.os.Handler object that was created with
Handler handler = new Handler();
Handler is a special Android object that processes messages. It is needed to receive call-backs from the Speech Kit library. This object can be created inside an Activity that is associated with the main window of your application, or with the windows or controls where voice recognition will actually be used.
Start the recognition by calling start .
The Recognizer.Listener passed into SpeechKit.createRecognizer receives the recognition results or error messages, as described below.
Prompts are short audio clips or vibrations that are played during a recognition. Prompts may be played at the following stages of the recognition:
The SpeechKit.defineAudioPrompt method defines an audio prompt from a raw resource ID packaged with the Android application. Audio prompts may consume significant system resources until release is called, to try to minimize the number of instances. The Prompt.vibrate method defines a vibration prompt. Vibration prompts are inexpensive–they can be created on the fly as they are used, and there is no need to release them.
Call SpeechKit.setDefaultRecognizerPrompts to specify default audio or vibration prompts to play during all recognitions by default. To override the default prompts in a specific recognition, call setPrompt prior to calling start .
To retrieve the recognition results, implement the Recognizer.Listener.onResults method. For example:
public
void
onResults
(
Recognizer
recognizer
,
Recognition
results
)
{
String
topResult
;
if
(
results
.
getResultCount
()
>
0
)
{
topResult
=
results
.
getResult
(
0
).
getText
();
// do something with topResult...
}
}
This method will be called only on successful completion, and the results list will have zero or more results.
Even in the absence of an error, there may be a suggestion, present in the recognition results object, from the speech server. This suggestion should be presented to the user.
To be informed of any recognition errors, implement the onError method of the Recognizer.Listener interface. In the case of errors, only this method will be called; conversely, on success this method will not be called. In addition to the error, a suggestion, as described in the previous section, may or may not be present. Note that both the Recognition and the SpeechError class have a getSuggestion method that can be used to check for a suggestion from the server.
public
void
onError
(
Recognizer
recognizer
,
SpeechError
error
)
{
// Inform the user of the error and suggestion
}
Optionally, to be informed when the recognizer starts or stops recording audio, implement the onRecordingBegin and onRecordingDone methods of the Recognizer.Listener interface. There may be a delay between initialization of the recognizer and the actual start of recording, so the onRecordingBegin message can be used to signal to the user when the system is listening.
public
void
onRecordingBegin
(
Recognizer
recognizer
)
{
// Update the UI to indicate the system is now recording
}
The onRecordingDone message is sent before the speech server has finished receiving and processing the audio, and therefore before the result is available.
public
void
onRecordingDone
(
Recognizer
recognizer
)
{
// Update the UI to indicate that recording has stopped and the speech is still being processed
}
This message is sent both with and without end-of-speech detection models in place. The message is sent regardless, whether recording was stopped due to calling the stopRecording method or due to detecting end-of-speech.
In some scenarios, especially for longer dictations, it is useful to provide a user with visual feedback of the volume of their speech. The Recognizer interface supports this feature by use of the method getAudioLevel , which returns the relative power level of the recorded audio in decibels. The range of this value is a float between 0.0 and -90.0 dB where 0.0 is the highest power level and -90.0 is the lowest level. This method should be accessed during recordings, specifically in the time between receiving the messages onRecordingBegin and onRecordingDone . Generally, you should use a timer method to read the power level regularly.
The Vocalizer class provides a network text-to-speech interface for developers.
Before you use speech synthesis, ensure that you have setup the core Speech Kit library with the SpeechKit.initialize method.
Then create and initialize a Vocalizer object to perform text-to-speech conversion:
Vocalizer
voc
=
sk
.
createVocalizerWithLanguage
(
"en_US"
,
this
,
handler
);
The Vocalizer.createVocalizerWithLanguage method initializes a text-to-speech synthesizer with a default language.
The language parameter is a String that defines the spoken language in the format of the ISO 639 language code, followed by an underscore “_”, followed by the ISO 3166-1 country code. For example, the English language as spoken in the United States is en_US . Each supported language has one or more uniquely defined voices, either male or female.
Note
An up-to-date list of supported languages for text-to-speech is available at http://dragonmobile.nuancemobiledeveloper.com/faq.php . The list of supported languages will be updated when new language support is added. The new languages will not necessarily require updating an existing Dragon Mobile SDK.
The this parameter defines the object to receive status and error messages from the speech synthesizer. It can be replaced with any object that implements the Vocalizer.Listener interface.
handler should be an android.os.Handler object that was created with
Handler handler = new Handler ();Handler is a special Android object that processes messages. It is needed to receive call-backs from the Speech Kit library. This object can be created inside an Activity that is associated with the main window of your application, or with the windows or controls where Text-To-Speech will actually be used.
The Vocalizer.createVocalizerWithLanguage method uses a default voice chosen by Nuance. To select a different voice, use the createVocalizerWithVoice method instead.
The voice parameter is a String that defines the voice model. For example, the female US English voice is Samantha .
Note
The up-to-date list of supported voices is provided with the supported languages at http://dragonmobile.nuancemobiledeveloper.com/faq.php .
To begin converting text to speech, you must use either the speakString or speakMarkupString method. These methods send the requested string to the speech server and start streaming and playing audio on the device.
voc
.
speakString
(
"Hello world."
,
context
);
Note
The speakMarkupString method is used in exactly the same manner as speakString except that it takes a String filled with SSML, a markup language tailored for use in describing synthesized speech. An advanced discussion of SSML is beyond the scope of this document, however you can find more information from the W3C at http://www.w3.org/TR/speech-synthesis/ .
As speech synthesis is a network-based service, these methods are all asynchronous, and in general an error condition is not immediately reported. Any errors are reported as messages to the Vocalizer.Listener that was passed to createVocalizerWithLanguage or createVocalizerWithVoice .
The speakString and speakMarkupString methods may be called multiple times for a single Vocalizer instance. To change the language or voice without having to create a new Vocalizer , call setLanguage or setVoice .
The synthesized speech will not immediately start playback. Rather there will be a brief delay as the request is sent to the speech server and speech is streamed back. For UI coordination, to indicate when audio playback begins, the optional method Vocalizer.Listener.onSpeakingBegin is provided.
public
void
onSpeakingBegin
(
Vocalizer
vocalizer
,
String
text
,
Object
context
)
{
// update UI to indicate that text is being spoken
}
The context in the message is a reference to the context that was passed to one of the speakString or speakMarkupString methods and may be used track sequences of playback when sequential text-to-speech requests are made.
On completion of the speech playback, the Vocalizer.Listener.onSpeakingDone message is sent. This message is always sent on successful completion and on error. In the success case, error is null .
public
void
onSpeakingDone
(
Vocalizer
vocalizer
,
String
text
,
SpeechError
error
,
Object
context
)
{
if
(
error
!=
null
)
{
// Present error dialog to user
}
else
{
// Update UI to indicate speech is complete
}
}