Introduction

The Near-LATTE (Lock And Track Type Engine) variant for ASR-listening within the Yobe SDK detects the voice of a pre-enrolled user and extracts it from surrounding noise and human crosstalk. The software enrolls a user on the basis of 10-20 seconds of unscripted speech from that user. Yobe Near-LATTE provides high quality speaker voice signals enabling Automatic Speech Recognition (ASR) platforms to function properly and with a high degree of accuracy in extremely complex environments . A typical use case for the Near-LATTE variant for ASR-listening is an enrolled user talking to a kiosk or point of sale tablet in noisy human crosstalk environments (e.g. restaurants, drive-thrus, shopping malls and other public spaces).

The Near-LATTE variant for ASR-listening has the following capabilities:

Training for Voice Template: The program captures a Voice Template from a speaker. The training is text independent and takes just 10-20 seconds of unscripted speech.
Audio Preparation for Accurate Transcription: The program uses the stored template to lock onto and track the voice of an enrolled user. The extracted voice is resolution input into an ASR program for subsequent transcription.

Note: Only one user can be enrolled at a time.

When to use this variant

This is a Near-Listening scenario as shown in the diagram below:
The objective of this variant is that its output audio should contain the extracted voice of the pre-enrolled user in a form that enables accurate ASR transcription.

Package Contents

This Yobe Android SDK package contains the following elements:

The speech.aar library that provides the Yobe Android SDK
A sample Android Studio project that implements the Yobe Android SDK
The Yobe license

Prerequisites

Library

Place the provided .aar in the proper location for implementing in your android project. It is common to include the library in your project in the dependencies of the build.gradle.

See the Android Studio Documentation on adding build dependencies for more details.

Permissions

The IDListenerSpeech requires permissions to use the device's microphones in order to capture audio data for processing. In your project's AndroidManifest.xml you can place this entry inside the manifest tag:

<uses-permission android:name="android.permission.RECORD_AUDIO" />

IDListener (Voice Identification)

LATTE's main functionality is accessed via the com.yobe.speech.IDListener class.

Initialization

Create a new com.yobe.speech.IDListener instance. The instance must then be initialized using the license provided by Yobe, as well as two configuration arguments: the Microphone Orientation and the Output Buffer Type.

// java
IDListener idListener = new IDListener();
idListener.Init("YOBE_LICENSE", MicOrientation.END_FIRE, OutputBufferType.FIXED);

Register and Select User

A com.yobe.speech.IDTemplate must be created using the desired user's voice so the IDListener can select that template and identify the voice.

Register the user by inputting their voice audio data using com.yobe.speech.IDListener.RegisterTemplate. This is done using a continuous array of audio samples.

It is recommended to first process the audio data so that only speech is present in the audio; this will yield better identification results. To achieve this, the IDListener can be placed into Enrollment Mode. In this mode, the audio that should be used to create a BiometricTemplate is processed, buffer-by-buffer, using Yobe::IDListener::ProcessBuffer. This ProcessBuffer will return a status of com.yobe.speech.Status.ENROLLING as long as the buffers are being processed in Enrollment Mode. Enrollment Mode is started by calling com.yobe.speech.IDListener.StartEnrollment, and is stopped by either manually calling com.yobe.speech.IDListener.StopEnrollment or by processing enough buffers for it to stop automatically based on an internal counter (currently, this is enough buffers to equal 20 seconds of audio).

Enrollment Mode can be started at any point after initial calibration. Any samples processed while in Enrollment Mode will not be matched for identification to a selected template, if there is one.

// java
 
/****(Option 1) register with unprocessed samples****/
IDTemplate idTemplate = idListener.RegisterTemplate(samples)
 
/****(Option 2) register with processed samples****/
// process enough unspecified audio to calibrate
Status status = Status.NEEDS_MORE_DATA;
do {
    status = SpeechUtil.GetStatusFromResults(idListener.ProcessBuffer(inputBuffer));
    inputBuffer = GetNextInputBuffer(); // some example function to get next buffer
} while (status == Status.NEEDS_MORE_DATA)
 
// now that we've calibrated, process desired user audio in Enrollment Mode
short[] processedVoiceAudio = new short[someLength];
idListener.StartEnrollment();
Status status = Status.OK;
do {
    Object[] result = idListener.ProcessBuffer(voiceInputBuffer);
    voiceInputBuffer = GetNextVoiceBuffer(); // some example function to get next buffer
    if (voiceInputBuffer == nullptr) {
        // the case where we've run out of voice audio before enrollment automatically stops
        idListener.StopEnrollment()
        break;
    }
 
    status = SpeechUtil.GetStatusFromResults(result);
    short[] processedAudio = SpeechUtil.GetAudioFromResults(result);
    // store processedAudio in processedVoiceAudio, such as with a for-loop
    //...
} while (status == Status.ENROLLING)
 
// register with the processed audio
IDTemplate idTemplate = idListener.RegisterTemplate(processedVoiceAudio);

Select the user using the template returned by the registration.

// java

idListener.SelectUser(idTemplate);

Any new audio buffers passed to com.yobe.speech.IDListener.ProcessBuffer while not in Enrollment Mode will be processed with respect to the selected user's voice.

Process and Use Audio

Audio data is passed into the com.yobe.speech.IDListener one buffer at a time. See Audio Buffers for more details on their format. The audio is encoded as PCM 16-bit Shorts.

// java
Object[] result = idListener.ProcessBuffer(buffer);
short[] processedAudio = SpeechUtil.GetAudioFromResults(result);

result in the above example has an entry that contains the processed version of the audio that is contained in buffer. An example of what to do with this data is to append its contents to a stream or larger buffer.

Note: You can find the library's built in buffer size using com.yobe.speech.Util.GetBufferSizeSamples.

Clean Up

To ensure proper clean up of the IDListener, simply call com.yobe.speech.IDListener.Deinit.

// java

idListener.Deinit();

IDListenerSpeech (Real-Time Voice Extraction)

LATTE's real-time functionality is accessed via the com.yobe.speech.IDListenerSpeech class.

Then, in your project, you must either prompt the user for relevant permission or enable the permission in the app's settings on the device.

Initialization

An IDListenerSpeech object can be created using a class that implements com.yobe.speech.AudioConsumer.

// java

IDListenerSpeech idListenerSpeech = new IDListenerSpeech(new MyAudioConsumer()); // creation of object

see Define Real-Time Processing Callbacks for details on MyAudioConsumer

Define Real-Time Processing Callbacks

Creating a com.yobe.speech.IDListenerSpeech object requires an object that implements the callback functions in the com.yobe.speech.AudioConsumer interface. These callback functions will receive processed audio buffers for further processing in real-time. Audio data is captured by the device's microphones, processed, and sent to the com.yobe.speech.AudioConsumer.onDataFeed callback function one buffer at a time. See Audio Buffers for more details on audio buffers. The status after each processing step is sent via a call to the com.yobe.speech.AudioConsumer.onResponse callback function.

The output buffers are arrays of short values.

Note: The originalBuffer will contain two channels of interleaved audio data, while the processedBuffer will only contain one channel of audio data.

// java
import com.yobe.speech.*;
 
// create a class implementing the AudioConsumer callback functions
class MyAudioConsumer implements AudioConsumer {
    @Override
    public void onDataFeed(short[] originalBuffer, short[] processedBuffer) { /* do something with the original and/or processed buffers */ }
 
    @Override
    public void onResponse(Status code) { /* do something with the status */ }
}

The processedBuffer callback argument in onDataFeed can be thought of as the output of the com.yobe.speech.IDListener.ProcessBuffer function. Further processing or storage can be done. However, it's important to minimize runtime spent in the callback function to keep the audio processing running at real-time speeds.

Register and Select User

The functions for performing IDTemplate registration and selection are the same as the IDListener, in the section Register and Select User. For live, processed enrollment, simply implement Option 2 in the onDataFeed callback.

Start Processing

Processing is started via com.yobe.speech.IDListenerSpeech.Start and stopped via com.yobe.speech.IDListenerSpeech.Stop. Once started, the callback functions will start being called with the processed audio buffers in real-time.

Note: A IDListenerSpeech object has a startup time of 5 seconds upon calling Start. The IDListenerSpeech object status will prompt for more data by reporting the com.yobe.speech.SpeechUtil.Status.NEEDS_MORE_DATA code in onResponse until this startup time has passed. After the startup time has passed, the onResponse will report com.yobe.speech.SpeechUtil.Status.OK.

// java

idListenerSpeech.Start("YOBE_LICENSE"); // audio will start getting captured and processed

Clean Up

To stop and clean up the IDListenerSpeech object, simply call com.yobe.speech.IDListenerSpeech.Stop.

// java

idListenerSpeech.Stop(); // no audio data is captured nor processed