Introduction

The Near-LATTE (Lock And Track Type Engine) variant for ASR-listening within the Yobe SDK detects the voice of a pre-enrolled user and extracts it from surrounding noise and human crosstalk. The software enrolls a user on the basis of 10-20 seconds of unscripted speech from that user. Yobe Near-LATTE provides high quality speaker voice signals enabling Automatic Speech Recognition (ASR) platforms to function properly and with a high degree of accuracy in extremely complex environments . A typical use case for the Near-LATTE variant for ASR-listening is an enrolled user talking to a kiosk or point of sale tablet in noisy human crosstalk environments (e.g. restaurants, drive-thrus, shopping malls and other public spaces).

The Near-LATTE variant for ASR-listening has the following capabilities:

Training for Voice Template: The program captures a Voice Template from a speaker. The training is text independent and takes just 10-20 seconds of unscripted speech.
Audio Preparation for Accurate Transcription: The program uses the stored template to lock onto and track the voice of an enrolled user. The extracted voice is resolution input into an ASR program for subsequent transcription.

Note: Only one user can be enrolled at a time.

When to use this variant

This is a Near-Listening scenario as shown in the diagram below:
The objective of this variant is that its output audio should contain the extracted voice of the pre-enrolled user in a form that enables accurate ASR transcription.

Installation

Place the provided libraries and header files in a location that can be discovered by your application's build system.

IDListener (Voice Identification)

LATTE's main functionality is accessed via the Yobe::IDListener class.

Initialization

Yobe::Create::NewIDListener is used to obtain a shared pointer to a new Yobe::IDListener instance. The instance must then be initialized using the license provided by Yobe, as well as two configuration arguments: the Microphone Orientation and the Output Buffer Type.

// cpp
auto id_listener = Yobe::Create::NewIDListener();
id_listener->Init("YOBE_LICENSE", "init_data_path", Yobe::MicOrientation::END_FIRE, Yobe::OutputBufferType::YOBE_VARIABLE);

Register and Select User

A Yobe::BiometricTemplate must be created using the desired user's voice so the IDListener can select that template and identify the voice.

Register the user by inputting their voice audio data using Yobe::IDListener::RegisterTemplate. This is done using a continuous array of audio samples.

It is recommended to first process the audio data so that only speech is present in the audio; this will yield better identification results. To achieve this, the IDListener can be placed into Enrollment Mode. In this mode, the audio that should be used to create a BiometricTemplate is processed, buffer-by-buffer, using Yobe::IDListener::ProcessBuffer. This ProcessBuffer will return a status of Yobe::Status::ENROLLING as long as the buffers are being processed in Enrollment Mode. Enrollment Mode is started by calling Yobe::IDListener::StartEnrollment, and is stopped by either manually calling Yobe::IDListener::StopEnrollment or by processing enough buffers for it to stop automatically based on an internal counter (currently, this is enough buffers to equal 20 seconds of audio).

Enrollment Mode can be started at any point after initial calibration. Any samples processed while in Enrollment Mode will not be matched for identification to a selected template, if there is one.

// cpp
 
/****(Option 1) register with unprocessed samples****/
auto biometric_template = id_listener->RegisterTemplate(samples, samples_size)
 
/****(Option 2) register with processed samples****/
// process enough unspecified audio to calibrate
while (id_listener->ProcessBuffer(input_buffer_ptr, out_buffer, input_size, is_user_verify) == Yobe::Status::NEEDS_MORE_DATA) {
    input_buffer = GetNextInputBuffer(); // some example function to get next buffer
    input_size = GetNextInputSize();
}
 
// now that we've calibrated, process desired user audio in Enrollment Mode
id_listener->StartEnrollment();
std::vector<double> processed_voice{};
while(id_listener->ProcessBuffer(voice_input_buffer_ptr, voice_out_buffer, voice_input_size, is_user_verify) == Yobe::Status::ENROLLING) {
    voice_input_buffer = GetNextVoiceBuffer(); // some example function to get next buffer
    if (voice_input_buffer == nullptr) {
        // the case where we've run out of voice audio before enrollment automatically stops
        id_listener->StopEnrollment()
        break;
    }
    voice_input_size = GetNextVoiceSize();
    // continuously add output to a vector
    processed_voice.insert(processed_voice.end(), voice_out_buffer.begin(), voice_out_buffer.end());
}
 
// register with the processed audio
auto biometric_template = id_listener->RegisterTemplate(processed_voice.data(), processed_voice.size());

Select the user using the template returned by the registration.

// cpp

id_listener->SelectUser(biometric_template);

Any new audio buffers passed to Yobe::IDListener::ProcessBuffer while not in Enrollment Mode will be processed with respect to the selected user's voice.

Process and Use Audio

Audio data is passed into the Yobe::IDListener one buffer at a time. See Audio Buffers for more details on their format. As seen in the method signatures for the Yobe::IDListener::ProcessBuffer functions, the audio can be encoded as Double or PCM 16-bit Integer. The output buffer size can also vary from call to call. You can use this variable to prepare your application to deal with the buffers accordingly. In this case, it only happens when there is a transition from unauthorized to authorized state.

// cpp

id_listener->ProcessBuffer(input_buffer, out_buffer, input_size, is_user_detected);

out_buffer in the above example now contains the processed version of the audio that is contained in input_buffer. An example of what to do with this out_buffer is to append its contents to a stream or larger buffer. An out-parameter is used to store whether the selected user was detected in the last buffer of audio via a boolean. In this example, the variable is_user_detected is assumed to be initialized before this call.

Note: You can find the library's built in buffer size using Yobe::Info::InputBufferSize.

Deinitialization

To ensure the IDListener is properly deinitialized, simply call Yobe::IDListener::Deinit.

// cpp

id_listener->Deinit();