Introduction

The Near-LATTE (Lock And Track Type Engine) variant for AVR-listening within the Yobe SDK performs Automatic Voice Recognition (AVR) on the voice of a pre-enrolled user to extract their voice from surrounding noise and human crosstalk to determine when the pre-enrolled user is silent and consequently mute the entire signal at those times. The software also has the capability of enrolling a user on the basis of 10-20 seconds of unscripted speech. The Voice Template is used to determine the intervals that a pre-enrolled user is not talking. During those intervals, the audio signal is muted. Typical use cases for the Near-LATTE variant for AVR-listening is a registered user engaged in a two way conversation and needing the signal to be entirely muted when they are not speaking or a solution wanting to input or record only the authorized speaker at all times, even though there may be other people speaking as well as other sources of noise.

The Near-LATTE variant for AVR-listening has the following capabilities:

Training for Voice Template: The program captures a Voice Template from a talker. The training is text independent and takes just 10-20 seconds of unscripted speech.
Audio Preparation for Automatic Muting/Unmuting: The program uses the stored template to lock onto and extract the voice of an enrolled user and automatically mutes the signal whenever the enrolled user is not talking.

Note: Only one user can be enrolled at a time.

When to use this variant

This is a Near-Listening scenario as shown in the diagram below:
The objective of this variant is that its output audio should be muted when the pre-enrolled user is not talking and should contain the extracted voice of the user when they are talking.

Installation

Place the provided libraries and header files in a location that can be discovered by your application's build system.

IDListener (Voice Identification)

LATTE's main functionality is accessed via the Yobe::IDListener class.

Initialization

Yobe::Create::NewIDListener is used to obtain a shared pointer to a new Yobe::IDListener instance. The instance must then be initialized using the license provided by Yobe, as well as two configuration arguments: the Microphone Orientation and the Output Buffer Type.

// cpp
auto id_listener = Yobe::Create::NewIDListener();
id_listener->Init("YOBE_LICENSE", "init_data_path", Yobe::MicOrientation::END_FIRE, Yobe::OutputBufferType::YOBE_FIXED);

Register and Select User

A Yobe::BiometricTemplate must be created using the desired user's voice so the IDListener can select that template and identify the voice.

Register the user by inputting their voice audio data using Yobe::IDListener::RegisterTemplate. This is done using a continuous array of audio samples.

It is recommended to first process the audio data so that only speech is present in the audio; this will yield better identification results. To achieve this, the IDListener can be placed into Enrollment Mode. In this mode, the audio that should be used to create a BiometricTemplate is processed, buffer-by-buffer, using Yobe::IDListener::ProcessBuffer. This ProcessBuffer will return a status of Yobe::Status::ENROLLING as long as the buffers are being processed in Enrollment Mode. Enrollment Mode is started by calling Yobe::IDListener::StartEnrollment, and is stopped by either manually calling Yobe::IDListener::StopEnrollment or by processing enough buffers for it to stop automatically based on an internal counter (currently, this is enough buffers to equal 20 seconds of audio).

Enrollment Mode can be started at any point after initial calibration. Any samples processed while in Enrollment Mode will not be matched for identification to a selected template, if there is one.

// cpp
 
/****(Option 1) register with unprocessed samples****/
auto biometric_template = id_listener->RegisterTemplate(samples, samples_size)
 
/****(Option 2) register with processed samples****/
// process enough unspecified audio to calibrate
while (id_listener->ProcessBuffer(input_buffer_ptr, out_buffer, input_size, is_user_verify) == Yobe::Status::NEEDS_MORE_DATA) {
    input_buffer = GetNextInputBuffer(); // some example function to get next buffer
    input_size = GetNextInputSize();
}
 
// now that we've calibrated, process desired user audio in Enrollment Mode
id_listener->StartEnrollment();
std::vector<double> processed_voice{};
while(id_listener->ProcessBuffer(voice_input_buffer_ptr, voice_out_buffer, voice_input_size, is_user_verify) == Yobe::Status::ENROLLING) {
    voice_input_buffer = GetNextVoiceBuffer(); // some example function to get next buffer
    if (voice_input_buffer == nullptr) {
        // the case where we've run out of voice audio before enrollment automatically stops
        id_listener->StopEnrollment()
        break;
    }
    voice_input_size = GetNextVoiceSize();
    // continuously add output to a vector
    processed_voice.insert(processed_voice.end(), voice_out_buffer.begin(), voice_out_buffer.end());
}
 
// register with the processed audio
auto biometric_template = id_listener->RegisterTemplate(processed_voice.data(), processed_voice.size());

Select the user using the template returned by the registration.

// cpp

id_listener->SelectUser(biometric_template);

Any new audio buffers passed to Yobe::IDListener::ProcessBuffer while not in Enrollment Mode will be processed with respect to the selected user's voice.

Process and Use Audio

Audio data is passed into the Yobe::IDListener one buffer at a time. See Audio Buffers for more details on their format. As seen in the method signatures for the Yobe::IDListener::ProcessBuffer functions, the audio can be encoded as Double or PCM 16-bit Integer. The output buffer size can also vary from call to call. You can use this variable to prepare your application to deal with the buffers accordingly. In this case, it only happens when there is a transition from unauthorized to authorized state.

// cpp

id_listener->ProcessBuffer(input_buffer, out_buffer, input_size, is_user_detected);

out_buffer in the above example now contains the processed version of the audio that is contained in input_buffer. An example of what to do with this out_buffer is to append its contents to a stream or larger buffer. An out-parameter is used to store whether the selected user was detected in the last buffer of audio via a boolean. In this example, the variable is_user_detected is assumed to be initialized before this call.

Note: You can find the library's built in buffer size using Yobe::Info::InputBufferSize.

Deinitialization

To ensure the IDListener is properly deinitialized, simply call Yobe::IDListener::Deinit.

// cpp

id_listener->Deinit();