Introduction

The GrandE (Grand Engine) variant within the Yobe SDK combines all the functionality that is possible within the Yobe SDK. This includes the following capabilities and settings (take note of the exclusivity of certain settings):

BioListener voice extraction
- with a voice target of EITHER:
  - Near-Listening OR
  - Far-Listening
IDListener voice identification
- with a voice target of EITHER:
  - Near-Listening OR
  - Far-Listening
- AND an output mode of EITHER
  - Automatic Speech Recognition (ASR) Listening OR
  - Automatic Voice Recognition (AVR) Listening

This guide will be broken up into sections describing each function of the Yobe SDK and how to access them within the GrandE variant.

Prerequisites

Place the provided libraries and header files in a location that can be discovered by your application's build system.

BioListener (Voice Extraction)

The BioListener is meant for situations where a person is talking while there are noise sources (including other voices). The job of the BioListener is to extract the voice of a person while pushing the noise sources further into the auditory background.

Initialization

Yobe::Create::NewBioListener is used to obtain a shared pointer to a new Yobe::BioListener instance. The instance must then be initialized via Yobe::BioListener::Init using the license provided by Yobe.

The BioListener can be initialized while specifying either a Near-Listening or Far-Listening target. See their respective glossary entries for more information.

// cpp
auto bio_listener = Yobe::Create::NewBioListener();
 
bio_listener->Init("YOBE_LICENSE", Yobe::VoiceTarget::NEAR_FIELD); // near-listening
// OR 
bio_listener->Init("YOBE_LICENSE", Yobe::VoiceTarget::FAR_FIELD); // far-listening

You can verify the selected voice target using the following function:

// cpp

bio_listener->GetVoiceTarget();

Process and Use Audio

Audio data is passed into the Yobe::BioListener one buffer at a time. See Audio Buffers for more details on their format. As seen in the method signatures for the Yobe::BioListener::ProcessBuffer functions, the audio can be encoded as PCM 16-bit Integer or 64-bit Double. The output buffer size can also vary from call to call.

// cpp

bio_listener->ProcessBuffer(input_buffer, out_buffer, input_size, &out_buffer_size);

out_buffer in the above example now contains the processed version of the audio that is contained in input_buffer. An example of what to do with this out_buffer is to append its contents to a stream or larger buffer.

Note: You can find the library's built in buffer size using Yobe::Info::InputBufferSize.

Deinitialization

To ensure BioListener is properly deinitialized, simply call Yobe::BioListener::Deinit.

// cpp

bio_listener->Deinit();

IDListener (Voice Identification)

The IDListener is used to perform certain actions with the voice of a pre-enrolled user. The IDListener has two output targets that define its functionality:

Automatic Speech Recognition (ASR) Listening
Automatic Voice Recognition (AVR) Listening

Automatic Speech Recognition (ASR) Listening

The IDListener with ASR output target detects the voice of a pre-enrolled user and extracts it from surrounding noise and human crosstalk. The software enrolls a user on the basis of 10-20 seconds of unscripted speech from that user. The IDListener provides high quality speaker voice signals enabling Automatic Speech Recognition (ASR) platforms to function properly and with a high degree of accuracy in extremely complex environments . A typical use case for this is an enrolled user talking to a kiosk or point of sale tablet in noisy human crosstalk environments (e.g. restaurants, drive-thrus, shopping malls and other public spaces).

Automatic Voice Recognition (AVR) Listening

The IDListener with AVR output target performs AVR on the voice of a pre-enrolled user to extract their voice from surrounding noise and human crosstalk to determine when the pre-enrolled user is silent and consequently mute the entire signal at those times. The software also has the capability of enrolling a user on the basis of 10-20 seconds of unscripted speech. The Voice Template is used to determine the intervals that a pre-enrolled user is not talking. During those intervals, the audio signal is muted. Typical use cases for this mode are:

a registered user engaged in a two way conversation and needing the signal to be entirely muted when they are not speaking
a solution wanting to input or record only the authorized speaker at all times, even though there may be other people speaking as well as other sources of noise

Initialization

Yobe::Create::NewIDListener is used to obtain a shared pointer to a new Yobe::IDListener instance. The instance must then be initialized using the license and a path to the initialization data, both provided by Yobe.

As with the BioListener, the IDListener is specified to initialize for either Near-Listening or Far-Listening. Additionally, the initialization specifies whether the IDListener is in ASR or AVR output mode.

// cpp
auto id_listener = Yobe::Create::NewIDListener();
 
id_listener->Init("YOBE_LICENSE", "init_data_path", Yobe::VoiceTarget::NEAR_FIELD, Yobe::OutputTarget::YOBE_ASR); // near-listening, ASR output
// OR
id_listener->Init("YOBE_LICENSE", "init_data_path", Yobe::VoiceTarget::FAR_FIELD, Yobe::OutputTarget::YOBE_ASR); // far-listening, ASR output
// OR
id_listener->Init("YOBE_LICENSE", "init_data_path", Yobe::VoiceTarget::NEAR_FIELD, Yobe::OutputTarget::YOBE_MUTE); // near-listening, AVR output
// OR
id_listener->Init("YOBE_LICENSE", "init_data_path", Yobe::VoiceTarget::FAR_FIELD, Yobe::OutputTarget::YOBE_MUTE); // far-listening, AVR output

You can verify the selected output mode using the following function:

// cpp

id_listener->GetOutputTarget();

Register and Select User

A Yobe::BiometricTemplate must be created using the desired user's voice so the IDListener can select that template and identify the voice.

Register the user by inputting their voice audio data using Yobe::IDListener::RegisterTemplate. This is done using a continuous array of audio samples.

It is recommended to first process the audio data so that only speech is present in the audio; this will yield better identification results. To achieve this, the IDListener can be placed into Enrollment Mode. In this mode, the audio that should be used to create a BiometricTemplate is processed, buffer-by-buffer, using Yobe::IDListener::ProcessBuffer. This ProcessBuffer will return a status of Yobe::Status::ENROLLING as long as the buffers are being processed in Enrollment Mode. Enrollment Mode is started by calling Yobe::IDListener::StartEnrollment, and is stopped by either manually calling Yobe::IDListener::StopEnrollment or by processing enough buffers for it to stop automatically based on an internal counter (currently, this is enough buffers to equal 20 seconds of audio).

Enrollment Mode can be started at any point after initial calibration. Any samples processed while in Enrollment Mode will not be matched for identification to a selected template, if there is one.

// cpp
 
/****(NOT RECOMMENDED) register with unprocessed samples****/
auto biometric_template = id_listener->RegisterTemplate(samples, samples_size)
 
/****(RECOMMENDED) register with processed samples****/
// process enough unspecified audio to calibrate
while (id_listener->ProcessBuffer(input_buffer, out_buffer, input_size, is_user_verify) == Yobe::Status::NEEDS_MORE_DATA) {
    input_buffer = GetNextInputBuffer(); // some example function to get next buffer
    input_size = GetNextInputSize();
}
 
// now that we've calibrated, process desired user audio in Enrollment Mode
id_listener->StartEnrollment();
std::vector<double> processed_voice{};
while(id_listener->ProcessBuffer(voice_input_buffer, voice_out_buffer, voice_input_size, is_user_verify) == Yobe::Status::ENROLLING) {
    voice_input_buffer = GetNextVoiceBuffer(); // some example function to get next buffer
    if (voice_input_buffer == nullptr) {
        // the case where we've run out of voice audio before enrollment automatically stops
        id_listener->StopEnrollment()
        break;
    }
    voice_input_size = GetNextVoiceSize();
    // continuously add output to a vector
    processed_voice.insert(processed_voice.end(), voice_out_buffer.begin(), voice_out_buffer.end());
}
 
// register with the processed audio
auto biometric_template = id_listener->RegisterTemplate(processed_voice.data(), processed_voice.size());

Select the user using the template returned by the registration.

// cpp

id_listener->SelectUser(biometric_template);

Any new audio buffers passed to Yobe::IDListener::ProcessBuffer while not in Enrollment Mode will be processed with respect to the selected user's voice.

Process and Use Audio

Audio data is passed into the Yobe::IDListener one buffer at a time. See Audio Buffers for more details on their format. As seen in the method signatures for the Yobe::IDListener::ProcessBuffer functions, the audio can be encoded as PCM 16-bit Integer or 64-bit Double. The output buffer size can also vary from call to call. You can use this variable to prepare your application to deal with the buffers accordingly. In this case, it only happens when there is a transition from unauthorized to authorized state.

// cpp

id_listener->ProcessBuffer(input_buffer, out_buffer, input_size, is_user_verify);

out_buffer in the above example now contains the processed version of the audio that is contained in input_buffer. An example of what to do with this out_buffer is to append its contents to a stream or larger buffer.

Note: You can find the library's built in buffer size using Yobe::Info::InputBufferSize.

Clean Up

To ensure proper clean up of the IDListener, simply call Yobe::IDListener::Deinit.

// cpp

id_listener->Deinit();

Table of Contents

Introduction

Prerequisites

BioListener (Voice Extraction)

Initialization

Process and Use Audio

Deinitialization

IDListener (Voice Identification)

Automatic Speech Recognition (ASR) Listening

Automatic Voice Recognition (AVR) Listening

Initialization

Register and Select User

Process and Use Audio

Clean Up