Creating Unit Tests for your assistant in the Console

In the Snips console, at the bottom of your assistant page, you will find a link to the Unit Tests tool. With this tool, you can create and grow a permanent library of voice query recordings, used as tests to assess the robustness of the key use cases of your assistant. You will also be equipped to identify regressions as you keep adding new capabilities to your assistant over time.

Getting started with unit tests

Let's see what this tool looks like. When getting started, you have no unit tests for your assistant. Click on the microphone button to start recording you first unit test, and say a query you would like your assistant to understand. Once you're done talking, click on Stop recording. This creates a new unit test for you to complete. You can click on the play button to replay the recording. If you are unhappy with the recording, you can re-record it by clicking on the microphone button of the unit test, or delete it with the trash button.

Alternatively, instead of recording from the browser you can import audio files to your unit tests. Just click on top-right vertical three dots next to the Intents filter dropdown on top of your uni tests list. Be aware that your imported audio files need to have the the format 16khz PCM16 mono.

If you are happy with the recording and/or imported files, you next need to indicate what results you expect to obtain from the queries. We'll go through how this works for one query, one step at a time.

Pick an intent

Clicking on Pick an intent opens a menu listing all the intents of your assistant. Select the intent that corresponds to your unit test. Later, when you run the tests, if Snips fails to identify this intent, then the unit test status will be set to Fail. We will see later how to consider other contexts than the one where only intents enabledByDefault are active.

You are also invited to test queries that do not match any intent of your assistant, to make sure it will behave correctly when users mis-use it.

Type the expected result and tag the expected slots

In the Type the expected result text field, you will need to type the transcript of your query. Once done, you can tag the words corresponding to slots the same way you would do when entering training examples in your intent editing page. Naturally, only the slots corresponding to the expected intent you've set are available. Later, when you run the tests, if Snips fails to catch these slots, then the unit test status will be set to Fail.

Run your first unit test

Once you have entered the expected intent and slot(s) for this unit test, you can run your first test by clicking on the run icon illustrated by two circular arrows. This will run your recorded voice sample through the speech recognition and natural language understanding engines of your assistant. The console will display the output of both engines, and show whether the unit test passed or failed.

Going further: create your unit test suite

To thoroughly validate the robustness of your assistant, and make sure you are equipped to identify regressions as your assistant evolves, we encourage you to create a large library of unit tests. This is similar to the way software development works. To do this, you can reproduce the steps above, and create a series of unit tests covering all intents of your assistant. Then, to run all your unit tests at once, you can click on the Run all tests button at the top.

In particular, we encourage you to test the most obvious formulations for your intents, to make sure your assistant covers the main use cases. Then, the more thorough your tests, the stronger the guarantees you will get regarding the performance of your assistant.

What to do when a unit test fails?

Ideally, all your unit tests pass, and you are ready to go. However, unit tests are here to help you challenge your assistant, and assist you in making it better. Here's how to handle the different types of errors:

ASR output is wrong

In itself, a unit test will not fail because the ASR output is wrong. What matters is the predicted intent, and captured slots. It is only if the intent and/or the slot(s) are wrong that the unit test fails. However, serious errors in the ASR output will prevent the NLU from identifying the right intent, or capturing the right slot(s). If a word is incorrectly captured by the assistant, make sure this word belongs to the vocabulary of the assistant. This means that the word should either appear in the training examples, or in the slot values. If it is the case, but the word isn't captured correctly, don't hesitate to add more training examples for this word.

Intent is wrong

If the ASR output is right, or contains the key words that make the intent explicit, the NLU should predict the expected intent. If the NLU predicts another intent, there may be an ambiguity between those two intents. Are those inherently two different intents, or would it make sense to merge them in your assistant? If you think those are two inherently different intents, have you checked if your training examples are in sufficient number, and do not create an ambiguity between the two intents? To make sure your assistant correctly captures this formulation, don't hesitate to add more examples with similar formulations in your target intent.

A slot is wrong

If the NLU predicts the correct intent, but tags the wrong slot, there may be an ambiguity between those two slots. Are those inherently two different slots, or would it make sense to merge them in your assistant? If you think those are two inherently different slots, have you checked if this slot is represented in sufficient number in your training examples? In some cases, the NLU will tag the slot differently than in your expected output. Double check that you had tagged your slot properly in your test, as it takes a single character or space mistakenly tagged to fail the slot matching.

Advanced use

The unit test tool gives you access to the following advanced features:

Intents filter

Intents filters can be used to test your assistant in different states. A typical context that leads to using a different filter are ellicitation contexts. In the following example, the developer created a destinationElicitation intents filter to handle cases where the user is simply expected to specify their destination:

  • Hey Snips, when will I arrive? [enabledByDefault]

  • Can you remind me what's your destination [assistantResponse]

  • I'm going to New York [destinationElicitation]

  • ...

This intents filter can be either created from the assistant, app or intent edition page, or directly from the unit test tool:

To create a test for the example above, click on New Intents Filter and select the intent(s) you want to be active under the destinationElicitation filter. Once this intents filter is created, you can created unit tests that will run in the context of this test. Using this feature, you can create a complete test suite for each intents filter, covering all the possible states in which your assistant can be.

Tests Settings

In the Unit Tests tool, you can access a test settings menu:

Let us describe the role and impact of each of these settings.

Fail test whenever a word is incorrectly captured

The standard objective of the Snips platform is to extract the intent and slots from an input audio. This is why, by default, unit tests are deemed to pass or fail based on whether intent and slots are correctly captured, or not. However, you may be interested in the raw output of the ASR, whether because you aim to display it to the end user, or simply because you want to evaluate the ASR independently of the NLU. In those cases, tick the box to trigger a different pass/fail logic in which any unit test in which a single word (including spelling) differs with regards to the expected output is classified as a fail.

Expand all tests by default

The default behaviour of the unit tests tool is to keep unit tests collapsed, hiding the details of the unit tests ASR and NLU outputs. Tick this box if you want to invert this behaviour.

Confidence Scores

The ASR and NLU outputs of the Snips platform come with associated confidence scores. More specifically, there is a score attached to the complete ASR output, to the intent classification decision made by the NLU, and to the ASR confidence of each slot.

A general good practice for voice assistants is to avoid triggering an action rather than triggering the wrong action. It is easier for a user to repeat or reformulate rather than having to cancel an action before they can reformulate. In order to control this behaviour, it is a good idea for action code developers to consider these confidence scores when triggering an action or not. This means for example setting:

  • a threshold for the ASR confidence below which the ASR result is discarded

  • and/or a threshold for the intent classification below which the intent is considered as not recognized

  • and/or a slot level threshold under which the slot value is considered as not captured

In order to test your assistant accounting for such thresholds, reflecting the final decisions or lack thereof of your assistant, you can activate such a confidence score thresholds logic from the tests settings, and manually set each threshold. Pass and fail statistics will directly impact the consequences of these choices.