Improving Assistant Quality with Data Generation

Data Generation enables you to generate high quality and high variability training examples for your intents. It does so by using crowdsourced data.

Generating data for your intents has multiple advantages, including:

  • drastically increases training examples diversity and improves the coverage and performances of your assistant

  • saving you a great amount of time shipping test versions and collecting data from users. This feature allows you to build an assistant with great performance in very little time

  • helping you minimize the labeling effort as the generated data will come pre-tagged

You can check the product presentation screencast.

If you are already familiar with data generation you may want to directly jump into the How to Make a Successful Data Generation Campaign. You will learn how to improve your assistant data quality.

Why training examples diversity matters

Training examples diversity is crucial for your end-to-end assistant performance, both at the speech recognition level (ASR) and at the natural language understanding level (NLU).

  • ASR: a specialised language model is trained specifically for your assistant based on the training examples you provided. Out-of-vocabulary words can not be "guessed" in any way by the ASR. Hence, wording diversity will ensure a greater robustness to variations in formulations.

  • NLU: the natural language understanding performance of your assistant will also improve greatly with the number of training examples provided, both for intent classification and slot-filling. If you want to learn more about NLU performance improvement expectations, we strongly encourage you to visit our NLU benchmark.


You'll find below answers to some of the practical questions you might have about Data Generation at Snips.

Is this feature free?

Data generation is a paid service, you'll find below the current pricing.

Number of intents



100 euros


220 euros


400 euros

The table show the number of intents ordered versus price for Data Generation

How long does it take?

We guarantee that you will receive the pre-tagged training examples you've asked for within 3 business days.

How does it work?

Our data generation engine is a mix of machine learning algorithms and human operators. It has been used internally at Snips for years and so is extensively tested (this is the process that is behind our benchmark data for instance).

What quality guarantee do I have?

Each of the generated training examples is reviewed manually by several people. This allows us to guarantee a high quality level. Our service also includes a disambiguation algorithm that is responsible of identifying those training examples that are problematic. In case some training examples remain ambiguous after these rounds of validation, you might be asked to manually address these ambiguities. Ambiguous training examples should not represent more than 5% of the order size, so if it does, get in touch and we will resolve it.

If you have any other questions or remark regarding this feature, do not hesitate to contact us, we will do our best to answer your needs.

Next Steps

Our console is equipped with the necessary tooling and processes to generate data to improve the quality of your assistant. Go through our guide How to Make a Successful Data Generation Campaign to learn how to do it.

If you have further questions about this service we have prepared an FAQ to give answer to the most common inquiries.