Faux Data, Real Intelligence: Low-cost AI Model Training with Synthetic Datasets

The need for large datasets is evident. Existing AI models like Gemma 7B require six trillion tokens for its training phase, and Llama 2 7B, require two trillion tokens. These two models are built for text generation and must be knowledgeable on a wide array of topics; therefore, the datasets can consist of general data such as publicly available web documents. Finding data to train models in niche domains is more difficult due to data scarcity. Even when real-life data is available, they can have issues like low data diversity, concerns about privacy regulations and a data imbalance.

AI engineers can address these issues by using synthetic data generation to train AI models in niche domains. Synthetic data will help models be more robust at handling a wider range of inputs, decrease time-to-market for AI applications, and improve model performance compared to competitors using non-synthetic or human-curated datasets.

Synthetic Data Generation Techniques

Synthetic data generation is the creation of artificial datasets that mirror the characteristics of real-life datasets. Structured synthetic data is in a tabular format while examples of unstructured synthetic data are text, images and videos. Synthetic data can be created using the following methods:

Random sampling where random values are selected from a distribution that follows the characteristics of real-life data.
Agent-based modeling (ABM) where data is created based on interactions between agents in a simulation.
Rule-based where data is created using rules that are defined based on domain knowledge, structure and relationship of the real-life data.
Generative AI where AI models generate data based on the complex patterns that were recognized during model training.

Synthetic Success Stories

Amazon One, a palm recognition system for contactless payment, is one success story for synthetic data generation for model training. Synthetic data was needed for this domain because there was a scarcity of palm image data and this system’s accuracy was critical because it was handling payments. The system was trained on millions of artificial palm images created by generative AI. In September 2023, Amazon reported that it was used more than three million times and produced an accuracy of 99.9999%.

Another success story is SynCLR–an image classifier created by Google and MIT without using any real-life data. This work used artificial image captions generated by GPT-4 and Llama 2 7B to prompt Stable Diffusion 1.5 to create artificial images that match the captions. The artificial captions and images are used to train SynCLR. The rationale for this approach was to show the value and convenience of generative models to produce diverse datasets. This classifier produced results that were similar to state-of-the-art image classifiers.

Synthetic Data Generation Vendors

Some synthetic data vendors include MOSTLY AI, Gretel, Hazy and Anyverse, who use generative AI to produce their datasets, see Table 1. All of these vendors offer a free plan or a demo with limited credits or functionality to try the software. A credit system for data generation is available for some of these solutions. Interestingly, MOSTLY AI is marketed as a no-code solution so anyone can use it to generate synthetic data while the other three solutions require technical knowledge.

Table 1: Comparison of synthetic data vendors

	MOSTLY AI	Gretel	Hazy	Anyverse
Free plan	Yes 5 daily credits	Yes 15 monthly credits	Yes Source data limit	No but demo available
Cost per credit	USD $3-$5 per credit	USD $2-$2.20 per credit	-	-
Monthly subscription	-	USD $0-$295	None for free plan USD $0-$2000 for other plans	-
Synthetic data type	Structured	Structured Unstructured (Coming soon)	Structured	Unstructured
Generated data	Any text data	Any text data	Any text data	Images

Open-source solutions for synthetic data generation are also available through DataSynthesizer and Zpy. DataSynthesizer is a Python library that generates structured synthetic data in random, independent and correlated modes. It also provides a web UI that can be installed for use without coding. Zpy is a Python library that creates artificial images using Blender. This library is suited for creating datasets for computer vision models.

Recommendations

Ensure synthetic data generation platforms or tools follow international privacy regulations. Synthetic data should be anonymous so that there is no risk of accessing personally identifiable information (PII). As a rule of thumb, it is best to use solutions that adhere to international privacy regulations, like GDPR and HIPAA, so your business has the opportunity to operate in more countries in the future.
Verify that your chosen vendor can generate your type of data. Some vendors generate synthetic data for specific domains (such as healthcare and finance) while other vendors generate text data for any domain. If you need synthetic data generation for one domain only then choose the specific solution, otherwise, choose the general-purpose solution if synthetic data generation will be done in multiple domains.
Create your own data if vendors cannot generate your data. If no vendors generate data for your domain then you can create your own using the DataSynthesizer or Zpy Python libraries or similar tools. This method will require technical skills to use these libraries, so factor in staff training and education costs or the cost for hiring third-party assistance.

Bottom Line

Synthetic data generation solves data scarcity and diversity and imbalance issues in niche domains when training AI models. AI engineers can use this technique to accelerate the training of models while saving on costs and get AI applications to market quicker.