We use cookies to personalize content and to analyze our traffic. Please decide if you are willing to accept cookies from our website.

Faux Data, Real Intelligence: Low-cost AI Model Training with Synthetic Datasets

Mon., 1. April 2024 | 4 min read

The need for large datasets is evident. Existing AI models like Gemma 7B require six trillion tokens for its training phase, and Llama 2 7B, require two trillion tokens. These two models are built for text generation and must be knowledgeable on a wide array of topics; therefore, the datasets can consist of general data such as publicly available web documents. Finding data to train models in niche domains is more difficult due to data scarcity. Even when real-life data is available, they can have issues like low data diversity, concerns about privacy regulations and a data imbalance.

AI engineers can address these issues by using synthetic data generation to train AI models in niche domains. Synthetic data will help models be more robust at handling a wider range of inputs, decrease time-to-market for AI applications, and improve model performance compared to competitors using non-synthetic or human-curated datasets.

Synthetic Data Generation Techniques

Synthetic data generation is the creation of artificial datasets that mirror the characteristics of real-life datasets. Structured synthetic data is in a tabular format while examples of unstructured synthetic data are text, images and videos. Synthetic data can be created using the following methods:

  1. Random sampling where random values are selected from a distribution that follows the characteristics of real-life data.
  2. Agent-based modeling (ABM) where data is created based on interactions between agents in a simulation.
  3. Rule-based where data is created using rules that are defined based on domain knowledge, structure and relationship of the real-life data.
  4. Generative AI where AI models generate data based on the complex patterns that were recognized during model training.

Synthetic Success Stories

Amazon One, a palm recognition system for contactless payment, is one success story for synthetic data generation for model training. Synthetic data was needed for this domain because there was a scarcity of palm image data and this system’s accuracy was critical because it was handling payments. The system was trained on millions of artificial palm images created by generative AI. In September 2023, Amazon reported that it was used more than three million times and produced an accuracy of 99.9999%.

Another success story is SynCLR–an image classifier created by Google and MIT without using any real-life data. This work used artificial image captions generated by GPT-4 and Llama 2 7B to prompt Stable Diffusion 1.5 to create artificial images that match the captions. The artificial captions and images are used to train SynCLR. The rationale for this approach was to show the value and convenience of generative models to produce diverse datasets. This classifier produced results that were similar to state-of-the-art image classifiers.

Synthetic Data Generation Vendors

Some synthetic data vendors include MOSTLY AI, Gretel, Hazy and Anyverse, who use generative AI to produce their datasets, see Table 1. All of these vendors offer a free plan or a demo with limited credits or functionality to try the software. A credit system for data generation is available for some of these solutions. Interestingly, MOSTLY AI is marketed as a no-code solution so anyone can use it to generate synthetic data while the other three solutions require technical knowledge.

Table 1: Comparison of synthetic data vendors

MOSTLY AI Gretel Hazy Anyverse
Free plan Yes

5 daily credits

Yes

15 monthly credits

Yes

Source data limit

No but demo available
Cost per credit USD $3-$5 per credit USD $2-$2.20 per credit - -
Monthly subscription - USD $0-$295 None for free plan

USD $0-$2000 for other plans

-
Synthetic data type Structured Structured

Unstructured (Coming soon)

Structured Unstructured
Generated data Any text data Any text data Any text data Images

Open-source solutions for synthetic data generation are also available through DataSynthesizer and Zpy. DataSynthesizer is a Python library that generates structured synthetic data in random, independent and correlated modes. It also provides a web UI that can be installed for use without coding. Zpy is a Python library that creates artificial images using Blender. This library is suited for creating datasets for computer vision models.

Recommendations

  1. Ensure synthetic data generation platforms or tools follow international privacy regulations. Synthetic data should be anonymous so that there is no risk of accessing personally identifiable information (PII). As a rule of thumb, it is best to use solutions that adhere to international privacy regulations, like GDPR and HIPAA, so your business has the opportunity to operate in more countries in the future.
  2. Verify that your chosen vendor can generate your type of data. Some vendors generate synthetic data for specific domains (such as healthcare and finance) while other vendors generate text data for any domain. If you need synthetic data generation for one domain only then choose the specific solution, otherwise, choose the general-purpose solution if synthetic data generation will be done in multiple domains.
  3. Create your own data if vendors cannot generate your data. If no vendors generate data for your domain then you can create your own using the DataSynthesizer or Zpy Python libraries or similar tools. This method will require technical skills to use these libraries, so factor in staff training and education costs or the cost for hiring third-party assistance.

Bottom Line

Synthetic data generation solves data scarcity and diversity and imbalance issues in niche domains when training AI models. AI engineers can use this technique to accelerate the training of models while saving on costs and get AI applications to market quicker.


References


More from Tactive

Decoding the Complexities of Serverless Computing: A Closer Look

Decoding the Complexities of Serverless Computing: A Closer Look

Serverless computing represents a paradigm shift in cloud services, eliminating the need for server management and offering scalable, cost-efficient solutions. This evolution addresses challenges of resource allocation and operational complexity. However, transitioning entirely to serverless computing involves certain nuances that must not be ignored. This article explores these challenges, providing insights into the potential limitations businesses may face in the realm of serverless computing.
Limitations Unveiled: Exploring the Restrictions of Large Language Models

Limitations Unveiled: Exploring the Restrictions of Large Language Models

This article dives into the burdens and constraints of using LLMs for key operational and strategic tasks. It highlights key areas where LLMs can fall short and significantly impact business operations. Understand the limitations of LLM implementations so that you can make informed decisions and set realistic expectations of what is possible with these models.
Apple AppStore Relaxation: the Good, the Bad and the Ugly

Apple AppStore Relaxation: the Good, the Bad and the Ugly

Apple's move to comply with the EU's Digital Markets Act (DMA) introduces alternative iOS app marketplaces, offering new opportunities for developers and users. This shift increases developers' flexibility but also presents potential risks. Developers must navigate these changes carefully to optimise benefits while safeguarding user trust and app integrity.