How GenAI-created synthetic data improves augmentation (original) (raw)
Synthetic data can enhance the performance and capabilities of data augmentation techniques. Navigate the challenges generative AI models present to reap the benefits.
Data augmentation techniques rely on existing data sets to simulate events and forecast outcomes. Synthetic data provides larger, more diverse data sets and stand-in for data that's difficult to access.
Much of the excitement around generative AI has focused on the technology's text-generating capabilities, but GenAI is also powering synthetic data creation and data augmentation. In the process, it's accelerating the use of synthetic data in existing applications and fueling its use in new use cases.
"The idea of creating synthetic data has been around a long time," said Thomas Coughlin, life fellow, president of the Institute of Electrical and Electronics Engineers and president of Coughlin Associates.
"What changes things now is the amount of detail, and the speed at which that synthetic data can be generated, as well as the depth and complexity that can be created because of technological advances such as generative AI."
Analysts, researchers and data practitioners tout the benefits that GenAI brings to data augmentation, citing the same points Coughlin made, and additional benefits such as safeguarding data privacy constraints. The use of synthetic data to augment existing data sets can spur innovations across different areas in organizations.
"We are at a point where there's potential for a reinvigoration of synthetic data use for a variety of uses in the enterprise," said Rowan Curran, a senior analyst at Forrester Research.
Synthetic data generation for data augmentation
Synthetic data is manufactured data, as opposed to data generated by real-world events and direct observation. Data practitioners use algorithms to generate the synthetic data from existing data sets. They set parameters to ensure the synthetic data meets the quality standards required for the use cases where it will be applied.
Synthetic data can either augment existing data sets that are too limited, or stand in for real-world data sets that are not easily accessible and useable, such as data that's highly restricted or heavily regulated. It can also fill a gap where real-world data is nonexistent, such as analyzing a scenario that is either so rare, theoretical or futuristic that there is no data from actual events. Analysts can also use synthetic data to run simulations in place of real-world situations that could be too risky or dangerous.
The use of synthetic data predates modern computers, said Arthur Carvalho, associate professor of information systems and analytics at Miami University's Farmer School of Business.
In the past, engineers manufactured data to test if planned projects, such as bridges or skyscrapers, could withstand extreme events. But manufacturing and analyzing synthetic data was typically limited to large-scale, high-cost projects where the consequences of failure were significant. The higher price of failure justified the cost and complexity of using synthetic data.
"We can very quickly obtain quality synthetic data at a fraction of the cost of obtaining the data from the real world," Carvalho said.
Modern technology increases the number of use cases for synthetic data use:
- Create simulations that train and test self-driving vehicles.
- Create and run preliminary tests on new drugs and train machine learning (ML) systems to perform different tasks.
- Create healthcare data sets from sources such as medical imaging and patient records, which makes it easier to perform complicated research.
- Create virtual and augmented environments.
- Support uncommon or theoretical test scenarios in digital twins.
GenAI models for data augmentation
Data augmentation uses several GenAI models to generate synthetic data. A generative adversarial network is an ML model that has two neural networks. The two networks compete with each other by using deep learning methods to make accurate predictions.
A variational autoencoder is a GenAI model with an encoder-decoder architecture. It uses deep learning to generate new content and detect anomalies.
Large language models power ChatGPT and other content-generating AI tools. They may have future uses for synthetic data generation and data augmentation, but they have limited uses currently, Curran said.
Benefits and challenges of data augmentation with GenAI
According to research firm Gartner, using GenAI to create synthetic data is rapidly growing. Gartner predicted 60% of AI data will be synthetic to simulate reality and future scenarios in 2024, up from 1% in 2021.
The benefits of synthetic data on its own or using it to augment real-world data sets drive that increase, experts said. Data augmentation benefits include the following:
- Improved privacy for individuals because their data isn't used.
- Increased safeguards for restricted and regulated data by avoiding its use.
- The ability to test extreme or edge scenarios; the lack of data or limited data from theoretical or rare events is no longer a barrier to analysis.
- Fewer biases. The creation or use of synthetic data to augment existing data sets can be corrective.
- More effective and cost-efficient data. The creation and use of synthetic data -- particularly when used to augment and scale a small data set -- can be an improvement over real-world data on all three points.
However, extracting the benefits from synthetic data use with data augmentation techniques requires overcoming several challenges, experts said.
For example, all the challenges that exist when using real-world data, including bias, can exist when using synthetic data if controls aren't in place. And synthetic data may not be as accurate as real-world data. Creating synthetic data and using it to augment data sets adds complexity to an organization's data programs. And the tools aren't ready for prime time.
"We're not at the point where this is a business-user level of tooling in any of the products [for synthetic data generation] I have seen," Curran said.
And as organizations strive to provide more data access and use for analysis to all workers, the ones outside of the data function will, for the most part, get shut out of using synthetic data for everyday analytics because they lack the technical skill level to use synthetic data.
Mary K. Pratt is an award-winning freelance journalist with a focus on covering enterprise IT and cybersecurity management.