AI

Synthetic data: Treasure trove or AI's ticking time bomb?

  • Synthetic data is being used to train AI models in environments where real data is hard to obtain
  • The benefits of synthetic data are vast, but it also comes with challenges including bias and reliability
  • AT&T is just one company that is striking a balance between using real and synthetic data to train its AI models

The newest irony in our modern, data-driven world is that while we generate plenty of it, we somehow lack the right kind of data to train certain artificial intelligence (AI) models. Synthetic data, which is artificially generated rather than collected from real-world events, is stepping in to fill this gap. But is it safe to use?

“In an era where we are literally producing more data than humankind ever has before, we’re running out of the specific types of data needed for AI training,” Bart Willemsen, VP analyst at Gartner, told Fierce Network. The more diverse the training data, the more robust the AI model, which is why organizations are now keen on leveraging synthetic data.

The promise of synthetic data is vast, and its appeal lies in its scalability and versatility. 

As Shelly Kramer, managing director and principal analyst at SiliconAngle, pointed out, "You can use synthetic data to help generate very large volumes of data quickly while also providing the ability to have control of the process." This capability is particularly valuable for companies in sectors like pharmaceuticals, cybersecurity and autonomous vehicles, where large datasets are essential for training and testing.

A fresh report from GlobalData underscored the versatility of synthetic data. While often used to test software in pre-production environments, synthetic data’s applications “extend far beyond” that, Rena Bhattacharyya, chief analyst at GlobalData, said in a note. It can also evaluate risk, prevent fraud and even aid in drug discovery.

In the healthcare sector, for example, synthetic data is helping to address privacy concerns while accelerating research, making it an attractive option for companies bound by strict privacy regulations.

The dark side of synthetic data

However, the use of synthetic data inevitably comes with challenges.

“For starters, you've got the potential for bias and of course then the risk of the amplification of any bias that you've got in a synthetic data set,” Kramer told Fierce Network. This can lead to models that perform well on synthetic test sets but “bomb in real-world scenarios."

This risk, coupled with the potential for missing nuances found in real-world data, raises questions about the reliability of synthetic data in critical applications.

Additional concerns have risen around "model collapse,” a scenario where AI models may degrade in performance if they rely too heavily on synthetic data rather than real-world examples. Kramer cautioned against over-reliance on synthetic data, suggesting a hybrid approach that blends real-world and synthetic data.

“Balance is key,” she said. “Make sure your testing processes are rigorous, engage in continuous monitoring and refinement and also develop validation processes using real-world data.”

Striking a balance

Kramer reiterated synthetic data can be very effective and it's something that researchers have been using for decades — it's not new. 

“If your tasks are well-defined and you're using high quality, well-represented synthetic data, your accurate rates and effectiveness can be high,” she said.

In the telco realm, AT&T is leveraging synthetic data to enhance AI processes throughout its network. But it’s striking that balance between real and synthetic data to do so.

"For most of our work, we rely on our own data," said Raj Savoor, AT&T’s VP of network analytics and automation, told Fierce. Synthetic data generation is also useful for scenarios like time series forecasting and testing in non-deterministic environments, such as when the company is simulating network impacts during a weather event.

In scenarios where real data may be insufficient or impractical to obtain, synthetic data supplements the real data. Using synthetic data can also be more cost-effective than going through the process of collecting, organizing and labeling real-time data, said AT&T’s VP of data science Mark Austin.

This can include creating data that mimics real-world scenarios but with more control over the variables involved. Essentially, if a company doesn’t want to invest heavily in using high-priced LLMs to label data, synthetic data offers a more cost-effective alternative.

That said, Austin emphasized that there is always a double check with real data involved. “We always test it on the real data, and we grade it there, but the synthetic data helps us in the middle piece,” he concluded.