AI

GenAI could illuminate decades worth of dark data

  • GenAI is transforming the way organizations manage and utilize unstructured data
  • Unstructured data, including documents, photos and videos, is plentiful but difficult to harness 
  • While GenAI offers powerful tools for extracting and utilizing this data, experts emphasize the need for strong data governance

Generative AI (GenAI) is revolutionizing how organizations manage and use their unstructured data, a resource that has long been abundant yet difficult to harness. But without a clear strategy, they could be opening Pandora's box.

Structured data includes your standard tables, Excel sheets and databases, while unstructured data spans everything from old emails, PDFs, purchase orders and invoices to training manuals and repair guides, noted Steward Bond, VP of data intelligence and integration software at IDC. This data has often become invisible thanks to a lack of proper management, sitting unused on hard drives or in cloud storage.

“Where I think there is tremendous opportunity is in the use of GenAI to shed light on dark data,” Bond told Fierce Network. “GenAI could be used to read through content that is not labelled or tagged, identify what is in the content including any sensitive information, add the appropriate metadata so that it becomes known and available for use.”

Organizations generate truly massive amounts of unstructured data, with the IDC Global DataSphere estimating that 132 ZettaBytes of data were created in 2023 alone, 64% of which came from enterprises. Much of this data is "dark data," Bond said, meaning it is not properly captured, tagged or managed, making it difficult to access and use.

Large language models (LLMs), though, are uniquely well-suited to understanding and processing unstructured data because they are trained on vast amounts of such content. Bond explained that LLMs can answer a wide range of questions and generate content based on unstructured data inputs.

Indeed, techniques like retrieval-augmented generation (RAG) provide a way to incorporate additional data into the model's prompts, improving the accuracy and relevance of the generated content.

What can unstructured tell us?

Enterprises and service providers alike are recognizing the potential of GenAI to extract and structure data from unstructured sources like long-form documents, research papers and emails.

This capability is crucial for organizations that need to connect unstructured data to downstream processes, such as integrating purchase orders into an ERP system, said Amy Machado, research manager for enterprise content and knowledge management strategies at IDC. Beyond simple extraction, GenAI can also help organizations search, discover, summarize and even generate new content based on existing unstructured data.

This turns previously inaccessible knowledge into actionable insights that can drive business processes and decision-making, Machado told Fierce. "A lot of unstructured data has knowledge stored in it," she added.

In one such instance, AWS is now using GenAI to enhance its sales teams by combining structured data from CRM systems with unstructured data like sales collateral. This allows the models to generate comprehensive customer account summaries that provide sales teams with more contextually relevant insights.

Structured data provides the quantitative foundation (e.g., consumption, pipeline), as well as historical trends, while unstructured data adds qualitative depth. Unstructured content such sales collateral and external web data provides context and nuance that structured data alone might miss. 

“Generative AI and LLMs have revolutionized what we can do with unstructured content which historically has been challenging to analyze at scale,” Rupa Boddu, principal tech product manager, Generative AI, AWS, told Fierce Network.

Weeding out the bad data

While the benefits of leveraging unstructured data with GenAI are clear, it is equally important to ensure that the data feeding these models is accurate, unbiased and free of sensitive information.

Unstructured data is much like a digital landfill, where information has been tossed and forgotten for years. Now, many companies have little understanding of what’s buried within, and opening it up could have unforeseen consequences. 

Bart Willemsen, VP analyst at Gartner, said using unstructured data for GenAI applications introduces critical questions around privacy and data governance, “something most organizations seem to not have solved yet.”

Most companies don’t actually know the data they’ve accumulated—in some cases “decades and decades of history,” Willemsen said—why they had that data to begin with, or what purposes it served.

Without proper data governance, there is a risk that GenAI could propagate inaccuracies or biases, leading to flawed outputs and potentially harmful decisions. Organizations must, therefore, implement robust data governance frameworks to manage the quality and security of the unstructured data used in training and deploying GenAI models.

General data governance is something organizations should have with “absolute, granular control before they can be successful at using any AI,” Willemsen concluded. “I don't care how good the AI technology itself is, if you have crappy data, you will have crappy AI.”