Cloud

The Five Nine: What is data curation and why does it matter?

By Diana Goovaerts Oct 17, 2024 4:00pm

This week we’re going back down the AI rabbit hole, but we’re venturing down a new tunnel to talk about something called data curation.

Though AI is still a developing technology, it’s well enough known at this point that models are only as good as the data they’re trained on. But for enterprises looking to fine tune publicly available models, it can be a challenge to make sure they’re making the right data available. Why? Well, the vast majority of enterprise data is what is known as unstructured data. That includes any data that’s not numeric – photos, videos, emails, PDFs, you name it.

Enter data curation – which is basically just the process of sorting through all this data to decide what is relevant to train the model and what’s not. Today this is mostly a tedious, manual process. But is it even worth the hassle?

We spoke to Vincent Chen, Director of Product and Founding Engineer at Snorkel AI to get the lowdown on how data curation works, why it matters and whether it’s worth the hassle.

This podcast is written and hosted by Diana Goovaerts. It is edited by Diana Goovaerts and Matt Rickman. Liz Coyne is our executive producer.

To learn more about the topics in this episode:

Snorkel AI dives into hot market of data curation https://www.fierce-network.com/cloud/snorkel-ai-dives-hot-market-data-curation

Data storage gets spicy with help from AI https://www.fierce-network.com/ai/data-storage-gets-spicy-help-ai

GenAI could illuminate decades worth of dark data https://www.fierce-network.com/cloud/unstructured-data-pandoras-box-genai-its-key

Snorkel AI artificial intelligence (AI) data Cloud