- Databricks has shifted its mission to support generative AI (GenAI) development for customers of all sizes
- With the acquisition of MosaicML, the company has made it possible for organizations train large language models (LLMs) and manage data in a single environment
- However, many organizations are still struggling to bring their AI projects past proof of concept
Born out of UC Berkeley by the creators of Apache Spark, Databricks initially focused on solving challenges in big data processing, a field that "was AI before it was cool to be AI," said Robin Sutara, the company’s field chief data strategy officer. Now, Databricks is on a mission to support generative artificial intelligence (GenAI) development right where enterprise data lives.
Early on, Databricks addressed the need for efficient big data processing in nascent AI and machine learning (ML) applications. As the data landscape evolved, Databricks recognized another challenge hindering organizations: data silos.
Many organizations traditionally stored structured data in warehouses for backward-looking insights while dumping structured and unstructured data into data lakes. The data used by Gen AI “will come from everywhere and in any form,” noted Mary Mesaglio, distinguished VP analyst at Gartner. The analyst firm estimates that only 10% to 30% of most data in enterprises is structured.
Data lakes often became “data swamps,” Sutara said — overwhelmed, costly repositories with limited usability. To address this problem, Databricks in 2020 introduced its lakehouse model, blending the structured nature of data warehouses with the flexibility of data lakes. This model allowed companies to run both business intelligence (BI) and AI on a single platform.
The lakehouse approach cleared a smoother path to data insights, but now, GenAI has catalyzed Databricks to rethink what the lakehouse can accomplish, Sutara told Fierce Network at the Gartner IT Symposium.
About 18 months ago, the acquisition of MosaicML marked Databricks’ leap into GenAI, with the former's platform allowing customers to train large language models (LLMs) without the vast resources of tech giants. Thus, GenAI has given Databricks an opportunity to "unlock the power of a lakehouse," Sutara explained.
MosaicML’s AI development platform dovetailed with Databricks’ goal to make advanced AI capabilities accessible to its customers. Companies can store data, run custom GenAI models, automate data management tasks and integrate governance in a single environment.
For example, Corning uses Databricks Data Intelligence Platform for machine learning and analytics projects to improve its manufacturing processes and refine product quality. Using MLflow and GenAI, Corning has created LLMs explicitly trained on its company data to accelerate development processes and make data-driven decisions.
Sutara noted GenAI can also generate explanations to make data easier to understand and put into context. "How do we actually help them construct the business semantics around it so that they understand what is that data, who’s using it?" she added.
Most customers struggle to scale AI projects
Despite these advancements, Sutara cautioned that a successful AI implementation goes beyond infrastructure. Many of Databricks’ customers face difficulties in scaling AI projects, with "about 85% of them never mak[ing] it into production," Sutara said. A “people-first approach” is crucial, emphasizing that AI should be accessible to non-technical users through natural language interfaces.
And for many companies, “the work of data preparation still remains,” noted Hung LeHong, distinguished VP analyst at Gartner. Many tech vendors are racing to lower costs for data management, and GenAI itself can help organize data.
However, companies "still need sophisticated data management," LeHong said at the symposium. Specifically, companies should know what is included in the unstructured data being used for GenAI, and ensure that data has the correct protections and access rights in place.
"You can't abandon data management because even though Gen AI tools can technically do this, it's costly to leave [data] messy," LeHong said.