Just like it did with the data lakehouse five years ago, Databricks is establishing another paradigm with data intelligence, which has the data lakehouse architecture at its core but is infused with generative AI (GenAI). Data intelligence was a key theme throughout Databricks Data & AI Summit and signals Databricks’ intentions to further democratize AI and ultimately help every company become an AI company.
A Brief Databricks Backstory
Founded by the creators of Apache Spark, Databricks is known as a trailblazer for launching new concepts in the world of data, such as Delta Lake, the open table format with over 1 billion yearly downloads, and the “lakehouse” architecture, which reflects Databricks’ effort to combine the best of what the data lake and data warehouse offer. Launched in 2020, the lakehouse architecture can handle both structured and unstructured data, and addresses the data engineer and business analyst personas in a single platform.
Delta Lake and Unity Catalog, which governs the unstructured data stored in these Delta tables, serve as the basis for the lakehouse architecture and are part of Databricks’ longtime strategy of simplifying the data estate and, by default, AI. But with the advent of GenAI, which is causing the amount of unstructured data to proliferate, Databricks has spearheaded yet another market paradigm, pushing the company beyond its core areas of data ingestion and governance into data intelligence.
At the heart of data intelligence is the lakehouse architecture and also Mosaic AI, the rebranded result of last year’s MosaicML acquisition that equipped Databricks with the tools to help customers train, build and fine-tune large language models (LLMs). These also happen to be the same technologies Databricks used to build its own open-source LLM ― DBRX ― sending a compelling message to customers that they, too, can build their own models and use the Mosaic AI capabilities to contextualize that data and tailor it to their business, thus achieving true data intelligence.
What Is Data Intelligence?
Databricks’ executives and product managers largely communicated the definition of data intelligence through demonstrations. One of the more compelling demos showed how Mosaic AI can be used to create an agent that will build a social media campaign, including an image and caption for that campaign, to boost sales.
The demo depicted how a user can use transaction data as a tool to supplement a base model, such as Meta’s Llama 3. This demo was key to highlighting one of Databricks’ product announcements, the Shutterstock ImageAI model, which is built on Databricks in partnership with Shutterstock and marks Databricks’ foray into the multimodal model space.
The exercise created an image for the fictional social media campaign that included a company’s bestselling product — chosen through transaction data — and a catchy slogan. But to convey the contrast between data intelligence and general intelligence, the demonstrator removed the “intelligence” ― all the data-enabled tools that exist in Unity Catalog ― and generated the image again. This time, the image did not include the bestselling product and was accompanied by a much more generic logan.
This demo reinforced the importance of contextualized data in GenAI and the role of Unity Catalog, which helps govern the data being used, and Mosaic AI, which allows developers to use enterprise data as tools for creating agents (e.g., customer support bots).
Data intelligence is about not only the context behind the data but also making that context a reality for the enterprise. For instance, in the above scenario, the demonstrator was able to put the image and slogan into Slack and share it with the marketing team through a single prompt. In this example, it is clear how a customer with Databricks skills could use GenAI in their business.
Databricks’ Acquisition of Tabular Is a Blow to Snowflake and a Surefire Way to Stay Relevant in the Microsoft Ecosystem
As a company born on the values of openness and reducing lock-in, Databricks pioneered Delta Lake to ensure any engine can access the data sitting in a data lake. Delta Lake remains the most widely adopted lakehouse format today, handling over 90% of the data processed in Databricks, and is supported by other companies, as 66% of contributions to the open-source software come from outside Databricks.
But over the past few years, we have seen Apache Iceberg gain traction as a notable alternative, garnering significant investment from data cloud platforms, including Snowflake. When Databricks announced its acquisition of Tabular ― created by the founders of Apache Iceberg ― days before the Data & AI Summit, it signified a strategic shift that will help Databricks target a new set of prospects who are all in on Iceberg, including many digital natives.
The general availability of Databricks’ Delta Universal Format (UniForm), which helps unify tables from different formats, indicates the company’s intention to make Delta and Iceberg more interoperable and, over time, potentially reduce the nuances between both formats, though this may be a longer-term vision.
The Tabular acquisition in some ways also marginalizes Snowflake’s steps to become more relevant as a Microsoft Fabric partner. Available through Azure as a first-party native service, Databricks has always had a unique relationship with Microsoft, and Delta serves as the basis for Microsoft Fabric. But Microsoft’s recent announcement to support Iceberg tables with Snowflake in a push for more interoperability was notable, and now with Tabular, Databricks can ensure it remains competitive in the Microsoft Fabric ecosystem.
It Is All About Governance
First announced three years ago, Unity Catalog has emerged as one of Databricks’ more popular products, allowing customers to govern not just their tables but also their AI models, an increasingly important component in GenAI.
At the event, Databricks announced it will open source Unity Catalog, which we watched happen during the Day 2 keynote, when Unity Catalog was uploaded to GitHub. Despite Unity Catalog’s mounting success, this announcement is not surprising and only reinforces the company’s commitment to fostering the most open and interoperable data estate.
It is very early days, but open sourcing Unity Catalog could help drive adoption, especially as governance of GenAI technologies remains among the top adoption barriers.
Databricks SQL Is Gaining Momentum
It is no secret that Databricks and Snowflake have been moving into one another’s territories. Databricks, with its expertise in AI and machine learning (ML), has been progressing down the stack, trying to capture data warehouse workloads. Snowflake, with its expertise in data warehousing, is looking to get in on the AI opportunity and address the core Databricks audience of data scientists and engineers.
Snowflake’s early lead in the data warehouse and strong relationship with Amazon Web Services (AWS) could be making it more difficult for Databricks to attract workloads. Combined with the enormity of the market, there may never be a scenario in which Databricks becomes a “standard” in enterprise accounts for data warehousing. But Databricks’ messaging of “the best data warehouse is a lakehouse” certainly seems to be working.
Traditionally, customers have come to Databricks for jobs like Spark processing and ETL (Extract, Transform, Load), but customers are increasingly looking to Databricks for their data warehouse. These customers fall into two groups. In the first group, customers on legacy systems, such as Oracle, are fed up with the licensing and are looking to modernize. In the second group, existing cloud customers are looking for a self-contained environment with less lock-in, compared to vendors like Snowflake, or are seeking to avoid challenges with system management and scale after having worked with hyperscalers.
As highlighted by Databricks Co-founder and Chief Architect Reynold Xin, Databricks SQL is the company’s fastest-growing product, with over 7,000 customers, or roughly 60% of Databricks’ total customer base. During his keynote, Xin touted improved startup time with Databricks SQL Serverless to five seconds and automatic optimizations for BI workloads to be four times faster compared to two years ago. Provided Databricks can continue to enhance performance while pushing the boundaries on ease of use to better compete with Snowflake and other vendors in attracting less technical business personas, we expect this momentum will continue and will challenge competitors to raise the bar for their own systems.
Databricks Is Bringing an Added Layer of Value to the BI Stack
Databricks AI/BI is a new service available to all Databricks SQL customers that allows them to ask questions using natural language (Genie) and perform analytics (Dashboards). In a demo, we saw the two user interfaces (UIs) in action: BI offers common features like no-code drag and drop and cross-filtering, and AI includes the conversational experience where customers can ask questions about their data.
Databricks AI/BI may lack some of the complex features of incumbent BI tools, but ultimately these are not the goals of the offering. The true value is in the agents that can understand the question the business analyst is asking and hoping to visualize. Databricks’ approach exposes the challenges of bolting on generic LLMs to a BI tool. But the company is not interested in keeping this value confined to its own BI capabilities. Staying true to its culture of openness, Databricks announced at the event that it will open up its API to partners, ensuring PowerBI, Tableau and Google Looker customers can take advantage of data intelligence in these BI environments.
Conclusion
With its lakehouse architecture, which was founded on the principles of open-source software and reduced lock-in, Databricks is well positioned to help customers achieve data intelligence and deploy GenAI. The core lakehouse architecture will remain Databricks’ secret sauce, but acquisitions, including those of MosaicML and Tabular, are allowing Databricks to broaden the scope of its platform to tap into new customer bases and serve new use cases.
If Databricks can continue to lower the skills barrier for its technology and sell the partner ecosystem around its platform, the company will no doubt strengthen its hold on the data cloud market and make competitors, including the hyperscalers in certain instances, increasingly nervous.