IBM has taken another open source technology provider into its portfolio, acquiring Ahana, one of the leading vendors behind the massively parallel distributed in-memory SQL query engine Presto. Presto, first created and still actively developed at Meta, has attracted a broad array of open source contributors, and is championed by several vendors. It has also been forked to create Trino, which competes for the same set of use cases. Trino has its own adherents, including the much better funded Starburst, which is seen by some as “more closed.” Both technologies, and both vendors, have competed with solutions for unified data access from major data management vendors. Many of thie vendors also support one or both emerging standards in an effort to attract prospects unwilling to be tied to what they may view as the vendors’ proprietary solution.
IBM is placing a bet here. It’s made a choice based on its posture of building an open-source based stack that ties together its sizable community of DBMS customers using its proprietary technologies like Db2 and Netezza with those leveraging newer, often open-source ones. IBM will leverage the expertise it has developed in the open source formats it has helped fund and develop across on-premises and cloud deployments. IBM has partnered with Cloudera to craft strategies bringing the transforming Hadoop wave and extending it as it moves to the cloud, and is eager to leverage its presence in the cloud and open source community to join the rising tide of analytic data management solutions with a comprehensive solution. Ahana offers a key component, a connection to a powerful community in the Presto Foundation as part of the Linux Foundation and will no doubt be key to upcoming deliverables.
Ahana must be seen as a down payment on the construction of IBM’s own entry into this space. It has a formidable array of assets, and it is late to the party, but the lakehouse market is just getting started, and there is room to make a splash.
Some Context – The Evolution of The Open Lakehouse
In the beginning was the transactional data; it resulted from, and drove, business processes. And business users wanted it after the fact, to analyze, to understand what was happening and improve outcomes, to predict the future and create new opportunities. But it was hard to find, hard get to, and hard to combine with other data. And so the most well-organized data was copied into data warehouses: well-structured, well-chosen subsets of the data for well-known uses required by well-funded, well-connected users.
Before long, new kinds of operational data came along – interactions on websites, trails of breadcrumbs left behind by visitors navigating shopping sites and games and all manner of commercial and personal activity, readings from all sorts of instruments on oil rigs, self-driving cars, and communications networks – as well as communications between people on all kinds of social sites. This data was not so well organized. So as uses for this data became apparent, the files it had been stored in became a new target, and those files – new and old, in many formats – were put into data lakes. Largely unaltered, at least at first, the data in these lakes became a playground for data scientists trying out new analytic tools, like notebooks with built in graphic tools and machine learning models to detect correlations and patterns.
The success of these efforts led to an obvious question:
What if we combined these two kinds of data to generate richer insights and connect new techniques with old ones?
This activity was pursued by traditional data warehouse users by creating logical data warehouses, a term coined by Gartner. But data lake users, often a separate group within their firms, pursued the same goal, and from their ranks emerged a new name: the data lakehouse, popularized and promoted by Databricks and now broadly adopted to describe this rich place of intersection.
Vendors from both ends of this “structured-to-unstructured” continuum – data warehouse purveyors, often using proprietary formats and providing exclusive access to the contents, and data lake vendors, often with a more open standards-based posture, pursued this emerging market. Both sets of vendors have spawned new technologies, evangelized emerging use cases, and helped accelerated the growth of these new analytic tools.