Over the past two years, Cloudera has demonstrated the power of surrounding emerging open source software with support services, expertise and its own IP. The firm has racked up over 30 customers since its founding in late 2008, and emerged as the leading source of Apache Hadoop. Cloudera’s recent C round of financing brought its funding to $36 million, and it has been investing aggressively, with 45 employees, a very visible voice on the Big Data circuit and a stellar, experienced leadership team. It evangelizes through training, thought leadership, and increasingly through a growing sales and marketing team. Cloudera deserves a full post of its own; I hope to get to that before yearend.
One indicator of Cloudera’s precocity has been its prioritization of key alliances – higher than many firms its size – and that strategy is likely to have a big payoff if the partnerships are well executed and bring the marketplace momentum and the value they promise to fruition. Two key recent announcements involved Membase and Informatica. I’ll discuss the latter in another post – here I’ll talk about why the Membase deal makes so much sense.
Membase (formerly Northscale) is a player in the NoSQL movement, a competitor of MongoDB, Apache Cassandra, and Riak. Like Cloudera, Membase Server is built upon a commercialization and extension of an open source project, in this case memcached, a distributed memory object caching system used to parallelize database activities for web apps. memcached is in wide enough use with MySQL that the latter’s manual provides documentation on how to use them together; Membase also markets a commercial version of memcached. Membase’s premise is simple: add a key-value store for persistence to replace the RDBMS in use today for the overwhelming majority of web applications. The company was founded in January 2009, and has also been effective at raising funds, with a $15M round under its belt. Mike Olson, CEO of Cloudera, is on the board, and like Cloudera, Membase has assembled an experienced team for its run at the emerging space.
But what space are we talking about? Co-founder and SVP of Products James Phillips is quite eloquent in a recent blog post that acknowledges “NoSQL” is a lousy name for it – no better than “Big Data.” He offers a different way – or set of ways – to think about these offerings:
Ultimately, categorizing these solutions will require looking beyond the underlying data model alone (key-value, document, column-oriented, graph, etc.). Rather, these systems should be compared using a larger, hopefully manageable, set of attributes: Must you declare a schema before inserting data? Can you change the schema on the fly if one is required? How hard is it to do that? Can the database transparently (to an application) spill across machines or is it a single-server focused solution? Must you take down the database to add or remove capacity? Can you query the database using a query language or must you write code? Does the system maintain indices to speed queries? How does the database perform on random and sequential operations? How does it perform on reads versus writes? Is data written to durable media immediately, or eventually, and what is my data loss exposure on node failure? How about on datacenter failure? Can I change that exposure through synchronous operations? What will that do to performance? Can the database work across datacenter boundaries? Will I always read my writes, or are there periods of data inconsistency across readers?
Potential buyers who ask these questions will be well on the way to a good product selection process. This not the same analytics-focused Hadoop marketplace that has been the focus of attention in the ADBMS community. There has been broad discussion of Hadoop there around MapReduce and its use for batch-oriented large scale analytics. The objective here is something else entirely: Membase, which went GA at the beginning of October, is really about OLTP, and scalable low latency access to data for processing a certain simple class of transactions.
Membase is available in a community edition and an Enterprise Edition. The two versions differ technically in only one dimension: the number of nodes they support. Unlike what some in the open source movement call “crippleware” (I’ve talked about this before), Membase Community Edition is fully functional. There are several versions at differing price levels (click figure to enlarge.)
Membase already has substantial applications in place. An early customer, Zynga, was running 500,000 operations per second (yes, for Farmville) on the prerelease version this summer before the commercial release of Membase Server. The product, running in Linux or Windows, adds nodes with a click, sports one-click failover and a GUI.
Why the Cloudera-Membase partnership? For those scenarios that require both scalable low latency data access and batch analytics to complete the application’s mission. This kind of hybrid, bidirectional data integration is the topological requirement of new applications – AOL Advertising and ShareThis are joint customers with these requirements. A Flume interface provides a streaming interface from Membase to Hadoop; a Sqoop utility can be used for batch transfers between the two. Both of these utilities will be familiar to Hadoop watchers.
Phillips told me that there are dedicated engineering resources on both sides driving the development. Developers and architects can rely upon “quasi-deterministic latency and throughput” by scaling with inexpensive new hardware as needed. Built-in memcached technology, auto-migration of hot data to the lowest latency storage technology available (RAM, SSD, or disk), and selectable write behavior – asynchronous or synchronous offer flexibility to meet a wide range of requirements. Membase says its high throughput is driven by its multi-threading, a model that minimizes lock contention, and automatic write deduplication.