EMC Buys Greenplum – Big Data Realignment Continues
July 6, 2010 15 Comments
EMC’s acquisition of Greenplum, announced today as a cash transaction, reaffirms the obvious: the Big Data tsunami upends conventional wisdom. It has already reshaped the market, spawning the most ferment in the RDBMS (and non-R DBMS via the noSQL players) space in years. When I first posted on Greenplum over a year ago, I said that
“Open source + capital has created an intriguing new model of rapid innovation in “mature” markets, and the database space – like BI – is not a done deal. It is indeed possible to escape the gravity well, if you execute. Greenplum is getting it done, and is among the new stars to watch.”
Why the open source reference? Greenplum uses a parallelization layer atop PostgreSQL (like Aster, another of the new breed of ADBMS.)
Now EMC has written the next chapter in that story. In the process, it adds a new piece (after literally dozens of others in the past few years) to its own portfolio, which already includes unstructured data (via Documentum) and virtualization (via VMWare), layered in among the industry-leading storage and information management pieces. Disruptive? You bet. Is EMC finished? I doubt it. Candidates? BI tools, ETL, MDM, data integration come to mind. Losers? At least one big one. Read on.
What is this deal about? It’s about reshaping the DBMS and data warehouse markets, affecting new players, new use cases, form factors, and new assumptions about performance. It’s about:
- Use cases that conventional DBMS struggle with. Organizations are turning in increasing numbers to specialist platforms for analytic uses – not all of which are even databases in the conventional sense, viz. Hadoop/MapReduce. The typical sales pitch? “We can do that faster than [incumbent big DBMS]. Just let us show you.”
- Appliance convenience – software plus processors, and also plus storage. At the high end, database integration with servers and storage has become increasingly important. Netezza helped change this game, which had been left to Teradata, by adding the convenience factor and price as value propositions. Not incidentally, it also means added revenue for the vendor, so it’s no surprise that Oracle and IBM DB2 have joined the game too. Among the big players, only Microsoft remains odd man out of the appliance game (for a few more months).
- Controlling the explosive growth of Big Data. This may seem counterintuitive for a storage vendor, but Greenplum’s columnar storage capability means it can help its customers use less storage. Is this a problem? No – it will fill up as fast as customers realize they can keep a years’ worth for analysis instead of 3 months’ worth. And expanding use cases will sell more storage, not less. Oracle is touting Exadata’s ability to do the same, Netezza has had it for a while too, and IBM has been pushing its database compression as well. See this useful post by Curt Monash on the topic for more.
- Finally, and not least, it’s about performance. At scale. Technical improvements become more possible when the vendor owns the whole stack. Balancing IO, processor speed, memory, multiple tiers of storage, virtualization, security, calls to external analytics, ingesting data, exporting data – all become more tunable, more optimizable, if you own all the pieces.
Greenplum, with over 100 customers, liked to say that it was acquiring net new sites at a pace faster than Netezza and Teradata. That made it a worthy candidate for EMC’s next step. But there was much more to talk about. Scale? A customer in Japan already had a 300-node system. Mature code line? Not as broadly as, say, Oracle, but Greenplum version 4 had recently added some significant features: complex query optimization; fast parallelized data loading, some early workload management suitable for its use cases, embedded languages for analytics (compiled C, Map and Reduce functions with optimized binding and data marshaling), connection management, query prioritization, self-healing physical block replication. While it was not yet at the level, say, Aster has achieved for programmability, it does support Java functions – both for general function declarations and as a MapReduce language. The fast parallel data-out capability – using an external table metaphor for flexibility and ease of use – fits nicely in a storage vendor’s vision. So does the ability to control the placement of tables on disk, used for flexible space allocation and assignment to different storage types (SSD, near-line, etc.)
Greenplum’s Enterprise Data Cloud vision, discussed in a great post by Ramon Chen here and in a white paper I wrote for Greenplum here had not taken off yet, but its notions of elastic provisioning, spinning up and (perhaps even more important) spinning down analytical projects have great resonance. Greenplum recently added to the vision with its Chorus product, focusing on self-service provisioning, data virtualization services, and data collaboration – issues that matter a great deal in environments where relevant data is in many places, inside and outside the data warehouse – and will stay there and become more widespread. The alignment with EMC’s strengths again are obvious, though clearly there will be work that must be done to flesh the integration story out. Bringing CEO Bill Cook over to run EMC’s new division sounds like a bet to do just that.
Who loses here? Not customers – especially Greenplum’s. They get a much more richly funded supplier, and apparently, management continuity. The most obviously displaced competitor is EMC’s existing partner ParAccel. In 2008, the two unveiled the Scalable Analytic Appliance (SAA) pilot program at EMC World, and ParAccel joined EMC’s DW/BI Competency Center. In 2009 they went GA, landed a few customers, and ParAccel won EMC’s Offering of the Year award. In April 2010 published a record-breaking TPC-H benchmark over virtualized storage, an impressive feat. At the time, I was surprised at how utterly minimal the promotion of this by EMC was – but the new deal perhaps makes the reason more clear. More is at stake for ParAccel: EMC’s RecoverPoint and MirrorView provided its replication capabilities, and ParAccel was leaning on EMC in part for its HA and DR story. One has to assume all that is now at risk.
[correction, July 31 - Barry Zane of ParAccel, responding at my request to a correction offered by an unnamed guest, says:
The benchmark was NOT run over virtualized storage. It used direct attached disks. It WAS based on VMWare virtual machines, running two VMs per server. Each VM was given ownership of half the drives. No EMC in the picture at all, other than the fact that they own most of the VMWare shares.
There’s another factual error regarding ParAccel. We do NOT rely on a SAN (such as EMC) for HA. ParAccel clusters that are SAN-less use their own cross-node internal mirroring, we call it RAID-P. In a SAN-connected environment, we turn off RAID-P and just consider the SAN the mirror copy for HA. We do leverage the SAN for DR. By doing so, we are the ONLY young vendor that has a DR offering at all.
I appreciate Barry providing the information. It's another reason I like the blogosphere so much - such issues can be quickly resolved. - MA]
How about the other ADBMS players? Netezza now will have a formidable foe to contend with in a similar space, depending on how the new EMC offering is configured, priced and targeted. The other sizable players in customer volume are Vertica and Sand. The former has been very visible, is also MPP and columnar, and similarly mature in its SQL and management capabilities, the latter less so. Aster is very differentiated – much more focused on analytics and programming, but considerably smaller, as are a number of others. What about the big guns? Oracle, IBM, and Sybase (recently acquired by SAP, in another indicator of the market upheaval) are on notice, though not likely to feel threatened anytime soon. HP? It is not a factor in the market today, and this deal is yet another example of HP sitting on its hands and failing to acquire assets that could make a difference. But the game has changed, and there are implications for Microsoft, whose largest competitors own their stack from storage up, and can optimize in-house, while Microsoft must work on reference architectures for its still-unreleased Parallel Data Warehouse with partners and get them to build to order. That does not sound like a recipe for leadership, though they may be able to achieve fast follower status given their clout. [added 7/11/2010 Note that all this copying of data to ever larger analytic platforms ups the ante on data quality and synchronization issues - these vendors need to have an answer to the problems this creates. Their focus on it varies widely. See Barry Devlin's discussion of this in his blog.]
Change indeed. EMC can be expected to invest substantially, and add its formidable selling organization to the tiny one Greenplum has. R&D can pick off more of the “to-dos” on its list. Welcome to the picnic, EMC – you’ve already made it more interesting. Even – or perhaps especially – on a cloudy day.
Disclosures: Greenplum, and several of the other vendors named above, are clients of IT Market Strategy