EMC Buys Greenplum – Big Data Realignment Continues

EMC’s acquisition of Greenplum, announced today as a cash transaction, reaffirms the obvious: the Big Data tsunami upends conventional wisdom. It has already reshaped the market, spawning the most ferment in the RDBMS (and non-R DBMS via the noSQL players) space in years. When I first posted on Greenplum over a year ago, I said that

Open source + capital has created an intriguing new model of rapid innovation in “mature” markets, and the database space – like BI – is not a done deal. It is indeed possible to escape the gravity well, if you execute. Greenplum is getting it done, and is among the new stars to watch.”

Why the open source reference? Greenplum uses a parallelization layer atop PostgreSQL (like Aster, another of the new breed of ADBMS.)

Now EMC has written the next chapter in that story. In the process, it adds a new piece (after literally dozens of others in the past few years) to its own portfolio, which already includes unstructured data (via Documentum) and virtualization (via VMWare), layered in among the industry-leading storage and information management pieces. Disruptive? You bet. Is EMC finished? I doubt it. Candidates? BI tools, ETL, MDM, data integration come to mind. Losers? At least one big one. Read on.

What is this deal about? It’s about reshaping the DBMS and data warehouse markets, affecting new players, new use cases, form factors, and new assumptions about performance. It’s about:

  • Use cases that conventional DBMS struggle with. Organizations are turning in increasing numbers to specialist platforms for analytic uses – not all of which are even databases in the conventional sense, viz. Hadoop/MapReduce. The typical sales pitch? “We can do that faster than [incumbent big DBMS]. Just let us show you.”
  • Appliance convenience – software plus processors, and also plus storage. At the high end, database integration with servers and storage has become increasingly important. Netezza helped change this game, which had been left to Teradata, by adding the convenience factor and price as value propositions. Not incidentally, it also means added revenue for the vendor, so it’s no surprise that Oracle and IBM DB2 have joined the game too. Among the big players, only Microsoft remains odd man out of the appliance game (for a few more months).
  • Controlling the explosive growth of Big Data. This may seem counterintuitive for a storage vendor, but Greenplum’s columnar storage capability means it can help its customers use less storage. Is this a problem? No – it will fill up as fast as customers realize they can keep a years’ worth for analysis instead of 3 months’ worth. And expanding use cases will sell more storage, not less. Oracle is touting Exadata’s ability to do the same, Netezza has had it for a while too, and IBM has been pushing its database compression as well. See this useful post by Curt Monash on the topic for more.
  • Finally, and not least, it’s about performance. At scale. Technical improvements become more possible when the vendor owns the whole stack. Balancing IO, processor speed, memory, multiple tiers of storage, virtualization, security, calls to external analytics, ingesting data, exporting data  – all become more tunable, more optimizable, if you own all the pieces.

Greenplum, with over 100 customers, liked to say that it was acquiring net new sites at a pace faster than Netezza and Teradata. That made it a worthy candidate for EMC’s next step. But there was much more to talk about. Scale? A customer in Japan already had a 300-node system. Mature code line? Not as broadly as, say, Oracle, but Greenplum version 4 had recently added some significant features: complex query optimization; fast parallelized data loading, some early workload management suitable for its use cases, embedded languages for analytics (compiled C, Map and Reduce functions with optimized binding and data marshaling),  connection management, query prioritization, self-healing physical block replication. While it was not yet at the level, say, Aster has achieved for programmability, it does support Java functions – both for general function declarations and as a MapReduce language. The fast parallel data-out capability – using an external table metaphor for flexibility and ease of use – fits nicely in a storage vendor’s vision. So does the ability to control the placement of tables on disk, used for flexible space allocation and assignment to different storage types (SSD, near-line, etc.)

Greenplum’s Enterprise Data Cloud vision, discussed in a great post by Ramon Chen here and in a white paper I wrote for Greenplum here had not taken off yet, but its notions of elastic provisioning, spinning up and (perhaps even more important) spinning down analytical projects have great resonance.  Greenplum recently added to the vision with its Chorus product, focusing on self-service provisioning, data virtualization services, and data collaboration – issues that matter a great deal in environments where relevant data is in many places, inside and outside the data warehouse – and will stay there and become more widespread. The alignment with EMC’s strengths again are obvious, though clearly there will be work that must be done to flesh the integration story out. Bringing CEO Bill Cook over to run EMC’s new division sounds like a bet to do just that.

Who loses here? Not customers – especially Greenplum’s. They get a much more richly funded supplier, and apparently, management continuity. The most obviously displaced competitor is EMC’s existing partner ParAccel.  In 2008, the two unveiled the Scalable Analytic Appliance (SAA) pilot program at EMC World, and ParAccel joined EMC’s DW/BI Competency Center. In 2009 they went GA, landed a few customers, and ParAccel won EMC’s Offering of the Year award. In April 2010 published a record-breaking TPC-H benchmark over virtualized storage, an impressive feat. At the time, I was surprised at how utterly minimal the promotion of this by EMC was – but the new deal perhaps makes the reason more clear. More is at stake for ParAccel: EMC’s RecoverPoint and MirrorView provided its replication capabilities, and ParAccel was leaning on EMC in part for its HA and DR story. One has to assume all that is now at risk.

[correction, July 31 – Barry Zane of ParAccel, responding at my request to a correction offered by an unnamed guest, says:

The benchmark was NOT run over virtualized storage. It used direct attached disks. It WAS based on VMWare virtual machines, running two VMs per server. Each VM was given ownership of half the drives. No EMC in the picture at all, other than the fact that they own most of the VMWare shares.

There’s another factual error regarding ParAccel. We do NOT rely on a SAN (such as EMC) for HA. ParAccel clusters that are SAN-less use their own cross-node internal mirroring, we call it RAID-P. In a SAN-connected environment, we turn off RAID-P and just consider the SAN the mirror copy for HA. We do leverage the SAN for DR. By doing so, we are the ONLY young vendor that has a DR offering at all.

I appreciate Barry providing the information. It’s another reason I like the blogosphere so much – such issues can be quickly resolved. – MA]

How about the other ADBMS players? Netezza now will have a formidable foe to contend with in a similar space, depending on how the new EMC offering is configured, priced and targeted. The other sizable players in customer volume are Vertica and Sand. The former has been very visible, is also MPP and columnar, and similarly mature in its SQL and management capabilities, the latter less so. Aster is very differentiated – much more focused on analytics and programming, but considerably smaller, as are a number of others. What about the big guns? Oracle, IBM, and Sybase (recently acquired by SAP, in another indicator of the market upheaval) are on notice, though not likely to feel threatened anytime soon. HP? It is not a factor in the market today, and this deal is yet another example of HP sitting on its hands and failing to acquire assets that could make a difference.  But the game has changed, and there are implications for Microsoft, whose largest competitors own their stack from storage up, and can optimize in-house, while Microsoft must work on reference architectures for its still-unreleased Parallel Data Warehouse with partners and get them to build to order. That does not sound like a recipe for leadership, though they may be able to achieve fast follower status given their clout. [added 7/11/2010 Note that all this copying of data to ever larger analytic platforms ups the ante on data quality and synchronization issues – these vendors need to have an answer to the problems this creates. Their focus on it varies widely. See Barry Devlin’s discussion of this in his blog.]

Change indeed. EMC can be expected to invest substantially, and add its formidable selling organization to the tiny one Greenplum has. R&D can pick off more of the “to-dos” on its list. Welcome to the picnic, EMC – you’ve already made it more interesting. Even – or perhaps especially – on a cloudy day.

Disclosures: Greenplum, and several of the other vendors named above, are clients of IT Market Strategy

Published by Merv Adrian

Independent information technology market analyst and consultant, 40 years of industry experience, covering software in and around the data management space.

15 thoughts on “EMC Buys Greenplum – Big Data Realignment Continues

    1. Thanks, Chuck, and good to meet you. Sorry we haven’t had the pleasure before. Look forward to changing that.
      I read your post, and I have to say I’m skeptical on the ParAccel front. Of course, I know you’ll sell storage to anyone and will partner with all comers. But that relationship, which they positioned as very special, has at least become substantially less so.


  1. Merv – great points and wonderful coverage of the angles. Looking at this market consolidation, would you think that Dell, HP, Cisco, and other hardware vendors might want to make a move on storage, data, and maybe SaaS? =) – ray

    1. Thanks, Ray, for the comment. HP already has storage. Oh, and servers. And database. Perhaps they are playing. But not visibly enough by far. For the others, getting into the database game would be a leap. Still there may never be a better time.

  2. Thanks Merv – great post…wanted to comment on a couple things relative to ParAccel…our strategy here has been to ensure that we integrate well into an enterprise’s standard operating environment – we’ve felt that working in concert with a customer’s data management and storage infrastructure is a key piece of this and thus developed an approach to do so without sacrificing any performance (and, indeed, enhancing it). We developed our ‘Blended Scan’ technology (which is patent pending) to work with any storage infrastructure. Blended Scan enhances scan speeds (and query performance) when PADB is integrated with a customer’s storage infrastructure. It does this by laying out data in a ‘storage aware’ fashion.

    We drove our initial approach to this via the SAA (Scalable Analytic Appliance), an integrated appliance including an EMC SAN that has been highly successful (as you noted, it won us an award from EMC – see http://www.paraccel.com). Our Blended Scan technology, though, integrates with any SAN or NAS infrastructure that a customer has in place while still providing the high performance data warehousing and analytics our customers expect. Over the last year, we’ve integrated across a wide variety of storage implementations (including HP, Dell, EMC, NetApp, Hitachi etc.) and have shown tremendous performance improvements over other analytic database implementations leveraging enterprise storage. So to net it out, our overall strategy is to work across any storage infrastructure our customers may have deployed – we believe that customers will find our Blended Scan technology to be the highest performance solution for their storage implementation needs no matter whether its EMC or another provider.

    Also, you mentioned that “ParAccel was leaning on EMC in part for its HA and DR story.” To clarify, we have distinct HA and DR strategies that exploit unique functionality we have built into PADB which work with any storage infrastructure the customer has. For DR, we ensure that the SAN or NAS is the overall database of record and enable the customer to leverage their existing models for snapshots/replication/backup etc. In addition, we have utilities that backup the database and restore it at a later point if need be. For our failover and general uptime/availability strategy we use technology called RAID-P (RAID ParAccel), a proprietary data redundancy scheme that doesn’t depend on a SAN or NAS. RAID-P uses block-level mirroring to allow the system to operate without data loss in the face of a disk or node failure.

    Finally, as noted by Phil Francisco at Netezza , EMC’s acquisition of GreenPlum doesn’t really address corporate needs for high performance data analytics. This is exactly what we at ParAccel live to do – deliver the highest performance, easiest to deploy analytics platform on the planet; and doing so without locking customers into any specific or proprietary hardware implementations…

    Hope this helps! (and apologies for the long ‘comment’ 🙂
    CMO, ParAccel

    1. Tarun, thanks for the comment. You lay out some of the technical issues well, and in more depth than I thought appropriate for a piece that was about Greenplum, not ParAccel. And I recognize some of the interesting interactions with the storage layer in the blended scan technology were intended to be storage-vendor independent. My comments had more to do with my perception – correct or not – that EMC was a significant contributor to your sales efforts. I’ve known in the past about the wins that EMC played a role in; one has to believe their primary thrust will now be focused elsewhere.

      As a follower of ParAccel, I have watched the change in management – including your arrival as CMO – with great interest. I believe that stepping up your marketing, which was already underway with a significant growth in the number of sales teams, is coming at a critical time in the company’s maturation. It’s a crowded field out there, and getting your message of high performance – and high concurrency, which is one of your competitive strengths in this comparison – out there will be key.

  3. It took nearly 2 years, but as I have known (and stated – see dbms2 blog below) for some time, it was inevitable that the storage industry veterans would find themselves owners and purveyors of database technology, but not from a traditional BI viewpoint.


    This is a massive infrastructure play and as Chuck Hollis may attest (thanks for sharing your perspectives), newly minted EMC Divisions are expected to generate multi-billion dollar top lines. Hats off to Greenplum and EMC for stirring the passion of what will prove to be an extraordinary and exciting future. Merv, an excellent summary, but your last point above as to what’s this deal about, is where a lot of the future value lies for our industry. After all, who’s going to manage all of this “explosive growth in data”? Storage companies are the trusted stewards of data in the enterprise and control most of the routes. And yes, SANs (centralized storage management) will dominate, even more so, now that EMC is firmly in control.

    1. Aravind, I’m very aware of Microsoft’s upcoming release of the acquired DatAllegro code, or what’s left of it – see http://mervadrian.wordpress.com/2010/06/16/microsofts-parallel-dw-still-waiting/

      I don’t minimize Microsoft’s eventual impact, despite the longer-than-expected delay in bringing PDW to market. Recall that SQL Server itself was a rewrite of the Sybase code line; that effort was similarly intensive, all the way down to literally changing the block structure data was saved in, the wire protocol, and the T-SQL language. And that was just the beginning. Analyst commentary on markets often looks at the immediate – it’s an occupational hazard. But hopefully we connect it to history when doing so is instructive.

Leave a Reply