Hadoop Summit Recap Part Two – SELECT FROM hdfs WHERE bigdatavendor USING SQL

Probably the most widespread, and commercially imminent, theme at the Summit was “SQL on Hadoop.” Since last year, many offerings have been touted, debated, and some have even shipped. In this post, I offer a brief look at where things stood at the Summit and how we got there. To net it out: offerings today range from the not-even-submitted to GA – if you’re interested, a bit of familiarity will help. Even more useful: patience.


EMC Jumps Into ADBMS Appliance Game

The Data Computing Appliance, first deliverable from EMC’s acquisition of Greenplum, was announced last month, only 75 days after the acquisition closed, and it doesn’t lack for ambition.  Pat Gelsinger, President and Chief Operating Officer, EMC Information Infrastructure, pointed to the high level opportunity: unlocking the “hidden value” of enormous and growing data assets every company is increasingly holding, and often failing to leverage. The appliance will reach many hitherto untapped resources in the data centers that EMC occupies. Adding EMC’s manufacturing, sales and marketing, and reference architectures to the Greenplum IP brings what Gelsinger calls Greenplum’s “first phase” to its completion. And begins what is likely to be a sizable battle with Oracle, Teradata and IBM, if EMC mounts campaigns and spending to match its ambitious vision. Read more of this post

More TDWI Notes – ParAccel Rolling On, HP Stalled, Vertica Leading Insurgents

On my second day at TDWI, I was in meetings all day – events like this are a great opportunity for analysts to catch up with many of the companies they follow at one time, and this particular one was packed with sponsors. Congrats to the folks who sell sponsorships – they had a packed exhibit hall, and a lot of very interested attendees. I got a chance to chat at a few booths (all buzzing), ask a few attendees some real-world questions (and was asked some surprising ones myself), and get a sense of the workload in the trenches (heavy and growing.)

EMC Buys Greenplum – Big Data Realignment Continues

EMC’s acquisition of Greenplum, announced today as a cash transaction, reaffirms the obvious: the Big Data tsunami upends conventional wisdom. It has already reshaped the market, spawning the most ferment in the RDBMS (and non-R DBMS via the noSQL players) space in years. When I first posted on Greenplum over a year ago, I said that

Open source + capital has created an intriguing new model of rapid innovation in “mature” markets, and the database space – like BI – is not a done deal. It is indeed possible to escape the gravity well, if you execute. Greenplum is getting it done, and is among the new stars to watch.”

Why the open source reference? Greenplum uses a parallelization layer atop PostgreSQL (like Aster, another of the new breed of ADBMS.)

Now EMC has written the next chapter in that story. In the process, it adds a new piece (after literally dozens of others in the past few years) to its own portfolio, which already includes unstructured data (via Documentum) and virtualization (via VMWare), layered in among the industry-leading storage and information management pieces. Disruptive? You bet. Is EMC finished? I doubt it. Candidates? BI tools, ETL, MDM, data integration come to mind. Losers? At least one big one. Read on. Read more of this post

New TPC-H Record – Virtualized by ParAccel, VMware

You can set performance records in a virtualized environment – that’s the message of the new 1 Tb TPC-H benchmark record (scroll down to see the 1Tb results) just released by ParAccel and VMware. Running on VMware’s vSphere 4, the ParAccel Analytic Database (PADB) delivered a one-two punch: not only the top performance number for a 1 terabyte (TB) benchmark, but the top price-performance number as well. The results in a nutshell: 1,316,882 Composite Queries per Hour (QphH), a price/performance of 70 cents/QphH, and a data load rate of over 3.5 TBs per hour. ParAccel moved quickly to promote the result; oddly, VMware seems to have been asleep at the switch, with no promotion on its site as the release hit the wires, and a bland quote from a partner exec in the release itself.

Vertica Projects Leadership, Embraces MapReduce (Sorta)

With the August announcement of Vertica Analytic Database 3.5, Vertica is laying claim to leadership of the new ADBMS vendors. With its most recent numbers – several dozens of customers are now in production and the company expects to pass 100 this year – the assertion bears thinking about. Driving forward with an aggressive release strategy, Vertica is showing its maturity and increasing ability to challenge the old school leaders like Teradata and Netezza – but with a software-only strategy. This agility allowed it to offer early support for release 3.5 in quick succession after its last release, with GA scheduled for later this year.  Read more of this post

TDWI Disappoints, But There is Hope Ahead

Few events offer as much promise as The Data Warehouse Institute World Conferences. With a deep educational focus, TDWI provides important opportunities for users. For vendors, the event offers one of the most focused, serious prospect audiences possible. My expectations, tempered though they were by economic realities, were still fairly high for this year’s San Diego event. Unfortunately, the drop in volume was greater than all of us expected, the number of announcements from the vendor community was low, and the content focus seemed a bit out of date.

Aster Appliance Elevates MapReduce Chatter, ADBMS Visibility

Since my last post about Aster, the analytic DBMS (ADBMS) vendor has added another arrow to its quiver. Its new MapReduce Data Warehouse Appliance Express Edition starts at $50,000, and includes Aster nCluster on Dell hardware and a copy of MicroStrategy BI software for up to 1 Tb of user data, which Aster clearly sees as a sweet spot. (MicroStrategy has been doing a lot of seeding with the ADBMSs lately; it also has  an introductory bundling deal with Sybase IQ.)  Delivering a ‘compute rich’ appliance on commodity hardware, with reduced operating costs, certainly hits all the right notes. But is 1 Tb  the sweet spot for MapReduce? I think not – although it makes a great starting point, and that may be Aster’s real opportunity – give ’em a taste of what SQL plus MapReduce can do, and watch them demand more and more. And sell it to them. Dell and MicroStrategy should love this strategy – if it works. Read more of this post

ParAccel Secures $22 Million – The Game’s Afoot

Recently, ParAccel published a TPC-H benchmark, and I said here that it was a coup that ought to get them significant attention. The blizzard of discussion that ensued was no doubt gratifying for ParAccel – Google reported 182 hits for “the past week” for them as of 6/28.

Now, Google hits – and visibility in general – aren’t everything. In a relatively crowded field, ParAccel will need more than just a fairly well-received press release – they will need money. Money to drive marketing, money to turn interest into leads, and money to fund a sales and field force to convert those leads into business. The good news? They just got some. On June 29th the firm announced a C round of venture capital has been secured, to the tune of $22 million led by Menlo Ventures; ParAccel’s previous investors participated as well. Read more of this post

ParAccel Rocks the TPC-H – Will See Added Momentum

ParAccel, another of the analytic database upstarts, has weighed in on Sun hardware with a record-shattering benchmark that its competitors have thus far avoided – the 30 TB TPC-H. It’s been two years since anyone has published a 30 TB TPC-H, and only 10 of any size (all smaller) have been published in the past year. One can scoff (many do) at this venerable institution, but TPC benchmarks are a rite of passage, and a badge of engineering prowess. The ParAccel Analytic Database (PADB) has set new records, raising its profile dramatically in one fell swoop. PADB came in at 16x the price/performance of Oracle, the prior leader (and only other vendor willing to tackle the 30Tb benchmark to date.) PADB, running on Sun Opteron 2356 servers, Sun Fire™ X4540 storage servers and OpenSolaris™, was 7x faster on queries and 4.6x faster loading the data than the 2 year old Oracle result. And because of its architecture, the construction and tuning of indexes and partitioning strategies were not needed. TPC rules are specific about having product in GA within 90 days, so one can expect to see PADB version 2.0, on which the benchmark was based, out in Q3.

ParAccel has seen some skepticism in the analyst community because of its relatively small published number of customers. It claims a dozen, and half are listed on its web site. Other vendors, like Vertica and Greenplum, have been very forthcoming promoting theirs, but both have more time in the market. PADB was released in Q4 2007 and really began its arc in 2008; Vertica has a year head start, and Greenplum even more. Rumors have also floated about whether CTO and founder Barry Zane was leaving. I had a conversation with Barry in late June to discuss the business and the benchmarks. He was clearly excited about the benchmarks, in which he was very involved, even working on the full disclosure report personally  – "It got to be like a hobby for me," he said – and he was quite clear that he is not going anywhere.