ParAccel, another of the analytic database upstarts, has weighed in on Sun hardware with a record-shattering benchmark that its competitors have thus far avoided – the 30 TB TPC-H. It’s been two years since anyone has published a 30 TB TPC-H, and only 10 of any size (all smaller) have been published in the past year. One can scoff (many do) at this venerable institution, but TPC benchmarks are a rite of passage, and a badge of engineering prowess. The ParAccel Analytic Database (PADB) has set new records, raising its profile dramatically in one fell swoop. PADB came in at 16x the price/performance of Oracle, the prior leader (and only other vendor willing to tackle the 30Tb benchmark to date.) PADB, running on Sun Opteron 2356 servers, Sun Fire™ X4540 storage servers and OpenSolaris™, was 7x faster on queries and 4.6x faster loading the data than the 2 year old Oracle result. And because of its architecture, the construction and tuning of indexes and partitioning strategies were not needed. TPC rules are specific about having product in GA within 90 days, so one can expect to see PADB version 2.0, on which the benchmark was based, out in Q3.
ParAccel has seen some skepticism in the analyst community because of its relatively small published number of customers. It claims a dozen, and half are listed on its web site. Other vendors, like Vertica and Greenplum, have been very forthcoming promoting theirs, but both have more time in the market. PADB was released in Q4 2007 and really began its arc in 2008; Vertica has a year head start, and Greenplum even more. Rumors have also floated about whether CTO and founder Barry Zane was leaving. I had a conversation with Barry in late June to discuss the business and the benchmarks. He was clearly excited about the benchmarks, in which he was very involved, even working on the full disclosure report personally – “It got to be like a hobby for me,” he said – and he was quite clear that he is not going anywhere.
There is suddenly a sizable number of new entrants in the analytic database space; and (disclosure) some are clients. (Not ParAccel at this writing.) I’ve posted about Greenplum, Vertica and Aster. Like several of the others, ParAccel has some roots in Postgres and is a massively parallel, shared-nothing architecture; like Vertica it uses column-based storage. Like the whole group of new players, it is routinely winning proof of concept engagements (POCs) against traditional DBMS players.
Structurally, ParAccel’s offerings are available in either proprietary or fully commodity component-based offerings. ParAccel’s description is that PADB is “standard-server-based;” the software may be purchased and run on a variety of commodity platforms. The Scalable Analytic Appliance (SAA) offering uses “enterprise-class midrange SAN components from EMC.” SAA uses a gigabit ethernet interconnect and 4 processors at each node with dedicated storage. In either case, a leader node (the Postgres-derived code is found here), coordinates the activities of the compute nodes. A hot standby node is always part of the installation, and can step in for any failing node, including the Leader node.
The EMC partnership allows ParAccel to rely on a FibreChannel-connected SAN (in a modular, midrange form factor designed to scale along with servers) for its enterprise-class features. Availability, backup and rapid recovery, and data replication thus become easier for ParAccel to deliver – as long as EMC’s CLARiiON CX4 is in the picture. PADB is able to concurrently scan the server-based direct attached storage (DAS) and the SAN. This “blended scan,” ParAccel argues, gives it the best of both worlds.
ParAccel’s technical papers do a nice job describing how “Continuous Sequential Scan Rate” (CSSR), measured in megabytes of I/O per second, describes throughput from platter to server and helps demonstrate the power of the new architectures. PADB boasts a patent-pending query optimizer, notable for its ability to handle correlated subqueries (CSQs), which feature in several of the TPC-H benchmark queries, and are often a performance stumbling block. Without belaboring the techie talk here, removing columns from CSQs can have substantial impact just as it does for table scans. Retrieving relevant columns also improves CSSR substantially. Columnar storage aids data compression substantially as well, and in combination with the benefits realized from not having to use substantial amounts of space for indexing, the growth rate of installed storage relative to raw data is improved considerably.
ParAccel claims wins over competitors such as Sybase IQ, Netezza, and Vertica as well as Oracle, and touts real-world performance numbers far better than the benchmark. Of the new architecture competitors it faces, only Sybase IQ has stepped up to the TPC-H bar to date, and no doubt there will be many win and loss claims from all the vendors over the year ahead. But PADB has vaulted into contention with this announcement, and will no doubt be on more short lists – as it should. ParAccel will also begin to see more attention from its partners, including hardware players beyond EMC: Dell, Fujitsu-Siemens, Intel, AMD and others. Sun, who made a substantial contribution to the benchmark by making much of the hardware available, may be somewhat less aggressive in the wake of its acquisition by Oracle. But ParAccel says Sun is not its most installed platform; customers are running on HP, Dell and IBM hardware already. Software partners are also likely to be friendlier as the temperature rises.
ParAccel is fortunate to have completed this work with Sun, whose acquisition by Oracle will no doubt create some reallocation of resources and priorities. This is a coup for ParAccel, whose timing turns out to be impeccable. As always, prospects should insist on a POC. And as Mark Madsen of Third Nature (his blog is here) says – “always hold back some queries;” you want to see how any database performs without heroic tuning, unless you plan to keep an army of specialists around.