Microsoft’s SQL Server Parallel Data Warehouse (PDW) has been eagerly awaited for a long time. It still is. Though much of the news at the BI Conference running in parallel with TechEd in New Orleans (discussed here) was generally quite good, the PDW story was much less so. It’s late, and it’s not all there.
Perhaps this shouldn’t be a surprise. After all, even its web page begins with a sentence that ends: ” delivers performance at low cost through a massively parallel processing (MPP).” Clearly, there is some cleanup work to do. Analysts I respect share my opinion that Microsoft grossly underestimated the work it would have to do when it acquired DATAllegro nearly two years ago. In New Orleans, the conversation about Microsoft’s upcoming MPP approach to the high end of the market continued the unsatisfying arc it has been on; there was no new news about when it will ship. “Quality is the top priority; we’re not too far away,” we were told. Perhaps months. “This year” is still the best we have heard.
The assembled analysts were offered some details, some of which have changed a little. Much of the architectural design appears to still be there from the DATAllegro acquisition: a Control Node, Management Servers, a Landing Zone for incoming data (running SSIS), a Backup Node with high capacity storage, a Spare Node (one per rack) and individual distributed db server or Compute nodes. Each Compute node is mapped to a Storage node, and all run the SQL Server query optimizer, which has been extensively rebuilt. David DeWitt and his academic team have played a role in that work – which is a promising sign. The Compute nodes maintain statistics that are used by each query running on that node, resulting in the plan for each node being tuned to the data there.
The Compute nodes are tied together with an Infiniband connection; the nodes are tied to the storage nodes with a fiber channel interconnect. Both interconnects are dual for fault tolerance. A command line loader is available to supplement SSIS; I have not yet dug in to the use cases where it is the better choice. It’s not entirely clear whether the landing zone (a single node) is a bottleneck; early adopters are getting “hundreds of GBs per hour,” we were told. But as Evan Levy of Baseline Consulting pointed out to me, “it’s about rows, not gigabytes.”
So far, concurrency is fixed at 32 users; each gets 1/32 of the machine’s resources – there is no workload manager yet. The SQL Server Resource Governor is pre-set to 32 concurrent queries. This is a serious omission, and Microsoft will need to get this done to be competitive at the high end. V1 is also lacking stored procedures and some T-SQL is not parallelized, although analytic functions are.
Clearly the technical part still has a way to go. Another challenge in delivery comes about with the creation of reference architectures with Microsoft’s partners: HP, IBM, Dell and EMC (storage only). Microsoft’s Fast Track program has served to share information and prepare for an appliance-style delivery when the PDW does become available, and they have to work with each vendor separately and coordinate the lessons learned to keep the product experience the same for each. Given the more tightly coupled nature of the appliance-style PDW, the mechanics of this delivery and the support process will be new, and a learning experience for all concerned. There are challenges ahead for Microsoft and its partners. Clearly PDW will not be a force at the high end of the market in 2010. 2011? Time will tell, but it’s unlikely that V1 will make much of a dent. Let’s hope V2 comes soon.
Disclosures: Microsoft is not a client of IT Market Strategy.
Good summary. It is hard to take them seriously as a warehouse platform without the basics.
There’s time. There’s always time. It’s all about setting expectations right. They can climb the ladder just as they did with SQL Server itself.