Infobright Bids to Anchor An Open Source DW Ecosystem

I recently sat down for a talk with Miriam Tuerk, CEO of Infobright – an open source, commodity hardware-based analytic database (ADBMS) vendor focused on the data warehousing market. Infobright is another of the leaders in the open source information management wave IT Market Strategy has been tracking. Founded in 2006, Infobright has assembled a remarkable team now committed to exploiting this economic model to reduce the startup costs of data warehousing. Like other open source players, MySQL-based Infobright has two versions: a Community Edition (ICE, whose community gathers at www.infobright.org) and an Enterprise Edition (IEE). This bifurcation allows it to distribute starter software broadly at minimal direct cost, then upsell; along the way, it gets to tap into the vibrant innovation provided by the user community that forms. As the product matures, such vendors fund the more hardened features large firms require by charging them for those added capabilities that they need. And now (July 7), Infobright has partnered with Jaspersoft for tighter integration with a report server and OLAP analysis.

Infobright is privately held, with roughly 50 employees (in Canada, the US and Europe, where a development team in Poland does much of the heavy lifting. ) Investors including Sun and Flybridge Capital Partners injected a reported $10M into Infobright in 2008; the company doesn’t discuss revenue, but considers funding “adequate through the end of 2010.” I expect that they will seek additional funding  rounds as their infrastructure buildout continues.

Infobright moved into general release under a GPL license in September 2008. With three product releases under their belt, ICE boasts nearly 10,000 downloads and the company claims that 2,000 of them are active community participants. There are now over 60 paying customers in 7 countries. The new “integrated virtual machine download” announced today includes: ICE; the JasperServer Community Edition for report creation, delivery and scheduling; and JasperAnalysis, an OLAP server.

Infobright has implemented a column-oriented data store, deployed atop MySQL, as columns divided into 65,536 row elements known as Data Packs, which are compressed as they are stored – 10:1 compression was an early claim but the company says they frequently do far better. Statistics about the data (things like min/max, cardinality, etc.) are stored in a “Knowledge Grid” – essentially an indexing scheme, not unlike what vendors such as Illuminate use, that permits retrieval to be limited only to the data needed to resolve the specific query in question. Query tests in customer use cases routinely deliver sizable improvements in query times, as we have seen with other players in the new analytic DBMS space.

Infobright loads data quite rapidly on commodity hardware, and asserts that load speed will remain constant despite raw data size as a result of the architecture. The MySQL loader can be replaced with the Infobright loader in IEE to ensure high speed loads at scale. Infobright makes familiar assertions about “load and go”; certainly, with no careful designing of models, partitions and indexes, time to usage is significantly reduced. “Hardware setup and configuration can be done in a day,” company marketing asserts. Infobright offers several different claims to scalability, including “to 50 Tb and more,” inherits management tools and partnerships from its MySQL heritage, and also thus benefits from MySQL’s ability to run on Linux, Solaris and Windows, and work with Ruby on Rails, PERL, Python, etc.

The firm’s new CTO Bob Zurek, who joined in Q2 2009, is an example of the seriousness Infobright brings to their commitment to enterprise-class offerings. Bob was most recently CTO and VP of Products at EnterpriseDB, after a distinguished career that includes stints at IBM , Ascential,  Sybase and Powersoft.  Partnerships are playing a key role, and Zurek’s industry experience will no doubt have a big impact in working cooperatively with other OSS vendors and commercial ones. The company recently announced an “open source project for End-to-End business intelligence” with Jaspersoft’s BI tools, ETL (Talend-based) and Infobright’s DW at the MySQL conference.  Shortly thereafter, it unveiled a hardware and software system for the deployment of BI with Pentaho,  based on the Sun Fire X4275 storage server or the Sun Storage 7310 unified storage system. And today’s announcement adds the Jasper report server and JasperServer OLAP piece for yet another configuration.

But announcements are not enough. True integration needs to be shown if the company wants to move into mainstream shops that don’t want to do all the work themselves, and the degree of pre-integration is not clear just yet. To date, Infobright has signed up some 30 partners, and making all of the technology deals represent meaningful deliverables will take focus, experience, and some legwork to commnicate successes. But the funding is there, the experience is in place, and Infobright joins the battle with some strong assets. Download ICE and check it out – it’s worth a look.

Published by Merv Adrian

Independent information technology market analyst and consultant, 40 years of industry experience, covering software in and around the data management space.

26 thoughts on “Infobright Bids to Anchor An Open Source DW Ecosystem

  1. I did download ICE several months ago. It didnt support DML at the time (not sure if it does now). Also had a hell of a time actually loading data (was using the FoodMart one at the time) and was never able to do much with it notwithstanding limited support from their forums (and a final admission that yup, there was a bug handling a basic VARCHAR data type at the moment…bummer). They’re very nice people but you can’t really do much with ICE and I imagine the enterprise version is the only viable one in the enterprise.
    Just my 2 cents.

    1. This is what’s great about blogs: real data from real users. Thanks for jumping right in on this. Have you got details of the version number at the time? Maybe we can find out what has been fixed since then. Given the timeline, I’m not surprised you found the seams in an open source product introduced in September 2008 “a few months ago.” That would mean it had been out there for a quarter or so. The challenge for any open source vendor is what happens when people expect things to work without having to code the fixes themselves – hhmmm…maybe that ought to be one definition of “GA” – even if you are open source.

  2. I’d have to get back to my emails/forum posts of the time. I will do that and provide it on here momentarily. At the time I felt that expecting DML support was fairly reasonable although, granted, UPDATE/DELETE arent exactly key in warehousing but still, at least for SCDs one would expect…

  3. Ok so there it is: http://www.infobright.org/Forums/viewthread/547

    Looks like I pulled ICE 3.1 for Windows 32 and *my bad* it was telecom data I was trying to load, and not FoodMart. You can follow the thread on there called “Problems using LOAD DATA INFILE—can anyone help me?”

    On the DML issue, see http://www.infobright.org/doc/specifications/
    It is available on the Enterprise version only (at least it was at the time).

    Maybe these issues are all resolved now but, who knows…and that’s the point with open source really isn’t it? You never really know what you’re getting, or what works when or where. So yes if you’re into tinkering and hacking stuff, it’s interesting but if you want to load data and start querying it right there & then w/o resorting to experts or having to wait until someone on some forum can maybe help you, it’s a little limited. This problem is pervasive with all OSS offerings IMHO – I think it’s time someone actually called this stuff for what it’s worth 🙂

  4. I’ve tried ICE too and I’ve honestly been impressed at what is possible with their engine since it is open source, but it still has some very severe limitations which are hard to live with.

    ICE is INSERT only, and I think the only DML method exposed via the storage engine is LOAD DATA INFILE. You can’t DELETE or UPDATE information. As Jerome pointed out, this can be a problem for SCD (slow changing dimension), but then again, since IB discourages long dimensions, they’ll tell you to just reload the table. If you can live with that, then great. It doesn’t support TRUNCATE TABLE though.

    IB can’t handle SQL like sum(a * b), because this requires examining contents of data packs which is “slow”. Their next version (3.2) is supposed to support this, but it probably won’t be optimal. According to their forums, ICE still has stability issues. You can love or hate the TPC-H queries, but that isn’t an issue with Infobright because they can’t run those queries anyway.

    ICE doesn’t support CTAS or temporary tables. You could potentially work around both of these with a MySQL proxy script.

    All-in-all, if you can live with the extensive limitations, you might find a diamond in the rough. Or you may find a rough diamond. As with all things, your mileage may vary.

    1. Justin, thank you! This is just wonderful, specific input, and I hope we get lots more of it – and that the Infobright gang jumps in – some vendors do moreof that than others, as we know. I’ll ping them to make sure they know how much things are hopping over here.

  5. First Merv, many thanks for your kind remarks about Infobright and our solutions. Also, appreciate the feedback and insight from the additional comments on this blog posting. Let me respond by saying that we are very dedicated to helping our community downloaders, users and contributors. The forum thread that was pointed out by Jerome shows that we rapidly try to respond in assisting users with their efforts using our products.

    As Jerome points out, our DML capabilities are available in our IEE version available at http://www.infobright.com. In addition, we have resolved the issues pointed out in these comments with our newest version of the software (V 3.2) which will be available in the next 30 days (give or take a few days). You will see and experience significant enhancements when the download is available. Finally, key complex expressions will also be available in this release.
    We look forward to additional feedback in our forums and wanted to ensure you (the reader) that we are working very hard to ensure your success with the use of our products. If we can be of any assistance, don’t hesitate to contact us via our forums or feel free to call us directly.
    I agree with Merv, this is what is great about blogs. Keep the comments coming and all the best.

    Bob Zurek
    CTO and VP Product Management
    Infobright

    1. Thanks, Bob – great to have you on the blog. Wonderful to have the reponsiveness. We’ve always had these kinds of discussions in the software industry, but it’s a new development – especially ion OSS – that they sometimes take place so publicly, outside of the direct comunications with the support organizations of the affected vendors. Kudos to you for being so in touch with the new channel and responding to it, and good luck with the next release.

  6. One of my criticisms of ICE since it went open source is that it is severely limited. I generally expect a community edition of software to be usable in production, although manageability or other aspects may not be as robust as a subscription pro version.

    The lack of DML is a significant limitation that really should be removed. ICE will be trialware for most people since you need the pro version if you expect to work at any scale. For smaller projects where it’s feasible to reload your DW nightly, it’s workable.

    The handling of SQL is improving with each release. SQL standards compliance (particularly for complex queries) and concurrency are two elements that every new appliance/database product has to deal with. They come with maturity.

    1. Thanks! Insightful as always, Mark. The delicate boundary between the typical two levels of OSS vendors is one we’ve seen other companies trip over. Not all features/functions are easily decomposed into “entry level” and “premium”; some things are either there or not.

      What it means to be “in production” also admits to several interpretations. For me the difference between “try, then buy if you plan to do anything useful” and open source as we might like it is that the latter does need to be workable for some level of production workload, but you’ll have to support yourself and there are likely some capacity issues.

      Perhaps that’s naive; the market will tell us. And as long as we’re out here telling vendors and users what they ought to be able to expect, we should have plenty to talk about.

  7. I’m not sure how effective DML on ICE would be, since there are absolutely no indexes available to the MySQL storage engine.

    Thus, a “bulk” update, to update the value of every column to some new value is likely easy to do, but updating values for any subset of rows will still require a FTS via a row-based iterator over column store data. That just isn’t isn’t going to be efficient and it may not be worth sinking engineering time into.

    Interestingly, the engine is open source. Would Infobright take exception to someone making patches available to make inefficient DML like I suggest available?

  8. Also, since it isn’t ACID, you are only safe appending data. This is safe because everything after a given HWM can be ignored and/or overwritten in the case of failed writes.

    Updating data packs probably isn’t easy, and if it can’t be done in an ACID way, not safe.

  9. Hi Merv,

    It is great to see that your blog post has produced a lot of commentary from the community. In regard to questions about the suitability of ICE for production use, people may be interested to read what some ICE users are doing. There is a new post from ICE user Kevin Galligan about the product he built on ICE at http://www.kagii.com/?p=85.

    There is also another recent post about some work that Osma Ahvenlampi has been doing with ICE. He spoke at April’s MySQL User Conference in Santa Clara about his project in detail. You can read his initial blog post here: http://www.fishpool.org/post/2009/04/08/Using-the-Infobright-Community-Edition-for-event-log-storage

    His presentation slides for the conference are also available: http://www.fishpool.org/post/2009/04/22/Mining-for-insight-presentation-materials

    And a follow-up posting to the conference: http://www.fishpool.org/post/2009/04/23/Three-domains-of-data

    He is running a feature rich platform – in production – with ICE.

    Some (if not most) of the issues that Osma raises (in the above links) will be addressed with our upcoming release 3.2 of ICE (and IEE, the Enterprise Edition.) that Bob Zurek referenced in his post.

    In regard to features differences between ICE and IEE, users will decide based on their specific needs which product is right for them. IEE provides a higher degree of operational capability which is typically required with a high volume production data warehouse. The IEE subscription also includes the support SLA’s, warranty and indemnification and services many enterprises need. The active, continually-growing community of ICE users however demonstrates that ICE is very useful for many people. Our goal with both products is to deliver high value – and while ICE is free, we believe that IEE is the lowest cost enterprise DW product on the market.

    With regards to Justin Swanhart’s comments about someone making DML functionality available for ICE – we welcome all contributions. We have been receiving code contributions for the project, and everything that we’ve received has been tested and incorporated into the solution.

    I look forward to any comments. Thanks again Merv.

    Best regards,

    Mark Windrim
    VP Community Relations
    Infobright

    1. Thanks for this, Mark. Lots of data along with the commentary. Data is always good, and I’m delighted that you’re watching and making sure we have it.

  10. Yeah that’s one thing you can say about InfoBright, they have a finger on the pulse out there. In other words, they’re on the ball. And they have good SEs working there too IMHO.

  11. Some thoughts after I hit the submit button…

    When I say ACID “is what it is”, I don’t mean to imply you should be OK with files partially loading and things like that. I’ve had issues where there’s some funky data in a file, and ICE will throw an error back. None of the data in that file is committed to your table in that case. However, I believe ICE does lock tables to reads while loading, and as far as I know, you can’t load multiple files in one transaction, so if you’re concerned about a query returning data after, say, only 3 of 5 files have loaded during a load process, you need to make sure you can take your database off line. I don’t know what IEE does, but I think this post was more about ICE anyway.

    Also…

    “Thus, a “bulk” update, to update the value of every column to some new value is likely easy to do, but updating values for any subset of rows will still require a FTS via a row-based iterator over column store data.”

    I use the ICE version, so I don’t do any updates, but as I understand the engine, it would identify portions of the data grid that need updating similar to how it runs queries, and only work on rebuilding those subsets. As far as Infobright not having indexes, it does. They’re just not b-tree style. So, you can isolate different parts of your data set that require updates. They’re just not “indexes” in the row-base database sense.

    Also…

    In the list of open source software that’s NOT built by teenage loners I forgot to include WordPress, the blog/cms system we’re currently discussing this topic on 😉

  12. These are going to appear out of order. I must have had some issue last night while posting. In any case, if the order can’t be fixed by the administor, here’s the original post from last night…

    I’d like to throw in my 2 cents. I have a feeling it will wind up being significantly more than 2 cents by the end, but here it is.

    I am a current ICE user. ICE is my first real foray into data warehousing databases, so pardon my lack of experience with competing products and DW terminology, but I have nothing but great things to say about it. We have a data set that is relatively small, 270 million rows by 900 columns or so, in front of which we’ve built a web application to run queries with the expectation of “web time” performance (< 30 seconds in the worst case). The initial application was built on top of a standard Mysql installation, which is really my technical background. Performance? Horrible. Thrashed for a few weeks trying to sort things out. Heard about column oriented DB's on a forum, and about 6 hours later had loaded my app on Infobright, with data, and things were working great. See my forum thread. Notice it ends near 3am…

    http://www.infobright.org/Forums/viewthread/274/

    Our app is currently in demo and semi-internal beta, although if you're really interested in seeing it, contact me. I'll send you login information. We have multiple instances of the same app. For larger customers, a semi-custom branded install, and a single large site for small users.

    Tomorrow and over the next week I'm going to be publishing a series of blog posts about the technology we're using. The first is how we're outputting large queries onto a map, but later I'll get into more detail on how we're using Infobright…

    http://www.kagii.com/

    Some responses to comments above, in no particular order:

    – I have had no stability problems with ICE. Any software, especially early versions, can have trouble (I would argue that closed source, proprietary apps are significantly more susceptible to this). Infobright, and ICE specifically, is now well into its version 3 point release cycle.

    – The load times are excellent. That, and the great performance we get, has put this project in some peril because my partner sees no need to get better hardware for production deployment. He's never supported a live web application. I can attest to the constant load time. The reason this works, which wasn't covered above in detail, is simple; the data summary info and compression happens on 64k blocks independently. Standard b-trees have to maintain a tree structure, which in an ideal world would increase in at a strictly logarithmic pace, although I imagine reality dictates somewhat worse performance. Logarithmic is slow, but still ever increasing. Infobright does not work in the same fashion.

    – Some of the comments here are bashing open source like its a hacky free for all. There are two basic ways in which open source projects are run: commercially funded and supported (like Java, Jboss, various flavors of Linux, MySql, Apache, Postgresql, Eclipse, ROL, Python, etc.), and the kind of open source that is hacked by high schoolers on Saturday nights in their parents' rec room. Infobright is very much in the former camp. All of the work I do now is on open source products, and we have a much smoother ride as a result. I ran a group that built a medium sized organization's operations on top of the Weblogic stack. That was rough. We had more than one day with production shutting down due to an obscure platform issue.

    – ACID? Well, it is what it is. It depends what you're doing with the database. The general use case is running aggregate reports on huge data sets. For every instance that requires pure ACID compliance, I have to imagine there are 10s or 100s that don't. Its not a transactional database in that sense. Right? Many of the obvious use cases I've discussed involved appending data as time goes on, so its not really an issue. My project has the luxury of reloading the full data set once a month, so this type of thing was never on the concern radar.

    – DML. Again, it is what it is. ICE is fantastic if your architecture supports append-only. Not supporting updates in the community edition is a business decision. I can't fault them there.

    My opinion is a little skewed at this point. ICE really saved my project, and I've become quite friendly with the company over the past few months. It is a fantastic product, though.

    Sorry for the long ramble 😉

  13. My comment isn’t submitting. I assume due to length, so some of this will sound terse. Originally tried to send last night (sorry if the two are out of order).

    I am a current ICE user. ICE is my first real foray into data warehousing databases, so pardon my lack of experience with competing products and DW terminology, but I have nothing but great things to say about it. We have a data set that is relatively small, 270 million rows by 900 columns or so, with a query builder web app in front. Built on mysql. Performance = bad. Found ICE. Converted in 5-6 hours. See story here…

    http://www.infobright.org/Forums/viewthread/274/

    We’re in semi-internal beta. Contact me if you’d like to see it (has ajax-y maps. Fun to demo).

    Tomorrow and over the next week I’m going to be publishing a series of blog posts about the technology we’re using. The first is how we’re outputting large queries onto a map, but later I’ll get into more detail on how we’re using Infobright…

    http://www.kagii.com/

    Some responses to comments above, in no particular order:

    – I have had no stability problems with ICE. Any software, especially early versions, can have trouble (I would argue that closed source, proprietary apps are significantly more susceptible to this). Infobright, and ICE specifically, is now well into its version 3 point release cycle.

    – The load times are excellent. Constant time is due to the way IB indexes data. not a tree structure, so it doesn’t slow with size (however logarithmically slow that growth may be).

    – Some of the comments here are bashing open source like its a hacky free for all. There are two basic ways in which open source projects are run: commercially funded and supported (like Java, Jboss, various flavors of Linux, MySql, Apache, Postgresql, Eclipse, ROL, Python, etc.), and the kind of open source that is hacked by high schoolers on Saturday nights in their parents’ rec room. Infobright is very much in the former camp. All of the work I do now is on open source products, and we have a much smoother ride as a result. I ran a group that built a medium sized organization’s operations on top of the Weblogic stack. That was rough. We had more than one day with production shutting down due to an obscure platform issue.

    – ACID? Well, it is what it is. It depends what you’re doing with the database. The general use case is running aggregate reports on huge data sets. For every instance that requires pure ACID compliance, I have to imagine there are 10s or 100s that don’t. Its not a transactional database in that sense. Right? Many of the obvious use cases I’ve discussed involved appending data as time goes on, so its not really an issue. My project has the luxury of reloading the full data set once a month, so this type of thing was never on the concern radar.

    – DML. Again, it is what it is. ICE is fantastic if your architecture supports append-only. Not supporting updates in the community edition is a business decision. I can’t fault them there.

    My opinion is a little skewed at this point. ICE really saved my project, and I’ve become quite friendly with the company over the past few months. It is a fantastic product, though.

    Sorry for the long ramble 😉

  14. >I use the ICE version, so I don’t do any
    >updates, but as I understand the engine, it
    >would identify portions of the data grid
    >that need updating similar to how it runs
    >queries, and only work on rebuilding those
    >subsets. As far as Infobright not having
    >indexes, it does.

    At the MySQL storage engine level, there are no indexes in Infobright. The “knowledge grid” is not accessible to the storage engine level. When Infobright executes SELECT queries they INTERCEPT them, that is they run them in an external execution engine which accesses the data.

    Consider what happens when IB can’t execute a query, and instead runs it via ‘mysql path’ (you get a warning when this happens). Performance is often terrible because MySQL is forced to use the join_buffer algorithm between the tables. Same thing would happen with DML if you tried to join tables, because there are no indexes to support block nested loops.

    Kickfire does the same kind of “query intercept” thing. This is because the MySQL parser and query execution are very closely tied together, and the parser can’t optimize over a column store.

    If the DML statement goes to the storage engine interface, it will have no choice but to FTS with its row-oriented interface over the column store which is expensive.

    So, for Infobright, if DML were to be implemented so as to work with the knowledge grid, it would have to be “intercepted” too, which is a very complex proposition.

    Kickfire supports unique indexes and primary keys which can be used by the storage engine interface when using DML. We still will not match row store performance, but DML performance is acceptable, which is important for SCD.

  15. Also some of the ACID “problems” might be possible to work around using datapack level locking, and making copies of the changed data packs in some sort of redo structure. Once again, not a simple exercise.

    Normally RDBMS lock a) entire tables, b) entire pages or c) entire rows. A column store can choose instead to lock d) an entire column, and with Infobright locking could be at the e) datapack level.

    When I visualize Infobright storage, I do so as a vertically partitioned table (column store) where each column is horizontally partitioned dynamically by range so that the data aligns into 65K blocks. Since min/max and other stats are known about the data blocks, partition pruning essentially is used to limit the number of data blocks to be examined. Not having looked at the source code, I’m not 100% sure this is how it works, but this is what I imagine from reading a description of the technology.

  16. Hi Merv,

    good to see the increased interest in (Open Source) ADBMS’s. It looks like you’re most interested in hands on experiences: my first ones are in the ‘ps’ of http://www.tholis.com/news/open-source-data-warehousing/. What’s not in the post is the TPC-H sf100 tests I ran. Load speed was faster than MonetDB (about 20%), but most of the queries could not be executed due to limited SQL support. The ones I rewrote showed mixed results; some were about on par with MonetDB, some were (considerably) slower and I believe one was actually faster. Most interesting however is the compression rate: loading 100 GB of TPC-H data resulted in a 15.5 GB database. And with 32GB Ram or more you can store this on a Ramdisk which should give some interesting results. Hope to get these in a couple of weeks…

    ps: I’m a bit skeptical re. the claimed DB sizes. As far as I’m aware of, ICE is a SMP solution, so for a 50TB database you’ll need a monster machine.

    ps2: On the other hand, I know of a project where they use PostgreSQL to do all the DWH stuff (esp. SCD’s) and reload ICE every day for end user access. Works like a charm.

    ps3: The Infobright people are great and really on top of things when they think you can need a little help, even on a Saturday night…

    hope this helps to complete the picture; looking forward to read other comments!

    best, Jos

Leave a Reply to JeromeCancel reply

Discover more from IT Market Strategy

Subscribe now to keep reading and get access to the full archive.

Continue reading