The Business Intelligence landscape has changed rapidly in the last few years: There have been many new entrants touting a wide range of technlogies, but fundamentally here are the hot technologies that you should look out for when evaluating a BI technology stack:
Targetted at large-scale datawarehousing, columnar storage allows massive compression ratios to be achieved on the underlying data, reducing the need for high-performance IO subsystems at the cost of increased CPU. So reading and loading is faster. The downside is that updates/deletes are considerably slower. Players in this space include Vertica, SybaseIQ, ParAccel, Infobright, EMC Greenplum, Aster Data, Teradata.
This involves hardware acceleration of BI to offload processing that would normally be down by the CPU to other dedicated chips which can include FPGAs and GPUs. Players include Netezza, Teradata and Jedox.
Designed to facilitate slice-and-dice “on-the-fly” querying, these technologies offer fast query performance but at the cost of pre-aggregation (which takes time). Upsides include little or no code has to be written in order to generate most queries (many of the products use the MDX language instead of SQL and therefore there are no complex join syntaxes to remember) – the downside is that considerable expertise has to go into designing the underlying cube structures. See http://en.wikipedia.org/wiki/Comparison_of_OLAP_Servers for a comparison of OLAP products.
This technology allows you to partition data horizontally into huge blocks based around a partitioning key value/range of values. This permits efficient queries where the criteria includes the partitioning key value as a restriction and also allows for efficient loading of partitions as atomic units where the partition key is tied to the atomic unit of load. Downsides are increased complexity of implementation, but this technology is really essential to deal with a large amount of data on a single machine. Almost all the mainstream relational database vendors offer partitioning as a feature.
This technology splits data transparantly among a number of physical servers. The upside is that you can use commodity servers to increase scalability. The downside is that you really have to commit to a sharding strategy up-front and modifying it sometime down the line can be very difficult. Also there are potential performance issues when combining large sets of results from multiple nodes. Players include Netezza, Vertica, Microsoft (Parallel Datawarehouse and SQL Azure), Teradata, Aster Data
Massive Parallel Processing
This is a feature that often accompanies transparent sharding, but the difference is that MPP does not place a bottleneck on the number of nodes involved in any layer of the sharded database technology stack. So where each node is both a storage node and a query distribution/result-assembly node (like Vertica), the system is inherently MPP. A system could also be MPP if it required dedicated distribution/result-assembly nodes and those nodes could be increased in number if required.
These databases load data from source systems directly into memory, without persisting to disk. Upsides are you don’t need a high performance (or any) IO subsystem for the database itself. Downsides are that if the power goes off, you’ve got to reload the data from source, or from a backup. And if the data’s large, you’ll need a high performance IO subsystem for that eventuality! Another downside is the amount of physical memory you can get in a single server, e.g. at the time of writing, HP’s biggest Proliant server (DL-980 G7) can hold up to 2TB of RAM. Players in this space include Microsoft SSAS (Tabular Mode/Vertipaq), Qlikview, VoltDB, Jedox, MicroStrategy
This technology is for fast computation of highly temporal data (i.e. data that’s real value is immediate) . Used to calculate things like stock market indexes and moving averages in real time. Players include KDB and Microsoft’s StreamInsight.
Integrated Technology Stack
Often excluded, but of vital importance, an integrated technology stack can make the whole much greater than the sum of the parts. Needless to say the big vendors can provide an entire framework which can cut down not only upon delivery time, but on project risk. Players here are Microsoft, IBM, Oracle and Sybase.