There are many analytic engines and appliances out there. Yet many of these engines share common technology patterns: In-memory databases, transparent sharding and columnar storage are all technology patterns that appear in many of these products.
What I want to concentrate in this article are particular products that bring something innovative to the table. Here is my short list of products that bring something innovative to the table, something that isn’t present in most of their peers, in other words, the product differentiators. (In order to keep this article relevant, I reserve the right to add more products/innovations as I discover them, so this article will be updated over time).
HP Vertica is a Massively Parallel Processing (MPP) scale-out distributed database for structured data that utilises columnar compression. Unlike RDBMS-derived database engines, HP Vertical is a ground-up analytics platform and it has a few distinct features which demonstrate this ground-up approach:
Many of the MPP products out there originate from open-source RDBMS products (commonly MySQL/MariaDb or Postgres) which are subsequently heavily modified. Because Vertica is a ground-up design, it doesn’t have an on-disk persisted transaction log; instead, its transaction log is in-memory. This means that it doesn’t have the dependency on transaction log disk-subsystem performance to contend with. Avoiding dealing with transaction log persistence frees the available IO resources to deal with the persistence of the compressed data files.
Indexes have existed in most database products for decades: They provide hierarchical, sorted structures of select data columns together with pointers to the underlying rows (in the case of row-based indexing).
Vertica’s projections are different – they utilise columnar compression of column subsets, which don’t necessarily require the inclusion of a primary key column or any other pointer to the underlying table structure: Not only does columnar compression help greatly in reducing storage and increasing the scan density, but the ability to exclude a row identifier increases the scan density far further. The projections can be segmented over multiple processing nodes, parallelising scan activity across the number of nodes involved. There are a number of algorithms which the database designer can use for this segmentation too. Vertica also offers projections that are aggregates and “top” projections.
Polybase is a new technology that’s been added to SQL Server 2016: It enables data to be queried in external data stores (Hadoop/HDInsight, Azure Data Lake Store and Azure Blob Store) together with data in the SQL Server database, using the T-SQL language with which so many developers are familiar. It offers push-down computation, such that a query accessing numerous tables on both the local database and on cloud storage will decide to push the computation that can be carried out on the cloud data to the cloud data for processing, before the interim results are transferred to the local SQL Server for subsequent processing in the database engine.
In a way, this is highly similar to Microsoft’s OLEDB technology, but for cloud data.
Using a stack from a single vendor can bring significant advantages: Sole responsibility, lower cost and integrated toolset are a few advantages. However one are often overlooked is the consistency of the object model APIs across different products within the stack – and nobody does this better than Microsoft: A prime example of this are the .NET APIs for SQL Server (SQL Server Management Objects API), Analysis Services (Analysis Services Management Objects) and
Being a Microsoft product, SQL Server is tightly integrated into the Microsoft ecosystem and nothing shows this better than SQL Server’s security model, providing tight integration with Active Directory, not as an alternative authentication, but as the default method for connecting to SQL Server and other database products (like Analysis Services).
IBM’s PureData offering comes from its acquisition of Netezza. What differentiates it from other Massively Parallel Processing (MPP) solutions is its reliance on hardware to increase performance rather than clever indexing strategies. In particular, its use of FPGAs is particularly innovative:
The inclusion of FPGA processing enables certain operations to be performed on the data as it comes off of the underlying disk, before that data passes across the system bus to the CPUs. Explicitly, these FPGAs perform the following operations, all of which speed up processing:
Because of its reliance on pushing data processing down to the FPGAs associated with the disk drives, PureData for Analytics doesn’t have any user-configurable indexes!
Because PureData for Analytics doesn’t have indexing, it’s unlikely to have the same performance as an MPP system where the clustering, indexing, partitioning and sharding strategies can be devised in full knowledge of the actual data structure and distribution, but it also doesn’t require the level of expertise, data knowledge and planning that such systems require to fully exploit their capabilities.
Jedox’s OLAP server is an in-memory OLAP server, similar to Microsoft’s Analysis Services server (running in Tabular mode). However it has a particularly innovative feature: The ability to exploit GPUs for accelerated parallel processing:
Jedox can utilise the NVidia GPUs to accelerate processing. It is particularly effective when dealing with queries accessing large numbers of numeric cells, performing compute-intensive calculations or aggregations on large numbers of rows and applying large numbers of business rules or complex operations at a cell level.
The GPU acceleration is (by default) dynamic – each read operation required by the query in question is evaluated and a decision is made as to whether to use the GPU or CPU for each operation. Clever stuff.
MapD have a very nice high-performance analytics server and web-visualisation server that are interdependent. Each of these servers brings real innovation to the table:
Unlike Jedox’s GPU Acceleration, MapD’s analytics server is built primarily around processing using one or more NVidia graphics cards. Multiple cards (up to 16 per server) can be placed in a machine with data spread over the GPU RAM present in the cards. Unlike Jedox’s offering, MapD’s core engine speaks SQL natively, not MDX. Like Jedox, MapD utilises columnar compression.
MapD stores data first in GPU memory. However if that memory is insufficient, data can be stored in server RAM, and if that should prove to be insufficient, on disk (MapD recommends Flash storage).
This approach, of prioritising data storage in memory as close as possible to the GPUs, means that MapD avoids the overhead of transferring data from server RAM and removes the bottleneck of memory bus bandwidth and latency that can occur in similar systems where data is loaded from system RAM on demand.
MapD utilises the GPUs to produce PNG images on demand. This avoids the overhead of transferring query results (which can comprise a large amount of data) to a separate server for subsequent rendering. No copying of this data is needed – the data is already resident on the GPU RAM processing the images. This results in images containing a large amount of data rendered in lightning-fast time. This is particularly useful in complex images such as those that show geographical data distributions.
Furthermore the speed with which GPUs can process images enables the production of complex animations, showing for example, an animation of data over time.
Not all report components utilise GPU: The approach taken is a hybrid one, in which simple graphs are rendered without utilising GPU processing.