Microsoft offers both the Hadoop ecosystem on Azure (which it collectively calls HDInsight) as well as a range of Azure Big Data services. This article attempts to compare and contrast these technologies and to suggest reasons why you might choose one over another.
Below is a simplified matrix showing the Azure Big Data service equivalent services for HDInsight service offerings. Note that in some cases, the big data services may encompass more functionality than the HDInsight (Hadoop) technologies to which they are mapped:
|Function||HDInsight||Azure Big Data Service|
|SQL on Hadoop||Hive||Data Lake Analytics|
|File <=> Database Load/Extract||Sqoop||Data Lake Analytics,
|ETL (Extract, Transform, Load)||Pig||Data Lake Analytics|
|Key-Value Table Database||HBase||Table Storage, CosmosDB|
|Machine Learning||Mahout, RServer||Machine Learning|
|Real-time Event Processing||Storm||Stream Analytics|
|Tabular Sharded Analytics||Spark||Sql Datawarehouse|
Infrastructure as a Service versus Platform as a Service
Perhaps the most significant difference between HDInsight and Microsoft’s own Azure Big Data services is that HDInsight is a virtual cluster of Hadoop nodes that may be spun up or down on request, and over which the customer has considerable control. In this sense, HDInsight is a Hadoop Infrastructure-as-a-Service offering. By contrast, the Big Data services are always available and do not need to spun up/down, but are allocated from a pool of services. Therefore Big Data services are a Platform-as-a-Service offering.
As a result, the pricing models for the various services differ considerably too: HDInsight clusters are charged when the cluster is up on a time basis (per minute of usage) with variations in the hourly charge rate depending upon the specification of the nodes. This means that you get charged as long as the cluster is up, regardless of how much or little processing you actually carry out.
By contrast Big Data services vary in their charging model: Some services are charged based upon usage (for example, Data Lake Analytics is charged per query); Other services are charged upon a per-seat-per-month basis (for example, Machine Learning), or per month (for example, Stream Analytics) or per unit of code executed per month (for example Data Factory).
Because of the difference in the pricing models, you can probably guess that if you’re using HDInsight, you probably only want to ensure that your clusters are spun-up when you want to perform computation. As a result there is a great incentive to spin-down clusters when they are not being used. Since it may take some time to spin-up a cluster, you should strongly consider this factor when choosing between the two.
With HDInsight the number of cluster nodes you spin-up is decided when you spin-up the cluster. You cannot dynamically spin-up additional nodes without taking down the cluster first. By contrast, services like Azure Data Analytics can process data across multiple nodes on a query-by-query basis.
This means that Azure Big Data services are more elastic: Additional resources can be deployed for specific jobs or where you get spurts in demand.
Not all HDInsight components are functionally equivalent to their Microsoft Big Data services: In some cases an Azure service encompasses more functionality than a single HDInsight component (e.g. Azure Data Analytics has greater functionality than Hive, encompassing features that would require Pig and Sqoop). In other cases, there is a substantial difference in underlying architecture (e.g. Spark’s in-memory store is closer to Microsoft’s Azure Analysis Services offering in underlying design, but Spark includes sharding over multiple servers, whereas Analysis Services does not (and so it is not listed in the table above); Azure Sql Datawarehouse by contrast is a closer fit, albeit that Sql Datawarehouse is an on-disk sharded RDBMS supporting columnar compression).
Programmatic Language Support
The underlying Hadoop components are notable in that there are few components that share programmatic languages. An example of this is Pig which requires Pig Latin, Hive which requires HQL and Sqoop which has its own command-line syntax: This can result in a steep learning curve when compared with Microsoft’s Azure Data Lake Analytics service which utilises the same service and language to perform the same functionality. In a nutshell, you may need to master more languages and differing command syntax with HDInsight than you would with Microsoft’s Big Data service offerings.
Having one’s own dedicated cluster, as the HDInsight infrastructure provides, is a double-edged sword: Although it is far less elastic than a PAAS offering, what it loses in elasticity it makes up for in resource isolation: The HDInsight cluster nodes you provision are yours, and not shared with anyone else. This means that you are unlikely to suffer performance degradation of your nodes as a result of activity conducted by other tenants on Microsoft’s Azure infrastructure. In a nutshell, HDInsight offers performance consistency at the cost of elasticity of scalability.
Without diving into the deeper technical details of the products available as Microsoft Azure services and HDInsight components, the choice between the two operating models is a choice between a dedicated cluster and a managed service: If your usage pattern is going to be fairly consistent around the clock, or you can limit your cluster up-time to a known window each day, then HDInsight will deliver performance consistency, limited by the abilities of the type and number of nodes you provision; However if you want a scalable service always available, have a high variance in activity, want a pay-on-usage model, and want reduced complexity both in development and in operations, Azure Big Data services are the route to go, providing that you can live with a higher variation in performance consistency.