Blog
Azure, Big Data, Data Lake, Hadoop, HDInsight, Uncategorized

Optimal Compressed Data File Strategies for HDInsight and Azure Data Lake

HDInsight (Microsoft’s canned Azure Hadoop offering) and Azure Data Lake are competing Azure offerings, with many similar features and yet significant differences.

One of the significant differences between the two platforms is their ability to process compressed file formats. This article looks at the similarities and differences between the two and attempts to formulate strategies to gain the maximum performance for each platform.

Before we delve into the strategies for the two platforms, it’s important to gain an understanding of the benefits and drawbacks of the some common compression algorithms used with these systems.

Compression Algorithms

Compression algorithms fall broadly into two categories: Lossy and Lossless.

Lossy compression algorithms compress data efficiently, but lose some of the data in order to do so. Lossy compression algorithms cannot be decompressed to derive the original uncompressed data and therefore are of no use for text compression. Examples of lossy compression algorithms include JPG, MPEG, MP3, etc. As the example file types indicate, they are normally used for media compression, where loss of data contributes to loss of media quality, which is usually an acceptable trade-off.

Lossless compression algorithms can be decompressed to derive the original data. They are therefore suitable for text compression. Examples of lossless compression algorithms include GZip, ZIP, RAR, LZO, BZip2 and 7-Zip.

Both Hadoop and Azure Data Lake Analytics can process lossless algorithms. However lossless algorithms fall into two distinct categories: Splittable and Unsplittable algorithms. Splittable algorithms permit the parallel decompression of different parts of the compressed file in parallel; Unsplittable algorithms do not. Below is a table indicating which algorithms are Unsplittable and Splittable:

Unsplittable Splittable
GZip (Default Compression (-6), Medium)
GZip (Minimal Compression (-1), Fast)
Bzip2 (Slow)
Un-indexed LZO (Fast) Indexed LZO (Fast)
Snappy (Fast)

Understanding which compression algorithms are splittable by a Hadoop distribution is essential to performance: Before any compressed file can be processed, the file has to be decompressed. Since an unsplittable file can only be decompressed by one thread, any processing request will have to wait upon the completion of the decompression thread before real data processing can start, negating the opportunities for parallelism that the Hadoop environment provides for decompression of that file. This can add massively to total request completion time.

So if we’re going to process compressed files quickly, we either have to use a splittable compressed file, or we have to split the same large file into multiple uncompressed files and compress these individual files on parallel threads prior to storing on HDFS.

It should be pointed out at this time that another option is to just store the uncompressed files on HDFS: This is a better option if the throughput of the underlying storage subsystem is fast enough such that the elimination of decompression from each processing request negates the reduction in IO throughput that comes from compression. As one would expect, a performance test is required to determine which is superior on the hardware in question.

Optimal Compression Strategies for HDInsight

For HDInsight, a splittable, fast compression algorithm really is the key to performance, and LZO really is the algorithm of choice. LZO doesn’t have a high compression ratio; but what it lacks in compression ratio, it makes up for in speed, particularly in view of the fact that it decompresses faster than it compresses. This makes it ideal for the write-once, read-many-times nature of Big Data.

It should be noted that not only does LZO handling ability need to be enabled in the Hadoop cluster, but the LZO files need to be indexed once they are uploaded to the cluster (creating a <filename>.lzo.index file for each <filename>.lzo file). This indexing will require a read of the LZO file to create the index for it.

If this strategy is too complex, then a less optimal option is to pre-process your large files into n smaller files prior to compression. Ideally, n should reflect the number of non-head cluster nodes in your cluster, to maximize parallelisation. Consider either LZO or GZIP with the -1 switch, both of which offer very fast compression/decompression.

Optimal Compression Strategies for Azure Data Lake

As of the time of writing, I could find no documentation indicating that Azure Data Lake can create or process splittable, indexed LZO files. (If someone can find a reference, I’ll update this article!)

As a result, the optimal compression strategy I could devise would be to split the file into n smaller files prior to compression on storage in Azure Data Lake Store. Note that this will invariably make specifying the extraction syntax in U-SQL more complex as we will have to specify a mask indicating which files should be referenced in the EXTRACT clause.

At first glass, this would appear to be a severe disadvantage of Azure Data Lake; but what Data Lake lacks in its inability to process splittable LZO files, it makes up for in elastic scalability: On a query-by-query basis, we can choose to parallelise the decompression on many more nodes than we are likely to have procured in our HDInsight cluster.

Therefore if we intend repeatedly processing our source files, we should break our large files into many more smaller files prior to compression on Azure Data Lake than we would do on HDInsight.

Optimally, we should break our files into (Maximum Number of Azure Data Lake Configured Nodes), providing that our files are at least as big as (Block Size * Maximum Number of Azure Data Lake Configured Nodes).

Alternatively, we can choose to work with large uncompressed files, relying on the massive parallel IO throughput of Data Lake Store to process blocks in parallel across many more nodes.

However Azure Data Lake Analytics is designed to offer high-speed analytics with a different workflow to HDInsight: Far faster query performance is possible if data is transformed into Tables, which are strongly-typed and can not only can be clustered, but can also be be partitioned (enabling partition elimination as a processing strategy,) and sharded across processing nodes to increase parallelisation.

If working with Tables is your preferred workflow, then it will probably be worthwhile to compress your large files once, upload them to Data Lake Store and transform the large files once into Tables for all subsequent processing.

Conclusion

As this article has shown, HDInsight and Azure Data Lake Analytics can both process files in their native formats. HDInsight offers more compression options including parallel decompression of single files. However Data Lake Analytics is a very different beast to HDInsight with different workflow possibilities favouring transformation into partitioned, sharded tables.

Whichever you choose, a proof-of-concept should always be undertaken to ensure that your workflow will be able to handle the anticipated load.

Advertisements

About Ian Posner

Ian Posner is an independent consultant specialising in the design, implementation and troubleshooting of systems that demand the very highest performance and scalability.

Discussion

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: