In ETL Optimisation: Part 1 – Extraction, I looked at data sources and the importance of optimising the design of the data source as well as the process of obtaining the data source.
In this part, I’m looking at ways of optimising the transformation of data sources. So here are a few pointers when really getting to grips with the transformation functionality of your ETL tool:
Optimise the Source Metadata in the ETL Tool
ETL tools require a definition for each of the data sources. You may get off to a quick start with some of them by pointing the tool at a sample file and getting the tool to interpret the datatypes within the file. Getting the metadata definitions to tightly reflect the content of the underlying data is essential in optimising your load speed, because the metadata definitions will determine how many rows of data may fit into each buffer – and the greater the number of rows per buffer of a given size, the faster the transformation process will execute.
However there is a limit to the metadata that a tool can interpret from most file formats. Many file formats (and especially flat files) provide very little data on:
- The maximum and minimum values of numeric values that may occur in a column
- The maximum number of decimal places that may occur for numeric values in a column
- The maximum number of characters within a string column
- The maximum and minimum values of date ranges
- The nullability of data in a column
Therefore it is incumbent upon the ETL developer to closely examine the data sources and define the source metadata as tightly as possible.
Eliminate Unnecessary Columns From the Output of Each Data Source
This sounds obvious, but it’s worth stating as this can easily be overlooked:
As discussed above, when you point an ETL tool at a sample data file, it will interpret the metadata it can and create a data source defining the file. Additionally, your tool may create an output for each column identified. For many transformations, not all of the columns are required, so eliminating unnecessary columns will allow you to improve throughput by increasing the number of rows per buffer.
Some tools (such as Microsoft SQL Server Integration Services) provide warning messages during execution indicating that columns exist in the data pipeline that are not being used. So take heed, read the warnings and remove unnecessary columns.
Baseline the Data Source
Once you’ve tightly defined your source data format and your source metadata definition in your ETL tool, you should perform a quick benchmark test. Simply sending the output of the source through to a simple Count transformation or sending the data to a destination file defined as a nul device will permit you to determine how quickly your data source can be read.
This is a very useful exercise and well worth the little time expended, for it allows you to determine if your source data format is fast enough to meet the performance requirements of the project. If you can’t read the data fast enough, you won’t be able to transform it nor load it any faster, so the results of this test may well send you back to the drawing board.
Performing Datatype Casting to Smaller Target Types Early in the Pipeline
If you tightly define the source format in the ETL tool, then you’re off to a good start in using lean datatypes. But you should check that the outputs of your data source are optimal for successive transformations – in particular, be aware that some tools (like Microsoft Integration Services) treat all string formats as unicode by default. So if your source file is single-byte encoded or your destination is a single-byte character type column, consider converting the unicode output of the data source to the single-byte type as early as possible in the pipeline. By so doing, you’ll double the density of data for those columns within each buffer – and cut the processing time of those columns by up to a half.
Performing Datatype Casting to Larger Target Types Late in the Pipeline
It may be the case that the Target column type is larger than the type defined for the source. This can occur where you are taking data from multiple sources, some of which contain data that requires a larger datatype than others.
The smaller the datatypes used within a pipeline, the higher the density of rows per buffer and the faster it will process, so keep datatypes small wherever possible and expand only where necessary and at the end of the pipeline.
This really is too expansive a subject to deal with in this article, but the transformation stage of an ETL process may provide significant opportunities for parallel processing.
Parallel processing is complex and there are potential pitfalls for the unwary. I therefore recommend optimising your ETL process before you attempt to introduce parallel processing. After all, you may find that performance meets your objectives without introducing the complexity and cost of introducing parallelisation.
In ETL Optimisation: Part 3 – Load, I’ll be looking in-depth at the loading aspect of ETL.