I often get brought in to review and enhance existing systems exhibiting performance, scalability or reslience problems, some of which focus on poorly performing or unstable ETL (Extraction, Transformation and Load) subsystems. In this series of three articles, I’m going to focus on optimisation of these ETL processes, whilst mindful of the trade-offs sometimes necessary to ensure robustness.
In this first part, I’m going to be looking at techniques to optimise the first stage of the ETL process, Extraction, together with the issues associated with data sources in general (location, format, size, number of files/feeds):
It’s important to understand how ETL tools work: Most ETL tools that use some form of data-pipelining (such as Microsoft SQL Server Integration Services and Informatica) move buffers of multiple data rows from a file, queue or database source, through multiple transformations to destination files or databases.
Therefore optimising the speed of processing necessitates pursuing the following goals:
- Reducing the initial amount of data that enters the pipeline to a minimum
- Increasing the density of rows per buffer without increasing the buffer size
- Optimising the input format for easy processing
- Optimising the data types for easy processing
So here are my golden rules to get your ETL development off to a cracking start:
Think “Incremental” From the Start
It sounds pretty obvious, but I’ll state it for the record – there’s no point receiving data that’s not needed. You’ll only have to waste time transporting it and processing it just to discard the data you don’t need. Therefore treat each load as if it’s an incremental load – with the first load being a rather large increment from a zero start.
If it’s possible for the ETL process to request data from the source according to criteria (e.g. if the data source is a database), interrogate the target to determine what criteria should be sent from the source system.
However, herein lies a trap for the unwary – never look for the maximum date loaded so far, or the maximum LoadId or whatever your unit of load is, for in so doing, you will unwittingly prevent out-of-order loading from working! Instead, use the algorithm “load everything from the source that’s not in the target”. Doing so will ensure that your loading process will pick up loads that have been skipped (due to system outage/Act of God/whatever) and still load them smoothly.
Location of the Source Relative to the Destination
Sometimes our source is a database, other times it’s one or more files, or a message queue. Regardless of its source, the location of the source is one of several factors that will place a maximum bottleneck on the speed of data insertion. So consider the following:
This can make a massive difference to the performance of your load: Most methods of network data transfer use TCP/IP protocol as the underlying transport mechanism. Because TCP/IP is a synchronous protocol (each packet of data has to be received by the destination before the next packet can be sent), not only is bandwidth a factor in the speed of data transmission but latency is too.
Latency may not be an issue when shuffling data between servers in the same data centre, but when the source is in Australasia and the destination is in Europe, high latency will result in a significant deterioration in data transfer time regardless of the bandwidth available.
There are a number of techniques for reducing the effect of high latency:
Breaking the source data into multiple files/requests and copying/accessing the source in parallel
Using data transfer mechanisms that depend upon the UDP protocol rather than TCP in order to transfer data (e.g. rsync)
Utilsing specialist hardware WAN accelerator appliances that implement TCP striping.
Configure Jumbo frames on the network cards and the intermediate routers to permit more data to be transferred per packet
Network Location Within a LAN
When moving data within a data centre, latency is less of an issue and bandwidth far more significant. Even so, moving large volumes of data across the network can add substantially to the overall time it takes to obtain the data and process it. Therefore consider the following techniques for optimising data transfer across the LAN:
If the data is in files, compare the time it takes to compress the file, transfer it and decompress it using different compression algorithms versus the time it takes to transmit it uncompressed. The lower your bandwidth, the greater the opportunity for compression/decompression to be an effective means of reducing file transmission time. The faster your bandwidth, the more important it becomes to use a fast compression/decompression algorithm rather than a slower one that achieves greater compression. Consider fast compression algorithms, such as gzip with the “-1” switch, or the fastest compression algorithm of all, lzop.
Even if it takes longer to compress-transmit-decompress a single large file, this technique may still yield benefits if the data comprising the load is broken into multiple files and the compression/decompression process parallelised.
When your data load consists of a single file, you have to wait for the file to be compressed/transmitted/decompressed before you can start the load. However, multiple files can be compressed/transmitted/decompressed simultaneously, it’s possible to start loading the first file while other files are being compressed/transmitted/decompressed, shortening the overall load time.
Some operating systems throttle the maximum bandwidth to which a single process can gain access when using particular networking protocols (i.e. Windows Networking (aka SMB or CIFS)) – it may be necessary to split a single large file so that full bandwidth can be exploited by multiple simultaneous file copies (an alternative would be to use a different protocol, such as FTP).
If the data comes directly from a database or queue, investigate the data access API to determine if the transmission protocol has an option for transparent data compression.
Network Data versus Local Data
Source data co-located on the target server can be loaded far faster than source data located on the network because of both the bandwidth and latency inherent in network communication. Therefore in many cases, it may well be faster to copy the source data to a local disk, then start the transformation and loading process.
Number of Sources to be Loaded
As has been mentioned above, breaking file data into multiple files offers numerous advantages when moving data around. There’s a further advantage when it comes to the actual ETL process itself: Multiple files provide the opportunity to parallelise the load process. This is not quite as easy as it sounds for the following reasons:
- Multiple parallel loads must not attempt to seize the same file – some form of algorithm has to be implemented to prevent the same file from being picked up by multiple loading processes.
- Parallelisation may not be possible for all stages of the ETL process. For example, if two files contain duplicate dimension data (but different fact data) that is not yet in the database, then the dimensional data elements will have to be processed serially to prevent a second file attempting to insert data that has been inserted from the first file.
- Tabular, flat files are easy to split into multiple files, but hierarchical files such as XML require more complex processing and will process slower.
However there’s a downside to multiple files: When a data load consists of a single file, the file can be processed as an atomic unit that constitutes the entire load; but when the load comprises multiple files, it’s quite possible for one or more of the files to fail to be processed at any stage of the ETL process. In order to handle this scenario, in the event of the process failing at any stage, the ETL process needs to be designed to enable processing to continue from any point in the ETL process such that the load completes with no data duplicated in the destination and no data omitted from the source load.
Format of the Data to be Loaded
The format of the source data will have an impact both on the size of data and the ease with which it may be processed. Generally speaking:
- Binary data processes faster than character data
- Single-byte encoded text (e.g. ISO-8859-1, ASCII) is half the size of double-byte encoded text (e.g. Unicode) (and therefore can be processed faster)
- Tabular formats (e.g. delimited or fixed width flat files) are less verbose and therefore tend to be faster to process than non-tabular formats (e.g. XML, JSON).
Now reading the above, you might conclude that I would always advocate tabular data structures. However I take the view that achieving high performance necessitates high reliability as a pre-requisite. After all, what good is it if your one hour load fails ten percent of the time due to poor quality data and you miss your data delivery SLAs by hours?
Therefore if the quality of your data sources are dubious/lacking and you have the ability to redefine the source file format to be provided from source systems, consider a strongly-typed, self-validating format such as XML with tightly defined XSD schemas. Such schemas are especially useful in the case where the source systems are beyond your control, for they can form the “contract” or interface between your system and source systems, permitting the maintainers of the source file a quick and easy way to validate their output.
XML is slower to process than flat files. So using XML files as a direct source for your transformations will reduce the maximum speed of your load from the beginning. To avoid this, first convert the XML source to a number of flat files, preferably in the binary format supported by your ETL tool. Remember that this stage can be parallelised and can be run as a separate process, reducing the overhead of XML on the transformation stage. Furthermore, pre-processing the XML into a format more “digestable” by our ETL is an opportunity to really optimise the file format for the ETL tool (perhaps by changing encoding or even using the ETL tool’s binary storage format) and at the same time, normalise the data within a file/s to reduce the number of rows that have to be processed.
Eliminating Unnecessary Columns/Attributes from the Data Source
We’ve discussed eliminating unnecessary rows from source file using an incremental loading algorithm. If you are able to define the source file format, you can further optimise your ETL process by eliminating unnecessary columns/attributes.
So with a database extract, you can reduce the number of columns in the underlying SELECT statement, with a flat file you can remove the columns and with an XML file you can remove superfluous attributes/elements.
In ETL Optimisation: Part 2 – Transformation, I’ll be looking in-depth at the Transformation aspect of ETL.