In ETL Optimisation: Part 2 – Transformation, I looked at the Transformation process and simple ways of optimising it.
In this part, I’m going to look at optimising the Load process:
Consider the Location of the Source Relative to the Destination
To avoid repetition, the same arguments hold true for the destination as much as they do for the source. See ETL Optimisation – Part 1: Transformation where this is discussed in detail.
File destinations are normally destinations where the data will be picked up for subsequent processing into another system, or for subsequent processing by another stage of the system being developed.
As a general rule, the speed with which a file format can be outputted depends upon the complexity of the processing required to format the data. Therefore it’s not surprising that:
- Outputting the data to a file in the native format of the ETL tool is faster than any other format.
- Tabular formats output faster than hierarchical formats.
- Re-encoding the output from the native encoding incurs an overhead.
- Dual-byte character encoding (e.g. Unicode) is slower and twice the size of single-byte character encoding (e.g. ASCII, ISO-88591-1, etc).
You may be constrained in your choice of output formats by either business requirements or technical constraints.
Choice of Driver
You may have a choice of database driver available to your ETL tool. For optimal speed, choose a database-specific driver (e.g. SQL Server Native Client) over a generic driver (e.g. a driver for ODBC data sources). Specific drivers will offer you advanced options for configuration and may expose specific high-speed bulk-loading interfaces inherent in the underlying database.
Where a driver exposes a bulk-loading interface, use it! Note that many of these bulk-loading interfaces only reveal their full potential when the ETL process is executing on the machine on which the database is present. This is because the bulk-loading capability far outstrips the ability of many networks to deliver data.
Depending upon your target database, there may be additional options (such as the ability to lock the entire destination table for the duration of load, or to load data with a minimal of journalled logging). Take the time to understand what options are available to your bulk-loading mechanism and whether each is suitable to your scenario and goal.
If the target table is sorted by means of a clustered index, then it may be possible to insert data far faster by pre-sorting the data to be loaded and providing a hint to the bulk loading mechanism to indicate that the data is pre-sorted.
It is often the case that a source and destination are on different machines. In this case, it may be worthwhile formatting the data to be loaded into the ETL tool’s specific binary file format on the source machine, transferring the binary file to the destination machine and loading the destination using an ETL process executing locally on the destination machine.
Yet another technique is the use of parallel bulk-loading. For this to work effectively, there must be minimal contention on any shared resource that is consumed by bulk-loading processes executing in parallel. This is sometimes quite complex to achieve where parallel loads are loading into the same table, as their may be contention on certain parts of the target table’s underlying resources. Frequently, there may be contention on the target table’s indexes, or the table itself. In this case, you may want to minimise and/or disable or remove the indexing for the duration of the load, re-enabling/redefining the indexing after the load.