Extract, transform, load (ETL) describes the process of copying data from a single or multiple repositories into a new centralized system--such as a data warehouse, data mart, or operational data store (ODS).
You can’t just take data from one repository and throw it into a new one. You’ve got to follow a set of procedures. Done right, ETL gives you data in a ready-to-use state for analysts, developers and department decision-makers. As you might guess, however, it’s a big job, and we’ll also discuss why it’s not always necessary for analysis, and why you shouldn’t do it anymore than you absolutely need to.
Extraction--the ‘E’ in ETL
The extraction phase involves gathering the data from sources which may use different data schemas and formats, and may also be semi structured or unstructured. These sources may be owned and guarded by different departments within your organization.
Transformation--the ‘T’ in ETL
Once the data has been extracted, the Transformation phase involves applying rules to prepare it for its new home. This may involve:
Selecting the proper columns to load
Ensuring everything is properly and consistently coded (for example, if “Male” is “M” in one source and “1” in another, it’s got to be reformatted for consistency)
Calculating new values and aggregating data
Sorting the data for improved search functionality
Combining data from multiple tables
Generating surrogate keys to ensure uniqueness and uniformity of records
Splitting columns (for example, if the first and last name exist in a single column, relational database rules may require them to be separated into two columns)
Identifying duplicate data
Load--the ‘L’ in ETL
During the ‘Load’ phase, as you might suspect, the data is loaded into its new repository. In some cases the data is entirely new, while in other cases it overwrites existing data. Loading may happen in hourly, daily, weekly, or monthly intervals, depending on resources and needs of the business (such as the need to analyze very new data).
The ‘Load’ phase follows strict guidelines. At this time whatever rules or constraints have been defined by the schema--such as distinctness, non-null fields, referential integrity--are applied. Depending on the rules you’ve set up some data might be rejected outright, some partially rejected, and some (hopefully most) fully accepted.
Needless to say, ETL is fraught with challenges, and many problems can occur. For instance, the range of data values that the system needs to handle may wind up being wider than originally anticipated, and may require that the validation rules be updated. Additionally, scalability may be an issue, and systems that start out only being required to process gigabytes of data may need to eventually handle terabytes or even petabytes.
ETL can be slowed down for many reasons. For instance, if there are multiple input sources, and those sources must be reconciled, a source that is quickly extracted and transformed may be delayed during the load process by a slower source until they can be properly synchronized.
Obviously, the faster an ETL system is able to work, so performance boosting techniques, such as the following, may be beneficial:
Importing data using a direct path, vs. database queries, to reduce computation resources
Using bulk loading (i.e. loading multiple rows of data simultaneously)
Validating as much of the data before the loading process begins so that certain constraints (which slow the process) can be disabled
Disabling triggers in the repository while data is being loaded (once again, reducing computation during the load)
Recreating indexes after the load, rather than trying to conform to them during the load
Making appropriate tradeoffs, such as removing duplicates only when they are present in such volume as to slow the load process
Partitioning the pipeline according to the three ETL phases so that some data is being extracted, some transformed, and some loaded simultaneously
Allowing the concurrent execution of multiple actions associated with a single job, such as sorting and deduplicating files
Reducing dependencies as much as possible
Selecting an ETL Tool
Of course it really depends on the specifics of your organization’s data needs and data architecture, but some things to think about when selecting an ETL tool include:
Its ability to work with a wide range of data repositories, vendors and file formats
Advanced capabilities such as the ability to assess data quality and assign metadata
User experience, as ETL tools are increasingly being used by non-IT staff
Ability to maintain a lineage, i.e. a history and audit trail of all changes to the data over the course of its lifetime.
Don’t ETL any more than you need to
At the end of the day it should be acknowledged that ETL is a big process. Do it when you need to, but if it’s not absolutely necessary, don’t. As the number of data sources and volume of data grows, when is ETL no longer a viable option? As one of the major purposes of ETL is to have data ready for analytics, when it becomes so cumbersome that it actually slows the analytics process it may be time to consider another direction.
For instance, conventional wisdom says that to have all your data ready for analysis, you’ve got to ETL it into a massive warehouse. But this isn’t always the case. If you want to find out why creating a virtual layer that allows you to access all your data without copying or moving it, read on.