Data Pipeline — ETL, ELT & EtLT

--

When it comes to data pipelines, a lot has been discussed about ETL, ELT and then about EtLT. While these terms seem synonymous, there is a slight difference among them. Let us try to understand more about these.

If you are someone just starting with data engineering, this will help you understand the difference between these three. Experienced data engineers may find this very basic.

As we all know,

Extract is gathering data from various sources.

Load is bringing the gathered data into a warehouse, lake or store.

Transform is formatting, slicing and dicing the data to make it ready for analysis.

ETL has been the default choice for data pipelines for a long time. Data usually was transformed before the loading phase because,

  1. Data warehouses did not have enough resources to perform necessary transformation of data before analytics.
  2. Data warehouses were row based (MySQL, Postgres) and not suitable for high volume.

So, data transformation happened in a separate system before reaching the data warehouse.

Today’s warehouses are columnar (e.g Snowflake) and have enough resources to do transformation by themselves. So there is no need to have a separate system for transformation. Even better, data can be loaded into the pipelines, right after extraction, skipping the transformation step. It is just ‘extract from the source’ and ‘load into the data warehouse’. This makes data ingestion very simple, leaving the complex transformation task to the powerful modern warehouses.

The extract and load steps are collectively called data ingestion.

So, that clarifies ETL and ELT. What is EtLT ?

In ELT — right after transformation, the data is going to be used for analytics. So the very act of transformation needs to have proper context and understanding about the business logic. This makes transformation a complex task, when we deal with large volumes of data. But, not all transformations are complex. Not all transformations need enough context and business logic. Deduplication of data is a good example.

When data from a source is loaded into a table, it is good to remove duplicate entries. This action of removing duplicate entries has no dependency on the business logic. Such transformations are called non-contextual transformations. These are simple tasks, but can help a lot before presenting the data for analytics.

The earlier these transformations happen, the better. For e.g If we can remove duplicates in an extracted dataset before loading it into the warehouse, it is easy for the warehouses to process it. So, in the ELT process, there is a small transformation that happens between the extract and the load steps. It is not a complex transformation, like the one done at the warehouse. It is simple, more preparatory in nature and done right after extracting from the source. So it is denoted by a small t. Hence it is EtLT.

Hope this was helpful.

👋 If you find this helpful, please click the clap 👏 button below a few times to show your support for the author 👇

🚀Join FAUN Developer Community & Get Similar Stories in your Inbox Each Week

--

--