Data Integration for Data Warehousing

October 2021, Mohsin Raza, DPulse

Introduction

In this blog we will discuss how data can be integrated from different sources and how a data warehouse can facilitate that, combined with the advantages that the latest data warehouse techniques can offer.

Traditional transactional systems (ERP, CRM systems etc) are often difficult to integrate which can lead to data silos in organizations; departments that have their own systems which often can’t talk to each other. Interfaces are being made between systems, but it is cumbersome and often doesn’t lead to a full integration of data. On top of that unstructured data from sources such as social media, but also external sources such as the weather forecast, traffic information, stock exchanges etc., are very challenging to integrate and are often not used at all.

A data warehouse is a central repository of information that provides users with current and historical decision support information which is difficult to access or present in the traditional operational data stores. To develop a data warehouse, we need to integrate data from various sources, such as data from various departments of an organization and external data.

Activities involving integration

Data Integration requires planning and communication with the stakeholders and relies on the company’s data strategy. Data Integration may comprise of the following activities:

Advantages of data integration

Data often resides in a number of separate data sources for most companies. For example, for a typical customer 360 degree view, the data that must be combined may include data from their CRM systems, web traffic, a web-shop, customer loyalty systems, booking systems, and even partner data, just to name a few. Information from all of those different sources often needs to be pulled together for analytical needs or operational actions.

Employees in every department — and sometimes in different physical locations — increasingly need access to the company’s data for shared and individual projects. IT needs a secure solution for delivering data via self-service access across all lines of business. Data Integration can help in such collaboration.

Data integration efforts cut down significantly on the time it takes to prepare and analyze that data. The automation of unified views cuts out the need for manually gathering data, when they need to run a report.

Data integration in modern business

Data integration can vary based on numerous business needs. Here are some common scenarios for data integration.

Data Lake

A data lake can house both structured and unstructured data and does not have a predetermined structure of data. The advantage is that the data can be saved in its original format without the need to impose a specific structure unlike a data warehouse. In practice, data lakes can be highly complex and massive in volume. Some major IT companies of the world process a non-stop influx of data from billions of users. Data lakes can be implemented in all major cloud services. Besides, some notable mentions are Cloudera and Databricks.

Data warehouse

A data warehouse handles primarily structured data and has a predetermined structure for the data it houses. Such a structure is carefully designed to facilitate the reporting and business intelligence. Many companies rely on data warehousing technologies in the cloud such as AWS Redshift, Azure Synapse Analytics and Snowflake.

ETL and ELT for data integration

Extract, Transform, Load, commonly known as ETL, is a process within data integration wherein data is taken from the source system and delivered into the warehouse. This is the ongoing process that data warehousing undertakes to transform multiple data sources into useful, consistent information for business intelligence and analytical efforts. For data lakes, ELT is the process that is used for data integration. Transformation is done after the integration as per the needs of the user. In Figure 1, there is an illustration of end-to-end pipeline of the data integration.

Figure 1: Illustration of Data Integration Pipeline

Which tools are available for ETL/ELT?

There are data integration tools that can simplify the data integration process.

Integration Strategies

There are several ways to integrate data that depend on the size of the business, the need to be fulfilled, and the resources available.

What are the costs for data integration?

We outline a rough estimate of the costs associated with building and maintaining a data-warehouse.

Cloud storage solution: Roughly €100 per terabyte per month

On-site storage solution: €1,000 per month

ETL software: Rougly €5000 - €10000 per year

+ Consultancy costs (One-time + support and maintenance)

For a medium sized company with the data storage requirements of 1-10TB in the data-warehouse, this means the cost could vary from €10000 to €25000 per year in addition to the consultancy or personnel costs.

References:

1. https://www.talend.com/resources/what-is-data-integration/

2. https://databricks.com/glossary/data-warehouse

3. https://redshelf.com/app/ecom/book/256986/data-warehouse-essentials-256986-9780983332480-mannino