October 2021, Mohsin Raza, DPulse
Introduction
In this blog we will discuss how data can be integrated from different sources and how a data warehouse can facilitate that, combined with the advantages that the latest data warehouse techniques can offer.
Traditional transactional systems (ERP, CRM systems etc) are often difficult to integrate which can lead to data silos in organizations; departments that have their own systems which often can’t talk to each other. Interfaces are being made between systems, but it is cumbersome and often doesn’t lead to a full integration of data. On top of that unstructured data from sources such as social media, but also external sources such as the weather forecast, traffic information, stock exchanges etc., are very challenging to integrate and are often not used at all.
A data warehouse is a central repository of information that provides users with current and historical decision support information which is difficult to access or present in the traditional operational data stores. To develop a data warehouse, we need to integrate data from various sources, such as data from various departments of an organization and external data.
Data Integration requires planning and communication with the stakeholders and relies on the company’s data strategy. Data Integration may comprise of the following activities:
Identify the data sources to integrate,
extracting data from different sources,
designing the data architecture and
developing data pipelines for transformation, integration, and storage of data (the ETL/ELT operations),
automating and optimizing workflows and flow of data between systems anddeveloping applications needed duringthe data integration workflow.
Testing and assessments of readiness and risk of the Data Warehouse
Data often resides in a number of separate data sources for most companies. For example, for a typical customer 360 degree view, the data that must be combined may include data from their CRM systems, web traffic, a web-shop, customer loyalty systems, booking systems, and even partner data, just to name a few. Information from all of those different sources often needs to be pulled together for analytical needs or operational actions.
Employees in every department — and sometimes in different physical locations — increasingly need access to the company’s data for shared and individual projects. IT needs a secure solution for delivering data via self-service access across all lines of business. Data Integration can help in such collaboration.
Data integration efforts cut down significantly on the time it takes to prepare and analyze that data. The automation of unified views cuts out the need for manually gathering data, when they need to run a report.
Data integration can vary based on numerous business needs. Here are some common scenarios for data integration.
A data lake can house both structured and unstructured data and does not have a predetermined structure of data. The advantage is that the data can be saved in its original format without the need to impose a specific structure unlike a data warehouse. In practice, data lakes can be highly complex and massive in volume. Some major IT companies of the world process a non-stop influx of data from billions of users. Data lakes can be implemented in all major cloud services. Besides, some notable mentions are Cloudera and Databricks.
A data warehouse handles primarily structured data and has a predetermined structure for the data it houses. Such a structure is carefully designed to facilitate the reporting and business intelligence. Many companies rely on data warehousing technologies in the cloud such as AWS Redshift, Azure Synapse Analytics and Snowflake.
Extract, Transform, Load, commonly known as ETL, is a process within data integration wherein data is taken from the source system and delivered into the warehouse. This is the ongoing process that data warehousing undertakes to transform multiple data sources into useful, consistent information for business intelligence and analytical efforts. For data lakes, ELT is the process that is used for data integration. Transformation is done after the integration as per the needs of the user. In Figure 1, there is an illustration of end-to-end pipeline of the data integration.
There are data integration tools that can simplify the data integration process.
There are several ways to integrate data that depend on the size of the business, the need to be fulfilled, and the resources available.
Manual data integration is simply the process by which an individual user manually collects necessary data from various sources by accessing interfaces directly, then cleans it up as needed, and combines it into a warehouse. This is highly inefficient and inconsistent.
API-based integration. This is the traditional ‘automated’ way of integrating data. Storage is not a part of this approach, which means that reporting is not readily available and goes at the cost of flexibility. Building the interfaces and creating reports requires skills that are often not available for smaller companies.
‘Common storage’ integration is the most frequently used approach to storage within data integration. The common storage approach is the underlying principle behind the traditional data warehousing solution, whereby data is readily available for reporting. Such integration is automated by the use of software. A copy of data from the original source is kept in the integrated system and processed for a unified view.
We outline a rough estimate of the costs associated with building and maintaining a data-warehouse.
Cloud storage solution: Roughly €100 per terabyte per month
On-site storage solution: €1,000 per month
ETL software: Rougly €5000 - €10000 per year
+ Consultancy costs (One-time + support and maintenance)
For a medium sized company with the data storage requirements of 1-10TB in the data-warehouse, this means the cost could vary from €10000 to €25000 per year in addition to the consultancy or personnel costs.
References:
1. https://www.talend.com/resources/what-is-data-integration/
2. https://databricks.com/glossary/data-warehouse
3. https://redshelf.com/app/ecom/book/256986/data-warehouse-essentials-256986-9780983332480-mannino