An Overview of Azure Data Factory

Microsoft's Azure Data Factory (ADF) is a cloud-based data integration service that allows users to create, schedule and orchestrate data workflows at scale.

Azure Data Factory consists of the following components:

Medallion Architecture typically consists of three main layers:


1. Integration Runtime (IR)

  •   Azure Integration Runtime: Used for data movement, transformation, and activity dispatch within the Azure ecosystem.

  •   Self-hosted Integration Runtime: Used to connect data sources on-premises or in a private network.

  •   Azure-SSIS Integration Runtime: Specifically used for executing SQL Server Integration Services (SSIS) packages.


2. Linked Services

Azure Data Factory (ADF) uses Linked Services to define connections to various data stores and compute services. These connections allow ADF to perform data movement and transformation tasks. Linked Services can broadly be categorized into two types: compute and storage. Let's explore each type in detail.


Storage Linked Services

Storage Linked Services allow data to be read from and written to data storage systems. Storage systems can be on-premises or within Azure or other cloud services. Common storage linked services include:


  •   Azure Blob Storage: Used to connect to Azure Blob Storage, an object storage solution for the cloud. This service is often used for storing unstructured data such as text or binary data.

  •   Azure Data Lake Storage (ADLS): Used to connect to Azure Data Lake Storage, which is optimized for big data analytics. It provides a scalable and secure data lake for high-performance analytics workloads.

  •   Azure SQL Database: Used to connect to Azure SQL Database, a managed relational database service in Azure. It is often used for structured data storage and transactional data.

  •   Azure Cosmos DB: Used to connect to Azure Cosmos DB, a globally distributed, multi-model database service designed for high availability and low latency.

  •   Azure Table Storage: Used to connect to Azure Table Storage, which provides a key/attribute store with a schema-less design.

  •   Amazon S3: Used to connect to Amazon Simple Storage Service (S3), a scalable object storage service provided by AWS.

  •   On-Premises SQL Server: Used to connect to an on-premises SQL Server database, which requires a self-hosted Integration Runtime.

Compute Linked Services

Compute Linked Services define connections to compute resources used to perform data transformation and processing tasks. These services provide the necessary compute power to execute various operations on data. Common compute linked services include:


  •   Azure HDInsight: Used to connect to an Azure HDInsight cluster, which is a fully-managed Apache Hadoop and Spark service. It is used for big data processing and analytics.

  •   Azure Databricks: Used to connect to an Azure Databricks workspace, which provides a fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure..

  •   Azure Machine Learning: Used to connect to Azure Machine Learning services, allowing you to execute machine learning models and pipelines as part of your data workflows.

  •   Azure Batch: Used to connect to Azure Batch, a service for running large-scale parallel and batch compute jobs in Azure. It is useful for compute-intensive tasks.

  •   Azure SQL Database Managed Instance: Used to connect to a managed instance of Azure SQL Database, which provides a fully-managed SQL Server instance in the cloud.

  • 3. Datasets

    In Azure Data Factory (ADF), a Dataset is a named view of data that simply points or references to the data you want to use in your activities. Datasets represent the structure of the data within the data stores. They provide essential metadata about the data, such as its schema, format, and location. Here’s a more detailed look at Datasets:


    Key Components of a Dataset

    •   Linked Service:Each Dataset is associated with a Linked Service, which provides connection information to the data store where the actual data resides. The Linked Service defines how ADF connects to the data store, while the Dataset specifies what data within the store is of interest.

    •   Schema:Datasets define the schema or structure of the data. This includes the data types and structure of the data entities (e.g., tables, files). For example, if the Dataset points to a SQL table, the schema would include the columns and their data types.

    •   Location:Datasets specify the location of the data in the data store. This could be a table name in a SQL database, a file path in Azure Blob Storage, a directory in a Data Lake, etc.

    •   Format:Datasets also define the format of the data. This includes the file format (e.g., CSV, JSON, Parquet, Avro) and any specific properties related to the format (e.g., delimiter in a CSV file).


    • Types of Datasets

      Datasets can be broadly categorized based on the type of data store they refer to:


      Structured Datasets:

      •   Examples:SQL tables, Cosmos DB collections, Azure Table Storage

      •   These datasets have a predefined schema with columns and data types.

      Semi-Structured Datasets:

      •   Examples: JSON files, XML files

      •   These datasets have a flexible schema and are often used with data formats that contain nested or hierarchical data.

      Unstructured Datasets:

      •   Examples: Blob files, Data Lake files

      •   These datasets do not have a predefined schema and are often used with file-based data stores.


      4. Pipelines

      A Pipeline in Azure Data Factory (ADF) is a logical grouping of activities that together perform a task. A Pipeline represents a workflow or a sequence of steps needed to accomplish a specific data integration or data transformation task. Here’s a more detailed look at Pipelines:

      Key Components of a Pipeline

      1.Activities:

      •   Activities are the building blocks of a Pipeline. Each activity performs a specific operation on data. For example, copying data from one location to another, transforming data using a data flow, or executing a stored procedure in a database. There are several types of activities:

      •   Data Movement Activities: For moving data between data stores (e.g., Copy Activity).

      •  Data Transformation Activities: For transforming data (e.g., Mapping Data Flow, Data Flow Activity).

      •   Control Activities: For controlling the workflow logic (e.g., If Condition Activity, ForEach Activity, Until Activity).

      2.Triggers:

      •   Triggers define when a Pipeline should be executed. There are different types of triggers:

      •   Schedule Trigger: Executes the Pipeline on a predefined schedule.

      •  Data Tumblin Window Trigger: Executes the Pipeline at fixed intervals, with a window of time for data processing.

      •  Event-based Trigger: Executes the Pipeline in response to an event, such as a file being created in a blob storage container.

      •   Manual Trigger: Manually executes the Pipeline.

      3.Parameters:

        Parameters are key-value pairs that can be passed into a Pipeline at runtime to make it dynamic and reusable. Parameters can be used to control the behavior of activities within the Pipeline.


      4.Variables:

        Variables are used to store temporary values within a Pipeline. They can be set and modified using Set Variable activities and used to control the flow of the Pipeline.


      5.Integration Runtime:

        Integration Runtime (IR) provides the compute environment for the activities within the Pipeline. It can be Azure IR, Self-hosted IR, or Azure-SSIS IR.


      6.Dependencies and Conditions:

        Pipelines can have dependencies between activities, defining the order in which activities are executed. Conditions can be set to control the flow based on certain criteria (e.g., success, failure, or completion of previous activities).


      Here are a few example use cases for Azure Data Factory Pipeline.

      ETL Process:

      A common use case for a Pipeline is an ETL (Extract, Transform, Load) process, where data is extracted from various sources, transformed according to business rules, and loaded into a data warehouse.

      Data Movement:

      Another use case is to copy data from on-premises SQL Server to Azure Blob Storage for backup or archiving purposes. The Pipeline might include activities to copy the data and then send a notification email upon completion.

      Data Integration:

      Integrating data from multiple sources, such as combining data from an API, a SQL database, and a CSV file stored in Azure Blob Storage, transforming it, and loading the integrated data into an analytics platform.

    Ready to Transform Your Data Strategy?

    If you’re interested in outsourcing work through remote arrangements, we can provide you with the best services in Data Infrastructure, Data Engineering, and Analytics Engineering. Let’s connect and explore how we can help you achieve your goals!