Data Analytics - Integration

Azure Analytics Services

Azure Synapse Analytics

Azure Synapse Analytics (formerly Azure SQL Data Warehouse) is a limitless analytics service that brings together enterprise-data warehousing and big-data analytics. You can query data on your terms by using either serverless or provisioned resources at scale.

Azure Synapse Analytics implements a massively parallel processing (MPP) architecture and has the following characteristics.

The Azure Synapse Analytics architecture includes a control node and a pool of compute nodes.
- The control node is the brain of the architecture. It's the front end that interacts with all applications.
- The compute nodes provide the computational power. The data to be processed is distributed evenly across the nodes.
You submit queries in the form of Transact-SQL statements and Azure Synapse Analytics runs them.
Azure Synapse uses a technology named PolyBase that enables you to retrieve and query data from relational and non-relational sources. You can save the data read in as SQL tables within the Azure Synapse service.

Synapse Analytics Elements

Azure Synapse Analytics is composed of the five elements :

Azure Synapse SQL pool : Synapse SQL offers both serverless and dedicated resource models to work with a node-based architecture.
- Serves different large workloads without a lag in performance because of workload isolation and high concurrency.
- Secures data features such as network security or fine-grain access
Azure Synapse Spark : This pool is a cluster of servers that run Apache Spark to process data. You write your data processing logic by using one of the four supported languages : Python, Scala, SQL and C#.
Azure Synapse Pipelines : applies the capabilities of Azure Data Factory. Pipelines are the cloud-based ETL and data integration service.
Azure Synapse Link : This component allows you to connect to Azure Cosmos DB. You can use it to perform near real-time analytics over the operational data stored in an Azure Cosmos DB database.
Azure Synapse Studio : This element is a web-based IDE that can be used centrally to work with all capabilities of Azure Synapse Analytics.
- You can use Azure Synapse Studio to create SQL and Spark pools, define and run pipelines and configure links to external data sources.

Azure Stream Analytics

Azure Stream Analytics is a fully managed (PaaS offering), real-time analytics and complex event-processing engine. It offers the possibility to perform real-time analytics on multiple streams of data from sources like IoT device data, sensors, clickstreams and social media feeds.

Azure Stream Analytics works on the following concepts :

Data streams : Data streams are continuous data generated by applications, IoT devices or sensors. The data streams are analyzed and actionable insights are extracted.
- Some examples are monitoring data streams from industrial and manufacturing equipment and monitoring water pipeline data by utility providers. Data streams help us understand change over time.
Event processing : Event processing refers to consumption and analysis of a continuous data stream to extract actionable insights from the events happening within that stream.
- An example is how a car passing through a tollbooth should include temporal information like a timestamp that indicates when the event occurred.

Stream Analytics ingests data from Azure Event Hubs, Azure IoT Hub or Azure Blob Storage.
The query, which is based on SQL query language, can be used to easily filter, sort, aggregate and join streaming data over a period. You can also extend this SQL language with JavaScript and C# user-defined functions (UDFs).

An Azure Stream Analytics job consists of an input, query and an output.

Azure Databricks

Azure Databricks helps you unlock insights from all your data and build artificial intelligence solutions. You can set up your Apache Spark environment in minutes, then autoscale and collaborate on shared projects in an interactive workspace.

Azure Databricks supports Python, Scala, R, Java and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch and scikit-learn.

Azure Databricks has a Control plane and a Data plane :

Control Plane : Hosts Databricks jobs, notebooks with query results and the cluster manager. The Control plane also has the web application, hive metastore and security access control lists (ACLs) and user sessions.
Data Plane : Contains all the Azure Databricks runtime clusters that are hosted within the workspace. All data processing and storage exists within the client subscription. No data processing ever takes place within the Microsoft/Databricks-managed subscription.

Azure Databricks offers three environments for developing data intensive applications.

Databricks SQL : Provides an easy-to-use platform for analysts who want to run SQL queries on their data lake.
Databricks Machine Learning : Is an integrated end-to-end machine learning environment.
Databricks Data Science & Engineering : Is an interactive workspace that enables collaboration between data engineers, data scientists and machine learning engineers.

Azure Data Lake

Azure Data Lake Analytics is an on-demand analytics job service that simplifies big data. Instead of deploying, configuring and tuning hardware, you can write queries to transform your data and extract valuable insights.

A data lake is a repository of data that's stored in its natural format, usually as blobs or files. Azure Data Lake can ingest real-time data directly from multiple sources.

Azure Data Lake Storage is a set of capabilities dedicated to big data analytics, built on Azure Blob Storage.

A fundamental part of Data Lake Storage Gen1 is the addition of a hierarchical namespace to Blob storage.The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access.
Lake Storage Gen1 does not have limits on file sizes, account sizes or number of files.

Lake Storage Gen2 has all the features of Gen1 and the following :

A superset of POSIX for finer-grained access crontrols.
The ability to manage and access Hadoop Distributed File System (HDFS) data with an ABFS driver.

Lake Storage Gen2 Authorization Mechanisms

Data Lake Storage Gen2 supports the following authorization mechanisms :

Shared Key authorization.
Shared access signature (SAS) authorization.
Role-based access control (Azure RBAC).
Access control lists (ACL).

Azure Data Explorer

Azure Data Explorer is a platform for big data that helps you analyze high volumes of data in near real time.

Data Explorer comes equipped with features to help you configure an end-to-end solution for ingesting and managing your data, running queries and generating visualizations.

Azure Data Explorer is well integrated with ML services such as Databricks and Machine Learning.

Azure Data Factory

Azure Data Factory is a cloud-based data integration service that can help you create and schedule data-driven workflows.
You can use Azure Data Factory to orchestrate data movement and transform data at scale. The data-driven workflows/pipelines ingest data from disparate data stores.

There are four major steps to create and implement a data-driven workflow in the Azure Data Factory architecture :

Connect and collect : First, ingest the data to collect all the data from different sources into a centralized location.
- Azure Data Factory supports more than 70 different connectors for various data formats.
Transform and enrich : Next, transform the data by using a compute service like Azure Databricks and Azure HDInsight Hadoop.
Provide continuous integration and delivery (CI/CD) and publish : Support CI/CD by using GitHub and Azure DevOps to deliver the ETL process incrementally before publishing the data to the analytics engine.
Monitor : Finally, use the Azure portal to monitor the pipeline for scheduled activities and for any failures.

Azure Event Hubs

Azure Event Hubs is a fully managed, big data streaming platform and event ingestion service.

Event Hubs supports real time data ingestion and microservices batching on the same stream.
You can send and receive events in many different languages. Messages can also be received from Azure Event Hubs by using Apache Storm.
It supports real time data ingestion and microservices batching on the same stream.
Event Hubs doesn't have a built-in mechanism to handle messages that aren't processed as expected.
It scales according to the number of purchased throughput (processing) units.
Events received by Azure Event Hubs are added to the end of its data stream.
- The data stream orders events according to the time they event is received. Consumers can seek along the data stream by using time offsets.