Data Analytics
GCP offers many services for data analytics purposes.
BigQuery
BigQuery is a fully managed, petabyte-scale and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real time.
BigQuery storage
BigQuery stores data using a columnar storage format that is optimized for analytical queries. BigQuery presents data in tables, rows and columns and provides full support for database transaction semantics (ACID)
-
Datasets : A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views.
- The dataset location can only be set at creation time.
- All tables that are referenced in a query must be stored in datasets in the same location.
- When you copy a table, the datasets that contain the source table and destination table must reside in the same location.
- Dataset names must be unique for each project.
-
Tables : A BigQuery table contains individual records organized in rows. Each record is composed of columns (also called fields).
- Every table is defined by a schema that describes the column names, data types, and other information.
- Table types are:
- Standard BigQuery tables : structured data stored in BigQuery storage.
- External tables : tables that reference data stored outside BigQuery.
- Views : logical tables that are created by using a SQL query.
Pub/Sub
Cloud Pub/Sub is a fully managed messaging architecture that enables you to build loosely coupled micro services that can communicate asynchronously. You can use Cloud Pub Sub to integrate components of your application.
CLI examples.
It supports 2 subscription modes :
-
Pull(Default) : The subscriber explicitly calls the pull method to request messages for delivery.
- Pub/Sub returns a message and acknowledgment ID. To acknowledge receipt, the subscriber invokes the acknowledged method by using the acknowledgment ID
- A subscriber can modify the acknowledgment deadline to allow more time to process messages.
- It enables batch delivery and acknowledgments as well as massively parallel consumption.
-
Push : Pub/Sub sends each message as an HTTP request to the subscriber at a preconfigured HTTP endpoint.
- The endpoint acknowledges the message by returning an HTTP success status code.
- A failure response indicates that the message should be sent again.
- Pub/Sub dynamically adjusts the rate of push requests based on the rate at which it receives success responses.
- The default deadline is 10 seconds, which can be increased up to 10 minutes.
For retention duration, a subscription tries to deliver a message for 7 days and will discard the message if a continuation fails.
To save undelivered messages, you configure a dead letter topic.
Cloud Task
An alternative of Pub/Sub is Cloud Task if you want explicit rate controls, longer timeouts to respond to a message greater than 10 minutes.
Cloud Task supports up to 30 minutes.
By default, Cloud Tasks and Pub/Sub can potentially deliver messages out of order.
You can prevent that by enabling message ordering :
- Only Pub/Sub has the option to deliver messages in the order they were published.
- You need to be aware that using this option will increase latency for message delivery.
- If you enable message ordering, you can’t catch undeliverable messages.
Dataflow
Cloud Dataflow is a managed service for executing a wide variety of data processing patterns.
- Serverless, stream/batch data processing.
- Batch & stream porcessing with autoscale.
- Open source programming using Apache Beam.
Dataprep
Cloud Dataprep is an intelligent data service for visually exploring, cleaning and preparing structured and unstructured data for analysis reporting and machine learning.
It is fully managed and scaled on-demand to meet your growing data-preparation needs, so you can stay focused on analysis.
Dataproc
Dataproc is a managed Apache Spark and Apache Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning.
Data Studio
Data Studio is a free, self-service business intelligence platform that lets users build and consume data visualizations, dashboards and reports.
You can leverage it to explore, visualize BigQuery data.
Cloud Composer
Cloud Composer is a managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.