Operations and Monitoring

Google Cloud's operations suite (formerly Stackdriver) includes

Cloud Monitoring.
Cloud Logging.
Cloud Trace.
Cloud Profiler.
Cloud Debugger.

Cloud Monitoring

Cloud Monitoring collects metrics, events, and metadata from Google Cloud, Amazon Web Services (AWS), hosted uptime probes, and application instrumentation.

Cloud Monitoring ingests that data and generates insights via dashboards, Metrics Explorer charts and automated alerts.

Collect metrics, events, metadata.
Analyze trends.
Set alert threshold.
Outlier activity.
SLO alerts.

Monitoring is the foundation of product reliability. It reveals what needs urgent attention and shows trends in application usage pattern and generally help improve an application client's experience.

It provides visibility into the performance, uptime and overall health of cloud-powered applications. It collects metrics, events, and metadata from projects,logs, services, systems, agents, custom code and various common application components.

The Four golden signals that measure a system’s performance and reliability: Latency - Traffic - Saturation - Errors.

The Four golden signals

LatencyTrafficSaturationErrors

Latency measures how long it takes a particular part of a system to return a result.

It directly affects the user experience.
Changes in latency could indicate emerging issues.
Its values may be tied to capacity demands.
It can be used to measure system improvements.

Traffic, which measures how many requests are reaching your system.

It’s an indicator of current system demand.
Its historical trends are used for capacity planning.
It’s a core measure when calculating infrastructure spend.

Saturation, which measures how close to capacity a system is.

It's an indicator of how full the service is.
It focuses on the most constrained resources.
It’s frequently tied to degrading performance as capacity is reached.

Errors, which are events that measure system failures or other issues.

They may indicate that something is failing.
They may indicate configuration or capacity issues.
They can indicate service level objective violations.
An error might mean it's time to send out an alert.

Cloud Logging

Cloud Logging is a fully managed service that allows allows users to collect, store, search, analyze, monitor, and alert on log entries and events.

Automated logging is integrated into Google Cloud products like App Engine, Cloud Run, Compute Engine VMs running the logging agent and GKE.

Analyze : Analyze log data in real time with the integrated Logs Explorer
- Analyze exported logs from Cloud Storage or BigQuery
Export : Export to Cloud Storage or Pub/Sub or BigQuery
- Create logs-based metrics for augmented Monitoring.
Retain : Data access and service logs are retained for 30 days and admin logs for 400 days.

You can write and query log entries with cloud CLI.

Key log categories

Cloud Audit LogsAgent LogsNetwork LogsService Logs

Who did what, where?
Admin Activity
Data Access
System Event
Access Transparency

Fluentd agent
Common third-party applications
System software

Agent logs use a Google-customized and packaged Fluentd agent that can be installed on any AWS or Google Cloud VM to ingest log data from Google Cloud instances.

VPC flow logs
Firewall rules
NAT gateway
Load Balancer

Network logs provide both network and security operations with in-depth network service telemetry.

VPC Flow Logs records samples of VPC network flow and can be used for network monitoring, forensics, real-time security analysis, and expense optimization.

Standard Out / Error
Created with API

Service logs provide access to logs created by developers deploying code to Google Cloud.

For example, if they build a container using Node.js and deploy it to Cloud Run any logging to Standard Out or Standard Error will automatically be sent to Cloud Logging.

Cloud Trace

Cloud Trace is a distributed tracing system for Google Cloud that collects latency data from applications and displays it in near real-time.

Latency reporting.
Per-URL latency sampling.
Displays data in near real-time.

Error Reporting

Error Reporting aggregates and displays errors produced in your running cloud services.

Error notifications.
Error dashboard.

Cloud Profiler

Cloud Profiler is a statistical, low-overhead profiler that continuously gathers CPU usage and memory-allocation information from your production applications.

Continuous profiling of production systems.
Statistical, low-overhead memory and CPU profiler.
Contextualized to your code.