Skip to content

Analytics_ML

Data - Analytics

Storage Best practices for Data and Analytics applications.

Amazon Redshift

Amazon Redshift is a fast, scalable data warehouse that makes it simple and cost-effective to analyze all your data across your data warehouse and data lake.

How does Amazon Redshift work?

Amazon Redshift data warehouse is a cluster. A cluster is composed of a leader node and one or more compute nodes.
The leader node is responsible for distributing jobs to the compute nodes.
Clients access Amazon Redshift via a SQL endpoint on the leader node.

Compute nodes partition the job into slices. Each slice is allocated a portion of the node's memory and disk space.
It is in these slices where the node processes its assigned portion of the job.

After the slices have completed their assigned tasks, the results are aggregated and returned to the leader node.
The leader node then aggregates the results from all nodes and returns them to the client.

Redshift Security

Redshift:

  • uses IAM to create and manage credentials.
  • requires both authentication and permission to access tables and data.
  • is configured to run inside a VPC.
  • uses AES-256 bit encryption algorithm to encrypt the data while at rest.

Amazon Kinesis

Kinesis allows you to ingest, process and analyse real-time streaming data.
It is a fully managed, real-time serverless service.

There are 2 major types of Kinesis.

  • Data streams : real-time streaming for ingesting data. Don't automically scale
  • Data Firehose : Data transfer tool to get data to S3, redshift, ElasticSearch or Splunk, near real time (60s), Scales automtically

Kinesis VS SQS : Both are message broker, but only Kinesis is real time.

Amazon EMR

EMR (Elastic MapReduce) is a managed big data platorm that allows you to processs vast amounts of data using open-source tools such as Spark, HBase Hive...
It is a ETL (Extract - Transform - Load) tool.

EMR is a managed fleet of EC2 instaces running open-source tools.

Amazon Athena - Glue - QuickSight - Data Pipeline

AGQ

Athena is an interactive query serverless service that makes it easy to analyse data in S3 using SQL without loading it into a database.

Glue is a data integration serverless service that makes it easy to discover, prepare and combine data.
Glue allows you to perform ETL workloads without managing underlaying servers.

QuickSight is a fully managed BI data visualization service to easily create dashboards and share them.

Data Pipeline is managed ETL (Extract - Tranform - Load) service for automating movement and transformation of data.
It integrates easily with DynamoDB, RDS, Redshift, S3.

ML

  • Amazon Comprehend : uses natural-language processing (NLP) to help you understand the meaning and sentiment in your text
  • Amazon Kendra : allows you to create an intelligent search service powered by ML.
  • Textract uses ML and OCR(optical character recognition) to automatically extract text, hardwriting and data from scanned documents.

  • Amazon Forecast : is a time-series forecasting service that uses ML and is built to give you important business insights.

  • Fraud Detector : AWS AI service that is built to detect fraud in your data.
  • Amazon Rekognition : Computer vision product that automate recognition of pictures and videos using deep learning and neural networks.
  • SageMaker : AI/ML Notebook, Training - tensorFlow, PyTorch

Speech

  • Amazon Transcribe : to convert audio and video files to text.
  • Amazon Lex : to build conversational interfaces in your apps.
  • Amazon Polly : to convert text into natural speech in multiple languages.
  • Amazon Translate : to automate language translation.