Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

© 2015, Amazon Web Services, Inc. or its Affiliates. All
rights reserved. Specialist Solutions Architect, Data and Analytics, EMEA November 17th, 2017 Full Stack Analytics on AWS Ian Robinson

Forces and Trends Prompting the Move to Cloud Cost Optimization
Licenses Hardware Data center and operations Dark Data Prematurely discarding data Agility Experimentation (data & tools) Democratised Access to Data Time-to-first-results Terminate failed experiments early From BI to Data Science In-house data science From back office to product

Storage is the Gravity for Cloud Applications Store all your
data, for ever, at every stage of its lifecycle Apply it using the right tool for the job

Storage is Job #1

Object Storage is Foundational

Standard Active data Archive data Infrequently accessed data Standard -
Infrequent Access Amazon Glacier Create Delete Events and Lifecycle Management

S3 as the Data Lake Fabric • Unlimited number of
objects and volume • 99.99% availability • 99.999999999% durability • Versioning • Tiered storage via lifecycle policies • SSL, client/server-side encryption at rest • Low cost (just over $2700/month for 100TB) • Natively supported by big data frameworks (Spark, Hive, Presto, etc) • Decouples storage and compute • Run transient compute clusters (with Amazon EC2 Spot Instances) • Multiple, heterogeneous clusters can use same data

Database Migration Service Automated Data Ingestion

Stream Events to S3 Using Kinesis Firehose

Write Database Changes to S3 with DMS <schema_name>/<table_name>/LOAD001.csv <schema_name>/<table_name>/LOAD002.csv <schema_name>/<table_name>/<time-stamp>.csv
Full Load Change Data Capture

Scalable (secure, versioned, durable) storage + Immutable data at every
stage of its lifecycle + Versioned schema and metadata = Data discovery, lineage Storage + Catalog

AWS Glue • Data Catalog Discover and store metadata •
Job Authoring Auto- generated ETL code • Job Execution Serverless scheduling and execution

Hive metastore-compatible, highly- available metadata repository: • Classification for identifying
and parsing files • Versioning of table metadata as schemas evolve • Table definitions – usable by Redshift, Athena, Glue, EMR Populate using Hive DDL, bulk import, or automatically through crawlers. Glue Data Catalog

semi-structured per-file schema semi-structured unified schema identify file type and
parse files enumerate S3 objects file 1 file 2 file N … int array int char struct char int array struct char bool int int array int char char int custom classifiers app log parser metrics parser … system classifiers JSON parser CSV parser Apache log parser … bool Crawlers: Automatic Schema Inference

AWS Lambda AWS Lambda Metadata Index (Amazon DynamoDB) Search Index
(Amazon Elasticsearch) ObjectCreated ObjectDeleted PutItem Update Stream Update Index Extract Search Fields Indexing and Searching Using Metadata Amazon S3

Security is Job #0

Data Access & Authorisation Give your users easy and secure
access Storage & Catalog Secure, cost-effective storage in Amazon S3. Robust metadata in AWS Catalog Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified

Identity and Access Management • Manage users, groups, and roles
• Identity federation with Open ID • Temporary credentials with Amazon Security Token Service (Amazon STS) • Stored policy templates • Powerful policy language • Amazon S3 bucket policies

IAM Amazon S3 Amazon ElastiCache Amazon DynamoDB Amazon EMR Amazon
Kinesis Amazon Athena Service API Access Security at the Data Level

Third Party Ecosystem Security Tools Amazon S3 AWS CloudTrail http://amzn.to/2tSimHj
Amazon Athena Access Logging API Logging Access Log Analytics IAM Amazon EMR http://amzn.to/2si6RqS Storage Level Support for Access Logging and Audit

Encryption Options AWS Server-Side encryption • AWS managed key infrastructure
AWS Key Management Service • Automated key rotation & auditing • Integration with other AWS services AWS CloudHSM • Dedicated Tenancy SafeNet Luna SA HSM Device • Common Criteria EAL4+, NIST FIPS 140-2

Serverless Processing and Analytics

• Python code generated by AWS Glue • Connect a
notebook or IDE to AWS Glue • Existing code brought into AWS Glue Managed ETL with AWS GLue

• Schedule-based • Event-based • On demand Job Execution with
AWS Glue

Amazon Kinesis Analytics • Interact with streaming data in real
time using SQL • Build fully managed and elastic stream processing applications that process data for real-time visualizations and alarms

SELECT STREAM author, count(author) OVER ONE_MINUTE FROM Tweets WINDOW ONE_MINUTE
AS (PARTITION BY author RANGE INTERVAL '1' MINUTE PRECEDING) WHERE text LIKE ‘%#BigDataSpain%'; Amazon Kinesis Analytics – Simple SQL Interface

Amazon Athena – Analyze Data in S3 • Interactive queries
• ANSI SQL • No infrastructure or administration • Zero spin up time • Query data in its raw format • AVRO, Text, CSV, JSON, weblogs, AWS service logs • Convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No loading of data, no ETL required • Stream data from directly from Amazon S3, take advantage of Amazon S3 durability and availability

Simple query editor with syntax highlighting and autocomplete Data Catalog
Query History, Saved Queries, and Catalog Management

QuickSight allows you to connect to data from a wide
variety of AWS, third-party, and on-premises sources including Amazon Athena Amazon RDS Amazon S3 Amazon Redshift Amazon Athena Using Amazon Athena with Amazon QuickSight

Building Smarter Applications

Add Machine Learning Capabilities Amazon Machine Learning Service Batch and
online predictions Train using data in S3, RDS and Redshift Amazon EMR Comprehensive machine learning libraries (eg Spark MLlib, Anaconda) Provision analytics clusters in minutes, autoscale with data volume or query demand

Amazon AI Services Amazon Polly – Lifelike Text-to-Speech 47 voices,
24 languages Low-latency, real time Amazon Rekognition – Image Analysis Object and scene detection Facial analysis Amazon Lex – Conversational Engine Speech and text recognition Enterprise connectors

Demographic Data Facial Landmarks Sentiment Expressed Image Quality Facial Analysis
with Rekognition Brightness: 25.84 Sharpness: 160 General Attributes

Up to ~40k CUDA cores Pre-configured CUDA drivers Jupyter notebook
with Python2, Python3, Anaconda CloudFormation Template AWS Marketplace – one-click deploy AWS Deep Learning AMI

Kinesis Firehose Athena Query Service Glue Machine Learning Predictive analytics
Data Access & Authorisation Give your users easy and secure access Data Ingestion Get your data into S3 quickly and securely Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Amazon AI Storage & Catalog Secure, cost-effective storage in Amazon S3. Robust metadata in AWS Catalog

Thank You Full Stack Analytics on AWS

Full Stack Analytics on Amazon Web Services by ...

Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript