Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Building analytics applications requires more than just one good service. It requires the ability to capture a vast amount of data, and react to data changes in real time.

https://www.bigdataspain.org/2017/talk/full-stack-analytics-on-amazon-web-services

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

December 18, 2017
Tweet

Transcript

  1. None
  2. © 2015, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Specialist Solutions Architect, Data and Analytics, EMEA November 17th, 2017 Full Stack Analytics on AWS Ian Robinson
  3. Forces and Trends Prompting the Move to Cloud Cost Optimization

    Licenses Hardware Data center and operations Dark Data Prematurely discarding data Agility Experimentation (data & tools) Democratised Access to Data Time-to-first-results Terminate failed experiments early From BI to Data Science In-house data science From back office to product
  4. Storage is the Gravity for Cloud Applications Store all your

    data, for ever, at every stage of its lifecycle Apply it using the right tool for the job
  5. Storage is Job #1

  6. Object Storage is Foundational

  7. Standard Active data Archive data Infrequently accessed data Standard -

    Infrequent Access Amazon Glacier Create Delete Events and Lifecycle Management
  8. S3 as the Data Lake Fabric • Unlimited number of

    objects and volume • 99.99% availability • 99.999999999% durability • Versioning • Tiered storage via lifecycle policies • SSL, client/server-side encryption at rest • Low cost (just over $2700/month for 100TB) • Natively supported by big data frameworks (Spark, Hive, Presto, etc) • Decouples storage and compute • Run transient compute clusters (with Amazon EC2 Spot Instances) • Multiple, heterogeneous clusters can use same data
  9. Database Migration Service Automated Data Ingestion

  10. Stream Events to S3 Using Kinesis Firehose

  11. Write Database Changes to S3 with DMS <schema_name>/<table_name>/LOAD001.csv <schema_name>/<table_name>/LOAD002.csv <schema_name>/<table_name>/<time-stamp>.csv

    Full Load Change Data Capture
  12. Scalable (secure, versioned, durable) storage + Immutable data at every

    stage of its lifecycle + Versioned schema and metadata = Data discovery, lineage Storage + Catalog
  13. AWS Glue • Data Catalog Discover and store metadata •

    Job Authoring Auto- generated ETL code • Job Execution Serverless scheduling and execution
  14. Hive metastore-compatible, highly- available metadata repository: • Classification for identifying

    and parsing files • Versioning of table metadata as schemas evolve • Table definitions – usable by Redshift, Athena, Glue, EMR Populate using Hive DDL, bulk import, or automatically through crawlers. Glue Data Catalog
  15. semi-structured per-file schema semi-structured unified schema identify file type and

    parse files enumerate S3 objects file 1 file 2 file N … int array int char struct char int array struct char bool int int array int char char int custom classifiers app log parser metrics parser … system classifiers JSON parser CSV parser Apache log parser … bool Crawlers: Automatic Schema Inference
  16. AWS Lambda AWS Lambda Metadata Index (Amazon DynamoDB) Search Index

    (Amazon Elasticsearch) ObjectCreated ObjectDeleted PutItem Update Stream Update Index Extract Search Fields Indexing and Searching Using Metadata Amazon S3
  17. Security is Job #0

  18. Data Access & Authorisation Give your users easy and secure

    access Storage & Catalog Secure, cost-effective storage in Amazon S3. Robust metadata in AWS Catalog Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified
  19. Identity and Access Management • Manage users, groups, and roles

    • Identity federation with Open ID • Temporary credentials with Amazon Security Token Service (Amazon STS) • Stored policy templates • Powerful policy language • Amazon S3 bucket policies
  20. IAM Amazon S3 Amazon ElastiCache Amazon DynamoDB Amazon EMR Amazon

    Kinesis Amazon Athena Service API Access Security at the Data Level
  21. Third Party Ecosystem Security Tools Amazon S3 AWS CloudTrail http://amzn.to/2tSimHj

    Amazon Athena Access Logging API Logging Access Log Analytics IAM Amazon EMR http://amzn.to/2si6RqS Storage Level Support for Access Logging and Audit
  22. Encryption Options AWS Server-Side encryption • AWS managed key infrastructure

    AWS Key Management Service • Automated key rotation & auditing • Integration with other AWS services AWS CloudHSM • Dedicated Tenancy SafeNet Luna SA HSM Device • Common Criteria EAL4+, NIST FIPS 140-2
  23. Serverless Processing and Analytics

  24. • Python code generated by AWS Glue • Connect a

    notebook or IDE to AWS Glue • Existing code brought into AWS Glue Managed ETL with AWS GLue
  25. • Schedule-based • Event-based • On demand Job Execution with

    AWS Glue
  26. Amazon Kinesis Analytics • Interact with streaming data in real

    time using SQL • Build fully managed and elastic stream processing applications that process data for real-time visualizations and alarms
  27. SELECT STREAM author, count(author) OVER ONE_MINUTE FROM Tweets WINDOW ONE_MINUTE

    AS (PARTITION BY author RANGE INTERVAL '1' MINUTE PRECEDING) WHERE text LIKE ‘%#BigDataSpain%'; Amazon Kinesis Analytics – Simple SQL Interface
  28. Amazon Athena – Analyze Data in S3 • Interactive queries

    • ANSI SQL • No infrastructure or administration • Zero spin up time • Query data in its raw format • AVRO, Text, CSV, JSON, weblogs, AWS service logs • Convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No loading of data, no ETL required • Stream data from directly from Amazon S3, take advantage of Amazon S3 durability and availability
  29. Simple query editor with syntax highlighting and autocomplete Data Catalog

    Query History, Saved Queries, and Catalog Management
  30. QuickSight allows you to connect to data from a wide

    variety of AWS, third-party, and on-premises sources including Amazon Athena Amazon RDS Amazon S3 Amazon Redshift Amazon Athena Using Amazon Athena with Amazon QuickSight
  31. None
  32. Building Smarter Applications

  33. Add Machine Learning Capabilities Amazon Machine Learning Service Batch and

    online predictions Train using data in S3, RDS and Redshift Amazon EMR Comprehensive machine learning libraries (eg Spark MLlib, Anaconda) Provision analytics clusters in minutes, autoscale with data volume or query demand
  34. Amazon AI Services Amazon Polly – Lifelike Text-to-Speech 47 voices,

    24 languages Low-latency, real time Amazon Rekognition – Image Analysis Object and scene detection Facial analysis Amazon Lex – Conversational Engine Speech and text recognition Enterprise connectors
  35. Demographic Data Facial Landmarks Sentiment Expressed Image Quality Facial Analysis

    with Rekognition Brightness: 25.84 Sharpness: 160 General Attributes
  36. Up to ~40k CUDA cores Pre-configured CUDA drivers Jupyter notebook

    with Python2, Python3, Anaconda CloudFormation Template AWS Marketplace – one-click deploy AWS Deep Learning AMI
  37. Kinesis Firehose Athena Query Service Glue Machine Learning Predictive analytics

    Data Access & Authorisation Give your users easy and secure access Data Ingestion Get your data into S3 quickly and securely Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Amazon AI Storage & Catalog Secure, cost-effective storage in Amazon S3. Robust metadata in AWS Catalog
  38. Thank You Full Stack Analytics on AWS