$30 off During Our Annual Pro Sale. View Details »

Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Building analytics applications requires more than just one good service. It requires the ability to capture a vast amount of data, and react to data changes in real time.

https://www.bigdataspain.org/2017/talk/full-stack-analytics-on-amazon-web-services

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Big Data Spain

December 18, 2017
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. View Slide

  2. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Specialist Solutions Architect, Data and Analytics, EMEA
    November 17th, 2017
    Full Stack Analytics on AWS
    Ian Robinson

    View Slide

  3. Forces and Trends Prompting the Move to Cloud
    Cost Optimization
    Licenses
    Hardware
    Data center and operations
    Dark Data
    Prematurely discarding data
    Agility
    Experimentation (data & tools)
    Democratised Access to Data
    Time-to-first-results
    Terminate failed experiments early
    From BI to Data Science
    In-house data science
    From back office to product

    View Slide

  4. Storage is the Gravity for Cloud Applications
    Store all your data, for ever, at every stage of its lifecycle
    Apply it using the right tool for the job

    View Slide

  5. Storage is Job #1

    View Slide

  6. Object Storage is Foundational

    View Slide

  7. Standard
    Active data Archive data
    Infrequently accessed data
    Standard - Infrequent
    Access
    Amazon Glacier
    Create
    Delete
    Events and Lifecycle Management

    View Slide

  8. S3 as the Data Lake Fabric
    • Unlimited number of objects
    and volume
    • 99.99% availability
    • 99.999999999% durability
    • Versioning
    • Tiered storage via lifecycle
    policies
    • SSL, client/server-side
    encryption at rest
    • Low cost (just over
    $2700/month for 100TB)
    • Natively supported by big
    data frameworks (Spark, Hive,
    Presto, etc)
    • Decouples storage and
    compute
    • Run transient compute
    clusters (with Amazon EC2
    Spot Instances)
    • Multiple, heterogeneous
    clusters can use same data

    View Slide

  9. Database Migration
    Service
    Automated Data Ingestion

    View Slide

  10. Stream Events to S3 Using Kinesis Firehose

    View Slide

  11. Write Database Changes to S3 with DMS
    //LOAD001.csv
    //LOAD002.csv
    //.csv
    Full Load
    Change Data Capture

    View Slide

  12. Scalable (secure, versioned, durable) storage +
    Immutable data at every stage of its lifecycle +
    Versioned schema and metadata
    =
    Data discovery, lineage
    Storage + Catalog

    View Slide

  13. AWS Glue
    • Data Catalog Discover
    and store metadata
    • Job Authoring Auto-
    generated ETL code
    • Job Execution
    Serverless scheduling
    and execution

    View Slide

  14. Hive metastore-compatible, highly-
    available metadata repository:
    • Classification for identifying and
    parsing files
    • Versioning of table metadata as
    schemas evolve
    • Table definitions – usable by
    Redshift, Athena, Glue, EMR
    Populate using Hive DDL, bulk import, or
    automatically through crawlers.
    Glue Data Catalog

    View Slide

  15. semi-structured
    per-file schema
    semi-structured
    unified schema
    identify file type
    and parse files
    enumerate
    S3 objects
    file 1
    file 2
    file N

    int
    array
    int
    char
    struct
    char int
    array
    struct
    char
    bool int
    int
    array
    int
    char
    char int
    custom classifiers
    app log parser
    metrics parser

    system classifiers
    JSON parser
    CSV parser
    Apache log parser

    bool
    Crawlers: Automatic Schema Inference

    View Slide

  16. AWS Lambda
    AWS Lambda
    Metadata Index
    (Amazon DynamoDB)
    Search Index
    (Amazon Elasticsearch)
    ObjectCreated
    ObjectDeleted PutItem
    Update Stream
    Update Index
    Extract Search Fields
    Indexing and Searching Using Metadata
    Amazon S3

    View Slide

  17. Security is Job #0

    View Slide

  18. Data Access & Authorisation
    Give your users easy and secure access
    Storage & Catalog
    Secure, cost-effective storage in Amazon S3.
    Robust metadata in AWS Catalog
    Protect and Secure
    Use entitlements to ensure data is secure and users’ identities are verified

    View Slide

  19. Identity and Access Management
    • Manage users, groups, and roles
    • Identity federation with Open ID
    • Temporary credentials with Amazon Security Token
    Service (Amazon STS)
    • Stored policy templates
    • Powerful policy language
    • Amazon S3 bucket policies

    View Slide

  20. IAM
    Amazon
    S3
    Amazon
    ElastiCache
    Amazon
    DynamoDB
    Amazon
    EMR
    Amazon
    Kinesis
    Amazon
    Athena
    Service API Access
    Security at the Data Level

    View Slide

  21. Third Party Ecosystem Security Tools
    Amazon
    S3
    AWS
    CloudTrail
    http://amzn.to/2tSimHj
    Amazon
    Athena
    Access Logging
    API Logging
    Access Log
    Analytics
    IAM
    Amazon
    EMR
    http://amzn.to/2si6RqS
    Storage Level Support for Access Logging and Audit

    View Slide

  22. Encryption Options
    AWS Server-Side encryption
    • AWS managed key infrastructure
    AWS Key Management Service
    • Automated key rotation & auditing
    • Integration with other AWS services
    AWS CloudHSM
    • Dedicated Tenancy SafeNet Luna SA HSM Device
    • Common Criteria EAL4+, NIST FIPS 140-2

    View Slide

  23. Serverless Processing and
    Analytics

    View Slide

  24. • Python code generated
    by AWS Glue
    • Connect a notebook or
    IDE to AWS Glue
    • Existing code brought
    into AWS Glue
    Managed ETL with AWS GLue

    View Slide

  25. • Schedule-based
    • Event-based
    • On demand
    Job Execution with AWS Glue

    View Slide

  26. Amazon Kinesis Analytics
    • Interact with streaming data in real time using SQL
    • Build fully managed and elastic stream processing
    applications that process data for real-time visualizations
    and alarms

    View Slide

  27. SELECT STREAM author,
    count(author) OVER ONE_MINUTE
    FROM Tweets
    WINDOW ONE_MINUTE AS
    (PARTITION BY author
    RANGE INTERVAL '1' MINUTE PRECEDING)
    WHERE text LIKE ‘%#BigDataSpain%';
    Amazon Kinesis Analytics – Simple SQL Interface

    View Slide

  28. Amazon Athena – Analyze Data in S3
    • Interactive queries
    • ANSI SQL
    • No infrastructure or administration
    • Zero spin up time
    • Query data in its raw format
    • AVRO, Text, CSV, JSON, weblogs, AWS service logs
    • Convert to an optimized form like ORC or Parquet for the
    best performance and lowest cost
    • No loading of data, no ETL required
    • Stream data from directly from Amazon S3, take advantage
    of Amazon S3 durability and availability

    View Slide

  29. Simple query editor
    with syntax highlighting
    and autocomplete
    Data Catalog
    Query History, Saved Queries, and
    Catalog Management

    View Slide

  30. QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena
    Amazon RDS
    Amazon S3
    Amazon Redshift
    Amazon Athena
    Using Amazon Athena with Amazon QuickSight

    View Slide

  31. View Slide

  32. Building Smarter Applications

    View Slide

  33. Add Machine Learning Capabilities
    Amazon Machine Learning Service
    Batch and online predictions
    Train using data in S3, RDS and
    Redshift
    Amazon EMR
    Comprehensive machine learning
    libraries (eg Spark MLlib, Anaconda)
    Provision analytics clusters in minutes,
    autoscale with data volume or query
    demand

    View Slide

  34. Amazon AI Services
    Amazon Polly – Lifelike Text-to-Speech
    47 voices, 24 languages
    Low-latency, real time
    Amazon Rekognition – Image Analysis
    Object and scene detection
    Facial analysis
    Amazon Lex – Conversational Engine
    Speech and text recognition
    Enterprise connectors

    View Slide

  35. Demographic Data
    Facial Landmarks
    Sentiment Expressed
    Image Quality
    Facial Analysis with Rekognition
    Brightness: 25.84
    Sharpness: 160
    General Attributes

    View Slide

  36. Up to ~40k CUDA cores
    Pre-configured CUDA drivers
    Jupyter notebook with Python2,
    Python3, Anaconda
    CloudFormation Template
    AWS Marketplace – one-click deploy
    AWS Deep Learning AMI

    View Slide

  37. Kinesis Firehose
    Athena
    Query Service Glue
    Machine Learning
    Predictive analytics
    Data Access & Authorisation
    Give your users easy and secure access
    Data Ingestion
    Get your data into S3
    quickly and securely
    Processing & Analytics
    Use of predictive and prescriptive
    analytics to gain better understanding
    Protect and Secure
    Use entitlements to ensure data is secure and users’ identities are verified
    Amazon AI
    Storage & Catalog
    Secure, cost-effective storage in Amazon S3.
    Robust metadata in AWS Catalog

    View Slide

  38. Thank You
    Full Stack Analytics on AWS

    View Slide