Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Amazon Redshift evolution history and future direction/redshift-evolution-2021-en

jozono
July 06, 2021

Amazon Redshift evolution history and future direction/redshift-evolution-2021-en

2021.04.06 Data Engineering Study #7
Public seminar hosted by Forkwell and primeNumber

Presentation deck "Amazon Redshift evolution history and future direction”

Some people says,
“I know little about Redshift in the first place”
“ What's going on with Redshift recently?”

Recap the history of Amazon Redshift’s evolution,
and update on the latest Amazon Redshift feature releases.

jozono

July 06, 2021
Tweet

More Decks by jozono

Other Decks in Technology

Transcript

  1. © 2021, Amazon Web Services, Inc. or its Affiliates.
    Amazon Web Service Japan K. K.
    Senior Solutions Architect, Analytics
    Junpei Ozono
    Amazon Redshift evolution
    history and future direction

    View Slide

  2. © 2021, Amazon Web Services, Inc. or its Affiliates.
    2
    Today's Session
    Some people says,
    “I know little about Redshift in the first place”
    “ What's going on with Redshift recently?”
    Recap the history of Amazon Redshift’s evolution,
    and update on the latest Amazon Redshift feature releases.

    View Slide

  3. © 2021, Amazon Web Services, Inc. or its Affiliates.
    3
    History of Amazon Redshift Evolution
    2012 2017 2019 2020 2021 Future

    View Slide

  4. © 2021, Amazon Web Services, Inc. or its Affiliates.
    4
    Typical data analysis architecture in the past
    Relational
    Databases
    Data Warehouse Business
    Intelligence
    Data Source Collect/Store/Analyze Visualize
    • Analyzes structured data on relational databases
    • Gather these into a data warehouse, analyze and visualize with BI tools
    • On-premise centric as of 2012

    View Slide

  5. © 2021, Amazon Web Services, Inc. or its Affiliates.

    View Slide

  6. © 2021, Amazon Web Services, Inc. or its Affiliates.
    6
    Introducing Amazon Redshift
    Relational
    Databases
    Amazon Redshift Business
    Intelligence
    Data Source Collect/Store/Analyze Visualize
    A fast, scalable, and cost-effective
    data warehouse managed service
    ü Peta byte scale
    ü Compatibility with PostgreSQL
    ü Connection with typical 3rd party tools
    ü Price less than 1/10th of traditional DWH (as of then)

    View Slide

  7. © 2021, Amazon Web Services, Inc. or its Affiliates.
    7
    Amazon Redshift Architecture at Service Launch
    Amazon
    Redshift
    JDBC/ODBC
    Shared Nothing + MPP (Massively Parallel) Processing) Architecture
    An approach to increasing processing throughput for analytic queries by distributing
    data across multiple compute nodes and processing in parallel at each node
    Leader Node
    • Query Endpoints
    • Generating and Deploying SQL
    Processing Code
    Compute nodes
    • Local columnar storage
    • Parallel execution of queries

    View Slide

  8. © 2021, Amazon Web Services, Inc. or its Affiliates.
    8
    Amazon
    Redshift
    JDBC/ODBC
    Amazon S3
    User Bucket
    COPY
    Unload
    Amazon S3
    Redshift Management Buckets
    Backup
    Restore
    Data is loaded via user-managed S3 & Unload
    Redshift for automatic backups and restores leveraged service-managed S3 space
    User
    data
    Amazon Redshift Architecture at Service Launch

    View Slide

  9. © 2021, Amazon Web Services, Inc. or its Affiliates.
    9
    History of Amazon Redshift Evolution
    2012 2017
    Redshift Announcement

    View Slide

  10. © 2021, Amazon Web Services, Inc. or its Affiliates.
    10
    Changes in data and how it is used
    * IDC, Data Age 2025: The Evolution of Data to Life-Critical: Don't Focus on Big Data, Focus on the Data That's Big, April 2017.
    More data than you can
    imagine
    More Diversity of Data
    アナリスト
    ビジネスユーザー
    アプリケーション
    データ
    サイエンティスト
    More users and more
    application access your data
    機械学習 SQL分析
    科学技術計算
    リアルタイム
    ストリーミング
    Analyze in different ways
    Data Scientist
    Analyst Application
    Business user Machine
    Learning
    SQL analysis
    Real-time
    streaming
    Scientific
    computing

    View Slide

  11. © 2021, Amazon Web Services, Inc. or its Affiliates.
    11
    Data Lake Architecture
    data
    Lake
    Stream Data
    /Event Log
    NoSQL
    Databases
    Relational
    Databases
    Data Warehouse
    Big Data
    Processing
    Log Search
    Machine learning
    Business
    Intelligence
    Business
    Application
    Data Source Visualize
    Analyze
    Collect/Store
    ...
    ...

    View Slide

  12. © 2021, Amazon Web Services, Inc. or its Affiliates.
    12
    Data Lake Architecture
    data
    Lake
    Stream Data
    /Event Log
    NoSQL
    Databases
    Relational
    Databases
    Data Ware
    House
    Big Data
    Processing
    Log Search
    Machine learning
    Business
    Intelligence
    Business
    Application
    Data Source Visualize
    Analyze
    Collect/Store
    ...
    ...
    More data than you can imagine
    The idea of Data Lake was born to make it
    easier to handle more Diversity of Data

    View Slide

  13. © 2021, Amazon Web Services, Inc. or its Affiliates.
    13
    Data Lake Architecture
    data
    Lake
    Stream Data
    /Event Log
    NoSQL
    Databases
    Relational
    Databases
    Data Ware
    House
    Big Data
    Processing
    Log Search
    Machine learning
    Business
    Intelligence
    Business
    Application
    Data Source Visualize
    Analyze
    Collect/Store
    ...
    ...
    More users and more application access
    your data
    Analyze in different ways
    Leverage purposes built data stores like
    data warehouse and any other analytics
    services

    View Slide

  14. © 2021, Amazon Web Services, Inc. or its Affiliates.
    14
    Data warehousing and data lake work together
    Stream Data
    /Event Log
    NoSQL
    Databases
    Relational
    Databases
    Big Data
    Processing
    Log Search
    Machine learning
    Business
    Intelligence
    Business
    Application
    ...
    ...
    Amazon S3
    Data Source Visualize
    Analyze
    Collect/Store
    Amazon Redshift
    It has become difficult to store all
    data in Data Warehouse(Redshift).
    If we can run queries against data on
    data Lake (S3) directly, more and
    more data could be analyzed while
    keeping costs down.

    View Slide

  15. © 2021, Amazon Web Services, Inc. or its Affiliates.
    15
    Extend your architecture to a data lake with Redshift Spectrum
    Amazon
    Redshift
    JDBC/ODBC
    Open Format Files
    (Parquet, ORC, JSON, CSV etc)
    Applications have
    transparently
    accessed to data in
    both data warehouse
    and data lake
    Amazon Redshift Spectrum
    • Parallel query execution
    engine for files onS3
    Data Lake
    • User Managed S3 Buckets

    View Slide

  16. © 2021, Amazon Web Services, Inc. or its Affiliates.
    16
    History of Amazon Redshift Evolution
    2012 2017
    Redshift Spectrum
    2019
    Redshift Announcement

    View Slide

  17. © 2021, Amazon Web Services, Inc. or its Affiliates.
    17
    Challenges of Parallel Workloads
    Stream Data
    /Event Log
    NoSQL
    Databases
    Relational
    Databases
    Big Data
    Processing
    Log Search
    Machine learning
    Business
    Intelligence
    Business
    Application
    ...
    ...
    Amazon S3
    Data Source Visualize
    Analyze
    Collect/Store
    Amazon Redshift
    Increased users and applications
    accessing the Data Warehouse.
    When workloads burst, the entire
    throughput on the cluster might be
    decreased.
    e.g. Most users simultaneously access
    the data warehouse at 9:00 am every
    Monday.

    View Slide

  18. © 2021, Amazon Web Services, Inc. or its Affiliates.
    18
    Concurrency Scaling automatically scales compute during peak hours
    Amazon
    Redshift
    Additional Clusters (1-10)
    Main Cluster
    dispatch
    +
    +
    +
    Queries running on Redshift cluster are burst and there are not enough resources
    to run them, it automatically launches another cluster(s) behind the scenes and
    process queries without waiting.
    You can get free of charge for 1 hour per day and the cost can be controlled.

    View Slide

  19. © 2021, Amazon Web Services, Inc. or its Affiliates.
    19
    History of Amazon Redshift Evolution
    2012 2017
    Redshift Spectrum
    2019
    Concurrency Scaling
    2020
    Redshift Announcement

    View Slide

  20. © 2021, Amazon Web Services, Inc. or its Affiliates.
    20
    Visualize
    Data Source Collect/Store
    Amazon S3
    Redshift scaling challenges
    Stream Data
    /Event Log
    NoSQL
    Databases
    Relational
    Databases
    Big Data
    Processing
    Log Search
    Machine learning
    Business
    Intelligence
    Business
    Application
    ...
    ...
    Analyze
    Amazon Redshift
    If you have more data on Redshift and want to add more
    storage capacity, you can resize the cluster to add more nodes
    with storage space.
    The Redshift architecture at the time didn't allow storage and
    compute to be scaled separately.

    View Slide

  21. © 2021, Amazon Web Services, Inc. or its Affiliates.
    21
    (Reinstate) Amazon Redshift Architecture
    Amazon
    Redshift
    JDBC/ODBC
    Compute nodes
    • Local columnar storage
    • Parallel execution of queries
    Leader Node
    • Query Endpoints
    • Generating and Deploying SQL
    Processing Code

    View Slide

  22. © 2021, Amazon Web Services, Inc. or its Affiliates.
    22
    RA3 instances with managed storage
    Amazon
    Redshift
    JDBC/ODBC
    Leader Node
    • Query Endpoints
    • Generating and Deploying SQL
    Processing Code
    Compute nodes
    • High-speed local SSD cache
    + large volume of RAM
    + high-bandwidth networking
    • Parallel execution of queries
    High-bandwidth networking
    Managed Storage
    • Redshift Managed S3 Bucket
    Redshift Format File
    Nitro-based hardware
    Size of data warehouse
    only based on steady
    state compute needs
    Scale and pay
    independently for
    compute and storage
    Frequently accessed
    data is automatically
    cached in the compute
    node

    View Slide

  23. © 2021, Amazon Web Services, Inc. or its Affiliates.
    23
    Visualize
    Linking data warehouses and operational databases
    Stream Data
    /Event Log
    NoSQL
    Databases
    Amazon Aurora
    / RDS
    Big Data
    Processing
    Log Search
    Machine learning
    Business
    Intelligence
    Business
    Application
    ...
    ...
    Data Source Analyze
    Collect/Store
    Amazon Redshift
    Not all data is always loaded into a data lake or data warehouse
    in real time, so being able to directly query the latest data on an
    operational database gives you even more analysis.
    Amazon S3

    View Slide

  24. © 2021, Amazon Web Services, Inc. or its Affiliates.
    24
    Amazon Redshift Federated Query
    Unified analytics across databases, data warehouse, and data lake
    Amazon RDS
    PostgreSQL,
    MySQL
    Amazon Aurora
    PostgreSQL,
    MySQL
    Amazon S3
    Data Lake
    Amazon Redshift
    JDBC/ODBC
    Analyze live data without data movement
    Query data directly on Amazon RDS/Aurora
    PostgreSQL from Amazon Redshift
    Secure, high-performance data access
    Amazon RDS/Aurora MySQL support
    (preview)

    View Slide

  25. © 2021, Amazon Web Services, Inc. or its Affiliates.
    25
    History of Amazon Redshift Evolution
    2012 2017
    Redshift Spectrum
    2019
    Concurrency Scaling
    2020
    RA3, Federated Query
    2021
    Redshift Announcement

    View Slide

  26. © 2021, Amazon Web Services, Inc. or its Affiliates.
    26
    Visualize
    Data Source Collect/Store
    Amazon S3
    Data sharing across multiple clusters
    Stream Data
    /Event Log
    NoSQL
    Databases
    Relational
    Databases
    Machine learning
    Business
    Intelligence
    Business
    Application
    ...
    ...
    Analyze
    Amazon Redshift
    Multiple Redshift clusters may be
    required for various reasons.
    • Completely separate the workload
    • Different departments to manage
    • Separation of environment for
    production, development, etc.
    To share data between these clusters,
    you had to transfer data from cluster
    to cluster.
    Amazon Redshift
    Amazon Redshift

    View Slide

  27. © 2021, Amazon Web Services, Inc. or its Affiliates.
    27
    Amazon Redshift Data Sharing
    Secure and easy data sharing across Redshift clusters
    Producer
    Cluster
    Compute
    Node
    Compute
    Node
    Compute
    Node
    Compute
    Node
    Leader Node
    Consumer
    Cluster
    Compute
    Node
    Compute
    Node
    Compute
    Node
    Leader Node
    Compute
    Node
    Compute
    Node
    Amazon Redshift Managed Storage
    Read
    shared data
    Read and write
    private data
    • Producer pays for Amazon Redshift managed storage and consumers pay for
    consumer cluster
    • Workloads accessing shared data are isolated from each other and the producer
    RA3 Instances RA3 Instances

    View Slide

  28. © 2021, Amazon Web Services, Inc. or its Affiliates.
    28
    Redshift Automated Performance Tuning
    ML-based optimizations to get started easily and get the fastest performance quickly
    Automates physical data design
    and optimization
    Optimizes for peak performance
    as data and workloads scale
    Leverages machine learning to adapt to
    shifting workloads
    Automated performance tuning
    Automatic
    sort keys
    Automatic
    vacuum delete
    Automatic
    distribution keys
    Auto Workload
    Manager
    Automatic
    table sort
    MV auto-refresh
    and rewrite

    View Slide

  29. © 2021, Amazon Web Services, Inc. or its Affiliates.
    29
    Physical view to speed up frequently executed queries
    • Join, Filter, Aggregate, Projection
    • Specify a different key than the base table
    • Reference external tables
    When the base table is updated, the associated
    Materialized Views are also refreshed automatically
    No need to be aware of the Materialized View
    • Just query the table
    • Redshift rewrites the execution plan
    as needed to read from a materialized view
    Practical Materialized View
    item store CUST1 price_
    i1 s1 c1 12.0
    i2 s2 c1 3.0
    i3 s2 c2 7.0
    sales_nam
    e
    store owner .loc
    s1 Joe SF
    s2 Ann NY
    s3 Lisa SF
    store_info
    loc total_sales
    SF 12.00
    NY 10.00
    loc_sales

    View Slide

  30. © 2021, Amazon Web Services, Inc. or its Affiliates.
    30
    History of Amazon Redshift Evolution
    2012 2017
    Redshift Spectrum
    2019
    Concurrency Scaling
    2020
    RA3, Federated Query
    2021
    Data Sharing, Auto-Tuning
    Redshift Announcement
    Future

    View Slide

  31. © 2021, Amazon Web Services, Inc. or its Affiliates.
    31
    RA3 instances further enhancements
    Amazon
    Redshift
    RA3
    Network bottlenecks?
    Redshift Managed Storage
    How to prevent network
    performance penalties
    between compute nodes
    and managed storage?

    View Slide

  32. © 2021, Amazon Web Services, Inc. or its Affiliates.
    32
    Advanced Query Accelerator (AQUA)
    New hardware-accelerated cache that delivers up to 10x better query performance
    than other cloud data warehouses
    Compute
    nodes
    Compute
    nodes
    Compute
    nodes
    Compute
    nodes
    AQUA
    node
    AWSDesign
    Custom
    Processors
    AQUA
    node
    AWSDesign
    Custom
    Processors
    AQUA
    node
    AWSDesign
    Custom
    Processors
    AQUA
    node
    AWSDesign
    Custom
    Processors
    Parallelism
    Minimize data movement over the network by
    pushing down operations to AQUA Nodes
    AQUA Nodes with custom AWS-designed analytics
    processors to make operations (compression,
    encryption, filtering, and aggregations) faster than
    traditional CPUs
    Available on ra3.16xlarge/ra3.4xlarge with no
    additional cost. No need to modify any
    SQL/application codes
    Redshift Managed Storage
    Scale-out
    2021/04
    G
    A

    View Slide

  33. © 2021, Amazon Web Services, Inc. or its Affiliates.
    33
    SUPER data type
    semi-structured data into a table without a schema specification
    New data type: SUPER
    Easy, efficient, and powerful JSON
    processing
    Fast row-oriented data ingestion
    Fast column-oriented analytics with
    materialized views over SUPER/JSON
    Access to schema-less nested data with
    easy-to-use SQL extensions powered
    by PartiQL query language
    SELECT name.given AS firstname,
    ph.num
    FROM customers c, c.phones ph
    WHERE ph.type = ‘cell’;
    firstname | num
    ----------+---------------
    "Jane" | 6505550101
    id
    INTEGER
    name
    SUPER
    Phones
    SUPER
    1
    {"given”: “Jane”,
    “family”: “Doe"}
    [{"type” :"work”,
    “num”: “9255550100"},
    { "type”:“cell”,
    “ num": 6505550101}]
    2
    {"given”: “Richard”,
    “family”: “Roe"},
    [{"type” :"work”,
    “num”: 5105550102}]
    2021/04
    G
    A

    View Slide

  34. © 2021, Amazon Web Services, Inc. or its Affiliates.
    34
    Amazon Redshift ML
    Easily create and train ML Models using SQL queries with Amazon SageMaker
    2021/05
    G
    A
    CREATE MODEL demo_ml.customer_churn
    FROM (SELECT c.age, c.zip, c.monthly_spend,
    c.monthly_cases, c.active FROM
    customer_info_table c)
    TARGET c.active;
    Use case: Product recommendations, fraud
    prevention, reduce customer churn
    Create, train, and apply ML models using SQL
    Deploy inference models locally in Amazon
    Redshift; run an inference as invoking a user-
    defined function as part of SQL statements
    Automatic selection of ML algorithms or
    select your algorithm with XGBoost
    Automatic pre-processing, creation,
    training, deployment of your model

    View Slide

  35. © 2021, Amazon Web Services, Inc. or its Affiliates.
    35
    Data Sharing for Data Lake
    Share Amazon Redshift data with other data services via AWS Lake Formation
    Com
    ing
    soon
    Share latest Redshift data with no ETL
    required
    Query live and transactionally consistent
    Redshift data from EMR, Athena, Glue, and
    SageMaker
    Queries run without using any Redshift
    compute
    No Redshift cluster necessary to consume
    data
    Amazon Redshift Amazon Athena
    Amazon SageMaker
    Amazon EMR
    AWS Lake Formation

    View Slide

  36. © 2021, Amazon Web Services, Inc. or its Affiliates.
    36
    AWS Glue Elastic Views
    Easily combine and replicate data across multiple data stores
    Create materialized views across data on various
    databases using familiar SQL
    RDS
    Aurora
    DynamoDB
    Amazon S3
    Amazon
    Redshift
    Amazon
    Elasticsearch
    Service
    RDS
    Aurora
    DynamoDB
    AWS Glue
    Elastic Views
    Materialized Views
    Access the latest data view for
    multiple targets
    Easy to duplicate, combine and connect data
    without custom coding
    Serverless. Automatically scale up / down
    capacity to accommodate workloads
    Continuously monitor changes in the source
    database and update the target within seconds
    Request
    Preview

    View Slide

  37. © 2021, Amazon Web Services, Inc. or its Affiliates.
    37
    Before...
    Amazon Redshift
    Relational
    Databases
    Business
    Intelligence

    View Slide

  38. © 2021, Amazon Web Services, Inc. or its Affiliates.
    38
    What's Next?
    Amazon Kinesis
    Amazon DynamoDB
    Amazon Aurora
    / RDS
    Amazon SageMaker
    Amazon QuickSight
    Amazon Redshift
    Amazon S3
    Amazon Redshift
    Amazon Redshift ML
    (Preview)
    Federated Query
    Data Sharing
    Data Sharing (Coming soon)
    Spectrum
    Spectrum
    AWS Elastic Views
    (Preview)
    Amazon
    EMR
    Amazon
    Athena
    Amazon
    SageMaker
    Concurrency Scaling
    ...

    View Slide

  39. © 2021, Amazon Web Services, Inc. or its Affiliates.
    Thank You!

    View Slide