Amazon Redshift evolution history and future direction/redshift-evolution-2021-en

© 2021, Amazon Web Services, Inc. or its Affiliates. Amazon
Web Service Japan K. K. Senior Solutions Architect, Analytics Junpei Ozono Amazon Redshift evolution history and future direction

© 2021, Amazon Web Services, Inc. or its Affiliates. 2
Today's Session Some people says, “I know little about Redshift in the first place” “ What's going on with Redshift recently?” Recap the history of Amazon Redshift’s evolution, and update on the latest Amazon Redshift feature releases.

History of Amazon Redshift Evolution 2012 2017 2019 2020 2021 Future

Typical data analysis architecture in the past Relational Databases Data Warehouse Business Intelligence Data Source Collect/Store/Analyze Visualize • Analyzes structured data on relational databases • Gather these into a data warehouse, analyze and visualize with BI tools • On-premise centric as of 2012

Introducing Amazon Redshift Relational Databases Amazon Redshift Business Intelligence Data Source Collect/Store/Analyze Visualize A fast, scalable, and cost-effective data warehouse managed service ü Peta byte scale ü Compatibility with PostgreSQL ü Connection with typical 3rd party tools ü Price less than 1/10th of traditional DWH (as of then)

Amazon Redshift Architecture at Service Launch Amazon Redshift JDBC/ODBC Shared Nothing + MPP (Massively Parallel) Processing) Architecture An approach to increasing processing throughput for analytic queries by distributing data across multiple compute nodes and processing in parallel at each node Leader Node • Query Endpoints • Generating and Deploying SQL Processing Code Compute nodes • Local columnar storage • Parallel execution of queries

Amazon Redshift JDBC/ODBC Amazon S3 User Bucket COPY Unload Amazon S3 Redshift Management Buckets Backup Restore Data is loaded via user-managed S3 & Unload Redshift for automatic backups and restores leveraged service-managed S3 space User data Amazon Redshift Architecture at Service Launch

History of Amazon Redshift Evolution 2012 2017 Redshift Announcement

Changes in data and how it is used * IDC, Data Age 2025: The Evolution of Data to Life-Critical: Don't Focus on Big Data, Focus on the Data That's Big, April 2017. More data than you can imagine More Diversity of Data アナリストビジネスユーザーアプリケーションデータサイエンティスト More users and more application access your data 機械学習 SQL分析科学技術計算リアルタイムストリーミング Analyze in different ways Data Scientist Analyst Application Business user Machine Learning SQL analysis Real-time streaming Scientific computing

Data Lake Architecture data Lake Stream Data /Event Log NoSQL Databases Relational Databases Data Warehouse Big Data Processing Log Search Machine learning Business Intelligence Business Application Data Source Visualize Analyze Collect/Store ... ...

Data Lake Architecture data Lake Stream Data /Event Log NoSQL Databases Relational Databases Data Ware House Big Data Processing Log Search Machine learning Business Intelligence Business Application Data Source Visualize Analyze Collect/Store ... ... More data than you can imagine The idea of Data Lake was born to make it easier to handle more Diversity of Data

Data Lake Architecture data Lake Stream Data /Event Log NoSQL Databases Relational Databases Data Ware House Big Data Processing Log Search Machine learning Business Intelligence Business Application Data Source Visualize Analyze Collect/Store ... ... More users and more application access your data Analyze in different ways Leverage purposes built data stores like data warehouse and any other analytics services

Data warehousing and data lake work together Stream Data /Event Log NoSQL Databases Relational Databases Big Data Processing Log Search Machine learning Business Intelligence Business Application ... ... Amazon S3 Data Source Visualize Analyze Collect/Store Amazon Redshift It has become difficult to store all data in Data Warehouse(Redshift). If we can run queries against data on data Lake (S3) directly, more and more data could be analyzed while keeping costs down.

Extend your architecture to a data lake with Redshift Spectrum Amazon Redshift JDBC/ODBC Open Format Files (Parquet, ORC, JSON, CSV etc) Applications have transparently accessed to data in both data warehouse and data lake Amazon Redshift Spectrum • Parallel query execution engine for files onS3 Data Lake • User Managed S3 Buckets

History of Amazon Redshift Evolution 2012 2017 Redshift Spectrum 2019 Redshift Announcement

Challenges of Parallel Workloads Stream Data /Event Log NoSQL Databases Relational Databases Big Data Processing Log Search Machine learning Business Intelligence Business Application ... ... Amazon S3 Data Source Visualize Analyze Collect/Store Amazon Redshift Increased users and applications accessing the Data Warehouse. When workloads burst, the entire throughput on the cluster might be decreased. e.g. Most users simultaneously access the data warehouse at 9:00 am every Monday.

Concurrency Scaling automatically scales compute during peak hours Amazon Redshift Additional Clusters (1-10) Main Cluster dispatch + + + Queries running on Redshift cluster are burst and there are not enough resources to run them, it automatically launches another cluster(s) behind the scenes and process queries without waiting. You can get free of charge for 1 hour per day and the cost can be controlled.

History of Amazon Redshift Evolution 2012 2017 Redshift Spectrum 2019 Concurrency Scaling 2020 Redshift Announcement

Visualize Data Source Collect/Store Amazon S3 Redshift scaling challenges Stream Data /Event Log NoSQL Databases Relational Databases Big Data Processing Log Search Machine learning Business Intelligence Business Application ... ... Analyze Amazon Redshift If you have more data on Redshift and want to add more storage capacity, you can resize the cluster to add more nodes with storage space. The Redshift architecture at the time didn't allow storage and compute to be scaled separately.

(Reinstate) Amazon Redshift Architecture Amazon Redshift JDBC/ODBC Compute nodes • Local columnar storage • Parallel execution of queries Leader Node • Query Endpoints • Generating and Deploying SQL Processing Code

RA3 instances with managed storage Amazon Redshift JDBC/ODBC Leader Node • Query Endpoints • Generating and Deploying SQL Processing Code Compute nodes • High-speed local SSD cache + large volume of RAM + high-bandwidth networking • Parallel execution of queries High-bandwidth networking Managed Storage • Redshift Managed S3 Bucket Redshift Format File Nitro-based hardware Size of data warehouse only based on steady state compute needs Scale and pay independently for compute and storage Frequently accessed data is automatically cached in the compute node

Visualize Linking data warehouses and operational databases Stream Data /Event Log NoSQL Databases Amazon Aurora / RDS Big Data Processing Log Search Machine learning Business Intelligence Business Application ... ... Data Source Analyze Collect/Store Amazon Redshift Not all data is always loaded into a data lake or data warehouse in real time, so being able to directly query the latest data on an operational database gives you even more analysis. Amazon S3

Amazon Redshift Federated Query Unified analytics across databases, data warehouse, and data lake Amazon RDS PostgreSQL, MySQL Amazon Aurora PostgreSQL, MySQL Amazon S3 Data Lake Amazon Redshift JDBC/ODBC Analyze live data without data movement Query data directly on Amazon RDS/Aurora PostgreSQL from Amazon Redshift Secure, high-performance data access Amazon RDS/Aurora MySQL support （preview)

History of Amazon Redshift Evolution 2012 2017 Redshift Spectrum 2019 Concurrency Scaling 2020 RA3, Federated Query 2021 Redshift Announcement

Visualize Data Source Collect/Store Amazon S3 Data sharing across multiple clusters Stream Data /Event Log NoSQL Databases Relational Databases Machine learning Business Intelligence Business Application ... ... Analyze Amazon Redshift Multiple Redshift clusters may be required for various reasons. • Completely separate the workload • Different departments to manage • Separation of environment for production, development, etc. To share data between these clusters, you had to transfer data from cluster to cluster. Amazon Redshift Amazon Redshift

Amazon Redshift Data Sharing Secure and easy data sharing across Redshift clusters Producer Cluster Compute Node Compute Node Compute Node Compute Node Leader Node Consumer Cluster Compute Node Compute Node Compute Node Leader Node Compute Node Compute Node Amazon Redshift Managed Storage Read shared data Read and write private data • Producer pays for Amazon Redshift managed storage and consumers pay for consumer cluster • Workloads accessing shared data are isolated from each other and the producer RA3 Instances RA3 Instances

Redshift Automated Performance Tuning ML-based optimizations to get started easily and get the fastest performance quickly Automates physical data design and optimization Optimizes for peak performance as data and workloads scale Leverages machine learning to adapt to shifting workloads Automated performance tuning Automatic sort keys Automatic vacuum delete Automatic distribution keys Auto Workload Manager Automatic table sort MV auto-refresh and rewrite

Physical view to speed up frequently executed queries • Join, Filter, Aggregate, Projection • Specify a different key than the base table • Reference external tables When the base table is updated, the associated Materialized Views are also refreshed automatically No need to be aware of the Materialized View • Just query the table • Redshift rewrites the execution plan as needed to read from a materialized view Practical Materialized View item store CUST1 price_ i1 s1 c1 12.0 i2 s2 c1 3.0 i3 s2 c2 7.0 sales_nam e store owner .loc s1 Joe SF s2 Ann NY s3 Lisa SF store_info loc total_sales SF 12.00 NY 10.00 loc_sales

History of Amazon Redshift Evolution 2012 2017 Redshift Spectrum 2019 Concurrency Scaling 2020 RA3, Federated Query 2021 Data Sharing, Auto-Tuning Redshift Announcement Future

RA3 instances further enhancements Amazon Redshift RA3 Network bottlenecks? Redshift Managed Storage How to prevent network performance penalties between compute nodes and managed storage?

Advanced Query Accelerator (AQUA) New hardware-accelerated cache that delivers up to 10x better query performance than other cloud data warehouses Compute nodes Compute nodes Compute nodes Compute nodes AQUA node AWSDesign Custom Processors AQUA node AWSDesign Custom Processors AQUA node AWSDesign Custom Processors AQUA node AWSDesign Custom Processors Parallelism Minimize data movement over the network by pushing down operations to AQUA Nodes AQUA Nodes with custom AWS-designed analytics processors to make operations (compression, encryption, filtering, and aggregations) faster than traditional CPUs Available on ra3.16xlarge/ra3.4xlarge with no additional cost. No need to modify any SQL/application codes Redshift Managed Storage Scale-out 2021/04 G A

SUPER data type semi-structured data into a table without a schema specification New data type: SUPER Easy, efficient, and powerful JSON processing Fast row-oriented data ingestion Fast column-oriented analytics with materialized views over SUPER/JSON Access to schema-less nested data with easy-to-use SQL extensions powered by PartiQL query language SELECT name.given AS firstname, ph.num FROM customers c, c.phones ph WHERE ph.type = ‘cell’; firstname | num ----------+--------------- "Jane" | 6505550101 id INTEGER name SUPER Phones SUPER 1 {"given”: “Jane”, “family”: “Doe"} [{"type” :"work”, “num”: “9255550100"}, { "type”:“cell”, “ num": 6505550101}] 2 {"given”: “Richard”, “family”: “Roe"}, [{"type” :"work”, “num”: 5105550102}] 2021/04 G A

Amazon Redshift ML Easily create and train ML Models using SQL queries with Amazon SageMaker 2021/05 G A CREATE MODEL demo_ml.customer_churn FROM (SELECT c.age, c.zip, c.monthly_spend, c.monthly_cases, c.active FROM customer_info_table c) TARGET c.active; Use case: Product recommendations, fraud prevention, reduce customer churn Create, train, and apply ML models using SQL Deploy inference models locally in Amazon Redshift; run an inference as invoking a user- defined function as part of SQL statements Automatic selection of ML algorithms or select your algorithm with XGBoost Automatic pre-processing, creation, training, deployment of your model

Data Sharing for Data Lake Share Amazon Redshift data with other data services via AWS Lake Formation Com ing soon Share latest Redshift data with no ETL required Query live and transactionally consistent Redshift data from EMR, Athena, Glue, and SageMaker Queries run without using any Redshift compute No Redshift cluster necessary to consume data Amazon Redshift Amazon Athena Amazon SageMaker Amazon EMR AWS Lake Formation

AWS Glue Elastic Views Easily combine and replicate data across multiple data stores Create materialized views across data on various databases using familiar SQL RDS Aurora DynamoDB Amazon S3 Amazon Redshift Amazon Elasticsearch Service RDS Aurora DynamoDB AWS Glue Elastic Views Materialized Views Access the latest data view for multiple targets Easy to duplicate, combine and connect data without custom coding Serverless. Automatically scale up / down capacity to accommodate workloads Continuously monitor changes in the source database and update the target within seconds Request Preview

Before... Amazon Redshift Relational Databases Business Intelligence

What's Next? Amazon Kinesis Amazon DynamoDB Amazon Aurora / RDS Amazon SageMaker Amazon QuickSight Amazon Redshift Amazon S3 Amazon Redshift Amazon Redshift ML (Preview) Federated Query Data Sharing Data Sharing (Coming soon) Spectrum Spectrum AWS Elastic Views (Preview) Amazon EMR Amazon Athena Amazon SageMaker Concurrency Scaling ...

Amazon Redshift evolution history and future di...

Amazon Redshift evolution history and future direction/redshift-evolution-2021-en

More Decks by jozono

Other Decks in Technology

Featured

Transcript