Yugabyte & Databricks Spark workshop at Distributed SQL Summit(DSS) 2021

Slide 1

Slide 1 text

Slide 2

Slide 2 text

© 2021 - All Rights Reserved Who we are Wei Wang VP of Field Engineering, Yugabyte, Inc. Pivotal • FINRA • NYSE University of Pennsylvania (UPenn) @ameybanarse about.me/amey Amey Banarse Principal Pre Sales Architect Rackspace • Accenture • HP Enterprise University of Oklahoma @wwang0825 2

Slide 3

Slide 3 text

Slide 4

Slide 4 text

© 2021 - All Rights Reserved Transactional, distributed SQL database designed for resilience and scale 100% open source, PostgreSQL compatible, enterprise-grade RDBMS …..built to run across all your cloud environments 4

Slide 5

Slide 5 text

© 2021 - All Rights Reserved Designed for cloud native microservices. 5 Sharding & Load Balancing Raft Consensus Replication Distributed Transaction Manager & MVCC Document Storage Layer Custom RocksDB Storage Engine DocDB Distributed Document Store Yugabyte Query Layer YSQL YCQL PostgreSQL Google Spanner YugabyteDB SQL Ecosystem ✓ Massively adopted ✘ New SQL ﬂavor ✓ Reuse PostgreSQL RDBMS Features ✓ Advanced Complex ✘ Basic cloud-native ✓ Advanced Complex and cloud-native Highly Available ✘ ✓ ✓ Horizontal Scale ✘ ✓ ✓ Distributed Txns ✘ ✓ ✓ Data Replication Async Sync Sync + Async

Slide 6

Slide 6 text

© 2021 - All Rights Reserved 6 ○ Strongly-consistent secondary indexes ○ Native JSON support ○ Geographic location hints to optimize cost and latency ○ Distributed ACID transactions ○ Geo-distributed ○ Supports Cassandra CQL 3.9.x and 4.x Features Cassandra YCQL Clustering and sharding ✓ ✓ Keyspaces and tables ✓ ✓ Indexes & unique constraints ✓ ✓ Transactions ✓ ✓ Strongly consistent secondary indexes 𐄂 ✓ Native JSON 𐄂 ✓ Geographic location hints 𐄂 ✓ In addition to Cassandra CQL API support, YCQL adds the following enhancements: Yugabyte Query Layer - YCQL

Slide 7

Slide 7 text

© 2021 - All Rights Reserved 7 Secondary Indexes - YCQL CREATE TABLE product_rankings ( asin text, category text, sales_rank int, ... PRIMARY KEY (asin, category) ); CREATE INDEX top_products_in_category ON product_rankings (category, sales_rank, asin); Partition by product ID for efficient lookups Secondary index to list top products in a category

Slide 8

Slide 8 text

© 2021 - All Rights Reserved 8 Secondary Indexes - YCQL SELECT * FROM product_rankings WHERE asin = '0684841363'; SELECT * FROM product_rankings WHERE category = 'Books' LIMIT 10 OFFSET 20; Query by ID for product summary data Query by index to list top products in a category by sales rank (best sellers)

Slide 9

Slide 9 text

© 2021 - All Rights Reserved Global Transactions Multi-Row/Multi-Shard Operations At Scale 9 CREATE TABLE orders ( order_id text PRIMARY KEY, user_id uuid, order_details jsonb, ... ) WITH transactions = {'enabled': 'true'}; Update inventory and orders tables atomically. Simple to enable global transactions on any table Use JSON type for flexible schema objects.

Slide 10

Slide 10 text

© 2021 - All Rights Reserved 10 Global Transactions Multi-Row/Multi-Shard Operations At Scale BEGIN TRANSACTION UPDATE product_inventory SET quantity = quantity - 2 WHERE asin = '0684841363'; INSERT INTO orders (order_id, user_id, order_details, … ) VALUES ('', '', '{id: "0684841363", quantity: 2, … }', … ); END TRANSACTION; Decrease inventory count for products fulfilled. Query and index by JSON attributes if needed! Add to orders table for products purchased.

Slide 11

Slide 11 text

Slide 12

Slide 12 text

© 2021 - All Rights Reserved 12 YugabyteDB Spark Connector Source Tables Derived Tables Enrichment / pre-aggregation Batch Aggregates Key features: ● Native JSONB support - ex. Column pruning ● Performance optimizations with predicate pushdowns ● Cluster, topology and partition awareness

Slide 13

Slide 13 text

© 2021 - All Rights Reserved What we will be building in this workshop? Integrating Apache Spark with Yugabyte Cloud A Scala Application Yugabyte Cloud Read From Write To YugabyteDB Spark Connector 3.0-yb-8 - Compatible with Spark 3.0 and Scala 2.12 - Compatible with Yugabyte YCQL 3.7+ - Exposes YCQL tables as Spark RDDs and Datasets/DataFrames - Saves RDDs /DataFrames back to Cassandra by implicit saveToCassandra call - Allows for execution of arbitrary CQL statements Yugabyte Cloud - Data source and target for Spark application - Namespace: test - Table: Employees_json Spark application: Native support of JSON - Read from and write to Yugabyte cloud - Perform sample ETL operation: Window function - Process JSON data type Apache Spark 3.x YugabyteDB Spark Connector Spark APIs Perform ETL 13

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

19 Yugabyte Conﬁdential © 2019 All rights reserved. YCQL Example CREATE KEYSPACE example WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '3'};; CREATE TABLE user_actions(user_id INT, ts TIMESTAMP, action TEXT, PRIMARY KEY((user_id), ts)) WITH CLUSTERING ORDER BY (ts DESC); INSERT INTO user_actions(user_id, ts, action) VALUES (1, '2000-12-2 12:30:15', 'log in'); INSERT INTO user_actions(user_id, ts, action) VALUES (1, '2000-12-2 12:30:25', 'change password'); INSERT INTO user_actions(user_id, ts, action) VALUES (1, '2000-12-2 12:30:35', 'log out');

Slide 20

Slide 20 text

20 Yugabyte Conﬁdential © 2019 All rights reserved. Databricks Scala Notebook (ycql) import org.apache.spark.SparkConf import org.apache.spark.SparkContext import com.datastax.spark.connector._ import com.datastax.spark.connector.cql._ val keyspace = "example" val table = "user_actions" val tableDf = spark.read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> table, "keyspace" -> keyspace)) .load() tableDf.show() +-------+-------------------+---------------+ |user_id| ts| action| +-------+-------------------+---------------+ | 1|2000-12-02 12:30:35| log out| | 1|2000-12-02 12:30:25|change password| | 1|2000-12-02 12:30:15| log in| +-------+-------------------+---------------+

Slide 21

Slide 21 text

21 Yugabyte Conﬁdential © 2019 All rights reserved. YSQL Example CREATE SCHEMA cycling; DROP TABLE IF EXISTS cycling.cyclist_name; CREATE TABLE cycling.cyclist_name ( id bigint PRIMARY KEY, lastname character varying (50), firstname character varying (50) ); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (1, 'VOS','Marianne'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (2, 'VAN DER BREGGEN','Anna'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (3, 'FRAME','Alex'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (4, 'TIRALONGO','Paolo'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (5, 'KRUIKSWIJK','Steven'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (6, 'MATTHEWS', 'Michael');

Slide 22

Slide 22 text

22 Yugabyte Conﬁdential © 2019 All rights reserved. Databricks Scala Notebook (ysql) val jdbcDF = spark.read .format("jdbc") .option("url", "jdbc:postgresql://xxx.xxx.xxx.xxx:5433/test") .option("dbtable", "cycling.cyclist_name") .option("user", "yugabyte") .option("password", "") .load() jdbcDF.show() +---+---------------+---------+ | id| lastname|firstname| +---+---------------+---------+ | 3| FRAME| Alex| | 5| KRUIKSWIJK| Steven| | 4| TIRALONGO| Paolo| | 2|VAN DER BREGGEN| Anna| | 6| MATTHEWS| Michael| | 1| VOS| Marianne| +---+---------------+---------+