Slide 1

Slide 1 text

© 2021 - All Rights Reserved 1 Building Applications with YugabyteDB and Databricks Spark Amey Banarse Wei Wang

Slide 2

Slide 2 text

© 2021 - All Rights Reserved Who we are Wei Wang VP of Field Engineering, Yugabyte, Inc. Pivotal • FINRA • NYSE University of Pennsylvania (UPenn) @ameybanarse about.me/amey Amey Banarse Principal Pre Sales Architect Rackspace • Accenture • HP Enterprise University of Oklahoma @wwang0825 2

Slide 3

Slide 3 text

© 2021 - All Rights Reserved Workshop Agenda 3 ● Overview of Yugabyte Architecture ● Yugabyte’s YCQL API ● YugabyteDB Spark Connector ● Hands-on Workshop with Databricks Spark

Slide 4

Slide 4 text

© 2021 - All Rights Reserved Transactional, distributed SQL database designed for resilience and scale 100% open source, PostgreSQL compatible, enterprise-grade RDBMS …..built to run across all your cloud environments 4

Slide 5

Slide 5 text

© 2021 - All Rights Reserved Designed for cloud native microservices. 5 Sharding & Load Balancing Raft Consensus Replication Distributed Transaction Manager & MVCC Document Storage Layer Custom RocksDB Storage Engine DocDB Distributed Document Store Yugabyte Query Layer YSQL YCQL PostgreSQL Google Spanner YugabyteDB SQL Ecosystem ✓ Massively adopted ✘ New SQL flavor ✓ Reuse PostgreSQL RDBMS Features ✓ Advanced Complex ✘ Basic cloud-native ✓ Advanced Complex and cloud-native Highly Available ✘ ✓ ✓ Horizontal Scale ✘ ✓ ✓ Distributed Txns ✘ ✓ ✓ Data Replication Async Sync Sync + Async

Slide 6

Slide 6 text

© 2021 - All Rights Reserved 6 ○ Strongly-consistent secondary indexes ○ Native JSON support ○ Geographic location hints to optimize cost and latency ○ Distributed ACID transactions ○ Geo-distributed ○ Supports Cassandra CQL 3.9.x and 4.x Features Cassandra YCQL Clustering and sharding ✓ ✓ Keyspaces and tables ✓ ✓ Indexes & unique constraints ✓ ✓ Transactions ✓ ✓ Strongly consistent secondary indexes 𐄂 ✓ Native JSON 𐄂 ✓ Geographic location hints 𐄂 ✓ In addition to Cassandra CQL API support, YCQL adds the following enhancements: Yugabyte Query Layer - YCQL

Slide 7

Slide 7 text

© 2021 - All Rights Reserved 7 Secondary Indexes - YCQL CREATE TABLE product_rankings ( asin text, category text, sales_rank int, ... PRIMARY KEY (asin, category) ); CREATE INDEX top_products_in_category ON product_rankings (category, sales_rank, asin); Partition by product ID for efficient lookups Secondary index to list top products in a category

Slide 8

Slide 8 text

© 2021 - All Rights Reserved 8 Secondary Indexes - YCQL SELECT * FROM product_rankings WHERE asin = '0684841363'; SELECT * FROM product_rankings WHERE category = 'Books' LIMIT 10 OFFSET 20; Query by ID for product summary data Query by index to list top products in a category by sales rank (best sellers)

Slide 9

Slide 9 text

© 2021 - All Rights Reserved Global Transactions Multi-Row/Multi-Shard Operations At Scale 9 CREATE TABLE orders ( order_id text PRIMARY KEY, user_id uuid, order_details jsonb, ... ) WITH transactions = {'enabled': 'true'}; Update inventory and orders tables atomically. Simple to enable global transactions on any table Use JSON type for flexible schema objects.

Slide 10

Slide 10 text

© 2021 - All Rights Reserved 10 Global Transactions Multi-Row/Multi-Shard Operations At Scale BEGIN TRANSACTION UPDATE product_inventory SET quantity = quantity - 2 WHERE asin = '0684841363'; INSERT INTO orders (order_id, user_id, order_details, … ) VALUES ('', '', '{id: "0684841363", quantity: 2, … }', … ); END TRANSACTION; Decrease inventory count for products fulfilled. Query and index by JSON attributes if needed! Add to orders table for products purchased.

Slide 11

Slide 11 text

© 2021 - All Rights Reserved 11 Native JSON

Slide 12

Slide 12 text

© 2021 - All Rights Reserved 12 YugabyteDB Spark Connector Source Tables Derived Tables Enrichment / pre-aggregation Batch Aggregates Key features: ● Native JSONB support - ex. Column pruning ● Performance optimizations with predicate pushdowns ● Cluster, topology and partition awareness

Slide 13

Slide 13 text

© 2021 - All Rights Reserved What we will be building in this workshop? Integrating Apache Spark with Yugabyte Cloud A Scala Application Yugabyte Cloud Read From Write To YugabyteDB Spark Connector 3.0-yb-8 - Compatible with Spark 3.0 and Scala 2.12 - Compatible with Yugabyte YCQL 3.7+ - Exposes YCQL tables as Spark RDDs and Datasets/DataFrames - Saves RDDs /DataFrames back to Cassandra by implicit saveToCassandra call - Allows for execution of arbitrary CQL statements Yugabyte Cloud - Data source and target for Spark application - Namespace: test - Table: Employees_json Spark application: Native support of JSON - Read from and write to Yugabyte cloud - Perform sample ETL operation: Window function - Process JSON data type Apache Spark 3.x YugabyteDB Spark Connector Spark APIs Perform ETL 13

Slide 14

Slide 14 text

© 2021 - All Rights Reserved 14 Hands-on Lab github.com/yugabyte/yugabyte-spring-workshop

Slide 15

Slide 15 text

15 Yugabyte Confidential © 2019 All rights reserved. Databricks Configuration

Slide 16

Slide 16 text

16 Yugabyte Confidential © 2019 All rights reserved.

Slide 17

Slide 17 text

17 Yugabyte Confidential © 2019 All rights reserved.

Slide 18

Slide 18 text

18 Yugabyte Confidential © 2019 All rights reserved.

Slide 19

Slide 19 text

19 Yugabyte Confidential © 2019 All rights reserved. YCQL Example CREATE KEYSPACE example WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '3'};; CREATE TABLE user_actions(user_id INT, ts TIMESTAMP, action TEXT, PRIMARY KEY((user_id), ts)) WITH CLUSTERING ORDER BY (ts DESC); INSERT INTO user_actions(user_id, ts, action) VALUES (1, '2000-12-2 12:30:15', 'log in'); INSERT INTO user_actions(user_id, ts, action) VALUES (1, '2000-12-2 12:30:25', 'change password'); INSERT INTO user_actions(user_id, ts, action) VALUES (1, '2000-12-2 12:30:35', 'log out');

Slide 20

Slide 20 text

20 Yugabyte Confidential © 2019 All rights reserved. Databricks Scala Notebook (ycql) import org.apache.spark.SparkConf import org.apache.spark.SparkContext import com.datastax.spark.connector._ import com.datastax.spark.connector.cql._ val keyspace = "example" val table = "user_actions" val tableDf = spark.read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> table, "keyspace" -> keyspace)) .load() tableDf.show() +-------+-------------------+---------------+ |user_id| ts| action| +-------+-------------------+---------------+ | 1|2000-12-02 12:30:35| log out| | 1|2000-12-02 12:30:25|change password| | 1|2000-12-02 12:30:15| log in| +-------+-------------------+---------------+

Slide 21

Slide 21 text

21 Yugabyte Confidential © 2019 All rights reserved. YSQL Example CREATE SCHEMA cycling; DROP TABLE IF EXISTS cycling.cyclist_name; CREATE TABLE cycling.cyclist_name ( id bigint PRIMARY KEY, lastname character varying (50), firstname character varying (50) ); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (1, 'VOS','Marianne'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (2, 'VAN DER BREGGEN','Anna'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (3, 'FRAME','Alex'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (4, 'TIRALONGO','Paolo'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (5, 'KRUIKSWIJK','Steven'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (6, 'MATTHEWS', 'Michael');

Slide 22

Slide 22 text

22 Yugabyte Confidential © 2019 All rights reserved. Databricks Scala Notebook (ysql) val jdbcDF = spark.read .format("jdbc") .option("url", "jdbc:postgresql://xxx.xxx.xxx.xxx:5433/test") .option("dbtable", "cycling.cyclist_name") .option("user", "yugabyte") .option("password", "") .load() jdbcDF.show() +---+---------------+---------+ | id| lastname|firstname| +---+---------------+---------+ | 3| FRAME| Alex| | 5| KRUIKSWIJK| Steven| | 4| TIRALONGO| Paolo| | 2|VAN DER BREGGEN| Anna| | 6| MATTHEWS| Michael| | 1| VOS| Marianne| +---+---------------+---------+

Slide 23

Slide 23 text

© 2021 - All Rights Reserved Thank You Join us on Slack: yugabyte.com/slack Star us on Github: github.com/yugabyte/yugabyte-db