Slide 1

Slide 1 text

Level 102 for Presto SQL on Everything Part 2 of the Tech Talk Series for Presto Getting started with PrestoDB? What’s the difference? Da Cheng Software Engineer, Twitter Dipti Borkar Co-Founder & CPO, Ahana

Slide 2

Slide 2 text

Pluggable Presto Connectors 2

Slide 3

Slide 3 text

Presto Connector ● Connector: Driver for a data source ○ Access HDFS datasets, e.g., Hive ○ Extend to real-time event streams, e.g., Kafka ○ Push-down/offload computation to data engines, e.g., Druid ● Data model from users’ perspective ○ Catalog: Contains schemas from a datasource specified by the connector ○ Schema: Namespace to organize tables. ○ Table: Set of unordered rows organized into columns with types. SHOW TABLES FROM hive.default; select * from hive.default.employees limit 1;

Slide 4

Slide 4 text

Presto Connector ● Catalog properties file ○ Hive: etc/catalog/hive.properties ○ Kafka: etc/catalog/kafka.properties ○ Druid: etc/catalog/druid.properties 4 connector.name=kafka kafka.table-names=table1, table2 kafka.nodes=host1:port,host2:port connector.name=druid druid.coordinator-url=hostname:port druid.broker-url=hostname:port druid.schema-name=schema druid.compute-pushdown-enabled=true connector.name=hive-hadoop2 hive.metastore.uri=thrift://example.net:9083

Slide 5

Slide 5 text

Presto Kafka Connector ● How it works? ○ Kafka brokers ○ Topic schema ○ Read from Kafka brokers ● Twitter Setup ● Performance ● Scalibility 5

Slide 6

Slide 6 text

Kafka Brokers https://www.tutorialspoint.com/apache_kafka/apache_kafka_fundamentals.htm

Slide 7

Slide 7 text

Topic Schema ● A table definition file consists of a JSON definition for a table. The name of the file can be arbitrary but must end in .json "tableName": ..., "schemaName": ..., "topicName": ..., "key": { "dataFormat": ..., "fields": [ ... ] }, "message": { "dataFormat": ..., "fields": [ ... ] } }

Slide 8

Slide 8 text

Topic brokers Coordinator 4 3 1 4 3 2 For ALL partitions: ● getOffsetsByTimestamp(ts) ● Build splits with offsets Worker ● poll ● Decode Worker ● poll ● Decode Worker ● poll ● Decode Read from a topic broker ● Coordinator builds splits in a single thread execution ● getOffsetsByTimestamp() ○ APIs call to get message offset by timestamp to users’ interest ○ E.g., KafkaClient subscribed to broker A, returns offset range for Partition 1 A B

Slide 9

Slide 9 text

Topic brokers Coordinator For ALL partitions: ● Build splits with ts 4 3 1 4 3 2 Worker ●get offsets for each split ●poll ●Decode Worker ●get offsets for each split ●poll ●Decode Worker ●get offsets for each split ●poll ●Decode ● Decouple getOffsetsByTimestamp() from single thread excution ○ Reduced by a fixed ~90 seconds to get offset ranges for all 1000 partitions in the test Topic Slightly smarter! Read from a topic broker A B

Slide 10

Slide 10 text

msg1, msg2 msg3, msg4 msg5, msg6 ... offset=0x01 offset=0x03 offset=0x05 ... TS=t1 TS=t3 TS=t5 ● Used segment timestamp (before KafkaClient 1.0.0) to limit scanned data Split ThriftRecordDecoder msg3 msg4 msg5 msg6 KafkaRecordSetCursor rows By Segment Read from a topic broker

Slide 11

Slide 11 text

● Deprecate notion of Segment ● Use internal field _timestamp in message format (after 1.0.0) to limit data at finer granularity msg1 msg2 msg3 msg4 msg5 msg6 ... offset=0x0 1 offset=0x02 offset=0x03 offset=0x04 offset=0x05 offset=0x06 ... TS=t1 TS=t2 TS=t3 TS=t4 TS=t5 TS=t6 Split ThriftRecordDecoder msg4 msg5 KafkaRecordSetCursor rows Read from a topic broker With message offset

Slide 12

Slide 12 text

Setup ● Presto cluster on Aurora ○ Dedicated hosts (55+ CPUs) ● A topic with 1000 partitions ○ 30 brokers each holding 100 partitions ■ Message replication factor = 3 ■ A total of 1,000 partitions ○ Each partition serving 10MB/s (SLA) ● HDFS counterpart ○ ~1 hour latency

Slide 13

Slide 13 text

SELECT ip_address, count(1) as cnt, avg(response_size), max(_timestamp), max(authentication.user_id) FROM kafka.default.test WHERE timestamp_ms between cast(to_unixtime(CAST('2019-11-13 21:00:00' AS timestamp))*1000 AS bigint) AND cast(to_unixtime(CAST('2019-11-13 21:01:00' AS timestamp))*1000 AS bigint) AND _timestamp between cast(to_unixtime(CAST('2019-11-13 21:00:00' AS timestamp))*1000 AS bigint) AND cast(to_unixtime(CAST('2019-11-13 21:01:00' AS timestamp))*1000 AS bigint) GROUP BY ip_address ORDER BY cnt DESC LIMIT 10; Performance ● Query Event time Processing time

Slide 14

Slide 14 text

Performance Time span 2 min 10 min 1 hour Kafka-clients 1.1.1 0:21 1:17 7:15 ● Execution time ○ Use a cluster of 50 Presto workers ○ Messages from one partition is packaged in a single split, # splits is fixed ○ Execution time increases linearly with time span (which decides message range) of users’ interests ● Data scanned Time span 2 min 10 min 1 hour Kafka-clients 1.1.1 470GB 2.21TB 13.1TB

Slide 15

Slide 15 text

Validation Kafka SELECT count(*) FROM kafka.default.test WHERE timestamp_ms between cast(to_unixtime(cast('2019-11-22 23:00:00' AS timestamp))*1000 AS bigint) and cast(to_unixtime(cast('2019-11-23 00:00:00' AS timestamp))*1000 AS bigint) AND _timestamp between cast(to_unixtime(cast('2019-11-22 23:00:00' AS timestamp))*1000 AS bigint) and cast(to_unixtime(cast('2019-11-23 00:00:00' AS timestamp))*1000 AS bigint); HDFS select count(*) from hive.logs.test WHERE timestamp_ms between cast(to_unixtime(cast('2019-11-22 23:00:00' AS timestamp))*1000 AS bigint) AND cast(to_unixtime(cast('2019-11-23 00:00:00' AS timestamp))*1000 AS bigint) AND datehour='2019112223'; ● Query ○ Use count() without aggregation to show difference vs. HDFS ○ Fetching one hour data equivalent to selecting one datehour partition on HDFS

Slide 16

Slide 16 text

Time span 1 min 10 min 30 min 1 hour Kafka 298655376 2922330813 8715311075 17310768994 HDFS 299548797 2921707581 8713096303 17307371235 Difference 0.3% -0.02% -0.03% -0.02% ● count(*) results Time span 1 min 10 min 30 min 1 hour Kafka 0:10/231GB 1:17/2.21TB 3:41/6.58TB 7:15/13.1TB HDFS 0:07/28.5GB 0:07/31.8GB 0:09/38.7GB 0:32/40.6GB ● Execution time / Bytes scanned HDFS parquet format allows scanning certain columns only. Validation

Slide 17

Slide 17 text

● For a fixed time span of 10-minutes [~2B rows, ~2TB] ● Execution time vs. Cluster size Scaling

Slide 18

Slide 18 text

● Kafka Topic tx (MB/s) ● Presto workers rx (MB/s) Broker tx (MB/s) Bottleneck: network bandwidth per host Bottleneck: Broker serving rate size=50 size=20 size=10 size=75 size=100 size=50 size=20 size=10 size=75 size=100 Scaling

Slide 19

Slide 19 text

Q&A

Slide 20

Slide 20 text

Presto Connector Interface ● ConnectorMetadata ○ Schema, Table, Column ● ConnectorSplitManager ○ Divide data into splits ● ConnectorSplit ○ Split data range ○ Predicate/Aggregation/Limit pushdown ● ConnectorRecordCursor ○ Transform underlying storage data into Presto internal page/block 20

Slide 21

Slide 21 text

Join the Presto Community ● Require new feature or file a bug: github.com/prestodb/presto ● Slack: prestodb.slack.com ● Twitter: @prestodb 21 Stay up-to-date with Ahana ● URL: ahana.io ● Twitter: @ahanaio