Tech Talk Series – Level 102 for Presto: Getting Started with PrestoDB

Level 102 for Presto SQL on Everything Part 2 of
the Tech Talk Series for Presto Getting started with PrestoDB? What’s the difference? Da Cheng Software Engineer, Twitter Dipti Borkar Co-Founder & CPO, Ahana

Pluggable Presto Connectors 2

Presto Connector • Connector: Driver for a data source ◦
Access HDFS datasets, e.g., Hive ◦ Extend to real-time event streams, e.g., Kafka ◦ Push-down/ofﬂoad computation to data engines, e.g., Druid • Data model from users’ perspective ◦ Catalog: Contains schemas from a datasource speciﬁed by the connector ◦ Schema: Namespace to organize tables. ◦ Table: Set of unordered rows organized into columns with types. SHOW TABLES FROM hive.default; select * from hive.default.employees limit 1;

Presto Connector • Catalog properties file ◦ Hive: etc/catalog/hive.properties ◦
Kafka: etc/catalog/kafka.properties ◦ Druid: etc/catalog/druid.properties 4 connector.name=kafka kafka.table-names=table1, table2 kafka.nodes=host1:port,host2:port connector.name=druid druid.coordinator-url=hostname:port druid.broker-url=hostname:port druid.schema-name=schema druid.compute-pushdown-enabled=true connector.name=hive-hadoop2 hive.metastore.uri=thrift://example.net:9083

Presto Kafka Connector • How it works? ◦ Kafka brokers
◦ Topic schema ◦ Read from Kafka brokers • Twitter Setup • Performance • Scalibility 5

Kafka Brokers https://www.tutorialspoint.com/apache_kafka/apache_kafka_fundamentals.htm

Topic Schema • A table definition file consists of a
JSON definition for a table. The name of the file can be arbitrary but must end in .json "tableName": ..., "schemaName": ..., "topicName": ..., "key": { "dataFormat": ..., "fields": [ ... ] }, "message": { "dataFormat": ..., "fields": [ ... ] } }

Topic brokers Coordinator 4 3 1 4 3 2 For
ALL partitions: • getOffsetsByTimestamp(ts) • Build splits with offsets Worker • poll • Decode Worker • poll • Decode Worker • poll • Decode Read from a topic broker • Coordinator builds splits in a single thread execution • getOffsetsByTimestamp() ◦ APIs call to get message offset by timestamp to users’ interest ◦ E.g., KafkaClient subscribed to broker A, returns offset range for Partition 1 A B

Topic brokers Coordinator For ALL partitions: • Build splits with
ts 4 3 1 4 3 2 Worker •get offsets for each split •poll •Decode Worker •get offsets for each split •poll •Decode Worker •get offsets for each split •poll •Decode • Decouple getOffsetsByTimestamp() from single thread excution ◦ Reduced by a ﬁxed ~90 seconds to get offset ranges for all 1000 partitions in the test Topic Slightly smarter! Read from a topic broker A B

msg1, msg2 msg3, msg4 msg5, msg6 ... offset=0x01 offset=0x03 offset=0x05
... TS=t1 TS=t3 TS=t5 • Used segment timestamp (before KafkaClient 1.0.0) to limit scanned data Split ThriftRecordDecoder msg3 msg4 msg5 msg6 KafkaRecordSetCursor rows By Segment Read from a topic broker

• Deprecate notion of Segment • Use internal field _timestamp
in message format (after 1.0.0) to limit data at finer granularity msg1 msg2 msg3 msg4 msg5 msg6 ... offset=0x0 1 offset=0x02 offset=0x03 offset=0x04 offset=0x05 offset=0x06 ... TS=t1 TS=t2 TS=t3 TS=t4 TS=t5 TS=t6 Split ThriftRecordDecoder msg4 msg5 KafkaRecordSetCursor rows Read from a topic broker With message offset

Setup • Presto cluster on Aurora ◦ Dedicated hosts (55+
CPUs) • A topic with 1000 partitions ◦ 30 brokers each holding 100 partitions ▪ Message replication factor = 3 ▪ A total of 1,000 partitions ◦ Each partition serving 10MB/s (SLA) • HDFS counterpart ◦ ~1 hour latency

SELECT ip_address, count(1) as cnt, avg(response_size), max(_timestamp), max(authentication.user_id) FROM kafka.default.test
WHERE timestamp_ms between cast(to_unixtime(CAST('2019-11-13 21:00:00' AS timestamp))*1000 AS bigint) AND cast(to_unixtime(CAST('2019-11-13 21:01:00' AS timestamp))*1000 AS bigint) AND _timestamp between cast(to_unixtime(CAST('2019-11-13 21:00:00' AS timestamp))*1000 AS bigint) AND cast(to_unixtime(CAST('2019-11-13 21:01:00' AS timestamp))*1000 AS bigint) GROUP BY ip_address ORDER BY cnt DESC LIMIT 10; Performance • Query Event time Processing time

Performance Time span 2 min 10 min 1 hour Kafka-clients
1.1.1 0:21 1:17 7:15 • Execution time ◦ Use a cluster of 50 Presto workers ◦ Messages from one partition is packaged in a single split, # splits is ﬁxed ◦ Execution time increases linearly with time span (which decides message range) of users’ interests • Data scanned Time span 2 min 10 min 1 hour Kafka-clients 1.1.1 470GB 2.21TB 13.1TB

Validation Kafka SELECT count(*) FROM kafka.default.test WHERE timestamp_ms between cast(to_unixtime(cast('2019-11-22
23:00:00' AS timestamp))*1000 AS bigint) and cast(to_unixtime(cast('2019-11-23 00:00:00' AS timestamp))*1000 AS bigint) AND _timestamp between cast(to_unixtime(cast('2019-11-22 23:00:00' AS timestamp))*1000 AS bigint) and cast(to_unixtime(cast('2019-11-23 00:00:00' AS timestamp))*1000 AS bigint); HDFS select count(*) from hive.logs.test WHERE timestamp_ms between cast(to_unixtime(cast('2019-11-22 23:00:00' AS timestamp))*1000 AS bigint) AND cast(to_unixtime(cast('2019-11-23 00:00:00' AS timestamp))*1000 AS bigint) AND datehour='2019112223'; • Query ◦ Use count() without aggregation to show difference vs. HDFS ◦ Fetching one hour data equivalent to selecting one datehour partition on HDFS

Time span 1 min 10 min 30 min 1 hour
Kafka 298655376 2922330813 8715311075 17310768994 HDFS 299548797 2921707581 8713096303 17307371235 Diﬀerence 0.3% -0.02% -0.03% -0.02% • count(*) results Time span 1 min 10 min 30 min 1 hour Kafka 0:10/231GB 1:17/2.21TB 3:41/6.58TB 7:15/13.1TB HDFS 0:07/28.5GB 0:07/31.8GB 0:09/38.7GB 0:32/40.6GB • Execution time / Bytes scanned HDFS parquet format allows scanning certain columns only. Validation

• For a ﬁxed time span of 10-minutes [~2B rows,
~2TB] • Execution time vs. Cluster size Scaling

• Kafka Topic tx (MB/s) • Presto workers rx (MB/s)
Broker tx (MB/s) Bottleneck: network bandwidth per host Bottleneck: Broker serving rate size=50 size=20 size=10 size=75 size=100 size=50 size=20 size=10 size=75 size=100 Scaling

Presto Connector Interface • ConnectorMetadata ◦ Schema, Table, Column •
ConnectorSplitManager ◦ Divide data into splits • ConnectorSplit ◦ Split data range ◦ Predicate/Aggregation/Limit pushdown • ConnectorRecordCursor ◦ Transform underlying storage data into Presto internal page/block 20

Join the Presto Community • Require new feature or ﬁle
a bug: github.com/prestodb/presto • Slack: prestodb.slack.com • Twitter: @prestodb 21 Stay up-to-date with Ahana • URL: ahana.io • Twitter: @ahanaio

Tech Talk Series – Level 102 for Presto: Gettin...

Tech Talk Series – Level 102 for Presto: Getting Started with PrestoDB

Ahana

More Decks by Ahana

Other Decks in Technology

Featured

Transcript

Level 102 for Presto SQL on Everything Part 2 of

Pluggable Presto Connectors 2

Presto Connector • Connector: Driver for a data source ◦

Presto Connector • Catalog properties file ◦ Hive: etc/catalog/hive.properties ◦

Presto Kafka Connector • How it works? ◦ Kafka brokers

Kafka Brokers https://www.tutorialspoint.com/apache_kafka/apache_kafka_fundamentals.htm

Topic Schema • A table definition file consists of a

Topic brokers Coordinator 4 3 1 4 3 2 For

Topic brokers Coordinator For ALL partitions: • Build splits with

msg1, msg2 msg3, msg4 msg5, msg6 ... offset=0x01 offset=0x03 offset=0x05

• Deprecate notion of Segment • Use internal ﬁeld _timestamp

Setup • Presto cluster on Aurora ◦ Dedicated hosts (55+

SELECT ip_address, count(1) as cnt, avg(response_size), max(_timestamp), max(authentication.user_id) FROM kafka.default.test

Performance Time span 2 min 10 min 1 hour Kafka-clients

Validation Kafka SELECT count(*) FROM kafka.default.test WHERE timestamp_ms between cast(to_unixtime(cast('2019-11-22

Time span 1 min 10 min 30 min 1 hour

• For a ﬁxed time span of 10-minutes [~2B rows,

• Kafka Topic tx (MB/s) • Presto workers rx (MB/s)

Q&A

Presto Connector Interface • ConnectorMetadata ◦ Schema, Table, Column •

Join the Presto Community • Require new feature or ﬁle