Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tech Talk Series – Level 102 for Presto: Gettin...

Ahana
August 13, 2020

Tech Talk Series – Level 102 for Presto: Getting Started with PrestoDB

In this tech talk, we’ll build on what we learned in Level 101 and show you how easy it is to get started with PrestoDB. We’ll also dive deeper into key features and connectors.

Ahana

August 13, 2020
Tweet

More Decks by Ahana

Other Decks in Technology

Transcript

  1. Level 102 for Presto SQL on Everything Part 2 of

    the Tech Talk Series for Presto Getting started with PrestoDB? What’s the difference? Da Cheng Software Engineer, Twitter Dipti Borkar Co-Founder & CPO, Ahana
  2. Presto Connector • Connector: Driver for a data source ◦

    Access HDFS datasets, e.g., Hive ◦ Extend to real-time event streams, e.g., Kafka ◦ Push-down/offload computation to data engines, e.g., Druid • Data model from users’ perspective ◦ Catalog: Contains schemas from a datasource specified by the connector ◦ Schema: Namespace to organize tables. ◦ Table: Set of unordered rows organized into columns with types. SHOW TABLES FROM hive.default; select * from hive.default.employees limit 1;
  3. Presto Connector • Catalog properties file ◦ Hive: etc/catalog/hive.properties ◦

    Kafka: etc/catalog/kafka.properties ◦ Druid: etc/catalog/druid.properties 4 connector.name=kafka kafka.table-names=table1, table2 kafka.nodes=host1:port,host2:port connector.name=druid druid.coordinator-url=hostname:port druid.broker-url=hostname:port druid.schema-name=schema druid.compute-pushdown-enabled=true connector.name=hive-hadoop2 hive.metastore.uri=thrift://example.net:9083
  4. Presto Kafka Connector • How it works? ◦ Kafka brokers

    ◦ Topic schema ◦ Read from Kafka brokers • Twitter Setup • Performance • Scalibility 5
  5. Topic Schema • A table definition file consists of a

    JSON definition for a table. The name of the file can be arbitrary but must end in .json "tableName": ..., "schemaName": ..., "topicName": ..., "key": { "dataFormat": ..., "fields": [ ... ] }, "message": { "dataFormat": ..., "fields": [ ... ] } }
  6. Topic brokers Coordinator 4 3 1 4 3 2 For

    ALL partitions: • getOffsetsByTimestamp(ts) • Build splits with offsets Worker • poll • Decode Worker • poll • Decode Worker • poll • Decode Read from a topic broker • Coordinator builds splits in a single thread execution • getOffsetsByTimestamp() ◦ APIs call to get message offset by timestamp to users’ interest ◦ E.g., KafkaClient subscribed to broker A, returns offset range for Partition 1 A B
  7. Topic brokers Coordinator For ALL partitions: • Build splits with

    ts 4 3 1 4 3 2 Worker •get offsets for each split •poll •Decode Worker •get offsets for each split •poll •Decode Worker •get offsets for each split •poll •Decode • Decouple getOffsetsByTimestamp() from single thread excution ◦ Reduced by a fixed ~90 seconds to get offset ranges for all 1000 partitions in the test Topic Slightly smarter! Read from a topic broker A B
  8. msg1, msg2 msg3, msg4 msg5, msg6 ... offset=0x01 offset=0x03 offset=0x05

    ... TS=t1 TS=t3 TS=t5 • Used segment timestamp (before KafkaClient 1.0.0) to limit scanned data Split ThriftRecordDecoder msg3 msg4 msg5 msg6 KafkaRecordSetCursor rows By Segment Read from a topic broker
  9. • Deprecate notion of Segment • Use internal field _timestamp

    in message format (after 1.0.0) to limit data at finer granularity msg1 msg2 msg3 msg4 msg5 msg6 ... offset=0x0 1 offset=0x02 offset=0x03 offset=0x04 offset=0x05 offset=0x06 ... TS=t1 TS=t2 TS=t3 TS=t4 TS=t5 TS=t6 Split ThriftRecordDecoder msg4 msg5 KafkaRecordSetCursor rows Read from a topic broker With message offset
  10. Setup • Presto cluster on Aurora ◦ Dedicated hosts (55+

    CPUs) • A topic with 1000 partitions ◦ 30 brokers each holding 100 partitions ▪ Message replication factor = 3 ▪ A total of 1,000 partitions ◦ Each partition serving 10MB/s (SLA) • HDFS counterpart ◦ ~1 hour latency
  11. SELECT ip_address, count(1) as cnt, avg(response_size), max(_timestamp), max(authentication.user_id) FROM kafka.default.test

    WHERE timestamp_ms between cast(to_unixtime(CAST('2019-11-13 21:00:00' AS timestamp))*1000 AS bigint) AND cast(to_unixtime(CAST('2019-11-13 21:01:00' AS timestamp))*1000 AS bigint) AND _timestamp between cast(to_unixtime(CAST('2019-11-13 21:00:00' AS timestamp))*1000 AS bigint) AND cast(to_unixtime(CAST('2019-11-13 21:01:00' AS timestamp))*1000 AS bigint) GROUP BY ip_address ORDER BY cnt DESC LIMIT 10; Performance • Query Event time Processing time
  12. Performance Time span 2 min 10 min 1 hour Kafka-clients

    1.1.1 0:21 1:17 7:15 • Execution time ◦ Use a cluster of 50 Presto workers ◦ Messages from one partition is packaged in a single split, # splits is fixed ◦ Execution time increases linearly with time span (which decides message range) of users’ interests • Data scanned Time span 2 min 10 min 1 hour Kafka-clients 1.1.1 470GB 2.21TB 13.1TB
  13. Validation Kafka SELECT count(*) FROM kafka.default.test WHERE timestamp_ms between cast(to_unixtime(cast('2019-11-22

    23:00:00' AS timestamp))*1000 AS bigint) and cast(to_unixtime(cast('2019-11-23 00:00:00' AS timestamp))*1000 AS bigint) AND _timestamp between cast(to_unixtime(cast('2019-11-22 23:00:00' AS timestamp))*1000 AS bigint) and cast(to_unixtime(cast('2019-11-23 00:00:00' AS timestamp))*1000 AS bigint); HDFS select count(*) from hive.logs.test WHERE timestamp_ms between cast(to_unixtime(cast('2019-11-22 23:00:00' AS timestamp))*1000 AS bigint) AND cast(to_unixtime(cast('2019-11-23 00:00:00' AS timestamp))*1000 AS bigint) AND datehour='2019112223'; • Query ◦ Use count() without aggregation to show difference vs. HDFS ◦ Fetching one hour data equivalent to selecting one datehour partition on HDFS
  14. Time span 1 min 10 min 30 min 1 hour

    Kafka 298655376 2922330813 8715311075 17310768994 HDFS 299548797 2921707581 8713096303 17307371235 Difference 0.3% -0.02% -0.03% -0.02% • count(*) results Time span 1 min 10 min 30 min 1 hour Kafka 0:10/231GB 1:17/2.21TB 3:41/6.58TB 7:15/13.1TB HDFS 0:07/28.5GB 0:07/31.8GB 0:09/38.7GB 0:32/40.6GB • Execution time / Bytes scanned HDFS parquet format allows scanning certain columns only. Validation
  15. • For a fixed time span of 10-minutes [~2B rows,

    ~2TB] • Execution time vs. Cluster size Scaling
  16. • Kafka Topic tx (MB/s) • Presto workers rx (MB/s)

    Broker tx (MB/s) Bottleneck: network bandwidth per host Bottleneck: Broker serving rate size=50 size=20 size=10 size=75 size=100 size=50 size=20 size=10 size=75 size=100 Scaling
  17. Q&A

  18. Presto Connector Interface • ConnectorMetadata ◦ Schema, Table, Column •

    ConnectorSplitManager ◦ Divide data into splits • ConnectorSplit ◦ Split data range ◦ Predicate/Aggregation/Limit pushdown • ConnectorRecordCursor ◦ Transform underlying storage data into Presto internal page/block 20
  19. Join the Presto Community • Require new feature or file

    a bug: github.com/prestodb/presto • Slack: prestodb.slack.com • Twitter: @prestodb 21 Stay up-to-date with Ahana • URL: ahana.io • Twitter: @ahanaio