Presto in Treasure Data

Presto in Treasure Data (Updated) Mitsunori Komatsu

• Mitsunori Komatsu,   Software engineer @ Treasure Data. •
Presto, Hive, PlazmaDB, td-android-sdk,   td-ios-sdk, Mobile SDK backend,  embedded-sdk • github:komamitsu,  msgpack-java committer, Presto contributor, etc… About me

Today’s talk • What's Presto? • Pros & Cons •
Architecture • Recent updates • Who uses Presto? • How do we use Presto?

What’s Presto?

Fast • Distributed SQL query engine (MPP) • Low latency
and good performance • No disk IO • Pipelined execution (not Map Reduce) • Compile a query plan down to byte code • Off heap memory • Suitable for ad-hoc query

Pluggable • Pluggable backends (“connectors”) • Cassandra / Hive /
JMX / Kafka / MySQL / PostgreSQL / System / TPCH • We can add a new connector by   extending SPI • Treasure Data has been developed a connector to access our storage

What kind of SQL • Supports ANSI SQL (Not HiveQL)
• Easy to use Presto compared to HiveQL • Structural type: Map, Array, JSON, Row • Window functions • Approximate queries • http://blinkdb.org/

Limitations • Fails with huge JOIN or DISTINCT • In
memory only (broadcast / distributed JOIN) • No grace / hybrid hash join • No fault tolerance • Coordinator is SPOF • No “cost based” optimization • No authentication / authorization • No native ODBC => Prestogres

Limitations • Fails with huge JOIN or DISTINCT • In
memory only (broadcast / distributed JOIN) • No grace / hybrid hash join • No fault tolerance • Coordinator is SPOF • No “cost based” optimization • No authentication / authorization • No native ODBC => Prestogres https://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/

Architectural overview https://prestodb.io/overview.html With Hive connector

Query plan Output[nationkey, _col1] => [nationkey:bigint, count:bigint]  - _col1 :=
count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]  - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=  ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]  - custkey := tpch:custkey:1  - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]  - custkey_0 := tpch:custkey:0  - nationkey := tpch:nationkey:3 select  c.nationkey,  count(1)  from orders o join customer c  on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey

Stage 1 Stage 2 Stage 0 Query, stage, task and
split Query Task 0.0 Split Task 1.0 Split Task 1.1 Task 1.2 Split Split Split Task 2.0 Split Task 2.1 Task 2.2 Split Split Split Split Split Split Split Split For example… TableScan (FROM) Aggregation (GROUP BY) Output @worker#2 @worker#3 @worker#0

count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]  - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=  ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]  - custkey := tpch:custkey:1  - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]  - custkey_0 := tpch:custkey:0  - nationkey := tpch:nationkey:3 select  c.nationkey,  count(1)  from orders o join customer c  on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey

count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]  - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=  ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]  - custkey := tpch:custkey:1  - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]  - custkey_0 := tpch:custkey:0  - nationkey := tpch:nationkey:3 select  c.nationkey,  count(1)  from orders o join customer c  on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey Stage 3

count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]  - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=  ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]  - custkey := tpch:custkey:1  - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]  - custkey_0 := tpch:custkey:0  - nationkey := tpch:nationkey:3 select  c.nationkey,  count(1)  from orders o join customer c  on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey Stage 3 Stage 2

count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]  - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=  ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]  - custkey := tpch:custkey:1  - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]  - custkey_0 := tpch:custkey:0  - nationkey := tpch:nationkey:3 select  c.nationkey,  count(1)  from orders o join customer c  on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey Stage 3 Stage 2 Stage 1

count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]  - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=  ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]  - custkey := tpch:custkey:1  - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]  - custkey_0 := tpch:custkey:0  - nationkey := tpch:nationkey:3 select  c.nationkey,  count(1)  from orders o join customer c  on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey Stage 3 Stage 2 Stage 1 Stage 0

What each component does? Presto Cli Coordinator - Parse Query
- Analyze Query - Create Query Plan - Execute Query - Contains Stages - Execute Stages - Contains Tasks - Issue Tasks Discovery Service Worker Worker - Execute Tasks - Convert Query to Java Bytecode (Operator) - Execute Operator Connector - MetaData - Table, Column… - SplitManager - Split, … Connector - RecordSetProvider - RecordSet - RecordCursor - Read Storage Connector Storage Worker Connector External  Metadata?

Presto Cli Coordinator - Parse Query - Analyze Query -
Create Query Plan - Execute Query - Contains Stages - Execute Stages - Contains Tasks - Issue Tasks Discovery Service Worker Worker - Execute Tasks - Convert Query to Java Bytecode (Operator) - Execute Operator Connector - MetaData - Table, Column… - SplitManager - Split, … Connector - RecordSetProvider - RecordSet - RecordCursor - Read Storage Connector Storage Worker Connector External  Metadata? What each component does?

Multi connectors Presto MySQL - test.users PostgreSQL - public.accesses -
public.codes Raptor - default.  user_codes create table raptor.default.user_codes as select c.text, u.name, count(1) as count from postgres.public.accesses a join mysql.test.users u on cast(a.user as bigint) = u.id join postgres.public.codes c on a.code = c.code where a.time < 1200000000

connector.name=mysql connection-url=jdbc:mysql://127.0.0.1:3306 connection-user=root - etc/catalog/mysql.properties connector.name=postgresql connection-url=jdbc:postgresql://127.0.0.1:5432/postgres connection-user=komamitsu - etc/catalog/postgres.properties
connector.name=raptor metadata.db.type=h2 metadata.db.filename=var/data/db/MetaStore - etc/catalog/postgres.properties Multi connectors

: 2015-08-30T22:29:28.882+0900 DEBUG 20150830_132930_00032_btyir. 4.0-0-50 com.facebook.presto.plugin.jdbc.JdbcRecordCursor Executing: SELECT `id`, `name`
FROM `test`.`users` : 2015-08-30T22:29:28.856+0900 DEBUG 20150830_132930_00032_btyir. 5.0-0-56 com.facebook.presto.plugin.jdbc.JdbcRecordCursor Executing: SELECT "text", "code" FROM “public"."codes" : 2015-08-30T22:30:09.294+0900 DEBUG 20150830_132930_00032_btyir. 3.0-2-70 com.facebook.presto.plugin.jdbc.JdbcRecordCursor Executing: SELECT "user", "code", "time" FROM "public"."accesses" WHERE (("time" < 1200000000)) : - log message Condition pushdown Join pushdown isn’t supported Multi connectors

Recent updates (0.108~0.117) New functions: normalize(), from_iso8601_timestamp(), from_iso8601_date(), to_iso8601() slice(),
md5(), array_min(), array_max(), histogram() element_at(), url_encode(), url_decode() sha1(), sha256(), sha512() multimap_agg(), checksum() Teradata compatibility functions: index(), char2hexint(), to_char(), to_date(), to_timestamp()

Recent updates (0.108~0.117) Cluster Resource Management. - query.max-memory - query.max-memory-per-node
- resources.reserved-system-memory “Big Query” option was removed Semi-joins are hash-partitioned if distributed_join is turned on. Add support for partial cast from JSON. For example, json can be cast to array<json>, map<varchar, json>, etc. Use JSON_PARSE() and JSON_FORMAT() instead of CAST Add query_max_run_time session property and query.max-run-time conﬁg. Queries are failed after the speciﬁed duration. optimizer.optimize-hash-generation and distributed-joins-enabled are both enabled by default now.

Recent updates (0.108~0.117) Cluster Resource Management. - query.max-memory - query.max-memory-per-node
- resources.reserved-system-memory “Big Query” option was removed Semi-joins are hash-partitioned if distributed_join is turned on. Add support for partial cast from JSON. For example, json can be cast to array<json>, map<varchar, json>, etc. Use JSON_PARSE() and JSON_FORMAT() instead of CAST Add query_max_run_time session property and query.max-run-time conﬁg. Queries are failed after the speciﬁed duration. optimizer.optimize-hash-generation and distributed-joins-enabled are both enabled by default now. https://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/

Who uses Presto? • Facebook http://www.slideshare.net/dain1/presto-meetup-2015

• Dropbox Who uses Presto?

• Airbnb Who uses Presto?

• Qubole • SaaS • Treasure Data • SaaS •
Teradata • commercial support Who uses Presto? As a service…

Today’s talk • What's Presto? • How do we use
Presto? • What’s Treasure Data? • System architecture • How we manage Presto

What’s Treasure Data?

What’s Treasure Data? and more… • Collect, store and analyze
data • With multi-tenancy • With a schema-less structure • On cloud, but mitigates the outage Founded in 2011 in the U.S. Over 85 employees

What’s Treasure Data? and more… Founded in 2011 in the
U.S. Over 85 employees • Collect, store and analyze data • With multi-tenancy • With a schema-less structure • On cloud, but mitigates the outage

Time to Value Send query result Result Push Acquire Analyze
Store Plazma DB Flexible, Scalable, Columnar Storage Web Log App Log Censor CRM ERP RDBMS Treasure Agent(Server) SDK(JS, Android, iOS, Unity) Streaming Collector Batch / Reliability Ad-hoc /  Low latency KPI$ KPI Dashboard BI Tools Other Products RDBMS, Google Docs, AWS S3, FTP Server, etc. Metric Insights Tableau, Motion Boardetc. POS REST API ODBC / JDBC SQL, Pig Bulk Uploader Embulk,  TD Toolbelt SQL-based query @AWS or @IDCF Connectivity Economy & Flexibility Simple & Supported Collect! Store! Analyze!

Integrations?

Treasure Data in ﬁgures

Some of our customers

System architecture

Components in Treasure Data worker queue (MySQL) api server td
worker process plazmadb (PostgreSQL + S3/RiakCS) select user_id, count(1) from … Presto coordinator Presto worker Presto worker Presto worker Presto worker result bucket (S3/RiakCS) Retry failed query if needed Authentication / Authorization Columnar ﬁle format. Schema-less. td-presto connector

worker queue (MySQL) api server td worker process plazmadb (PostgreSQL
+ S3/RiakCS) select user_id, count(1) from … Presto coordinator Presto worker Presto worker Presto worker Presto worker Retry failed query if needed Authentication / Authorization Columnar ﬁle format. Schema-less. td-presto connector result bucket (S3/RiakCS) Components in Treasure Data

Schema on read time code method user_id 2015-06-01 10:07:11 200
GET 2015-06-01 10:10:12 “200” GET 2015-06-01 10:10:20 200 GET 2015-06-01 10:11:30 200 POST 2015-06-01 10:20:45 200 GET 2015-06-01 10:33:50 400 GET 206 2015-06-01 10:40:11 200 GET 852 2015-06-01 10:51:32 200 PUT 1223 2015-06-01 10:58:02 200 GET 5118 2015-06-01 11:02:11 404 GET 12 2015-06-01 11:14:27 200 GET 3447 access_logs table User added a new column “user_id” in imported data User can select this column with only adding it to the schema (w/o reconstruct the table) Schema on read

Columnar ﬁle format time code method user_id 2015-06-01 10:07:11 200
GET 2015-06-01 10:10:12 “200” GET 2015-06-01 10:10:20 200 GET 2015-06-01 10:11:30 200 POST 2015-06-01 10:20:45 200 GET 2015-06-01 10:33:50 400 GET 206 2015-06-01 10:40:11 200 GET 852 2015-06-01 10:51:32 200 PUT 1223 2015-06-01 10:58:02 200 GET 5118 2015-06-01 11:02:11 404 GET 12 2015-06-01 11:14:27 200 GET 3447 access_logs table time code method user_id Columnar ﬁle format This query accesses only code column   select code, count(1) from tbl group by code

td-presto connector • MessagePack v07 • off heap • avoiding
“TypeProﬁle” • Async IO with Jetty-client • Scheduling & Resource management

How we manage Presto • Blue-Green Deployment • Stress test
tool • Monitoring with DataDog

Blue-Green Deployment worker queue (MySQL) api server td worker process
plazmadb (PostgreSQL + S3/RiakCS) select user_id, count(1) from … Presto coordinator result bucket (S3) Presto coordinator Presto worker Presto worker Presto worker Presto worker Presto worker Presto worker Presto worker Presto worker production rc

plazmadb (PostgreSQL + S3/RiakCS) select user_id, count(1) from … Presto coordinator result bucket (S3) Presto coordinator Presto worker Presto worker Presto worker Presto worker Presto worker Presto worker Presto worker Presto worker production rc Test Test Test!

plazmadb (PostgreSQL + S3/RiakCS) select user_id, count(1) from … Presto coordinator result bucket (S3) Presto coordinator Presto worker Presto worker Presto worker Presto worker Presto worker Presto worker Presto worker Presto worker production! Release stable cluster No downtime

Stress test tool • Collect queries that has ever caused
issues. • Add a new query with just adding this entry. • Issue the query, gets the result and implements a calculated digest automatically.  • We can send all the queries including very heavy ones (around 6000 stages) to Presto - job_id: 28889999 - result: 227d16d801a9a43148c2b7149ce4657c - job_id: 28889999

Monitoring with DataDog Presto coordinator Presto worker Presto worker Presto
worker Presto worker Presto process td-agent in_presto_metrics /v1/jmx/mbean /v1/query /v1/node out_metricsense DataDog

Monitoring with DataDog Query stalled time - Most important for
us. - It triggers alert calls to us… - It can be mainly increased by td-presto connector problems. Most of them are race condition issue.

How many queries processed More than 20000 queries / day

How many queries processed Most queries ﬁnish within 1min

Analytics Infrastructure, Simpliﬁed in the Cloud We’re hiring! https://jobs.lever.co/treasure-data

Presto in Treasure Data

Presto in Treasure Data

More Decks by Mitsunori Komatsu

Other Decks in Technology

Featured

Transcript