Tez Deep Dive with Geolocation Data

© Hortonworks Inc. 2012 Tez Deep Dive with Geolocation Data
Chris Harris [email protected] Twitter : cj_harris5 Page 1

Potential Uses of Geolocation Data Geolocation data can be used
to: •  Locate people •  Locate assets Today’s we will focus on the vehicle location use case

Truck Data

Road Trip Data

© Hortonworks Inc. 2013 Hadoop 1 - Basics B C
A A A A B C C B MapReduce (Computation Framework) HDFS (Storage Framework)

© Hortonworks Inc. 2013 Hadoop 2 " Potentially up to
10,000 nodes per cluster " O(cluster size) " Supports multiple namespace for managing HDFS " Efficient cluster utilization (YARN) " MRv1 backward and forward compatible " Any apps can integrate with Hadoop " Beyond Java

© Hortonworks Inc. 2013 Hadoop 2 - Running Jobs RackN
NodeManager NodeManager NodeManager Rack2 NodeManager NodeManager NodeManager Rack1 NodeManager NodeManager NodeManager C2.1 C1.4 AM2 C2.2 C2.3 AM1 C1.3 C1.2 C1.1 Hadoop Client 1 Hadoop Client 2 create app2 submit app1 submit app2 create app1 ASM Scheduler queues ASM Containers NM ASM Scheduler Resources .......negotiates....... .......reports to....... .......partitions....... ResourceManager status report

© Hortonworks Inc. 2013 Tez – Introduction Page 14 • Distributed
execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache incubator project and Apache licensed.

© Hortonworks Inc. 2013 Tez – Empowering End Users • Expressive
dataflow definition API’s – Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical plan at runtime. – Targeted towards data processing applications like Hive/Pig but not limited to it. Hive/Pig query plans naturally map to Tez dataflow graphs with no translation impedance. Page 15 TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2 TaskD-1 TaskD-2 TaskE-1 TaskE-2

© Hortonworks Inc. 2013 Tez – Empowering End Users • Flexible
Input-Processor-Output runtime model – Construct physical runtime executors dynamically by connecting different inputs, processors and outputs. – End goal is to have a library of inputs, outputs and processors that can be programmatically composed to generate useful tasks. Page 16 IntermediateReduce ShuffleInput ReduceProcessor FileSortedOutput FinalReduce ShuffleInput ReduceProcessor HDFSOutput PairwiseJoin Input1 JoinProcessor FileSortedOutput Input2

© Hortonworks Inc. 2013 Tez – Empowering End Users • Data
type agnostic – Tez is only concerned with the movement of data. Files and streams of bytes. – Does not impose any data format on the user application. MR application can use Key-Value pairs on top of Tez. Hive and Pig can use tuple oriented formats that are natural and native to them. Page 17 File Stream Key Value Tez Task Tuples User Code Bytes Bytes

© Hortonworks Inc. 2013 Tez – Empowering End Users • Simplifying
deployment – Tez is a completely client side application. – No deployments to do. Simply upload to any accessible FileSystem and change local Tez configuration to point to that. – Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. – Leverages YARN local resources. Page 18 Client Machine Node Manager TezTask Node Manager TezTask TezClient HDFS Tez Lib 1 Tez Lib 2 Client Machine TezClient

© Hortonworks Inc. 2013 Tez – Execution Performance • Performance gains
over Map Reduce – Eliminate replicated write barrier between successive computations. – Eliminate job launch overhead of workflow jobs. – Eliminate extra stage of map reads in every workflow job. – Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. Page 19 Pig/Hive - MR Pig/Hive - Tez

© Hortonworks Inc. 2013 Tez – Execution Performance • Optimal resource
management – Reuse YARN containers to launch new tasks. – Reuse YARN containers to enable shared objects across tasks. Page 20 YARN Container TezTask Host TezTask1 TezTask2 Shared Objects YARN Container Tez Application Master Start Task Task Done Start Task

© Hortonworks Inc. 2013 Tez – Execution Performance • Plan reconfiguration
at runtime – Dynamic runtime concurrency control based on data size, user operator resources, available cluster resources and locality. – Advanced changes in dataflow graph structure. – Progressive graph construction in concert with user optimizer. Page 21 HDFS Blocks YARN Resources Stage 1 50 maps 100 partitions Stage 2 100 reducers Stage 1 50 maps 100 partitions Stage 2 100 10 reducers Only 10GB’s of data

© Hortonworks Inc. 2013 Tez – Execution Performance • Dynamic physical
data flow decisions – Decide the type of physical byte movement and storage on the fly. – Store intermediate data on distributed store, local store or in- memory. – Transfer bytes via blocking files or streaming and the spectrum in between. Page 22 Producer (small size) In-Memory Consumer Producer Local File Consumer At Runtime

© Hortonworks Inc. 2013 Tez – Deep Dive – API
DAG dag = new DAG(); Vertex map1 = new Vertex(MapProcessor.class); Vertex map2 = new Vertex(MapProcessor.class); Vertex reduce1 = new Vertex(ReduceProcessor.class); Vertex reduce2 = new Vertex(ReduceProcessor.class); Vertex join1 = new Vertex(JoinProcessor.class); ……. Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); ……. dag.addVertex(map1).addVertex(map2) .addVertex(reduce1).addVertex(reduce2) .addVertex(join1) .addEdge(edge1).addEdge(edge2) .addEdge(edge3).addEdge(edge4); Page 23 reduce1 map2 reduce2 join1 map1 Scatter_Gather Bipartite Sequential Scatter_Gather Bipartite Sequential Simple DAG definition API

© Hortonworks Inc. 2013 Tez – Deep Dive – API
Page 24 •  Data movement – Defines routing of data between tasks –  One-To-One : Data from the ith producer task routes to the ith consumer task. –  Broadcast : Data from a producer task routes to all consumer tasks. –  Scatter-Gather : Producer tasks scatter data into shards and consumer tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task. •  Scheduling – Defines when a consumer task is scheduled –  Sequential : Consumer task may be scheduled after a producer task completes. –  Concurrent : Consumer task must be co-scheduled with a producer task. •  Data source – Defines the lifetime/reliability of a task output –  Persisted : Output will be available after the task exits. Output may be lost later on. –  Persisted-Reliable : Output is reliably stored and will always be available –  Ephemeral : Output is available only while the producer task is running Edge properties define the connection between producer and consumer vertices in the DAG

© Hortonworks Inc. 2013 Tez – Current status • Apache Incubator
Project – Rapid development. Over 330 jiras opened. Over 220 resolved. – Growing community. • Focus on stability – Testing and quality are highest priority. – Working on Tez+YARN to fix basic performance overheads. – Code ready and deployed on multi-node environments. • DAG of MR processing is working –  Already functionally equivalent to Map Reduce. Existing Map Reduce jobs can be executed on Tez with few or no changes. –  Working Hive prototype that can target Tez for execution of queries (HIVE-4660). – Work started on prototype of Pig that can target Tez. Page 25

© Hortonworks Inc. 2013 Hortonworks Sandbox Page 26 Hands on
tutorials integrated into Sandbox HDP environment for evaluation

© Hortonworks Inc. 2013 Page 27 THANK YOU! Chris Harris
[email protected] Twitter : cj_harris5 Download Sandbox hortonworks.com/sandbox

Tez Deep Dive with Geolocation Data

Tez Deep Dive with Geolocation Data

cj_harris5

More Decks by cj_harris5

Other Decks in Technology

Featured

Transcript

© Hortonworks Inc. 2012 Tez Deep Dive with Geolocation Data

Potential Uses of Geolocation Data Geolocation data can be used

Truck Data

Road Trip Data

© Hortonworks Inc. 2013 Hadoop 1 - Basics B C

© Hortonworks Inc. 2013 Hadoop 1 - Reading Files Rack1

© Hortonworks Inc. 2013 Hadoop 1 - Writing Files Rack1

© Hortonworks Inc. 2013 Hadoop 1 - Running Jobs Rack1

© Hortonworks Inc. 2013 Hadoop 2 " Potentially up to

© Hortonworks Inc. 2013 Hadoop 2 - Basics

© Hortonworks Inc. 2013 Hadoop 2 - Reading Files (w/

© Hortonworks Inc. 2013 Hadoop 2 - Writing Files Rack1

© Hortonworks Inc. 2013 Hadoop 2 - Running Jobs RackN

© Hortonworks Inc. 2013 Tez – Introduction Page 14 • Distributed

© Hortonworks Inc. 2013 Tez – Empowering End Users • Expressive

© Hortonworks Inc. 2013 Tez – Empowering End Users • Flexible

© Hortonworks Inc. 2013 Tez – Empowering End Users • Data

© Hortonworks Inc. 2013 Tez – Empowering End Users • Simplifying

© Hortonworks Inc. 2013 Tez – Execution Performance • Performance gains

© Hortonworks Inc. 2013 Tez – Execution Performance • Optimal resource

© Hortonworks Inc. 2013 Tez – Execution Performance • Plan reconfiguration

© Hortonworks Inc. 2013 Tez – Execution Performance • Dynamic physical

© Hortonworks Inc. 2013 Tez – Deep Dive – API

© Hortonworks Inc. 2013 Tez – Deep Dive – API

© Hortonworks Inc. 2013 Tez – Current status • Apache Incubator

© Hortonworks Inc. 2013 Hortonworks Sandbox Page 26 Hands on

© Hortonworks Inc. 2013 Page 27 THANK YOU! Chris Harris