Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tez Deep Dive with Geolocation Data

cj_harris5
October 05, 2013

Tez Deep Dive with Geolocation Data

cj_harris5

October 05, 2013
Tweet

More Decks by cj_harris5

Other Decks in Technology

Transcript

  1. Potential Uses of Geolocation Data Geolocation data can be used

    to: •  Locate people •  Locate assets Today’s we will focus on the vehicle location use case
  2. © Hortonworks Inc. 2013 Hadoop 1 - Basics B C

    A A A A B C C B MapReduce (Computation Framework) HDFS (Storage Framework)
  3. © Hortonworks Inc. 2013 Hadoop 1 - Reading Files Rack1

    Rack2 Rack3 RackN read file (fsimage/edit) Hadoop Client NameNode SNameNode return DNs, block ids, etc. DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT checkpoint heartbeat/ block report read blocks
  4. © Hortonworks Inc. 2013 Hadoop 1 - Writing Files Rack1

    Rack2 Rack3 RackN request write (fsimage/edit) Hadoop Client NameNode SNameNode return DNs, etc. DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT checkpoint block report write blocks replication pipelining
  5. © Hortonworks Inc. 2013 Hadoop 1 - Running Jobs Rack1

    Rack2 Rack3 RackN Hadoop Client JobTracker DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT submit job deploy job part 0 map reduce shuffle
  6. © Hortonworks Inc. 2013 Hadoop 2 " Potentially up to

    10,000 nodes per cluster " O(cluster size) " Supports multiple namespace for managing HDFS " Efficient cluster utilization (YARN) " MRv1 backward and forward compatible " Any apps can integrate with Hadoop " Beyond Java
  7. © Hortonworks Inc. 2013 Hadoop 2 - Reading Files (w/

    NN Federation) Rack1 Rack2 Rack3 RackN read file fsimage/edit copy Hadoop Client NN1/ns1 SNameNode per NN return DNs, block ids, etc. DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM checkpoint register/ heartbeat/ block report read blocks fs sync Backup NN per NN checkpoint NN2/ns2 NN3/ns3 NN4/ns4 or ns1 ns2 ns3 ns4 dn1, dn2 dn1, dn3 dn4, dn5 dn4, dn5 Block Pools
  8. © Hortonworks Inc. 2013 Hadoop 2 - Writing Files Rack1

    Rack2 Rack3 RackN request write Hadoop Client return DNs, etc. DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM write blocks replication pipelining fsimage/edit copy NN1/ns1 SNameNode per NN checkpoint block report fs sync Backup NN per NN checkpoint NN2/ns2 NN3/ns3 NN4/ns4 or
  9. © Hortonworks Inc. 2013 Hadoop 2 - Running Jobs RackN

    NodeManager NodeManager NodeManager Rack2 NodeManager NodeManager NodeManager Rack1 NodeManager NodeManager NodeManager C2.1 C1.4 AM2 C2.2 C2.3 AM1 C1.3 C1.2 C1.1 Hadoop Client 1 Hadoop Client 2 create app2 submit app1 submit app2 create app1 ASM Scheduler queues ASM Containers NM ASM Scheduler Resources .......negotiates....... .......reports to....... .......partitions....... ResourceManager status report
  10. © Hortonworks Inc. 2013 Tez – Introduction Page 14 • Distributed

    execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache incubator project and Apache licensed.
  11. © Hortonworks Inc. 2013 Tez – Empowering End Users • Expressive

    dataflow definition API’s – Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical plan at runtime. – Targeted towards data processing applications like Hive/Pig but not limited to it. Hive/Pig query plans naturally map to Tez dataflow graphs with no translation impedance. Page 15 TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2 TaskD-1 TaskD-2 TaskE-1 TaskE-2
  12. © Hortonworks Inc. 2013 Tez – Empowering End Users • Flexible

    Input-Processor-Output runtime model – Construct physical runtime executors dynamically by connecting different inputs, processors and outputs. – End goal is to have a library of inputs, outputs and processors that can be programmatically composed to generate useful tasks. Page 16 IntermediateReduce ShuffleInput ReduceProcessor FileSortedOutput FinalReduce ShuffleInput ReduceProcessor HDFSOutput PairwiseJoin Input1 JoinProcessor FileSortedOutput Input2
  13. © Hortonworks Inc. 2013 Tez – Empowering End Users • Data

    type agnostic – Tez is only concerned with the movement of data. Files and streams of bytes. – Does not impose any data format on the user application. MR application can use Key-Value pairs on top of Tez. Hive and Pig can use tuple oriented formats that are natural and native to them. Page 17 File Stream Key Value Tez Task Tuples User Code Bytes Bytes
  14. © Hortonworks Inc. 2013 Tez – Empowering End Users • Simplifying

    deployment – Tez is a completely client side application. – No deployments to do. Simply upload to any accessible FileSystem and change local Tez configuration to point to that. – Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. – Leverages YARN local resources. Page 18 Client Machine Node Manager TezTask Node Manager TezTask TezClient HDFS Tez Lib 1 Tez Lib 2 Client Machine TezClient
  15. © Hortonworks Inc. 2013 Tez – Execution Performance • Performance gains

    over Map Reduce – Eliminate replicated write barrier between successive computations. – Eliminate job launch overhead of workflow jobs. – Eliminate extra stage of map reads in every workflow job. – Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. Page 19 Pig/Hive - MR Pig/Hive - Tez
  16. © Hortonworks Inc. 2013 Tez – Execution Performance • Optimal resource

    management – Reuse YARN containers to launch new tasks. – Reuse YARN containers to enable shared objects across tasks. Page 20 YARN Container TezTask Host TezTask1 TezTask2 Shared Objects YARN Container Tez Application Master Start Task Task Done Start Task
  17. © Hortonworks Inc. 2013 Tez – Execution Performance • Plan reconfiguration

    at runtime – Dynamic runtime concurrency control based on data size, user operator resources, available cluster resources and locality. – Advanced changes in dataflow graph structure. – Progressive graph construction in concert with user optimizer. Page 21 HDFS Blocks YARN Resources Stage 1 50 maps 100 partitions Stage 2 100 reducers Stage 1 50 maps 100 partitions Stage 2 100 10 reducers Only 10GB’s of data
  18. © Hortonworks Inc. 2013 Tez – Execution Performance • Dynamic physical

    data flow decisions – Decide the type of physical byte movement and storage on the fly. – Store intermediate data on distributed store, local store or in- memory. – Transfer bytes via blocking files or streaming and the spectrum in between. Page 22 Producer (small size) In-Memory Consumer Producer Local File Consumer At Runtime
  19. © Hortonworks Inc. 2013 Tez – Deep Dive – API

    DAG dag = new DAG(); Vertex map1 = new Vertex(MapProcessor.class); Vertex map2 = new Vertex(MapProcessor.class); Vertex reduce1 = new Vertex(ReduceProcessor.class); Vertex reduce2 = new Vertex(ReduceProcessor.class); Vertex join1 = new Vertex(JoinProcessor.class); ……. Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); ……. dag.addVertex(map1).addVertex(map2) .addVertex(reduce1).addVertex(reduce2) .addVertex(join1) .addEdge(edge1).addEdge(edge2) .addEdge(edge3).addEdge(edge4); Page 23 reduce1 map2 reduce2 join1 map1 Scatter_Gather Bipartite Sequential Scatter_Gather Bipartite Sequential Simple DAG definition API
  20. © Hortonworks Inc. 2013 Tez – Deep Dive – API

    Page 24 •  Data movement – Defines routing of data between tasks –  One-To-One : Data from the ith producer task routes to the ith consumer task. –  Broadcast : Data from a producer task routes to all consumer tasks. –  Scatter-Gather : Producer tasks scatter data into shards and consumer tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task. •  Scheduling – Defines when a consumer task is scheduled –  Sequential : Consumer task may be scheduled after a producer task completes. –  Concurrent : Consumer task must be co-scheduled with a producer task. •  Data source – Defines the lifetime/reliability of a task output –  Persisted : Output will be available after the task exits. Output may be lost later on. –  Persisted-Reliable : Output is reliably stored and will always be available –  Ephemeral : Output is available only while the producer task is running Edge properties define the connection between producer and consumer vertices in the DAG
  21. © Hortonworks Inc. 2013 Tez – Current status • Apache Incubator

    Project – Rapid development. Over 330 jiras opened. Over 220 resolved. – Growing community. • Focus on stability – Testing and quality are highest priority. – Working on Tez+YARN to fix basic performance overheads. – Code ready and deployed on multi-node environments. • DAG of MR processing is working –  Already functionally equivalent to Map Reduce. Existing Map Reduce jobs can be executed on Tez with few or no changes. –  Working Hive prototype that can target Tez for execution of queries (HIVE-4660). – Work started on prototype of Pig that can target Tez. Page 25
  22. © Hortonworks Inc. 2013 Hortonworks Sandbox Page 26 Hands on

    tutorials integrated into Sandbox HDP environment for evaluation
  23. © Hortonworks Inc. 2013 Page 27 THANK YOU! Chris Harris

    [email protected] Twitter : cj_harris5 Download Sandbox hortonworks.com/sandbox