Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Tez

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Apache Tez

This presentation gives an overview of the Apache Tez project. It explains Tez as a processing system based on Hadoop YARN as well as comparing it to Map Reduce.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Avatar for Mike Frampton

Mike Frampton

May 17, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Tez ? • An application framework •

    Build on top of Apache Hadoop YARN • Uses directed-acyclic-graphs ( DAG's ) • Open source / Apache 2.0 license • Scaleable • Performant
  2. Tez DAG • Tez directed-acyclic-graphs ( DAG ) • Distributed

    data processing • Vertices represent data transformation • Edges represent data movement • For data processing applications • TEZ is an execution engine • Built on top of YARN
  3. Tez Performance • Performance improvement compared to Map Reduce –

    No need for HDFS storage between MR jobs – Better execution performance • Expressive dataflow API for DAG – Visualise what you wish to construct – Add processor vertices to graph – Add data movement edges to graph – To build the computational DAG that you require
  4. Tez Deployment • Tez is client side • Install Tez

    client locally • Build task DAG • Load DAG/Tez libraries to HDFS • Execute YARN based job – From Tez client – Using HDFS based DAG library
  5. Tez Existing MR Tasks • Tez can process existing Map

    Reduce ( MR ) tasks • No need for any modification • Allows for phased migration – Of existing MR jobs to DAG's • Allows for near real time task types • Rather than just MR tasks which are – Batch oriented – Iterative – Resource intensive
  6. Tez API • Tez DAG defines the job • Vertex

    defines one DAG job step – Requires user logic and resources for step • Edge defines one DAG data movement step – From producer to consumer – Edge properties define movement • How data moves • Schedules when data moves relationally • Defines durability of data
  7. Tez Hive • Increased performance – Compared to Map Reduce

    usage • No need to use HDFS for intermediate steps • Greater parallelism via DAG's • Less complex steps in DAG compared to MR • Reduced latency • Higher throughput • Better speed
  8. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” – • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  9. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration