Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Tez

Apache Tez

This presentation gives an overview of the Apache Tez project. It explains Tez as a processing system based on Hadoop YARN as well as comparing it to Map Reduce.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Mike Frampton

May 17, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Tez ? • An application framework •

    Build on top of Apache Hadoop YARN • Uses directed-acyclic-graphs ( DAG's ) • Open source / Apache 2.0 license • Scaleable • Performant
  2. Tez DAG • Tez directed-acyclic-graphs ( DAG ) • Distributed

    data processing • Vertices represent data transformation • Edges represent data movement • For data processing applications • TEZ is an execution engine • Built on top of YARN
  3. Tez Performance • Performance improvement compared to Map Reduce –

    No need for HDFS storage between MR jobs – Better execution performance • Expressive dataflow API for DAG – Visualise what you wish to construct – Add processor vertices to graph – Add data movement edges to graph – To build the computational DAG that you require
  4. Tez Deployment • Tez is client side • Install Tez

    client locally • Build task DAG • Load DAG/Tez libraries to HDFS • Execute YARN based job – From Tez client – Using HDFS based DAG library
  5. Tez Existing MR Tasks • Tez can process existing Map

    Reduce ( MR ) tasks • No need for any modification • Allows for phased migration – Of existing MR jobs to DAG's • Allows for near real time task types • Rather than just MR tasks which are – Batch oriented – Iterative – Resource intensive
  6. Tez API • Tez DAG defines the job • Vertex

    defines one DAG job step – Requires user logic and resources for step • Edge defines one DAG data movement step – From producer to consumer – Edge properties define movement • How data moves • Schedules when data moves relationally • Defines durability of data
  7. Tez Hive • Increased performance – Compared to Map Reduce

    usage • No need to use HDFS for intermediate steps • Greater parallelism via DAG's • Less complex steps in DAG compared to MR • Reduced latency • Higher throughput • Better speed
  8. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” – • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  9. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration