Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Tajo

Apache Tajo

This presentation gives an overview of the Apache Tajo project. It explains Tajo architecture in relation to Hadoop/Hive and ETL.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Mike Frampton

May 20, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Tajo ? • A data warehouse system

    • Open source / Apache 2.0 license • Stores data on HDFS and others • For low latency big data queries / ETL • Supports SQL • No release since May 2016
  2. Tajo Catalog Storage • Tajo can store catalog information in

    – Apache Derby ( default ) – MySQL – MariaDB – In-memory – Hive Catalog / HiveMetaStore • Derby is the default with storage under /tmp
  3. Tajo Data Storage • Tajo can store data in the

    following locations – HDFS – Amazon S3 – Openstack Swift – Hbase – RDBMS • It is also possible to register user defined storage – Place user define jar file in tajo/extlib – Copy modified conf/storage-site.json.template into conf/storage-site.json
  4. Tajo Shell • Tajo provides a shell for instance manipulation

    – Issue meta commands i.e. \l ( list db ) – Issue HDFS commands – Use \set to set session variables – Issue \admin administration commands – Issues commands interactively or batch – Run as a background process
  5. Tajo Cluster Architecture • A Tajo cluster has – One

    or more TajoMaster servers – One or more TajoWorker servers • TajoMaster coordinates TajoWorkers • TajoWorkers carry out processing • More TajoWorkers mean more processing capacity • Capacity scales linearly
  6. Tajo TajoMaster Architecture • A TajoMaster process has a –

    QueryCoordinator • Decides whether each query should be executed in a distributed way or be executed immediately in TajoMaster – Resource Tracker • Manages membership of cluster nodes – Client Service Provider • Routes client API calls to proper QueryCoordinator or ResourceTracker
  7. Tajo TajoWorker Architecture • A TajoWorker process has a –

    NodeResourceManager • Manages resource of worker node – TaskManager • Launches task to the TaskExecutor • Uses multiple threads equal to the number of cpu cores – TaskExecutor • Creates TaskContainers for workload – NodeStatusUpdater • Updates the current status when resources change
  8. Tajo Table Spaces • Tajo supports Table Spaces – Data

    may be stored in multiple locations – i.e. HDFS, S3, or Hbase – It might be stored in multiple formats – i.e. CSV, Parquet, or ORC • TableSpaces provide a way to – Easily handle data stored on different storage types – In various file formats
  9. Tajo Table Spaces • Multiple tablespaces exist for a data

    source • A tablespace contains multiple tables while a table has only one tablespace • External tables don't have any tablespaces because they have their own storage information • A database can contain tables of different tablespaces
  10. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  11. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration