Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Tajo

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

Apache Tajo

This presentation gives an overview of the Apache Tajo project. It explains Tajo architecture in relation to Hadoop/Hive and ETL.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Avatar for Mike Frampton

Mike Frampton

May 20, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Tajo ? • A data warehouse system

    • Open source / Apache 2.0 license • Stores data on HDFS and others • For low latency big data queries / ETL • Supports SQL • No release since May 2016
  2. Tajo Catalog Storage • Tajo can store catalog information in

    – Apache Derby ( default ) – MySQL – MariaDB – In-memory – Hive Catalog / HiveMetaStore • Derby is the default with storage under /tmp
  3. Tajo Data Storage • Tajo can store data in the

    following locations – HDFS – Amazon S3 – Openstack Swift – Hbase – RDBMS • It is also possible to register user defined storage – Place user define jar file in tajo/extlib – Copy modified conf/storage-site.json.template into conf/storage-site.json
  4. Tajo Shell • Tajo provides a shell for instance manipulation

    – Issue meta commands i.e. \l ( list db ) – Issue HDFS commands – Use \set to set session variables – Issue \admin administration commands – Issues commands interactively or batch – Run as a background process
  5. Tajo Cluster Architecture • A Tajo cluster has – One

    or more TajoMaster servers – One or more TajoWorker servers • TajoMaster coordinates TajoWorkers • TajoWorkers carry out processing • More TajoWorkers mean more processing capacity • Capacity scales linearly
  6. Tajo TajoMaster Architecture • A TajoMaster process has a –

    QueryCoordinator • Decides whether each query should be executed in a distributed way or be executed immediately in TajoMaster – Resource Tracker • Manages membership of cluster nodes – Client Service Provider • Routes client API calls to proper QueryCoordinator or ResourceTracker
  7. Tajo TajoWorker Architecture • A TajoWorker process has a –

    NodeResourceManager • Manages resource of worker node – TaskManager • Launches task to the TaskExecutor • Uses multiple threads equal to the number of cpu cores – TaskExecutor • Creates TaskContainers for workload – NodeStatusUpdater • Updates the current status when resources change
  8. Tajo Table Spaces • Tajo supports Table Spaces – Data

    may be stored in multiple locations – i.e. HDFS, S3, or Hbase – It might be stored in multiple formats – i.e. CSV, Parquet, or ORC • TableSpaces provide a way to – Easily handle data stored on different storage types – In various file formats
  9. Tajo Table Spaces • Multiple tablespaces exist for a data

    source • A tablespace contains multiple tables while a table has only one tablespace • External tables don't have any tablespaces because they have their own storage information • A database can contain tables of different tablespaces
  10. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  11. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration