Apache Tajo

What Is Apache Tajo ? • A data warehouse system
• Open source / Apache 2.0 license • Stores data on HDFS and others • For low latency big data queries / ETL • Supports SQL • No release since May 2016

Tajo Catalog Storage • Tajo can store catalog information in
– Apache Derby ( default ) – MySQL – MariaDB – In-memory – Hive Catalog / HiveMetaStore • Derby is the default with storage under /tmp

Tajo Data Storage • Tajo can store data in the
following locations – HDFS – Amazon S3 – Openstack Swift – Hbase – RDBMS • It is also possible to register user defined storage – Place user define jar file in tajo/extlib – Copy modified conf/storage-site.json.template into conf/storage-site.json

Tajo Shell • Tajo provides a shell for instance manipulation
– Issue meta commands i.e. \l ( list db ) – Issue HDFS commands – Use \set to set session variables – Issue \admin administration commands – Issues commands interactively or batch – Run as a background process

Tajo Cluster Architecture • A Tajo cluster has – One
or more TajoMaster servers – One or more TajoWorker servers • TajoMaster coordinates TajoWorkers • TajoWorkers carry out processing • More TajoWorkers mean more processing capacity • Capacity scales linearly

Tajo Cluster Architecture

Tajo TajoMaster Architecture • A TajoMaster process has a –
QueryCoordinator • Decides whether each query should be executed in a distributed way or be executed immediately in TajoMaster – Resource Tracker • Manages membership of cluster nodes – Client Service Provider • Routes client API calls to proper QueryCoordinator or ResourceTracker

Tajo TajoWorker Architecture • A TajoWorker process has a –
NodeResourceManager • Manages resource of worker node – TaskManager • Launches task to the TaskExecutor • Uses multiple threads equal to the number of cpu cores – TaskExecutor • Creates TaskContainers for workload – NodeStatusUpdater • Updates the current status when resources change

Tajo Table Spaces • Tajo supports Table Spaces – Data
may be stored in multiple locations – i.e. HDFS, S3, or Hbase – It might be stored in multiple formats – i.e. CSV, Parquet, or ORC • TableSpaces provide a way to – Easily handle data stored on different storage types – In various file formats

Tajo Table Spaces

Tajo Table Spaces • Multiple tablespaces exist for a data
source • A tablespace contains multiple tables while a table has only one tablespace • External tables don't have any tablespaces because they have their own storage information • A database can contain tables of different tablespaces

Tajo Table Spaces

Available Books • See “Big Data Made Easy” – Apress
Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
• See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

Apache Tajo

Apache Tajo

Mike Frampton

More Decks by Mike Frampton

Other Decks in Technology

Featured

Transcript

What Is Apache Tajo ? • A data warehouse system

Tajo Catalog Storage • Tajo can store catalog information in

Tajo Data Storage • Tajo can store data in the

Tajo Shell • Tajo provides a shell for instance manipulation

Tajo Cluster Architecture • A Tajo cluster has – One

Tajo Cluster Architecture

Tajo TajoMaster Architecture • A TajoMaster process has a –

Tajo TajoWorker Architecture • A TajoWorker process has a –

Tajo Table Spaces • Tajo supports Table Spaces – Data

Tajo Table Spaces

Tajo Table Spaces • Multiple tablespaces exist for a data

Tajo Table Spaces

Available Books • See “Big Data Made Easy” – Apress

Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020