Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Kylin

Apache Kylin

This presentation gives an overview of the Apache Kylin project. It explains Kylin architecture in relation to Hadoop/HBase/Hive and Druid.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Mike Frampton

May 19, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Kylin ? • An analytics data warehouse

    • For big data / Apache 2.0 license • Open source / written in Java • Kylin is an OLAP engine with SQL interface • For huge table (e.g., >100 million rows) • Provides second level query performance at TB to PB level
  2. How does Kylin work ? • Kylin runs on a

    Hadoop cluster • It needs these services – HDFS, YARN, MapReduce, Hive, HBase, Zookeeper • State information is stored in Hbase • Historic data / star schema stored in Hive • Access Kylin at http://<hostname>:7070/kylin • Uses Lambda architecture for real time streaming – layers: Batch, speed and serving – batch / near real-time processing
  3. Kylin Software Requirements • Requirements as of release v3.0.1 –

    Hadoop: 2.7+, 3.1+ (since v2.5) – Hive: 0.13 - 1.2.1+ – HBase: 1.1+, 2.0 (since v2.5) – Spark (optional) 2.3.0+ – Kafka (optional) 1.0.0+ (since v2.5) – JDK: 1.8+ (since v2.5) – OS: Linux only, CentOS 6.5+ or Ubuntu 16.0.4+
  4. Kylin Real Time Streaming Architecture • Streaming Receiver – ingest

    data from stream data sources • Streaming Coordinator – coordinate work loads • Metadata Store – store streaming related metadata • Query Engine – query real-time data from streaming receiver • Build Engine – build cube from the real-time data
  5. Kylin Vs Druid • Druid is more suitable for real

    time analysis. Kylin is more focused on the OLAP case. • Druid has good integration with Kafka for real time streaming analysis. The real time capability of Kylin (v3) is for real time OLAP. • Druid uses bitmap indexes for internal data structures. Kylin uses bitmap indexes for real time data and molap cubes for historical data. • Kylin provide ANSI SQL, Druid provides a specific query language. • Druid has limitations on table join, Kylin supports star schema. • Kylin has good integration with BI tools, such as Tableau or Excel. Druid has limited integration with existing BI tools. • Since Kylin supports molap cubes, it has very good performance for complex queries on billion level data sets. • Since Druid needs to scan the full index, the performance may be hurt if the data set and query range is too big.
  6. Kylin Ecosystem • Kylin Core Fundamental framework of Kylin OLAP

    Engine comprises of Metadata Engine, Query Engine, Job Engine and Storage Engine to run the entire stack. It also includes a REST Server to service client requests • Extensions Plugins to support additional functions and features • Integration Lifecycle Management Support to integrate with Job Scheduler, ETL, Monitoring and Alerting Systems • User Interface Allows third party users to build customized user-interface atop Kylin core • Drivers ODBC and JDBC drivers to support different tools and products, such as Tableau
  7. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  8. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration