Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Kylin

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

Apache Kylin

This presentation gives an overview of the Apache Kylin project. It explains Kylin architecture in relation to Hadoop/HBase/Hive and Druid.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Avatar for Mike Frampton

Mike Frampton

May 19, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Kylin ? • An analytics data warehouse

    • For big data / Apache 2.0 license • Open source / written in Java • Kylin is an OLAP engine with SQL interface • For huge table (e.g., >100 million rows) • Provides second level query performance at TB to PB level
  2. How does Kylin work ? • Kylin runs on a

    Hadoop cluster • It needs these services – HDFS, YARN, MapReduce, Hive, HBase, Zookeeper • State information is stored in Hbase • Historic data / star schema stored in Hive • Access Kylin at http://<hostname>:7070/kylin • Uses Lambda architecture for real time streaming – layers: Batch, speed and serving – batch / near real-time processing
  3. Kylin Software Requirements • Requirements as of release v3.0.1 –

    Hadoop: 2.7+, 3.1+ (since v2.5) – Hive: 0.13 - 1.2.1+ – HBase: 1.1+, 2.0 (since v2.5) – Spark (optional) 2.3.0+ – Kafka (optional) 1.0.0+ (since v2.5) – JDK: 1.8+ (since v2.5) – OS: Linux only, CentOS 6.5+ or Ubuntu 16.0.4+
  4. Kylin Real Time Streaming Architecture • Streaming Receiver – ingest

    data from stream data sources • Streaming Coordinator – coordinate work loads • Metadata Store – store streaming related metadata • Query Engine – query real-time data from streaming receiver • Build Engine – build cube from the real-time data
  5. Kylin Vs Druid • Druid is more suitable for real

    time analysis. Kylin is more focused on the OLAP case. • Druid has good integration with Kafka for real time streaming analysis. The real time capability of Kylin (v3) is for real time OLAP. • Druid uses bitmap indexes for internal data structures. Kylin uses bitmap indexes for real time data and molap cubes for historical data. • Kylin provide ANSI SQL, Druid provides a specific query language. • Druid has limitations on table join, Kylin supports star schema. • Kylin has good integration with BI tools, such as Tableau or Excel. Druid has limited integration with existing BI tools. • Since Kylin supports molap cubes, it has very good performance for complex queries on billion level data sets. • Since Druid needs to scan the full index, the performance may be hurt if the data set and query range is too big.
  6. Kylin Ecosystem • Kylin Core Fundamental framework of Kylin OLAP

    Engine comprises of Metadata Engine, Query Engine, Job Engine and Storage Engine to run the entire stack. It also includes a REST Server to service client requests • Extensions Plugins to support additional functions and features • Integration Lifecycle Management Support to integrate with Job Scheduler, ETL, Monitoring and Alerting Systems • User Interface Allows third party users to build customized user-interface atop Kylin core • Drivers ODBC and JDBC drivers to support different tools and products, such as Tableau
  7. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  8. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration