Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Kudu

Apache Kudu

This presentation gives an overview of the Apache Kudu project. It explains the Kudu project in terms of it's architecture, schema, partitioning and replication. It also provides an example deployment scale.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Mike Frampton

May 31, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Kudu ? • A column oriented data

    store • Open source / Apache 2.0 license • Written in C++ • Provides fast processing of OLAP workloads • Integrates with – MapReduce, Spark, Hadoop ecosystem, Impala • Scales to large datasets and large clusters • Choose consistency requirements on a per-request basis
  2. Kudu Architecture • Kudu tables are split into tablet units

    • A Kudu cluster may have multiple Masters • One Master will lead whilst the others follow • Tablet servers support tablet data • Raft consensus is used to elect leaders and followers • A tablet server may lead other tablet servers • This architecture supports – Fault tolerance – High availability
  3. Kudu Schema • Structured data model similar to RDBMS •

    Three main concerns for schema design – Column design – Primary key design – Partitioning design • Kudu has strongly-typed columns • It uses a columnar on-disk storage format
  4. Kudu Schema • Schema design should accomplish • Efficient partition

    design – Even distribution of data across tablet servers – Even distribution of reads/writes across tablet servers – Even growth of data across tablet servers – Scans would read the minimum amount of data • The last point is also impacted by – Primary key design
  5. Kudu Partitioning • Partitioning involves – Partitioning tables into tablets

    – Across tablet servers • Partitioning affects performance • Aim to partition evenly across cluster • Strategies include – Range, hash, multilevel
  6. Kudu Column Types • Supported column types include – boolean

    – 8-bit signed integer – 16-bit / 32-bit / 64-bit signed integer – date (32-bit days since the Unix epoch) – unixtime_micros (64-bit microseconds since the Unix epoch) – single-precision (32-bit) IEEE-754 floating-point number – double-precision (64-bit) IEEE-754 floating-point number – decimal – varchar – UTF-8 encoded string (up to 64KB uncompressed) – binary (up to 64KB uncompressed)
  7. Kudu Replication • Kudu is rack aware – It knows

    the server rack assignments • It replicates operations not on disk data • It performs logical replication not physical • Inserts and updates do not transmit data over the network • Deletes do not need to move any data • Compaction does not transmit the data over the network • Tablets performing compactions dont need to – Perform at the same time – Use the same schedule – Remain in synchronisation
  8. Kudu Replication Terms • Kudu hot replica – A tablet

    replica that is continuously receiving writes • Kudu cold replica – A tablet replica that is not hot – A replica that is not frequently receiving writes • Kudu data on disk – Total amount of data stored on a tablet server – Across all disks
  9. Kudu Example Scale • 3 master servers • 100 tablet

    servers • 8 TiB of stored data per tablet server – post-replication and post-compression. • 1000 tablets per tablet server – post-replication. • 60 tablets per table – per tablet server, at table-creation time. • 10 GiB of stored data per tablet.
  10. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  11. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration