Apache Kudu

What Is Apache Kudu ? • A column oriented data
store • Open source / Apache 2.0 license • Written in C++ • Provides fast processing of OLAP workloads • Integrates with – MapReduce, Spark, Hadoop ecosystem, Impala • Scales to large datasets and large clusters • Choose consistency requirements on a per-request basis

Kudu Architecture • Kudu tables are split into tablet units
• A Kudu cluster may have multiple Masters • One Master will lead whilst the others follow • Tablet servers support tablet data • Raft consensus is used to elect leaders and followers • A tablet server may lead other tablet servers • This architecture supports – Fault tolerance – High availability

Kudu Architecture

Kudu Schema • Structured data model similar to RDBMS •
Three main concerns for schema design – Column design – Primary key design – Partitioning design • Kudu has strongly-typed columns • It uses a columnar on-disk storage format

Kudu Schema • Schema design should accomplish • Efficient partition
design – Even distribution of data across tablet servers – Even distribution of reads/writes across tablet servers – Even growth of data across tablet servers – Scans would read the minimum amount of data • The last point is also impacted by – Primary key design

Kudu Partitioning • Partitioning involves – Partitioning tables into tablets
– Across tablet servers • Partitioning affects performance • Aim to partition evenly across cluster • Strategies include – Range, hash, multilevel

Kudu Column Types • Supported column types include – boolean
– 8-bit signed integer – 16-bit / 32-bit / 64-bit signed integer – date (32-bit days since the Unix epoch) – unixtime_micros (64-bit microseconds since the Unix epoch) – single-precision (32-bit) IEEE-754 floating-point number – double-precision (64-bit) IEEE-754 floating-point number – decimal – varchar – UTF-8 encoded string (up to 64KB uncompressed) – binary (up to 64KB uncompressed)

Kudu Replication • Kudu is rack aware – It knows
the server rack assignments • It replicates operations not on disk data • It performs logical replication not physical • Inserts and updates do not transmit data over the network • Deletes do not need to move any data • Compaction does not transmit the data over the network • Tablets performing compactions dont need to – Perform at the same time – Use the same schedule – Remain in synchronisation

Kudu Replication Terms • Kudu hot replica – A tablet
replica that is continuously receiving writes • Kudu cold replica – A tablet replica that is not hot – A replica that is not frequently receiving writes • Kudu data on disk – Total amount of data stored on a tablet server – Across all disks

Kudu Example Scale • 3 master servers • 100 tablet
servers • 8 TiB of stored data per tablet server – post-replication and post-compression. • 1000 tablets per tablet server – post-replication. • 60 tablets per table – per tablet server, at table-creation time. • 10 GiB of stored data per tablet.

Available Books • See “Big Data Made Easy” – Apress
Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
• See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

Apache Kudu

Apache Kudu

Mike Frampton

More Decks by Mike Frampton

Other Decks in Technology

Featured

Transcript

What Is Apache Kudu ? • A column oriented data

Kudu Architecture • Kudu tables are split into tablet units

Kudu Architecture

Kudu Schema • Structured data model similar to RDBMS •

Kudu Schema • Schema design should accomplish • Efficient partition

Kudu Partitioning • Partitioning involves – Partitioning tables into tablets

Kudu Column Types • Supported column types include – boolean

Kudu Replication • Kudu is rack aware – It knows

Kudu Replication Terms • Kudu hot replica – A tablet

Kudu Example Scale • 3 master servers • 100 tablet

Available Books • See “Big Data Made Easy” – Apress

Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020