Apache Fluo - Speaker Deck

Slide 1

Slide 1 text

What Is Apache Fluo ? ● For large scale data set incremental updates ● Open source Apache 2.0 license ● Based upon Apache Accumulo – Uses Hadoop HDFS to store data – Uses ZooKeeper for configuration – Partitions tables into tablets ● It is a distributed system ● Supports cross node transactions

Slide 2

Slide 2 text

What Is Apache Fluo ? ● Allows monitoring of large datasets to – Identify small changes – Join changes into the larger data set – Without processing all data ● Transactions allows many current changes – Without data corruption ● Fluo uses code based observers which – Act on table column changes ● Offers a Fluo Java based API

Slide 3

Slide 3 text

What Is Apache Fluo ? ● Use of Fluo is code based and low level ● Fluo uses Hadoop YARN to run its processes ● Fluo uses ZooKeeper to – Store its meta data – Store its state information ● Fluo data is stored in Fluo tables on Accumulo ( HDFS) – Same structure as Accumulo except – Row has no timestamps

Slide 4

Slide 4 text

Fluo Architecture

Slide 5

Slide 5 text

Fluo Architecture ● Large scale computation through small scale transactions ● Clients access Fluo through Java API ● Clients ingest data through the API ● Application Oracle processes apply transaction timestamps ● Application worker processes run user code ● User code/observers monitor column changes ● Multiple workers can run the same observers ● Transactions change data, snapshots read data

Slide 6

Slide 6 text

Fluo Architecture ● Fluo provides snapshot isolation ● A snapshot only sees pre committed transactions ● Transaction overlap / collision is possible ● In this case a write skew is possible if – Different keys are concurrently updated ● Fluo supports scanners to read data ranges or spans ● Fluo has a transaction based LoaderExecutor – To aid the loading of data

Slide 7

Slide 7 text

Fluo Architecture ● Fluo supports incremental processing via ● Notifications – Persistent markers set by a transaction that Indicate – An Observer should run later for a certain row+column ● Observers – User provided code that is registered to – Process notifications for a certain column ● Observer receives row/column that triggered it plus transaction ● Fluo worker processes running across a cluster ● Will execute Observers

Slide 8

Slide 8 text

Fluo Architecture ● Fluo supports two types of notification ● Strong notification – Guarantee an observer will run at most once – When a column is modified – Even for multiple row+column updates ● Weak notification – Cause an observer to run at least once – Observers may run multiple times and/or concurrently – Based on a single weak notification

Slide 9

Slide 9 text

Fluo Row Locking

Slide 10

Slide 10 text

Fluo Row Locking ● For cross node transactions Fluo uses – Accumulo conditional mutations ● Conditional mutations lock entire rows ● On the server side when checking conditions ● Row locks can impact the transaction performance ● May be a problem if – Many transactions will update separate columns in a row – Those transactions are very likely to run concurrently

Slide 11

Slide 11 text

Available Books ● See “Big Data Made Easy” – Apress Jan 2015 ● See “Mastering Apache Spark” – Packt Oct 2015 ● See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” ● Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ ● Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

Slide 12

Slide 12 text

Connect ● Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at – open-source-systems.blogspot.com/ ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration