Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Fluo

Apache Fluo

This presentation gives an overview of the Apache Fluo project. It explains Apache Fluo in terms of it's architecture, functionality and transactions.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Mike Frampton

June 08, 2020
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. What Is Apache Fluo ? • For large scale data

    set incremental updates • Open source Apache 2.0 license • Based upon Apache Accumulo – Uses Hadoop HDFS to store data – Uses ZooKeeper for configuration – Partitions tables into tablets • It is a distributed system • Supports cross node transactions
  2. What Is Apache Fluo ? • Allows monitoring of large

    datasets to – Identify small changes – Join changes into the larger data set – Without processing all data • Transactions allows many current changes – Without data corruption • Fluo uses code based observers which – Act on table column changes • Offers a Fluo Java based API
  3. What Is Apache Fluo ? • Use of Fluo is

    code based and low level • Fluo uses Hadoop YARN to run its processes • Fluo uses ZooKeeper to – Store its meta data – Store its state information • Fluo data is stored in Fluo tables on Accumulo ( HDFS) – Same structure as Accumulo except – Row has no timestamps
  4. Fluo Architecture • Large scale computation through small scale transactions

    • Clients access Fluo through Java API • Clients ingest data through the API • Application Oracle processes apply transaction timestamps • Application worker processes run user code • User code/observers monitor column changes • Multiple workers can run the same observers • Transactions change data, snapshots read data
  5. Fluo Architecture • Fluo provides snapshot isolation • A snapshot

    only sees pre committed transactions • Transaction overlap / collision is possible • In this case a write skew is possible if – Different keys are concurrently updated • Fluo supports scanners to read data ranges or spans • Fluo has a transaction based LoaderExecutor – To aid the loading of data
  6. Fluo Architecture • Fluo supports incremental processing via • Notifications

    – Persistent markers set by a transaction that Indicate – An Observer should run later for a certain row+column • Observers – User provided code that is registered to – Process notifications for a certain column • Observer receives row/column that triggered it plus transaction • Fluo worker processes running across a cluster • Will execute Observers
  7. Fluo Architecture • Fluo supports two types of notification •

    Strong notification – Guarantee an observer will run at most once – When a column is modified – Even for multiple row+column updates • Weak notification – Cause an observer to run at least once – Observers may run multiple times and/or concurrently – Based on a single weak notification
  8. Fluo Row Locking • For cross node transactions Fluo uses

    – Accumulo conditional mutations • Conditional mutations lock entire rows • On the server side when checking conditions • Row locks can impact the transaction performance • May be a problem if – Many transactions will update separate columns in a row – Those transactions are very likely to run concurrently
  9. Available Books • See “Big Data Made Easy” – Apress

    Jan 2015 • See “Mastering Apache Spark” – Packt Oct 2015 • See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” • Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ • Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  10. Connect • Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

    • See my open source blog at – open-source-systems.blogspot.com/ • I am always interested in – New technology – Opportunities – Technology based issues – Big data integration