Kafka on HDInsight

Willkommen beim #GAB2017! Ka6a™ HDInsight
Hans-‐Peter Grahsl Entwickler -‐ Trainer -‐ Berater | Netconomy | FH CAMPUS 02 @hpgrahsl 22. April 2017

Inhalte •  HDInsight -‐ Cluster Setup & KonﬁguraJon
•  Apache KaKa -‐ Konzepte & Grundlagen -‐ Client APIs •  KaKa @ Azure -‐ Cloud Live Demos -‐ AlternaJven? à Conﬂuent PlaSorm

Azure Intelligence & AnalyGcs

Azure HDInsight Open-‐Source Big Data Services -‐  ursprünglich
Apache Hadoop (nur Windows) -‐  miXlerweile umfassendes Big Data Tech Stack Angebot (auch Linux)

Azure HDInsight Branchenführende *-‐as-‐a-‐Service Angebote (unabhängig von Cluster
Type und OS) -‐  garanJerte Verfügbarkeit (99,9%) inkl. 24/7 Support -‐  Cluster Provisionierung auf Knopfdruck in Minuten -‐  HA Node KonﬁguraJonen & Geo-‐ReplikaJon von Daten -‐  HIPAA, PCI/DSS, SOC & ISO Compliance

Warum Messaging & Stream Processing in der Cloud?
Der Großteil an Daten ist nicht mehr lokal… Eventbasierte Daten beﬁnden sich oimals bereits in der Cloud Eventbasierte Daten sind immer häuﬁger global verteilt Reduced TCO ElasJc scale-‐out Service, not infrastructure “Bring the processing to the data, not the data to the processing!”

Ka6a™ à „central nervous system for data“ -‐-‐ Jay Kreps

Ka6a™ à „central nervous system for data“ -‐-‐ Jay Kreps
“…everything that happens in a company—every customer interacJon, every API request, every database change—can be represented as a real-‐Gme stream that anything else can tap into, process, or react to.”

Apache Ka6a™ “…is used for
building real-‐Gme data pipelines and streaming apps. It is horizontally scalable, fault-‐tolerant, wicked fast, and runs in producJon in thousands of companies.”

“I thought that since Ka6a was a system opGmized for
wriGng using a writer’s name would make sense. I had taken a lot of lit classes in college and liked Franz Ka6a. Plus the name sounded cool for an open source project… So basically there is not much of a relaGonship.” —Jay Kreps Warum/Woher der Name -‐> Ka6a <-‐ ?

Kerneigenschaden -‐  ﬂexible und skalierbare Publish / Subscribe Szenarien
-‐  fehlertolerante Speicherung von Datenströmen -‐  echtzeitnahe Verarbeitung von Events für... -‐  zuverlässigen systemübergreifenden Datenaustausch -‐  leichtgewichJge performante Datenstromanalysen Apache Ka6a™

Apache Ka6a™ Consumer KaKa Brokers Zookeeper
Broker & Topic Metadata Message Out Producer Message In Consumer Metadata & ParJJon Oﬀsets

Apache Ka6a™ Ka6a
Logs -‐  zentrale Datenstruktur ist ein append-‐only Log -‐  jede Message hat eindeuJge sequenJelle Nummer -‐  Messages links „sind älter“ als jene rechts 0 1 2 3 4 5 6 7 8 9 10 … älter neuer nächste Message erste Message

Apache Ka6a™ Topics -‐  Messages werden in Topics
kategorisiert -‐  Topics können parJJoniert werden 0 1 2 3 4 5 6 7 8 9 … 0 1 2 3 4 5 6 … 0 1 2 3 4 5 6 7 8 … ParJJon 0 ParJJon 1 ParJJon 2 Topic “Azure_Bootcamp”

Apache Ka6a™ 4 Core Client APIs:
PRODUCER CONNECT CONSUMER STREAM C ? … ≈≈≈ ≈≈ ? … P … ?

Apache Ka6a™ Producers -‐  erzeugen und schreiben Messages
in Topics -‐  ParJJonierungsstrategie entscheidet über ZielparJJon -‐  Ordnung / Reihenfolge innerhalb ParJJon garanJert 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 ParJJon 0 ParJJon 1 ParJJon 2 Topic “Azure_Bootcamp” Producer Writes Messages

Consumers -‐  Lesen und verarbeiten Messages aus Topics
-‐  können 1..N Topics abonnieren -‐  Verarbeitung innerhalb ParJJon in Einfügereihenfolge -‐  persisJeren Oﬀsets zuletzt konsumierter Messages Apache Ka6a™ 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 ParJJon 0 ParJJon 1 ParJJon 2 Topic “Azure_Bootcamp” Consumer Consumer Consumer Consumer Group

Apache Ka6a™ Demo Producer / Consumer API
Java Producer Console Consumer

Apache Ka6a™ Connect API -‐  Daten von
Quellsystem à nach Ka6a bringen Ka6a Cluster Connect API Source Connectors Sources

Apache Ka6a™ Connect API -‐  Daten aus
Ka6a à ins Zielsystem bringen Ka6a Cluster Sinks Connect API Sink Connectors

Apache Ka6a™ Connect API -‐  Quell-‐ und Zielsysteme
über KaKa verbinden -‐  Ka6a als zentrale Daten Pipeline -‐  zuverlässiger & skalierbarer Datenstromabgleich Ka6a Cluster Sources Sinks Connect API Source Connectors Sink Connectors

Apache Ka6a™ Connect API CharakterisGken -‐  Aktualität
-‐  Zuverlässigkeit -‐  Durchsatz -‐  Formatunabhängigkeit -‐  Transformierbarkeit -‐  Sicherheit -‐  Fehlertoleranz KEIN CODE „Konﬁgurieren stau Programmieren“

Apache Ka6a™ Connect API seit Version 0.9+ -‐ 
bereits viele offizielle Connectoren verfügbar -‐  einige weitere aus der Open Source Community -‐  siehe Confluent Website hXps://www.confluent.io/product/connectors/ Bsp. aus Community MongoDB Sink Connector mein privates Open Source Project hXps://github.com/hpgrahsl/kaKa-‐connect-‐mongodb

Apache Ka6a™ Demo Connect API Twiuer Source
Connector

Apache Ka6a™ Stream API seit Version 0.10+ -‐ 
Datenstromverarbeitung direkt mit KaKa Library -‐  leichtgewichJge Client ApplikaJonen -‐  ohne externe, dedizierte Streaming Frameworks (Storm, Spark, Flink, Samza, ...) -‐  Processor API & Stream DSL ≈≈≈ ≈≈ ? … … ?

Apache Ka6a™ Datenstrom Eigenschaden: -‐  unbounded
-‐  ordered -‐  immutable -‐  replayable weitere Aspekte à Zeit und Zustand

Apache Ka6a™ Zeit bei Datenstromverarbeitung ? -‐  Event
Time -‐  Log-‐Append Time -‐  Processing Time hXps://www.ﬂickr.com/photos/smemon/5281453002/

Apache Ka6a™ Zustand bei Datenstromverarbeitung ? -‐  stateless
(z.B. nur map / ﬁlter) -‐  local state (z.B. groupBy & AggregaJonen innerhalb ParJJon) -‐  „shared state“ (z.B. global Top N parJJonsübergreifend) stateful: -‐  local (internal) à State Store RocksDB + Ka6a Topic -‐  remote (external) à zumeist *DB (NoSQL od. RDBMS)

Apache Ka6a™ Demo Stream API Producer
Streaming App Windowed Aggregates ≈≈≈ ≈≈≈

Apache Ka6a™ Demo Stream API Producer
Streaming App TopN #hashtags, emojis,... ≈≈≈ ≈≈≈

Bsp. für Anwendungsgebiete -‐  General Purpose Messaging
-‐  Clickstream Tracking -‐  Central IoT DataHub for -‐  OperaJonal Metrics Monitoring -‐  Log AggregaJon -‐  Stream Processing -‐  Microservices & Event Sourcing -‐  External Commit Log Apache Ka6a™

Beyond the core product...

Conﬂuent Playorm on Azure...

Grab your Ka6a Swag...
sponsored by

Kontaktdaten... @hpgrahsl hXps://www.xing.com/proﬁle/HansPeter_Grahsl hans_peter_g
Hans-‐Peter Grahsl [email protected] +43 650 217 17 04

Kafka on HDInsight

Kafka on HDInsight

More Decks by Hans-Peter Grahsl

Other Decks in Programming

Featured

Transcript