Open Source Big Data in the Cloud (HDFS, Map Reduce, Hive, Spark, Kafka)

Slide 1

Slide 1 text

Edelweiss Kammermann Frank Munz Open Source Big Data in OPC Analytics & Big Data Summit 2018

Slide 2

Slide 2 text

munz & more #2

Slide 3

Slide 3 text

© IT Convergence 2016. All rights reserved. About Me à Computer Engineer, BI and Data Integration Specialist à Over 20 years of Consulting and Project Management experience in Oracle tech. à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG) à Vice President of LAOUC (Latin America Oracle User Community) à BI Manager at ITConvergence à Writer and frequent speaker at tech conferences à Oracle ACE Director à Oracle Big Data Implementation Specialist

Slide 4

Slide 4 text

Slide 5

Slide 5 text

6 Dr. Frank Munz •Founded munz & more in 2007 •18 years Oracle Middleware, Cloud, and Distributed Computing •Consulting and High-End Training •Wrote two Oracle WLS and one Cloud book

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

#1 Hadoop HDFS + M/R

Slide 8

Slide 8 text

© IT Convergence 2016. All rights reserved. What is Big Data? ü Volume: High amount of data ü Variety: Different data types formats. Unstructured/semi-structured data ü Velocity: Speed which data is created and/or consumed ü Veracity: Quality of data. Accuracy ü Value: Data has intrinsic value—but it must be discovered

Slide 9

Slide 9 text

© IT Convergence 2016. All rights reserved. What is Oracle Big Data Cloud Compute Edition? à Big Data Platform that integrates Oracle Big Data solution with Open Source tools à Fully Elastic à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud Service, Event Hub Cloud Service à Access, Data and Network Security à REST access to all the funcitonality

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

© IT Convergence 2016. All rights reserved. What is Hadoop? à An open source software platform for distributed storage and processing à Manage huge volumes of unstructured data à Parallel processing of large data set à Highly scalable à Fault-tolerant à Two main components: à HDFS: Hadoop Distributed File System for storing information à MapReduce: programming framework that process information

Slide 13

Slide 13 text

© IT Convergence 2016. All rights reserved. HFDS Architecture (Simplified) Client NameNode DataNodes Manages metadata and access control Has the info of where the data is (which DataNodes contains the blocks of each file) Keeps this info in memory. Store and retrieves data (blocks) by client request. Requests processes as read or write data

Slide 14

Slide 14 text

© IT Convergence 2016. All rights reserved. HFDS: Writing Data Client NameNode DataNodes 1 2 Divide the file into fixed size blocks (usually 64 or 128MB) For each block: Ask Namenode in which DataNodes can write, Specifying block size and replication factor For each block: Provide DataNodes addresses, sorted in increasing distance 3

Slide 15

Slide 15 text

© IT Convergence 2016. All rights reserved. HFDS: Writing Data Client NameNode DataNodes 1 2 Sends the data of the block and the list of nodes to the first DataNode 3 4 5 Sends the data to the following DataNode Replication Pipeline 6 Each DataNode sends Done to NameNode once the block data is written to hard disk

Slide 16

Slide 16 text

© IT Convergence 2016. All rights reserved. HFDS: Reading Data Client NameNode DataNode 1 Send list of blocks of the file. List of DataNodes for each block 2 4 Send data for required block Ask NameNode for a specific file 3 Download data from the nearest DataNode (send block number)

Slide 17

Slide 17 text

© IT Convergence 2016. All rights reserved. HFDS: High Availability ü Secondary NameNode (active-standby configuration) ü Namenodes use shared storage ü Datanodes send block reports to both namenodes Shared Storage Passive NameNode Active NameNode

Slide 18

Slide 18 text

Slide 19

Slide 19 text

© IT Convergence 2016. All rights reserved. Hadoop Components: MapReduce à Retrieves data from HDFS à A MapReduce program is composed by à Map() method: performs filtering and sorting of the inputs à Reduce() method: summarize the pairs provided by the Mappers à Code can be written in many languages (Perl, Python, Java etc)

Slide 20

Slide 20 text

Slide 21

Slide 21 text

HDFS & M/R • HDFS enabled big data – Democratization • Could you operate Hadoop on-premises? • Hadoop on VirtualBox Oracle BD Light is an easy start – Localhost only munz & more #25

Slide 22

Slide 22 text

HDFS & M/R • S3 is very popular today – Blob (key, value) storage cloud service – Not a file system, but used instead of HDFS • M/R is powerful, but low level • Google: “We do not use M/R internally at Google anymore” munz & more #26

Slide 23

Slide 23 text

Slide 24

Slide 24 text

© IT Convergence 2016. All rights reserved. What is Hive? à An open source data warehouse software on top of Apache Hadoop à Analyze and query data stored in HDFS à Structure the data into tables à Tools for simple ETL à SQL- like queries (HiveQL) à Procedural language with HPL-SQL à Metadata storage in a RDBMS

Slide 25

Slide 25 text

© IT Convergence 2016. All rights reserved. Hive: Code Example à SELECT [ALL | DISTINCT] select_expr, select_expr, ... à FROM table_reference à [WHERE where_condition] à [GROUP BY col_list] à [HAVING having_condition] à [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] à [LIMIT number];

Slide 26

Slide 26 text

Hive • Yes, SQL again! – People already know it – Hive is basis for integration with other tools munz & more #31

Slide 27

Slide 27 text

#3 Spark

Slide 28

Slide 28 text

Revisited: Map Reduce I/O munz & more #34 Source: Hadoop Application Architecture Book

Slide 29

Slide 29 text

Spark • InMemory: Orders of magnitude(s) faster than M/R • Higher level Scala, Java, R or Python API • Standalone, in Hadoop, or Mesos • Principle: Run an operation on all data -> ”Spark is the new MapReduce” • See also: Apache Storm, etc • Uses RDDs, or Dataframes, or Datasets munz & more #35 https://stackoverflow.com/questions/31508083/difference-between- dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

Slide 30

Slide 30 text

RDDs Resilient Distributed Datasets Where do they come from? Collection of data grouped into named columns. Supports text, JSON, Apache Parquet, sequence. Read in HDFS, Local FS, S3, Hbase Parallelize existing Collection Transform other RDD -> RDDs are immutable

Slide 31

Slide 31 text

Lazy Evaluation munz & more #37 Nothing is executed Execution Transformations: map(), flatMap(), reduceByKey(), groupByKey() Actions: collect(), count(), first(), takeOrdered(), saveAsTextFile(), … http://spark.apache.org/docs/2.1.1/programming-guide.html

Slide 32

Slide 32 text

Spark Demo munz & more #41

Slide 33

Slide 33 text

Apache Zeppelin Notebook munz & more #42

Slide 34

Slide 34 text

Word Count and Histogram munz & more #43 res = t.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) res.takeOrdered(5, key = lambda x: -x[1])

Slide 35

Slide 35 text

Zeppelin Notebooks munz & more #44

Slide 36

Slide 36 text

Big Data Compute Service CE munz & more #45

Slide 37

Slide 37 text

#4 Kafka Fast Data

Slide 38

Slide 38 text

Kafka Partitioned, replicated commit log munz & more #47 0 1 2 3 4 … n Immutable log: Messages with offset Producer Consumer A Consumer B https://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it- is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that

Slide 39

Slide 39 text

Broker1 Broker2 Broker3 Topic A (1) Topic A (2) Topic A (3) Partition / Leader Repl A (1) Repl A (2) Repl A (3) Producer Replication / Follower Zoo- keeper Zoo- keeper Zoo- keeper State / HA

Slide 40

Slide 40 text

https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/ - 1 topic - 1 partition - Contains every article published since 1851 - Multiple producers / consumers Example for Stream / Table Duality

Slide 41

Slide 41 text

Kafka Clients SDKs Connect Streams - OOTB: Java, Scala - Confluent: Python, C, C++ Confluent: - HDFS sink, - JDBC source, - S3 sink - Elastic search sink - Plugin .jar file - JDBC: Change data capture (CDC) - Real-time data ingestion - Microservices - KSQL: SQL streaming engine for streaming ETL, anomaly detection, monitoring - .jar file runs anywhere High / low level Kafka API Configuration only Integrate external Systems Data in Motion Stream / Table duality REST - Language agnostic - Easy for mobile apps - Easy to tunnel through FW etc. Lightweight

Slide 42

Slide 42 text

Oracle Event Hub Cloud Service • PaaS: Managed Kafka 0.10.2 • Two deployment modes – Basic (Broker and ZK on 1 node) – Recommended (distributed) • REST Proxy – Separate sever(s) running REST Proxy munz & more #51

Slide 43

Slide 43 text

Event Hub munz & more #52

Slide 44

Slide 44 text

Event Hub Service munz & more #53

Slide 45

Slide 45 text

Ports You must open ports to allow access for external clients • Kafka Broker (from OPC connect string) • Zookeeper with port 2181 munz & more #54

Slide 46

Slide 46 text

Scaling munz & more #55 horizontal (up) vertical

Slide 47

Slide 47 text

Event Hub REST Interface munz & more #56 https://129.151.91.31:1080/restproxy/topics/a12345orderTopic Service = Topic

Slide 48

Slide 48 text

Interesting to Know • Event Hub topics are prefixed with ID domain • With Kafka CLI topics with ID Domain can be created • Topics without ID domain are not shown in OPC console 57

Slide 49

Slide 49 text

#5 Conclusion

Slide 50

Slide 50 text

TL;DR #bigData #openSource #OPC OpenSource: entry point to Oracle Big Data CS world and vice versa / Low(er) setup times / Check for resource usage & limits in Big Data OPC / BDCS-CE: managed Hadoop, Hive, Spark + Event hub: Kafka / Attend a hands-on Oracle Cloud workshop! / Next level: Oracle Big Data tools @EdelweissK @FrankMunz

Slide 51

Slide 51 text

3 Membership Tiers • Oracle ACE Director • Oracle ACE • Oracle ACE Associate bit.ly/OracleACEProgram 500+ Technical Experts Helping Peers Globally Connect: Nominate yourself or someone you know: acenomination.oracle.com @oracleace Facebook.com/oracleaces [email protected]

Slide 52

Slide 52 text

www.linkedin.com/in/frankmunz/ www.munzandmore.com/blog facebook.com/cloudcomputingbook facebook.com/weblogicbook @frankmunz youtube.com/weblogicbook -> more than 50 web casts Don’t be shy J