Computer Engineer, BI and Data Integration Specialist à Over 20 years of Consulting and Project Management experience in Oracle tech. à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG) à Vice President of LAOUC (Latin America Oracle User Community) à BI Manager at ITConvergence à Writer and frequent speaker at tech conferences à Oracle ACE Director à Oracle Big Data Implementation Specialist
Data? ü Volume: High amount of data ü Variety: Different data types formats. Unstructured/semi-structured data ü Velocity: Speed which data is created and/or consumed ü Veracity: Quality of data. Accuracy ü Value: Data has intrinsic value—but it must be discovered
Big Data Cloud Compute Edition? à Big Data Platform that integrates Oracle Big Data solution with Open Source tools à Fully Elastic à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud Service, Event Hub Cloud Service à Access, Data and Network Security à REST access to all the funcitonality
à An open source software platform for distributed storage and processing à Manage huge volumes of unstructured data à Parallel processing of large data set à Highly scalable à Fault-tolerant à Two main components: à HDFS: Hadoop Distributed File System for storing information à MapReduce: programming framework that process information
Client NameNode DataNodes Manages metadata and access control Has the info of where the data is (which DataNodes contains the blocks of each file) Keeps this info in memory. Store and retrieves data (blocks) by client request. Requests processes as read or write data
Client NameNode DataNodes 1 2 Divide the file into fixed size blocks (usually 64 or 128MB) For each block: Ask Namenode in which DataNodes can write, Specifying block size and replication factor For each block: Provide DataNodes addresses, sorted in increasing distance 3
Client NameNode DataNodes 1 2 Sends the data of the block and the list of nodes to the first DataNode 3 4 5 Sends the data to the following DataNode Replication Pipeline 6 Each DataNode sends Done to NameNode once the block data is written to hard disk
Client NameNode DataNode 1 Send list of blocks of the file. List of DataNodes for each block 2 4 Send data for required block Ask NameNode for a specific file 3 Download data from the nearest DataNode (send block number)
ü Secondary NameNode (active-standby configuration) ü Namenodes use shared storage ü Datanodes send block reports to both namenodes Shared Storage Passive NameNode Active NameNode
à Retrieves data from HDFS à A MapReduce program is composed by à Map() method: performs filtering and sorting of the <key, value> inputs à Reduce() method: summarize the <key,value> pairs provided by the Mappers à Code can be written in many languages (Perl, Python, Java etc)
Blob (key, value) storage cloud service – Not a file system, but used instead of HDFS • M/R is powerful, but low level • Google: “We do not use M/R internally at Google anymore” munz & more #26
à An open source data warehouse software on top of Apache Hadoop à Analyze and query data stored in HDFS à Structure the data into tables à Tools for simple ETL à SQL- like queries (HiveQL) à Procedural language with HPL-SQL à Metadata storage in a RDBMS
à SELECT [ALL | DISTINCT] select_expr, select_expr, ... à FROM table_reference à [WHERE where_condition] à [GROUP BY col_list] à [HAVING having_condition] à [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] à [LIMIT number];
Higher level Scala, Java, R or Python API • Standalone, in Hadoop, or Mesos • Principle: Run an operation on all data -> ”Spark is the new MapReduce” • See also: Apache Storm, etc • Uses RDDs, or Dataframes, or Datasets munz & more #35 https://stackoverflow.com/questions/31508083/difference-between- dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
of data grouped into named columns. Supports text, JSON, Apache Parquet, sequence. Read in HDFS, Local FS, S3, Hbase Parallelize existing Collection Transform other RDD -> RDDs are immutable
1 2 3 4 … n Immutable log: Messages with offset Producer Consumer A Consumer B https://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it- is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that
Data CS world and vice versa / Low(er) setup times / Check for resource usage & limits in Big Data OPC / BDCS-CE: managed Hadoop, Hive, Spark + Event hub: Kafka / Attend a hands-on Oracle Cloud workshop! / Next level: Oracle Big Data tools @EdelweissK @FrankMunz