Computer Engineer, BI and Data Integration Specialist à Over 20 years of Consulting and Project Management experience in Oracle technology. à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG) à Director of Community of LAOUC à Head of BI Team CMS at ITConvergence à Writer and frequent speaker at international conferences: à Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, Rittman Mead BI Forum à Oracle ACE Director
Data? à Volume: The high amount of data à Variety: The wide range of different data formats and schemas. Unstructured and semi-structured data à Velocity: The speed which data is created or consumed à Oracle added another V in this definition à Value: Data has intrinsic value—but it must be discovered.
Big Data Cloud Compute Edition? à Big Data Platform that integrates Oracle Big Data solution with Open Source tools à Fully Elastic à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud Service, Event Hub Cloud Service à Access, Data and Network Security à REST access to all the funcitonality
à An open source software platform for distributed storage and processing à Manage huge volumes of unstructured data à Parallel processing of large data set à Highly scalable à Fault-tolerant à Two main components: à HDFS: Hadoop Distributed File System for storing information à MapReduce: programming framework that process information
à Retrieves data from HDFS à A MapReduce program is composed by à Map() method: performs filtering and sorting of the <key, value> inputs à Reduce() method: summarize the <key,value> pairs provided by the Mappers à Code can be written in many languages (Perl, Python, Java etc)
à An open source data warehouse software on top of Apache Hadoop à Analyze and query data stored in HDFS à Structure the data into tables à Tools for simple ETL à SQL- like queries (HiveQL) à Procedural language with HPL-SQL à Metadata storage in a RDBMS
level Scala, Java or Python API • Standalone, in Hadoop, or Mesos • Principle: Run an operation on all data -> ”Spark is the new MapReduce” • See also: Apache Storm, etc • Uses RDDs, or Dataframes, or Datasets munz & more #24 https://stackoverflow.com/questions/31508083/difference-between- dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
of data grouped into named columns. Supports text, JSON, Apache Parquet, sequence. Read in HDFS, Local FS, S3, Hbase Parallelize existing Collection Transform other RDD -> RDDs are immutable
element of the source through a function func. flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. Transformations
1 2 3 4 … n Immutable log: Messages with offset Producer Consumer A Consumer B https://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it- is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that
Data world / Low(er) setup times / Check for resource usage & limits in Big Data OPC / BDCS-CE: managed Hadoop, Hive, Spark + Event hub: Kafka / Attend a hands-on workshop! / Next level: Oracle Big Data tools @EdelweissK @FrankMunz