Slide 1

Slide 1 text

Mark Rittman, Oracle ACE Director THE FUTURE OF ANALYTICS, DATA INTEGRATION AND BI ON BIG DATA PLATFORMS HADOOP USER GROUP IRELAND (HUG IRL) Dublin, September 2016

Slide 2

Slide 2 text

•Mark Rittman, Co-Founder of Rittman Mead •Oracle ACE Director, specialising in Oracle BI&DW •14 Years Experience with Oracle Technology •Regular columnist for Oracle Magazine •Author of two Oracle Press Oracle BI books •Oracle Business Intelligence Developers Guide •Oracle Exalytics Revealed •Writer for Rittman Mead Blog :
 http://www.rittmanmead.com/blog •Email : [email protected] •Twitter : @markrittman About the Speaker 2

Slide 3

Slide 3 text

OR AS I SAY AT PARTIES… 3

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

BUT SERIOUSLY… 5

Slide 6

Slide 6 text

•Started back in 1996 on a bank Oracle DW project •Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL 
 and shell scripts •Went on to use Oracle Developer/2000 and Designer/2000 •Our initial users queried the DW using SQL*Plus •And later on, we rolled-out Discoverer/2000 to everyone else •And life was fun… 20 Years in Old-school BI & Data Warehousing 6

Slide 7

Slide 7 text

•Data warehouses provided a unified view of the business •Single place to store key data and metrics •Joined-up view of the business •Aggregates and conformed dimensions •ETL routines to load, cleanse and conform data •BI tools for simple, guided access to information •Tabular data access using SQL-generating tools •Drill paths, hierarchies, facts, attributes •Fast access to pre-computed aggregates •Packaged BI for fast-start ERP analytics Data Warehouses and Enterprise BI Tools 7 Oracle MongoDB Oracle Sybase IBM DB/2 MS SQL MS SQL Server Core ERP Platform Retail Banking Call Center E-Commerce CRM 
 Business Intelligence Tools 
 Data Warehouse Access &
 Performance
 Layer ODS /
 Foundation
 Layer 7

Slide 8

Slide 8 text

•Examples were Crystal Reports, Oracle Reports, Cognos Impromptu, Business Objects •Report written against carefully-curated BI dataset, or directly connecting to ERP/CRM •Adding data from external sources, or other RDBMSs,
 was difficult and involved IT resources •Report-writing was a skilled job •High ongoing cost for maintenance and changes •Little scope for analysis, predictive modeling •Often user frustration and pace of delivery Reporting Back Then… 8 8

Slide 9

Slide 9 text

•For example Oracle OBIEE, SAP Business Objects, IBM Cognos •Full-featured, IT-orientated enterprise BI platforms •Metadata layers, integrated security, web delivery •Pre-build ERP metadata layers, dashboards + reports •Federated queries across multiple sources •Single version of the truth across the enterprise •Mobile, web dashboards, alerts, published reports •Integration with SOA and web services Then Came Enterprise BI Tools 10 10

Slide 10

Slide 10 text

THEN CAME … BIG DATA 11

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

AND HADOOP 13

Slide 13

Slide 13 text

BIG, FAST AND FAULT-TOLERANT 14

Slide 14

Slide 14 text

•Data from new-world applications is not like historic data •Typically comes in non-tabular form •JSON, log files, key/value pairs •Users often want it speculatively •Haven’t thought it through •Schema can evolve •Or maybe there isn’t one •But the end-users want it now •Not when you’re ready But Why Hadoop? Reason #1 - Flexible Storage 16 Big Data Management Platform Discovery & Development Labs Safe & secure Discovery and Development environment Data sets and samples Models and programs Single Customer View Enriched Customer Profile Correlating Modeling Machine Learning Scoring Schema-on Read Analysis

Slide 15

Slide 15 text

•Enterprise High-End RDBMSs such as Oracle can scale •Clustering for single-instance DBs can scale to >PB •Exadata scales further by offloading queries to storage •Sharded databases (e.g. Netezza) can scale further •But cost (and complexity) become limiting factors •Typically $1m/node is not uncommon But Why Hadoop? Reason #2 - Massive Scalability 17

Slide 16

Slide 16 text

•Hadoop started by being synonymous with MapReduce, and Java coding •But YARN (Yet another Resource Negotiator) broke this dependency •Modern Hadoop platforms provide overall cluster resource management,
 but support multiple processing frameworks •General-purpose (e.g. MapReduce) •Graph processing •Machine Learning •Real-Time Processing (Spark Streaming, Storm) •Even the Hadoop resource management framework
 can be swapped out •Apache Mesos Reason #3 - Processing Frameworks 18 Big Data Platform - All Running Natively Under Hadoop YARN (Cluster Resource Management) Batch
 (MapReduce) HDFS (Cluster Filesystem holding raw data) Interactive
 (Impala, Drill,
 Tez, Presto) Streaming +
 In-Memory
 (Spark, Storm) Graph + Search
 (Solr, Giraph) Enriched 
 Customer Profile Modeling Scoring

Slide 17

Slide 17 text

•Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage •Flexible data storage platform with cheap storage, flexible schema support + compute •Data lands in the data lake or reservoir in raw form, then minimally processed •Data then accessed directly by “data scientists”, or processed further into DW Meet the New Data Warehouse : The “Data Lake” 19 Data Transfer Data Access Data Factory Data Reservoir Business Intelligence Tools Hadoop Platform File Based Integration Stream Based Integration Data streams Discovery & Development Labs Safe & secure Discovery and Development environment Data sets and samples Models and programs Marketing / Sales Applications Models Machine Learning Segments Operational Data Transactions Customer Master ata Unstructured Data Voice + Chat Transcripts ETL Based Integration Raw Customer Data Data stored in the original format (usually files) such as SS7, ASN.1, JSON etc. Mapped Customer Data Data sets produced by mapping and transforming raw data

Slide 18

Slide 18 text

NEW STARTUPS ENABLING A HYBRID 
 “OLD WORLD/NEW WORLD” APPROACH 20

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

AND PERFECT FOR ANALYTICS 22

Slide 21

Slide 21 text

•Enterprise High-End RDBMSs such as Oracle can scale into the petabytes, using clustering •Sharded databases (e.g. Netezza) can scale further but with complexity / single workload trade-offs •Hadoop was designed from outside for massive horizontal scalability - using cheap hardware •Anticipates hardware failure and makes multiple copies of data as protection •More nodes you add, more stable it becomes •And at a fraction of the cost of traditional
 RDBMS platforms Hadoop : The Default Platform Today for Analytics 23

Slide 22

Slide 22 text

BI INNOVATION IS HAPPENING
 AROUND HADOOP 24

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

“WE’RE WINNING!” 27

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

BUT… 29

Slide 27

Slide 27 text

isn’t Hadoop Slow?

Slide 28

Slide 28 text

too slow
 for ad-hoc querying?

Slide 29

Slide 29 text

WELCOME TO 2016 32

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

(HADOOP 2.0) 35

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

HADOOP IS NOW FAST 37

Slide 35

Slide 35 text

Hadoop 2.0 Processing Frameworks + Tools 38

Slide 36

Slide 36 text

•Cloudera’s answer to Hive query response time issues •MPP SQL query engine running on Hadoop, bypasses MapReduce for direct data access •Mostly in-memory, but spills to disk if required •Uses Hive metastore to access Hive table metadata •Similar SQL dialect to Hive - not as rich though and no support for Hive SerDes, storage handlers etc Cloudera Impala - Fast, MPP-style Access to Hadoop Data 39

Slide 37

Slide 37 text

•Beginners usually store data in HDFS using text file formats (CSV) but these have limitations •Apache AVRO often used for general-purpose processing •Splitability, schema evolution, in-built metadata, support for block compression •Parquet now commonly used with Impala due to column-orientated storage •Mirrors work in RDBMS world around column-store •Only return (project) the columns you require across a wide table Parquet - Column-Orientated Storage for Analytics 40

Slide 38

Slide 38 text

•But Parquet (and HDFS) have significant limitation for real-time analytics applications •Append-only orientation, focus on column-store 
 makes streaming ingestion harder •Cloudera Kudu aims to combine 
 best of HDFS + HBase •Real-time analytics-optimised •Supports updates to data •Fast ingestion of data •Accessed using SQL-style tables
 and get/put/update/delete API Cloudera Kudu - Best of HBase and Column-Store 41

Slide 39

Slide 39 text

•Kudu storage used with Impala - create tables using Kudu storage handler •Can now UPDATE, DELETE and INSERT into Hadoop tables, not just SELECT and LOAD DATA Example Impala DDL + DML Commands with Kudu 42 CREATE TABLE `my_first_table` ( `id` BIGINT, `name` STRING ) TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'my_first_table', 'kudu.master_addresses' = 'kudu-master.example.com:7051', 'kudu.key_columns' = 'id' ); INSERT INTO my_first_table VALUES (99, "sarah"); INSERT IGNORE INTO my_first_table VALUES (99, "sarah"); UPDATE my_first_table SET name="bob" where id = 3; DELETE FROM my_first_table WHERE id < 3; DELETE c FROM my_second_table c, stock_symbols s WHERE c.name = s.symbol;

Slide 40

Slide 40 text

AND IT’S NOW IN-MEMORY 43

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

Accompanied by Innovations in Underlying Platform 45 Cluster Resource Management to
 support mulJ-tenant distributed services In-Memory Distributed Storage,
 to accompany In-Memory Distributed Processing

Slide 43

Slide 43 text

DATAFLOW PIPELINES 
 ARE THE NEW ETL 46

Slide 44

Slide 44 text

New ways to do BI

Slide 45

Slide 45 text

New ways to do BI

Slide 46

Slide 46 text

HADOOP IS THE NEW ETL ENGINE 49

Slide 47

Slide 47 text

50 Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Proprietary ETL engines die circa 2015 – folded into big data Oracle Open World 2015 21 Proprietary ETL is Dead. Apache-based ETL is What’s Next Scripted SQL Stored Procs ODI for Columnar ODI for In-Mem ODI for Exadata ODI for Hive ODI for Pig & Oozie 1990’s Eon of Scripts and PL-SQL Era of SQL E-LT/Pushdown Big Data ETL in Batch Streaming ETL Period of Proprietary Batch ETL Engines Informatica Ascential/IBM Ab Initio Acta/SAP SyncSort 1994 Oracle Data Integrator ODI for Spark ODI for Spark Streaming Warehouse Builder

Slide 48

Slide 48 text

MACHINE LEARNING & SEARCH FOR 
 “AUTOMAGIC” SCHEMA DISCOVERY 51

Slide 49

Slide 49 text

New ways to do BI

Slide 50

Slide 50 text

•By definition there's lots of data in a big data system ... so how do you find the data you want? •Google's own internal solution - GOODS ("Google Dataset Search") •Uses crawler to discover new datasets •ML classification routines to infer domain •Data provenance and lineage •Indexes and catalogs 26bn datasets •Other users, vendors also have solutions •Oracle Big Data Discovery •Datameer •Platfora •Cloudera Navigator Google GOODS - Catalog + Search At Google-Scale 53

Slide 51

Slide 51 text

A NEW TAKE ON BI 54

Slide 52

Slide 52 text

•Came out if the data science movement, as a way to "show workings" •A set of reproducible steps that tell a story about the data •as well as being a better command-line environment for data analysis •One example is Jupyter, evolution of iPython notebook •supports pySpark, Pandas etc •See also Apache Zepplin Web-Based Data Analysis Notebooks 55

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

AND EMERGING OPEN-SOURCE
 BI TOOLS AND PLATFORMS 57

Slide 55

Slide 55 text

And Emerging Open-Source
 BI Tools and Platforms wp-content/uploads/2016/05/paper.pdf

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

And Emerging Open-Source
 BI Tools and Platforms

Slide 58

Slide 58 text

WELCOME TO THE FUTURE 62

Slide 59

Slide 59 text

Mark Rittman, Oracle ACE Director THE FUTURE OF ANALYTICS, DATA INTEGRATION AND BI ON BIG DATA PLATFORMS HADOOP USER GROUP IRELAND (HUG IRL) Dublin, September 2016