Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Set of product roadmap + capabilities slides from Oracle Data Integration Product Management, and thoughts on data integration on big data implementations by Mark Rittman (Independent Analyst)
| Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Oracle Confidential 1
| CON6624 Oracle Data Integration Platform A Cornerstone for Big Data Christophe Dupupet (@XofDup) Director | A-Team Mark Rittman (@markrittman) Independent Analyst Julien Testut (@JulienTestut) Senior Principal Product Manager September, 2016 Confidential – Oracle Internal/Restricted/Highly Restricted
Data Movement DATA ANYWHERE IT’S NEEDED 3. Data Transformation DATA ACCESSIBLE IN ANY FORMAT 4. Data Governance DATA THAT CAN BE TRUSTED 5. Streaming Data DATA IN MOTION OR AT REST
Tool Pushdown / E-LT Data Integration Tool 1st to certify replication with Streaming Big Data 1st to certify E-LT tool with Apache Spark/Python 1st to power Data Preparation w/ML + NLP + Graph Data 1st to offer Self-Service & Hybrid Cloud solution
| Oracle Confidential 7 Hybrid Open-Source ...Open Source at the core of speed & batch processing engines ...Enterprise Vendor tools for connecting to existing IT system and ...Cloud Platforms for data fabric Business Data Serving Layer Apps Analytics Batch Layer Data Streams Social and Logs Enterprise Data Highly Available Databases Pub / Sub REST APIs NoSQL Bulk Data Speed Layer Raw Data Stream Processing Batch Processing Prepared Data
| Examples Oracle Confidential 8 Reference Architecture Business Data Serving Layer Apps Analytics Batch Layer Data Streams Social and Logs Enterprise Data Highly Available Databases Pub / Sub REST APIs NoSQL Bulk Data Speed Layer GoldenGate Data Preparation Data Quality, Metadata Management & Business Glossary Oracle Data Integrator Active DataGuard Comprehensive architecture covers key areas – #1. Data Ingestion, #2. Data Preparation & Transformation, #3. Streaming Big Data, #4. Parallel Connectivity, and #5. Data Governance – and Oracle Data Integration has it covered. Dataflow ML Stream Analytics Connectors
Oracle GoldenGate provides low-impact capture, routing, transformation, and delivery of database transactions across homogeneous and heterogeneous environments in real-time with no distance limitations. Most Databases Data Events Transaction Streams Cloud DBs Big Data Supports Databases, Big Data and NoSQL: * The most popular enterprise integration tool in history
easy to use browser based interface Better automation and less grunt work for humans Graph database of real-world facts used for enrichment Oracle Data Preparation Reporting Apps Files ETL Oracle Data Preparation is a self-service tool that makes it simple to transform, prepare, enrich and standardize business data – it can help IT accelerate solutions for the Business by giving control of data formatting directly to data analysts.
| Oracle Confidential 12 MONTHS of effort spent on each new dataset PROGRAMERS writing scripts or complex ETL DATA WRANGLING wastes time and money “Big Data’s dirty little secret is that 90% of time spent on a project is devoted to preparing data… After all the preparation work, there isn’t enough time left to do sophisticated analytics on it…” Thomas H. Davenport Internet Logs UNSTRUCTURED STRUCTURED Discovery & Visualization Enterprise Reporting Enterprise ETL & Data Integration BUSINESS VALUE OPPORTUNITY Weeks or Months I want my data!! BDP for Data Preparation
| Oracle Data Integrator Bulk Data Performance Non Invasive Footprint Future Proof IT Skills Oracle Data Integrator provides high performance bulk data movement, massively parallel data transformation using database or big data technologies, and block-level data loading that leverages native data utilities Bulk Data Transformation Most Apps, Databases & Cloud Bulk Data Movement Cloud DBs Big Data 1000’s of customers – more than other ETL tools Flexible ELT workloads run anywhere: DBs, Big Data, Cloud Up to 2x faster batch processes and 3x more efficient tooling
| Oracle Confidential 15 No ETL engine is required Separation of Logical and Physical design Physical exec on SQL, Hive, Pig, or Spark Runtime exec in Oozie or via ODI Java Agent Rich set of pre- built operators User defined functions Business Value of ODI: Only Tool with Portable Mappings
Web / Devices Data Event Data & Transaction Streams Downstream (eg; Hadoop) Data Event Oracle Stream Analytics is a powerful analytic toolkit designed to work directly on data in motion – simple data correlations, complex event processing, geo-fencing, and advanced dashboards run on millions of events per second. Innovative dual model for Apache Spark or Coherence grid Simple to use spatial and geo- fencing features an industry first Includes Oracle GoldenGate for streaming transactions
| Stream or Batch Data Spark based Pipelines ML-powered Profiling Oracle Dataflow ML Oracle Dataflow ML is big data solution for stream and batch processing in a single environment – Lambda based applications that can run streaming ETL for cloud based analytic solutions. Batch and stream processing at the same time Machine learning guides users for data profiling Data movement across Oracle PaaS services Most Apps, Databases & Cloud Bulk Data Movement Streaming Data Cloud DBs Big Data Big Data Pipeline
Oracle Metadata Management provides an integrated toolkit that combines business glossary, workflow, metadata harvesting and rich data steward collaboration features. Supports Databases, Big Data, ETL Tools, BI Tools etc: BI Report Lineage Taxonomy Lineage Data Model Lineage
| Leverage Wide Range of Modern Analytic Styles 4 Business Patterns of Big Data Customer Adoption Oracle Confidential, under Non-Disclosure 23 DBMS (on prem or cloud) Sandbox ETL Offload Staging Deep Data Storage 1. Analytic Data Sandbox: – Stakeholder: Functional Line of Business (LoB) – Core Value: Faster access to business data, Faster time to value on Analytics – Innovation: Schema-on-read empowers rapid data staging and true Data Discovery 2. ETL Offload: – Stakeholder: Information Technology (IT) – Core Value: Cost avoidance on DW/Marts – Innovation: YARN/Hadoop empowers lower cost compute and lower cost storage 3. Deep Data Storage: – Stakeholder: Risk / Compliance (LoB) – Core Value: High fidelity aged data – Innovation: SQL on Hadoop engines enable very low cost, queryable data access 4. Streaming: – Stakeholder: Marketing (LoB) / Telematics (LoB) – Core Value: New Data Services or Higher Click Rates – Innovation: MPP capable streaming platforms combined with modern in-motion analytics Data First Analytics Model First Analytics In-Motion Analytics Streaming
| Discovery, Exploratory and Visualization Style Analytics • Oracle Endeca, Big Data Discovery • Tableau, Cliq, Spotfire • DataMeer etc Business Intelligence, Reporting and Dashboard Style Analytics • Oracle BIEE, Visual Analyzer • Cognos, SAS, MicroStrategy • Business Objects, Actuate etc Analytic Data Sandbox Oracle Confidential, under Non-Disclosure 24 Analytic Data Sandbox: – Stakeholder: Functional Line of Business (LoB) – Core Value: Faster access to business data, Faster time to value on Analytics – Innovation: Schema-on-read empowers rapid data staging and true Data Discovery – Industries: All industries Supports “Data First” Style of Analytics – No schema required – Staging data is simple and fast – Minimal data preparation required (mainly for un/semi-structured data sets) Typical Customer Data Types / Sets – Usually bringing in Structured Data from OLTP (Primary data is their existing Application data) – Often bringing in Semi-Structured data (Secondary data is clickstream, logs, machine data) – Business value is usually in the combination of the various data sets and the improved speed of discovery DBMS (on prem or cloud) Sandbox ETL Offload Staging Data First Analytics Model First Analytics Often the data flow may not require any ETL Tooling Other data flows may still require ETL as a pipeline BI Self Service
| Discovery, Exploratory and Visualization Style Analytics • Oracle Endeca, Big Data Discovery • Tableau, Cliq, Spotfire • DataMeer etc Business Intelligence, Reporting and Dashboard Style Analytics • Oracle BIEE, Visual Analyzer • Cognos, SAS, MicroStrategy • Business Objects, Actuate etc ETL Offload Oracle Confidential, under Non-Disclosure 25 DBMS (on prem or cloud) Sandbox ETL Offload Staging 2. ETL Offload: – Stakeholder: Information Technology (IT) – Core Value: Cost avoidance on DW/Marts – Innovation: YARN/Hadoop empowers lower cost compute and lower cost storage – Industries: Teradata, Netezza & AbInitio customers Supports “Model First” Style of Analytics – Schemas required (for working areas, sources and targets) – Staging data requires modeled staging tables – Data preparation required (mapping data sets) (un/semi-structured data sets require pre-parsing) Typical Customer Data Types / Sets – Usually bringing in Structured Data from OLTP Apps (Primary data is their existing Application data) – Occasionally adding new data types to EDW schema (Secondary data is clickstream, logs, machine data) – Business value is usually tied to the “cost avoidance” around escalating DW and ETL tooling costs Data First Analytics Model First Analytics Primary Data Flow Requires Data Integration Tools
| Discovery, Exploratory and Visualization Style Analytics • Oracle Endeca, Big Data Discovery • Tableau, Cliq, Spotfire • DataMeer etc Business Intelligence, Reporting and Dashboard Style Analytics • Oracle BIEE, Visual Analyzer • Cognos, SAS, MicroStrategy • Business Objects, Actuate etc Deep Data Storage Oracle Confidential, under Non-Disclosure 26 DBMS (on prem or cloud) Sandbox ETL Offload Staging Deep Data Storage 3. Deep Data Storage: – Stakeholder: Risk / Compliance (LoB) – Core Value: High fidelity aged data – Innovation: SQL on Hadoop engines enable very low cost, queryable data access – Industries: Insurance and Banking Typically Deep Storage of Relational Data – Schemas required (item detail records, not necessarily aggregates) – Archival can be “on the way in” as part of routine loading, and also via “periodic” pruning from the EDW and data marts Popular with SQL on Hadoop and Federation – Teradata Query Grid from Teradata/Aster – IBM BigSQL from Netezza/PureData – Oracle Big Data SQL from Exadata – Pivotal HAWQ from Greenplum – Cisco Composite Software also selling on this use case (in addition to BI Virtualization) Data First Analytics Model First Analytics Pattern mining Compliance Queryable Archive
| Streaming Big Data Analytics Oracle Confidential, under Non-Disclosure 27 DBMS (on prem or cloud) Sandbox ETL Offload Staging Deep Data Storage 4. Streaming: – Stakeholder: Marketing (LoB) / Telematics (LoB) – Core Value: New Data Services or Higher Click Rates – Innovation: MPP capable streaming platforms combined with modern in-motion analytics – Industries: Automotive, Aerospace, Industrial Manufacturing, some Energy/Oil & Gas Decisions on Data Before it hits Disk – Data volume may be too high to persist all data • Only save the important data – Data may be highly repetitive (sensor data) – Correlations may need to happen with very low latency requirements based on LoB demand Key Use Case for “Data Monetization” – Customers are standing up new Data Services (eg; realtime equipment failure alerts and subscription based monitoring) – “Connected Car” services from most car makers – Disaster preparedness centers – Energy/Aerospace In-Motion Analytics Streaming Other data flows may still require ETL as a pipeline Data First Analytics Model First Analytics Pattern mining
| Some Common Themes Across Use Cases Oracle Confidential, under Non-Disclosure 28 1. Nearly 100% Analytic Use Cases – Data Discovery directly in Hadoop – ETL Offloading for analytics in SQL DB – Deep Data Storage for analytics in SQL DB – Streaming Analytics for data before it hits disk – Lambda Arch 2. Nearly all the Data is Structured Data: – OLTP Sources: every customer starts with the trusted data sets that already drive the majority of business value – App Data – New Sources: Clickstream Logs, Machine Data and other App Exhaust all have “structure” even if they may not have schema 3. Many more Sources are App/OLTP Sources: – By Quantity of Sources: most customers have many (dozens or hundreds) of App/OLTP source they are bringing in – By Volume: by quantity of data, the amount of Machine Data or Log data may often exceed the OLTP data sets 4. Mainframes Matter: – High Value App : most of the biggest customers bringing mainframe (DB2/z, IMS, VSAM) data to Hadoop 5. Multiple Projects / Programs using Hadoop: – Larger Customers: most of the biggest customers have multiple Hadoop projects running in parallel, some are IT led (DW/ETL Offload) and others are LoB led (Discovery/Telematics) 6. Customers are Starting in Phases: – By Value: IT led vs. LoB led initiatives have different characteristics – even if the “Lake / Reservoir” factors in as a long term goal, the initial phases are often quite small in scale 7. Size of Hadoop Clusters vary widely: – Investment Sizes Differ (by a lot): some “start” with mega- commitments (1000’s of Nodes) and others start very small 8. Commodity H/W Clusters Dominate: – Commodity: for use cases designed to work across groups – Appliances: for use cases attached to a single project 9. Data Lakes as a Way to Handle Vendor Diversity: – Middleware for Data: bigger customers have DWs/DBs from every vendor and >6+ different BI tools; Hadoop is becoming the “canonical” data platform to sit in between 10. Open Source Data Platform is a Strategic Priority: – Senior Stakeholder Feedback: as a design point priority for their “next gen” it is becoming more important that Open Source has a central role to play in the enterprise data platform 11. Industry Clusters: – 1. Banking, 2. Insurance, 3. Manufacturing, 4. Media, 4. Retail
ACE Director, blogger + ODTUG member •Regular columnist for Oracle Magazine •Past ODTUG Executive Board Member •Author of two books on Oracle BI •Co-founder of Rittman Mead, now independent analyst •15+ Years in Oracle BI, DW, ETL + now Big Data •Based in Brighton, UK About the Presenter 31
engagement and customer discussion has Big Data central to the project • Hadoop extending traditional DWs through scalability, flexibility, cost, RDBMS -compatibility • Hadoop as the ETL engine driven by ODI Big Data KMs • New datatypes and methods of analysis enabled by Hadoop schema-on-read • Project innovation driven by machine learning, streaming, ability to store + keep *all* data Big Data Technology Core to Modern BI Platforms 32 •And what is driving the interest in these projects…? Data Reservoir Oracle Data Visualization Oracle Big Data Platform Oracle Big Data Discovery Safe & secure Discovery and Development environment Data sets and samples Models and programs Marketing / Sales Applications Models Machine Learning Segments Operational Data Transactions Customer Master ata Event, Social + Unstructured Data Voice + Chat Transcripts Data Factory OGG for Big Data 12c Oracle Stream Analytics Data streams ODI12c Raw Customer Data Data stored in the original format (usually files) such as SS7, ASN.1, JSON etc. Mapped Customer Data Data sets produced by mapping and transforming raw data Oracle Data Preparation Oracle Big Data Appliance Starter Rack + Expansion • Cloudera CDH + Oracle software • 18 High-spec Hadoop Nodes with InfiniBand switches for internal Hadoop traffic, optimised for network throughput • 1 Cisco Management Switch • Single place for support for H/W + S/W Oracle Big Data Appliance Starter Rack + Expansion • Cloudera CDH + Oracle software • 18 High-spec Hadoop Nodes with InfiniBand switches for internal Hadoop traffic, optimised for network throughput • 1 Cisco Management Switch • Single place for support for H/W + S/W Enriched Customer Profile Modeling Scoring Infiniband
from all the sources will need to be integrated to create the single customer view • Hadoop technologies (Flume, Kafka, Storm) can be used to ingest events, log data • Files can be loaded “as is” into the HDFS filesystem • Oracle/DB data can be bulk-loaded using Sqoop • GoldenGate for trickle-feeding transactional data •But nature of new data sources brings challenges • May be semi-structured or unknown schema • Joining schema-free datasets •Need to consider quality and resolve incorrect, incomplete, and inconsistent customer data The Big data Secret? IT’s all about Data Integration 35 Single Customer View Enriched Customer Profile M/L “How” Chat “What” “Who” “Why” Data from structured + schema-on-read sources needs integrating Requires preparation + obfuscation Streaming sources with JSON payloads Apply Schema to Raw and Semi- Structured Data Heterogeneous Enterprise + Web sources
raw data is easy; then the real work needs to be done - can be > 90% of project •Four main tasks to land, prepare and integrate raw data to turn it into a customer profile 1. Ingest it in real-time into the data reservoir 2. Apply Schema to Raw and Semi-Structured Data 3. Remove Sensitive Data from Any Input Files 4. Transform and map into your Customer 360-degree profile Landing, Preparing and Securing Raw Data is *Hard* 36
enrichment tool aimed at domain experts, not programmers •Uses machine-learning to automate data classification + profiling steps •Automatically highlight sensitive data, and offer to redact or obfuscate •Dramatically reduce the time required to onboard new data sources •Hosted in Oracle Cloud for zero-install • File upload and download from browser • Automate for production data loads Oracle Big Data Preparation Cloud Service 37 Raw Data Data stored in the original format (usually files) such as SS7, ASN.1, JSON etc. Mapped Data Data sets produced by mapping and transforming raw data Voice + Chat Transcripts
2: Apply Schema to Raw and Semi-Structured Data 38 NLP Embedded Information in unstructured text Entities Embedded Information No reliable patterns Invalid and missing data Sensitive data Invalid emails Stream from APIs, HTTP: Moderate Batch Load from files, DB: Easy Load raw text from blog entries, reviews
profile and analyse datasets •Use Machine Learning to spot and obfuscate sensitive data automatically Step 3: Remove Sensitive Data from Any Input Files 39
Data Integration offers a wider set of products for managing Customer 360 data •Oracle GoldenGate •Oracle Enterprise Data Quality •Oracle Data Integrator •Oracle Enterprise Metadata Management •All Hadoop enabled •Works across Big Data, Relational and Cloud Step 4 : Transform, Join + Map into Polyglot Data Stores 40
build yesterday using MapReduce today need to be rewritten in Spark • Then Spark needs to be upgraded to Spark Streaming + Kafka for real time… • Upgrades, and replatforming onto the latest tech, can bring “fragile” initiatives to a halt •ODI’s pluggable KM approach to big data integration makes tech upgrades simple •Focus time + investment on new big data initiatives • Not rewriting fragile hand-coded scripts Future-Proof Big Data Integration Platform 41 41 Discovery & Development Labs Safe & secure Discovery and Development environment Data Warehouse Curated data : Historical view and business aligned access ODI Desktop Client Big Data Management Platform Data sets and samples Models and programs Big Data Platform - All Running Natively Under Hadoop YARN (Cluster Resource Management) Hive + Pig (Log processing, UDFs etc) HDFS (Cluster Filesystem holding raw data) Kafka + Spark Streaming Apache Beam? Enriched Customer Profile Modeling Scoring Spark (In-Memory Data Processing)
data projects have had it “easy” so far in terms of data quality + data provenance • Innovation labs + schema-on-read prioritise discovery + insight, not accuracy and audit trails • But a data reservoir without any cleansing, management + data quality = data cesspool • … and nobody knows where all the contamination came from, or who made it worse And the Next Challenge : Data Quality + Provenance 42
my perspective, this is what makes Oracle Data Integration my Hadoop DI platform of choice •Most vendors can load and transform data in Hadoop (not as well, but basic capability) •Only Oracle have the tools to tackle tomorrow’s Big Data challenge: Data Quality + Data Governance • Oracle Enterprise Data Quality • Oracle Enteprise Metadata Mgmt •Seamlessly integrated with ODI •Brings enterprise “smarts” to less mature Big Data projects Data Governance : Why I Recommend Oracle DI Tools 43
| Presen- tations on: Oracle Confidential 44 Data Integration Solutions Program - tinyurl.com/DISOOW16 Demo Stations: Hands- on labs: Oracle Enterprise Metadata Management Oracle Enterprise Data Quality Oracle GoldenGate Oracle Data Integrator Oracle Big Data Preparation Cloud Service Oracle Enterprise Data Quality HOL7466 Oracle GoldenGate Deep Dive HOL7528 ODI and OGG for Big Data HOL7434 Oracle Big Data Preparation Cloud Service HOL7432 Middleware Demoground - Moscone South Big Data Showcase - Moscone South Database Demoground - Moscone South
| Oracle Confidential 45 Data Integration Solutions Program - tinyurl.com/DISOOW16 Monday, Sept 19 • Oracle Data Integration Solutions – Platform Overview and Roadmap [CON6619 ] • Oracle Data Integration: the Foundation for Cloud Integration [CON6620 ] • A Practical Path to Enterprise Data Governance with Cummins [CON6621] • Oracle Data Integrator Product Update and Strategy [CON6622] • Deep Dive into Oracle GoldenGate 12.3 New Features for the Oracle 12.2 Database [CON6555] Tuesday, Sept 20 • Oracle Big Data Integration in the Cloud [CON7472] • Oracle Data Integration Platform: a Cornerstone for Big Data [CON6624] • Oracle Data Integrator and Oracle GoldenGate for Big Data [HOL7434] • Oracle Enterprise Data Quality – Product Overview and Roadmap [CON6627] • Self Service Data Preparation for Domain Experts – No Programming Required [CON6630] • Oracle Big Data Preparation Cloud Service: Self-Service Data Prep for Business Users [HOL7432] • Oracle GoldenGate 12.3 Product Update and Strategy [CON6631] • New GoldenGate 12.3 Services Architecture [CON6551] • Meet the Experts: Oracle GoldenGate Cloud Service [MTE7119] Wednesday, Sept 21 • Data Quality for the Cloud: Enabling Cloud Applications with Trusted Data [CON6629] • Transforming Streaming Analytical Business Intelligence to Business Advantage [CON7352] • Oracle Enterprise Data Quality for All Types of Data [HOL7466] • Oracle GoldenGate for Big Data [CON6632] • Accelerate Cloud On-Boarding using Oracle GoldenGate Cloud Service [CON6633] • Oracle GoldenGate Deep Dive and Oracle GoldenGate Cloud Service for Cloud Onboarding [HOL7528] Thursday, Sept 22 • Best Practices for Migrating to Oracle Data Integrator [CON6623] • Best Practices for Oracle Data Integrator: Hear from the Experts [CON6625] • Dataflow, Machine Learning and Streaming Big Data Preparation [CON6626] • Data Governance with Oracle Enterprise Data Quality and Metadata Management [CON6628] • Faster Design, Development and Deployment with Oracle GoldenGate Studio [CON6634] • Getting started with Oracle GoldenGate [CON7318] • Best Practice for High Availability and Performance Tuning for Oracle GoldenGate [CON6558]
| Oracle Cloud Platform Innovation Awards Meet the Most Impressive Cloud Platform Innovators • Meet peers who implemented cutting-edge solutions with Oracle Cloud Platform • Learn how you can Transform your Business No registration or OpenWorld pass required to attend Oracle PaaS Customer Appreciation Reception Tuesday, Sep 20, 4:00 p.m. - 6:00 p.m. YBCA Theater | 701 Mission St Meet the Most Impressive Cloud Platform Innovators • FREE Appreciation Reception for all Oracle PaaS Customers directly following the Innovation Awards Ceremony No OpenWorld pass is required to attend this reception Tuesday, Sep 20, 6:00 p.m. - 8:30 p.m. YBCA Theater | 701 Mission St
| Safe Harbor Statement The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Oracle Confidential 51