Slide 1

Slide 1 text

EVENT SPEAKER ODTUG KSCOPE’16, CHICAGO ORACLE BIG DATA DISCOVERY EXTENDING INTO MACHINE LEARNING : A QUANTIFIED SELF CASE STUDY MARK RITTMAN, ORACLE ACE DIRECTOR

Slide 2

Slide 2 text

EVENT CONTACT T: @MARKRITTMAN TITLE ABOUT THE SPEAKER Mark Rittman, CTO, Rittman Mead KSCOPE’16, CHICAGO, JUNE 2016 2 Oracle ACE Director, blogger + ODTUG member Regular columnist for Oracle Magazine Past ODTUG Executive Board Member Author of two books on Oracle BI Co-founder & CTO of Rittman Mead 15+ Years in Oracle BI, DW, ETL + now Big Data Implementor, trainer, consultant + company founder Hobbies include football, tech + now … cycling ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 3

Slide 3 text

EVENT CONTACT T: @MARKRITTMAN TITLE On sabbatical … looking at emerging real-time Hadoop-based BI technologies Taking time-out from day-to-day consulting and the Oracle-centric BI world Building prototypes and making contact with startups, analysts, open-source teams Asking myself the question “what will an analytics platform look like in 5 years time?” SO WHERE HAVE I BEEN FOR THE PAST 6 MONTHS? KSCOPE’16, CHICAGO, JUNE 2016 3 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING Strangely quiet on the blog, the occasional tweet about Christian’s laptop, where have I been?

Slide 4

Slide 4 text

AND TAKING SOME TIME OUT

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

SUMMER IN ENGLAND

Slide 9

Slide 9 text

EVENT CONTACT T: @MARKRITTMAN TITLE Something that makes cycling and all workouts more interesting today Record routes you took using GPS in phone Specialised bike computers for more detailed and accurate speed, cadence data Upload into smart phone, load into services such as Strava, Cyclometer, Apple Health Review and analyse cycling style, set goals Compare and compete against yourself
 (“gamification”) or others (league tables) CYCLING … WITH A GEEK TWIST Recording routes, speed, cadence for later analysis, gamification and ride history KSCOPE’16, CHICAGO, JUNE 2016 9 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 10

Slide 10 text

EVENT CONTACT T: @MARKRITTMAN TITLE PART OF A WIDER ECOSYSTEM OF HEALTH DEVICES All bought by me over the past year, part of my “get fit” initiative KSCOPE’16, CHICAGO, JUNE 2016 10 Jawbone UP health band for workouts, sleep tracking Withings Smart Scale for weight Apple Health, Apple Watch and iPhone with M7 Motion co-processor Each of which integrates or forms its own ecosystem ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 11

Slide 11 text

EVENT CONTACT T: @MARKRITTMAN TITLE USING JAWBONE UP PUBLIC API AS AGGREGATOR KSCOPE’16, CHICAGO, JUNE 2016 11 Apple HealthKit was an option for data aggregation, but no central cloud store Can manually download HealthKit data using iOS apps, or use Hipbone IoS app for Dropbox d/l Jawbone UP API was most robust and widely supported ecosystem API Download data as CSV file, or automate using API Access all of Jawbone UP health metrics Integrate weight data from Withings scale Workout data from Strava ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 12

Slide 12 text

AND ANOTHER PROJECT…

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

EVENT CONTACT T: @MARKRITTMAN TITLE ANOTHER PERSONAL PROJECT : HOME AUTOMATION Smart appliances, Internet-connected heating and lights, Sensors and Home Automation platforms KSCOPE’16, CHICAGO, JUNE 2016 14 Another personal project has been home automation, IoT and the “smart home” Started with Nest thermostat and Philips Hue lights Extended the Nest system to include Nest Protect and Nest Cam Used Apple HomeKit, HomeBridge, Apple TV and Domoticz for Siri voice control Added Samsung Smart Things hub for Z-wave, Zigbee compatibility ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 15

Slide 15 text

EVENT CONTACT T: @MARKRITTMAN TITLE HOME AUTOMATION / IOT NETWORK Linking Apple Homekit, Samsung Smart Things and other IoT devices with Siri + Hadoop Logging EVENT NAME, LOCATION AND DATE 15 SESSION TITLE PHILIPS HUE 
 LIGHTING NEST PROTECT (X2), 
 THERMOSTAT, CAM WITHINGS
 SMART SCALES AIRPLAY
 SPEAKERS HOMEBRIDGE
 HOMEKIT / SMARTHINGS 
 CONNECTOR SAMSUNG
 SMART THINGS HUB (Z-WAVE, ZIGBEE) DOOR, MOTION, MOISTURE,
 PRESENCE SENSORS SIRI ON IPHONE, WATCH HADOOP CLUSTER SMART THINGS WATCH APP APPLE HOMEKIT,
 APPLE TV, SIRI

Slide 16

Slide 16 text

EVENT CONTACT T: @MARKRITTMAN TITLE Use Jawbone UP events and IFTTT to trigger Smart Things actions When I wake up, boil the kettle If my sleep was lower than usual last night,
 dim the lights early
 USING JAWBONE EVENTS TO TRIGGER SWITCHES IFTTT can also drive actions directly in Hue, Nest and other Smart Devices KSCOPE’16, CHICAGO, JUNE 2016 16 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 17

Slide 17 text

WHAT IF I COULD COMBINE…?

Slide 18

Slide 18 text

…MY HEALTH DATA…

Slide 19

Slide 19 text

EVENT CONTACT T: @MARKRITTMAN TITLE EVENT NAME, LOCATION AND DATE SESSION TITLE

Slide 20

Slide 20 text

… WITH SMART HOME SENSOR DATA …

Slide 21

Slide 21 text

EVENT CONTACT T: @MARKRITTMAN TITLE EVENT NAME, LOCATION AND DATE SESSION TITLE +

Slide 22

Slide 22 text

… AND ALSO MY EMAILS, TWEETS, LOCATION DATA AND SO …

Slide 23

Slide 23 text

EVENT CONTACT T: @MARKRITTMAN TITLE EVENT NAME, LOCATION AND DATE SESSION TITLE +

Slide 24

Slide 24 text

TO CREATE A “PERSONAL” DATA LAKE

Slide 25

Slide 25 text

EVENT CONTACT T: @MARKRITTMAN TITLE Data extracted or transported to target platform using LogStash, CSV file batch loads Landed into Elasticsearch indexes, then exposed as Hive tables using Storage Handler Cataloged, visualised and analysed using Oracle Big Data Discovery + Python ML “PERSONAL DATA LAKE” LOGICAL ARCHITECTURE All built and currently running, combination of real-time and batch loading EVENT NAME, LOCATION AND DATE SESSION TITLE Data Transfer Data Access “Personal” Data Lake Jupyter
 Web Notebook 6 Node Hadoop Cluster (CDH5.5) Discovery & Development Labs
 Oracle Big Data Discovery 1.2 Data sets and samples Models and programs Oracle DV
 Desktop Models BDD Shell,
 Python, 
 Spark ML Data Factory LogStash
 via HTTP Manual
 CSV U/L Data streams CSV, IFTTT
 or API call Staging
 ElasticSearch
 Indexes Three indexes,
 one for each
 data source Hive Tables
 w/ Elastic
 Storage Handler Index data turned into tabular format Health Data Unstructured Comms Data Smart Home
 Sensor Data

Slide 26

Slide 26 text

EVENT CONTACT T: @MARKRITTMAN TITLE AND USE MACHINE LEARNING FOR INSIGHTS… Find correlations, predict outcomes based on regression analysis, classify and cluster data KSCOPE’16, CHICAGO, JUNE 2016 26 As well as visualising the combined dataset, we could also use “machine learning” Advanced analytics, classification, regression, clustering Run algorithms on the full dataset to answer questions like: “What are the biggest determinants of weight gain or loss for me?” “On a good day, what are the typical combination of behaviours I exhibit”? “If I raised my cadence RPM average, how much further could I cycle per day?” “Is working late or missing lunch self-defeating in terms of overall weekly output?” ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING Discovery & Development Labs
 Oracle Big Data Discovery 1.2 Models BDD Shell,
 Python, 
 Spark ML

Slide 27

Slide 27 text

EVENT CONTACT T: @MARKRITTMAN TITLE ORACLE BIG DATA DISCOVERY - WHAT IS IT? Brief summary for the one person in this session who’s not seen Oracle’s marketing KSCOPE’16, CHICAGO, JUNE 2016 27 Oracle’s first Hadoop-Native BI & data discovery tool Catalog, visualize, data wrangle and search the datasets you land in Hadoop Initial releases focused on these areas of functionality, and OEID migrations … but lacked functionality that a data scientist would require ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 28

Slide 28 text

EVENT CONTACT T: @MARKRITTMAN TITLE BIG DATA DISCOVERY 1.0/1.1 - PARTIAL SOLUTION Missing full data tidying, data aggregation features, plus no real machine learning or stats features KSCOPE’16, CHICAGO, JUNE 2016 2 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING BDD Catalog + Data Wrangling features BDD Data Processing CLI BDD Dashboards Partial solution - no aggs, null-handling,
 no materialised joins No solution for M/L 
 or Predictive Analytics

Slide 29

Slide 29 text

EVENT CONTACT T: @MARKRITTMAN TITLE NEW FEATURES IN ORACLE BDD 1.2 BDD 1.2.0 Release Theme : “Developer Productivity” KSCOPE’16, CHICAGO, JUNE 2016 29 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING 29

Slide 30

Slide 30 text

EVENT CONTACT T: @MARKRITTMAN TITLE BDD SHELL - WHAT IS IT? Note comment about Jupyter - more on this later KSCOPE’16, CHICAGO, JUNE 2016 30 Interactive tool designed to work with BDD without using Studio's front-end Exposes all BDD concepts 
 (views, datasets, data sources etc) Supports Apache Spark HiveContext and SQLContext exposed BDD Shell SDK for easy access to BDD
 features, functionality Access to third-party libraries such as
 Pandas, Spark ML, numPy Use with web-based notebook such as
 iPython, Jupyter, Zeppelin ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 31

Slide 31 text

EVENT CONTACT T: @MARKRITTMAN TITLE WORKFLOW TO INGEST, TIDY AND PREPARE DATA KSCOPE’16, CHICAGO, JUNE 2016 31 1. Create Apache Hive tables over Elasticsearch Indexes using Storage Handler,
 and any CSV files or JSON documents from Jawbone UP / Google Locations 2. Import Hive table data into DGraph, and auto-enrich 3. Perform exploratory analysis on the imported data 4. Transform data to create one table, with one row of readings per period 5. Aggregate rows as appropriate (e.g. weekly averages + counts, for weight analysis) 6. Deal with nulls and missing data 7. Expose dataset through BDD Shell / Jupyter web notebook UI 8. Do any further transformations (e.g. pct chg on prior period) using Python Pandas 9. Run machine learning algorithms on data using Pandas, pySpark, Spark ML ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 32

Slide 32 text

EVENT CONTACT T: @MARKRITTMAN TITLE BASE BDD DATASET - JAWBONE UP EXTRACT Initially manually downloaded from Jawbone UP website; long term route would be direct via API KSCOPE’16, CHICAGO, JUNE 2016 32 Data extract contains one row per day, data in various categories Base activity data (steps, active time, active calories expended) Sleep data (time asleep, time in-bed, light and deep sleep, resting heart-rate) Mood if recorded; food ingested if recorded Workout data as provided by Strava integration Weight data as provided by Withings integration ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING 1 2 3

Slide 33

Slide 33 text

EVENT CONTACT T: @MARKRITTMAN TITLE Understand the “spread” of data using histograms Use box-plot charts to identify outliers and range of “usual” values Sort attributes by strongest correlation to a target attribute PERFORM EXPLORATORY ANALYSIS ON DATA KSCOPE’16, CHICAGO, JUNE 2016 33 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 34

Slide 34 text

EVENT CONTACT T: @MARKRITTMAN TITLE TRANSFORM (“WRANGLE”) DATA AS NEEDED KSCOPE’16, CHICAGO, JUNE 2016 34 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 35

Slide 35 text

EVENT CONTACT T: @MARKRITTMAN TITLE DEALING WITH MISSING DATA (“NULLS”) Very typical with self-recorded healthcare and workout data KSCOPE’16, CHICAGO, JUNE 2016 35 Most machine-learning algorithms expect every attribute to have a value per row Self-recorded data is typically sporadically recorded, lots of gaps in data Need to decide what to do with columns of poorly populate values ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING 1 2 3

Slide 36

Slide 36 text

EVENT CONTACT T: @MARKRITTMAN TITLE Previous versions of BDD allowed you to create joins for views Used in visualisations, equivalent to a SQL view i.e. SELECT only BDD 1.2.x allows you to add new joined attributes to data view, i.e. materialise In this instance, use to bring in data on emails, and on geolocation JOINING DATASETS TO MATERIALIZE RELATED DATA KSCOPE’16, CHICAGO, JUNE 2016 36 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 37

Slide 37 text

EVENT CONTACT T: @MARKRITTMAN TITLE AGGREGATE DATA TO WEEK LEVEL Only sensible option when looking at change in weight compared to prior period - day-level too short KSCOPE’16, CHICAGO, JUNE 2016 37 New feature in BDD 1.2.x is ability to aggregate (“rollup”) data Previous releases only supported row-level transforms ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING 1 2 3

Slide 38

Slide 38 text

NOW FOR THE CLEVER BIT

Slide 39

Slide 39 text

EVENT CONTACT T: @MARKRITTMAN TITLE USE BDD SHELL API TO IDENTIFY MAIN DATASET KSCOPE’16, CHICAGO, JUNE 2016 39 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 40

Slide 40 text

EVENT CONTACT T: @MARKRITTMAN TITLE USE PYTHON PANDAS TO CALCULATE % CHG W/W KSCOPE’16, CHICAGO, JUNE 2016 40 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 41

Slide 41 text

EVENT CONTACT T: @MARKRITTMAN TITLE IDENTIFY CORRELATIONS IN ATTRIBUTES KSCOPE’16, CHICAGO, JUNE 2016 41 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 42

Slide 42 text

EVENT CONTACT T: @MARKRITTMAN TITLE PERFORM LINEAR REGRESSION ON DATA KSCOPE’16, CHICAGO, JUNE 2016 42 ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 43

Slide 43 text

EVENT CONTACT T: @MARKRITTMAN TITLE INITIAL FINDINGS IN THE EXERCISE KSCOPE’16, CHICAGO, JUNE 2016 43 Most influential variable/attribute in my weight / loss gain is “# of emails sent” Inverse correlation - more emails I sent, the more weight I lose - but why? In my case - unusual set of circumstances that led to late nights, burst of intense work So busy I skipped meals, didn’t snack, stress and overwork perhaps And then compensated once work over by getting out on bike and exercising Correlation and most influential variable 
 will probably change in time This is where the data, measuring it, 
 and analysing it comes in Useful basis for experimenting And bring in the Smart Home data too ORACLE BIG DATA DISCOVERY : EXTENDING INTO MACHINE LEARNING

Slide 44

Slide 44 text

EVENT SPEAKER ODTUG KSCOPE’16, CHICAGO ORACLE BIG DATA DISCOVERY EXTENDING INTO MACHINE LEARNING : A QUANTIFIED SELF CASE STUDY MARK RITTMAN, ORACLE ACE DIRECTOR