The Story Behind the 11hr Cup of Tea, Wifi Kettles, & How it Was All About... Data

Slide 1

Slide 1 text

T : @markrittman THE STORY BEHIND THE 11HR CUP OF TEA, WIFI KETTLES, & HOW IT WAS ALL ABOUT... DATA Mark Rittman, Oracle ACE Director OCTOBER 2016

Slide 2

Slide 2 text

•Oracle ACE Director, now Independent Analyst •Regular columnist for Oracle Magazine •Past ODTUG Executive Board Member •Author of two books on Oracle BI •Co-founder & CTO of Rittman Mead •15+ Years in Oracle BI, DW, ETL + now Big Data •Implementor, trainer, consultant + company founder •Based in Brighton, UK About The Presenter 2

Slide 3

Slide 3 text

Slide 4

Slide 4 text

ONE MORNING (AND THEN ALL DAY)   LAST WEEK… 4

Slide 5

Slide 5 text

Slide 6

Slide 6 text

THE FOLLOWING MORNING… 6

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

WHY? 13

Slide 14

Slide 14 text

Slide 15

Slide 15 text

DATA 15

Slide 16

Slide 16 text

D A T A 16

Slide 17

Slide 17 text

AND WHAT’S POSSIBLE WHEN  YOU JOIN THAT DATA TOGETHER 17

Slide 18

Slide 18 text

SIX MONTHS AGO… 18

Slide 19

Slide 19 text

•Over the past months I’ve been on sabbatical, taking time out to look at new Hadoop tech •Building prototypes, working with with startups & analysts outside of core Oracle world •Asking myself the question “What will an analytics platform look like in 5 years time?” •But also during this time, getting fit, getting into cycling and losing 14kg over 12 months •Using Wahoo Elemnt + Strava for workout recording •Withings Wifi scales for weight + body fat measurement •Jawbone UP3 for steps, sleep, resting heart rate •All the time, collecting data and storing it in Hadoop Personal Data Science Project - “Quantified Self” 19

Slide 20

Slide 20 text

•Quantified Self is about self-knowledge through numbers •Decide on some goals, work out what metrics to track •Use wearables and other smart devices to record steps, heart rate, workouts, weight and other health metrics •Plot, correlate, track trends and combine datasets •For me, goal was to maintain new “healthy weight” •Understand drivers of weight gain or loss •See how sleep affected productivity •Understand what behaviours led to a “good day” Personal Data Science Project - “Quantified Self” 20

Slide 21

Slide 21 text

MY OTHER SABBATICAL PROJECT… 21

Slide 22

Slide 22 text

22 HOME AUTOMATION 22

Slide 23

Slide 23 text

Slide 24

Slide 24 text

•Data extracted or transported to target platform using LogStash, CSV file batch loads •Landed into HDFS as JSON documents, then exposed as Hive tables using Storage Handler •Cataloged, visualised and analysed using Oracle Big Data Discovery + Python ML Hadoop Cluster Dataset - “Personal Data Lake" 24 Data Transfer Data Access “Personal” Data Lake Jupyter  Web Notebook 6 Node Hadoop Cluster (CDH5.5) Discovery & Development Labs  Oracle Big Data Discovery 1.2 Data sets and samples Models and programs Oracle DV  Desktop Models BDD Shell,  Python,   Spark ML Data Factory LogStash  via HTTP Manual  CSV U/L Data streams CSV, IFTTT  or API call Raw JSON log files in HDFS Each document an event, daily record or comms message Hive Tables  w/ Elastic  Storage Handler Index data turned into tabular format Health Data Unstructured Comms Data Smart Home  Sensor Data

Slide 25

Slide 25 text

•Uses IFTTT cloud workflow service to subscribe to events on wearables’ APIs •Triggers HTTP GET request via IFTTT Maker Channel to Logstash running at home •Event data sent as JSON documents, loaded  into HDFS via webhdfs protocol •Structured in Hadoop using Hive JSONSerDe •Then loaded hourly into DGraph using  Big Data Discovery dataprocessing CLI •Event data automatically enriched, and can  be joined to smart home data for analysis Landing Wearables Data In Real-Time 25 New workout  logged using  Strava 1 Workout details uploaded  to Strava using cloud API 2 3 IFTTT recipe gets workout event from Strava API, triggers an HTTP GET web request 4 JSON document received by  Logstash, then forwarded to   Hadoop using webhdfs PUT 5 JSON documents landed in HDFS in raw form, then structured using Hive JSONSerDe 6 Hive data uploaded into Oracle Big Data Discovery, visualised and wrangled, and modelled using pySpark In the Cloud Home

Slide 26

Slide 26 text

•All smart device events and sensor readings are routed through Samsung Smart Things hub •Including Apple HomeKit devices, through custom integration •Event data uploads to Smart Things cloud service + storage •Custom Groovy SmartApp subscribes to  device events, transmits JSON documents  to Logstash using HTTP GET requests •Then process flow the same as with  wearables and social media / comms data Landing Smart Home Data In Real-Time 26 Sensor or other smart device  raises a Smart Things event 1 Event logged in Samsung Smarthings Cloud Service from Smart Things Hub 2 4 JSON document received by  Logstash, then forwarded to   Hadoop using webhdfs PUT 5 JSON documents landed in HDFS in raw form, then structured using Hive JSONSerDe 6 Hive data uploaded into Oracle Big Data Discovery, visualised and wrangled, and modelled using pySpark In the Cloud Home SmartApp subscribes to device events, forwards them as JSON document using HTTP GET requests 3

Slide 27

Slide 27 text

•As well as visualising the combined dataset, we could also use “machine learning” •Find correlations, predict outcomes based on regression analysis, classify and cluster data •Run algorithms on the full dataset to answer questions like: •“What are the biggest determinants of weight gain or loss for me?” •“On a good day, what are the typical combination of behaviours I exhibit”? •“If I raised my cadence RPM average, how much further could I cycle per day?” •“Is working late or missing lunch self-defeating in terms of overall weekly output?” And Use Machine Learning For Insights… 27 MODELING AND INFERRING

Slide 28

Slide 28 text

•Analysis started with data from Jawbone UP2 ecosystem (manual export, and via IFTTT events) •Base activity data (steps, active time, active calories expended) •Sleep data (time asleep, time in-bed, light and deep sleep, resting heart-rate) •Mood if recorded; food ingested if recorded •Workout data as provided by Strava integration •Weight data as provided by Withings integration Initial Base Dataset - Jawbone Up Extract 28 1 2 3

Slide 29

Slide 29 text

•Understand the “spread” of data using histograms •Use box-plot charts to identify outliers and range of “usual” values •Sort attributes by strongest correlation to a target attribute Perform Exploratory Analysis On Data 29

Slide 30

Slide 30 text

•Initial row-wise preparation and transformation of data using Groovy transformations Transform (“Wrangle”) Data As Needed 30

Slide 31

Slide 31 text

•Very typical with self-recorded healthcare and workout data •Most machine-learning algorithms expect every attribute to have a value per row •Self-recorded data is typically sporadically recorded, lots of gaps in data •Need to decide what to do with columns of poorly populate values Dealing With Missing Data (“Nulls”) 31 1 2 3

Slide 32

Slide 32 text

•Previous versions of BDD allowed you to create joins for views •Used in visualisations, equivalent to a SQL view i.e. SELECT only •BDD 1.2.x allows you to add new joined attributes to data view, i.e. materialise •In this instance, use to bring in data on emails, and on geolocation Joining Datasets To Materialize Related Data 32

Slide 33

Slide 33 text

•Only sensible option when looking at change in weight compared to prior period •Change compared to previous day too granular Aggregate Data To Week Level 33 1 2 3

Slide 34

Slide 34 text

NOW FOR THE CLEVER BIT MODELING AND INFERRING 34

Slide 35

Slide 35 text

Use BDD Shell API to Identify Main Dataset ID 35

Slide 36

Slide 36 text

Use Python PANDAS to Calculate % CHG W/w 36

Slide 37

Slide 37 text

Identify Correlations Between Attributes 37

Slide 38

Slide 38 text

Use Linear Regression on BDD Dataset via Python 38 •To answer the question - which metric is the most influential when it comes to weight change?

Slide 39

Slide 39 text

And the Answer … Amount of Sleep Each Night 39 •Most influential variable/attribute in my weight / loss gain is “# of emails sent” •Inverse correlation - more emails I sent, the more weight I lose - but why? •In my case - unusual set of circumstances that led to late nights, burst of intense work •So busy I skipped meals, didn’t snack, stress and overwork perhaps •And then compensated once work over by getting out on bike and exercising •Correlation and most influential variable   will probably change in time •This is where the data, measuring it,   and analysing it comes in •Useful basis for experimenting •And bring in the Smart Home data too

Slide 40

Slide 40 text

•Load device + event data into Cloudera Kudu rather than HDFS + Hive •Current limitation is around Big Data Discovery - does not work with Kudu or Impala •But useful for real-time metrics (BDD requires batch ingest, and samples the data) •Use Kafka for more reliable event routing •Push email, social media, saved documents etc into Cloudera Search •Do more on the machine learning / data integration + correlation side For The Future..? 40

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Slide 46

Slide 46 text

THANK YOU E X A M P L E S O F T E X T - S I M P L E A N D E A S Y T O U S E T H E T H E M E O F T H E D E M O T E M P L AT E B L A C K A N D W H I T E W O R L D THANK YOU 46

Slide 47

Slide 47 text

T : @markrittman THE STORY BEHIND THE 11HR CUP OF TEA, WIFI KETTLES, & HOW IT WAS ALL ABOUT... DATA Mark Rittman, Oracle ACE Director OCTOBER 2016