Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN CABOT at Big Data Spain 2012

1 Tél : +33 (0)1 58 56 10 00 Fax
: +33 (0)1 58 56 10 01 www.octo.com © OCTO 2012 50, avenue des Champs-Elysées 75008 Paris - FRANCE Health Insurance Predictive Analysis with MapReduce and Machine Learning Julien Cabot Managing Director OCTO [email protected] @julien_cabot Madrid 16th of November 2012 www.bigdataspain.org

2 Internet as a Data Source… © OCTO 2012 Internet
as the voice of the crowd

3 … in Healthcare © OCTO 2012 71% about •
Illness • Symptom • Medecine • Advice / opinion Main sources are old school forums, not social network

4 Understand the subject of interest of the patient to
design customer-centric products and marketing actions Anticipate the psycho-social effect due to Internet to prevent excessive consultations (and reimbursements) Predict the claims while monitoring the request about symptoms and drugs Benefits for Insurance Company?

5 How to run the predictive analysis?

6 Understand the semantic field of Healthcare…used on Internet Find
correlation between the evolution of claims and … many millions of unidentified external variables Find correlated variables… anticipating the claims The data problem We need some help from Machine Learning !

7 Correlation search in external datasets Trends of medical keywords
used in forums Trends of medical keywords searched in Google Google search volume of symptom and drugs keywords Automated tokenization of message per posted date and semantic tagging Trends of socio- economical factors Socio-economical context from Open Data initiatives Health claims by act typology Correlation Search Machine Determination coeff. (R²) sorted matrix

8 Understand the semantic field of Healthcare Timelines of healthcare
key words Healthcare semantic field keywords database 3-Learn automatically from Wikipedia Medical Categories Message tokenization by date Word stemming, tagging and common word filtering with NTLK 1-Build a first list of keywords 2-Enrich the list with highly searched keywords How to tag Healthcare words?

9 Compare the evolution of the variable and the claims
over the time Find non linear regression and learn a polymorphic predictive function f(x) from the dataset with Support Vector Regression (SVR) How to find correlations between time series? y x f(x) f(x) + ε f(x) - ε Problem to solve min w 1 2 . - (·ϕ(x) + b) ≤ ε (·ϕ(x) + b) - ≤ ε Resolution • Stochastic gradient descendent • Test the response through the coef. of determination R² Open source ML library helps!

10 The current volume of external data grabbed is large
but not so huge (~10 Gb) Data aggregation Eg. Select … Group By Date Correlation search Eg. SVR computing Data Processing Profiles Data volume Data volume ~5Gb . 123 = 8,64 Tb We need Parallel Computing to divide RAM requirement and time processing !

11 How to build the platform?

12 IT drivers Data aggregation Large Tasks execution IO Elasticity
CPU Elasticity Low CAPEX Low OPEX OSS SW Cost Elasticity Requirements IT drivers Aggregate data from Mb to Gb file while sequential reading SVR, NLP execution time is ~100ms by task Large RAM execution RAM Elasticity Process many Tb in memory data Increase the ROI of the research project while decreasing the TCO Commodity HW

13 Available solutions IO Elasticity CPU Elasticity OSS Software Cost
Elasticity RAM Elasticity Commodity Hardware RDBMS Hadoop AWS Elastic MapReduce HPC In Memory analytics With repartitioning With repartitioning With repartitioning Through Task Through Task

14 AWS Elastic MapReduce Architecture Source: AWS

15 Hadoop components HDFS Distributed file storage MapReduce Parallel processing
framework Pig Flow processing Streaming MR scripting Hive SQL-like querying BI tools Tableau, Pentaho, … Mahout Machine Learning Hama Bulk synchronous processing Dataming tools R, SAS Sqoop RDBMS integration Zookeeper Coordination service Flume Data stream integration Hue Hadoop GUI HBase NoSQL on HDFS Solr Full text search Oozie MR workflow Custom App Java, C#, PHP, … Grid of commodity hardware – storage and processing

16 General architecture of the platform AWS S3 Core Instance
1 Core Instance 2 Task Instance 1 Task Instance 2 Master Instance Task Instances 3 & 4 Redis DataViz Application • Store raw data • Store results files • Store detailed results for drill down 2 x m2.4xlarge 4 x m2.4xlarge • For SVR and NLP processing, only

17 Data aggregation with Pig Job flow records = LOAD
‘/input/forums/messages.txt’ AS (str_date:chararray, message:chararray, url:chararray); date_grouped = GROUP records BY str_date results = FOREACH date_grouped GENERATE group, COUNT(records); DUMP results; Num_of_messages_by_date.pig

18 Hadoop streaming runs map/reduce jobs with any executables or
scripts through standard input and standard output It looks like that (on a cluster) : cat input.txt | map.py | sort | reduce.py Why Hadoop streaming? Intensive use of NLTK for Natural Language Processing Intensive use of NumPy and Sklearn for Machine Learning Hadoop streaming

19 Stemmed word distribution with Hadoop streaming, mapper.py import sys
import nltk from nltk.tokenize import regexp_tokenize from nltk.stem.snowball import FrenchStemmer # input comes from STDIN (standard input) for line in sys.stdin: line = line.strip() str_date, message, url = line.split(";") stemmer = FrenchStemmer("french") tokens = regexp_tokenize(message, pattern='\w+') for token in tokens: word = stemmer.stem(token) if len(word) >= 3: print '%s;%s' % (word, str_date) Stem_distribution_by_date/mapper.py

20 Stemmed word distribution with Hadoop streaming, reducer.py import sys
import json from itertools import groupby from operator import itemgetter from nltk.probability import FreqDist def read(f): for line in f: line = line.strip() yield line.split(';') data = read(sys.stdin) for current_stem, group in groupby(data, itemgetter(0)): values = [item[1] for item in group] freq_dist = FreqDist() print "%s;%s" % (current_stem, json.dumps(freq_dist)) Stem_distribution_by_date/reducer.py

21 Conclusions

22 Conclusions  The correlation search identifies currently 462 variables
correlated with a R² >= 80% and a lag >= 1 month  Amazon Elastic MapReduce provides the elasticity required by the morphology of the jobs and the cost elasticity  Monthly cost with zero activity : < 5 €  Monthly cost with intensive activity : < 1 000 €  The equivalent cost of the platform would be around 50 000 €  The S3 transfer overhead is not a problem due the volume of stored data  While Correlation search processing, only 80% max of the virtual CPU are used due to job scheduling with a parallelism factor of 36 instead of 48 regarding SMP

23 Future works Data mining  Increase the number of
data sources  Testing the robustness of the predictive model over the time  Reducing the over fitting of the correlation  Enhance the correlation search for word while testing combinations IT  Switch only the correlation search to a map reduce engine for SMP architecture and cluster of cores, inspired by the Stanford Phoenix and the Nokia Disco engine  Industrialize the data mining components as a platform for generalization to IARD insurance, banking, e-commerce, telecoms and retails

24 OCTO in a nutshell  Business case and benchmark
studies  Business Proof of Concept  Data feeds : Web Trends  Big Data and Analytics architecture design  Big data project delivery  Training, seminar : Big Data, Hadoop Big data Analytics Offer  Established in 1998  175 employees  19,5 million turnover worldwide (2011)  Verticals-based organization  Banking – Financial Services  Insurance  Media – Internet – Leisure  Industry – Distribution  Telecom – Services IT Consulting firm OCTO offices

25 Thank you!

Health Insurance Predictive Analysis with Hadoo...

Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN CABOT at Big Data Spain 2012

Big Data Spain

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript

1 Tél : +33 (0)1 58 56 10 00 Fax

2 Internet as a Data Source… © OCTO 2012 Internet

3 … in Healthcare © OCTO 2012 71% about •

4 Understand the subject of interest of the patient to

5 How to run the predictive analysis?

6 Understand the semantic field of Healthcare…used on Internet Find

7 Correlation search in external datasets Trends of medical keywords

8 Understand the semantic field of Healthcare Timelines of healthcare

9 Compare the evolution of the variable and the claims

10 The current volume of external data grabbed is large

11 How to build the platform?

12 IT drivers Data aggregation Large Tasks execution IO Elasticity

13 Available solutions IO Elasticity CPU Elasticity OSS Software Cost

14 AWS Elastic MapReduce Architecture Source: AWS

15 Hadoop components HDFS Distributed file storage MapReduce Parallel processing

16 General architecture of the platform AWS S3 Core Instance

17 Data aggregation with Pig Job flow records = LOAD

18 Hadoop streaming runs map/reduce jobs with any executables or

19 Stemmed word distribution with Hadoop streaming, mapper.py import sys

20 Stemmed word distribution with Hadoop streaming, reducer.py import sys

21 Conclusions

22 Conclusions  The correlation search identifies currently 462 variables

23 Future works Data mining  Increase the number of

24 OCTO in a nutshell  Business case and benchmark

25 Thank you!