Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN CABOT at Big Data Spain 2012

Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN CABOT at Big Data Spain 2012

Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
More info: http://www.bigdataspain.org/es-2012/conference/health-insurance-predictive-analysis-with-hadoop-and-machine-learning/julien-cabot


Big Data Spain

November 16, 2012


  1. 1 Tél : +33 (0)1 58 56 10 00 Fax

    : +33 (0)1 58 56 10 01 www.octo.com © OCTO 2012 50, avenue des Champs-Elysées 75008 Paris - FRANCE Health Insurance Predictive Analysis with MapReduce and Machine Learning Julien Cabot Managing Director OCTO jcabot@octo.com @julien_cabot Madrid 16th of November 2012 www.bigdataspain.org
  2. 2 Internet as a Data Source… © OCTO 2012 Internet

    as the voice of the crowd
  3. 3 … in Healthcare © OCTO 2012 71% about •

    Illness • Symptom • Medecine • Advice / opinion Main sources are old school forums, not social network
  4. 4 Understand the subject of interest of the patient to

    design customer-centric products and marketing actions Anticipate the psycho-social effect due to Internet to prevent excessive consultations (and reimbursements) Predict the claims while monitoring the request about symptoms and drugs Benefits for Insurance Company?
  5. 5 How to run the predictive analysis?

  6. 6 Understand the semantic field of Healthcare…used on Internet Find

    correlation between the evolution of claims and … many millions of unidentified external variables Find correlated variables… anticipating the claims The data problem We need some help from Machine Learning !
  7. 7 Correlation search in external datasets Trends of medical keywords

    used in forums Trends of medical keywords searched in Google Google search volume of symptom and drugs keywords Automated tokenization of message per posted date and semantic tagging Trends of socio- economical factors Socio-economical context from Open Data initiatives Health claims by act typology Correlation Search Machine Determination coeff. (R²) sorted matrix
  8. 8 Understand the semantic field of Healthcare Timelines of healthcare

    key words Healthcare semantic field keywords database 3-Learn automatically from Wikipedia Medical Categories Message tokenization by date Word stemming, tagging and common word filtering with NTLK 1-Build a first list of keywords 2-Enrich the list with highly searched keywords How to tag Healthcare words?
  9. 9 Compare the evolution of the variable and the claims

    over the time Find non linear regression and learn a polymorphic predictive function f(x) from the dataset with Support Vector Regression (SVR) How to find correlations between time series? y x f(x) f(x) + ε f(x) - ε Problem to solve min w 1 2 . - (·ϕ(x) + b) ≤ ε (·ϕ(x) + b) - ≤ ε Resolution • Stochastic gradient descendent • Test the response through the coef. of determination R² Open source ML library helps!
  10. 10 The current volume of external data grabbed is large

    but not so huge (~10 Gb) Data aggregation Eg. Select … Group By Date Correlation search Eg. SVR computing Data Processing Profiles Data volume Data volume ~5Gb . 123 = 8,64 Tb We need Parallel Computing to divide RAM requirement and time processing !
  11. 11 How to build the platform?

  12. 12 IT drivers Data aggregation Large Tasks execution IO Elasticity

    CPU Elasticity Low CAPEX Low OPEX OSS SW Cost Elasticity Requirements IT drivers Aggregate data from Mb to Gb file while sequential reading SVR, NLP execution time is ~100ms by task Large RAM execution RAM Elasticity Process many Tb in memory data Increase the ROI of the research project while decreasing the TCO Commodity HW
  13. 13 Available solutions IO Elasticity CPU Elasticity OSS Software Cost

    Elasticity RAM Elasticity Commodity Hardware RDBMS Hadoop AWS Elastic MapReduce HPC In Memory analytics With repartitioning With repartitioning With repartitioning Through Task Through Task
  14. 14 AWS Elastic MapReduce Architecture Source: AWS

  15. 15 Hadoop components HDFS Distributed file storage MapReduce Parallel processing

    framework Pig Flow processing Streaming MR scripting Hive SQL-like querying BI tools Tableau, Pentaho, … Mahout Machine Learning Hama Bulk synchronous processing Dataming tools R, SAS Sqoop RDBMS integration Zookeeper Coordination service Flume Data stream integration Hue Hadoop GUI HBase NoSQL on HDFS Solr Full text search Oozie MR workflow Custom App Java, C#, PHP, … Grid of commodity hardware – storage and processing
  16. 16 General architecture of the platform AWS S3 Core Instance

    1 Core Instance 2 Task Instance 1 Task Instance 2 Master Instance Task Instances 3 & 4 Redis DataViz Application • Store raw data • Store results files • Store detailed results for drill down 2 x m2.4xlarge 4 x m2.4xlarge • For SVR and NLP processing, only
  17. 17 Data aggregation with Pig Job flow records = LOAD

    ‘/input/forums/messages.txt’ AS (str_date:chararray, message:chararray, url:chararray); date_grouped = GROUP records BY str_date results = FOREACH date_grouped GENERATE group, COUNT(records); DUMP results; Num_of_messages_by_date.pig
  18. 18 Hadoop streaming runs map/reduce jobs with any executables or

    scripts through standard input and standard output It looks like that (on a cluster) : cat input.txt | map.py | sort | reduce.py Why Hadoop streaming? Intensive use of NLTK for Natural Language Processing Intensive use of NumPy and Sklearn for Machine Learning Hadoop streaming
  19. 19 Stemmed word distribution with Hadoop streaming, mapper.py import sys

    import nltk from nltk.tokenize import regexp_tokenize from nltk.stem.snowball import FrenchStemmer # input comes from STDIN (standard input) for line in sys.stdin: line = line.strip() str_date, message, url = line.split(";") stemmer = FrenchStemmer("french") tokens = regexp_tokenize(message, pattern='\w+') for token in tokens: word = stemmer.stem(token) if len(word) >= 3: print '%s;%s' % (word, str_date) Stem_distribution_by_date/mapper.py
  20. 20 Stemmed word distribution with Hadoop streaming, reducer.py import sys

    import json from itertools import groupby from operator import itemgetter from nltk.probability import FreqDist def read(f): for line in f: line = line.strip() yield line.split(';') data = read(sys.stdin) for current_stem, group in groupby(data, itemgetter(0)): values = [item[1] for item in group] freq_dist = FreqDist() print "%s;%s" % (current_stem, json.dumps(freq_dist)) Stem_distribution_by_date/reducer.py
  21. 21 Conclusions

  22. 22 Conclusions  The correlation search identifies currently 462 variables

    correlated with a R² >= 80% and a lag >= 1 month  Amazon Elastic MapReduce provides the elasticity required by the morphology of the jobs and the cost elasticity  Monthly cost with zero activity : < 5 €  Monthly cost with intensive activity : < 1 000 €  The equivalent cost of the platform would be around 50 000 €  The S3 transfer overhead is not a problem due the volume of stored data  While Correlation search processing, only 80% max of the virtual CPU are used due to job scheduling with a parallelism factor of 36 instead of 48 regarding SMP
  23. 23 Future works Data mining  Increase the number of

    data sources  Testing the robustness of the predictive model over the time  Reducing the over fitting of the correlation  Enhance the correlation search for word while testing combinations IT  Switch only the correlation search to a map reduce engine for SMP architecture and cluster of cores, inspired by the Stanford Phoenix and the Nokia Disco engine  Industrialize the data mining components as a platform for generalization to IARD insurance, banking, e-commerce, telecoms and retails
  24. 24 OCTO in a nutshell  Business case and benchmark

    studies  Business Proof of Concept  Data feeds : Web Trends  Big Data and Analytics architecture design  Big data project delivery  Training, seminar : Big Data, Hadoop Big data Analytics Offer  Established in 1998  175 employees  19,5 million turnover worldwide (2011)  Verticals-based organization  Banking – Financial Services  Insurance  Media – Internet – Leisure  Industry – Distribution  Telecom – Services IT Consulting firm OCTO offices
  25. 25 Thank you!