Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN CABOT at Big Data Spain 2012

Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN CABOT at Big Data Spain 2012

Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/health-insurance-predictive-analysis-with-hadoop-and-machine-learning/julien-cabot

Big Data Spain

November 16, 2012
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. 1
    Tél : +33 (0)1 58 56 10 00
    Fax : +33 (0)1 58 56 10 01
    www.octo.com
    © OCTO 2012
    50, avenue des Champs-Elysées
    75008 Paris - FRANCE
    Health Insurance Predictive Analysis
    with MapReduce and Machine Learning
    Julien Cabot
    Managing Director
    OCTO
    [email protected]
    @julien_cabot
    Madrid
    16th of November 2012
    www.bigdataspain.org

    View Slide

  2. 2
    Internet as a Data Source…
    © OCTO 2012
    Internet as the voice of the crowd

    View Slide

  3. 3
    … in Healthcare
    © OCTO 2012
    71% about
    • Illness
    • Symptom
    • Medecine
    • Advice / opinion
    Main sources are old school
    forums, not social network

    View Slide

  4. 4
    Understand the subject of interest of the
    patient to design customer-centric products
    and marketing actions
    Anticipate the psycho-social effect due to
    Internet to prevent excessive consultations
    (and reimbursements)
    Predict the claims while monitoring the
    request about symptoms and drugs
    Benefits for Insurance Company?

    View Slide

  5. 5
    How to run the predictive analysis?

    View Slide

  6. 6
    Understand the semantic field of
    Healthcare…used on Internet
    Find correlation between the evolution of
    claims and … many millions of unidentified
    external variables
    Find correlated variables… anticipating the
    claims
    The data problem
    We need some help from Machine Learning !

    View Slide

  7. 7
    Correlation search in external datasets
    Trends of medical
    keywords used in
    forums
    Trends of medical
    keywords searched in
    Google
    Google search
    volume of symptom
    and drugs keywords
    Automated tokenization of
    message per posted date
    and semantic tagging
    Trends of socio-
    economical factors
    Socio-economical
    context from Open
    Data initiatives
    Health claims by
    act typology
    Correlation
    Search Machine
    Determination
    coeff. (R²) sorted
    matrix

    View Slide

  8. 8
    Understand the semantic field of Healthcare
    Timelines of
    healthcare
    key words
    Healthcare
    semantic
    field
    keywords
    database
    3-Learn
    automatically from
    Wikipedia Medical
    Categories
    Message
    tokenization
    by date
    Word stemming, tagging
    and common word
    filtering with NTLK
    1-Build a first list of
    keywords
    2-Enrich the list
    with highly
    searched keywords
    How to tag Healthcare
    words?

    View Slide

  9. 9
    Compare the evolution of the variable and the claims over the time
    Find non linear regression and learn a polymorphic predictive function
    f(x) from the dataset with Support Vector Regression (SVR)
    How to find correlations between time series?
    y
    x
    f(x)
    f(x) + ε
    f(x) - ε
    Problem to solve
    min
    w
    1
    2
    .
    - (·ϕ(x) + b) ≤ ε
    (·ϕ(x) + b) - ≤ ε
    Resolution
    • Stochastic gradient descendent
    • Test the response through the coef.
    of determination R²
    Open source ML library helps!

    View Slide

  10. 10
    The current volume of external data grabbed is large but not so huge (~10 Gb)
    Data aggregation
    Eg. Select … Group By Date
    Correlation search
    Eg. SVR computing
    Data Processing Profiles
    Data volume
    Data volume
    ~5Gb . 123 = 8,64 Tb
    We need Parallel Computing to divide
    RAM requirement and time processing !

    View Slide

  11. 11
    How to build the platform?

    View Slide

  12. 12
    IT drivers
    Data
    aggregation
    Large Tasks
    execution
    IO Elasticity
    CPU Elasticity
    Low CAPEX
    Low OPEX
    OSS SW
    Cost Elasticity
    Requirements IT drivers
    Aggregate data
    from Mb to Gb file
    while sequential
    reading
    SVR, NLP
    execution time is
    ~100ms by task
    Large RAM
    execution
    RAM Elasticity
    Process many Tb
    in memory data
    Increase the ROI of
    the research
    project while
    decreasing the
    TCO
    Commodity HW

    View Slide

  13. 13
    Available solutions
    IO Elasticity
    CPU Elasticity
    OSS Software
    Cost Elasticity
    RAM Elasticity
    Commodity
    Hardware
    RDBMS
    Hadoop
    AWS Elastic MapReduce
    HPC
    In Memory analytics
    With
    repartitioning
    With
    repartitioning
    With
    repartitioning
    Through Task Through Task

    View Slide

  14. 14
    AWS Elastic MapReduce Architecture
    Source: AWS

    View Slide

  15. 15
    Hadoop components
    HDFS
    Distributed file storage
    MapReduce
    Parallel processing framework
    Pig
    Flow processing
    Streaming
    MR scripting
    Hive
    SQL-like querying
    BI tools
    Tableau, Pentaho, …
    Mahout
    Machine Learning
    Hama
    Bulk synchronous
    processing
    Dataming tools
    R, SAS
    Sqoop
    RDBMS integration
    Zookeeper
    Coordination service
    Flume
    Data stream integration
    Hue
    Hadoop GUI
    HBase
    NoSQL on HDFS
    Solr
    Full text search
    Oozie
    MR workflow
    Custom App
    Java, C#, PHP, …
    Grid of commodity hardware – storage and processing

    View Slide

  16. 16
    General architecture of the platform
    AWS S3
    Core
    Instance 1
    Core
    Instance 2
    Task
    Instance 1
    Task
    Instance 2
    Master
    Instance
    Task
    Instances 3
    & 4
    Redis
    DataViz Application
    • Store raw
    data
    • Store results
    files
    • Store detailed
    results for
    drill down
    2 x m2.4xlarge
    4 x m2.4xlarge
    • For SVR and
    NLP
    processing,
    only

    View Slide

  17. 17
    Data aggregation with Pig Job flow
    records = LOAD ‘/input/forums/messages.txt’
    AS (str_date:chararray, message:chararray,
    url:chararray);
    date_grouped = GROUP records BY str_date
    results = FOREACH date_grouped GENERATE
    group, COUNT(records);
    DUMP results;
    Num_of_messages_by_date.pig

    View Slide

  18. 18
    Hadoop streaming runs map/reduce jobs with any
    executables or scripts through standard input and
    standard output
    It looks like that (on a cluster) :
    cat input.txt | map.py | sort | reduce.py
    Why Hadoop streaming?
    Intensive use of NLTK for Natural Language Processing
    Intensive use of NumPy and Sklearn for Machine Learning
    Hadoop streaming

    View Slide

  19. 19
    Stemmed word distribution with Hadoop streaming, mapper.py
    import sys
    import nltk
    from nltk.tokenize import regexp_tokenize
    from nltk.stem.snowball import FrenchStemmer
    # input comes from STDIN (standard input)
    for line in sys.stdin:
    line = line.strip()
    str_date, message, url = line.split(";")
    stemmer = FrenchStemmer("french")
    tokens = regexp_tokenize(message, pattern='\w+')
    for token in tokens:
    word = stemmer.stem(token)
    if len(word) >= 3:
    print '%s;%s' % (word, str_date)
    Stem_distribution_by_date/mapper.py

    View Slide

  20. 20
    Stemmed word distribution with Hadoop streaming, reducer.py
    import sys
    import json
    from itertools import groupby
    from operator import itemgetter
    from nltk.probability import FreqDist
    def read(f):
    for line in f:
    line = line.strip()
    yield line.split(';')
    data = read(sys.stdin)
    for current_stem, group in groupby(data, itemgetter(0)):
    values = [item[1] for item in group]
    freq_dist = FreqDist()
    print "%s;%s" % (current_stem, json.dumps(freq_dist))
    Stem_distribution_by_date/reducer.py

    View Slide

  21. 21
    Conclusions

    View Slide

  22. 22
    Conclusions
     The correlation search identifies currently 462 variables correlated with a R² >= 80%
    and a lag >= 1 month
     Amazon Elastic MapReduce provides the elasticity required by the morphology of
    the jobs and the cost elasticity
     Monthly cost with zero activity : < 5 €
     Monthly cost with intensive activity : < 1 000 €
     The equivalent cost of the platform would be around 50 000 €
     The S3 transfer overhead is not a problem due the volume of stored data
     While Correlation search processing, only 80% max of the virtual CPU are
    used due to job scheduling with a parallelism factor of 36 instead of 48
    regarding SMP

    View Slide

  23. 23
    Future works
    Data mining
     Increase the number of data sources
     Testing the robustness of the predictive model over the time
     Reducing the over fitting of the correlation
     Enhance the correlation search for word while testing combinations
    IT
     Switch only the correlation search to a map reduce engine for SMP
    architecture and cluster of cores, inspired by the Stanford Phoenix and the
    Nokia Disco engine
     Industrialize the data mining components as a platform for generalization to
    IARD insurance, banking, e-commerce, telecoms and retails

    View Slide

  24. 24
    OCTO in a nutshell
     Business case and benchmark studies
     Business Proof of Concept
     Data feeds : Web Trends
     Big Data and Analytics architecture design
     Big data project delivery
     Training, seminar : Big Data, Hadoop
    Big data Analytics Offer
     Established in 1998
     175 employees
     19,5 million turnover worldwide (2011)
     Verticals-based organization
     Banking – Financial Services
     Insurance
     Media – Internet – Leisure
     Industry – Distribution
     Telecom – Services
    IT Consulting firm OCTO offices

    View Slide

  25. 25
    Thank you!

    View Slide