Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Record Linkage - A real use case with Spark ML

Record Linkage - A real use case with Spark ML

Presentation from Spark Saturday DC, April 30th 2016

Alexis Seigneurin

April 30, 2016
Tweet

More Decks by Alexis Seigneurin

Other Decks in Technology

Transcript

  1. RECORD LINKAGE,
    A REAL USE CASE WITH SPARK ML
    Alexis Seigneurin

    View Slide

  2. Who I am
    • Software engineer for 15 years
    • Consultant at Ippon USA, previously at Ippon France
    • Favorite subjects: Spark, Machine Learning, Cassandra
    • Spark trainer
    • @aseigneurin

    View Slide

  3. • 200 software engineers in France and the US
    • In the US: offices in DC, NYC and Richmond, Virginia
    • Digital, Big Data and Cloud applications
    • Java & Agile expertise
    • Open-source projects: JHipster, Tatami, etc.
    • @ipponusa

    View Slide

  4. The project
    • Record Linkage with Machine learning
    • Use cases:
    • Find new clients who come from insurance comparison services
    → Commission
    • Find duplicates in existing files (acquisitions)
    • Record Linkage
    • Entity resolution
    • Deduplication
    • Entity disambiguation
    • …

    View Slide

  5. Overview

    View Slide

  6. • Find duplicates!
    Purpose
    +---+-------+------------+----------+------------+---------+------------+
    | ID| veh|codptgar_veh|dt_nais_cp|dt_permis_cp|redmaj_cp| formule|
    +---+-------+------------+----------+------------+---------+------------+
    |...|PE28221| 50000|1995-10-12| 2013-10-08| 100.0| TIERS|
    |...|FO26016| 59270|1967-01-01| 1987-02-01| 100.0|VOL_INCENDIE|
    |...|FI19107| 77100|1988-09-27| 2009-09-13| 105.0|TOUS_RISQUES|
    |...|RE07307| 69100|1984-08-15| 2007-04-20| 50.0| TIERS|
    |...|FO26016| 59270|1967-01-01| 1987-02-07| 105.0|TOUS_RISQUES|
    +---+-------+------------+----------+------------+---------+------------+

    View Slide

  7. Steps
    1. Preprocessing
    1. Find potential duplicates
    2. Feature engineering
    2. Manual labeling of a sample
    3. Machine Learning to make predictions on the rest of the
    records

    View Slide

  8. Prototype
    • Crafted by a Data Scientist
    • Not architectured, not versioned, not unit tested…
    → Not ready for production
    • Spark, but a lot of Spark SQL (data processing)
    • Machine Learning in Python (Scikit Learn)
    → Objective: industrialization of the code

    View Slide

  9. Preprocessing

    View Slide

  10. • Data (CSV) + Schema (JSON)
    Inputs
    000010;Jose;Lester;10/10/1970
    000011;José;Lester;10/10/1970
    000012;Tyler;Hunt;12/12/1972
    000013;Tiler;Hunt;25/12/1972
    000014;Patrick;Andrews;1973-12-13
    {
    "tableSchemaBeforeSelection": [
    {
    "name": "ID",
    "typeField": "StringType",
    "hardJoin": false
    },
    {
    "name": "name",
    "typeField": "StringType",
    "hardJoin": true,
    "cleaning": "digitLetter",
    "listFeature": [ "scarcity" ],
    "listDistance": [ "equality", "soundLike" ]
    },
    ...

    View Slide

  11. • Spark CSV module → DataFrame
    Don’t use type inference
    Data loading
    +------+-------+-------+----------+
    | ID| name|surname| birthDt|
    +------+-------+-------+----------+
    |000010| Jose| Lester|10/10/1970|
    |000011| José| Lester|10/10/1970|
    |000012| Tyler| Hunt|12/12/1972|
    |000013| Tiler| Hunt|25/12/1972|
    |000014|Patrick|Andrews|1970-10-10|
    +------+-------+-------+----------+

    View Slide

  12. • Parsing of dates, numbers…
    • Cleaning of strings
    • UDF + DataFrame.withColumn()
    Data cleansing
    +------+-------+-------+----------+
    | ID| name|surname| birthDt|
    +------+-------+-------+----------+
    |000010| jose| lester|1970-10-10|
    |000011| jose| lester|1970-10-10|
    |000012| tyler| hunt|1972-12-12|
    |000013| tiler| hunt|1972-12-25|
    |000014|patrick|andrews| null|
    +------+-------+-------+----------+

    View Slide

  13. • Convert strings to phonetics (Beider-Morse)
    • Apache Commons Codec
    • …
    Feature calculation
    +------+-------+-------+----------+--------------------+
    | ID| name|surname| birthDt| BMencoded_name|
    +------+-------+-------+----------+--------------------+
    |000010| jose| lester|1970-10-10|ios|iosi|ioz|iozi...|
    |000011| jose| lester|1970-10-10|ios|iosi|ioz|iozi...|
    |000012| tyler| hunt|1972-12-12| tilir|
    |000013| tiler| hunt|1972-12-25| tQlir|tili|tilir|
    |000014|patrick|andrews| null|pYtrQk|pYtrik|pat...|
    +------+-------+-------+----------+--------------------+

    View Slide

  14. • Auto-join (more on that later…)
    Find potential duplicates
    +------+------+---------+...+------+------+---------+...
    | ID_1|name_1|surname_1|...| ID_2|name_2|surname_2|...
    +------+------+---------+...+------+------+---------+...
    |000010| jose| lester|...|000011| jose| lester|...
    |000012| tyler| hunt|...|000013| tiler| hunt|...
    +------+------+---------+...+------+------+---------+...

    View Slide

  15. • Several distance algorithms:
    • Levenshtein distance (Apache Commons Lang)
    • Date difference
    • …
    Distance calculation
    +------+...+------+...+-------------+--------------+...+----------------+
    | ID_1|...| ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt|
    +------+...+------+...+-------------+--------------+...+----------------+
    |000010|...|000011|...| 0.0| 0.0|...| 0.0|
    |000012|...|000013|...| 1.0| 0.0|...| 13.0|
    +------+...+------+...+-------------+--------------+...+----------------+

    View Slide

  16. • Standardization of distances only
    • Vectorization (2 vectors)
    • Spark ML’s VectorAssembler
    Standardization / vectorization
    +------+------+---------+----------+------+------+---------+----------+------------+--------------+
    | ID_1|name_1|surname_1| birthDt_1| ID_2|name_2|surname_2| birthDt_2| distances|other_features|
    +------+------+---------+----------+------+------+---------+----------+------------+--------------+
    |000010| jose| lester|1970-10-10|000011| jose| lester|1970-10-10|[0.0,0.0,...| [2.0,2.0,...|
    |000012| tyler| hunt|1972-12-12|000013| tiler| hunt|1972-12-25|[1.0,1.0,...| [1.0,2.0,...|
    +------+------+---------+----------+------+------+---------+----------+------------+--------------+

    View Slide

  17. Spark
    SQL → DataFrames

    View Slide

  18. From SQL…
    • Generated SQL requests
    • Hard to maintain (especially as regards to UDFs)
    val cleaningRequest = tableSchema.map(x => {
    x.CleaningFuction match {
    case (Some(udf), _) => udf + "(" + x.name + ") as " + x.name
    case _ => x.name
    }
    }).mkString(", ")
    val cleanedTable = sqlContext.sql("select " + cleaningRequest + " from " + tableName)
    cleanedTable.registerTempTable(schema.tableName + "_cleaned")

    View Slide

  19. … to DataFrames
    • DataFrame primitives
    • More work done by the Scala compiler
    val cleanedDF = tableSchema.filter(_.cleaning.isDefined).foldLeft(df) {
    case (df, field) =>
    val udf: UserDefinedFunction = ... // get the cleaning UDF
    df.withColumn(field.name + "_cleaned", udf.apply(df(field.name)))
    .drop(field.name)
    .withColumnRenamed(field.name + "_cleaned", field.name)
    }

    View Slide

  20. Unit testing

    View Slide

  21. Unit testing
    • Scalatest + Scoverage
    • Coverage of all the data processing operations

    View Slide

  22. val resDF = schema.cleanTable(rows)

    "The cleaning process" should "clean text fields" in {

    val res = resDF.select("ID", "name", "surname").collect()

    val expected = Array(

    Row("000010", "jose", "lester"),

    Row("000011", "jose", "lester ea"),

    Row("000012", "jose", "lester")

    )

    res should contain theSameElementsAs expected

    }

    "The cleaning process" should "parse dates" in {
    ...
    Comparison of Row objects
    000010;Jose;Lester;10/10/1970

    000011;Jose =-+;Lester éà;10/10/1970

    000012;Jose;Lester;invalid date

    View Slide

  23. Shared SparkContext
    • Don’ts:
    • Use one SparkContext per class of tests → multiple contexts
    • Setup / tear down the SparkContext for each test → slow tests
    • Do’s:
    • Use a shared SparkContext
    object SparkTestContext {


    val conf = new SparkConf()

    .setAppName("deduplication-tests")

    .setMaster("local[*]")


    val sc = new SparkContext(conf)

    val sqlContext = new SQLContext(sc)


    }

    View Slide

  24. spark-testing-base
    • Holden Karau’s spark-testing-base library
    • https://github.com/holdenk/spark-testing-base
    • Provides:
    • Shared SparkContext and SQLContext
    • Comparison of RDDs, DataFrames, DataSets
    • Mock data generators

    View Slide

  25. Matching potential duplicates

    View Slide

  26. Join strategy
    • For record linkage, first merge the
    two sources
    • Then auto-join
    Prospects New clients
    Duplicate

    View Slide

  27. Join - Volume of data
    • Input: 1M records
    • Cartesian product: 1000 B records
    → Find an appropriate join condition
    0
    25
    50
    75
    100

    View Slide

  28. Join condition
    • Multiples join on 2 fields
    • Equality of values or custom condition (UDF)
    • Union between all the intermediate results
    • E.g. with fields name, surname, birth_date:
    df1.join(df2, (df1("ID_1") < df2("ID_2"))

    && (df1("name_1") === df2("name_2"))

    && (soundLike(df1("surname_1"), df2("surname_2")))
    df1.join(df2, (df1("ID_1") < df2("ID_2"))

    && (df1("name_1") === df2("name_2"))

    && (df1("birth_date_1") === df2("birth_date_2")))
    df1.join(df2, (df1("ID_1") < df2("ID_2"))

    && (soundLike(df1("surname_1"), df2("surname_2")))

    && (df1("birth_date_1") === df2("birth_date_2")))
    UNION

    View Slide

  29. DataFrames extension

    View Slide

  30. • 3 types of columns
    DataFrames extension
    +------+...+------+...+-------------+--------------+...+----------------+
    | ID_1|...| ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt|
    +------+...+------+...+-------------+--------------+...+----------------+
    |000010|...|000011|...| 0.0| 0.0|...| 0.0|
    |000012|...|000013|...| 1.0| 0.0|...| 13.0|
    +------+...+------+...+-------------+--------------+...+----------------+
    Data Distances
    Non-distance features

    View Slide

  31. • DataFrame columns have a name and a data type
    • DataFrameExt = DataFrame + metadata over columns
    DataFrames extension
    case class OutputColumn(name: String, columnType: ColumnType)



    class DataFrameExt(val df: DataFrame, val outputColumns: Seq[OutputColumn]) {
    def show() = df.show()
    def drop(colName: String): DataFrameExt = ...
    def withColumn(colName: String, col: Column, columnType: ColumnType): DataFrameExt = ...
    ...

    View Slide

  32. Labeling

    View Slide

  33. Labeling
    • Manual operation
    • Is this a duplicate? → Yes / No
    • Performed on a sample of the potential duplicates
    • Between 1000 and 10 000 records

    View Slide

  34. Labeling

    View Slide

  35. Predictions

    View Slide

  36. Predictions
    • Machine Learning - Spark ML
    • Random Forests
    • (Gradient Boosting Trees also give good results)
    • Training on the potential duplicates labeled by hand
    • Predictions on the potential duplicates not labeled by hand

    View Slide

  37. Predictions
    • Sample: 1000 records
    • Training set: 800 records
    • Test set: 200 records
    • Results
    • True positives: 53
    • False positives: 2
    • True negatives: 126
    • False negatives: 5
    → Found 53 duplicates on the 58
    expected (53+5) and only 2 errors
    •Precision ≈ 93%
    •Recall ≈ 91%

    View Slide

  38. Summary
    &
    Conclusion

    View Slide

  39. Summary
    ✓ Single engine for Record Linkage and Deduplication
    ✓ Machine Learning → Specific rules for each dataset
    ✓ Higher identification of matches
    • Previously ~50% → Now ~90%

    View Slide

  40. Thank you!
    @aseigneurin

    View Slide