Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Record linkage, a real use case with Spark ML

Record linkage, a real use case with Spark ML

Alexis Seigneurin

November 30, 2015
Tweet

More Decks by Alexis Seigneurin

Other Decks in Technology

Transcript

  1. RECORD LINKAGE,
    A REAL USE CASE WITH SPARK ML
    Alexis Seigneurin

    View Slide

  2. Who I am
    • Software engineer for 15 years
    • Consultant at Ippon Tech in Paris, France
    • Spark trainer
    • Favorite subjects: Spark, Machine Learning, Cassandra
    • @aseigneurin

    View Slide

  3. The project
    • Record Linkage with Machine learning
    • Use cases:
    • Find new clients who come from insurance comparison services
    → Commission
    • Find duplicates in existing files (acquisitions)
    • Record Linkage
    • Entity resolution
    • Deduplication
    • Entity disambiguation
    • …

    View Slide

  4. Overview

    View Slide

  5. • Find duplicates!
    Purpose
    +---+-------+------------+----------+------------+---------+------------+
    | ID| veh|codptgar_veh|dt_nais_cp|dt_permis_cp|redmaj_cp| formule|
    +---+-------+------------+----------+------------+---------+------------+
    |...|PE28221| 50000|1995-10-12| 2013-10-08| 100.0| TIERS|
    |...|FO26016| 59270|1967-01-01| 1987-02-01| 100.0|VOL_INCENDIE|
    |...|FI19107| 77100|1988-09-27| 2009-09-13| 105.0|TOUS_RISQUES|
    |...|RE07307| 69100|1984-08-15| 2007-04-20| 50.0| TIERS|
    |...|FO26016| 59270|1967-01-07| 1987-02-01| 105.0|TOUS_RISQUES|
    +---+-------+------------+----------+------------+---------+------------+

    View Slide

  6. Steps
    1. Preprocessing
    1. Find potential duplicates
    2. Feature engineering
    2. Manual labeling of a sample
    3. Machine Learning to make predictions on the rest of the
    records

    View Slide

  7. Prototype
    • Crafted by a Data Scientist
    • Not architectured, not versioned, not unit tested…
    → Not ready for production
    • Spark, but a lot of Spark SQL (data processing)
    • Machine Learning in Python (Scikit Learn)
    → Objective: industrialization of the code

    View Slide

  8. Preprocessing

    View Slide

  9. • Data (CSV) + Schema (JSON)
    Inputs
    000010;Jose;Lester;10/10/1970
    000011;José;Lester;10/10/1970
    000012;Tyler;Hunt;12/12/1972
    000013;Tiler;Hunt;25/12/1972
    000014;Patrick;Andrews;1973-12-13
    {
    "tableSchemaBeforeSelection": [
    {
    "name": "ID",
    "typeField": "StringType",
    "hardJoin": false
    },
    {
    "name": "name",
    "typeField": "StringType",
    "hardJoin": true,
    "cleaning": "digitLetter",
    "listFeature": [ "scarcity" ],
    "listDistance": [ "equality", "soundLike" ]
    },
    ...

    View Slide

  10. • Spark CSV module → DataFrame
    Don’t use type inference
    Data loading
    +------+-------+-------+----------+
    | ID| name|surname| birthDt|
    +------+-------+-------+----------+
    |000010| Jose| Lester|10/10/1970|
    |000011| José| Lester|10/10/1970|
    |000012| Tyler| Hunt|12/12/1972|
    |000013| Tiler| Hunt|25/12/1972|
    |000014|Patrick|Andrews|1970-10-10|
    +------+-------+-------+----------+

    View Slide

  11. • Parsing of dates, numbers…
    • Cleaning of strings
    Data cleaning
    +------+-------+-------+----------+
    | ID| name|surname| birthDt|
    +------+-------+-------+----------+
    |000010| jose| lester|1970-10-10|
    |000011| jose| lester|1970-10-10|
    |000012| tyler| hunt|1972-12-12|
    |000013| tiler| hunt|1972-12-25|
    |000014|patrick|andrews| null|
    +------+-------+-------+----------+

    View Slide

  12. • Convert strings to phonetics (Beider-Morse)
    • …
    Feature calculation
    +------+-------+-------+----------+--------------------+
    | ID| name|surname| birthDt| BMencoded_name|
    +------+-------+-------+----------+--------------------+
    |000010| jose| lester|1970-10-10|ios|iosi|ioz|iozi...|
    |000011| jose| lester|1970-10-10|ios|iosi|ioz|iozi...|
    |000012| tyler| hunt|1972-12-12| tilir|
    |000013| tiler| hunt|1972-12-25| tQlir|tili|tilir|
    |000014|patrick|andrews| null|pYtrQk|pYtrik|pat...|
    +------+-------+-------+----------+--------------------+

    View Slide

  13. • Auto-join (more on that later…)
    Find potential duplicates
    +------+------+---------+...+------+------+---------+...
    | ID_1|name_1|surname_1|...| ID_2|name_2|surname_2|...
    +------+------+---------+...+------+------+---------+...
    |000010| jose| lester|...|000011| jose| lester|...
    |000012| tyler| hunt|...|000013| tiler| hunt|...
    +------+------+---------+...+------+------+---------+...

    View Slide

  14. • Several distance algorithms:
    • Levenshtein distance, date difference…
    Distance calculation
    +------+...+------+...+-------------+--------------+...+----------------+
    | ID_1|...| ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt|
    +------+...+------+...+-------------+--------------+...+----------------+
    |000010|...|000011|...| 0.0| 0.0|...| 0.0|
    |000012|...|000013|...| 1.0| 0.0|...| 13.0|
    +------+...+------+...+-------------+--------------+...+----------------+

    View Slide

  15. • Standardization of distances only
    • Vectorization (2 vectors)
    Standardization / vectorization
    +------+------+---------+----------+------+------+---------+----------+------------+--------------+
    | ID_1|name_1|surname_1| birthDt_1| ID_2|name_2|surname_2| birthDt_2| distances|other_features|
    +------+------+---------+----------+------+------+---------+----------+------------+--------------+
    |000010| jose| lester|1970-10-10|000011| jose| lester|1970-10-10|[0.0,0.0,...| [2.0,2.0,...|
    |000012| tyler| hunt|1972-12-12|000013| tiler| hunt|1972-12-25|[1.0,1.0,...| [1.0,2.0,...|
    +------+------+---------+----------+------+------+---------+----------+------------+--------------+

    View Slide

  16. Spark
    SQL → DataFrames

    View Slide

  17. From SQL…
    • Generated SQL requests
    • Hard to maintain (especially as regards to UDFs)
    val cleaningRequest = tableSchema.map(x => {
    x.CleaningFuction match {
    case (Some(a), _) => a + "(" + x.name + ") as " + x.name
    case _ => x.name
    }
    }).mkString(", ")
    val cleanedTable = sqlContext.sql("select " + cleaningRequest + " from " + tableName)
    cleanedTable.registerTempTable(schema.tableName + "_cleaned")

    View Slide

  18. … to DataFrames
    • DataFrame primitives
    • More work done by the Scala compiler
    val cleanedDF = tableSchema.filter(_.cleaning.isDefined).foldLeft(df) {
    case (df, field) =>
    val udf: UserDefinedFunction = ... // get the cleaning UDF
    df.withColumn(field.name + "_cleaned", udf.apply(df(field.name)))
    .drop(field.name)
    .withColumnRenamed(field.name + "_cleaned", field.name)
    }

    View Slide

  19. Matching potential duplicates

    View Slide

  20. • Join table A with table B
    Join strategy #1
    Prospects New clients

    View Slide

  21. Join strategy #2
    • Union table A with table B
    • Auto-join
    • Allows for deduplication, not only
    matching
    Prospects New clients
    Duplicate

    View Slide

  22. Join - Volume of data
    • Input: 1M records
    • Cartesian product: 1000 B records
    → Find an appropriate join condition
    0
    25
    50
    75
    100

    View Slide

  23. Join condition
    • Multiples join on 2 fields
    • Equality of values or custom condition (UDF)
    • Union between all the intermediate results
    • E.g. with fields name, surname, birth_date:
    df1.join(df2, (df1("ID_1") < df2("ID_2"))

    && (df1("name_1") === df2("name_2"))

    && (soundLike(df1("surname_1"), df2("surname_2")))
    df1.join(df2, (df1("ID_1") < df2("ID_2"))

    && (df1("name_1") === df2("name_2"))

    && (df1("birth_date_1") === df2("birth_date_2")))
    df1.join(df2, (df1("ID_1") < df2("ID_2"))

    && (soundLike(df1("surname_1"), df2("surname_2")))

    && (df1("birth_date_1") === df2("birth_date_2")))
    UNION

    View Slide

  24. DataFrames extension

    View Slide

  25. • 3 types of columns
    DataFrames extension
    +------+...+------+...+-------------+--------------+...+----------------+
    | ID_1|...| ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt|
    +------+...+------+...+-------------+--------------+...+----------------+
    |000010|...|000011|...| 0.0| 0.0|...| 0.0|
    |000012|...|000013|...| 1.0| 0.0|...| 13.0|
    +------+...+------+...+-------------+--------------+...+----------------+
    Data Distances
    Non-distance features

    View Slide

  26. • DataFrame columns have a name and a data type
    • DataFrameExt = DataFrame + metadata over columns
    DataFrames extension
    case class OutputColumn(name: String, columnType: ColumnType)



    class DataFrameExt(val df: DataFrame, val outputColumns: Seq[OutputColumn]) {
    def show() = df.show()
    def drop(colName: String): DataFrameExt = ...
    def withColumn(colName: String, col: Column, columnType: ColumnType): DataFrameExt = ...
    ...

    View Slide

  27. Unit testing

    View Slide

  28. Unit testing
    • Scalatest + Scoverage
    • Coverage of all the data processing operations
    • Comparison of Row objects

    View Slide

  29. val resDF = schema.cleanTable(rows)

    "The cleaning process" should "clean text fields" in {

    val res = resDF.select("ID", "name", "surname").collect()

    val expected = Array(

    Row("000010", "jose", "lester"),

    Row("000011", "jose", "lester ea"),

    Row("000012", "jose", "lester")

    )

    res should contain theSameElementsAs expected

    }

    "The cleaning process" should "parse dates" in {
    ...
    Unit testing
    000010;Jose;Lester;10/10/1970

    000011;Jose =-+;Lester éà;10/10/1970

    000012;Jose;Lester;invalid date

    View Slide

  30. Labeling

    View Slide

  31. Labeling
    • Manual operation
    • Is this a duplicate? → Yes / No
    • Performed on a sample of the potential duplicates
    • Between 1000 and 10 000 records

    View Slide

  32. Labeling

    View Slide

  33. Predictions

    View Slide

  34. Predictions
    • Machine Learning
    • Random Forests
    • (Gradient Boosting Trees also give good results)
    • Training on the potential duplicates labeled by hand
    • Predictions on the potential duplicates not labeled by hand

    View Slide

  35. Predictions
    • Sample: 1000 records
    • Training set: 800 records
    • Test set: 200 records
    • Results
    • True positives: 53
    • False positives: 2
    • True negatives: 126
    • False negatives: 5
    → Found 53 duplicates on the 58 expected (53+5) and only 2 errors

    View Slide

  36. Summary
    &
    Conclusion

    View Slide

  37. Summary
    ✓ Single engine for Record Linkage and Deduplication
    ✓ Machine Learning → Specific rules for each dataset
    ✓ Higher identification of matches
    • Previously ~50% → Now 80-90%

    View Slide

  38. Thank you!
    @aseigneurin

    View Slide