Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Record Linkage - A real use case with Spark ML

Record Linkage - A real use case with Spark ML

Presentation from Spark Saturday DC, April 30th 2016

B1ed299a884f153fd23b9a1b81b798ac?s=128

Alexis Seigneurin

April 30, 2016
Tweet

Transcript

  1. RECORD LINKAGE, A REAL USE CASE WITH SPARK ML Alexis

    Seigneurin
  2. Who I am • Software engineer for 15 years •

    Consultant at Ippon USA, previously at Ippon France • Favorite subjects: Spark, Machine Learning, Cassandra • Spark trainer • @aseigneurin
  3. • 200 software engineers in France and the US •

    In the US: offices in DC, NYC and Richmond, Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster, Tatami, etc. • @ipponusa
  4. The project • Record Linkage with Machine learning • Use

    cases: • Find new clients who come from insurance comparison services → Commission • Find duplicates in existing files (acquisitions) • Record Linkage • Entity resolution • Deduplication • Entity disambiguation • …
  5. Overview

  6. • Find duplicates! Purpose +---+-------+------------+----------+------------+---------+------------+ | ID| veh|codptgar_veh|dt_nais_cp|dt_permis_cp|redmaj_cp| formule| +---+-------+------------+----------+------------+---------+------------+

    |...|PE28221| 50000|1995-10-12| 2013-10-08| 100.0| TIERS| |...|FO26016| 59270|1967-01-01| 1987-02-01| 100.0|VOL_INCENDIE| |...|FI19107| 77100|1988-09-27| 2009-09-13| 105.0|TOUS_RISQUES| |...|RE07307| 69100|1984-08-15| 2007-04-20| 50.0| TIERS| |...|FO26016| 59270|1967-01-01| 1987-02-07| 105.0|TOUS_RISQUES| +---+-------+------------+----------+------------+---------+------------+
  7. Steps 1. Preprocessing 1. Find potential duplicates 2. Feature engineering

    2. Manual labeling of a sample 3. Machine Learning to make predictions on the rest of the records
  8. Prototype • Crafted by a Data Scientist • Not architectured,

    not versioned, not unit tested… → Not ready for production • Spark, but a lot of Spark SQL (data processing) • Machine Learning in Python (Scikit Learn) → Objective: industrialization of the code
  9. Preprocessing

  10. • Data (CSV) + Schema (JSON) Inputs 000010;Jose;Lester;10/10/1970 000011;José;Lester;10/10/1970 000012;Tyler;Hunt;12/12/1972

    000013;Tiler;Hunt;25/12/1972 000014;Patrick;Andrews;1973-12-13 { "tableSchemaBeforeSelection": [ { "name": "ID", "typeField": "StringType", "hardJoin": false }, { "name": "name", "typeField": "StringType", "hardJoin": true, "cleaning": "digitLetter", "listFeature": [ "scarcity" ], "listDistance": [ "equality", "soundLike" ] }, ...
  11. • Spark CSV module → DataFrame Don’t use type inference

    Data loading +------+-------+-------+----------+ | ID| name|surname| birthDt| +------+-------+-------+----------+ |000010| Jose| Lester|10/10/1970| |000011| José| Lester|10/10/1970| |000012| Tyler| Hunt|12/12/1972| |000013| Tiler| Hunt|25/12/1972| |000014|Patrick|Andrews|1970-10-10| +------+-------+-------+----------+
  12. • Parsing of dates, numbers… • Cleaning of strings •

    UDF + DataFrame.withColumn() Data cleansing +------+-------+-------+----------+ | ID| name|surname| birthDt| +------+-------+-------+----------+ |000010| jose| lester|1970-10-10| |000011| jose| lester|1970-10-10| |000012| tyler| hunt|1972-12-12| |000013| tiler| hunt|1972-12-25| |000014|patrick|andrews| null| +------+-------+-------+----------+
  13. • Convert strings to phonetics (Beider-Morse) • Apache Commons Codec

    • … Feature calculation +------+-------+-------+----------+--------------------+ | ID| name|surname| birthDt| BMencoded_name| +------+-------+-------+----------+--------------------+ |000010| jose| lester|1970-10-10|ios|iosi|ioz|iozi...| |000011| jose| lester|1970-10-10|ios|iosi|ioz|iozi...| |000012| tyler| hunt|1972-12-12| tilir| |000013| tiler| hunt|1972-12-25| tQlir|tili|tilir| |000014|patrick|andrews| null|pYtrQk|pYtrik|pat...| +------+-------+-------+----------+--------------------+
  14. • Auto-join (more on that later…) Find potential duplicates +------+------+---------+...+------+------+---------+...

    | ID_1|name_1|surname_1|...| ID_2|name_2|surname_2|... +------+------+---------+...+------+------+---------+... |000010| jose| lester|...|000011| jose| lester|... |000012| tyler| hunt|...|000013| tiler| hunt|... +------+------+---------+...+------+------+---------+...
  15. • Several distance algorithms: • Levenshtein distance (Apache Commons Lang)

    • Date difference • … Distance calculation +------+...+------+...+-------------+--------------+...+----------------+ | ID_1|...| ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt| +------+...+------+...+-------------+--------------+...+----------------+ |000010|...|000011|...| 0.0| 0.0|...| 0.0| |000012|...|000013|...| 1.0| 0.0|...| 13.0| +------+...+------+...+-------------+--------------+...+----------------+
  16. • Standardization of distances only • Vectorization (2 vectors) •

    Spark ML’s VectorAssembler Standardization / vectorization +------+------+---------+----------+------+------+---------+----------+------------+--------------+ | ID_1|name_1|surname_1| birthDt_1| ID_2|name_2|surname_2| birthDt_2| distances|other_features| +------+------+---------+----------+------+------+---------+----------+------------+--------------+ |000010| jose| lester|1970-10-10|000011| jose| lester|1970-10-10|[0.0,0.0,...| [2.0,2.0,...| |000012| tyler| hunt|1972-12-12|000013| tiler| hunt|1972-12-25|[1.0,1.0,...| [1.0,2.0,...| +------+------+---------+----------+------+------+---------+----------+------------+--------------+
  17. Spark SQL → DataFrames

  18. From SQL… • Generated SQL requests • Hard to maintain

    (especially as regards to UDFs) val cleaningRequest = tableSchema.map(x => { x.CleaningFuction match { case (Some(udf), _) => udf + "(" + x.name + ") as " + x.name case _ => x.name } }).mkString(", ") val cleanedTable = sqlContext.sql("select " + cleaningRequest + " from " + tableName) cleanedTable.registerTempTable(schema.tableName + "_cleaned")
  19. … to DataFrames • DataFrame primitives • More work done

    by the Scala compiler val cleanedDF = tableSchema.filter(_.cleaning.isDefined).foldLeft(df) { case (df, field) => val udf: UserDefinedFunction = ... // get the cleaning UDF df.withColumn(field.name + "_cleaned", udf.apply(df(field.name))) .drop(field.name) .withColumnRenamed(field.name + "_cleaned", field.name) }
  20. Unit testing

  21. Unit testing • Scalatest + Scoverage • Coverage of all

    the data processing operations
  22. val resDF = schema.cleanTable(rows)
 "The cleaning process" should "clean text

    fields" in {
 val res = resDF.select("ID", "name", "surname").collect()
 val expected = Array(
 Row("000010", "jose", "lester"),
 Row("000011", "jose", "lester ea"),
 Row("000012", "jose", "lester")
 )
 res should contain theSameElementsAs expected
 } 
 "The cleaning process" should "parse dates" in { ... Comparison of Row objects 000010;Jose;Lester;10/10/1970
 000011;Jose =-+;Lester éà;10/10/1970
 000012;Jose;Lester;invalid date
  23. Shared SparkContext • Don’ts: • Use one SparkContext per class

    of tests → multiple contexts • Setup / tear down the SparkContext for each test → slow tests • Do’s: • Use a shared SparkContext object SparkTestContext {
 
 val conf = new SparkConf()
 .setAppName("deduplication-tests")
 .setMaster("local[*]")
 
 val sc = new SparkContext(conf)
 val sqlContext = new SQLContext(sc)
 
 }
  24. spark-testing-base • Holden Karau’s spark-testing-base library • https://github.com/holdenk/spark-testing-base • Provides:

    • Shared SparkContext and SQLContext • Comparison of RDDs, DataFrames, DataSets • Mock data generators
  25. Matching potential duplicates

  26. Join strategy • For record linkage, first merge the two

    sources • Then auto-join Prospects New clients Duplicate
  27. Join - Volume of data • Input: 1M records •

    Cartesian product: 1000 B records → Find an appropriate join condition 0 25 50 75 100
  28. Join condition • Multiples join on 2 fields • Equality

    of values or custom condition (UDF) • Union between all the intermediate results • E.g. with fields name, surname, birth_date: df1.join(df2, (df1("ID_1") < df2("ID_2"))
 && (df1("name_1") === df2("name_2"))
 && (soundLike(df1("surname_1"), df2("surname_2"))) df1.join(df2, (df1("ID_1") < df2("ID_2"))
 && (df1("name_1") === df2("name_2"))
 && (df1("birth_date_1") === df2("birth_date_2"))) df1.join(df2, (df1("ID_1") < df2("ID_2"))
 && (soundLike(df1("surname_1"), df2("surname_2")))
 && (df1("birth_date_1") === df2("birth_date_2"))) UNION
  29. DataFrames extension

  30. • 3 types of columns DataFrames extension +------+...+------+...+-------------+--------------+...+----------------+ | ID_1|...|

    ID_2|...|equality_name|soundLike_name|...|dateDiff_birthDt| +------+...+------+...+-------------+--------------+...+----------------+ |000010|...|000011|...| 0.0| 0.0|...| 0.0| |000012|...|000013|...| 1.0| 0.0|...| 13.0| +------+...+------+...+-------------+--------------+...+----------------+ Data Distances Non-distance features
  31. • DataFrame columns have a name and a data type

    • DataFrameExt = DataFrame + metadata over columns DataFrames extension case class OutputColumn(name: String, columnType: ColumnType)
 
 
 class DataFrameExt(val df: DataFrame, val outputColumns: Seq[OutputColumn]) { def show() = df.show() def drop(colName: String): DataFrameExt = ... def withColumn(colName: String, col: Column, columnType: ColumnType): DataFrameExt = ... ...
  32. Labeling

  33. Labeling • Manual operation • Is this a duplicate? →

    Yes / No • Performed on a sample of the potential duplicates • Between 1000 and 10 000 records
  34. Labeling

  35. Predictions

  36. Predictions • Machine Learning - Spark ML • Random Forests

    • (Gradient Boosting Trees also give good results) • Training on the potential duplicates labeled by hand • Predictions on the potential duplicates not labeled by hand
  37. Predictions • Sample: 1000 records • Training set: 800 records

    • Test set: 200 records • Results • True positives: 53 • False positives: 2 • True negatives: 126 • False negatives: 5 → Found 53 duplicates on the 58 expected (53+5) and only 2 errors •Precision ≈ 93% •Recall ≈ 91%
  38. Summary & Conclusion

  39. Summary ✓ Single engine for Record Linkage and Deduplication ✓

    Machine Learning → Specific rules for each dataset ✓ Higher identification of matches • Previously ~50% → Now ~90%
  40. Thank you! @aseigneurin