Who I am • Software engineer for 15 years • Consultant at Ippon USA, previously at Ippon France • Favorite subjects: Spark, Machine Learning, Cassandra • Spark trainer • @aseigneurin
• 200 software engineers in France and the US • In the US: offices in DC, NYC and Richmond, Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster, Tatami, etc. • @ipponusa
The project • Record Linkage with Machine learning • Use cases: • Find new clients who come from insurance comparison services → Commission • Find duplicates in existing files (acquisitions) • Record Linkage • Entity resolution • Deduplication • Entity disambiguation • …
Steps 1. Preprocessing 1. Find potential duplicates 2. Feature engineering 2. Manual labeling of a sample 3. Machine Learning to make predictions on the rest of the records
Prototype • Crafted by a Data Scientist • Not architectured, not versioned, not unit tested… → Not ready for production • Spark, but a lot of Spark SQL (data processing) • Machine Learning in Python (Scikit Learn) → Objective: industrialization of the code
… to DataFrames • DataFrame primitives • More work done by the Scala compiler val cleanedDF = tableSchema.filter(_.cleaning.isDefined).foldLeft(df) { case (df, field) => val udf: UserDefinedFunction = ... // get the cleaning UDF df.withColumn(field.name + "_cleaned", udf.apply(df(field.name))) .drop(field.name) .withColumnRenamed(field.name + "_cleaned", field.name) }
val resDF = schema.cleanTable(rows) "The cleaning process" should "clean text fields" in { val res = resDF.select("ID", "name", "surname").collect() val expected = Array( Row("000010", "jose", "lester"), Row("000011", "jose", "lester ea"), Row("000012", "jose", "lester") ) res should contain theSameElementsAs expected }
"The cleaning process" should "parse dates" in { ... Comparison of Row objects 000010;Jose;Lester;10/10/1970 000011;Jose =-+;Lester éà;10/10/1970 000012;Jose;Lester;invalid date
Shared SparkContext • Don’ts: • Use one SparkContext per class of tests → multiple contexts • Setup / tear down the SparkContext for each test → slow tests • Do’s: • Use a shared SparkContext object SparkTestContext {
val conf = new SparkConf() .setAppName("deduplication-tests") .setMaster("local[*]")
val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc)
• DataFrame columns have a name and a data type • DataFrameExt = DataFrame + metadata over columns DataFrames extension case class OutputColumn(name: String, columnType: ColumnType)
Predictions • Machine Learning - Spark ML • Random Forests • (Gradient Boosting Trees also give good results) • Training on the potential duplicates labeled by hand • Predictions on the potential duplicates not labeled by hand
Summary ✓ Single engine for Record Linkage and Deduplication ✓ Machine Learning → Specific rules for each dataset ✓ Higher identification of matches • Previously ~50% → Now ~90%