High performance privacy by design using Matryoshka & Spark

High performance privacy by design using Matryoshka & Spark

Def754b0d38c0f7e7a282c47ac6ce85e?s=128

Olivier Girardot

October 31, 2018
Tweet

Transcript

  1. 2.

    Big Data Architect Olivier Girardot • Scala, Python & Java

    dev • Data Engineer • Big Data Architect • Co-Founder of LateralThoughts.com
  2. 3.

    Software Engineer Wiem Zine Elabidine • Data Engineer at ebiznext

    • Enthusiastic about FP • Contributor to Scalaz-ZIO
  3. 4.

    The Plan • Introduction • Privacy Framework goals • Recursive

    Data Structures • Our Use Case • Matryoshka • Privacy Engines • Performance results • Conclusion
  4. 5.
  5. 6.

    { "_id": "5bd9761695a4b11a262c6f6d", "isActive": false, "balance": "$1,217.04", "age": 40, "eyeColor":

    "brown", "name": "Hebert Mullen", "gender": "male", "company": "EVENTIX", "email": "hebertmullen@eventix.com", "phone": "+1 (962) 529-3054", "address": "260 Clark Street, Corinne, Maryland, 1890", "coords": { "latitude": 33.118464, "longitude": 168.775865 } } Example :
  6. 7.

    Problem : We want to build a generic framework that

    can : - represent any object - handle for any object’s field (no matter how nested) an encryption function (ex: mask, hash, cypher…) - while being general purpose & expressive
  7. 9.

    How ? 1. Build annotated Schemas with fields metadata representing

    what it is “semantically” ex: This is a Person’s first name 2. Define Privacy Strategies, i.e. what to do with “a Person’s first name” ex: Hash, Mask, Delete…
  8. 14.

    Privacy Framework type PrivacyStrategies = Map[Seq[String], PrivacyStrategy] “address” “name” “email”

    “id” “pw” ... Tags encryptStrategy changeSchema ******* 4lEhcqv4 #Fde32 -1 &&& ...
  9. 15.

    Privacy Framework type PrivacyStrategies = Map[Seq[String], PrivacyStrategy] String String String

    Long String ... Tags encryptStrategy changeSchema “address” “name” “email” “id” “pw” ... Schema
  10. 16.

    Privacy Framework type PrivacyStrategies = Map[Seq[String], PrivacyStrategy] String String String

    Long String ... Tags encryptStrategy changeSchema “address” “name” “email” “id” “pw” ... Schema String String String String String ...
  11. 19.

    Recursive Data Structure Succ(Succ(Succ(Zero))) = ??? sealed trait Number case

    object Zero extends Number case class Succ(prev: Number) extends Number
  12. 20.

    Recursive Data Structure Succ(Succ(Succ(Zero))) = ??? sealed trait Number case

    object Zero extends Number case class Succ(prev: Number) extends Number def numberToInt(n: Number): Int = n match { case Succ(x) => 1 + numberToInt(x) case Zero => 0 }
  13. 21.

    Recursive Data Structure sealed trait Number case object Zero extends

    Number case class Succ(prev: Number) extends Number Succ(Succ(Succ(Zero))) = 3 def numberToInt(n: Number): Int = n match { case Succ(x) => 1 + numberToInt(x) case Zero => 0 } Zero
  14. 23.

    Schema vs Data Schema: Struct id: Long name: String addresses:

    Array email: String pw: String Row1: Data 143265 Bob [“Paris”] bob.st@gmail.com ******** Row2 143267 Anna [“Lyon”] anna@gmail.com ******** Row3 143225 Robert [“Germany”] robert@gmail.com ******** ... ... ... ... ... ... Row5466 345675 Alice [“Scotland”] alice@gmail.com ******** ... ... ... ... ... ...
  15. 26.

    Our Schema sealed trait TSchema case class TStruct(fields: List[(String, TSchema)],

    metadata: ColumnMetadata) extends TSchema case class TArray(elementType: TSchema, metadata: ColumnMetadata) extends TSchema sealed trait TValue extends TSchema case class TBoolean(metadata: ColumnMetadata) extends TValue case class TDate(metadata: ColumnMetadata) extends TValue case class TDouble(metadata: ColumnMetadata) extends TValue case class TFloat(metadata: ColumnMetadata) extends TValue case class TInteger(metadata: ColumnMetadata) extends TValue case class TLong(metadata: ColumnMetadata) extends TValue case class TString(metadata: ColumnMetadata) extends TValue
  16. 27.

    Our Schema sealed trait TSchema case class TStruct(fields: List[(String, TSchema)],

    metadata: ColumnMetadata) extends TSchema case class TArray(fields: TSchema, metadata: ColumnMetadata) extends TSchema sealed trait TValue extends TSchema case class TBoolean(metadata: ColumnMetadata) extends TValue case class TDate(metadata: ColumnMetadata) extends TValue case class TDouble(metadata: ColumnMetadata) extends TValue case class TFloat(metadata: ColumnMetadata) extends TValue case class TInteger(metadata: ColumnMetadata) extends TValue case class TLong(metadata: ColumnMetadata) extends TValue case class TString(metadata: ColumnMetadata) extends TValue
  17. 30.

    Our Generic Data sealed trait GData case class GStruct(fields: List[(String,

    GData)]) extends GData case class GArray(elements: Seq[GData]) extends GData sealed trait GValue extends GData case class GBoolean(value: Boolean) extends GValue case class GString(value: String) extends GValue ...
  18. 34.

    Magical Steps 1. Remove Recursion 2. Recapture Recursion 3. Define

    Functor 4. Enjoy the free recursion functions in Matryoshka! cata/ana/hylo
  19. 35.

    Step 1: Remove Recursion sealed trait TSchema case class TStruct(fields:

    List[(String, TSchema)], metadata: ColumnMetadata) extends TSchema case class TArray(elementType: TSchema, metadata: ColumnMetadata) extends TSchema sealed trait TValue extends TSchema case class TBoolean(metadata: ColumnMetadata) extends TSchema case class TString(metadata: ColumnMetadata) extends TSchema ...
  20. 36.

    Step 1: Remove Recursion sealed trait SchemaF[A] case class StructF[A]

    (fields: List[(String, A)], metadata: ColumnMetadata) extends SchemaF[A] case class ArrayF[A](elementType: A, metadata: ColumnMetadata) extends SchemaF[A] sealed trait ValueF[A] extends SchemaF[A] case class BooleanF[A](metadata: ColumnMetadata) extends ValueF[A] case class StringF[A](metadata: ColumnMetadata) extends ValueF[A] … sealed trait TSchema case class TStruct(fields: List[(String, TSchema)], metadata: ColumnMetadata) extends TSchema case class TArray(elementType: TSchema, metadata: ColumnMetadata) extends TSchema sealed trait TValue extends TSchema case class TBoolean(metadata: ColumnMetadata) extends TSchema case class TString(metadata: ColumnMetadata) extends TSchema ...
  21. 37.

    But.. case class StructF[A] (fields: List[(String, A)], metadata: ColumnMetadata) extends

    SchemaF[A] case class ArrayF[A](elementType: A, metadata: ColumnMetadata) extends SchemaF[A] 1. What if A is another SchemaF[A]?
  22. 42.

    Step 2: Recapture Recursion case class Fix[F[_]](unFix: F[Fix[F]]) Fix[F] ==

    F[Fix[F]] val schema: Fix[SchemaF] = Fix(ArrayF(Fix(DoubleF(???)))) SchemaF[Fix[SchemaF]]
  23. 43.

    Step 2: Recapture Recursion case class Fix[F[_]](unFix: F[Fix[F]]) val schema:

    Fix[SchemaF] = Fix(StructF(List("isAvailable" -> Fix(BooleanF(???)), "date" -> Fix(DataF(???)), "person" -> Fix(StructF(List("name" -> Fix(StringF(???)), "array" -> Fix(ArrayF(Fix(DoubleF(???)), ???))), ???))), ???))
  24. 44.

    Step 3: Define Functor implicit val schemaFunctor: Functor[SchemaF] = new

    Functor[SchemaF] { def map[A, B](fa: SchemaF[A])(f: A => B): SchemaF[B] = fa match { case StructF(fields, m) => StructF(fields.map{ case (name, value) => name -> f(value) }), m) case ArrayF(elem, m) => ArrayF(f(elem), m) case BooleanF(m) => BooleanF(m) case StringF(m) => StringF(m) case IntegerF(m) => IntegerF(m) ... } }
  25. 45.
  26. 46.

    We Are Ready! • Build SchemaF from Spark Schema •

    Collapse SchemaF to Spark Schema We need recursive functions!
  27. 47.

    We Are Ready! • Build SchemaF from Spark Schema •

    Collapse SchemaF to Spark Schema Matryoshka offers that for free
  28. 51.

    How to build up a single layer of SchemaF from

    DataType? val dataTypeToSchemaF: Coalgebra[SchemaF, DataType] = { case StructType(fields) => StructF(fields.map(f => f.name -> f.dataType)), ColumnMetadata.empty) case ArrayType(elem, _) => ArrayF(elem, ColumnMetadata.empty) case BooleanType => BooleanF(ColumnMetadata.empty) case DateType => DateF(ColumnMetadata.empty) case DoubleType => DoubleF(ColumnMetadata.empty) case FloatType => FloatF(ColumnMetadata.empty) case IntegerType => IntegerF(ColumnMetadata.empty) case LongType => LongF(ColumnMetadata.empty) case StringType => StringF(ColumnMetadata.empty) } DataType TOP DOWN
  29. 52.

    val dataTypeToSchemaF: Coalgebra[SchemaF, DataType] = { case StructType(fields) => StructF(fields.map(f

    => f.name -> f.dataType)), ColumnMetadata.empty) case ArrayType(elem, _) => ArrayF(elem, ColumnMetadata.empty) case BooleanType => BooleanF(ColumnMetadata.empty) case DateType => DateF(ColumnMetadata.empty) case DoubleType => DoubleF(ColumnMetadata.empty) case FloatType => FloatF(ColumnMetadata.empty) case IntegerType => IntegerF(ColumnMetadata.empty) case LongType => LongF(ColumnMetadata.empty) case StringType => StringF(ColumnMetadata.empty) } StructType(StructField(id,LongType), StructField(name,StringType)) StructType TOP DOWN How to build up a single layer of SchemaF from DataType?
  30. 53.

    val dataTypeToSchemaF: Coalgebra[SchemaF, DataType] = { case StructType(fields) => StructF(fields.map(f

    => f.name -> f.dataType)), ColumnMetadata.empty) case ArrayType(elem, _) => ArrayF(elem, ColumnMetadata.empty) case BooleanType => BooleanF(ColumnMetadata.empty) case DateType => DateF(ColumnMetadata.empty) case DoubleType => DoubleF(ColumnMetadata.empty) case FloatType => FloatF(ColumnMetadata.empty) case IntegerType => IntegerF(ColumnMetadata.empty) case LongType => LongF(ColumnMetadata.empty) case StringType => StringF(ColumnMetadata.empty) } StructType(StructField(id,LongType), StructField(name,StringType)) StructType StructF(List(id,LongType),(name,StringType)) TOP DOWN How to build up a single layer of SchemaF from DataType?
  31. 54.

    val dataTypeToSchemaF: Coalgebra[SchemaF, DataType] = { case StructType(fields) => StructF(fields.map(f

    => f.name -> f.dataType)), ColumnMetadata.empty) case ArrayType(elem, _) => ArrayF(elem, ColumnMetadata.empty) case BooleanType => BooleanF(ColumnMetadata.empty) case DateType => DateF(ColumnMetadata.empty) case DoubleType => DoubleF(ColumnMetadata.empty) case FloatType => FloatF(ColumnMetadata.empty) case IntegerType => IntegerF(ColumnMetadata.empty) case LongType => LongF(ColumnMetadata.empty) case StringType => StringF(ColumnMetadata.empty) } StructType(StructField(id,LongType), StructField(name,StringType)) StructF(List(id,LongType, (name,StringType)) LongType TOP DOWN How to build up a single layer of SchemaF from DataType?
  32. 55.

    val dataTypeToSchemaF: Coalgebra[SchemaF, DataType] = { case StructType(fields) => StructF(fields.map(f

    => f.name -> f.dataType)), ColumnMetadata.empty) case ArrayType(elem, _) => ArrayF(elem, ColumnMetadata.empty) case BooleanType => BooleanF(ColumnMetadata.empty) case DateType => DateF(ColumnMetadata.empty) case DoubleType => DoubleF(ColumnMetadata.empty) case FloatType => FloatF(ColumnMetadata.empty) case IntegerType => IntegerF(ColumnMetadata.empty) case LongType => LongF(ColumnMetadata.empty) case StringType => StringF(ColumnMetadata.empty) } StructType(StructField(id,LongType), StructField(name,StringType)) StructF(List(id,LongF(ColumnMetadata.empty), (name,StringType)) LongType TOP DOWN How to build up a single layer of SchemaF from DataType?
  33. 56.

    val dataTypeToSchemaF: Coalgebra[SchemaF, DataType] = { case StructType(fields) => StructF(fields.map(f

    => f.name -> f.dataType)), ColumnMetadata.empty) case ArrayType(elem, _) => ArrayF(elem, ColumnMetadata.empty) case BooleanType => BooleanF(ColumnMetadata.empty) case DateType => DateF(ColumnMetadata.empty) case DoubleType => DoubleF(ColumnMetadata.empty) case FloatType => FloatF(ColumnMetadata.empty) case IntegerType => IntegerF(ColumnMetadata.empty) case LongType => LongF(ColumnMetadata.empty) case StringType => StringF(ColumnMetadata.empty) } StructType(StructField(id,LongType), StructField(name,StringType)) StructF(List(id,LongF(ColumnMetadata.empty), (name,StringType)) StringType TOP DOWN How to build up a single layer of SchemaF from DataType?
  34. 57.

    val dataTypeToSchemaF: Coalgebra[SchemaF, DataType] = { case StructType(fields) => StructF(fields.map(f

    => f.name -> f.dataType)), ColumnMetadata.empty) case ArrayType(elem, _) => ArrayF(elem, ColumnMetadata.empty) case BooleanType => BooleanF(ColumnMetadata.empty) case DateType => DateF(ColumnMetadata.empty) case DoubleType => DoubleF(ColumnMetadata.empty) case FloatType => FloatF(ColumnMetadata.empty) case IntegerType => IntegerF(ColumnMetadata.empty) case LongType => LongF(ColumnMetadata.empty) case StringType => StringF(ColumnMetadata.empty) } StructType(StructField(id,LongType), StructField(name,StringType)) StructF(List(id,LongF(ColumnMetadata.empty), (name,StringF(ColumnMetadata.empty))) StringType TOP DOWN How to build up a single layer of SchemaF from DataType?
  35. 58.

    How to build a full blown SchemaF using Coalgebra? val

    sparkSchema: DataType = StructType(List( StructField("id", LongType, true), StructField("name", StringType, true) )) val schemaF: Fix[SchemaF] = sparkSchema ?? dataTypeToSchemaF
  36. 59.

    ana val sparkSchema: DataType = StructType(List( StructField("id", LongType, true), StructField("name",

    StringType, true) )) val schemaF: Fix[SchemaF] = sparkSchema ?? dataTypeToSchemaF def ana[A](f: Coalgebra[F, A])(implicit BF: Functor[F]): F[A]
  37. 60.

    ana val sparkSchema: DataType = StructType(List( StructField("id", LongType, true), StructField("name",

    StringType, true) )) val schemaF: Fix[SchemaF] = sparkSchema.ana[Fix[SchemaF]](dataTypeToSchemaF)
  38. 61.

    ana val sparkSchema: DataType = StructType(List( StructField("id", LongType, true), StructField("name",

    StringType, true) )) val schemaF: Fix[SchemaF] = sparkSchema.ana[Fix[SchemaF]](dataTypeToSchemaF) Fix(StructF(List((id,Fix(LongF(ColumnMetadata(true,List())))), (name,Fix(StringF(ColumnMetadata(true,List()))))),ColumnMetadata(true,List())))
  39. 65.

    Algebra def schemaFToDataType: Algebra[SchemaF, DataType] = { case StructF(fields, _)

    => StructType(fields.map { case (name, value) => StructField(name, value) }.toArray) case ArrayF(elem, m) => ArrayType(elem, containsNull = false) case BooleanF(_) => BooleanType case DateF(_) => DateType case DoubleF(_) => DoubleType case FloatF(_) => FloatType case IntegerF(_) => IntegerType case LongF(_) => LongType case StringF(_) => StringType } SchemaF UP BOTTOM
  40. 66.

    Algebra def schemaFToDataType: Algebra[SchemaF, DataType] = { case StructF(fields, _)

    => StructType(fields.map { case (name, value) => StructField(name, value) }.toArray) case ArrayF(elem, m) => ArrayType(elem, containsNull = false) case BooleanF(_) => BooleanType case DateF(_) => DateType case DoubleF(_) => DoubleType case FloatF(_) => FloatType case IntegerF(_) => IntegerType case LongF(_) => LongType case StringF(_) => StringType } LongF StructF(List((id,LongF)),(name,StringF)) StructF(List((id,LongF)),(name,StringF)) UP BOTTOM
  41. 67.

    Algebra def schemaFToDataType: Algebra[SchemaF, DataType] = { case StructF(fields, _)

    => StructType(fields.map { case (name, value) => StructField(name, value) }.toArray) case ArrayF(elem, m) => ArrayType(elem, containsNull = false) case BooleanF(_) => BooleanType case DateF(_) => DateType case DoubleF(_) => DoubleType case FloatF(_) => FloatType case IntegerF(_) => IntegerType case LongF(_) => LongType case StringF(_) => StringType } LongF StructF(List((id,LongType)),(name,StringF)) StructF(List((id,LongF)),(name,StringF)) UP BOTTOM
  42. 68.

    Algebra def schemaFToDataType: Algebra[SchemaF, DataType] = { case StructF(fields, _)

    => StructType(fields.map { case (name, value) => StructField(name, value) }.toArray) case ArrayF(elem, m) => ArrayType(elem, containsNull = false) case BooleanF(_) => BooleanType case DateF(_) => DateType case DoubleF(_) => DoubleType case FloatF(_) => FloatType case IntegerF(_) => IntegerType case LongF(_) => LongType case StringF(_) => StringType } StringF StructF(List((id,LongType)),(name,StringF)) StructF(List((id,LongF)),(name,StringF)) UP BOTTOM
  43. 69.

    Algebra def schemaFToDataType: Algebra[SchemaF, DataType] = { case StructF(fields, _)

    => StructType(fields.map { case (name, value) => StructField(name, value) }.toArray) case ArrayF(elem, m) => ArrayType(elem, containsNull = false) case BooleanF(_) => BooleanType case DateF(_) => DateType case DoubleF(_) => DoubleType case FloatF(_) => FloatType case IntegerF(_) => IntegerType case LongF(_) => LongType case StringF(_) => StringType } StringF StructF(List((id,LongType)),(name,StringType)) StructF(List((id,LongF)),(name,StringF)) UP BOTTOM
  44. 70.

    Algebra def schemaFToDataType: Algebra[SchemaF, DataType] = { case StructF(fields, _)

    => StructType(fields.map { case (name, value) => StructField(name, value) }.toArray) case ArrayF(elem, m) => ArrayType(elem, containsNull = false) case BooleanF(_) => BooleanType case DateF(_) => DateType case DoubleF(_) => DoubleType case FloatF(_) => FloatType case IntegerF(_) => IntegerType case LongF(_) => LongType case StringF(_) => StringType } StructF(List((id,LongType)),(name,StringType)) StructF(List((id,LongF)),(name,StringF)) StructF UP BOTTOM
  45. 71.

    Algebra def schemaFToDataType: Algebra[SchemaF, DataType] = { case StructF(fields, _)

    => StructType(fields.map { case (name, value) => StructField(name, value) }.toArray) case ArrayF(elem, m) => ArrayType(elem, containsNull = false) case BooleanF(_) => BooleanType case DateF(_) => DateType case DoubleF(_) => DoubleType case FloatF(_) => FloatType case IntegerF(_) => IntegerType case LongF(_) => LongType case StringF(_) => StringType } StructType(StructField(id,LongType), StructField(name,StringType)) StructF(List((id,LongF)),(name,StringF)) StructF UP BOTTOM
  46. 72.

    How to collapse a SchemaF into DataType? val schemaF: Fix[SchemaF]

    = Fix(StructF(List(id,Fix(LongF(ColumnMetadata.empty)), (name, Fix(StringF(ColumnMetadata.empty))))) val dataType: DataType = schemaF ??? schemaFToDataType
  47. 73.

    cata val schemaF: Fix[SchemaF] = Fix(StructF(List(id,Fix(LongF(ColumnMetadata.empty)), (name, Fix(StringF(ColumnMetadata.empty))))) val dataType:

    DataType = schemaF ??? schemaFToDataType def cata[A](f: Algebra[F, A])(implicit BF: Functor[F]): A
  48. 74.

    cata val schemaF: Fix[SchemaF] = Fix(StructF(List(id,Fix(LongF(ColumnMetadata.empty)), (name, Fix(StringF(ColumnMetadata.empty))))) val dataType:

    DataType = schemaF.cata[DataType](schemaFToDataType) StructType(List(StructField("id", LongType, true), StructField("name", org.apache.spark.sql.types.StringType, true)))
  49. 79.

    hylo DataType1 DataType2 DataType1 DataType2 hylo def hylo[F[_]: Functor, A,

    B](a: A)(alg: Algebra[F, B], co: Coalgebra[F, A]): B
  50. 80.

    Recap - ana: unfold Requires Coalgebra - cata: fold Requires

    Algebra - hylo: re-fold ⇒ (unfold + fold) Requires Algebra and Coalgebra
  51. 82.

    Problem We need to encrypt Data only if the tags

    within its Schema match those of Privacy Strategy
  52. 83.

    How : A naive approach would be to zip the

    data & schema together recursively. And then use an Algebra to pattern-match Schemas with proper tag & and mutate data accordingly
  53. 84.

    Zip Data & Schema recursively ? case class EnvT[E, W[_],

    A](run: (E, W[A])) EnvT is a Matryoshka pattern-functor that annotates a Functor W[_] with a label of type E and has a type-parameter A
  54. 85.

    Example TStruct( "personName"-> TString(tags1) "gender" -> TLong(tags2) ) GStructF( GStringF("John

    McClane"), GLongF(0) ) Schema Data GStructF( EnvT((TString(tags1), GStringF("John McClane")), EnvT((TLong(tags2), GLongF(0)) )
  55. 86.

    EnvT is a case class So it can be pattern

    matched, for example with a type Data & Schema : type DataWithSchema[A] = EnvT[Fix[SchemaF], DataF, A] … match { case EnvT((TStruct(f, m)), data @ GStructF(fields))) => } Or you can access the inner values with ask & lower final case class EnvT[E, W[_], A](run: (E, W[A])) { self => def ask: E = run._1 def lower: W[A] = run._2 }
  56. 87.

    How to zip ? Using Matryoshka, all you need is

    match Schema ⇔ Data if it doesn’t match then your data is not compatible with your Schema : def zipWithSchema: CoalgebraM[\/[Incompatibility, ?], DataWithSchema, (Fix[SchemaF], Fix[DataF])] = { case (structf @ Fix(StructF(fields, metadata)), Fix(GStructF(values))) => … // everything is fine ! build the EnvT case (arrayf @ Fix(ArrayF(elementType, metadata)), Fix(GArrayF(values))) => … // everything is fine ! build the EnvT (you get the idea !) case values … case (wutSchema, wutData) => … // everything is not fine ! Incompatibility ! }
  57. 88.

    Let’s build our privacy engine now ! As usual in

    F.P. - once you’ve prepared the Types - it’s easy-peasy : type PrivacyStrategies = Map[Seq[(String, String)], PrivacyStrategy] val privacyAlg: AlgebraM[\/[Incompatibility, ?], DataWithSchema, Fix[DataF]] = { case EnvT((vSchema, value)) => val tags = vSchema.unFix.metadata.tags val fixedValue = Fix(value) privacyStrategies.get(tags).map { privacyStrategy => privacyStrategy.applyOrFail(fixedValue)(logger.error) } .getOrElse(fixedValue) .right }
  58. 89.

    Putting it all together (schema, data).hyloM[\/[Incompatibility, ?], DataWithSchema, Fix[DataF]](privacyAlg, zipWithSchema)

    match { case -\/(incompatibilities) => log.error(s"Found incompatibilities between the observed data and its expected schema : $incompatibilities") case \/-(result) => result }
  59. 90.

    Victory \o/ We now have our most : • versatile

    • generic • efficient privacy engine !
  60. 92.

    Problem The Matryoshka engine is perfect BUT : - for

    every piece of data we need to zip it with its schema - So for 1,000 rows of the same table - we will duplicate the same schema Is it possible to just “prepare” the mutation ?
  61. 93.

    How? Let’s build a “Lambda” that will go down into

    the data according to a schema if there’s something to cypher : - trait MutationOp = NoMutationOp + GoDownOp + - Algebra[SchemaF, MutationOp]
  62. 94.

    sealed trait MutationOp extends Serializable { def apply(gdata: Fix[DataF]): Fix[DataF]

    // Transform the data def andThen(f: Fix[DataF] => Fix[DataF]): MutationOp // Chain transformations } // NoOp - nothing comes out of this - there's nothing to do ! case object NoMutationOp extends MutationOp with Serializable { override def apply(gdata: Fix[DataF]): Fix[DataF] = gdata override def andThen(f: Fix[DataF] => Fix[DataF]): MutationOp = GoDown(f) } // A specific [[MutationOp]] that goes "down" and apply a function to the data private case class GoDownOp(apply0: Fix[DataF] => Fix[DataF]) extends MutationOp { override def apply(gdata: Fix[DataF]): Fix[DataF] = apply0(gdata) override def andThen(f: Fix[DataF] => Fix[DataF]): MutationOp = { GoDownOp(apply0.andThen(f)) }} What would be the form of our lambda ?
  63. 95.

    Let’s build the Algebra Now it’s going to be simpler,

    no incompatibilities to handle : def prepareTransform(schema: Fix[SchemaF], privacyStrategies: PrivacyStrategies): MutationOp = { val privacyAlg: Algebra[SchemaF, MutationOp] = ??? schema.cata[MutationOp](privacyAlg) }
  64. 96.

    First things First - Values Like before : case value:

    ValueF[MutationOp] if value.metadata.tags.nonEmpty => val tags = value.metadata.tags privacyStrategies.get(tags).map { privacyStrategy => GoDownOp(fx => privacyStrategy.applyOrFail(fx)(logger.error)) } .getOrElse(NoLensOp)
  65. 97.

    That was easy - now let’s compose Now it’s not

    that hard, if there’s something to do - apply it on all the elements : case ArrayF(previousOp, metadata) => previousOp match { case NoMutationOp => NoMutationOp case op => GoDownOp { case Fix(GArrayF(elems)) => val result = elems.map(previousOp.apply) Fix(GArrayF(result)) }
  66. 98.

    And for Structs Same thing - if there’s something to

    do apply it on the fields : case StructF(fields, metadata) => if (fields.map(_._2).forall(_ == NoMutationOp)) { // all fields are not to be touched NoMutationOp } else { // at least one field need work done GoDownOp { case Fix(GStructF(dataFields)) => Fix(GStructF(fields.zip(dataFields).map { case ((fieldName, innerOp), (_, data)) => if (innerOp == NoMutationOp || data == Fix[DataF](GNullF())) { (fieldName, data) } else { (fieldName, innerOp(data)) } } )) }
  67. 99.

    So According to any given Schema, we can now build

    only once a lambda that : - will zoom into our recursive data - But only go into what it needs to > data.get(0).get(1).get(0) <=== There it is ! - And can be Serialized & applied many times
  68. 100.

    Victory \o/ We now have our most : • versatile

    • generic • efficient privacy engine ! At least… on the Heap
  69. 102.

    Problem : Applying any of the previous engine to an

    Apache Spark Job (ex Millions of records) is : - GC Intensive (lots of conversions back & forth) - ex. for the matryoshka engine : (Spark Row) => (DataF) => (DataWithSchema) => (DataF) => (Row) - Not really integrated with Spark (No DataFrame function, so we need a UDF ? or go back to RDD ?)
  70. 103.

    Consequences • It breaks the DataFrame Logical Plan optimizations •

    It generate too much objects => GC Overflow • It becomes tedious to use : > val r = df.rdd.map(row => ApplyPrivacyEngine.transform(schema, row, strategies)) spark.createDataFrame(r, newSchema)... :(
  71. 105.

    How : Use Spark Catalyst engine to generate ad-hoc optimized

    “Java Code” to - go down into the data according to a schema - mutate it according to privacy - stay “off-heap” using sun.misc.unsafe as much as possible i.e. : Algebra[SchemaF, CatalystOp] where CatalystOp = NoOp + CatalystCode(InputVariable => String, outputVariable)
  72. 107.

    Spark Driver (at startup) Lifecycle : SchemaF Algebra + cata

    to generate Java Code as String Compiled by Janino Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor(s) ByteCode Sent to Executors
  73. 108.

    First things first: define your output/contract So : • NoOp

    = No need to do anything • CatalystCode = we’ll generate some code ◦ The “caller” provides the name of the input variable ◦ The case class provides the name of the output variable case class InputVariable(name: String) extends AnyVal sealed trait CatalystOp case class CatalystCode(code: InputVariable => String, outputVariable: String) extends CatalystOp case object NoOp extends CatalystOp
  74. 109.

    Creating a new expression : case class ApplyPrivacyExpression(schema: Fix[SchemaF], //

    Our schema privacyStrategies: PrivacyStrategies, // The strategies to apply children: Seq[Expression] // The top columns of our dataframe ) extends Expression { // can your expression output a null ? override def nullable: Boolean = ??? // How does your expression transform the original schema of your data override def dataType: DataType = ??? // What spark will call to evaluate your expression without codegen override def eval(input: InternalRow) = ??? // here’s the code generation part ! override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = ??? }
  75. 110.

    doGenCode - draft type FieldWithInfos = (DataType, CatalystOp) override protected

    def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val privacyAlg: Algebra[SchemaF, FieldWithInfos] = ??? ev.copy(code = schema.cata[FieldWithInfos](privacyAlg)._2 match { case NoOp => s""" final InternalRow $output = $input; """ case CatalystCode(method, privacyOutput) => s""" ${method(InputVariable(input))} final InternalRow $output = $privacyOutput; """ }) }
  76. 111.

    Let’s create the CatalystCode val privacyAlg: Algebra[SchemaF, FieldWithInfos] = {

    case StructF(fieldsWithDataTypes, metadata) => // create the code to destroy / re-create the struct // & call the code previously computed for each field case ArrayF(elementType, metadata) => // create the code to destroy / re-create the array // & call the code previously computed for the “elementType” case v: ValueF[FieldWithInfos] if valueColumnSchema.metadata.tags.nonEmpty => // create the code to mutate the field (or NoOp) case v: ValueF[FieldWithInfos] if value.metadata.tags.isEmpty => // \o/ NoOp FTW ! }
  77. 112.

    ValueF case valueColumnSchema: ValueF[FieldWithInfos] if valueColumnSchema.metadata.tags.nonEmpty => val valueCode =

    privacyStrategies.get(tags).map { val cypherInSpark = ctx.addReferenceObj("cypherLambda", cypherLambda) val code = (inputVariable: InputVariable) => s""" $javaType $output = ($javaType) $cypherInSpark.apply(${inputVariable.name}); """ CatalystCode(code, output) }.getOrElse(NoOp) (valueColumnSchema.dataType, valueCode)
  78. 113.

    ArrayF case ArrayF(elementType, metadata) => val resOp = if (innerOp

    == NoOp) { innerOp } else { CatalystCode(inputVariable => s""" Object[] $tmp = new Object[$input.numElements()]; for (int $pos = 0; $pos < $input.numElements(); $pos++) { ${elementCode.apply(s"$input.get($pos)")} $tmp[$pos] = $elementOuput; } ArrayData $output = new GenericArrayData($tmp); """, output) } (arrayDataType, resOp)
  79. 114.

    StructF - from “unsafe” to “on-heap” case StructF(fieldsWithDataTypes, metadata) =>

    val CatalystCode(fieldsCode, _) = generateCodeForStruct(ctx, fieldsWithDataTypes, tmpRow) val code = (inputVariable: InputVariable) => { s""" InternalRow $input = (InternalRow ) ${inputVariable.name}; InternalRow $tmpRow = InternalRow.fromSeq($input.toSeq($outputDataType)); ${fieldsCode.apply(InputVariable(tmpRow))} """ } (outputDataType, CatalystCode(code, tmp))
  80. 115.

    Putting it all together type FieldWithInfos = (DataType, CatalystOp) override

    protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val privacyAlg: Algebra[SchemaF, FieldWithInfos] ev.copy(code = schema.cata(privacyAlg) match { case (_, NoOp) => s""" final InternalRow $output = $input; """ case rec@(_, CatalystCode(method, privacyOutput)) => s""" ${method(InputVariable(input))} final InternalRow $output = $privacyOutput; """ }) }
  81. 116.

    Victory \o/ (for real !) It was tough ! But

    now : - the data stays “off”-heap if it’s not needed - it can even stays in the Tungsten format for Long,Int,etc… while being mutated - & it is deeply integrated with Spark in a non-hacky way !
  82. 118.

    Results - Apache Spark job - 10 cores - 5G

    of Heap by executors - 5G of compressed (snappy) Apache Parquet \o/ Matryoshka Lambda Codegen 70 min 45 min 21 min
  83. 119.

    Conclusion Using FP we managed to : • create a

    generic privacy framework • create 3 engines with different point of views : ◦ Matryoshka Engine for the most complicated cases ◦ Lambda Engine well suited for streaming app ◦ Codegen Engine well suited for simple low-overhead Batch processing • All of that in a testable, (type-)safe, efficient and maintainable way !
  84. 120.

    Voilà Special thanks to the people that made it possible

    : - Amine Sagaama (@sagaama) - Ghazy Ben Ahmed (@ghazy17) And Valentin Kasas for the foundations (@ValentinKasas)
  85. 121.

    To go further All the code and slides are available

    here - https://github.com/ogirardot/high-perf-privacy-scalaIO2018 Check out the ongoing effort around scalaz-schema - https://github.com/scalaz/scalaz-schema Contact us on twitter : @ogirardot @WiemZin
  86. 122.