Define Privacy Strategies to specified tags Person “address”: String “name”: String “email”: String “id”: Long “pw”: String ... For a Person’s email → email => mask(email) For a Person’s id → delete(_) For a Person’s password → hash(_) ...
TSchema)], metadata: ColumnMetadata) extends TSchema case class TArray(elementType: TSchema, metadata: ColumnMetadata) extends TSchema sealed trait TValue extends TSchema case class TBoolean(metadata: ColumnMetadata) extends TValue case class TDate(metadata: ColumnMetadata) extends TValue case class TDouble(metadata: ColumnMetadata) extends TValue case class TFloat(metadata: ColumnMetadata) extends TValue case class TInteger(metadata: ColumnMetadata) extends TValue case class TLong(metadata: ColumnMetadata) extends TValue case class TString(metadata: ColumnMetadata) extends TValue
GData)]) extends GData case class GArray(elements: Seq[GData]) extends GData sealed trait GValue extends GData case class GBoolean(value: Boolean) extends GValue case class GString(value: String) extends GValue ...
metadata: ColumnMetadata) extends TSchema case class TArray(elementType: TSchema, metadata: ColumnMetadata) extends TSchema sealed trait TValue extends TSchema case class TBoolean(metadata: ColumnMetadata) extends TSchema case class TString(metadata: ColumnMetadata) extends TSchema ...
metadata: ColumnMetadata) extends TSchema case class TArray(elementType: TSchema, metadata: ColumnMetadata) extends TSchema sealed trait TValue extends TSchema case class TBoolean(metadata: ColumnMetadata) extends TSchema case class TString(metadata: ColumnMetadata) extends TSchema ... sealed trait SchemaF[A] case class StructF[A] (fields: List[(String, A)], metadata: ColumnMetadata) extends SchemaF[A] case class ArrayF[A](elementType: A, metadata: ColumnMetadata) extends SchemaF[A] sealed trait ValueF[A] extends SchemaF[A] case class BooleanF[A](metadata: ColumnMetadata) extends ValueF[A] case class StringF[A](metadata: ColumnMetadata) extends ValueF[A] …
metadata: ColumnMetadata) extends TSchema case class TArray(elementType: TSchema, metadata: ColumnMetadata) extends TSchema sealed trait TValue extends TSchema case class TBoolean(metadata: ColumnMetadata) extends TSchema case class TString(metadata: ColumnMetadata) extends TSchema ... sealed trait SchemaF[A] case class StructF[A] (fields: List[(String, A)], metadata: ColumnMetadata) extends SchemaF[A] case class ArrayF[A](elementType: A, metadata: ColumnMetadata) extends SchemaF[A] sealed trait ValueF[A] extends SchemaF[A] case class BooleanF[A](metadata: ColumnMetadata) extends ValueF[A] case class StringF[A](metadata: ColumnMetadata) extends ValueF[A] …
def map[A, B](fa: SchemaF[A])(f: A => B): SchemaF[B] = fa match { case StructF(fields, m) => StructF(fields.map{ case (name, value) => name -> f(value) }), m) case ArrayF(elem, m) => ArrayF(f(elem), m) case BooleanF(m) => BooleanF(m) case StringF(m) => StringF(m) case IntegerF(m) => IntegerF(m) ... } }
StructF(fields, _) => StructType(fields.map { case (name, value) => StructField(name, value) }.toArray) case ArrayF(elem, m) => ArrayType(elem, containsNull = false) case BooleanF(_) => BooleanType case DateF(_) => DateType case DoubleF(_) => DoubleType case FloatF(_) => FloatType case IntegerF(_) => IntegerType case LongF(_) => LongType case StringF(_) => StringType }
class EnvT[E, W[_], A](run: (E, W[A])) EnvT is a Matryoshka pattern-functor that annotates a Functor W[_] with a label of type E and has a type-parameter A
DataWithSchema[A] = EnvT[Fix[SchemaF], DataF, A] final case class EnvT[E, W[_], A](run: (E, W[A])) { self => def ask: E = run._1 def lower: W[A] = run._2 } This is the “hole” that will be filled with intermediate computations = previous layer results
and zip them together • The result might fail if the data and the schema are not compatible. Privacy engine - Zip the Data & the Schema (TSchema, Fix[DataF]) Either[Incompatibility, DataWithSchema] type DataWithSchema[A] = EnvT[TSchema, DataF, A]
zipWithSchema: CoalgebraM[\/[Incompatibility, ?], DataWithSchema, (TSchema, Fix[DataF])] = { case (structf @ TStruct(fields, metadata), Fix(GStructF(values))) => … // everything is fine ! build the EnvT case (arrayf @ TArray(elementType, metadata), Fix(GArrayF(values))) => … // everything is fine ! build the EnvT (you get the idea !) case values … case (wutSchema, wutData) => … // everything is not fine ! Incompatibility ! }
that matches the tags in the Schema. Privacy engine - Encrypt Data that matches the tags in the Schema Either[Incompatibility, Fix[DataF]] type DataWithSchema[A] = EnvT[TSchema, DataF, A] Either[Incompatibility, DataWithSchema]
Privacy engine - Apply privacy (schema, data).hyloM[\/[Incompatibility, ?], DataWithSchema, Fix[DataF]](privacyAlg, zipWithSchema) match { case -\/(incompatibilities) => log.error(s"Found incompatibilities between the observed data and its expected schema : $incompatibilities") case \/-(result) => result }
piece of data we need to zip it with its schema. - For 1,000 rows of the same table, we will duplicate the same schema. Is it possible to just “prepare” the mutation ? Lambda
… case ArrayF(previousOp, metadata) => previousOp match { case NoOp => NoOp case op => GoDownOp { case Fix(GArrayF(elems)) => val result = elems.map(previousOp.apply) Fix(GArrayF(result)) } ... }
case StructF(fields, metadata) => if (fields.forall(_ == NoOp)) { // all fields are not to be touched NoOp } else { // at least one field need work done GoDownOp { case Fix(GStructF(dataFields)) => Fix(GStructF(fields.zip(dataFields).map { case ((fieldName, innerOp), (_, data)) => if (innerOp == NoOp) { (fieldName, data) } else { (fieldName, innerOp(data)) } } )) } ... }
only once a lambda that : - will zoom into our recursive data - But only go into what it needs to > data.get(0).get(1).get(0) <=== There it is ! - And can be Serialized & applied many times
Spark Job (ex Millions of records) is : - GC Intensive (lots of conversions back & forth) - ex. for the matryoshka engine : (Spark Row) => (DataF) => (DataWithSchema) => (DataF) => (Row) - Not really integrated with Spark (No DataFrame function, so we need an UDF ? or go back to RDD ?)
It generate too much objects => GC Overflow • It becomes tedious to use : Consequences val transformed = df.rdd.map( row => PrivacyEngine.transform(schema, row, strategies) ) val newSchema = PrivacyEngine.transformSchema(schema, strategies) spark.createDataFrame(transformed, newSchema) // :( :( :( :(
Code” to - go down into the data according to a schema - mutate it according to privacy - stay “off-heap” using sun.misc.unsafe as much as possible How?
to do anything • CatalystCode = we’ll generate some code ◦ The “caller” provides the name of the input variable ◦ The case class provides the name of the output variable case class InputVariable(name: String) extends AnyVal sealed trait CatalystOp case class CatalystCode(code: InputVariable => String, outputVariable: String) extends CatalystOp case object NoOp extends CatalystOp
Our schema privacyStrategies: PrivacyStrategies, // The strategies to apply children: Seq[Expression] // The top columns of our dataframe ) extends Expression { // can your expression output a null ? override def nullable: Boolean = ??? // How does your expression transform the original schema of your data override def dataType: DataType = ??? // What spark will call to evaluate your expression without codegen override def eval(input: InternalRow) = ??? // here’s the code generation part ! override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = ??? }
case StructF(fieldsWithDataTypes, metadata) => // create the code to destroy / re-create the struct // & call the code previously computed for each field case ArrayF(elementType, metadata) => // create the code to destroy / re-create the array // & call the code previously computed for the “elementType” case v: ValueF[FieldTypeAndCode] if valueColumnSchema.metadata.tags.nonEmpty => // create the code to mutate the field (or NoOp) case v: ValueF[FieldTypeAndCode] if value.metadata.tags.isEmpty => // \o/ NoOp FTW ! }
! But now : - The data stays “off”-heap if it’s not needed - It can even stays in the Tungsten format for Long,Int,etc… while being mutated - It is deeply integrated with Spark in a non-hacky way !
a generic privacy framework • create 3 engines with different point of views : ◦ Matryoshka Engine for the most complicated cases ◦ Lambda Engine well suited for streaming app ◦ Codegen Engine well suited for simple low-overhead Batch processing • All of that in a testable, (type-)safe, efficient and maintainable way !