Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DeviceGraph — Clustering Devices into People at Adobe with Apache Spark

DeviceGraph — Clustering Devices into People at Adobe with Apache Spark

Bucharest FP

April 19, 2017
Tweet

More Decks by Bucharest FP

Other Decks in Programming

Transcript

  1. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    DeviceGraph Clustering Devices into People at Adobe by Using Apache Spark
  2. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    About Us • Călin-Andrei Burloiu, Big Data Engineer at Adobe • Adobe • Creative • Photoshop, Premiere, Audition etc. • Marketing Cloud • Audience Manager, Analytics, Target etc.
  3. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    DeviceGraph • Also known as Cross-Device Co-op • Problem solved – Identify devicesthat belong to the same person • Solution – Perform connected components on a graph of IDs – Identify people as clusters
  4. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Use Cases • Improve reports – Count people, not devicesor cookies • Frequency capping • Cross-device targeting • Cross-device attribution
  5. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Authenticated Profile Anonymous Device D1 D2 D3 D4 P1 P2 User Profiles (Pn): • Customer IDs • CRM IDs • Business IDs Devices (Dn): • Browser ID’s • Mobile Device ID • Connected Device ID D5 D6 P3 Deterministic Linking devices based on an authentication event Probabilistic Linking devices based on a probabilistic signals (e.g. IPs) D7 DeviceGraph
  6. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    D1 D2 D1 D2 Beeper knows D1 and D2 are linked Facepalm knows D2 and D3 are linked P1 P3 D3 D3 D4 Cross-Device Co-operation
  7. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Apache Spark • We use Apache Spark with Scala • A fast and general engine for large-scale data processing (Big Data) • API: – Functional (Scala-like) • map, flatMap, filter, sort – Relational (SQL-like) • select, where, groupBy, join • Distributed – A Driver node submits work to Executor nodes
  8. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    DeviceGraph Jobs ID-Sync Logs Clustering Input Clusters Metrics Ingestion Job Clustering Job Reports Job
  9. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Job Design trait Job[T <: IO[_, _], C <: Configuration] { def run(io: T, conf: C): Unit } trait IO[I, O] { def readInput(): I def writeOutput(output: O): Unit }
  10. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Job Design • Extracting data access(IO) from the business logic (Job) • IO depends on the environment – In production we read/write data from Amazon S3 (or HDFS) – In tests we create an IO stub with input test cases and expected output
  11. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Functional Infrastructure • Immutable Infrastructure – IT resources are replaced rather than changed – Create a Spark Cluster cluster every time we want to run a bunch of jobs • Deploy on Amazon EMR (Elastic MapReduce) • Deploying jobs is a function – which • creates a cluster • deploys the jobs • runs the jobs – takes as input • Configuration • Environment • Jobs to run
  12. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Functional Infrastructure abstract class Infrastructure { def deploy( conf: Configuration, env: Environment, jobs: Seq[Class[Job[_ ,_]]]): Unit }
  13. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Ingestion Job: Parsing class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => None } } io.writeOutput(outputRDD.toDS) } }
  14. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Ingestion Job: Filtering class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { val acceptedCountries = Set("us", "ca") override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => None } } .filter{ record => acceptedCountries.contains(record.country) } io.writeOutput(outputRDD.toDS) } }
  15. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Ingestion Job: Counting Errors class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { val parseFailures = sparkContext.longAccumulator("parseFailures") val acceptedCountries = Set("us", "ca") override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => parseFailures.add(1) None } } .filter{ record => acceptedCountries.contains(record.country) } io.writeOutput(outputRDD.toDS) } } serialized
  16. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Ingestion Job: Using a Metrics Container object IngestionMetrics { val parseFailures = sparkContext.longAccumulator("parseFailures") val invalidId = sparkContext.longAccumulator("invalidId") }
  17. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Ingestion Job: Using a Metrics Container class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { val acceptedCountries = Set("us", "ca") override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => IngestionMetrics.parseFailures.add(1) None } } .filter{ record => acceptedCountries.contains(record.country) } io.writeOutput(outputRDD.toDS) } } FAILURE! • Accumulators need to be instantiated on the Driver • When the Scala app starts an IngestionMetrics instance will be created on each Spark process from each machine • Not what we want!
  18. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Ingestion Job: Using a Metrics Container class IngestionMetrics extends Metrics { val parseFailures = sparkContext.longAccumulator("parseFailures") val invalidId = sparkContext.longAccumulator("invalidId") } object IngestionMetrics { lazy val instance = new IngestionMetrics }
  19. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Ingestion Job: Using a Metrics Container class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { lazy val metrics = IngestionMetrics.instance val acceptedCountries = Set("us", "ca") override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => metrics.parseFailures.add(1) None } } .filter{ record => acceptedCountries.contains(record.country) } io.writeOutput(outputRDD.toDS) } } Instantiated once on the Driver serialized
  20. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Connected Components D10 D20 D30 P3 P1 D40 D50 P2 l r cluster P1 D30 D30 P1 D40 D40 → D30 P2 D40 D40 P2 D50 D50 → D40 P3 D10 D10 P3 D20 D20 → D10 D30 D40 D10
  21. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Connected Components D20 D30 P3 P1 D40 D50 P2 l r cluster D30 P1 D30 → P1 D40 P1 D30 → P1 D40 P2 D40 → P1 D50 P2 D40 → P2 D10 P3 D10 → P3 D20 P3 D10 → P3 P1 P2 P3 P1 D10 P3 D30 D40 D10
  22. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Connected Components D10 D20 D30 P3 P1 D40 D50 P2 l r cluster P1 D30 P1 P1 D40 P1 P2 D40 P1 P2 D50 P2 → P1 P3 D10 P3 P3 D20 P3 P2 P3 P1 P3 P3 P1 P1 P1
  23. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Connected Components D10 D20 D30 P3 P1 D40 D50 P2 l r cluster P1 D30 P1 P1 D40 P1 P2 D40 P1 P2 D50 P1 P3 D10 P3 P3 D20 P3 P1 P3 P1 P3 P3 P1 P1 P1
  24. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Clustering Job: Streams and Accumulators val changed: Accumulator[Boolean] = sparkContext.accumulator(true, "changed") val clusterImprovementStream = Iterator.iterate(initialClusters) { oldClusters => changed.setValue(false) val newClusters = chooseMinAndPropagateToNeighbor(oldClusters) newClusters } val clusters = clusterImprovementStream.find { _ => !changed.value }
  25. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    Conclusions • Use Immutable Infrastructure – no side effects – everything is in the Configuration • Split computation in functional & testable jobs – Extract data access from business logic • Pay attention objects blindly serialized – like Scala singleton objects • Leverage mutable state for efficiency – Use accumulators for iterative algorithms