DeviceGraph — Clustering Devices into People at Adobe with Apache Spark

Slide 1

Slide 1 text

Slide 2

Slide 2 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. About Us • Călin-Andrei Burloiu, Big Data Engineer at Adobe • Adobe • Creative • Photoshop, Premiere, Audition etc. • Marketing Cloud • Audience Manager, Analytics, Target etc.

Slide 3

Slide 3 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. DeviceGraph • Also known as Cross-Device Co-op • Problem solved – Identify devicesthat belong to the same person • Solution – Perform connected components on a graph of IDs – Identify people as clusters

Slide 4

Slide 4 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Use Cases • Improve reports – Count people, not devicesor cookies • Frequency capping • Cross-device targeting • Cross-device attribution

Slide 5

Slide 5 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Authenticated Profile Anonymous Device D1 D2 D3 D4 P1 P2 User Profiles (Pn): • Customer IDs • CRM IDs • Business IDs Devices (Dn): • Browser ID’s • Mobile Device ID • Connected Device ID D5 D6 P3 Deterministic Linking devices based on an authentication event Probabilistic Linking devices based on a probabilistic signals (e.g. IPs) D7 DeviceGraph

Slide 6

Slide 6 text

Slide 7

Slide 7 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Apache Spark • We use Apache Spark with Scala • A fast and general engine for large-scale data processing (Big Data) • API: – Functional (Scala-like) • map, flatMap, filter, sort – Relational (SQL-like) • select, where, groupBy, join • Distributed – A Driver node submits work to Executor nodes

Slide 8

Slide 8 text

Slide 9

Slide 9 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Job Design trait Job[T <: IO[_, _], C <: Configuration] { def run(io: T, conf: C): Unit } trait IO[I, O] { def readInput(): I def writeOutput(output: O): Unit }

Slide 10

Slide 10 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Job Design • Extracting data access(IO) from the business logic (Job) • IO depends on the environment – In production we read/write data from Amazon S3 (or HDFS) – In tests we create an IO stub with input test cases and expected output

Slide 11

Slide 11 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Functional Infrastructure • Immutable Infrastructure – IT resources are replaced rather than changed – Create a Spark Cluster cluster every time we want to run a bunch of jobs • Deploy on Amazon EMR (Elastic MapReduce) • Deploying jobs is a function – which • creates a cluster • deploys the jobs • runs the jobs – takes as input • Configuration • Environment • Jobs to run

Slide 12

Slide 12 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Functional Infrastructure abstract class Infrastructure { def deploy( conf: Configuration, env: Environment, jobs: Seq[Class[Job[_ ,_]]]): Unit }

Slide 13

Slide 13 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Ingestion Job: Parsing class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => None } } io.writeOutput(outputRDD.toDS) } }

Slide 14

Slide 14 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Ingestion Job: Filtering class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { val acceptedCountries = Set("us", "ca") override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => None } } .filter{ record => acceptedCountries.contains(record.country) } io.writeOutput(outputRDD.toDS) } }

Slide 15

Slide 15 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Ingestion Job: Counting Errors class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { val parseFailures = sparkContext.longAccumulator("parseFailures") val acceptedCountries = Set("us", "ca") override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => parseFailures.add(1) None } } .filter{ record => acceptedCountries.contains(record.country) } io.writeOutput(outputRDD.toDS) } } serialized

Slide 16

Slide 16 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Ingestion Job: Using a Metrics Container object IngestionMetrics { val parseFailures = sparkContext.longAccumulator("parseFailures") val invalidId = sparkContext.longAccumulator("invalidId") }

Slide 17

Slide 17 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Ingestion Job: Using a Metrics Container class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { val acceptedCountries = Set("us", "ca") override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => IngestionMetrics.parseFailures.add(1) None } } .filter{ record => acceptedCountries.contains(record.country) } io.writeOutput(outputRDD.toDS) } } FAILURE! • Accumulators need to be instantiated on the Driver • When the Scala app starts an IngestionMetrics instance will be created on each Spark process from each machine • Not what we want!

Slide 18

Slide 18 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Ingestion Job: Using a Metrics Container class IngestionMetrics extends Metrics { val parseFailures = sparkContext.longAccumulator("parseFailures") val invalidId = sparkContext.longAccumulator("invalidId") } object IngestionMetrics { lazy val instance = new IngestionMetrics }

Slide 19

Slide 19 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Ingestion Job: Using a Metrics Container class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { lazy val metrics = IngestionMetrics.instance val acceptedCountries = Set("us", "ca") override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => metrics.parseFailures.add(1) None } } .filter{ record => acceptedCountries.contains(record.country) } io.writeOutput(outputRDD.toDS) } } Instantiated once on the Driver serialized

Slide 20

Slide 20 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Connected Components D10 D20 D30 P3 P1 D40 D50 P2 l r cluster P1 D30 D30 P1 D40 D40 → D30 P2 D40 D40 P2 D50 D50 → D40 P3 D10 D10 P3 D20 D20 → D10 D30 D40 D10

Slide 21

Slide 21 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Connected Components D20 D30 P3 P1 D40 D50 P2 l r cluster D30 P1 D30 → P1 D40 P1 D30 → P1 D40 P2 D40 → P1 D50 P2 D40 → P2 D10 P3 D10 → P3 D20 P3 D10 → P3 P1 P2 P3 P1 D10 P3 D30 D40 D10

Slide 22

Slide 22 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Connected Components D10 D20 D30 P3 P1 D40 D50 P2 l r cluster P1 D30 P1 P1 D40 P1 P2 D40 P1 P2 D50 P2 → P1 P3 D10 P3 P3 D20 P3 P2 P3 P1 P3 P3 P1 P1 P1

Slide 23

Slide 23 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Connected Components D10 D20 D30 P3 P1 D40 D50 P2 l r cluster P1 D30 P1 P1 D40 P1 P2 D40 P1 P2 D50 P1 P3 D10 P3 P3 D20 P3 P1 P3 P1 P3 P3 P1 P1 P1

Slide 24

Slide 24 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Clustering Job: Streams and Accumulators val changed: Accumulator[Boolean] = sparkContext.accumulator(true, "changed") val clusterImprovementStream = Iterator.iterate(initialClusters) { oldClusters => changed.setValue(false) val newClusters = chooseMinAndPropagateToNeighbor(oldClusters) newClusters } val clusters = clusterImprovementStream.find { _ => !changed.value }

Slide 25

Slide 25 text

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Conclusions • Use Immutable Infrastructure – no side effects – everything is in the Configuration • Split computation in functional & testable jobs – Extract data access from business logic • Pay attention objects blindly serialized – like Scala singleton objects • Leverage mutable state for efficiency – Use accumulators for iterative algorithms