DeviceGraph — Clustering Devices into People at Adobe with Apache Spark

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
DeviceGraph Clustering Devices into People at Adobe by Using Apache Spark

About Us • Călin-Andrei Burloiu, Big Data Engineer at Adobe • Adobe • Creative • Photoshop, Premiere, Audition etc. • Marketing Cloud • Audience Manager, Analytics, Target etc.

DeviceGraph • Also known as Cross-Device Co-op • Problem solved – Identify devicesthat belong to the same person • Solution – Perform connected components on a graph of IDs – Identify people as clusters

Use Cases • Improve reports – Count people, not devicesor cookies • Frequency capping • Cross-device targeting • Cross-device attribution

Authenticated Profile Anonymous Device D1 D2 D3 D4 P1 P2 User Profiles (Pn): • Customer IDs • CRM IDs • Business IDs Devices (Dn): • Browser ID’s • Mobile Device ID • Connected Device ID D5 D6 P3 Deterministic Linking devices based on an authentication event Probabilistic Linking devices based on a probabilistic signals (e.g. IPs) D7 DeviceGraph

D1 D2 D1 D2 Beeper knows D1 and D2 are linked Facepalm knows D2 and D3 are linked P1 P3 D3 D3 D4 Cross-Device Co-operation

Apache Spark • We use Apache Spark with Scala • A fast and general engine for large-scale data processing (Big Data) • API: – Functional (Scala-like) • map, flatMap, filter, sort – Relational (SQL-like) • select, where, groupBy, join • Distributed – A Driver node submits work to Executor nodes

DeviceGraph Jobs ID-Sync Logs Clustering Input Clusters Metrics Ingestion Job Clustering Job Reports Job

Job Design trait Job[T <: IO[_, _], C <: Configuration] { def run(io: T, conf: C): Unit } trait IO[I, O] { def readInput(): I def writeOutput(output: O): Unit }

Job Design • Extracting data access(IO) from the business logic (Job) • IO depends on the environment – In production we read/write data from Amazon S3 (or HDFS) – In tests we create an IO stub with input test cases and expected output

Functional Infrastructure • Immutable Infrastructure – IT resources are replaced rather than changed – Create a Spark Cluster cluster every time we want to run a bunch of jobs • Deploy on Amazon EMR (Elastic MapReduce) • Deploying jobs is a function – which • creates a cluster • deploys the jobs • runs the jobs – takes as input • Configuration • Environment • Jobs to run

Functional Infrastructure abstract class Infrastructure { def deploy( conf: Configuration, env: Environment, jobs: Seq[Class[Job[_ ,_]]]): Unit }

Ingestion Job: Parsing class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => None } } io.writeOutput(outputRDD.toDS) } }

Ingestion Job: Filtering class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { val acceptedCountries = Set("us", "ca") override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => None } } .filter{ record => acceptedCountries.contains(record.country) } io.writeOutput(outputRDD.toDS) } }

Ingestion Job: Counting Errors class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { val parseFailures = sparkContext.longAccumulator("parseFailures") val acceptedCountries = Set("us", "ca") override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => parseFailures.add(1) None } } .filter{ record => acceptedCountries.contains(record.country) } io.writeOutput(outputRDD.toDS) } } serialized

Ingestion Job: Using a Metrics Container object IngestionMetrics { val parseFailures = sparkContext.longAccumulator("parseFailures") val invalidId = sparkContext.longAccumulator("invalidId") }

Ingestion Job: Using a Metrics Container class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { val acceptedCountries = Set("us", "ca") override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => IngestionMetrics.parseFailures.add(1) None } } .filter{ record => acceptedCountries.contains(record.country) } io.writeOutput(outputRDD.toDS) } } FAILURE! • Accumulators need to be instantiated on the Driver • When the Scala app starts an IngestionMetrics instance will be created on each Spark process from each machine • Not what we want!

Ingestion Job: Using a Metrics Container class IngestionMetrics extends Metrics { val parseFailures = sparkContext.longAccumulator("parseFailures") val invalidId = sparkContext.longAccumulator("invalidId") } object IngestionMetrics { lazy val instance = new IngestionMetrics }

Ingestion Job: Using a Metrics Container class IngestionJob extends Job[IngestionIO, IngestionConfiguration] { lazy val metrics = IngestionMetrics.instance val acceptedCountries = Set("us", "ca") override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = { val outputRDD = io.readInput() .flatMap { line: String => parseLine(line) match { case Success(value) => Some(value) case Failure(_) => metrics.parseFailures.add(1) None } } .filter{ record => acceptedCountries.contains(record.country) } io.writeOutput(outputRDD.toDS) } } Instantiated once on the Driver serialized

Connected Components D10 D20 D30 P3 P1 D40 D50 P2 l r cluster P1 D30 D30 P1 D40 D40 → D30 P2 D40 D40 P2 D50 D50 → D40 P3 D10 D10 P3 D20 D20 → D10 D30 D40 D10

Connected Components D20 D30 P3 P1 D40 D50 P2 l r cluster D30 P1 D30 → P1 D40 P1 D30 → P1 D40 P2 D40 → P1 D50 P2 D40 → P2 D10 P3 D10 → P3 D20 P3 D10 → P3 P1 P2 P3 P1 D10 P3 D30 D40 D10

Connected Components D10 D20 D30 P3 P1 D40 D50 P2 l r cluster P1 D30 P1 P1 D40 P1 P2 D40 P1 P2 D50 P2 → P1 P3 D10 P3 P3 D20 P3 P2 P3 P1 P3 P3 P1 P1 P1

Connected Components D10 D20 D30 P3 P1 D40 D50 P2 l r cluster P1 D30 P1 P1 D40 P1 P2 D40 P1 P2 D50 P1 P3 D10 P3 P3 D20 P3 P1 P3 P1 P3 P3 P1 P1 P1

Clustering Job: Streams and Accumulators val changed: Accumulator[Boolean] = sparkContext.accumulator(true, "changed") val clusterImprovementStream = Iterator.iterate(initialClusters) { oldClusters => changed.setValue(false) val newClusters = chooseMinAndPropagateToNeighbor(oldClusters) newClusters } val clusters = clusterImprovementStream.find { _ => !changed.value }

Conclusions • Use Immutable Infrastructure – no side effects – everything is in the Configuration • Split computation in functional & testable jobs – Extract data access from business logic • Pay attention objects blindly serialized – like Scala singleton objects • Leverage mutable state for efficiency – Use accumulators for iterative algorithms

DeviceGraph — Clustering Devices into People at...

DeviceGraph — Clustering Devices into People at Adobe with Apache Spark

Bucharest FP

More Decks by Bucharest FP

Other Decks in Programming

Featured

Transcript

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.