$30 off During Our Annual Pro Sale. View Details »

DeviceGraph — Clustering Devices into People at Adobe with Apache Spark

DeviceGraph — Clustering Devices into People at Adobe with Apache Spark

Bucharest FP

April 19, 2017
Tweet

More Decks by Bucharest FP

Other Decks in Programming

Transcript

  1. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    DeviceGraph
    Clustering Devices into People at Adobe by Using Apache Spark

    View Slide

  2. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    About Us
    • Călin-Andrei Burloiu, Big Data Engineer at Adobe
    • Adobe
    • Creative
    • Photoshop, Premiere, Audition etc.
    • Marketing Cloud
    • Audience Manager, Analytics, Target etc.

    View Slide

  3. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    DeviceGraph
    • Also known as Cross-Device Co-op
    • Problem solved
    – Identify devicesthat belong to the same person
    • Solution
    – Perform connected components on a graph of IDs
    – Identify people as clusters

    View Slide

  4. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Use Cases
    • Improve reports
    – Count people, not devicesor cookies
    • Frequency capping
    • Cross-device targeting
    • Cross-device attribution

    View Slide

  5. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Authenticated Profile Anonymous Device
    D1
    D2
    D3
    D4
    P1
    P2
    User Profiles (Pn):
    • Customer IDs
    • CRM IDs
    • Business IDs
    Devices (Dn):
    • Browser ID’s
    • Mobile Device ID
    • Connected Device ID
    D5 D6
    P3
    Deterministic
    Linking devices based on an authentication
    event
    Probabilistic
    Linking devices based on a probabilistic
    signals (e.g. IPs)
    D7
    DeviceGraph

    View Slide

  6. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    D1 D2
    D1 D2
    Beeper knows D1 and D2
    are linked
    Facepalm knows D2 and D3
    are linked
    P1
    P3
    D3
    D3 D4
    Cross-Device Co-operation

    View Slide

  7. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Apache Spark
    • We use Apache Spark with Scala
    • A fast and general engine for large-scale data processing (Big Data)
    • API:
    – Functional (Scala-like)
    • map, flatMap, filter, sort
    – Relational (SQL-like)
    • select, where, groupBy, join
    • Distributed
    – A Driver node submits work to Executor nodes

    View Slide

  8. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    DeviceGraph Jobs
    ID-Sync Logs Clustering Input
    Clusters
    Metrics
    Ingestion Job
    Clustering Job
    Reports Job

    View Slide

  9. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Job Design
    trait Job[T <: IO[_, _], C <: Configuration] {
    def run(io: T, conf: C): Unit
    }
    trait IO[I, O] {
    def readInput(): I
    def writeOutput(output: O): Unit
    }

    View Slide

  10. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Job Design
    • Extracting data access(IO) from the business logic (Job)
    • IO depends on the environment
    – In production we read/write data from Amazon S3 (or HDFS)
    – In tests we create an IO stub with input test cases and expected output

    View Slide

  11. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Functional Infrastructure
    • Immutable Infrastructure
    – IT resources are replaced rather than
    changed
    – Create a Spark Cluster cluster every time
    we want to run a bunch of jobs
    • Deploy on Amazon EMR (Elastic MapReduce)
    • Deploying jobs is a function
    – which
    • creates a cluster
    • deploys the jobs
    • runs the jobs
    – takes as input
    • Configuration
    • Environment
    • Jobs to run

    View Slide

  12. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Functional Infrastructure
    abstract class Infrastructure {
    def deploy(
    conf: Configuration,
    env: Environment,
    jobs: Seq[Class[Job[_ ,_]]]): Unit
    }

    View Slide

  13. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Ingestion Job: Parsing
    class IngestionJob extends Job[IngestionIO, IngestionConfiguration] {
    override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = {
    val outputRDD = io.readInput()
    .flatMap { line: String =>
    parseLine(line) match {
    case Success(value) => Some(value)
    case Failure(_) =>
    None
    }
    }
    io.writeOutput(outputRDD.toDS)
    }
    }

    View Slide

  14. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Ingestion Job: Filtering
    class IngestionJob extends Job[IngestionIO, IngestionConfiguration] {
    val acceptedCountries = Set("us", "ca")
    override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = {
    val outputRDD = io.readInput()
    .flatMap { line: String =>
    parseLine(line) match {
    case Success(value) => Some(value)
    case Failure(_) =>
    None
    }
    }
    .filter{ record =>
    acceptedCountries.contains(record.country)
    }
    io.writeOutput(outputRDD.toDS)
    }
    }

    View Slide

  15. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Ingestion Job: Counting Errors
    class IngestionJob extends Job[IngestionIO, IngestionConfiguration] {
    val parseFailures = sparkContext.longAccumulator("parseFailures")
    val acceptedCountries = Set("us", "ca")
    override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = {
    val outputRDD = io.readInput()
    .flatMap { line: String =>
    parseLine(line) match {
    case Success(value) => Some(value)
    case Failure(_) =>
    parseFailures.add(1)
    None
    }
    }
    .filter{ record =>
    acceptedCountries.contains(record.country)
    }
    io.writeOutput(outputRDD.toDS)
    }
    }
    serialized

    View Slide

  16. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Ingestion Job: Using a Metrics Container
    object IngestionMetrics {
    val parseFailures = sparkContext.longAccumulator("parseFailures")
    val invalidId = sparkContext.longAccumulator("invalidId")
    }

    View Slide

  17. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Ingestion Job: Using a Metrics Container
    class IngestionJob extends Job[IngestionIO, IngestionConfiguration] {
    val acceptedCountries = Set("us", "ca")
    override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = {
    val outputRDD = io.readInput()
    .flatMap { line: String =>
    parseLine(line) match {
    case Success(value) => Some(value)
    case Failure(_) =>
    IngestionMetrics.parseFailures.add(1)
    None
    }
    }
    .filter{ record =>
    acceptedCountries.contains(record.country)
    }
    io.writeOutput(outputRDD.toDS)
    }
    }
    FAILURE!
    • Accumulators need to be instantiated on
    the Driver
    • When the Scala app starts an
    IngestionMetrics instance will be created
    on each Spark process from each
    machine
    • Not what we want!

    View Slide

  18. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Ingestion Job: Using a Metrics Container
    class IngestionMetrics extends Metrics {
    val parseFailures = sparkContext.longAccumulator("parseFailures")
    val invalidId = sparkContext.longAccumulator("invalidId")
    }
    object IngestionMetrics {
    lazy val instance = new IngestionMetrics
    }

    View Slide

  19. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Ingestion Job: Using a Metrics Container
    class IngestionJob extends Job[IngestionIO, IngestionConfiguration] {
    lazy val metrics = IngestionMetrics.instance
    val acceptedCountries = Set("us", "ca")
    override def run(io: IngestionIO, conf: IngestionConfiguration): Unit = {
    val outputRDD = io.readInput()
    .flatMap { line: String =>
    parseLine(line) match {
    case Success(value) => Some(value)
    case Failure(_) =>
    metrics.parseFailures.add(1)
    None
    }
    }
    .filter{ record =>
    acceptedCountries.contains(record.country)
    }
    io.writeOutput(outputRDD.toDS)
    }
    }
    Instantiated once on the Driver
    serialized

    View Slide

  20. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Connected Components
    D10
    D20
    D30
    P3
    P1
    D40
    D50
    P2
    l r cluster
    P1 D30 D30
    P1 D40 D40 → D30
    P2 D40 D40
    P2 D50 D50 → D40
    P3 D10 D10
    P3 D20 D20 → D10
    D30
    D40
    D10

    View Slide

  21. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Connected Components
    D20
    D30
    P3
    P1
    D40
    D50
    P2
    l r cluster
    D30 P1 D30 → P1
    D40 P1 D30 → P1
    D40 P2 D40 → P1
    D50 P2 D40 → P2
    D10 P3 D10 → P3
    D20 P3 D10 → P3
    P1
    P2
    P3
    P1
    D10
    P3
    D30
    D40
    D10

    View Slide

  22. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Connected Components
    D10
    D20
    D30
    P3
    P1
    D40
    D50
    P2
    l r cluster
    P1 D30 P1
    P1 D40 P1
    P2 D40 P1
    P2 D50 P2 → P1
    P3 D10 P3
    P3 D20 P3
    P2
    P3
    P1 P3
    P3
    P1
    P1
    P1

    View Slide

  23. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Connected Components
    D10
    D20
    D30
    P3
    P1
    D40
    D50
    P2
    l r cluster
    P1 D30 P1
    P1 D40 P1
    P2 D40 P1
    P2 D50 P1
    P3 D10 P3
    P3 D20 P3
    P1
    P3
    P1 P3
    P3
    P1
    P1
    P1

    View Slide

  24. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Clustering Job: Streams and Accumulators
    val changed: Accumulator[Boolean] = sparkContext.accumulator(true, "changed")
    val clusterImprovementStream = Iterator.iterate(initialClusters) { oldClusters =>
    changed.setValue(false)
    val newClusters = chooseMinAndPropagateToNeighbor(oldClusters)
    newClusters
    }
    val clusters = clusterImprovementStream.find { _ => !changed.value }

    View Slide

  25. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
    Conclusions
    • Use Immutable Infrastructure
    – no side effects
    – everything is in the Configuration
    • Split computation in functional & testable jobs
    – Extract data access from business logic
    • Pay attention objects blindly serialized
    – like Scala singleton objects
    • Leverage mutable state for efficiency
    – Use accumulators for iterative algorithms

    View Slide

  26. © 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

    View Slide