Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to find a needle in a very, very large haystack using Apache Spark

How to find a needle in a very, very large haystack using Apache Spark

This is geared towards engineers with some data experience who would like to dip their toes into using Apache Spark.

This workshop covers how to:

* use a notebook environment
* write simple Apache Spark queries to filter and transform a dataset
* do very simple outlier detection

and is taught using the Amazon Electronics reviews dataset.

The prerequisites are available at: https://github.com/stefano-meschiari/spark_workshop

Stefano Meschiari

February 22, 2021
Tweet

More Decks by Stefano Meschiari

Other Decks in Programming

Transcript

  1. How to find a needle in a very, very
    large haystack using Apache Spark
    @smeschiari (Data Science)

    View Slide

  2. What you’ll take away
    1. How Data Scientists find anomalies in data
    2. What you can do with Apache Spark
    3. How to run simple operations on Spark DataFrames

    View Slide

  3. How we’ll learn it
    PART 1
    1. Define Anomaly
    Detection (~10 minutes)
    2. Introduce Spark (~ 10
    minutes)
    PART 2
    Interactive Spark
    Workshop:
    Finding Fake Reviews on
    Amazon (60 minutes)

    View Slide

  4. Requirements
    1. You followed the instructions at
    https://github.com/stefano-meschiari/spark_workshop
    2. Some Python knowledge
    3. If you have any last minute questions about setup, join the
    #sparkworkshop channel

    View Slide

  5. What is Anomaly
    Detection?

    View Slide

  6. What is an anomaly?
    1. An anomaly (outlier) is an
    observation that deviates
    significantly from the bulk of
    the other observations.
    NORMAL
    ANOMALIES

    View Slide

  7. What is an anomaly?
    NORMAL
    ANOMALIES
    1. An anomaly (outlier) is an
    observation that deviates
    significantly from the bulk of
    the other observations.
    2. We suspect anomalies are
    generated by a different
    process than normal
    observations.

    View Slide

  8. NORMAL DATA ANOMALY GENERATED BY DIFFERENT PROCESS (NSFL)

    View Slide

  9. Applications
    Fraud detection
    Is this credit card transaction normal?
    (Stolen credit card)
    Log and load monitoring
    Has load spiked in an unexpected way?
    (DDOS)
    Healthcare
    Is this insurance claim suspicious?
    (Insurance scam)
    Quality control
    Is this widget out of tolerance?
    (Malfunctioning machine)
    Time series
    Is the current amount of stellar
    activity abnormal? (New physics)
    Intrusion/Threat detection
    Does this authentication look legit?
    (Credential theft)

    View Slide

  10. Supervised
    We can train models with
    examples that are labeled as
    either normal or anomalous.
    TYPES OF MACHINE LEARNING MODELS

    View Slide

  11. Supervised
    We can train models with
    examples that are labeled as
    either normal or anomalous.
    Unsupervised
    We don’t have labeled examples. We
    have to make strong assumptions on
    what anomalies look like.
    TYPES OF MACHINE LEARNING MODELS
    ?
    ?
    ?

    View Slide

  12. Supervised
    “Learning with a teacher”
    Every prediction is associated
    with either a correct answer, or an
    error.
    Unsupervised
    “Learning without a teacher”
    No direct measure of success or
    effectiveness. Much more challenging.
    TYPES OF MACHINE LEARNING MODELS

    View Slide

  13. Simple statistical methods
    “API response time”
    1st-99th percentile (“normal”)
    Outliers
    Distribution

    View Slide

  14. Multi-dimensional unsupervised models
    Density of normal
    observations Clustering

    View Slide

  15. Unsupervised to
    Supervised

    View Slide

  16. View Slide

  17. How can we operate on large datasets that can’t fit
    on a single machine?
    Split the data into many small partitions
    Apply operations on each partition in parallel on
    multiple machines
    Combine the results

    View Slide

  18. What Spark does
    Spark is a distributed data processing framework that
    automatically parallelizes data transformations and
    distributes them to a cluster of workers.
    Data transformations are expressed using a high-level API.

    View Slide

  19. Spark Libraries
    ● Spark SQL and DataFrames
    ● SparkML (MLLib)
    ● Spark Streaming
    ● GraphX

    View Slide

  20. The main abstraction: Spark DataFrames
    DataFrames objects represent distributed and resilient
    abstractions to structured data.
    ○ Distributed because they can be partitioned in such a way that
    computations can be executed on separate nodes, on multiple threads
    ○ Resilient because they are fault-tolerant.
    ○ Structured because they are tabular data structures that carry a schema.

    View Slide

  21. SELECT
    user,
    count(*) as auths
    FROM
    authlog
    WHERE
    ts > ‘2018-01-01’
    GROUP BY
    user
    ORDER BY
    auths
    result = authlog \
    .filter("ts > '2018-01-01'") \
    .groupBy("user")
    .agg(F.count("*").alias("auths")) \
    .orderBy("auths")
    SQL Spark
    “Count number of authentications initiated by each user
    after January 1st, 2018”

    View Slide

  22. Engine sets up the
    plan....

    View Slide

  23. ...and ships it to executors
    Credit: High Performance Spark
    ● Layered on top of a cluster
    manager and distributed
    storage (S3, HDFS,
    Cassandra...)
    ● Driver sets up computation
    plan and divides it into jobs.
    ● Worker nodes are set up with
    executors
    ● Executors run tasks (small,
    discrete computation steps)
    on data partitions

    View Slide

  24. Workshop: Finding Fake
    Reviews on Amazon

    View Slide

  25. Open up Jupyter
    1. Navigate to the spark_workshop folder
    2. Run sh run.sh
    3. Navigate to http://0.0.0.0:8889

    View Slide

  26. How it will work
    1. I’ll guide you through some worked-out examples
    2. Give you some time to do some simple operations on your
    own
    3. Independently build new anomaly heuristics using Spark
    (nothing too rigorous!)

    View Slide

  27. Workshop time!
    #spark-workshop

    View Slide

  28. What’s Next?

    View Slide

  29. 1. Build model features based on intuition, heuristics, and
    domain expertise you gained
    2. Train a multidimensional ML model on historical data
    3. Get labels (if you can)
    4. Validate model predictions
    Next steps: What Would a Data Scientist Do?

    View Slide

  30. Devise adversarial attacks on the recommender system that
    can also circumvent fake user detection models.
    Lam & Riedl, “Shilling Recommender Systems for Fun and Profit”, 2004
    Christakopolou & Banerjee, “Adversarial Recommendation: Attack of the Learned
    Fake Users”, 2018
    Next steps: What Would a Bad Actor Do?

    View Slide

  31. Thank you!
    Questions?

    View Slide