Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to find a needle in a very, very large haystack using Apache Spark

How to find a needle in a very, very large haystack using Apache Spark

This is geared towards engineers with some data experience who would like to dip their toes into using Apache Spark.

This workshop covers how to:

* use a notebook environment
* write simple Apache Spark queries to filter and transform a dataset
* do very simple outlier detection

and is taught using the Amazon Electronics reviews dataset.

The prerequisites are available at: https://github.com/stefano-meschiari/spark_workshop

A6f78eb8c4b69c5f49ed82641138a316?s=128

Stefano Meschiari

February 22, 2021
Tweet

Transcript

  1. How to find a needle in a very, very large

    haystack using Apache Spark @smeschiari (Data Science)
  2. What you’ll take away 1. How Data Scientists find anomalies

    in data 2. What you can do with Apache Spark 3. How to run simple operations on Spark DataFrames
  3. How we’ll learn it PART 1 1. Define Anomaly Detection

    (~10 minutes) 2. Introduce Spark (~ 10 minutes) PART 2 Interactive Spark Workshop: Finding Fake Reviews on Amazon (60 minutes)
  4. Requirements 1. You followed the instructions at https://github.com/stefano-meschiari/spark_workshop 2. Some

    Python knowledge 3. If you have any last minute questions about setup, join the #sparkworkshop channel
  5. What is Anomaly Detection?

  6. What is an anomaly? 1. An anomaly (outlier) is an

    observation that deviates significantly from the bulk of the other observations. NORMAL ANOMALIES
  7. What is an anomaly? NORMAL ANOMALIES 1. An anomaly (outlier)

    is an observation that deviates significantly from the bulk of the other observations. 2. We suspect anomalies are generated by a different process than normal observations.
  8. NORMAL DATA ANOMALY GENERATED BY DIFFERENT PROCESS (NSFL)

  9. Applications Fraud detection Is this credit card transaction normal? (Stolen

    credit card) Log and load monitoring Has load spiked in an unexpected way? (DDOS) Healthcare Is this insurance claim suspicious? (Insurance scam) Quality control Is this widget out of tolerance? (Malfunctioning machine) Time series Is the current amount of stellar activity abnormal? (New physics) Intrusion/Threat detection Does this authentication look legit? (Credential theft)
  10. Supervised We can train models with examples that are labeled

    as either normal or anomalous. TYPES OF MACHINE LEARNING MODELS
  11. Supervised We can train models with examples that are labeled

    as either normal or anomalous. Unsupervised We don’t have labeled examples. We have to make strong assumptions on what anomalies look like. TYPES OF MACHINE LEARNING MODELS ? ? ?
  12. Supervised “Learning with a teacher” Every prediction is associated with

    either a correct answer, or an error. Unsupervised “Learning without a teacher” No direct measure of success or effectiveness. Much more challenging. TYPES OF MACHINE LEARNING MODELS
  13. Simple statistical methods “API response time” 1st-99th percentile (“normal”) Outliers

    Distribution
  14. Multi-dimensional unsupervised models Density of normal observations Clustering

  15. Unsupervised to Supervised

  16. None
  17. How can we operate on large datasets that can’t fit

    on a single machine? Split the data into many small partitions Apply operations on each partition in parallel on multiple machines Combine the results
  18. What Spark does Spark is a distributed data processing framework

    that automatically parallelizes data transformations and distributes them to a cluster of workers. Data transformations are expressed using a high-level API.
  19. Spark Libraries • Spark SQL and DataFrames • SparkML (MLLib)

    • Spark Streaming • GraphX
  20. The main abstraction: Spark DataFrames DataFrames objects represent distributed and

    resilient abstractions to structured data. ◦ Distributed because they can be partitioned in such a way that computations can be executed on separate nodes, on multiple threads ◦ Resilient because they are fault-tolerant. ◦ Structured because they are tabular data structures that carry a schema.
  21. SELECT user, count(*) as auths FROM authlog WHERE ts >

    ‘2018-01-01’ GROUP BY user ORDER BY auths result = authlog \ .filter("ts > '2018-01-01'") \ .groupBy("user") .agg(F.count("*").alias("auths")) \ .orderBy("auths") SQL Spark “Count number of authentications initiated by each user after January 1st, 2018”
  22. Engine sets up the plan....

  23. ...and ships it to executors Credit: High Performance Spark •

    Layered on top of a cluster manager and distributed storage (S3, HDFS, Cassandra...) • Driver sets up computation plan and divides it into jobs. • Worker nodes are set up with executors • Executors run tasks (small, discrete computation steps) on data partitions
  24. Workshop: Finding Fake Reviews on Amazon

  25. Open up Jupyter 1. Navigate to the spark_workshop folder 2.

    Run sh run.sh 3. Navigate to http://0.0.0.0:8889
  26. How it will work 1. I’ll guide you through some

    worked-out examples 2. Give you some time to do some simple operations on your own 3. Independently build new anomaly heuristics using Spark (nothing too rigorous!)
  27. Workshop time! #spark-workshop

  28. What’s Next?

  29. 1. Build model features based on intuition, heuristics, and domain

    expertise you gained 2. Train a multidimensional ML model on historical data 3. Get labels (if you can) 4. Validate model predictions Next steps: What Would a Data Scientist Do?
  30. Devise adversarial attacks on the recommender system that can also

    circumvent fake user detection models. Lam & Riedl, “Shilling Recommender Systems for Fun and Profit”, 2004 Christakopolou & Banerjee, “Adversarial Recommendation: Attack of the Learned Fake Users”, 2018 Next steps: What Would a Bad Actor Do?
  31. Thank you! Questions?