How to find a needle in a very, very large haystack using Apache Spark

How to ﬁnd a needle in a very, very large
haystack using Apache Spark @smeschiari (Data Science)

What you’ll take away 1. How Data Scientists ﬁnd anomalies
in data 2. What you can do with Apache Spark 3. How to run simple operations on Spark DataFrames

How we’ll learn it PART 1 1. Deﬁne Anomaly Detection
(~10 minutes) 2. Introduce Spark (~ 10 minutes) PART 2 Interactive Spark Workshop: Finding Fake Reviews on Amazon (60 minutes)

Requirements 1. You followed the instructions at https://github.com/stefano-meschiari/spark_workshop 2. Some
Python knowledge 3. If you have any last minute questions about setup, join the #sparkworkshop channel

What is Anomaly Detection?

What is an anomaly? 1. An anomaly (outlier) is an
observation that deviates signiﬁcantly from the bulk of the other observations. NORMAL ANOMALIES

What is an anomaly? NORMAL ANOMALIES 1. An anomaly (outlier)
is an observation that deviates signiﬁcantly from the bulk of the other observations. 2. We suspect anomalies are generated by a diﬀerent process than normal observations.

NORMAL DATA ANOMALY GENERATED BY DIFFERENT PROCESS (NSFL)

Applications Fraud detection Is this credit card transaction normal? (Stolen
credit card) Log and load monitoring Has load spiked in an unexpected way? (DDOS) Healthcare Is this insurance claim suspicious? (Insurance scam) Quality control Is this widget out of tolerance? (Malfunctioning machine) Time series Is the current amount of stellar activity abnormal? (New physics) Intrusion/Threat detection Does this authentication look legit? (Credential theft)

Supervised We can train models with examples that are labeled
as either normal or anomalous. TYPES OF MACHINE LEARNING MODELS

Supervised We can train models with examples that are labeled
as either normal or anomalous. Unsupervised We don’t have labeled examples. We have to make strong assumptions on what anomalies look like. TYPES OF MACHINE LEARNING MODELS ? ? ?

Supervised “Learning with a teacher” Every prediction is associated with
either a correct answer, or an error. Unsupervised “Learning without a teacher” No direct measure of success or eﬀectiveness. Much more challenging. TYPES OF MACHINE LEARNING MODELS

Simple statistical methods “API response time” 1st-99th percentile (“normal”) Outliers
Distribution

Multi-dimensional unsupervised models Density of normal observations Clustering

Unsupervised to Supervised

How can we operate on large datasets that can’t ﬁt
on a single machine? Split the data into many small partitions Apply operations on each partition in parallel on multiple machines Combine the results

What Spark does Spark is a distributed data processing framework
that automatically parallelizes data transformations and distributes them to a cluster of workers. Data transformations are expressed using a high-level API.

Spark Libraries • Spark SQL and DataFrames • SparkML (MLLib)
• Spark Streaming • GraphX

The main abstraction: Spark DataFrames DataFrames objects represent distributed and
resilient abstractions to structured data. ◦ Distributed because they can be partitioned in such a way that computations can be executed on separate nodes, on multiple threads ◦ Resilient because they are fault-tolerant. ◦ Structured because they are tabular data structures that carry a schema.

SELECT user, count(*) as auths FROM authlog WHERE ts >
‘2018-01-01’ GROUP BY user ORDER BY auths result = authlog \ .filter("ts > '2018-01-01'") \ .groupBy("user") .agg(F.count("*").alias("auths")) \ .orderBy("auths") SQL Spark “Count number of authentications initiated by each user after January 1st, 2018”

Engine sets up the plan....

...and ships it to executors Credit: High Performance Spark •
Layered on top of a cluster manager and distributed storage (S3, HDFS, Cassandra...) • Driver sets up computation plan and divides it into jobs. • Worker nodes are set up with executors • Executors run tasks (small, discrete computation steps) on data partitions

Workshop: Finding Fake Reviews on Amazon

Open up Jupyter 1. Navigate to the spark_workshop folder 2.
Run sh run.sh 3. Navigate to http://0.0.0.0:8889

How it will work 1. I’ll guide you through some
worked-out examples 2. Give you some time to do some simple operations on your own 3. Independently build new anomaly heuristics using Spark (nothing too rigorous!)

Workshop time! #spark-workshop

What’s Next?

1. Build model features based on intuition, heuristics, and domain
expertise you gained 2. Train a multidimensional ML model on historical data 3. Get labels (if you can) 4. Validate model predictions Next steps: What Would a Data Scientist Do?

Devise adversarial attacks on the recommender system that can also
circumvent fake user detection models. Lam & Riedl, “Shilling Recommender Systems for Fun and Proﬁt”, 2004 Christakopolou & Banerjee, “Adversarial Recommendation: Attack of the Learned Fake Users”, 2018 Next steps: What Would a Bad Actor Do?

Thank you! Questions?

How to find a needle in a very, very large hays...

How to find a needle in a very, very large haystack using Apache Spark

Stefano Meschiari

More Decks by Stefano Meschiari

Other Decks in Programming

Featured

Transcript

How to ﬁnd a needle in a very, very large

What you’ll take away 1. How Data Scientists ﬁnd anomalies

How we’ll learn it PART 1 1. Deﬁne Anomaly Detection

Requirements 1. You followed the instructions at https://github.com/stefano-meschiari/spark_workshop 2. Some

What is Anomaly Detection?

What is an anomaly? 1. An anomaly (outlier) is an

What is an anomaly? NORMAL ANOMALIES 1. An anomaly (outlier)

NORMAL DATA ANOMALY GENERATED BY DIFFERENT PROCESS (NSFL)

Applications Fraud detection Is this credit card transaction normal? (Stolen

Supervised We can train models with examples that are labeled

Supervised We can train models with examples that are labeled

Supervised “Learning with a teacher” Every prediction is associated with

Simple statistical methods “API response time” 1st-99th percentile (“normal”) Outliers

Multi-dimensional unsupervised models Density of normal observations Clustering

Unsupervised to Supervised

How can we operate on large datasets that can’t ﬁt

What Spark does Spark is a distributed data processing framework

Spark Libraries • Spark SQL and DataFrames • SparkML (MLLib)

The main abstraction: Spark DataFrames DataFrames objects represent distributed and

SELECT user, count(*) as auths FROM authlog WHERE ts >

Engine sets up the plan....

...and ships it to executors Credit: High Performance Spark •

Workshop: Finding Fake Reviews on Amazon

Open up Jupyter 1. Navigate to the spark_workshop folder 2.

How it will work 1. I’ll guide you through some

Workshop time! #spark-workshop

What’s Next?

1. Build model features based on intuition, heuristics, and domain

Devise adversarial attacks on the recommender system that can also

Thank you! Questions?