Hey!
I am Matild Reema
Software Engineer @ Indix
Enthusiastic Spark developer
Slide 3
Slide 3 text
Crawl Parse Classify Match Analytics API/ Feeds
What do we do at
Indix?
Slide 4
Slide 4 text
100TB
Data processed daily
5TB
Html data crawled per day
1.5B
Product urls
3000
sites
6000
categories
30B
Price points
Slide 5
Slide 5 text
What is Spark?
?
Slide 6
Slide 6 text
◇ Alternate to Hadoop MapReduce
◇ In memory processing engine
◇ Support for both batch and stream processing
◇ SQL, MLlib, GraphX
◇ 100x times faster in memory and 10x faster running
on disk
◇ Can run on a standalone cluster, Mesos or YARN
◇ HDFS for data storage
Slide 7
Slide 7 text
How is it different from
Hadoop?
?
Slide 8
Slide 8 text
In memory processing!!
Slide 9
Slide 9 text
Why is it faster?
◇ Writes intermediate output to memory instead of
disk
◇ Less disk I/O
◇ Lazy evaluation of tasks - optimizes data processing
workflow
◇ In-memory caching helpful when multiple operations
access the same dataset
Spark
MR
Operations
Input
MR
Operations
…...
Write Write
Read
HDFS
Read
Input
Query 1
Query 2
Query 3
HDFS
Read
Result 1
Result 2
Result 3
memory memory
memory
Slide 12
Slide 12 text
Sample code
val sc = new SparkContext()
val rdd = sc.parallelize(List(1, 2, 3, 4, 5, 6), 3) // RDD[Int]
rdd
.filter(no => no % 2)
.map(no => no + 1)
.saveAsTextFile(“s3://bucket/path”) - Action
Transformations
Slide 13
Slide 13 text
Data representation in
Spark
?
Slide 14
Slide 14 text
RDD
◇ Collection of objects
operated in parallel
◇ Transformations and
actions
◇ Immutable - Every
transformation returns a
new RDD
◇ Fault tolerant -
recomputed
Eg: rdd
.map(person -> person.age)
.filter(_ > 18)
Dataframe
◇ Distributed data
organized into
named columns
◇ Like a relational
db
◇ As of Spark 2.0
Dataframe is
nothing but a
Dataset[Row]
Eg: df
.select(“age”)
.filter(“age > 18”)
Datasets
◇ Available since 1.6
◇ Has all the benefits of
an RDD along with
Spark SQL’s optimized
execution engine
◇ Has a strongly typed
API as well as an
untyped one(df)
Eg: ds
.map(person -> person.age)
.filter(_ > 18)
Large no. of small
executors
vs
Small no. of large
executors
?
Slide 21
Slide 21 text
◇ Large executors - more cores
◇ Preferable to have larger executors - less
movement of data across the n/w.
◇ The more the number of cores per executor,
the more number of tasks that can run in
parallel
◇ Having a single core executor doesn’t fully
take advantage of running multiple tasks in
the same JVM
Slide 22
Slide 22 text
“ Most Spark Jobs run as is but there are
certain pitfalls one should be careful of
Slide 23
Slide 23 text
Common Issues
◇ Executor OOM
◇ Driver OOM
◇ Straggler Tasks
◇ Shuffle Failures
◇ GC Limit Exceeded
◇ No space left on device
Slide 24
Slide 24 text
Executors going
OOM
◇ Executor memory + memory overhead
◇ Have smaller partitions
◇ >2000 partitions
◇ Avoid groupByKeys while working with RDDs
- Switch to reduceByKey if aggregating (Map
side combine)
- Apply filters before grouping than after
Slide 25
Slide 25 text
Driver going OOM
◇ Increase driver memory + overhead
◇ Avoid actions that bring all data to the driver for
large datasets(collect)
◇ Too many partitions
- Every task sends a map status object back to
the driver
- Has to deal with multiple map output status
requests from the reducers
Slide 26
Slide 26 text
Straggler Tasks
◇ Skewed data
- A single task deals with more data than the
rest
◇ Increase spark.locality.wait - long running task
might be having poor locality
◇ Increase the number of partitions
Slide 27
Slide 27 text
GC Limit Exceeded
◇ Too much time spent on GC
◇ Increase executor heap size (executor memory)
◇ Increase storage memory (spark.memory.storageFraction)
◇ Choose a different GC policy(CMS, ParallelOld GC, G1GC)
Slide 28
Slide 28 text
Shuffle Failures
◇ Expensive operation - involves disk I/O, data serialization, and
network I/O
◇ Avoid shuffles when possible
- Broadcast the smaller dataset while joining. Moving all
of the data of the smaller table could be more expensive
than just putting a copy of the small table on all
executors.
◇ More cores per executor - still achieves parallelism and lesser
shuffles over the network vs having more executors with
lesser cores
Slide 29
Slide 29 text
No space left on device
◇ Continuous eviction of data to disk
- UnsafeExternalSorter: Thread 75 spilling sort data of
141.0 MB to disk (90 times so far)
◇ Ensure that spark.memory.fraction isn’t too low
◇ If you’re running with minimal number of nodes - increase
them
◇ By default the spilled data is written to /tmp, you can add
more disks by specifying them in spark.local.dirs
Slide 30
Slide 30 text
Good Practices
Slide 31
Slide 31 text
Using Parquet as File
Format
◇ Columnar data storage
◇ 75% data compression on an average
◇ Push down filters(1.6 onwards) to reduce disk I/O : Filter
pushdowns to access only required columns
◇ Use DirectParquetOutputCommitter - Avoids expensive
renames
Slide 32
Slide 32 text
Use Kryo Serialization
◇ More compact than Java serialization(almost 10x)
◇ Faster
◇ Register all custom classes
Slide 33
Slide 33 text
Use Aggregate functions
◇ Prefer reduceBy, combineBy, aggregateBy over groupBys
◇ Support for custom aggregate functions
◇ Map side combines
Slide 34
Slide 34 text
Persist data
◇ When multiple operations are performed on the same data,
persist it
◇ Doesn’t recompute every time an action is performed
◇ Helpful during interactive analysis
Slide 35
Slide 35 text
How do I debug my
application?
?
Slide 36
Slide 36 text
Jobs (Actions)
Slide 37
Slide 37 text
Stages
Slide 38
Slide 38 text
Tasks
Slide 39
Slide 39 text
Executor Metrics
Slide 40
Slide 40 text
To summarize
Avoid unnecessary groupBys in
RDDs - replace them with
aggregate functions wherever
possible
Never collect/coalesce large data
into a single node - Let the
executors write out the data in a
distributed manner
Use parquet file format and kryo
serialization
Switch to using datasets to take
full advantage of Spark’s SQL
engine
Default partitions may not work
all the time - Know when to
increase or decrease them
Cache data especially in jobs
where it is re-used across stages
- Avoids re-computation
Prefer to have larger executors
with more cores - Achieves more
parallelism
Broadcast variables that are
shared to all executors - Also
while performing joins with a
smaller dataset