Spark Introduction 20160426 NTU

APACHE SPARK TODAY! Erica Li 2016-04-26

About Me @Shrimp_li ericalitw What I am working for? Data
Scientist Girls in Tech Taiwan ElasticMining CTO & Co-founder Taiwan Spark User Group Founder Taiwan People’s Food Bank IT consultant

About Me

10 THINGS TO APACHE SPARK 1 2 3 4 5
6 7 8 9 10 Introduction Hadoop vs. Spark Spark Features Spark Ecosystem Spark Architect RDD Streaming SQL DataFrame MLlib GraphX

WHAT DO YOU LEARN FROM HADOOP?

1 What is Spark freegoogleslidestemplates.com

Apache Spark™ is a powerful open source processing engine built
around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009.

A Brief Story 2004 2006 2008 2010 2012 2014 2016
MapReduce Paper Hadoop @Yahoo! Hadoop Summit Spark Paper Apache Spark Won What’s Next? Spark 2.0?

2 Hadoop vs. Spark freegoogleslidestemplates.com

Hadoop A full stack massive parallel processing (MPP) system with
both big data (HDFS) and parallel execution model (MapReduce) Spark An open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers

Spark vs. Hadoop MapReduce 2014 Sort Benchmark Competition 機器數量時間
Hadoop MapR 2100 72m Spark 207 23m Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. http://spark.apache.org/

Spark vs. Hadoop MapReduce Input mem Input Iterator Iterator mem
HDFS read HDFS read HDFS write HDFS write MapReduce Spark

3 Spark Features freegoogleslidestemplates.com

Spark Features • Write in Scala • Run on JVM
• Upgrade MapReduce to next level • In-memory data storage • Near real-time processing • Lazy evaluation of queries

Available APIs It currently support the following language for developing
using Spark (v1.4+)

4 Spark Ecosystem

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

5 Architecture

Components Data Storage HDFS, hadoop compatible Management Framework Standalone Mesos
$./bin/spark-shell --master mesos://host:5050 Yarn $./bin/spark-shell --master --deply-mode client AWS EC2 Distributed computing API (scala, python Java) Storage HDFS, etc

Standalone Mode Launch Cluster $./bin/spark-shell --master spark://IP:PORT • Master •
Slaves • Public keys

Launch Cluster Driver Program Cluster Manager Worker Node Worker Node
Standalone, Yarn, or Mesos SparkContext Master

Cluster Processing Driver Program SparkContext Cluster Manager Worker Node Cache
Executor Task Task Master *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py

Client Mode (default) Driver (Client) SparkContext Master Worker Node Cache
Executor Task Task *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py Submit APP

Cluster Mode Worker Node Driver Worker Node Cache Executor Task
Task Worker Node Cache Executor Task Task Driver (Client) SparkContext Master Submit APP

Spark on Yarn Spark Yarn Client Resource Manager Node Manager
Application Master Spark Context DAG Scheduler YarnClusterScheduler Node Manager Container ExecutorBackend Container ExecutorBackend Executor Executor 1.Request 2.Assign AppMaster 3.Invoke AppMaster 5. Apply Container 3.Invoke Container 6.Assign Container

It depends… On your infra Which one is better?

Spark Installation Downloading Wget http://www.apache.org/dyn/closer. lua/spark/spark-1.6.1/spark-1.6.1-bin- hadoop2.6.tgz Tar it then
mv it tar zxvf spark-1.6.1-bin-hadoop2.6.tgz cd spark-1.6.1-bin-hadoop2.6 v2.10.X+ v7+ V2.6+ V3.1+

Resilient Distributed Datasets “Spark provides is a resilient distributed dataset
(RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” Partition1 Partition2 ...

Transformations Actions • map(func) • flatMap(func) • filter(func) • groupByKey()
• reduceByKey() • union(other) • sortByKey() • reduce(func) • collect() • first() • take(n) • saveAsTextFile(path) • countByKey() RDD operations Here are some common operations

How to Create RDD Java Scala Python

Key-Value RDD For example • mapValues • groupByKey • reduceByKey

Narrow Dependencies map, filter

Narrow Dependencies union

Wide Dependencies groupByKey, reduceByKey

Stage Stage3 A groupBy B C map D E union
F G join Stage1 Stage2

Cache • RDD persistence • Caching is a key tool
for iterative algorithms and fast interactive use • Usage

Shared Variables Broadcast variables Accumulators

lines errors HDFS erros Fault Tolerance

#Word count (python)

#word count (scala)

#word count (scala) New in 1.6+

#word count (scala)

BEST PRACTICE Show time freegoogleslidestemplates.com

1. Avoid GroupByKey (甲,1) (甲,1) (乙,1) (乙,1) (甲,1) (甲,2) (甲,1)
(乙,1) (乙,3) (乙,1) (乙,1) (甲,1) (甲,3) (甲,2) (乙,1) (乙,4) (乙,3) (甲,1) (乙,1) (甲,1) (甲,1) (乙,1) (乙,1) (乙,1) (甲,1) (甲,3) (甲,1) (甲,1) (乙,1) (乙,4) (乙,1) (乙,1) (乙,1) ReduceByKey GroupByKey

2. Don’t copy all elements to driver take() sample countByValue()
countByKey() collectAsMap() save as file filtering/sampling

3. Bad input data (python)

4. Number of partitions Spark Application UI Inspect it programmatically

6 Spark Streaming

Use cases • Streaming ETL ◦ Uber - Kafka, HDFS
◦ Coviva - Quality of live videos • Data Enrichment ◦ KKBOX • Complex Session Analysis ◦ Pinterest - immediate user behavior • Trigger Event Detection ◦ UITOX

7 Spark SQL

M Armbrust et al. (2015) Spark SQL: Relational Data Processing
in Spark Spark 1.3

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ Spark 1.3

DATASETS API • Released in Spark 1.6 • RDD +
Encoder • RDD with catalyst optimizor

M Armbrust et al. (2015) Spark SQL: Relational Data Processing
in Spark Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plan Cost Model Selected Physical Plan RDDs SQL Query DataFrame Analysis Logical Optimization Physical Planning Code Generation Phase of query planning

Use cases • Data Query ◦ VPON ◦ Foxconn ◦
Appier ◦ EBC ◦ ...

8 Spark MLlib

MLlib • MLlib is Spark’s machine learning (ML) library •
Its goal is to make practical machine learning scalable and easy • It consists of common learning algorithms and utilities

Models & Use cases • Classification & Regression • Collaborative
Filtering • Clustering • Dimensionality Reduction • Feature Extraction... • User behavior analysis ◦ PIXNET ◦ Pinkoi

9 GraphX

Case Study 10

Memory issue, small files Multi-use environment... CRAZY ERROR Five things
we hate about SPARK!

Show you are Better 1st.

2016-04-24

HadoopCon 2016 9/9-9/10 8th Annual Hadoop Meetup in Taiwan http://2016.hadoopcon.org/wp/

Hiring! (Full-time) Frontend Developer Job Description • HTML5/CSS3 • JavaScript
• AngularJS • SASS • SEO • jQuery • Bootstrap • RWD/Mobile First Design http://www.wetogether.co/hiring.html Backend Developer Job Description • Web • Python/Scala • MVC • HTTP • Linux • Database (SQL or No-SQL) Email me!

Thank You

Spark Introduction 20160426 NTU

Spark Introduction 20160426 NTU

More Decks by Erica Li

Other Decks in Technology

Featured

Transcript