Spark - Alexis Seigneurin - English

Alexis Seigneurin @aseigneurin @ippontech

Spark • Processing of large volumes of data • Distributed
processing on commodity hardware • Written in Scala, Java binding

History • 2009: AMPLab, Berkeley University • June 2013 :
"Top-level project" of the Apache foundation • May 2014: version 1.0.0 • Currently: version 1.2.0

Use cases • Logs analysis • Processing of text files
• Analytics • Distributed search (Google, before) • Fraud detection • Product recommendation

• Same use cases • Same development model: MapReduce •
Integration with the ecosystem Proximity with Hadoop

Simpler than Hadoop • API simpler to learn • “Relaxed”
MapReduce • Spark Shell: interactive processing

Faster than Hadoop Spark officially sets a new record in
large-scale sorting (5th November 2014) • Sorting 100 To of data • Hadoop MR: 72 minutes ◦ With 2100 noeuds (50400 cores) • Spark: 23 minutes ◦ With 206 noeuds (6592 cores)

Spark ecosystem • Spark • Spark Shell • Spark Streaming
• Spark SQL • Spark ML • GraphX

Integration • Yarn, Zookeeper, Mesos • HDFS • Cassandra •
Elasticsearch • MongoDB

Spark Operating principle

• Resilient Distributed Dataset • Abstraction of a collection processed
in parallel • Fault tolerant • Can work with tuples: ◦ Key - Value ◦ Tuples must be independent from each other RDD

Sources • Files on HDFS • Local files • Collection
in memory • Amazon S3 • NoSQL database • ... • Or a custom implementation of InputFormat

Transformations • Processes an RDD, returns another RDD • Lazy!
• Examples : ◦ map(): one value → another value ◦ mapToPair(): one value → a tuple ◦ filter(): filters values/tuples given a condition ◦ groupByKey(): groups values by key ◦ reduceByKey(): aggregates values by key ◦ join(), cogroup()...: joins two RDDs

Actions • Does not return an RDD • Examples: ◦
count(): counts values/tuples ◦ saveAsHadoopFile(): saves results in Hadoop’s format ◦ foreach(): applies a function on each item ◦ collect(): retrieves values in a list (List<T>)

Example

• Trees of Paris: CSV file, Open Data • Count
of trees by specie Spark - Example geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta 48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;; 48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;; 48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;; 48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29 ...

Spark - Example JavaSparkContext sc = new JavaSparkContext("local", "arbres"); sc.textFile("data/arbresalignementparis2010.csv")
.filter(line -> !line.startsWith("geom")) .map(line -> line.split(";")) .mapToPair(fields -> new Tuple2<String, Integer>(fields[4], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " : " + t._2)); [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] u m k m a a textFile mapToPair map reduceByKey foreach 1 1 1 1 1 u m k 1 2 1 2 a ... ... ... ... filter ... ... sortByKey a m 2 1 2 1 u ... ... ... ... ... ... geom;... 1 k

Spark - Example Acacia dealbata : 2 Acer acerifolius :
39 Acer buergerianum : 14 Acer campestre : 452 ...

Topology & Terminology • One master / several workers ◦
(+ one standby master) • Submit an application to the cluster • Execution managed by a driver

Spark in a cluster Several options • YARN • Mesos
• Standalone ◦ Workers started manually ◦ Workers started by the master

MapReduce • Spark (API) • Distributed processing • Fault tolerant
Stockage • HDFS, base NoSQL... • Distributed storage • Fault tolerant Storage & Processing

Data locality • Process the data where it is stored
• Avoid network I/Os

Data locality Spark Worker HDFS Datanode Spark Worker HDFS Datanode
Spark Worker HDFS Datanode Spark Master HDFS Namenode HDFS Namenode (Standby) Spark Master (Standby)

Demo Spark in a cluster

Demo $ $SPARK_HOME/sbin/start-master.sh $ $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://MBP-de-Alexis:7077 --cores 2 --memory
2G $ mvn clean package $ $SPARK_HOME/bin/spark-submit --master spark://MBP-de-Alexis:7077 --class com.seigneurin.spark.WikipediaMapReduceByKey --deploy-mode cluster target/pres-spark-0.0.1-SNAPSHOT.jar

Spark SQL

• Usage of an RDD in SQL • SQL engine:
converts SQL instructions to low-level instructions Spark SQL

Spark SQL Prerequisites: • Use tabular data • Describe the
schema → SchemaRDD Describing the schema : • Programmatic description of the data • Schema inference through reflection (POJO)

JavaRDD<Row> rdd = trees.map(fields -> Row.create( Float.parseFloat(fields[3]), fields[4])); • Creating
tabular data (type Row) Spark SQL - Example --------------------------------------- | 10.0 | Aesculus hippocastanum | | 15.0 | Tilia platyphyllos | | 0.0 | Platanus x hispanica | | 10.0 | Paulownia tomentosa | | ... | ... |

Spark SQL - Example List<StructField> fields = new ArrayList<StructField>(); fields.add(DataType.createStructField("hauteurenm",
DataType.FloatType, false)); fields.add(DataType.createStructField("espece", DataType.StringType, false)); StructType schema = DataType.createStructType(fields); JavaSchemaRDD schemaRDD = sqlContext.applySchema(rdd, schema); schemaRDD.registerTempTable("tree"); --------------------------------------- | hauteurenm | espece | --------------------------------------- | 10.0 | Aesculus hippocastanum | | 15.0 | Tilia platyphyllos | | 0.0 | Platanus x hispanica | | 10.0 | Paulownia tomentosa | | ... | ... | • Describing the schema

• Counting trees by specie Spark SQL - Example sqlContext.sql("SELECT
espece, COUNT(*) FROM tree WHERE espece <> '' GROUP BY espece ORDER BY espece") .foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1))); Acacia dealbata : 2 Acer acerifolius : 39 Acer buergerianum : 14 Acer campestre : 452 ...

Spark Streaming

Micro-batches • Slices a continuous flow of data into batches
• Same API • ≠ Apache Storm

DStream • Discretized Streams • Sequence of RDDs • Initialized
with a Duration

Window operations • Sliding window • Reuses data from other
windows • Initialized with a window length and a slide interval

Sources • Socket • Kafka • Flume • HDFS •
MQ (ZeroMQ...) • Twitter • ... • Or a custom implementation of Receiver

Demo Spark Streaming

Spark Streaming Demo • Receive Tweets with hashtag #Android ◦
Twitter4J • Detection of the language of the Tweet ◦ Language Detection • Indexing with Elasticsearch • Reporting with Kibana 4

$ curl -X DELETE localhost:9200 $ curl -X PUT localhost:9200/spark/_mapping/tweets
'{ "tweets": { "properties": { "user": {"type": "string","index": "not_analyzed"}, "text": {"type": "string"}, "createdAt": {"type": "date","format": "date_time"}, "language": {"type": "string","index": "not_analyzed"} } } }' • Launch ElasticSearch Demo • Launch Kibana -> http://localhost:5601 • Launch the Spark Streaming process

@aseigneurin aseigneurin.github.io @ippontech blog.ippon.fr

Spark - Alexis Seigneurin - English

Spark - Alexis Seigneurin - English

More Decks by Alexis Seigneurin

Other Decks in Technology

Featured

Transcript