Alexis Seigneurin @aseigneurin @ippontech

Spark ● Processing of large volumes of data ● Distributed processing on commodity hardware ● Written in Scala, Java binding

History ● 2009: AMPLab, Berkeley University ● June 2013 : "Top-level project" of the Apache foundation ● May 2014: version 1.0.0 ● Currently: version 1.2.0

Use cases ● Logs analysis ● Processing of text files ● Analytics ● Distributed search (Google, before) ● Fraud detection ● Product recommendation

● Same use cases ● Same development model: MapReduce ● Integration with the ecosystem Proximity with Hadoop

Simpler than Hadoop ● API simpler to learn ● “Relaxed” MapReduce ● Spark Shell: interactive processing

Faster than Hadoop Spark officially sets a new record in large-scale sorting (5th November 2014) ● Sorting 100 To of data ● Hadoop MR: 72 minutes ○ With 2100 noeuds (50400 cores) ● Spark: 23 minutes ○ With 206 noeuds (6592 cores)

Spark ecosystem ● Spark ● Spark Shell ● Spark Streaming ● Spark SQL ● Spark ML ● GraphX

Integration ● Yarn, Zookeeper, Mesos ● HDFS ● Cassandra ● Elasticsearch ● MongoDB

Spark Operating principle

● Resilient Distributed Dataset ● Abstraction of a collection processed in parallel ● Fault tolerant ● Can work with tuples: ○ Key - Value ○ Tuples must be independent from each other RDD

Sources ● Files on HDFS ● Local files ● Collection in memory ● Amazon S3 ● NoSQL database ● ... ● Or a custom implementation of InputFormat

Transformations ● Processes an RDD, returns another RDD ● Lazy! ● Examples : ○ map(): one value → another value ○ mapToPair(): one value → a tuple ○ filter(): filters values/tuples given a condition ○ groupByKey(): groups values by key ○ reduceByKey(): aggregates values by key ○ join(), cogroup()...: joins two RDDs

Actions ● Does not return an RDD ● Examples: ○ count(): counts values/tuples ○ saveAsHadoopFile(): saves results in Hadoop’s format ○ foreach(): applies a function on each item ○ collect(): retrieves values in a list (List)

Slide 15 text


● Trees of Paris: CSV file, Open Data ● Count of trees by specie Spark - Example geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta 48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;; 48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;; 48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;; 48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29 ...

Spark - Example JavaSparkContext sc = new JavaSparkContext("local", "arbres"); sc.textFile("data/arbresalignementparis2010.csv") .filter(line -> !line.startsWith("geom")) .map(line -> line.split(";")) .mapToPair(fields -> new Tuple2(fields[4], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " : " + t._2)); [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] u m k m a a textFile mapToPair map reduceByKey foreach 1 1 1 1 1 u m k 1 2 1 2 a ... ... ... ... filter ... ... sortByKey a m 2 1 2 1 u ... ... ... ... ... ... geom;... 1 k

Spark - Example Acacia dealbata : 2 Acer acerifolius : 39 Acer buergerianum : 14 Acer campestre : 452 ...

Topology & Terminology ● One master / several workers ○ (+ one standby master) ● Submit an application to the cluster ● Execution managed by a driver

Spark in a cluster Several options ● YARN ● Mesos ● Standalone ○ Workers started manually ○ Workers started by the master

MapReduce ● Spark (API) ● Distributed processing ● Fault tolerant Stockage ● HDFS, base NoSQL... ● Distributed storage ● Fault tolerant Storage & Processing

Data locality ● Process the data where it is stored ● Avoid network I/Os

Data locality Spark Worker HDFS Datanode Spark Worker HDFS Datanode Spark Worker HDFS Datanode Spark Master HDFS Namenode HDFS Namenode (Standby) Spark Master (Standby)

Demo Spark in a cluster

Demo $ $SPARK_HOME/sbin/ $ $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://MBP-de-Alexis:7077 --cores 2 --memory 2G $ mvn clean package $ $SPARK_HOME/bin/spark-submit --master spark://MBP-de-Alexis:7077 --class com.seigneurin.spark.WikipediaMapReduceByKey --deploy-mode cluster target/pres-spark-0.0.1-SNAPSHOT.jar

Spark SQL

● Usage of an RDD in SQL ● SQL engine: converts SQL instructions to low-level instructions Spark SQL

Spark SQL Prerequisites: ● Use tabular data ● Describe the schema → SchemaRDD Describing the schema : ● Programmatic description of the data ● Schema inference through reflection (POJO)

JavaRDD rdd = -> Row.create( Float.parseFloat(fields[3]), fields[4])); ● Creating tabular data (type Row) Spark SQL - Example --------------------------------------- | 10.0 | Aesculus hippocastanum | | 15.0 | Tilia platyphyllos | | 0.0 | Platanus x hispanica | | 10.0 | Paulownia tomentosa | | ... | ... |

Spark SQL - Example List fields = new ArrayList(); fields.add(DataType.createStructField("hauteurenm", DataType.FloatType, false)); fields.add(DataType.createStructField("espece", DataType.StringType, false)); StructType schema = DataType.createStructType(fields); JavaSchemaRDD schemaRDD = sqlContext.applySchema(rdd, schema); schemaRDD.registerTempTable("tree"); --------------------------------------- | hauteurenm | espece | --------------------------------------- | 10.0 | Aesculus hippocastanum | | 15.0 | Tilia platyphyllos | | 0.0 | Platanus x hispanica | | 10.0 | Paulownia tomentosa | | ... | ... | ● Describing the schema

● Counting trees by specie Spark SQL - Example sqlContext.sql("SELECT espece, COUNT(*) FROM tree WHERE espece <> '' GROUP BY espece ORDER BY espece") .foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1))); Acacia dealbata : 2 Acer acerifolius : 39 Acer buergerianum : 14 Acer campestre : 452 ...

Spark Streaming

Micro-batches ● Slices a continuous flow of data into batches ● Same API ● ≠ Apache Storm

DStream ● Discretized Streams ● Sequence of RDDs ● Initialized with a Duration

Window operations ● Sliding window ● Reuses data from other windows ● Initialized with a window length and a slide interval

Sources ● Socket ● Kafka ● Flume ● HDFS ● MQ (ZeroMQ...) ● Twitter ● ... ● Or a custom implementation of Receiver

Demo Spark Streaming

Spark Streaming Demo ● Receive Tweets with hashtag #Android ○ Twitter4J ● Detection of the language of the Tweet ○ Language Detection ● Indexing with Elasticsearch ● Reporting with Kibana 4

$ curl -X DELETE localhost:9200 $ curl -X PUT localhost:9200/spark/_mapping/tweets '{ "tweets": { "properties": { "user": {"type": "string","index": "not_analyzed"}, "text": {"type": "string"}, "createdAt": {"type": "date","format": "date_time"}, "language": {"type": "string","index": "not_analyzed"} } } }' ● Launch ElasticSearch Demo ● Launch Kibana -> http://localhost:5601 ● Launch the Spark Streaming process

@aseigneurin @ippontech