Spark - Alexis Seigneurin - English

Slide 1

Slide 1 text

Alexis Seigneurin @aseigneurin @ippontech

Slide 2

Slide 2 text

Spark ● Processing of large volumes of data ● Distributed processing on commodity hardware ● Written in Scala, Java binding

Slide 3

Slide 3 text

History ● 2009: AMPLab, Berkeley University ● June 2013 : "Top-level project" of the Apache foundation ● May 2014: version 1.0.0 ● Currently: version 1.2.0

Slide 4

Slide 4 text

Use cases ● Logs analysis ● Processing of text files ● Analytics ● Distributed search (Google, before) ● Fraud detection ● Product recommendation

Slide 5

Slide 5 text

● Same use cases ● Same development model: MapReduce ● Integration with the ecosystem Proximity with Hadoop

Slide 6

Slide 6 text

Simpler than Hadoop ● API simpler to learn ● “Relaxed” MapReduce ● Spark Shell: interactive processing

Slide 7

Slide 7 text

Faster than Hadoop Spark officially sets a new record in large-scale sorting (5th November 2014) ● Sorting 100 To of data ● Hadoop MR: 72 minutes ○ With 2100 noeuds (50400 cores) ● Spark: 23 minutes ○ With 206 noeuds (6592 cores)

Slide 8

Slide 8 text

Spark ecosystem ● Spark ● Spark Shell ● Spark Streaming ● Spark SQL ● Spark ML ● GraphX

Slide 9

Slide 9 text

Integration ● Yarn, Zookeeper, Mesos ● HDFS ● Cassandra ● Elasticsearch ● MongoDB

Slide 10

Slide 10 text

Spark Operating principle

Slide 11

Slide 11 text

● Resilient Distributed Dataset ● Abstraction of a collection processed in parallel ● Fault tolerant ● Can work with tuples: ○ Key - Value ○ Tuples must be independent from each other RDD

Slide 12

Slide 12 text

Sources ● Files on HDFS ● Local files ● Collection in memory ● Amazon S3 ● NoSQL database ● ... ● Or a custom implementation of InputFormat

Slide 13

Slide 13 text

Transformations ● Processes an RDD, returns another RDD ● Lazy! ● Examples : ○ map(): one value → another value ○ mapToPair(): one value → a tuple ○ filter(): filters values/tuples given a condition ○ groupByKey(): groups values by key ○ reduceByKey(): aggregates values by key ○ join(), cogroup()...: joins two RDDs

Slide 14

Slide 14 text

Actions ● Does not return an RDD ● Examples: ○ count(): counts values/tuples ○ saveAsHadoopFile(): saves results in Hadoop’s format ○ foreach(): applies a function on each item ○ collect(): retrieves values in a list (List)

Slide 15

Slide 15 text

Example

Slide 16

Slide 16 text

● Trees of Paris: CSV file, Open Data ● Count of trees by specie Spark - Example geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta 48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;; 48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;; 48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;; 48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29 ...

Slide 17

Slide 17 text

Spark - Example JavaSparkContext sc = new JavaSparkContext("local", "arbres"); sc.textFile("data/arbresalignementparis2010.csv") .filter(line -> !line.startsWith("geom")) .map(line -> line.split(";")) .mapToPair(fields -> new Tuple2(fields[4], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " : " + t._2)); [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] u m k m a a textFile mapToPair map reduceByKey foreach 1 1 1 1 1 u m k 1 2 1 2 a ... ... ... ... filter ... ... sortByKey a m 2 1 2 1 u ... ... ... ... ... ... geom;... 1 k

Slide 18

Slide 18 text

Spark - Example Acacia dealbata : 2 Acer acerifolius : 39 Acer buergerianum : 14 Acer campestre : 452 ...

Slide 19

Slide 19 text

Topology & Terminology ● One master / several workers ○ (+ one standby master) ● Submit an application to the cluster ● Execution managed by a driver

Slide 20

Slide 20 text

Spark in a cluster Several options ● YARN ● Mesos ● Standalone ○ Workers started manually ○ Workers started by the master

Slide 21

Slide 21 text

MapReduce ● Spark (API) ● Distributed processing ● Fault tolerant Stockage ● HDFS, base NoSQL... ● Distributed storage ● Fault tolerant Storage & Processing

Slide 22

Slide 22 text

Data locality ● Process the data where it is stored ● Avoid network I/Os

Slide 23

Slide 23 text

Data locality Spark Worker HDFS Datanode Spark Worker HDFS Datanode Spark Worker HDFS Datanode Spark Master HDFS Namenode HDFS Namenode (Standby) Spark Master (Standby)

Slide 24

Slide 24 text

Demo Spark in a cluster

Slide 25

Slide 25 text

Demo $ $SPARK_HOME/sbin/start-master.sh $ $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://MBP-de-Alexis:7077 --cores 2 --memory 2G $ mvn clean package $ $SPARK_HOME/bin/spark-submit --master spark://MBP-de-Alexis:7077 --class com.seigneurin.spark.WikipediaMapReduceByKey --deploy-mode cluster target/pres-spark-0.0.1-SNAPSHOT.jar

Slide 26

Slide 26 text

Spark SQL

Slide 27

Slide 27 text

● Usage of an RDD in SQL ● SQL engine: converts SQL instructions to low-level instructions Spark SQL

Slide 28

Slide 28 text

Spark SQL Prerequisites: ● Use tabular data ● Describe the schema → SchemaRDD Describing the schema : ● Programmatic description of the data ● Schema inference through reflection (POJO)

Slide 29

Slide 29 text

JavaRDD rdd = trees.map(fields -> Row.create( Float.parseFloat(fields[3]), fields[4])); ● Creating tabular data (type Row) Spark SQL - Example --------------------------------------- | 10.0 | Aesculus hippocastanum | | 15.0 | Tilia platyphyllos | | 0.0 | Platanus x hispanica | | 10.0 | Paulownia tomentosa | | ... | ... |

Slide 30

Slide 30 text

Spark SQL - Example List fields = new ArrayList(); fields.add(DataType.createStructField("hauteurenm", DataType.FloatType, false)); fields.add(DataType.createStructField("espece", DataType.StringType, false)); StructType schema = DataType.createStructType(fields); JavaSchemaRDD schemaRDD = sqlContext.applySchema(rdd, schema); schemaRDD.registerTempTable("tree"); --------------------------------------- | hauteurenm | espece | --------------------------------------- | 10.0 | Aesculus hippocastanum | | 15.0 | Tilia platyphyllos | | 0.0 | Platanus x hispanica | | 10.0 | Paulownia tomentosa | | ... | ... | ● Describing the schema

Slide 31

Slide 31 text

● Counting trees by specie Spark SQL - Example sqlContext.sql("SELECT espece, COUNT(*) FROM tree WHERE espece <> '' GROUP BY espece ORDER BY espece") .foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1))); Acacia dealbata : 2 Acer acerifolius : 39 Acer buergerianum : 14 Acer campestre : 452 ...

Slide 32

Slide 32 text

Spark Streaming

Slide 33

Slide 33 text

Micro-batches ● Slices a continuous flow of data into batches ● Same API ● ≠ Apache Storm

Slide 34

Slide 34 text

DStream ● Discretized Streams ● Sequence of RDDs ● Initialized with a Duration

Slide 35

Slide 35 text

Window operations ● Sliding window ● Reuses data from other windows ● Initialized with a window length and a slide interval

Slide 36

Slide 36 text

Sources ● Socket ● Kafka ● Flume ● HDFS ● MQ (ZeroMQ...) ● Twitter ● ... ● Or a custom implementation of Receiver

Slide 37

Slide 37 text

Demo Spark Streaming

Slide 38

Slide 38 text

Spark Streaming Demo ● Receive Tweets with hashtag #Android ○ Twitter4J ● Detection of the language of the Tweet ○ Language Detection ● Indexing with Elasticsearch ● Reporting with Kibana 4

Slide 39

Slide 39 text

$ curl -X DELETE localhost:9200 $ curl -X PUT localhost:9200/spark/_mapping/tweets '{ "tweets": { "properties": { "user": {"type": "string","index": "not_analyzed"}, "text": {"type": "string"}, "createdAt": {"type": "date","format": "date_time"}, "language": {"type": "string","index": "not_analyzed"} } } }' ● Launch ElasticSearch Demo ● Launch Kibana -> http://localhost:5601 ● Launch the Spark Streaming process

Slide 40

Slide 40 text

@aseigneurin aseigneurin.github.io @ippontech blog.ippon.fr