Slide 1

Slide 1 text

Introduction to! Nantes - 08/07/2014! Ludwine Probst - @nivdul

Slide 2

Slide 2 text

developer! maths lover! machine learning & big data Duchess France Leader @nivdul nivdul.wordpress.com

Slide 3

Slide 3 text

Lay of the land

Slide 4

Slide 4 text

But…

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

• big data analytics in memory! • Resilient Distributed Datasets (RDD)! • shared variables! • principle of lineage! • complements Hadoop! • better performances than Hadoop! • more flexible for programming models

Slide 7

Slide 7 text

Supported languages lambdas expressions (Java8)

Slide 8

Slide 8 text

Data Data Data Standalone Mesos YARN! Hadoop Spark Spark Spark

Slide 9

Slide 9 text

InputFormat Spark

Slide 10

Slide 10 text

SparkContext SparkConf sparkConf = new SparkConf() .setAppName("SimpleExample") .setMaster("local") //.setMaster("spark://192.168.1.11:7077") ! .setJars(new String[]{PATH + "/project- spark/target/project-spark-0.0.1-SNAPSHOT.jar"}); ! ! JavaSparkContext sc = new JavaSparkContext(sparkConf);

Slide 11

Slide 11 text

Resilient Distributed Datasets (RDD) • process in parallel! • operations on RDDs = transformations & actions! • persistance : MEMORY, DISK, MEMORY & DISK… Definition : fault-tolerant immutable distributed collections !

Slide 12

Slide 12 text

Create a RDD ! // from textile JavaRDD lines = sc.textFile("ensemble-des-equipements- sportifs-de-lile-de-france.csv"); // from HDFS JavaRDD lines = sc.textFile("hdfs://dataset/ensemble-des- equipements-sportifs-de-lile-de-france.csv"); // from S3 JavaRDD lines = sc.textFile("s3n://dataset/ensemble-des- equipements-sportifs-de-lile-de-france.csv"); ! ! // file from Hadoop sc.hadoopFile(path, inputFormatClass, keyClass, valueClass); ! // create a distributed dataset sc.parallelize(data);

Slide 13

Slide 13 text

Operations on RDDs JavaRDD lines = sc.textFile("ensemble-des-equipements- sportifs-de-lile-de-france.csv") .map(line -> line.split(";")) // remove first line .filter(line -> !line[1].equals("ins_com")); ! lines.count(); ! // number per type of equipment order by alphabetical order lines.mapToPair(line -> new Tuple2<>(line[3], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " -> " + t._2)); !

Slide 14

Slide 14 text

RDDs Persistence // lines = RDD Object ! // persistence by default is MEMORY_ONLY lines.cache(); ! // other persistences lines.persist(StorageLevel.DISK_ONLY()); lines.persist(StorageLevel.MEMORY_ONLY()); lines.persist(StorageLevel.MEMORY_AND_DISK()); ! // replication lines.persist(StorageLevel.apply(1, 3));

Slide 15

Slide 15 text

Shared variables broadcast variables ! & ! accumulators

Slide 16

Slide 16 text

Performances Exemple : logistic regression

Slide 17

Slide 17 text

Spark Ecosystem Streaming SQL Spark Spark

Slide 18

Slide 18 text

No content