Introduction to Spark

Introduction to! Nantes - 08/07/2014! Ludwine Probst - @nivdul

developer! maths lover! machine learning & big data Duchess France
Leader @nivdul nivdul.wordpress.com

Lay of the land

But…

• big data analytics in memory! • Resilient Distributed Datasets
(RDD)! • shared variables! • principle of lineage! • complements Hadoop! • better performances than Hadoop! • more ﬂexible for programming models

Supported languages lambdas expressions (Java8)

Data Data Data Standalone Mesos YARN! Hadoop Spark Spark Spark

InputFormat Spark

SparkContext SparkConf sparkConf = new SparkConf() .setAppName("SimpleExample") .setMaster("local") //.setMaster("spark://192.168.1.11:7077") !
.setJars(new String[]{PATH + "/project- spark/target/project-spark-0.0.1-SNAPSHOT.jar"}); ! ! JavaSparkContext sc = new JavaSparkContext(sparkConf);

Resilient Distributed Datasets (RDD) • process in parallel! • operations
on RDDs = transformations & actions! • persistance : MEMORY, DISK, MEMORY & DISK… Deﬁnition : fault-tolerant immutable distributed collections !

Create a RDD ! // from textile JavaRDD<String[]> lines =
sc.textFile("ensemble-des-equipements- sportifs-de-lile-de-france.csv"); // from HDFS JavaRDD<String[]> lines = sc.textFile("hdfs://dataset/ensemble-des- equipements-sportifs-de-lile-de-france.csv"); // from S3 JavaRDD<String[]> lines = sc.textFile("s3n://dataset/ensemble-des- equipements-sportifs-de-lile-de-france.csv"); ! ! // file from Hadoop sc.hadoopFile(path, inputFormatClass, keyClass, valueClass); ! // create a distributed dataset sc.parallelize(data);

Operations on RDDs JavaRDD<String[]> lines = sc.textFile("ensemble-des-equipements- sportifs-de-lile-de-france.csv") .map(line ->
line.split(";")) // remove first line .filter(line -> !line[1].equals("ins_com")); ! lines.count(); ! // number per type of equipment order by alphabetical order lines.mapToPair(line -> new Tuple2<>(line[3], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " -> " + t._2)); !

RDDs Persistence // lines = RDD Object ! // persistence
by default is MEMORY_ONLY lines.cache(); ! // other persistences lines.persist(StorageLevel.DISK_ONLY()); lines.persist(StorageLevel.MEMORY_ONLY()); lines.persist(StorageLevel.MEMORY_AND_DISK()); ! // replication lines.persist(StorageLevel.apply(1, 3));

Shared variables broadcast variables ! & ! accumulators

Performances Exemple : logistic regression

Spark Ecosystem Streaming SQL Spark Spark

Introduction to Spark

Introduction to Spark

Probst Ludwine

More Decks by Probst Ludwine

Other Decks in Technology

Featured

Transcript

Introduction to! Nantes - 08/07/2014! Ludwine Probst - @nivdul

developer! maths lover! machine learning & big data Duchess France

Lay of the land

But…

• big data analytics in memory! • Resilient Distributed Datasets

Supported languages lambdas expressions (Java8)

Data Data Data Standalone Mesos YARN! Hadoop Spark Spark Spark

InputFormat Spark

SparkContext SparkConf sparkConf = new SparkConf() .setAppName("SimpleExample") .setMaster("local") //.setMaster("spark://192.168.1.11:7077") !

Resilient Distributed Datasets (RDD) • process in parallel! • operations

Create a RDD ! // from textile JavaRDD<String[]> lines =

Operations on RDDs JavaRDD<String[]> lines = sc.textFile("ensemble-des-equipements- sportifs-de-lile-de-france.csv") .map(line ->

RDDs Persistence // lines = RDD Object ! // persistence

Shared variables broadcast variables ! & ! accumulators

Performances Exemple : logistic regression

Spark Ecosystem Streaming SQL Spark Spark