Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Spark

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Introduction to Spark

Avatar for Probst Ludwine

Probst Ludwine

July 09, 2014
Tweet

More Decks by Probst Ludwine

Other Decks in Technology

Transcript

  1. • big data analytics in memory! • Resilient Distributed Datasets

    (RDD)! • shared variables! • principle of lineage! • complements Hadoop! • better performances than Hadoop! • more flexible for programming models
  2. SparkContext SparkConf sparkConf = new SparkConf() .setAppName("SimpleExample") .setMaster("local") //.setMaster("spark://192.168.1.11:7077") !

    .setJars(new String[]{PATH + "/project- spark/target/project-spark-0.0.1-SNAPSHOT.jar"}); ! ! JavaSparkContext sc = new JavaSparkContext(sparkConf);
  3. Resilient Distributed Datasets (RDD) • process in parallel! • operations

    on RDDs = transformations & actions! • persistance : MEMORY, DISK, MEMORY & DISK… Definition : fault-tolerant immutable distributed collections !
  4. Create a RDD ! // from textile JavaRDD<String[]> lines =

    sc.textFile("ensemble-des-equipements- sportifs-de-lile-de-france.csv"); // from HDFS JavaRDD<String[]> lines = sc.textFile("hdfs://dataset/ensemble-des- equipements-sportifs-de-lile-de-france.csv"); // from S3 JavaRDD<String[]> lines = sc.textFile("s3n://dataset/ensemble-des- equipements-sportifs-de-lile-de-france.csv"); ! ! // file from Hadoop sc.hadoopFile(path, inputFormatClass, keyClass, valueClass); ! // create a distributed dataset sc.parallelize(data);
  5. Operations on RDDs JavaRDD<String[]> lines = sc.textFile("ensemble-des-equipements- sportifs-de-lile-de-france.csv") .map(line ->

    line.split(";")) // remove first line .filter(line -> !line[1].equals("ins_com")); ! lines.count(); ! // number per type of equipment order by alphabetical order lines.mapToPair(line -> new Tuple2<>(line[3], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " -> " + t._2)); !
  6. RDDs Persistence // lines = RDD Object ! // persistence

    by default is MEMORY_ONLY lines.cache(); ! // other persistences lines.persist(StorageLevel.DISK_ONLY()); lines.persist(StorageLevel.MEMORY_ONLY()); lines.persist(StorageLevel.MEMORY_AND_DISK()); ! // replication lines.persist(StorageLevel.apply(1, 3));