Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Spark

Introduction to Spark

Probst Ludwine

July 09, 2014
Tweet

More Decks by Probst Ludwine

Other Decks in Technology

Transcript

  1. • big data analytics in memory! • Resilient Distributed Datasets

    (RDD)! • shared variables! • principle of lineage! • complements Hadoop! • better performances than Hadoop! • more flexible for programming models
  2. SparkContext SparkConf sparkConf = new SparkConf() .setAppName("SimpleExample") .setMaster("local") //.setMaster("spark://192.168.1.11:7077") !

    .setJars(new String[]{PATH + "/project- spark/target/project-spark-0.0.1-SNAPSHOT.jar"}); ! ! JavaSparkContext sc = new JavaSparkContext(sparkConf);
  3. Resilient Distributed Datasets (RDD) • process in parallel! • operations

    on RDDs = transformations & actions! • persistance : MEMORY, DISK, MEMORY & DISK… Definition : fault-tolerant immutable distributed collections !
  4. Create a RDD ! // from textile JavaRDD<String[]> lines =

    sc.textFile("ensemble-des-equipements- sportifs-de-lile-de-france.csv"); // from HDFS JavaRDD<String[]> lines = sc.textFile("hdfs://dataset/ensemble-des- equipements-sportifs-de-lile-de-france.csv"); // from S3 JavaRDD<String[]> lines = sc.textFile("s3n://dataset/ensemble-des- equipements-sportifs-de-lile-de-france.csv"); ! ! // file from Hadoop sc.hadoopFile(path, inputFormatClass, keyClass, valueClass); ! // create a distributed dataset sc.parallelize(data);
  5. Operations on RDDs JavaRDD<String[]> lines = sc.textFile("ensemble-des-equipements- sportifs-de-lile-de-france.csv") .map(line ->

    line.split(";")) // remove first line .filter(line -> !line[1].equals("ins_com")); ! lines.count(); ! // number per type of equipment order by alphabetical order lines.mapToPair(line -> new Tuple2<>(line[3], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " -> " + t._2)); !
  6. RDDs Persistence // lines = RDD Object ! // persistence

    by default is MEMORY_ONLY lines.cache(); ! // other persistences lines.persist(StorageLevel.DISK_ONLY()); lines.persist(StorageLevel.MEMORY_ONLY()); lines.persist(StorageLevel.MEMORY_AND_DISK()); ! // replication lines.persist(StorageLevel.apply(1, 3));