Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Spark

Introduction to Spark

Probst Ludwine

July 09, 2014
Tweet

More Decks by Probst Ludwine

Other Decks in Technology

Transcript

  1. Introduction to!
    Nantes - 08/07/2014!
    Ludwine Probst - @nivdul

    View Slide

  2. developer!
    maths lover!
    machine learning & big data
    Duchess France Leader
    @nivdul
    nivdul.wordpress.com

    View Slide

  3. Lay of the land

    View Slide

  4. But…

    View Slide

  5. View Slide

  6. • big data analytics in memory!
    • Resilient Distributed Datasets (RDD)!
    • shared variables!
    • principle of lineage!
    • complements Hadoop!
    • better performances than Hadoop!
    • more flexible for programming models

    View Slide

  7. Supported languages
    lambdas expressions
    (Java8)

    View Slide

  8. Data Data
    Data
    Standalone
    Mesos
    YARN!
    Hadoop
    Spark Spark Spark

    View Slide

  9. InputFormat
    Spark

    View Slide

  10. SparkContext
    SparkConf sparkConf = new SparkConf()
    .setAppName("SimpleExample")
    .setMaster("local")
    //.setMaster("spark://192.168.1.11:7077")
    !
    .setJars(new String[]{PATH + "/project-
    spark/target/project-spark-0.0.1-SNAPSHOT.jar"});
    !
    !
    JavaSparkContext sc = new JavaSparkContext(sparkConf);

    View Slide

  11. Resilient Distributed Datasets
    (RDD)
    • process in parallel!
    • operations on RDDs = transformations & actions!
    • persistance : MEMORY, DISK, MEMORY & DISK…
    Definition : fault-tolerant immutable distributed
    collections !

    View Slide

  12. Create a RDD
    !
    // from textile
    JavaRDD lines = sc.textFile("ensemble-des-equipements-
    sportifs-de-lile-de-france.csv");
    // from HDFS
    JavaRDD lines = sc.textFile("hdfs://dataset/ensemble-des-
    equipements-sportifs-de-lile-de-france.csv");
    // from S3
    JavaRDD lines = sc.textFile("s3n://dataset/ensemble-des-
    equipements-sportifs-de-lile-de-france.csv");
    !
    !
    // file from Hadoop
    sc.hadoopFile(path, inputFormatClass, keyClass, valueClass);
    !
    // create a distributed dataset
    sc.parallelize(data);

    View Slide

  13. Operations on RDDs
    JavaRDD lines = sc.textFile("ensemble-des-equipements-
    sportifs-de-lile-de-france.csv")
    .map(line -> line.split(";"))
    // remove first line
    .filter(line -> !line[1].equals("ins_com"));
    !
    lines.count();
    !
    // number per type of equipment order by alphabetical order
    lines.mapToPair(line -> new Tuple2<>(line[3], 1))
    .reduceByKey((x, y) -> x + y)
    .sortByKey()
    .foreach(t -> System.out.println(t._1 + " -> " + t._2));
    !

    View Slide

  14. RDDs Persistence
    // lines = RDD Object
    !
    // persistence by default is MEMORY_ONLY
    lines.cache();
    !
    // other persistences
    lines.persist(StorageLevel.DISK_ONLY());
    lines.persist(StorageLevel.MEMORY_ONLY());
    lines.persist(StorageLevel.MEMORY_AND_DISK());
    !
    // replication
    lines.persist(StorageLevel.apply(1, 3));

    View Slide

  15. Shared variables
    broadcast variables !
    & !
    accumulators

    View Slide

  16. Performances
    Exemple : logistic regression

    View Slide

  17. Spark Ecosystem
    Streaming
    SQL
    Spark
    Spark

    View Slide

  18. View Slide