Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark - Alexis Seigneurin - English

Spark - Alexis Seigneurin - English

Spark presentation in English

Alexis Seigneurin

January 21, 2015
Tweet

More Decks by Alexis Seigneurin

Other Decks in Technology

Transcript

  1. Spark • Processing of large volumes of data • Distributed

    processing on commodity hardware • Written in Scala, Java binding
  2. History • 2009: AMPLab, Berkeley University • June 2013 :

    "Top-level project" of the Apache foundation • May 2014: version 1.0.0 • Currently: version 1.2.0
  3. Use cases • Logs analysis • Processing of text files

    • Analytics • Distributed search (Google, before) • Fraud detection • Product recommendation
  4. • Same use cases • Same development model: MapReduce •

    Integration with the ecosystem Proximity with Hadoop
  5. Simpler than Hadoop • API simpler to learn • “Relaxed”

    MapReduce • Spark Shell: interactive processing
  6. Faster than Hadoop Spark officially sets a new record in

    large-scale sorting (5th November 2014) • Sorting 100 To of data • Hadoop MR: 72 minutes ◦ With 2100 noeuds (50400 cores) • Spark: 23 minutes ◦ With 206 noeuds (6592 cores)
  7. • Resilient Distributed Dataset • Abstraction of a collection processed

    in parallel • Fault tolerant • Can work with tuples: ◦ Key - Value ◦ Tuples must be independent from each other RDD
  8. Sources • Files on HDFS • Local files • Collection

    in memory • Amazon S3 • NoSQL database • ... • Or a custom implementation of InputFormat
  9. Transformations • Processes an RDD, returns another RDD • Lazy!

    • Examples : ◦ map(): one value → another value ◦ mapToPair(): one value → a tuple ◦ filter(): filters values/tuples given a condition ◦ groupByKey(): groups values by key ◦ reduceByKey(): aggregates values by key ◦ join(), cogroup()...: joins two RDDs
  10. Actions • Does not return an RDD • Examples: ◦

    count(): counts values/tuples ◦ saveAsHadoopFile(): saves results in Hadoop’s format ◦ foreach(): applies a function on each item ◦ collect(): retrieves values in a list (List<T>)
  11. • Trees of Paris: CSV file, Open Data • Count

    of trees by specie Spark - Example geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta 48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;; 48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;; 48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;; 48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29 ...
  12. Spark - Example JavaSparkContext sc = new JavaSparkContext("local", "arbres"); sc.textFile("data/arbresalignementparis2010.csv")

    .filter(line -> !line.startsWith("geom")) .map(line -> line.split(";")) .mapToPair(fields -> new Tuple2<String, Integer>(fields[4], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " : " + t._2)); [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] [... ; … ; …] u m k m a a textFile mapToPair map reduceByKey foreach 1 1 1 1 1 u m k 1 2 1 2 a ... ... ... ... filter ... ... sortByKey a m 2 1 2 1 u ... ... ... ... ... ... geom;... 1 k
  13. Spark - Example Acacia dealbata : 2 Acer acerifolius :

    39 Acer buergerianum : 14 Acer campestre : 452 ...
  14. Topology & Terminology • One master / several workers ◦

    (+ one standby master) • Submit an application to the cluster • Execution managed by a driver
  15. Spark in a cluster Several options • YARN • Mesos

    • Standalone ◦ Workers started manually ◦ Workers started by the master
  16. MapReduce • Spark (API) • Distributed processing • Fault tolerant

    Stockage • HDFS, base NoSQL... • Distributed storage • Fault tolerant Storage & Processing
  17. Data locality Spark Worker HDFS Datanode Spark Worker HDFS Datanode

    Spark Worker HDFS Datanode Spark Master HDFS Namenode HDFS Namenode (Standby) Spark Master (Standby)
  18. Demo $ $SPARK_HOME/sbin/start-master.sh $ $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://MBP-de-Alexis:7077 --cores 2 --memory

    2G $ mvn clean package $ $SPARK_HOME/bin/spark-submit --master spark://MBP-de-Alexis:7077 --class com.seigneurin.spark.WikipediaMapReduceByKey --deploy-mode cluster target/pres-spark-0.0.1-SNAPSHOT.jar
  19. • Usage of an RDD in SQL • SQL engine:

    converts SQL instructions to low-level instructions Spark SQL
  20. Spark SQL Prerequisites: • Use tabular data • Describe the

    schema → SchemaRDD Describing the schema : • Programmatic description of the data • Schema inference through reflection (POJO)
  21. JavaRDD<Row> rdd = trees.map(fields -> Row.create( Float.parseFloat(fields[3]), fields[4])); • Creating

    tabular data (type Row) Spark SQL - Example --------------------------------------- | 10.0 | Aesculus hippocastanum | | 15.0 | Tilia platyphyllos | | 0.0 | Platanus x hispanica | | 10.0 | Paulownia tomentosa | | ... | ... |
  22. Spark SQL - Example List<StructField> fields = new ArrayList<StructField>(); fields.add(DataType.createStructField("hauteurenm",

    DataType.FloatType, false)); fields.add(DataType.createStructField("espece", DataType.StringType, false)); StructType schema = DataType.createStructType(fields); JavaSchemaRDD schemaRDD = sqlContext.applySchema(rdd, schema); schemaRDD.registerTempTable("tree"); --------------------------------------- | hauteurenm | espece | --------------------------------------- | 10.0 | Aesculus hippocastanum | | 15.0 | Tilia platyphyllos | | 0.0 | Platanus x hispanica | | 10.0 | Paulownia tomentosa | | ... | ... | • Describing the schema
  23. • Counting trees by specie Spark SQL - Example sqlContext.sql("SELECT

    espece, COUNT(*) FROM tree WHERE espece <> '' GROUP BY espece ORDER BY espece") .foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1))); Acacia dealbata : 2 Acer acerifolius : 39 Acer buergerianum : 14 Acer campestre : 452 ...
  24. Window operations • Sliding window • Reuses data from other

    windows • Initialized with a window length and a slide interval
  25. Sources • Socket • Kafka • Flume • HDFS •

    MQ (ZeroMQ...) • Twitter • ... • Or a custom implementation of Receiver
  26. Spark Streaming Demo • Receive Tweets with hashtag #Android ◦

    Twitter4J • Detection of the language of the Tweet ◦ Language Detection • Indexing with Elasticsearch • Reporting with Kibana 4
  27. $ curl -X DELETE localhost:9200 $ curl -X PUT localhost:9200/spark/_mapping/tweets

    '{ "tweets": { "properties": { "user": {"type": "string","index": "not_analyzed"}, "text": {"type": "string"}, "createdAt": {"type": "date","format": "date_time"}, "language": {"type": "string","index": "not_analyzed"} } } }' • Launch ElasticSearch Demo • Launch Kibana -> http://localhost:5601 • Launch the Spark Streaming process