Brown Bag Lunch : Hadoop + Spark

BBL  HADOOP / SPARK Vincent heuschling  12/12/2016

AGENDA Affini-Tech Hadoop Spark

3 Imagine Experiment Build platforms Data

4 Imagine Que faire avec vos data, quel use-case ?
Qu’est ce que les techno changent ?  Experiment Valider en quelques jours Passer de l’intuition à la conviction  Execute Mettre en oeuvre et executer   des plateformes opérationnelles & Datalakes AIDER NOS CLIENTS À SIMPLIFIER LEUR DATA POUR MIEUX L’UTILISER

7 STORE COMPUTE EXPLORE   & EXPERIMENT COLLECT TRANSFORM &
PROCESS EXPOSE   ( WEBSERVICE & SQL ) INDEX & CATALOG DATALAKE

HADOOP Histoire Fonctionnement Ecosystème / Distribution

10 STOCKER : HDFS ( système de fichiers distribué ) 
TRAITER : MAP/REDUCE & YARN APPLICATIONS : OUTILS et FRAMEWORKS  ( Load, Transform, Query, Machine Learning ) COMMUNAUTE : APACHE

11 HDFS a b c d e f g h
i a a a b b b c e i g c e i g c e i g d f h d f h d f h

12 HDFS : EXTENSION

13 HDFS : EXTENSION

14 MAP-REDUCE a a a b b b c e
i g c e i g c e i g d f h d f h d f h

15 MAP-REDUCE Map Function :   output ( word :
1 ) Reduce Function :   output ( word : sum(1) )

16 MAP-REDUCE Prénom Nom Age Vincent H 38 Marc T
26 François K 40 Vincent O 26 Frederic P 41 Marc N 32 Key Value François 40 Frederic 41 Marc 26 Marc 32 Vincent 38 Vincent 26 Map Sort  Shufﬂe Key Value François [40] Frederic [41] Marc [26,32] Vincent [38,26] Reduce Key Value François 40 Frederic 41 Marc 29 Vincent 32

17 MAP-REDUCE Vincent H 38 Marc T 26 François K
40 Vincent O 26 Frederic P 41 Marc N 32 Map Map Vincent 38 Marc 26 François 40 Vincent 26 Frederic 41 Marc 32 François 40 Frederic 41 Marc 26 Marc 32 Vincent 26 Vincent 38 Sort & Shufﬂe Combine Combine Vincent [38,26] Marc [26,32] Frederic [41] Combine Combine François [40]

18 HADOOP V1 Client Client Job Tracker Task Tracker Task
Tracker Task Tracker Task Task Task Task Task Task

19 HADOOP V2 : YARN Client Ress  Manager Master Client
Node  Manager Node  Manager Node  Manager Container Container Master Container Container Container Container Container

20 CONTRAINTES temps Contrainte Shufﬂe : Network Map : IO
Disque Reduce : Memory

21 PERFORMANCES Volume Performance SQL MPP Variété

22 COMPOSANTS

SPARK Histoire Points clés Démo

24 HISTORIQUE 2009 Projet AMPLab Berkeley 2013 Projet ‘data’ le
plus dynamique de Apache  Version 1  2014 Record au Graysort : 23 minutes pour trier 100TB avec 206 noeuds   2016 Version 2

25 POINTS CLÉS Utilise la mémoire pour stocker et traiter
les données Développé en Scala Utilisable en Scala, Python, Java, R  « Full stack » : core + sql + streaming + MLlib + Graphx

26 FONCTIONNEL import sys from pyspark import SparkContext if __name__
== "__main__": sc = SparkContext( “local”, “WordCount”, sys.argv[0], None)  lines = sc.textFile(sys.argv[1])  counts = lines.flatMap(lambda s: s.split(“ ”)) \  .map(lambda word: (word, 1)) \  .reduceByKey(lambda x, y: x + y)    counts.saveAsTextFile(sys.argv[2])

27 RDD RDD Valeur RDD 1 RDD 2 RDD (Resilient
Distributed Dataset) : - Collection d’objets distribués sur le  cluster en RAM ou sur disque (Distributed) - Tolérant à la « panne » (Resilient)   grâce au lineage Construit à travers des opérations parallèles Lazy 2 types d’opérations sur les RDD - Transformations : crée un nouveau RDD - Actions : retourne des valeurs

29 DATAFRAMES df.select("name").show()    df.select($"name", $"age" + 1).show()    df.filter($"age"
> 21).show()    df.groupBy(« age").count().show() df.createOrReplaceTempView("people") val sqlDF = spark.sql("SELECT * FROM people") sqlDF.show()

30 DÉMO http:/ /spark-notebook.io/

Brown Bag Lunch : Hadoop + Spark

Brown Bag Lunch : Hadoop + Spark

Vincent Heuschling

Other Decks in Technology

Featured

Transcript

BBL  HADOOP / SPARK Vincent heuschling  12/12/2016

AGENDA Affini-Tech Hadoop Spark

3 Imagine Experiment Build platforms Data

4 Imagine Que faire avec vos data, quel use-case ?

5

7 STORE COMPUTE EXPLORE   & EXPERIMENT COLLECT TRANSFORM &

HADOOP Histoire Fonctionnement Ecosystème / Distribution

10 STOCKER : HDFS ( système de fichiers distribué )

11 HDFS a b c d e f g h

12 HDFS : EXTENSION

13 HDFS : EXTENSION

14 MAP-REDUCE a a a b b b c e

15 MAP-REDUCE Map Function :   output ( word :

16 MAP-REDUCE Prénom Nom Age Vincent H 38 Marc T

17 MAP-REDUCE Vincent H 38 Marc T 26 François K

18 HADOOP V1 Client Client Job Tracker Task Tracker Task

19 HADOOP V2 : YARN Client Ress  Manager Master Client

20 CONTRAINTES temps Contrainte Shufﬂe : Network Map : IO

21 PERFORMANCES Volume Performance SQL MPP Variété

22 COMPOSANTS

SPARK Histoire Points clés Démo

24 HISTORIQUE 2009 Projet AMPLab Berkeley 2013 Projet ‘data’ le

25 POINTS CLÉS Utilise la mémoire pour stocker et traiter

26 FONCTIONNEL import sys from pyspark import SparkContext if name

27 RDD RDD Valeur RDD 1 RDD 2 RDD (Resilient

28 DATAFRAMES val df = spark.read.json("people.json") df.show() // +----+-------+ //

29 DATAFRAMES df.select("name").show()    df.select($"name", $"age" + 1).show()    df.filter($"age"

30 DÉMO http:/ /spark-notebook.io/