$30 off During Our Annual Pro Sale. View Details »

JUC - Procesando datos con Hadoop: MapReduce y Yarn

CETA-Ciemat
February 11, 2015

JUC - Procesando datos con Hadoop: MapReduce y Yarn

I Jornadas Técnicas UEx - CIEMAT. Procesando grandes volúmenes de datos con Hadoop

CETA-Ciemat

February 11, 2015
Tweet

More Decks by CETA-Ciemat

Other Decks in Education

Transcript

  1. Jornadas Técnicas Uex-CIEMAT // 10-12 Febrero 2015
    Procesando grandes volúmenes de datos con
    HADOOP
    César Suárez Ortega
    [email protected]
    Procesando datos con Hadoop:
    MapReduce y YARN

    View Slide

  2. Software Developer / Researcher
    César Suárez Ortega
    tharandur
    csuarez

    View Slide

  3. View Slide

  4. Software Engineer / Researcher
    César Suárez Ortega

    View Slide

  5. Software Engineer / Researcher
    César Suárez Ortega

    View Slide

  6. Índice
    1.
    2.
    3.


    4. Exprimiendo al máximo MapReduce
    5.

    View Slide

  7. Introducción a
    MapReduce

    View Slide

  8. Modelo de programación
    para el procesamiento de
    datos

    View Slide

  9. View Slide

  10. ¿Qué es MapReduce?

     MUCHOS


    View Slide

  11. by
    MapReduce: Simplified Data Processing on Large Clusters. Dean
    J. and Ghemawat S. (2004)

    View Slide

  12. Aprendiendo con ejemplos

    View Slide

  13. National Climatic Data Center





    View Slide

  14. 1958
    -0021
    1959
    +0065
    1960

    View Slide

  15. Soluciones
    1. Script

    2. Paralelización por año

    3. Paralelización por partes iguales

    4. MapReduce

    View Slide

  16. MapReduce 101


     Hay que definir una función para
    cada etapa.
    MAP REDUCE

    View Slide

  17. Map
    MAP
    <0, 005733213…> //line1
    <160, 006844324…> //line2

    View Slide

  18. Map
    MAP

    View Slide

  19. Map
    MAP

    list(K2, V2)

    View Slide

  20. 1958
    -0021
    1959
    +0065
    1960

    View Slide

  21. Map
    MAP
    <0, 005733213…> //line1
    <160, 006844324…> //line2
    <1958, -21>
    <1959, +65>

    View Slide

  22. Shuffle
    <1958, -21>
    <1959, +65>
    <1958, -34>
    <1959, +28>
    <1958, [-21, -34]>
    <1959, [+65, +28]>

    View Slide

  23. Shuffle
    list(K2, V2)
    list(K2, list(V2))

    View Slide

  24. Reduce
    REDUCE
    list(K2, list(V2))
    list(K3, V3)

    View Slide

  25. Reduce
    REDUCE
    <1958, [-21, -34]>
    <1959, [+65, +28]>
    <1958, -21>
    <1959, +65>

    View Slide

  26. Map & Reduce
    REDUCE
    list(K2, list(V2))
    list(K3, V3))
    MAP

    list(K2, V2)

    View Slide

  27. Map&Reduce

     
     
    Job



    View Slide

  28. View Slide

  29. View Slide

  30. Práctica 1
    Mandando nuestro
    primer trabajo

    View Slide

  31. MapReduce API


    1. Blabla

    2. BlablaMapper

    3. BlablaReducer

    View Slide

  32. FlightsByCarrier


    https://github.com/csuarez/seminario-mapreduce
    flights-by-carrier/

    View Slide

  33. Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsed
    Time,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,Carrie
    rDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
    1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,15,4,729,730,903,849,PS,1451,NA,94,79,NA,14,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,17,6,741,730,918,849,PS,1451,NA,97,79,NA,29,11,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,18,7,729,730,847,849,PS,1451,NA,78,79,NA,-2,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,19,1,749,730,922,849,PS,1451,NA,93,79,NA,33,19,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,22,4,728,730,852,849,PS,1451,NA,84,79,NA,3,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,24,6,744,730,908,849,PS,1451,NA,84,79,NA,19,14,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,25,7,729,730,851,849,PS,1451,NA,82,79,NA,2,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,26,1,735,730,904,849,PS,1451,NA,89,79,NA,15,5,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,28,3,741,725,919,855,PS,1451,NA,98,90,NA,24,16,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,29,4,742,725,906,855,PS,1451,NA,84,90,NA,11,17,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,31,6,726,725,848,855,PS,1451,NA,82,90,NA,-7,1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,1,4,936,915,1035,1001,PS,1451,NA,59,46,NA,34,21,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,2,5,918,915,1017,1001,PS,1451,NA,59,46,NA,16,3,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,3,6,928,915,1037,1001,PS,1451,NA,69,46,NA,36,13,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,4,7,914,915,1003,1001,PS,1451,NA,49,46,NA,2,-1,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,5,1,1042,915,1129,1001,PS,1451,NA,47,46,NA,88,87,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,6,2,934,915,1024,1001,PS,1451,NA,50,46,NA,23,19,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,7,3,946,915,1037,1001,PS,1451,NA,51,46,NA,36,31,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,8,4,932,915,1033,1001,PS,1451,NA,61,46,NA,32,17,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,9,5,947,915,1036,1001,PS,1451,NA,49,46,NA,35,32,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,10,6,915,915,1022,1001,PS,1451,NA,67,46,NA,21,0,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,11,7,916,915,1006,1001,PS,1451,NA,50,46,NA,5,1,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,12,1,944,915,1027,1001,PS,1451,NA,43,46,NA,26,29,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,13,2,941,915,1036,1001,PS,1451,NA,55,46,NA,35,26,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,14,3,930,915,1029,1001,PS,1451,NA,59,46,NA,28,15,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,15,4,920,915,1023,1001,PS,1451,NA,63,46,NA,22,5,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,17,6,1009,915,1104,1001,PS,1451,NA,55,46,NA,63,54,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,18,7,915,915,1008,1001,PS,1451,NA,53,46,NA,7,0,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,19,1,940,915,1032,1001,PS,1451,NA,52,46,NA,31,25,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,21,3,913,915,1003,1001,PS,1451,NA,50,46,NA,2,-2,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
    1987,10,22,4,915,915,1017,1001,PS,1451,NA,62,46,NA,16,0,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA

    View Slide

  34. public class FlightsByCarrier {
    public static void main (String[] args) throws Exception {
    Job job = new Job();
    job.setJarByClass(FlightsByCarrier.class);
    job.setJobName("FlightsByCarrier”);
    TextInputFormat.addInputPath(job, new Path(args[0]));
    TextOutputFormat.setOutputPath(job, new Path(args[1]));
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setMapperClass(FlightsByCarrierMapper.class);
    job.setReducerClass(FlightsByCarrierReducer.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    job.addFileToClassPath(new Path("/user/root/opencsv-2.3.jar"));
    job.waitForCompletion(true);
    }
    }

    View Slide

  35. //Mapper
    public class FlightsByCarrierMapper extends Mapper {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException,
    InterruptedException {
    if (key.get() > 0) { //Ignora la primera linea
    String[] lines = new CSVParser().parseLine(value.toString());
    context.write(new Text(lines[8]), new IntWritable(1));
    }
    }
    }

    View Slide

  36. public class FlightsByCarrierReducer extends Reducer {
    @Override
    protected void reduce(Text token, Iterable counts, Context context) throws
    IOException, InterruptedException {
    int sum = 0;
    for (IntWritable count : counts) {
    sum += count.get();
    }
    context.write(token, new IntWritable(sum));
    }
    }

    View Slide

  37. $ git clone https://github.com/csuarez/seminario-mapreduce.git
    [...]
    $ tar xvzf 1987.tar.gz
    $ hdfs dfs –copyFromLocal lib/opencsv-2.3.jar /user/root
    $ hdfs dfs –copyFromLocal 1987.csv /user/root
    $ sh build.sh
    $ hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv
    /user/root/output/flightsCount
    $ hdfs dfs -cat /user/root/output/flightsCount/part-r-00000
    Ejecución

    View Slide

  38. FlightsByCarrier output
    INFO mapreduce.JobSubmitter: number of splits:2
    ...
    INFO mapreduce.Job: map 0% reduce 0%
    INFO mapreduce.Job: map 22% reduce 0%
    INFO mapreduce.Job: map 41% reduce 0%
    INFO mapreduce.Job: map 83% reduce 0%
    INFO mapreduce.Job: map 100% reduce 0%
    INFO mapreduce.Job: map 100% reduce 100%
    ...
    Job Counters
    Launched map tasks=2
    Launched reduce tasks=1
    Rack-local map tasks=2
    Total time spent by all maps in occupied slots (ms)=42442
    Total time spent by all reduces in occupied slots (ms)=13465

    View Slide

  39. MapReduce
    Cómo funciona

    View Slide

  40. MapReduce v1

    View Slide

  41. View Slide

  42. Actores
    1.
    2.
    3.
    4.

    View Slide

  43. Ejecución MapReduce







    View Slide

  44. Paso 1 Job Submission




    View Slide

  45. Paso 2 Job Initialization





    View Slide

  46. Paso 3 Task Assignment




    View Slide

  47. Paso 4 Task Execution



    View Slide

  48. Paso 5 Progress & Status




    View Slide

  49. Paso 6 Job Completion


    View Slide

  50. MapReduce v2
    +
    YARN

    View Slide

  51. ¿Por qué YARN?





    View Slide

  52. View Slide

  53. YARN: Actores
    1.

    2.

    3.

    4.

    5.

    View Slide

  54. vs. MapReduce v1: Actores
    1.
    2.
    3.
    4.
    1. YARN Resource
    Manager
    2. YARN Node Manager
    3. Application Master

    View Slide

  55. View Slide

  56. MapReduce v1
    API
    Processing
    FW
    Resource
    Manager
    Storage
    MR
    MapReduce v1
    HDFS
    PIG HIVE HBASE

    View Slide

  57. YARN
    API
    Processing
    FW
    Resource
    Manager
    Storage
    MR
    YARN
    HDFS
    PIG
    STORM
    MR v2 TEZ
    MPI

    View Slide

  58. View Slide

  59. Paso 1 Job Submission






    View Slide

  60. View Slide

  61. Paso 2 Job Initialization







    View Slide

  62. Paso 2** Uber Mode





    View Slide

  63. View Slide

  64. Paso 3 Task Assignment






    View Slide

  65. View Slide

  66. Paso 4 Task Execution





    View Slide

  67. View Slide

  68. Paso 5 Progress & Status





    View Slide

  69. View Slide

  70. Exprimiendo
    MapReduce

    View Slide

  71. Apache Pig

    View Slide

  72. Apache Pig





    View Slide

  73. records = LOAD '1987.csv' USING PigStorage(',') AS
    (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,Tail
    Num,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance:int,TaxiIn,TaxiOut,Ca
    ncelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,DecurityDelay,LateAircraftDelay);
    milage_recs = GROUP records ALL;
    tot_miles = FOREACH milage_recs GENERATE SUM(records.Distance);
    STORE tot_miles INTO '/user/root/totalmiles';
    https://github.com/csuarez/seminario-mapreduce
    $ cd pig-total-miles/
    $ pig totalmiles.pig

    View Slide

  74. Hadoop
    Streaming

    View Slide

  75. $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper myPythonScript.py \
    -reducer /bin/wc \
    -file myPythonScript.py

    View Slide

  76. View Slide

  77. View Slide

  78. Práctica 2
    MapReduce Job

    View Slide

  79. GreekGodCounter (Estándar)


    https://github.com/csuarez/seminario-mapreduce
    greek-god-counter-standard/

    View Slide

  80. private final static String[] gods = {
    "Zeus",
    "Hera",
    "Poseidón",
    "Dioniso",
    "Apolo",
    "Artemisa",
    "Hermes",
    "Atenea",
    "Ares",
    "Afrodita",
    "Hefesto",
    "Deméter”
    };

    View Slide

  81. //Initializing the initial structure
    for (String god : gods) {
    godMap.put(god, 0);
    }
    try {
    //Reading input
    br = new BufferedReader(new FileReader(args[0]));
    String line = br.readLine();
    while (line != null) {
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    if (godMap.containsKey(token)) {
    godMap.put(token, godMap.get(token) + 1);
    }
    }
    line = br.readLine();
    }
    //Writing output
    Writer writer = new BufferedWriter(new FileWriter("gods.txt"));
    for (Entry entry : godMap.entrySet()) {
    writer.write(entry.getKey() + " = " + entry.getValue());
    writer.write(System.lineSeparator());
    }
    writer.close();
    }

    View Slide

  82. ¡¡¡TAREA!!!

    View Slide

  83. GreekGodCounter (MapReduce)


    https://github.com/csuarez/seminario-mapreduce
    greek-god-counter-mapreduce/

    View Slide

  84. import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.filecache.DistributedCache;
    public class GreekGodCounterMapReduce {
    public static void main (String[] args) throws Exception {
    Job job = new Job();
    job.setJarByClass(GreekGodCounterMapReduce.class);
    job.setJobName("GreekGodCounterMapReduce");
    TextInputFormat.addInputPath(job, new Path(args[0]));
    job.setInputFormatClass(TextInputFormat.class);
    job.setMapperClass(GreekGodCounterMapReduceMapper.class);
    job.setReducerClass(GreekGodCounterMapReduceReducer.class);
    TextOutputFormat.setOutputPath(job, new Path(args[1]));
    job.setOutputFormatClass(TextOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.waitForCompletion(true);
    }
    }

    View Slide

  85. import java.io.IOException;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapreduce.Mapper;
    import java.util.HashMap;
    import java.util.Map.Entry;
    import java.util.StringTokenizer;
    public class GreekGodCounterMapReduceMapper
    extends Mapper {
    private final static String[] gods = {
    "Zeus",
    "Hera",
    "Poseidón",
    "Dioniso",
    "Apolo",
    "Artemisa",
    "Hermes",
    "Atenea",
    "Ares",
    "Afrodita",
    "Hefesto",
    "Deméter"
    };
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException,
    InterruptedException {
    }
    }

    View Slide

  86. import java.io.IOException;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapreduce.Reducer;
    public class GreekGodCounterMapReduceReducer
    extends Reducer {
    @Override
    protected void reduce
    (Text token, Iterable counts, Context context)
    throws IOException, InterruptedException {
    }
    }

    View Slide

  87. Recursos

    View Slide

  88. Recursos

    View Slide

  89. ¡Gracias!
    ¿Alguna pregunta?
    [email protected]

    View Slide