Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Procesanso datos con Hadoop: MapReduce y YARN

Procesanso datos con Hadoop: MapReduce y YARN

Charla dada bajo el marco de las Jornadas Técnicas Uex-CIEMAT: "Procesando grandes volúmenes de datos con Hadoop"

**NOTA** Algunas transparencias aparecen cortadas en el visor. La versión descargable funciona correctamente.

César Suárez Ortega

February 11, 2015
Tweet

More Decks by César Suárez Ortega

Other Decks in Technology

Transcript

  1. Jornadas Técnicas Uex-CIEMAT // 10-12 Febrero 2015
    Procesando grandes volúmenes de datos con
    HADOOP
    César Suárez Ortega
    [email protected]
    Procesando datos con Hadoop:
    MapReduce y YARN

    View Slide

  2. Software Developer / Researcher
    César Suárez Ortega
    tharandur
    csuarez

    View Slide

  3. View Slide

  4. Software Engineer / Researcher
    César Suárez Ortega

    View Slide

  5. Software Engineer / Researcher
    César Suárez Ortega

    View Slide

  6. Índice
    1. 
    2. 
    3. 
    z 
    z 
    4.  Exprimiendo al máximo MapReduce
    5. 

    View Slide

  7. Introducción a
    MapReduce

    View Slide

  8. Modelo de programación
    para el procesamiento de
    datos

    View Slide

  9. View Slide

  10. ¿Qué es MapReduce?
    z 
    z  MUCHOS
    z 
    z 

    View Slide

  11. by
    MapReduce: Simplified Data Processing on Large
    Clusters. Dean J. and Ghemawat S. (2004)!

    View Slide

  12. Aprendiendo con ejemplos

    View Slide

  13. National Climatic Data Center
    z 
    z 
    z 
    z 
    z 

    View Slide

  14. 1958
    -0021
    1959
    +0065
    1960
    +0054

    View Slide

  15. Soluciones
    1.  Script
    z 
    2.  Paralelización por año
    z 
    3.  Paralelización por partes iguales
    z 
    4.  MapReduce
    z 

    View Slide

  16. MapReduce 101
    z 
    z 
    z  Hay que definir una función para
    cada etapa.
    MAP REDUCE

    View Slide

  17. Map
    MAP
    <0, 005733213…> //line1
    <160, 006844324…> //line2

    View Slide

  18. Map
    MAP

    View Slide

  19. Map
    MAP

    list(K2, V2)

    View Slide

  20. 1958
    -0021
    1959
    +0065
    1960
    +0054

    View Slide

  21. Map
    MAP
    <0, 005733213…> //line1
    <160, 006844324…> //line2
    <1958, -21>
    <1959, +65>

    View Slide

  22. Shuffle
    <1958, -21>
    <1959, +65>
    <1958, -34>
    <1959, +28>
    <1958, [-21, -34]>
    <1959, [+65, +28]>

    View Slide

  23. Shuffle
    list(K2, V2)
    list(K2, list(V2))

    View Slide

  24. Reduce
    REDUCE
    list(K2, list(V2))
    list(K3, V3)

    View Slide

  25. Reduce
    REDUCE
    <1958, [-21, -34]>
    <1959, [+65, +28]>
    <1958, -21>
    <1959, +65>

    View Slide

  26. Map & Reduce
    REDUCE
    list(K2, list(V2))
    list(K3, V3))
    MAP

    list(K2, V2)

    View Slide

  27. Map&Reduce
    z 
    z  à
    z  à
    Job
    z 
    z 
    z 

    View Slide

  28. split 0
    split 1
    split 2
    map
    map
    map
    reduce output HDFS
    replication

    View Slide

  29. split 0
    split 1
    split 2
    map
    map
    map
    reduce part0 HDFS
    replication
    reduce part1 HDFS
    replication

    View Slide

  30. Práctica 1
    Mandando nuestro
    primer trabajo

    View Slide

  31. MapReduce API
    z 
    z 
    1.  Blabla!
    z 
    2.  BlablaMapper!
    z 
    3.  BlablaReducer!
    z 

    View Slide

  32. FlightsByCarrier
    z 
    z 
    !
    !
    https://github.com/csuarez/seminario-mapreduce!
     
    flights-by-carrier/  

    View Slide

  33. Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,Tai
    lNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiO
    ut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircr
    aftDelay!
    1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,15,4,729,730,903,849,PS,1451,NA,94,79,NA,14,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,17,6,741,730,918,849,PS,1451,NA,97,79,NA,29,11,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,18,7,729,730,847,849,PS,1451,NA,78,79,NA,-2,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,19,1,749,730,922,849,PS,1451,NA,93,79,NA,33,19,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,22,4,728,730,852,849,PS,1451,NA,84,79,NA,3,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,24,6,744,730,908,849,PS,1451,NA,84,79,NA,19,14,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,25,7,729,730,851,849,PS,1451,NA,82,79,NA,2,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,26,1,735,730,904,849,PS,1451,NA,89,79,NA,15,5,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,28,3,741,725,919,855,PS,1451,NA,98,90,NA,24,16,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,29,4,742,725,906,855,PS,1451,NA,84,90,NA,11,17,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,31,6,726,725,848,855,PS,1451,NA,82,90,NA,-7,1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,1,4,936,915,1035,1001,PS,1451,NA,59,46,NA,34,21,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,2,5,918,915,1017,1001,PS,1451,NA,59,46,NA,16,3,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,3,6,928,915,1037,1001,PS,1451,NA,69,46,NA,36,13,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,4,7,914,915,1003,1001,PS,1451,NA,49,46,NA,2,-1,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,5,1,1042,915,1129,1001,PS,1451,NA,47,46,NA,88,87,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,6,2,934,915,1024,1001,PS,1451,NA,50,46,NA,23,19,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,7,3,946,915,1037,1001,PS,1451,NA,51,46,NA,36,31,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,8,4,932,915,1033,1001,PS,1451,NA,61,46,NA,32,17,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,9,5,947,915,1036,1001,PS,1451,NA,49,46,NA,35,32,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,10,6,915,915,1022,1001,PS,1451,NA,67,46,NA,21,0,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,11,7,916,915,1006,1001,PS,1451,NA,50,46,NA,5,1,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,12,1,944,915,1027,1001,PS,1451,NA,43,46,NA,26,29,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,13,2,941,915,1036,1001,PS,1451,NA,55,46,NA,35,26,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,14,3,930,915,1029,1001,PS,1451,NA,59,46,NA,28,15,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,15,4,920,915,1023,1001,PS,1451,NA,63,46,NA,22,5,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,17,6,1009,915,1104,1001,PS,1451,NA,55,46,NA,63,54,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,18,7,915,915,1008,1001,PS,1451,NA,53,46,NA,7,0,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,19,1,940,915,1032,1001,PS,1451,NA,52,46,NA,31,25,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!
    1987,10,21,3,913,915,1003,1001,PS,1451,NA,50,46,NA,2,-2,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA!

    View Slide

  34. public class FlightsByCarrier {!
    public static void main (String[] args) throws Exception {!
    !
    Job job = new Job();!
    job.setJarByClass(FlightsByCarrier.class); !
    !
    job.setJobName("FlightsByCarrier”);!
    !
    !
    TextInputFormat.addInputPath(job, new Path(args[0]));
    TextOutputFormat.setOutputPath(job, new Path(args[1]));!
    job.setInputFormatClass(TextInputFormat.class);!
    job.setOutputKeyClass(Text.class);!
    job.setOutputValueClass(IntWritable.class);!
    !
    !
    job.setMapperClass(FlightsByCarrierMapper.class);!
    job.setReducerClass(FlightsByCarrierReducer.class); !
    job.setOutputFormatClass(TextOutputFormat.class); !
    !
    job.addFileToClassPath(new Path("/user/root/opencsv-2.3.jar")); !
    job.waitForCompletion(true);!
    }!
    }!

    View Slide

  35. //Mapper!
    public class FlightsByCarrierMapper extends MapperText, IntWritable> {!
    !
    @Override!
    protected void map(LongWritable key, Text value, Context context)
    throws IOException, InterruptedException {!
    !
    if (key.get() > 0) { //Ignora la primera linea!
    String[] lines = new CSVParser().parseLine(value.toString());!
    !
    context.write(new Text(lines[8]), new IntWritable(1));!
    }!
    }!
    }!

    View Slide

  36. public class FlightsByCarrierReducer extends ReducerText, IntWritable> {!
    !
    @Override!
    protected void reduce(Text token, Iterable counts,
    Context context) throws IOException, InterruptedException {!
    !
    int sum = 0;!
    !
    for (IntWritable count : counts) {!
    sum += count.get();!
    }!
    !
    context.write(token, new IntWritable(sum));!
    }!
    }!
    !

    View Slide

  37. !
    !
    $ git clone https://github.com/csuarez/seminario-mapreduce.git!
    !
    [...]!
    !
    $ tar xvzf 1987.tar.gz!
    !
    $ hdfs dfs –copyFromLocal lib/opencsv-2.3.jar /user/root!
    !
    $ hdfs dfs –copyFromLocal 1987.csv /user/root!
    !
    $ sh build.sh!
    !
    $ hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/
    1987.csv /user/root/output/flightsCount!
    !
    $ hdfs dfs -cat /user/root/output/flightsCount/part-r-00000!
    !
    !
    Ejecución

    View Slide

  38. FlightsByCarrier output
    INFO mapreduce.JobSubmitter: number of splits:2!
    !
    ...!
    !
    INFO mapreduce.Job: map 0% reduce 0%!
    INFO mapreduce.Job: map 22% reduce 0%!
    INFO mapreduce.Job: map 41% reduce 0%!
    INFO mapreduce.Job: map 83% reduce 0%!
    INFO mapreduce.Job: map 100% reduce 0%!
    INFO mapreduce.Job: map 100% reduce 100%!
    !
    ...!
    !
    Job Counters!
    Launched map tasks=2!
    Launched reduce tasks=1!
    Rack-local map tasks=2!
    Total time spent by all maps in occupied slots (ms)=42442!
    Total time spent by all reduces in occupied slots (ms)=13465!
    !

    View Slide

  39. MapReduce
    Cómo funciona

    View Slide

  40. MapReduce v1

    View Slide

  41. MapReduce
    Program
    Job
    Client
    Job Tracker
    Task
    Tracker
    HDFS
    Child
    M/R
    task
    TASKTRACKER NODE
    JVM
    1. run
    2. get id
    3. copy
    4.
    submit
    job
    5. init. job
    6. get splits
    7. heartbeats
    9. launch
    10. run
    8. get resources
    JVM
    CLIENT NODE JOB TRACKER NODE

    View Slide

  42. Actores
    1. 
    2. 
    3. 
    4. 
    MapReduce
    Program
    Job
    Client
    Job Tracker
    Task
    Tracker
    HDFS
    Child
    M/R
    task
    TASKTRACKER NODE
    JVM
    1. run
    2. get id
    3. copy
    4.
    submit
    job
    5. init. job
    6. get splits
    7. heartbeats
    9. launch
    10. run
    8. get resources
    JVM
    CLIENT NODE JOB TRACKER NODE

    View Slide

  43. Ejecución MapReduce
    z 
    z 
    z 
    z 
    z 
    z 
    z 

    View Slide

  44. Paso 1 Job Submission
    z 
    z 
    z 
    z 
    MapReduce
    Program
    Job
    Client
    Job Tracker
    Task
    Tracker
    HDFS
    Child
    M/R
    task
    TASKTRACKER NODE
    JVM
    1. run
    2. get id
    3. copy
    4.
    submit
    job
    5. init. job
    6. get splits
    7. heartbeats
    9. launch
    10. run
    8. get resources
    JVM
    CLIENT NODE JOB TRACKER NODE

    View Slide

  45. Paso 2 Job Initialization
    z 
    z 
    z 
    z 
    z 
    MapReduce
    Program
    Job
    Client
    Job Tracker
    Task
    Tracker
    HDFS
    Child
    M/R
    task
    TASKTRACKER NODE
    JVM
    1. run
    2. get id
    3. copy
    4.
    submit
    job
    5. init. job
    6. get splits
    7. heartbeats
    9. launch
    10. run
    8. get resources
    JVM
    CLIENT NODE JOB TRACKER NODE

    View Slide

  46. Paso 3 Task Assignment
    z 
    z 
    z 
    z 
    MapReduce
    Program
    Job
    Client
    Job Tracker
    Task
    Tracker
    HDFS
    Child
    M/R
    task
    TASKTRACKER NODE
    JVM
    1. run
    2. get id
    3. copy
    4.
    submit
    job
    5. init. job
    6. get splits
    7. heartbeats
    9. launch
    10. run
    8. get resources
    JVM
    CLIENT NODE JOB TRACKER NODE

    View Slide

  47. Paso 4 Task Execution
    z 
    z 
    z 
    MapReduce
    Program
    Job
    Client
    Job Tracker
    Task
    Tracker
    HDFS
    Child
    M/R
    task
    TASKTRACKER NODE
    JVM
    1. run
    2. get id
    3. copy
    4.
    submit
    job
    5. init. job
    6. get splits
    7. heartbeats
    9. launch
    10. run
    8. get resources
    JVM
    CLIENT NODE JOB TRACKER NODE

    View Slide

  48. Paso 5 Progress & Status
    z 
    z 
    z 
    z 
    MapReduce
    Program
    Job
    Client
    Job Tracker
    Task
    Tracker
    HDFS
    Child
    M/R
    task
    TASKTRACKER NODE
    JVM
    1. run
    2. get id
    3. copy
    4.
    submit
    job
    5. init. job
    6. get splits
    7. heartbeats
    9. launch
    10. run
    8. get resources
    JVM
    CLIENT NODE JOB TRACKER NODE

    View Slide

  49. Paso 6 Job Completion
    z 
    z 
    MapReduce
    Program
    Job
    Client
    Job Tracker
    Task
    Tracker
    HDFS
    Child
    M/R
    task
    TASKTRACKER NODE
    JVM
    1. run
    2. get id
    3. copy
    4.
    submit
    job
    5. init. job
    6. get splits
    7. heartbeats
    9. launch
    10. run
    8. get resources
    JVM
    CLIENT NODE JOB TRACKER NODE

    View Slide

  50. MapReduce v2
    +
    YARN

    View Slide

  51. ¿Por qué YARN?
    z 
    z 
    z 
    z 
    z 

    View Slide

  52. MapReduce
    Progran
    Job
    Client
    Resource
    Manager
    Node
    Manager
    HDFS
    YARN
    Child
    M/R
    task
    JVM
    1. run
    JVM
    CLIENT NODE
    RESOURCE MANAGER NODE
    Node
    Manager
    Application
    Master
    2. get ID
    3. copy job
    resources
    4. submit
    application
    5a. start
    container
    5b. launch
    6. init. job
    7. get splits
    8. allocate
    resources
    9a. start
    container
    9b. launch
    10. get data
    NODE MANAGER NODE
    NODE MANAGER NODE

    View Slide

  53. YARN: Actores
    1. 
    z 
    2. 
    z 
    3. 
    z 
    4. 
    z 
    5. 

    View Slide

  54. vs. MapReduce v1: Actores
    1. 
    2. 
    3. 
    4. 
    MapReduce
    Progran
    Job
    Client
    Job Tracker
    Task
    Tracker
    HDFS
    Child
    M/R
    task
    TASKRACKER NODE
    JVM
    1. run
    2. get id
    3. copy
    4.
    submit
    job
    5. init. job
    6. get splits
    7. heartbeats
    9. launch
    10. run
    8. get resources
    JVM
    CLIENT NODE JOB TRACKER NODE
    1.  YARN Resource
    Manager
    2.  YARN Node Manager
    3.  Application Master

    View Slide

  55. View Slide

  56. MapReduce v1
    API
    Processing
    FW
    Resource
    Manager
    Storage
    MR
    MapReduce v1  
    HDFS  
    PIG   HIVE   HBASE  

    View Slide

  57. YARN
    API
    Processing
    FW
    Resource
    Manager
    Storage
    MR
    YARN  
    HDFS  
    PIG  
    STORM  
    MR v2 TEZ  
    MPI  

    View Slide

  58. MapReduce
    Progran
    Job
    Client
    Resource
    Manager
    Node
    Manager
    HDFS
    YARN
    Child
    M/R
    task
    JVM
    1. run
    JVM
    CLIENT NODE
    RESOURCE MANAGER NODE
    Node
    Manager
    Application
    Master
    2. get ID
    3. copy job
    resources
    4. submit
    application
    5a. start
    container
    5b. launch
    6. init. job
    7. get splits
    8. allocate
    resources
    9a. start
    container
    9b. launch
    10. get data
    NODE MANAGER NODE
    NODE MANAGER NODE

    View Slide

  59. Paso 1 Job Submission
    z 
    z 
    z 
    z 
    z 
    z 

    View Slide

  60. MapReduce
    Progran
    Job
    Client
    Resource
    Manager
    Node
    Manager
    HDFS
    YARN
    Child
    M/R
    task
    JVM
    1. run
    JVM
    CLIENT NODE
    RESOURCE MANAGER NODE
    Node
    Manager
    Application
    Master
    2. get ID
    3. copy job
    resources
    4. submit
    application
    5a. start
    container
    5b. launch
    6. init. job
    7. get splits
    8. allocate
    resources
    9a. start
    container
    9b. launch
    10. get data
    NODE MANAGER NODE
    NODE MANAGER NODE

    View Slide

  61. Paso 2 Job Initialization
    z 
    z 
    z 
    z 
    z 
    z 
    z 

    View Slide

  62. Paso 2** Uber Mode
    z 
    z 
    z 
    z 
    z 

    View Slide

  63. MapReduce
    Progran
    Job
    Client
    Resource
    Manager
    Node
    Manager
    HDFS
    YARN
    Child
    M/R
    task
    JVM
    1. run
    JVM
    CLIENT NODE
    RESOURCE MANAGER NODE
    Node
    Manager
    Application
    Master
    2. get ID
    3. copy job
    resources
    4. submit
    application
    5a. start
    container
    5b. launch
    6. init. job
    7. get splits
    8. allocate
    resources
    9a. start
    container
    9b. launch
    10. get data
    NODE MANAGER NODE
    NODE MANAGER NODE

    View Slide

  64. Paso 3 Task Assignment
    z 
    z 
    z 
    z 
    z 
    z 

    View Slide

  65. MapReduce
    Progran
    Job
    Client
    Resource
    Manager
    Node
    Manager
    HDFS
    YARN
    Child
    M/R
    task
    JVM
    1. run
    JVM
    CLIENT NODE
    RESOURCE MANAGER NODE
    Node
    Manager
    Application
    Master
    2. get ID
    3. copy job
    resources
    4. submit
    application
    5a. start
    container
    5b. launch
    6. init. job
    7. get splits
    8. allocate
    resources
    9a. start
    container
    9b. launch
    10. get data
    NODE MANAGER NODE
    NODE MANAGER NODE

    View Slide

  66. Paso 4 Task Execution
    z 
    z 
    z 
    z 
    z 

    View Slide

  67. MapReduce
    Progran
    Job
    Client
    Resource
    Manager
    Node
    Manager
    HDFS
    YARN
    Child
    M/R
    task
    JVM
    1. run
    JVM
    CLIENT NODE
    RESOURCE MANAGER NODE
    Node
    Manager
    Application
    Master
    2. get ID
    3. copy job
    resources
    4. submit
    application
    5a. start
    container
    5b. launch
    6. init. job
    7. get splits
    8. allocate
    resources
    9a. start
    container
    9b. launch
    10. get data
    NODE MANAGER NODE
    NODE MANAGER NODE

    View Slide

  68. Paso 5 Progress & Status
    z 
    z 
    z 
    z 
    z 

    View Slide

  69. MapReduce
    Progran
    Job
    Client
    Resource
    Manager
    Node
    Manager
    HDFS
    YARN
    Child
    M/R
    task
    JVM
    1. run
    JVM
    CLIENT NODE
    RESOURCE MANAGER NODE
    Node
    Manager
    Application
    Master
    2. get ID
    3. copy job
    resources
    4. submit
    application
    5a. start
    container
    5b. launch
    6. init. job
    7. get splits
    8. allocate
    resources
    9a. start
    container
    9b. launch
    10. get data
    NODE MANAGER NODE
    NODE MANAGER NODE

    View Slide

  70. Exprimiendo
    MapReduce

    View Slide

  71. Apache Pig

    View Slide

  72. Apache Pig
    z 
    z 
    z 
    z 
    z 

    View Slide

  73. records = LOAD '1987.csv' USING PigStorage(',') AS!
    (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrie
    r,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Orig
    in,Dest,Distance:int,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDela
    y,WeatherDelay,NASDelay,DecurityDelay,LateAircraftDelay);!
    !
    milage_recs = GROUP records ALL;!
    !
    tot_miles = FOREACH milage_recs GENERATE SUM(records.Distance);!
    !
    STORE tot_miles INTO '/user/root/totalmiles';!
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    https://github.com/csuarez/seminario-mapreduce!
    $ cd pig-total-miles/!
    $ pig totalmiles.pig!

    View Slide

  74. Hadoop
    Streaming

    View Slide

  75. !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \!
    -input myInputDirs \!
    -output myOutputDir \!
    -mapper myPythonScript.py \!
    -reducer /bin/wc \!
    -file myPythonScript.py!

    View Slide

  76. View Slide

  77. View Slide

  78. Práctica 2
    MapReduce Job

    View Slide

  79. GreekGodCounter (Estándar)
    z 
    z 
    !
    !
    https://github.com/csuarez/seminario-mapreduce!
     
    greek-god-counter-standard/  

    View Slide

  80. !
    !
    !
    !
    !
    private final static String[] gods = {!
    "Zeus",!
    "Hera",!
    "Poseidón",!
    "Dioniso",!
    "Apolo",!
    "Artemisa",!
    "Hermes",!
    "Atenea",!
    "Ares",!
    "Afrodita",!
    "Hefesto",!
    "Deméter”!
    };!
    !
    !
    !
    !
    !
    !

    View Slide

  81. //Initializing the initial structure!
    for (String god : gods) {!
    godMap.put(god, 0);!
    }!
    !
    try {!
    //Reading input!
    br = new BufferedReader(new FileReader(args[0]));!
    String line = br.readLine();!
    while (line != null) {!
    StringTokenizer tokenizer = new StringTokenizer(line);!
    while (tokenizer.hasMoreTokens()) {!
    String token = tokenizer.nextToken();!
    if (godMap.containsKey(token)) {!
    godMap.put(token, godMap.get(token) + 1);!
    }!
    }!
    line = br.readLine();!
    }!
    !
    //Writing output!
    Writer writer = new BufferedWriter(new FileWriter("gods.txt"));!
    for (Entry entry : godMap.entrySet()) {!
    writer.write(entry.getKey() + " = " + entry.getValue());!
    writer.write(System.lineSeparator());!
    }!
    writer.close();!
    }!

    View Slide

  82. ¡¡¡TAREA!!!

    View Slide

  83. GreekGodCounter (MapReduce)
    z 
    z 
    !
    !
    https://github.com/csuarez/seminario-mapreduce!
     
    greek-god-counter-mapreduce/  

    View Slide

  84. import org.apache.hadoop.fs.Path;!
    import org.apache.hadoop.io.*;!
    import org.apache.hadoop.mapreduce.Job;!
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;!
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;!
    import org.apache.hadoop.filecache.DistributedCache;!
    !
    !
    public class GreekGodCounterMapReduce {!
    public static void main (String[] args) throws Exception {!
    !
    Job job = new Job();!
    job.setJarByClass(GreekGodCounterMapReduce.class);!
    job.setJobName("GreekGodCounterMapReduce");!
    TextInputFormat.addInputPath(job, new Path(args[0])); !
    job.setInputFormatClass(TextInputFormat.class);!
    job.setMapperClass(GreekGodCounterMapReduceMapper.class);!
    job.setReducerClass(GreekGodCounterMapReduceReducer.class); !
    TextOutputFormat.setOutputPath(job, new Path(args[1]));!
    job.setOutputFormatClass(TextOutputFormat.class);!
    job.setOutputKeyClass(Text.class);!
    job.setOutputValueClass(IntWritable.class);!
    job.waitForCompletion(true);!
    !
    }!
    }!

    View Slide

  85. import java.io.IOException;!
    import org.apache.hadoop.io.*;!
    import org.apache.hadoop.mapreduce.Mapper;!
    import java.util.HashMap;!
    import java.util.Map.Entry;!
    import java.util.StringTokenizer;!
    !
    public class GreekGodCounterMapReduceMapper !
    extends Mapper {!
    private final static String[] gods = {!
    "Zeus",!
    "Hera",!
    "Poseidón",!
    "Dioniso",!
    "Apolo",!
    "Artemisa",!
    "Hermes",!
    "Atenea",!
    "Ares",!
    "Afrodita",!
    "Hefesto",!
    "Deméter"!
    };!
    !
    @Override!
    protected void map(LongWritable key, Text value, Context context) throws
    IOException, InterruptedException {!
    !
    }!
    }!

    View Slide

  86. import java.io.IOException;!
    import org.apache.hadoop.io.*;!
    import org.apache.hadoop.mapreduce.Reducer;!
    !
    public class GreekGodCounterMapReduceReducer !
    extends Reducer {!
    @Override!
    protected void reduce!
    (Text token, Iterable counts, Context context) !
    throws IOException, InterruptedException {!
    !
    }!
    }!

    View Slide

  87. Recursos

    View Slide

  88. Recursos

    View Slide

  89. ¡Gracias!
    ¿Alguna pregunta?
    [email protected]

    View Slide