JUC - Procesando datos con Hadoop: MapReduce y Yarn

C4187f9cd1f03aa1619b269218883910?s=47 CETA-Ciemat
February 11, 2015

JUC - Procesando datos con Hadoop: MapReduce y Yarn

I Jornadas Técnicas UEx - CIEMAT. Procesando grandes volúmenes de datos con Hadoop

C4187f9cd1f03aa1619b269218883910?s=128

CETA-Ciemat

February 11, 2015
Tweet

Transcript

  1. Jornadas Técnicas Uex-CIEMAT // 10-12 Febrero 2015 Procesando grandes volúmenes

    de datos con HADOOP César Suárez Ortega cesar.suarez@externos.ciemat.es Procesando datos con Hadoop: MapReduce y YARN
  2. Software Developer / Researcher César Suárez Ortega tharandur csuarez

  3. None
  4. Software Engineer / Researcher César Suárez Ortega

  5. Software Engineer / Researcher César Suárez Ortega

  6. Índice 1. 2. 3.   4. Exprimiendo al máximo

    MapReduce 5.
  7. Introducción a MapReduce

  8. Modelo de programación para el procesamiento de datos

  9. None
  10. ¿Qué es MapReduce?   MUCHOS  

  11. by MapReduce: Simplified Data Processing on Large Clusters. Dean J.

    and Ghemawat S. (2004)
  12. Aprendiendo con ejemplos

  13. National Climatic Data Center     

  14. 1958 -0021 1959 +0065 1960

  15. Soluciones 1. Script  2. Paralelización por año  3.

    Paralelización por partes iguales  4. MapReduce 
  16. MapReduce 101   <Key, Value>  Hay que definir

    una función para cada etapa. MAP REDUCE
  17. Map MAP <0, 005733213…> //line1 <160, 006844324…> //line2

  18. Map MAP <K1, V1>

  19. Map MAP <K1, V1> list(K2, V2)

  20. 1958 -0021 1959 +0065 1960

  21. Map MAP <0, 005733213…> //line1 <160, 006844324…> //line2 <1958, -21>

    <1959, +65>
  22. Shuffle <1958, -21> <1959, +65> <1958, -34> <1959, +28> <1958,

    [-21, -34]> <1959, [+65, +28]>
  23. Shuffle list(K2, V2) list(K2, list(V2))

  24. Reduce REDUCE list(K2, list(V2)) list(K3, V3)

  25. Reduce REDUCE <1958, [-21, -34]> <1959, [+65, +28]> <1958, -21>

    <1959, +65>
  26. Map & Reduce REDUCE list(K2, list(V2)) list(K3, V3)) MAP <K1,

    V1> list(K2, V2)
  27. Map&Reduce      Job   

  28. None
  29. None
  30. Práctica 1 Mandando nuestro primer trabajo

  31. MapReduce API   1. Blabla  2. BlablaMapper 

    3. BlablaReducer 
  32. FlightsByCarrier   https://github.com/csuarez/seminario-mapreduce flights-by-carrier/

  33. Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsed Time,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,Carrie rDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay 1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,15,4,729,730,903,849,PS,1451,NA,94,79,NA,14,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,17,6,741,730,918,849,PS,1451,NA,97,79,NA,29,11,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,18,7,729,730,847,849,PS,1451,NA,78,79,NA,-2,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,19,1,749,730,922,849,PS,1451,NA,93,79,NA,33,19,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,22,4,728,730,852,849,PS,1451,NA,84,79,NA,3,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA

    1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,24,6,744,730,908,849,PS,1451,NA,84,79,NA,19,14,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,25,7,729,730,851,849,PS,1451,NA,82,79,NA,2,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,26,1,735,730,904,849,PS,1451,NA,89,79,NA,15,5,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,28,3,741,725,919,855,PS,1451,NA,98,90,NA,24,16,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,29,4,742,725,906,855,PS,1451,NA,84,90,NA,11,17,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,31,6,726,725,848,855,PS,1451,NA,82,90,NA,-7,1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,1,4,936,915,1035,1001,PS,1451,NA,59,46,NA,34,21,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,2,5,918,915,1017,1001,PS,1451,NA,59,46,NA,16,3,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,3,6,928,915,1037,1001,PS,1451,NA,69,46,NA,36,13,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,4,7,914,915,1003,1001,PS,1451,NA,49,46,NA,2,-1,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,5,1,1042,915,1129,1001,PS,1451,NA,47,46,NA,88,87,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,6,2,934,915,1024,1001,PS,1451,NA,50,46,NA,23,19,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,7,3,946,915,1037,1001,PS,1451,NA,51,46,NA,36,31,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,8,4,932,915,1033,1001,PS,1451,NA,61,46,NA,32,17,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,9,5,947,915,1036,1001,PS,1451,NA,49,46,NA,35,32,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,10,6,915,915,1022,1001,PS,1451,NA,67,46,NA,21,0,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,11,7,916,915,1006,1001,PS,1451,NA,50,46,NA,5,1,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,12,1,944,915,1027,1001,PS,1451,NA,43,46,NA,26,29,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,13,2,941,915,1036,1001,PS,1451,NA,55,46,NA,35,26,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,14,3,930,915,1029,1001,PS,1451,NA,59,46,NA,28,15,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,15,4,920,915,1023,1001,PS,1451,NA,63,46,NA,22,5,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,17,6,1009,915,1104,1001,PS,1451,NA,55,46,NA,63,54,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,18,7,915,915,1008,1001,PS,1451,NA,53,46,NA,7,0,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,19,1,940,915,1032,1001,PS,1451,NA,52,46,NA,31,25,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,21,3,913,915,1003,1001,PS,1451,NA,50,46,NA,2,-2,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,22,4,915,915,1017,1001,PS,1451,NA,62,46,NA,16,0,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA
  34. public class FlightsByCarrier { public static void main (String[] args)

    throws Exception { Job job = new Job(); job.setJarByClass(FlightsByCarrier.class); job.setJobName("FlightsByCarrier”); TextInputFormat.addInputPath(job, new Path(args[0])); TextOutputFormat.setOutputPath(job, new Path(args[1])); job.setInputFormatClass(TextInputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(FlightsByCarrierMapper.class); job.setReducerClass(FlightsByCarrierReducer.class); job.setOutputFormatClass(TextOutputFormat.class); job.addFileToClassPath(new Path("/user/root/opencsv-2.3.jar")); job.waitForCompletion(true); } }
  35. //Mapper<KeyIn, ValueIn, KeyOut, ValueOut> public class FlightsByCarrierMapper extends Mapper<LongWritable, Text,

    Text, IntWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { if (key.get() > 0) { //Ignora la primera linea String[] lines = new CSVParser().parseLine(value.toString()); context.write(new Text(lines[8]), new IntWritable(1)); } } }
  36. public class FlightsByCarrierReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override

    protected void reduce(Text token, Iterable<IntWritable> counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) { sum += count.get(); } context.write(token, new IntWritable(sum)); } }
  37. $ git clone https://github.com/csuarez/seminario-mapreduce.git [...] $ tar xvzf 1987.tar.gz $

    hdfs dfs –copyFromLocal lib/opencsv-2.3.jar /user/root $ hdfs dfs –copyFromLocal 1987.csv /user/root $ sh build.sh $ hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv /user/root/output/flightsCount $ hdfs dfs -cat /user/root/output/flightsCount/part-r-00000 Ejecución
  38. FlightsByCarrier output INFO mapreduce.JobSubmitter: number of splits:2 ... INFO mapreduce.Job:

    map 0% reduce 0% INFO mapreduce.Job: map 22% reduce 0% INFO mapreduce.Job: map 41% reduce 0% INFO mapreduce.Job: map 83% reduce 0% INFO mapreduce.Job: map 100% reduce 0% INFO mapreduce.Job: map 100% reduce 100% ... Job Counters Launched map tasks=2 Launched reduce tasks=1 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=42442 Total time spent by all reduces in occupied slots (ms)=13465
  39. MapReduce Cómo funciona

  40. MapReduce v1

  41. None
  42. Actores 1. 2. 3. 4.

  43. Ejecución MapReduce       

  44. Paso 1 Job Submission    

  45. Paso 2 Job Initialization     

  46. Paso 3 Task Assignment    

  47. Paso 4 Task Execution   

  48. Paso 5 Progress & Status    

  49. Paso 6 Job Completion  

  50. MapReduce v2 + YARN

  51. ¿Por qué YARN?     

  52. None
  53. YARN: Actores 1.  2.  3.  4. 

    5.
  54. vs. MapReduce v1: Actores 1. 2. 3. 4. 1. YARN

    Resource Manager 2. YARN Node Manager 3. Application Master
  55. None
  56. MapReduce v1 API Processing FW Resource Manager Storage MR MapReduce

    v1 HDFS PIG HIVE HBASE
  57. YARN API Processing FW Resource Manager Storage MR YARN HDFS

    PIG STORM MR v2 TEZ MPI
  58. None
  59. Paso 1 Job Submission      

  60. None
  61. Paso 2 Job Initialization      

  62. Paso 2** Uber Mode     

  63. None
  64. Paso 3 Task Assignment      

  65. None
  66. Paso 4 Task Execution     

  67. None
  68. Paso 5 Progress & Status     

  69. None
  70. Exprimiendo MapReduce

  71. Apache Pig

  72. Apache Pig     

  73. records = LOAD '1987.csv' USING PigStorage(',') AS (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,Tail Num,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance:int,TaxiIn,TaxiOut,Ca ncelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,DecurityDelay,LateAircraftDelay);

    milage_recs = GROUP records ALL; tot_miles = FOREACH milage_recs GENERATE SUM(records.Distance); STORE tot_miles INTO '/user/root/totalmiles'; https://github.com/csuarez/seminario-mapreduce $ cd pig-total-miles/ $ pig totalmiles.pig
  74. Hadoop Streaming

  75. $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir

    \ -mapper myPythonScript.py \ -reducer /bin/wc \ -file myPythonScript.py
  76. None
  77. None
  78. Práctica 2 MapReduce Job

  79. GreekGodCounter (Estándar)   https://github.com/csuarez/seminario-mapreduce greek-god-counter-standard/

  80. private final static String[] gods = { "Zeus", "Hera", "Poseidón",

    "Dioniso", "Apolo", "Artemisa", "Hermes", "Atenea", "Ares", "Afrodita", "Hefesto", "Deméter” };
  81. //Initializing the initial structure for (String god : gods) {

    godMap.put(god, 0); } try { //Reading input br = new BufferedReader(new FileReader(args[0])); String line = br.readLine(); while (line != null) { StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { String token = tokenizer.nextToken(); if (godMap.containsKey(token)) { godMap.put(token, godMap.get(token) + 1); } } line = br.readLine(); } //Writing output Writer writer = new BufferedWriter(new FileWriter("gods.txt")); for (Entry<String, Integer> entry : godMap.entrySet()) { writer.write(entry.getKey() + " = " + entry.getValue()); writer.write(System.lineSeparator()); } writer.close(); }
  82. ¡¡¡TAREA!!!

  83. GreekGodCounter (MapReduce)   https://github.com/csuarez/seminario-mapreduce greek-god-counter-mapreduce/

  84. import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

    import org.apache.hadoop.filecache.DistributedCache; public class GreekGodCounterMapReduce { public static void main (String[] args) throws Exception { Job job = new Job(); job.setJarByClass(GreekGodCounterMapReduce.class); job.setJobName("GreekGodCounterMapReduce"); TextInputFormat.addInputPath(job, new Path(args[0])); job.setInputFormatClass(TextInputFormat.class); job.setMapperClass(GreekGodCounterMapReduceMapper.class); job.setReducerClass(GreekGodCounterMapReduceReducer.class); TextOutputFormat.setOutputPath(job, new Path(args[1])); job.setOutputFormatClass(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.waitForCompletion(true); } }
  85. import java.io.IOException; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.Mapper; import java.util.HashMap; import java.util.Map.Entry;

    import java.util.StringTokenizer; public class GreekGodCounterMapReduceMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static String[] gods = { "Zeus", "Hera", "Poseidón", "Dioniso", "Apolo", "Artemisa", "Hermes", "Atenea", "Ares", "Afrodita", "Hefesto", "Deméter" }; @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { } }
  86. import java.io.IOException; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.Reducer; public class GreekGodCounterMapReduceReducer extends

    Reducer<Text, IntWritable, Text, IntWritable> { @Override protected void reduce (Text token, Iterable<IntWritable> counts, Context context) throws IOException, InterruptedException { } }
  87. Recursos

  88. Recursos

  89. ¡Gracias! ¿Alguna pregunta? cesar.suarez@externos.ciemat.es