Jornadas Técnicas Uex-CIEMAT // 10-12 Febrero 2015 Procesando grandes volúmenes de datos con HADOOP César Suárez Ortega Procesando datos con Hadoop: MapReduce y YARN

Software Developer / Researcher César Suárez Ortega tharandur csuarez

Índice 1. 2. 3.   4. Exprimiendo al máximo MapReduce 5.

Introducción a MapReduce

Modelo de programación para el procesamiento de datos

¿Qué es MapReduce?   MUCHOS  

by MapReduce: Simplified Data Processing on Large Clusters. Dean J. and Ghemawat S. (2004)

Aprendiendo con ejemplos

National Climatic Data Center     

1958 -0021 1959 +0065 1960

Soluciones 1. Script  2. Paralelización por año  3. Paralelización por partes iguales  4. MapReduce 

MapReduce 101    Hay que definir una función para cada etapa. MAP REDUCE

Map MAP <0, 005733213…> //line1 <160, 006844324…> //line2

Map MAP list(K2, V2)

1958 -0021 1959 +0065 1960

Map MAP <0, 005733213…> //line1 <160, 006844324…> //line2 <1958, -21> <1959, +65>

Shuffle <1958, -21> <1959, +65> <1958, -34> <1959, +28> <1958, [-21, -34]> <1959, [+65, +28]>

Shuffle list(K2, V2) list(K2, list(V2))

Reduce REDUCE list(K2, list(V2)) list(K3, V3)

Reduce REDUCE <1958, [-21, -34]> <1959, [+65, +28]> <1958, -21> <1959, +65>

Map & Reduce REDUCE list(K2, list(V2)) list(K3, V3)) MAP list(K2, V2)

Map&Reduce      Job   

Práctica 1 Mandando nuestro primer trabajo

MapReduce API   1. Blabla  2. BlablaMapper  3. BlablaReducer 

FlightsByCarrier   flights-by-carrier/

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsed Time,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,Carrie rDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay 1987,10,14,3,741,730,912,849,PS,1451,NA,91,79,NA,23,11,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,15,4,729,730,903,849,PS,1451,NA,94,79,NA,14,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,17,6,741,730,918,849,PS,1451,NA,97,79,NA,29,11,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,18,7,729,730,847,849,PS,1451,NA,78,79,NA,-2,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,19,1,749,730,922,849,PS,1451,NA,93,79,NA,33,19,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,21,3,728,730,848,849,PS,1451,NA,80,79,NA,-1,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,22,4,728,730,852,849,PS,1451,NA,84,79,NA,3,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,23,5,731,730,902,849,PS,1451,NA,91,79,NA,13,1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,24,6,744,730,908,849,PS,1451,NA,84,79,NA,19,14,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,25,7,729,730,851,849,PS,1451,NA,82,79,NA,2,-1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,26,1,735,730,904,849,PS,1451,NA,89,79,NA,15,5,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,28,3,741,725,919,855,PS,1451,NA,98,90,NA,24,16,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,29,4,742,725,906,855,PS,1451,NA,84,90,NA,11,17,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,31,6,726,725,848,855,PS,1451,NA,82,90,NA,-7,1,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,1,4,936,915,1035,1001,PS,1451,NA,59,46,NA,34,21,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,2,5,918,915,1017,1001,PS,1451,NA,59,46,NA,16,3,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,3,6,928,915,1037,1001,PS,1451,NA,69,46,NA,36,13,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,4,7,914,915,1003,1001,PS,1451,NA,49,46,NA,2,-1,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,5,1,1042,915,1129,1001,PS,1451,NA,47,46,NA,88,87,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,6,2,934,915,1024,1001,PS,1451,NA,50,46,NA,23,19,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,7,3,946,915,1037,1001,PS,1451,NA,51,46,NA,36,31,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,8,4,932,915,1033,1001,PS,1451,NA,61,46,NA,32,17,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,9,5,947,915,1036,1001,PS,1451,NA,49,46,NA,35,32,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,10,6,915,915,1022,1001,PS,1451,NA,67,46,NA,21,0,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,11,7,916,915,1006,1001,PS,1451,NA,50,46,NA,5,1,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,12,1,944,915,1027,1001,PS,1451,NA,43,46,NA,26,29,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,13,2,941,915,1036,1001,PS,1451,NA,55,46,NA,35,26,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,14,3,930,915,1029,1001,PS,1451,NA,59,46,NA,28,15,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,15,4,920,915,1023,1001,PS,1451,NA,63,46,NA,22,5,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,17,6,1009,915,1104,1001,PS,1451,NA,55,46,NA,63,54,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,18,7,915,915,1008,1001,PS,1451,NA,53,46,NA,7,0,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,19,1,940,915,1032,1001,PS,1451,NA,52,46,NA,31,25,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,21,3,913,915,1003,1001,PS,1451,NA,50,46,NA,2,-2,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA 1987,10,22,4,915,915,1017,1001,PS,1451,NA,62,46,NA,16,0,SFO,RNO,192,NA,NA,0,NA,0,NA,NA,NA,NA,NA

public class FlightsByCarrier { public static void main (String[] args) throws Exception { Job job = new Job(); job.setJarByClass(FlightsByCarrier.class); job.setJobName("FlightsByCarrier”); TextInputFormat.addInputPath(job, new Path(args[0])); TextOutputFormat.setOutputPath(job, new Path(args[1])); job.setInputFormatClass(TextInputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(FlightsByCarrierMapper.class); job.setReducerClass(FlightsByCarrierReducer.class); job.setOutputFormatClass(TextOutputFormat.class); job.addFileToClassPath(new Path("/user/root/opencsv-2.3.jar")); job.waitForCompletion(true); } }

//Mapper public class FlightsByCarrierMapper extends Mapper { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { if (key.get() > 0) { //Ignora la primera linea String[] lines = new CSVParser().parseLine(value.toString()); context.write(new Text(lines[8]), new IntWritable(1)); } } }

public class FlightsByCarrierReducer extends Reducer { @Override protected void reduce(Text token, Iterable counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) { sum += count.get(); } context.write(token, new IntWritable(sum)); } }

$ git clone [...] $ tar xvzf 1987.tar.gz $ hdfs dfs –copyFromLocal lib/opencsv-2.3.jar /user/root $ hdfs dfs –copyFromLocal 1987.csv /user/root $ sh $ hadoop jar FlightsByCarrier.jar FlightsByCarrier /user/root/1987.csv /user/root/output/flightsCount $ hdfs dfs -cat /user/root/output/flightsCount/part-r-00000 Ejecución

FlightsByCarrier output INFO mapreduce.JobSubmitter: number of splits:2 ... INFO mapreduce.Job: map 0% reduce 0% INFO mapreduce.Job: map 22% reduce 0% INFO mapreduce.Job: map 41% reduce 0% INFO mapreduce.Job: map 83% reduce 0% INFO mapreduce.Job: map 100% reduce 0% INFO mapreduce.Job: map 100% reduce 100% ... Job Counters Launched map tasks=2 Launched reduce tasks=1 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=42442 Total time spent by all reduces in occupied slots (ms)=13465

MapReduce Cómo funciona

MapReduce v1

Actores 1. 2. 3. 4.

Ejecución MapReduce       

Paso 1 Job Submission    

Paso 2 Job Initialization     

Paso 3 Task Assignment    

Paso 4 Task Execution   

Paso 5 Progress & Status    

Paso 6 Job Completion  

MapReduce v2 + YARN

¿Por qué YARN?     

YARN: Actores 1.  2.  3.  4.  5.

vs. MapReduce v1: Actores 1. 2. 3. 4. 1. YARN Resource Manager 2. YARN Node Manager 3. Application Master

MapReduce v1 API Processing FW Resource Manager Storage MR MapReduce v1 HDFS PIG HIVE HBASE

YARN API Processing FW Resource Manager Storage MR YARN HDFS PIG STORM MR v2 TEZ MPI

Paso 1 Job Submission      

Paso 2 Job Initialization       

Paso 2** Uber Mode     

Paso 3 Task Assignment      

Paso 4 Task Execution     

Paso 5 Progress & Status     

Exprimiendo MapReduce

Apache Pig

Apache Pig     

records = LOAD '1987.csv' USING PigStorage(',') AS (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,Tail Num,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance:int,TaxiIn,TaxiOut,Ca ncelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,DecurityDelay,LateAircraftDelay); milage_recs = GROUP records ALL; tot_miles = FOREACH milage_recs GENERATE SUM(records.Distance); STORE tot_miles INTO '/user/root/totalmiles'; $ cd pig-total-miles/ $ pig totalmiles.pig

Hadoop Streaming

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper \ -reducer /bin/wc \ -file

Práctica 2 MapReduce Job

GreekGodCounter (Estándar)   greek-god-counter-standard/

private final static String[] gods = { "Zeus", "Hera", "Poseidón", "Dioniso", "Apolo", "Artemisa", "Hermes", "Atenea", "Ares", "Afrodita", "Hefesto", "Deméter” };

//Initializing the initial structure for (String god : gods) { godMap.put(god, 0); } try { //Reading input br = new BufferedReader(new FileReader(args[0])); String line = br.readLine(); while (line != null) { StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { String token = tokenizer.nextToken(); if (godMap.containsKey(token)) { godMap.put(token, godMap.get(token) + 1); } } line = br.readLine(); } //Writing output Writer writer = new BufferedWriter(new FileWriter("gods.txt")); for (Entry entry : godMap.entrySet()) { writer.write(entry.getKey() + " = " + entry.getValue()); writer.write(System.lineSeparator()); } writer.close(); }

GreekGodCounter (MapReduce)   greek-god-counter-mapreduce/

import org.apache.hadoop.fs.Path; import*; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.filecache.DistributedCache; public class GreekGodCounterMapReduce { public static void main (String[] args) throws Exception { Job job = new Job(); job.setJarByClass(GreekGodCounterMapReduce.class); job.setJobName("GreekGodCounterMapReduce"); TextInputFormat.addInputPath(job, new Path(args[0])); job.setInputFormatClass(TextInputFormat.class); job.setMapperClass(GreekGodCounterMapReduceMapper.class); job.setReducerClass(GreekGodCounterMapReduceReducer.class); TextOutputFormat.setOutputPath(job, new Path(args[1])); job.setOutputFormatClass(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.waitForCompletion(true); } }

import; import*; import org.apache.hadoop.mapreduce.Mapper; import java.util.HashMap; import java.util.Map.Entry; import java.util.StringTokenizer; public class GreekGodCounterMapReduceMapper extends Mapper { private final static String[] gods = { "Zeus", "Hera", "Poseidón", "Dioniso", "Apolo", "Artemisa", "Hermes", "Atenea", "Ares", "Afrodita", "Hefesto", "Deméter" }; @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { } }

import; import*; import org.apache.hadoop.mapreduce.Reducer; public class GreekGodCounterMapReduceReducer extends Reducer { @Override protected void reduce (Text token, Iterable counts, Context context) throws IOException, InterruptedException { } }

¡Gracias! ¿Alguna pregunta?