Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MapReduce: Which Way To Go?

DaFED
August 05, 2015

MapReduce: Which Way To Go?

DaFED#35
Speaker: Ozren Gulan
BigData je svuda oko nas i koristi se za rešavanje kompleksnih problema. Prvi način obrade podataka u BigData svetu omogućen je korišćenjem BatchProcessing pristupa, a razni tipovi BatchProcessing tehnologija koriste MapReduce algoritam za obradu i analizu podataka koji je i ujedno najpopularniji način za rešavanje problema u Big Data svetu. Ovo predavanje će prikazati analizu poređenja dva pristupa: Apache Pig i Java MapReduce, a biće reči i o budućnosti BigData sveta, odnosno o drugačijem pristupu obrade podataka: DataStreaming-u (Apache Spark).

DaFED

August 05, 2015
Tweet

More Decks by DaFED

Other Decks in Programming

Transcript

  1. Facebook 60 TB YouTube 300 h/m Instagram 140k p/m Twitter

    350k t/m Google web index 10+ PB Logs * Requests * Large Hadron Colider ~ 1 PB/d
  2. $ showcase - for each customer group - top 5

    products bought - average number of views per visit - average number of purchases - average purchase
  3. $ cat input_record.json { "sessionId": 1, "customerCategoryId": 5, "customerCategoryDescription": "desc",

    "products": [ { "id": 1222, "name": "product", "category": "product category", "bought": true, "price": 57990.0 }, ... ] }
  4. products = LOAD '/example/products/customer_records_map_reduce_input.json’ USING JsonLoader('...'); categories = LOAD '/example/dimension/customer_categories.db'

    AS (categoryId:int,age:chararray,gender:chararray); joinedRecords = JOIN categories BY categoryId, products BY customerCategoryId; --for each group of users, show top five selling products flattenedProducts = FOREACH joinedRecords GENERATE sessionId AS sessionId, categories::categoryId AS categoryId, categories::age AS age, categories::gender AS gender, FLATTEN(products.(id, name, category, bought, price)) AS (id, name, category, bought, price); boughtProducts = FILTER flattenedProducts BY bought == true; groupedProducts = GROUP boughtProducts BY (categoryId, age, gender, id, name); countedProducts = FOREACH groupedProducts GENERATE FLATTEN(group), COUNT(boughtProducts) AS counter; groupTopFiveProducts = GROUP countedProducts BY (categoryId, age, gender);
  5. resultTopFiveProducts = FOREACH groupTopFiveProducts { sorted = ORDER countedProducts BY

    counter DESC; topProducts = LIMIT sorted 5; GENERATE FLATTEN(topProducts); }; STORE resultTopFiveProducts INTO '/example/results/topTenProducts' USING JsonStorage(); --average number of seen products averageSeenProducts = FOREACH joinedRecords GENERATE categories::categoryId AS categoryId, categories::age AS age, categories::gender AS gender, COUNT(products) AS counter; grpAverageSeenProducts = GROUP averageSeenProducts BY (categoryId, age, gender); averageCountedProducts = FOREACH grpAverageSeenProducts GENERATE FLATTEN(group), AVG(averageSeenProducts.counter) AS averageSeen; --average number of bought products per visit groupedBySession = GROUP boughtProducts BY (sessionId, categoryId, age, gender);
  6. averageBoughtProducts = FOREACH groupedBySession GENERATE FLATTEN(group), COUNT(boughtProducts.name) AS counter; groupedAverageBoughtProducts

    = GROUP averageBoughtProducts BY (categoryId, age, gender); resultAverageBoughtProducts = FOREACH groupedAverageBoughtProducts GENERATE FLATTEN(group), AVG(averageBoughtProducts.counter) AS averageBought; --average purchase amount groupedAveragePrice = GROUP boughtProducts BY (categoryId, age, gender); averagePrice = FOREACH groupedAveragePrice GENERATE FLATTEN(group), AVG(boughtProducts.price) AS averagePaid; joinedFinal = JOIN averageCountedProducts BY (categoryId, age, gender), resultAverageBoughtProducts BY (categoryId, age, gender), averagePrice BY (categoryId, age, gender); finalResult = FOREACH joinedFinal GENERATE averageCountedProducts::categoryId AS categoryId, averageCountedProducts::age AS age, averageCountedProducts::gender AS gender, averageCountedProducts::averageSeen AS averageSeen, resultAverageBoughtProducts::averageBought AS averageBought, averagePrice::averagePaid AS averagePaid; STORE finalResult INTO '/example/results/productsStatistic' USING JsonStorage();
  7. $ analysis:readability_maintainability –java package com.codingserbia.dto; import java.util.ArrayList; import java.util.Collections; import

    java.util.Comparator; import java.util.HashMap; import java.util.Iterator; import java.util.List; import java.util.Map; import java.util.Map.Entry; import java.util.Set; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import com.codingserbia.writable.ProductWritable; public class CustomerCategoryProductBag { public LongWritable customerCategoryId; public Text customerCategoryDescription; private Map<LongWritable, ProductWritable> products; private Map<LongWritable, Long> purchasesByProduct; private Map<LongWritable, Long> viewsByProduct; private int numberOfViews = 0; private int numberOfSessions = 0;
  8. private int numberOfPurchases = 0; public CustomerCategoryProductBag() { customerCategoryId =

    new LongWritable(0L); customerCategoryDescription = new Text(); products = new HashMap<LongWritable, ProductWritable>(); purchasesByProduct = new HashMap<LongWritable, Long>(); viewsByProduct = new HashMap<LongWritable, Long>(); } public ProductWritable getProductWritable(LongWritable id) { return products.get(id); } public boolean contains(LongWritable productId) { return getProductWritable(productId) != null; } public void add(ProductWritable product) { products.put(product.id, product); viewsByProduct.put(product.id, 1L); numberOfViews++; if (product.bought.get()) { purchasesByProduct.put(product.id, 1L); numberOfPurchases++; } } public void processOccurance(ProductWritable product) { if (product.bought.get()) { Long productNumberOfPurchases = purchasesByProduct.get(product.id); if (productNumberOfPurchases == null) { productNumberOfPurchases = 1L; } else { productNumberOfPurchases++; } purchasesByProduct.put(product.id, productNumberOfPurchases);
  9. numberOfPurchases++; } Long productNumberOfViews = viewsByProduct.get(product.id); productNumberOfViews++; viewsByProduct.put(product.id, productNumberOfViews); numberOfViews++;

    } public List<ProductWritable> getTopProductsBought(int numberOfProducts) { List<ProductWritable> topProducts = new ArrayList<ProductWritable>(); Set<Entry<LongWritable, Long>> entrySet = purchasesByProduct.entrySet(); List<Entry<LongWritable, Long>> entries = new ArrayList<Entry<LongWritable, Long>>(); for (Iterator<Entry<LongWritable, Long>> iterator = entrySet.iterator(); iterator.hasNext();) { entries.add(iterator.next()); } Collections.sort(entries, new Comparator<Entry<LongWritable, Long>>() { @Override public int compare(Entry<LongWritable, Long> entry1, Entry<LongWritable, Long> entry2) { return entry2.getValue().intValue() - entry1.getValue().intValue(); } }); int resultSize = numberOfProducts; if (resultSize > entries.size()) { resultSize = entries.size(); } for (Entry<LongWritable, Long> e : entries.subList(0, resultSize)) { topProducts.add(products.get(e.getKey())); } return topProducts; }
  10. public void increaseNumberOfSessions() { numberOfSessions++; } public float calculateAverageNumberOfViews() {

    if (numberOfSessions == 0) { return 0f; } return (float) numberOfViews / (float) numberOfSessions; } public float calculateAverageNumberOfPurchases() { if (numberOfSessions == 0) { return 0f; } return (float) numberOfPurchases / (float) numberOfSessions; } public float calculateAveragePurchase() { float amountInTotal = 0f; for (Iterator<LongWritable> iterator = purchasesByProduct.keySet().iterator(); iterator.hasNext();) { LongWritable key = iterator.next(); amountInTotal += products.get(key).price.get() * purchasesByProduct.get(key); } return numberOfPurchases != 0 ? amountInTotal / numberOfPurchases : 0f; } } package com.codingserbia.dto; import java.util.ArrayList; import java.util.List; import com.fasterxml.jackson.annotation.JsonProperty;
  11. public class CustomerSession { @JsonProperty public long sessionId; @JsonProperty public

    long customerCategoryId; @JsonProperty(required = false) public String customerCategoryDescription; @JsonProperty public List<Product> products; public CustomerSession() { products = new ArrayList<Product>(); } } package com.codingserbia.dto; import java.util.ArrayList; import java.util.List; import com.fasterxml.jackson.annotation.JsonProperty; public class CustomerSessionOutput { @JsonProperty public long customerCategoryId; @JsonProperty public String customerCategoryDescription; @JsonProperty public List<ProductOutput> products; @JsonProperty public float averageNumberOfViews;
  12. @JsonProperty public float averageNumberOfPurchases; @JsonProperty public float averagePurchase; public CustomerSessionOutput()

    { products = new ArrayList<ProductOutput>(); } } package com.codingserbia.dto; import com.fasterxml.jackson.annotation.JsonIgnore; import com.fasterxml.jackson.annotation.JsonProperty; public class Product { @JsonProperty public long id; @JsonProperty public String name; @JsonProperty public String category; @JsonProperty public boolean bought; @JsonProperty public double price; @JsonIgnore public int numberOfPurschases; public Product() { }
  13. public Product(long id, String name, String category, boolean bought, double

    price) { super(); this.id = id; this.name = name; this.category = category; this.bought = bought; this.price = price; } public Product(Product aProduct) { super(); this.id = aProduct.id; this.name = aProduct.name; this.category = aProduct.category; this.bought = aProduct.bought; this.price = aProduct.price; } } package com.codingserbia.dto; import com.fasterxml.jackson.annotation.JsonProperty; public class ProductOutput { @JsonProperty public long id; @JsonProperty public String name; public ProductOutput() { name = ""; } public ProductOutput(long id, String name) { super(); this.id = id; this.name = name; } }
  14. package com.codingserbia.writable; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.LongWritable;

    import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; public class CustomerCategoryWritable implements Writable { public LongWritable categoryId; public Text description; public Text gender; public CustomerCategoryWritable() { super(); categoryId = new LongWritable(); description = new Text(); gender = new Text(); } public CustomerCategoryWritable(long id, String description, String gender) { super(); categoryId = new LongWritable(id); this.description = new Text(description); this.gender = new Text(gender); } @Override public void readFields(DataInput input) throws IOException { categoryId.readFields(input); description.readFields(input); gender.readFields(input); } @Override public void write(DataOutput output) throws IOException { categoryId.write(output); description.write(output);
  15. gender.write(output); } @Override public int hashCode() { final int prime

    = 31; int result = 1; result = prime * result + ((categoryId == null) ? 0 : categoryId.hashCode()); result = prime * result + ((description == null) ? 0 : description.hashCode()); result = prime * result + ((gender == null) ? 0 : gender.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) { return true; } if (obj == null) { return false; } if (getClass() != obj.getClass()) { return false; } CustomerCategoryWritable other = (CustomerCategoryWritable) obj; if (categoryId == null) { if (other.categoryId != null) { return false; } } else if (!categoryId.equals(other.categoryId)) { return false; } if (description == null) { if (other.description != null) { return false; } } else if (!description.equals(other.description)) { return false; }
  16. if (gender == null) { if (other.gender != null) {

    return false; } } else if (!gender.equals(other.gender)) { return false; } return true; } @Override public String toString() { return "CustomerCategoryWritable [categoryId=" + categoryId + ", description=" + description + ", gender=" + gender + "]"; } } package com.codingserbia.writable; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import com.codingserbia.dto.CustomerSession; public class CustomerSessionWritable implements Writable { public LongWritable categoryId; public Text categoryDescription; public ProductArrayWritable products; public CustomerSessionWritable() { super();
  17. categoryDescription = new Text(); products = new ProductArrayWritable(ProductWritable.class); } public

    CustomerSessionWritable(String categoryDesc, CustomerSession json) { super(); categoryId = new LongWritable(json.customerCategoryId); categoryDescription = new Text(categoryDesc); products = new ProductArrayWritable(ProductWritable.class); ProductWritable[] pwArray = new ProductWritable[json.products.size()]; for (int i = 0; i < json.products.size(); i++) { ProductWritable pw = new ProductWritable(json.products.get(i)); pwArray[i] = pw; } products.set(pwArray); } @Override public void readFields(DataInput input) throws IOException { categoryId.readFields(input); categoryDescription.readFields(input); products.readFields(input); } @Override public void write(DataOutput ouput) throws IOException { categoryId.write(ouput); categoryDescription.write(ouput); products.write(ouput); } @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((categoryDescription == null) ? 0 : categoryDescription.hashCode()); result = prime * result + ((categoryId == null) ? 0 : categoryId.hashCode()); result = prime * result + ((products == null) ? 0 : products.hashCode()); return result; }
  18. @Override public boolean equals(Object obj) { if (this == obj)

    { return true; } if (obj == null) { return false; } if (getClass() != obj.getClass()) { return false; } CustomerSessionWritable other = (CustomerSessionWritable) obj; if (categoryDescription == null) { if (other.categoryDescription != null) { return false; } } else if (!categoryDescription.equals(other.categoryDescription)) { return false; } if (categoryId == null) { if (other.categoryId != null) { return false; } } else if (!categoryId.equals(other.categoryId)) { return false; } if (products == null) { if (other.products != null) { return false; } } else if (!products.equals(other.products)) { return false; } return true; } @Override public String toString() { return "CustomerSessionWritable [categoryId=" + categoryId + ", categoryDescription=" + categoryDescription.toString() + ", products=[" + products.toString() + "]]"; } }
  19. package com.codingserbia.writable; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.LongWritable;

    import org.apache.hadoop.io.MapWritable; import org.apache.hadoop.io.Writable; public class CustomerSessionWritablesGroupedByCustomerCategoryId implements Writable { public LongWritable customerCategoryId; public MapWritable sessions; public CustomerSessionWritablesGroupedByCustomerCategoryId() { super(); customerCategoryId = new LongWritable(); sessions = new MapWritable(); } public CustomerSessionWritablesGroupedByCustomerCategoryId(Long categoryId) { super(); customerCategoryId = new LongWritable(categoryId); sessions = new MapWritable(); } @Override public void readFields(DataInput input) throws IOException { customerCategoryId.readFields(input); sessions.readFields(input); } @Override public void write(DataOutput output) throws IOException { customerCategoryId.write(output); sessions.write(output); }
  20. @Override public int hashCode() { final int prime = 31;

    int result = 1; result = prime * result + ((customerCategoryId == null) ? 0 : customerCategoryId.hashCode()); result = prime * result + ((sessions == null) ? 0 : sessions.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) { return true; } if (obj == null) { return false; } if (getClass() != obj.getClass()) { return false; } CustomerSessionWritablesGroupedByCustomerCategoryId other = (CustomerSessionWritablesGroupedByCustomerCategoryId) obj; if (customerCategoryId == null) { if (other.customerCategoryId != null) { return false; } } else if (!customerCategoryId.equals(other.customerCategoryId)) { return false; } if (sessions == null) { if (other.sessions != null) { return false; } } else if (!sessions.equals(other.sessions)) { return false; } return true; } }
  21. package com.codingserbia.writable; import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.io.Writable; public class ProductArrayWritable extends

    ArrayWritable { public ProductArrayWritable(Class<? extends Writable> valueClass) { super(valueClass); } @Override public String toString() { String value = "ProductArrayWritable ["; Writable[] pwArray = get(); for (Writable pw : pwArray) { value += pw.toString(); } value += "]"; return value; } } package com.codingserbia.writable; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.BooleanWritable; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import com.codingserbia.dto.Product; public class ProductWritable implements Writable { public LongWritable id;
  22. public Text name; public Text category; public BooleanWritable bought; public

    DoubleWritable price; public ProductWritable() { super(); id = new LongWritable(); name = new Text(); category = new Text(); bought = new BooleanWritable(); price = new DoubleWritable(); } public ProductWritable(Product json) { super(); id = new LongWritable(json.id); name = new Text(json.name); category = new Text(json.category); bought = new BooleanWritable(json.bought); price = new DoubleWritable(json.price); } @Override public void readFields(DataInput input) throws IOException { id.readFields(input); name.readFields(input); category.readFields(input); bought.readFields(input); price.readFields(input); } @Override public void write(DataOutput output) throws IOException { id.write(output); name.write(output); category.write(output); bought.write(output); price.write(output); }
  23. @Override public int hashCode() { final int prime = 31;

    int result = 1; result = prime * result + ((bought == null) ? 0 : bought.hashCode()); result = prime * result + ((category == null) ? 0 : category.hashCode()); result = prime * result + ((id == null) ? 0 : id.hashCode()); result = prime * result + ((name == null) ? 0 : name.hashCode()); result = prime * result + ((price == null) ? 0 : price.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) { return true; } if (obj == null) { return false; } if (getClass() != obj.getClass()) { return false; } ProductWritable other = (ProductWritable) obj; if (bought == null) { if (other.bought != null) { return false; } } else if (!bought.equals(other.bought)) { return false; } if (category == null) { if (other.category != null) { return false; } } else if (!category.equals(other.category)) { return false; }
  24. if (id == null) { if (other.id != null) {

    return false; } } else if (!id.equals(other.id)) { return false; } if (name == null) { if (other.name != null) { return false; } } else if (!name.equals(other.name)) { return false; } if (price == null) { if (other.price != null) { return false; } } else if (!price.equals(other.price)) { return false; } return true; } @Override public String toString() { return "ProductWritable [id=" + id + ", name=" + name + ", category=" + category + ", bought=" + bought + ", price=" + price + "]"; } }
  25. package com.codingserbia; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable;

    import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import com.codingserbia.writable.CustomerSessionWritable; public class CodingSerbiaMapReduce extends Configured implements Tool { private static final Logger LOGGER = LoggerFactory.getLogger(CodingSerbiaMapReduce.class); protected String customerCategoriesFilePath = ""; protected String inputPath = ""; protected String outputPath = ""; public CodingSerbiaMapReduce(Configuration config) { super(); setConf(config); } public static void main(String[] args) throws Exception { System.setProperty("hadoop.home.dir", "C:/work/tools/hadoop-common-2.2.0-bin-master"); Configuration config = new Configuration(); CodingSerbiaMapReduce mr = new CodingSerbiaMapReduce(config); ToolRunner.run(config, mr, args); }
  26. protected boolean validateAndParseInput(String[] args) { if (args == null ||

    args.length < 3) { LOGGER.error("Three arguments are required: path to customer categories file, path to input data and path to desired output directory."); return false; } if (args.length > 3) { LOGGER.error("Too many arguments. Only three arguments are required: path to customer categories file, path to input data and path to desired output directory."); return false; } customerCategoriesFilePath = args[0]; LOGGER.info("Customer categories file path: " + customerCategoriesFilePath); getConf().set("customer.categories.file.path", customerCategoriesFilePath); inputPath = args[1]; LOGGER.info("Input path: " + inputPath); outputPath = args[2]; LOGGER.info("Output path: " + outputPath); LOGGER.info("Input validation succeeded"); return true; } @Override public int run(String[] args) throws Exception { if (!validateAndParseInput(args)) { throw new RuntimeException("Input validation failed."); } Job job = Job.getInstance(getConf()); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(CustomerSessionWritable.class); job.setOutputKeyClass(NullWritable.class); job.setOutputValueClass(Text.class); job.setMapperClass(CustomerRecordsMapper.class); job.setReducerClass(CustomerRecordsReducer.class);
  27. job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(inputPath)); FileOutputFormat.setOutputPath(job, new Path(outputPath)); job.setJarByClass(CodingSerbiaMapReduce.class); return

    job.waitForCompletion(true) ? 0 : 1; } } package com.codingserbia; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.util.HashMap; import java.util.Map; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import com.codingserbia.dto.CustomerSession; import com.codingserbia.writable.CustomerCategoryWritable; import com.codingserbia.writable.CustomerSessionWritable; import com.fasterxml.jackson.databind.ObjectMapper; public class CustomerRecordsMapper extends Mapper<LongWritable, Text, LongWritable, CustomerSessionWritable> { private static Logger LOGGER = LoggerFactory.getLogger(CustomerRecordsMapper.class); private Map<LongWritable, CustomerCategoryWritable> groupedCategories; private ObjectMapper jsonMapper;
  28. public CustomerRecordsMapper() { super(); groupedCategories = new HashMap<LongWritable, CustomerCategoryWritable>(); jsonMapper

    = new ObjectMapper(); } @SuppressWarnings({ "rawtypes", "unchecked" }) @Override protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException, InterruptedException { super.setup(context); String customerCategoriesPath = context.getConfiguration().get("customer.categories.file.path"); loadCustomerCategories(customerCategoriesPath, context); } @SuppressWarnings("unused") @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { try { CustomerSession jsonObj = jsonMapper.readValue(value.toString(), CustomerSession.class); LongWritable categoryId = new LongWritable(jsonObj.customerCategoryId); CustomerCategoryWritable category = groupedCategories.get(categoryId); if (category != null) { CustomerSessionWritable session = new CustomerSessionWritable(category.description.toString(), jsonObj); context.write(categoryId, session); } } catch (Exception e) { LOGGER.error(e.getMessage(), e); } } private void loadCustomerCategories(String filePath, Context context) throws IOException { Path path = new Path(filePath); FileSystem fs = path.getFileSystem(context.getConfiguration()); BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path))); String line; while ((line = br.readLine()) != null) { String[] columns = line.split("\t"); long categoryId = Long.valueOf(columns[0]);
  29. String description = columns[1] + " " + columns[2]; String

    gender = columns[2]; CustomerCategoryWritable writable = new CustomerCategoryWritable(categoryId, description, gender); groupedCategories.put(writable.categoryId, writable); } br.close(); } } package com.codingserbia; import java.io.IOException; import java.util.HashMap; import java.util.List; import java.util.Map; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapreduce.Reducer; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import com.codingserbia.dto.CustomerCategoryProductBag; import com.codingserbia.dto.CustomerSessionOutput; import com.codingserbia.dto.ProductOutput; import com.codingserbia.writable.CustomerSessionWritable; import com.codingserbia.writable.ProductWritable; import com.fasterxml.jackson.databind.ObjectMapper; public class CustomerRecordsReducer extends Reducer<LongWritable, CustomerSessionWritable, NullWritable, Text> { private static Logger LOGGER = LoggerFactory.getLogger(CustomerRecordsReducer.class); private Map<LongWritable, CustomerCategoryProductBag> categoryMap; private ObjectMapper jsonMapper;
  30. public CustomerRecordsReducer() { super(); categoryMap = new HashMap<LongWritable, CustomerCategoryProductBag>(); jsonMapper

    = new ObjectMapper(); } @Override protected void reduce(LongWritable key, Iterable<CustomerSessionWritable> values, Context context) throws IOException, InterruptedException { CustomerCategoryProductBag aBag = categoryMap.get(key); if (aBag == null) { aBag = new CustomerCategoryProductBag(); aBag.customerCategoryId = key; } for (CustomerSessionWritable value : values) { aBag.increaseNumberOfSessions(); if (aBag.customerCategoryDescription.getLength() == 0) { aBag.customerCategoryDescription = value.categoryDescription; } Writable[] productWritables = value.products.get(); for (Writable writable : productWritables) { ProductWritable product = (ProductWritable) writable; if (!aBag.contains(product.id)) { aBag.add(product); } else { aBag.processOccurance(product); } } } categoryMap.put(key, aBag); int numberOfTopBoughtProducts = 5; List<ProductWritable> topProducts = aBag.getTopProductsBought(numberOfTopBoughtProducts); CustomerSessionOutput outputJsonObj = new CustomerSessionOutput();
  31. outputJsonObj.customerCategoryId = key.get(); outputJsonObj.customerCategoryDescription = aBag.customerCategoryDescription.toString(); outputJsonObj.averageNumberOfViews = aBag.calculateAverageNumberOfViews(); outputJsonObj.averageNumberOfPurchases

    = aBag.calculateAverageNumberOfPurchases(); outputJsonObj.averagePurchase = aBag.calculateAveragePurchase(); for (ProductWritable pw : topProducts) { outputJsonObj.products.add(new ProductOutput(pw.id.get(), pw.name.toString())); } context.write(NullWritable.get(), new Text(jsonMapper.writeValueAsString(outputJsonObj))); LOGGER.info(jsonMapper.writeValueAsString(outputJsonObj)); } }
  32. public class CodingSerbiaMapReduce extends Configured implements Tool { ... Configuration

    config = new Configuration(); CodingSerbiaMapReduce mr = new CodingSerbiaMapReduce(config); ToolRunner.run(config, mr, args); ... Job job = Job.getInstance(getConf()); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(CustomerSessionWritable.class); job.setOutputKeyClass(NullWritable.class); job.setOutputValueClass(Text.class); job.setMapperClass(CustomerRecordsMapper.class); job.setReducerClass(CustomerRecordsReducer.class); FileInputFormat.setInputPaths(job, new Path(inputPath)); FileOutputFormat.setOutputPath(job, new Path(outputPath)); return job.waitForCompletion(true) ? 0 : 1; }
  33. public class CustomerRecordsMapper extends Mapper<LongWritable, Text, LongWritable, CustomerSessionWritable> { protected

    void map(LongWritable key, Text value, Context context) ... { ... CustomerSession jsonObj = jsonMapper.readValue(value.toString(), CustomerSession.class); LongWritable categoryId = new LongWritable(jsonObj.customerCategoryId); CustomerCategoryWritable category = categories.get(categoryId); if (category != null) { CustomerSessionWritable session = new CustomerSessionWritable(..., jsonObj); context.write(categoryId, session); }
  34. public class CustomerRecordsReducer extends Reducer<LongWritable, CustomerSessionWritable, NullWritable, Text> { protected

    void reduce(LongWritable key, Iterable<CustomerSessionWritable> values, Context context)…{ for (CustomerSessionWritable value : values) { // increase number of customer visits for (Writable writable : value.products.get()) { // process an occurrence of a product // track if it is bought or viewed, etc... } } // calculate average values we need // order bought/viewed products based on number of purschases/views context.write(NullWritable.get(), new Text(jsonMapper.writeValueAsString(outputJsonObj)));
  35. $ cat output_record.json { "customerCategoryId": 4, "customerCategoryDescription": "30-40 male", "products":

    [ { "id": 1229, "name": "Candy ugradna rerna FS 635 AQUA" }, ... ], "averageNumberOfViews": 2.3333333, "averageNumberOfPurchases": 1.3333334, "averagePurchase": 44750.0 }
  36. ...

  37. $ analysis:performance cloudera-quickstart-vm-5.1.0 64-bit Intel i5 CPU @ 2.60GHz 16

    GB RAM (12 GB RAM for VM) Small: 150000 json records ~ 103 MB Medium: 750000 json records ~ 517 MB Large: 1500000 json records ~ 1 GB X-large: 2250000 json records ~ 1.5 GB
  38. $ analysis:performance -mode single-node 4m 05s 7m 40s 10m 30s

    13m 30s 40s 1m 20s 2m 30s 3m 30s 0 100 200 300 400 500 600 700 800 900 small (150000/103MB) medium (750000/517MB) large (1500000/1GB) x-large (2250000/1.5GB)
  39. $ analysis:performance –mode single-node 8 13 19 26 5 5

    6 6 0 5 10 15 20 25 30 small (150000/103MB) medium (750000/517MB) large (1500000/1GB) x-large (2250000/1.5GB)
  40. $ analysis:performance –mode cluster 2m 40s 3m 5s 3m 34s

    4m 6s 35s 38s 46s 51s 0 50 100 150 200 250 300 small(150000/103MB) medium(750000/517MB) large(1500000/1GB) x-large(2250000/1.5GB) pig_2 java
  41. $ analysis:performance –mode compare 4m 5s 7m 40s 10m 30s

    13m 30s 40s 1m 20s 2m 30s 3m 30s 2m 40s 3m 5s 3m 34s 4m 6s 35s 38s 46s 51s 0 100 200 300 400 500 600 700 800 900 small(150000/103MB) medium(750000/517MB) large(1500000/1GB) x-large(2250000/1.5GB)
  42. $ analysis:language_support Pig - UDF (Java, Python, Jython, Groovy, Ruby,

    JavaScript) REGISTER myUDFs.jar DEFINE ShinyUDF some.shiny.udf.DoSomething();
  43. $ analysis:dev_tools:pig - Currently plugins for IDE - Plugins for

    text editors - Diagnostic operators: Describe, Dump, Explain and Illustrate - PigUnit $ analysis:dev_tools:java - MRUnit
  44. $ conclusion:pig + high abstraction level + quick development +

    maintenance + extensions (UDF, PiggyBank) - performance - restrictions of Pig Latin $ conclusion:java + speeeeed, control + tools - complexity, maintenance, control
  45. 6 8 - 10 To 100 times faster than MapReduce!

    - Advanced DAG (Directed Acyclic Graph) execution engine - Java, Scala, Python, R - In Memory or Disk - Berkeley Paper https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf - Initially started by Matei Zaharia at UC Berkeley in 2009 - 2013 donated to the Apache Software Foundation - 2014 set a new world record in large scale sorting