Upgrade to Pro — share decks privately, control downloads, hide ads and more …

pmux

 pmux

Pmux is a lightweight file-based MapReduce system, written in Ruby.

maebashi

June 02, 2013
Tweet

More Decks by maebashi

Other Decks in Technology

Transcript

  1. What  is  MapReduce? Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.

     Represent  problems  as  Map  and  Reduce  step   (1)  Map  –  extract,  convert   (2)  Reduce  –  aggregate,  summarize
  2. locate  files  based  solely  on  their  name Copyright  (c)  2013

     Internet  Ini=a=ve  Japan  Inc.  (in case of distributed volume)
  3. What  is  pmux?  (1) •  stands  for  pipeline  mul)plexer  

    •  hQps://github.com/iij/pmux   •  hQps://github.com/iij/pmux/wiki   Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc. 
  4. What  is  pmux?  (2) •  file-­‐based  map/reduce  tool   • 

    uses  Unix  standard  input/output  as  the   interface   Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.  $ pmux --mapper="grep PATTERN" *.log Example:  distributed  grep files  on  GlusterFS
  5. 1.  lookup  target  files Copyright  (c)  2013  Internet  Ini=a=ve  Japan

     Inc.  run  pmux  command   on  this  host read  USVTUFEHMVTUFSGTQBUIJOGP  from  xaQr
  6. 2.  invoke  pmux  on  each  node Copyright  (c)  2013  Internet

     Ini=a=ve  Japan  Inc.  worker dispatcher
  7. 3.  assign  map  tasks  to  nodes Copyright  (c)  2013  Internet

     Ini=a=ve  Japan  Inc.  tasks  are  assigned  to  nodes(workers)  dynamically
  8. 4.  mapper  produces  tmp  files Copyright  (c)  2013  Internet  Ini=a=ve

     Japan  Inc.  maper  produces  temporary  files  containing  intermediate  results
  9. example(1):  count  of  status  code Copyright  (c)  2013  Internet  Ini=a=ve

     Japan  Inc.  extract  the  status  code  from  Apache  log  files  and  count $ pmux --mapper='grep PAT |cut -d" " -f 9’ \ --reducer='sort|uniq -c’ /mnt/glusterfs/*.log 176331 200 106360 206 809 400 21852 403 533 404 27 406 805 416 25 500
  10. example(2):  word  count Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.

     $ pmux --mapper=map.rb --reducer=reduce.rb \ --file=map.rb –-file=reduce.rb \ /mnt/glusterfs/*.txt #! /usr/bin/ruby -an $F.each {|f| print "#{f}\t1\n"} #! /usr/bin/ruby -an BEGIN {$c = Hash.new 0} $c[$F[0]] += $F[1].to_i END {$c.each {|k, v| print "#{k} #{v}\n"}} map.rb reduce.rb command  line
  11. Performance Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.  14:00:00.416011

     IP  21.44.60.29.hQp  >  170.73.162.175.58546:    .  3523999974:3524001422(1448)  ack  3401170238  win  1716    <nop,nop,=mestamp  1070614671  1955062367>   packet  capture  logs  (made  by  tcpdump) extract  the  most  frequently  appeared  IP  address   on  each  file 8344  files,  500K  lines/file,  total  4  billion  lines
  12. map  command Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc. 

    --mapper='egrep –o "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+"| sort|uniq -c|sort -nr|head -1'
  13. result Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.  8

     hr  49  min  6  sec 1  node,  without  pmux
  14. result Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.  8

     hr  49  min  6  sec 1  min  45  sec 300  Hmes  fater 1  node,  without  pmux 60  nodes   (each  node  has  8  cores)
  15. related  tools •  pmux-­‐gw  (pmux-­‐gateway)   – HTTP  interface  for  pmux

      •  pmux-­‐logview   – visualizer  for  pmux  job  progress Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.