Upgrade to Pro — share decks privately, control downloads, hide ads and more …

pmux

 pmux

Pmux is a lightweight file-based MapReduce system, written in Ruby.

Avatar for maebashi

maebashi

June 02, 2013
Tweet

More Decks by maebashi

Other Decks in Technology

Transcript

  1. What  is  MapReduce? Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.

     Represent  problems  as  Map  and  Reduce  step   (1)  Map  –  extract,  convert   (2)  Reduce  –  aggregate,  summarize
  2. locate  files  based  solely  on  their  name Copyright  (c)  2013

     Internet  Ini=a=ve  Japan  Inc.  (in case of distributed volume)
  3. What  is  pmux?  (1) •  stands  for  pipeline  mul)plexer  

    •  hQps://github.com/iij/pmux   •  hQps://github.com/iij/pmux/wiki   Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc. 
  4. What  is  pmux?  (2) •  file-­‐based  map/reduce  tool   • 

    uses  Unix  standard  input/output  as  the   interface   Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.  $ pmux --mapper="grep PATTERN" *.log Example:  distributed  grep files  on  GlusterFS
  5. 1.  lookup  target  files Copyright  (c)  2013  Internet  Ini=a=ve  Japan

     Inc.  run  pmux  command   on  this  host read  USVTUFEHMVTUFSGTQBUIJOGP  from  xaQr
  6. 2.  invoke  pmux  on  each  node Copyright  (c)  2013  Internet

     Ini=a=ve  Japan  Inc.  worker dispatcher
  7. 3.  assign  map  tasks  to  nodes Copyright  (c)  2013  Internet

     Ini=a=ve  Japan  Inc.  tasks  are  assigned  to  nodes(workers)  dynamically
  8. 4.  mapper  produces  tmp  files Copyright  (c)  2013  Internet  Ini=a=ve

     Japan  Inc.  maper  produces  temporary  files  containing  intermediate  results
  9. example(1):  count  of  status  code Copyright  (c)  2013  Internet  Ini=a=ve

     Japan  Inc.  extract  the  status  code  from  Apache  log  files  and  count $ pmux --mapper='grep PAT |cut -d" " -f 9’ \ --reducer='sort|uniq -c’ /mnt/glusterfs/*.log 176331 200 106360 206 809 400 21852 403 533 404 27 406 805 416 25 500
  10. example(2):  word  count Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.

     $ pmux --mapper=map.rb --reducer=reduce.rb \ --file=map.rb –-file=reduce.rb \ /mnt/glusterfs/*.txt #! /usr/bin/ruby -an $F.each {|f| print "#{f}\t1\n"} #! /usr/bin/ruby -an BEGIN {$c = Hash.new 0} $c[$F[0]] += $F[1].to_i END {$c.each {|k, v| print "#{k} #{v}\n"}} map.rb reduce.rb command  line
  11. Performance Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.  14:00:00.416011

     IP  21.44.60.29.hQp  >  170.73.162.175.58546:    .  3523999974:3524001422(1448)  ack  3401170238  win  1716    <nop,nop,=mestamp  1070614671  1955062367>   packet  capture  logs  (made  by  tcpdump) extract  the  most  frequently  appeared  IP  address   on  each  file 8344  files,  500K  lines/file,  total  4  billion  lines
  12. map  command Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc. 

    --mapper='egrep –o "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+"| sort|uniq -c|sort -nr|head -1'
  13. result Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.  8

     hr  49  min  6  sec 1  node,  without  pmux
  14. result Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.  8

     hr  49  min  6  sec 1  min  45  sec 300  Hmes  fater 1  node,  without  pmux 60  nodes   (each  node  has  8  cores)
  15. related  tools •  pmux-­‐gw  (pmux-­‐gateway)   – HTTP  interface  for  pmux

      •  pmux-­‐logview   – visualizer  for  pmux  job  progress Copyright  (c)  2013  Internet  Ini=a=ve  Japan  Inc.