Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Map, Reduce, AWK!

Map, Reduce, AWK!

Presented at !!con (http://bangbangcon.com/) 2014.

"Is my data Big enough for Hadoop?" If you're asking, then probably not! Instead, try AWK. It's on your computer, it's easy to use, and you can do a whole lot with it. This talk will (re-)introduce you to AWK, the best little stream processor you could ever ask for. Learn how to map or reduce or both with Unix tools that have existed in one form or another since the 1970s. Don't be ashamed of your small-to-medium data — embrace it with AWK!

63e2f6b0de6ae817af2e185b82aa05c2?s=128

Mark Wunsch

May 18, 2014
Tweet

Transcript

  1. Map, Reduce, AWK! @markwunsch

  2. $ ls -l s3log.txt -rw-r--r-- 1 mwunsch RTRHQ\Domain Users 13553855

    May 16 14:08 s3log.txt
  3. 13553855

  4. 13M

  5. That’s BIG DATA right?

  6. I’m going to need an Hadoop.

  7. None
  8. What about small data?

  9. What about small data?

  10. small data

  11. None
  12. Alfred Aho Peter Weinberger Brian Kernighan

  13. None
  14. Alfred V. Aho "AWK is a language for processing text

    files. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.”
  15. AWK - a language for processing text files - each

    line is a record - each line is broken up into a sequence of fields - pattern-action statements - for each pattern that matches, the associated action is executed
  16. condition { action } An AWK program is of a

    sequence of pattern-action statements.
  17. awk

  18. awk nawk gawk mawk jawk

  19. YOU ALREADY HAVE IT!

  20. docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html Field Name Example Bucket Owner 79a59df900b949e55d96a1e6… Bucket mybucket Time

    [06/Feb/2014:00:00:38 +0000] Remote IP 192.0.2.3 Requester 79a59df900b949e55d96a1e698f… Request ID 3E57427F33A59F07 Operation REST.PUT.OBJECT Key /photos/2014/08/puppy.jpg Request-URI "GET /mybucket/photos/2014/08/puppy.jpg?x-foo=bar" HTTP status 200 Error Code NoSuchBucket Bytes Sent 2662992 Object Size 3462992 Total Time 70 Turn-Around Time 10 Referrer "http://www.amazon.com/webservices" User-Agent "curl/7.15.1" Version Id 3HL4kqtJvjVBH40Nrjfkd
  21. httpd.apache.org/docs/trunk/logs.html#accesslog "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""

    Apache Combined Log Format
  22. Field Name Example Apache Format String Bucket Owner 79a59df900b949e55d96a1e6… Bucket

    mybucket Time [06/Feb/2014:00:00:38 +0000] %t Remote IP 192.0.2.3 %h Requester 79a59df900b949e55d96a1e698f… %u Request ID 3E57427F33A59F07 Operation REST.PUT.OBJECT Key /photos/2014/08/puppy.jpg Request-URI "GET /mybucket/photos/2014/08…” \”%r\” HTTP status 200 %>s Error Code NoSuchBucket Bytes Sent 2662992 %b Object Size 3462992 Total Time 70 Turn-Around Time 10 Referrer “http://www.amazon.com/…” \”%{Referer}i\” User-Agent "curl/7.15.1" \”%{User-agent}i\” Version Id 3HL4kqtJvjVBH40Nrjfkd
  23. Map()

  24. { action } condition

  25. { action }

  26. { print }

  27. Each line is broken up into a sequence of fields…

    3252c3… www.abstractfactory.tv [09/May/2
  28. Each line is broken up into a sequence of fields…

    FS 0897 10897 18 17 "-" "Podcasts/2.0.2" -
  29. $1 3252c3… $2 www.abstractfactory.tv $3 [09/May/2014:15:58:01 $4 +0000] $5 64.124.28.146

    $6 - $7 6EB0563A2F14BD4A $8 WEBSITE.GET.OBJECT $9 feed.xml $10 "GET $11 /feed.xml $12 HTTP/1.1" $13 200 $14 - $15 10897 $16 10897 $17 18 $18 17 $19 "-" $20 "Podcasts/2.0.2" $21 -
  30. { print $5,“-”,$6,$3,$4,$10,$11,$12,$13,$15,$19,$20; } Apache Combined Log Format

  31. { print $5,“-”,$6,$3,$4,$10,$11,$12,$13,$15,$19,$20; } Apache Combined Log Format

  32. "Castro/39 (iPhone; iOS 7.1.1; Scale/2.00)" Apache Combined Log Format User

    Agent FS FS FS FS
  33. { ua=$20 ! ! ! print $5,“-”,$6,$3,$4,$10,$11,$12,$13,$15,$19,ua; } Apache Combined

    Log Format for (i=21; i<NF; i++) { ua=(ua “ ” $i); }
  34. { ua=$20 for (i=21; i<NF; i++) { ua=(ua “ ”

    $i); } print $5,“-”,$6,$3,$4,$10,$11,$12,$13,$15,$19,ua; } #!/usr/bin/awk -f
  35. BEGIN { FS=“\””; } ($2 ~ /feed\.xml/) { print $6;

    }
  36. awk -F\” ‘($2 ~ /feed\.xml/) {print $6}’

  37. awk -F\” ‘($2 ~ /feed\.xml/) {print $6}’ | sort |

    uniq -c | sort -fr
  38. 16 Castro/39 (iPhone; iOS 7.1.1; Scale/2.00) 9 Feedbin - 2

    subscribers 7 iTunes/10.7 Downcast/2.8.18.1003 3 Podcasts/2.0.2 3 Instacast/4.5.2 (like iTunes/10.1.2) 2 StitcherBot (MP3 Search Bot for Stitcher Personalized Radio Service) 2 Overcast/1.0 Podcast Sync (http://overcast.fm/) 2 Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http:// go.mail.ru/help/robots) 2 iTunes/11.1.5 (Macintosh; OS X 10.9.2) AppleWebKit/537.75.14 1 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0 1 iTunes/11.1.5 (Macintosh; OS X 10.7.5) AppleWebKit/534.57.7 1 iTunes/10.7 Downcast/2.8.17.1003 1 Apache-HttpClient/UNAVAILABLE (java 1.4) tail -50 s3logs.txt | awk -F\” ‘($2 ~ /feed\.xml/) {print $6}’ | sort | uniq -c | sort -fr
  39. Reduce()

  40. ($9 == 206) { addresses[$1] += $10; } END {

    for (x in addresses) { print x, addresses[x]; } } $9 => Status Code 206 => HTTP Partial Content $1 => Ip Address $10 => Bytes
  41. awk ‘{ sum += $2 } END { print sum

    }’
  42. sort | uniq | wc -l

  43. Map Reduce AWK

  44. AWK(1) AWK(1) awk NAME awk - pattern-directed scanning and processing

    language SYNOPSIS awk [ -F fs ] [ -v var=value ] [ ‘prog’ | -f progfile ] [ file … ]
  45. Great Auks John James Audobon, The Birds of America

  46. ^D