Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Map, Reduce, AWK!

Map, Reduce, AWK!

Presented at !!con (http://bangbangcon.com/) 2014.

"Is my data Big enough for Hadoop?" If you're asking, then probably not! Instead, try AWK. It's on your computer, it's easy to use, and you can do a whole lot with it. This talk will (re-)introduce you to AWK, the best little stream processor you could ever ask for. Learn how to map or reduce or both with Unix tools that have existed in one form or another since the 1970s. Don't be ashamed of your small-to-medium data — embrace it with AWK!

Mark Wunsch

May 18, 2014
Tweet

More Decks by Mark Wunsch

Other Decks in Programming

Transcript

  1. 13M

  2. Alfred V. Aho "AWK is a language for processing text

    files. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.”
  3. AWK - a language for processing text files - each

    line is a record - each line is broken up into a sequence of fields - pattern-action statements - for each pattern that matches, the associated action is executed
  4. condition { action } An AWK program is of a

    sequence of pattern-action statements.
  5. awk

  6. docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html Field Name Example Bucket Owner 79a59df900b949e55d96a1e6… Bucket mybucket Time

    [06/Feb/2014:00:00:38 +0000] Remote IP 192.0.2.3 Requester 79a59df900b949e55d96a1e698f… Request ID 3E57427F33A59F07 Operation REST.PUT.OBJECT Key /photos/2014/08/puppy.jpg Request-URI "GET /mybucket/photos/2014/08/puppy.jpg?x-foo=bar" HTTP status 200 Error Code NoSuchBucket Bytes Sent 2662992 Object Size 3462992 Total Time 70 Turn-Around Time 10 Referrer "http://www.amazon.com/webservices" User-Agent "curl/7.15.1" Version Id 3HL4kqtJvjVBH40Nrjfkd
  7. Field Name Example Apache Format String Bucket Owner 79a59df900b949e55d96a1e6… Bucket

    mybucket Time [06/Feb/2014:00:00:38 +0000] %t Remote IP 192.0.2.3 %h Requester 79a59df900b949e55d96a1e698f… %u Request ID 3E57427F33A59F07 Operation REST.PUT.OBJECT Key /photos/2014/08/puppy.jpg Request-URI "GET /mybucket/photos/2014/08…” \”%r\” HTTP status 200 %>s Error Code NoSuchBucket Bytes Sent 2662992 %b Object Size 3462992 Total Time 70 Turn-Around Time 10 Referrer “http://www.amazon.com/…” \”%{Referer}i\” User-Agent "curl/7.15.1" \”%{User-agent}i\” Version Id 3HL4kqtJvjVBH40Nrjfkd
  8. Each line is broken up into a sequence of fields…

    3252c3… www.abstractfactory.tv [09/May/2
  9. Each line is broken up into a sequence of fields…

    FS 0897 10897 18 17 "-" "Podcasts/2.0.2" -
  10. $1 3252c3… $2 www.abstractfactory.tv $3 [09/May/2014:15:58:01 $4 +0000] $5 64.124.28.146

    $6 - $7 6EB0563A2F14BD4A $8 WEBSITE.GET.OBJECT $9 feed.xml $10 "GET $11 /feed.xml $12 HTTP/1.1" $13 200 $14 - $15 10897 $16 10897 $17 18 $18 17 $19 "-" $20 "Podcasts/2.0.2" $21 -
  11. { ua=$20 ! ! ! print $5,“-”,$6,$3,$4,$10,$11,$12,$13,$15,$19,ua; } Apache Combined

    Log Format for (i=21; i<NF; i++) { ua=(ua “ ” $i); }
  12. { ua=$20 for (i=21; i<NF; i++) { ua=(ua “ ”

    $i); } print $5,“-”,$6,$3,$4,$10,$11,$12,$13,$15,$19,ua; } #!/usr/bin/awk -f
  13. 16 Castro/39 (iPhone; iOS 7.1.1; Scale/2.00) 9 Feedbin - 2

    subscribers 7 iTunes/10.7 Downcast/2.8.18.1003 3 Podcasts/2.0.2 3 Instacast/4.5.2 (like iTunes/10.1.2) 2 StitcherBot (MP3 Search Bot for Stitcher Personalized Radio Service) 2 Overcast/1.0 Podcast Sync (http://overcast.fm/) 2 Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http:// go.mail.ru/help/robots) 2 iTunes/11.1.5 (Macintosh; OS X 10.9.2) AppleWebKit/537.75.14 1 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0 1 iTunes/11.1.5 (Macintosh; OS X 10.7.5) AppleWebKit/534.57.7 1 iTunes/10.7 Downcast/2.8.17.1003 1 Apache-HttpClient/UNAVAILABLE (java 1.4) tail -50 s3logs.txt | awk -F\” ‘($2 ~ /feed\.xml/) {print $6}’ | sort | uniq -c | sort -fr
  14. ($9 == 206) { addresses[$1] += $10; } END {

    for (x in addresses) { print x, addresses[x]; } } $9 => Status Code 206 => HTTP Partial Content $1 => Ip Address $10 => Bytes
  15. AWK(1) AWK(1) awk NAME awk - pattern-directed scanning and processing

    language SYNOPSIS awk [ -F fs ] [ -v var=value ] [ ‘prog’ | -f progfile ] [ file … ]
  16. ^D