Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python in Action - Part 2 (Systems Programming)

Python in Action - Part 2 (Systems Programming)

Tutorial presentation. 2007 USENIX LISA conference.

David Beazley

November 16, 2007
Tweet

More Decks by David Beazley

Other Decks in Programming

Transcript

  1. Copyright (C) 2007, http://www.dabeaz.com 1- Python in Action 1 Presented

    at USENIX LISA Conference November 16, 2007 David M. Beazley http://www.dabeaz.com (Part II - Systems Programming)
  2. Copyright (C) 2007, http://www.dabeaz.com 1- Section Overview • In this

    section, we're going to get dirty • Systems Programming • Files, I/O, file-system • Text parsing, data decoding • Processes and IPC • Threads and concurrency • Networking 2
  3. Copyright (C) 2007, http://www.dabeaz.com 1- Commentary • I personally think

    Python is a fantastic tool for systems programming. • Modules provide access to most of the major system libraries I used to access via C • No enforcement of "morality" • Good performance • It just "works" and it feels right. 3
  4. Copyright (C) 2007, http://www.dabeaz.com 1- Approach • I've thought long

    and hard about how I would present this part of the class. • A reference manual approach is just going to be long and very boring. • So instead, we're going to focus on building something more in tune with the times. 4
  5. Copyright (C) 2007, http://www.dabeaz.com 1- "To Catch a Slacker" •

    Write some Python programs that can quietly monitor Firefox browser caches to find out who has been spending their day reading Slashdot instead of working on their TPS reports. • Oh yeah, and be a real sneaky bugger about it. 5
  6. Copyright (C) 2007, http://www.dabeaz.com 1- Why this Problem? • Involves

    a real-world system and data • Firefox already installed on your machine (?) • Cross platform (Linux, Mac, Windows) • Many opportunities for tool building • Related to a variety of practical problems • A good tour of "Python in Action" 6
  7. Copyright (C) 2007, http://www.dabeaz.com 1- Disclaimers • I am not

    involved in browser forensics (or spyware for that matter). • I am in no way affiliated with Firefox/Mozilla nor have I ever seen Firefox source code • I have never worked with the cache data prior to preparing this tutorial • I have never used any third-party tools for looking at this data. 7
  8. Copyright (C) 2007, http://www.dabeaz.com 1- More Disclaimers • All of

    the code in this tutorial works with a standard Python installation • No third party modules. • All code is cross-platform • Code samples are available online at 8 http://www.dabeaz.com/lisa/ • Please look at that code and follow along
  9. Copyright (C) 2007, http://www.dabeaz.com 1- Assumptions • This is not

    a review of systems concepts • You should be generally familiar with background material (files, filesystems, file formats, processes, threads, networking, protocols, etc.) • You can "extrapolate" from the material presented here to construct more advanced Python applications. 9
  10. Copyright (C) 2007, http://www.dabeaz.com 1- Rough Outline • The file

    system and environment • Data processing (text and binary) • File encoding/decoding • Interprocess communication • Networking • Concurrency • Distributed computing 10
  11. Copyright (C) 2007, http://www.dabeaz.com 1- The Firefox Cache • The

    Firefox browser keeps a disk cache of recently visited sites 11 % ls Cache/ -rw------- 1 beazley 111169 Sep 25 17:15 01CC0844d01 -rw------- 1 beazley 104991 Sep 25 17:15 01CC3844d01 -rw------- 1 beazley 47233 Sep 24 16:41 021F221Ad01 ... -rw------- 1 beazley 26749 Sep 21 11:19 FF8AEDF0d01 -rw------- 1 beazley 58172 Sep 25 18:16 FFE628C6d01 -rw------- 1 beazley 1939456 Sep 25 19:14 _CACHE_001_ -rw------- 1 beazley 2588672 Sep 25 19:14 _CACHE_002_ -rw------- 1 beazley 4567040 Sep 25 18:44 _CACHE_003_ -rw------- 1 beazley 33044 Sep 23 21:58 _CACHE_MAP_ • A bunch of cryptic files.
  12. Copyright (C) 2007, http://www.dabeaz.com 1- Problem : Finding Files •

    Find the Firefox cache Write a program findcache.py that takes a directory name as input and recursively scans that directory and all subdirectories looking for Firefox/Mozilla cache directories. 12 • Example: % python findcache.py /Users/beazley /Users/beazley/Library/.../qs1ab616.default/Cache /Users/beazley/Library/.../wxuoyiuf.slt/Cache % • Use case: Searching on the filesystem.
  13. Copyright (C) 2007, http://www.dabeaz.com 1- Solution # findcache.py # Recursively

    scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 13
  14. Copyright (C) 2007, http://www.dabeaz.com 1- The sys module # findcache.py

    # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 14 The sys module has basic information related to the execution environment. sys.argv A list of the command line options sys.stdin sys.stdout sys.stderr Standard I/O files sys.argv = ['findcache.py', '/Users/beazley']
  15. Copyright (C) 2007, http://www.dabeaz.com 1- Program Termination # findcache.py #

    Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 15 SystemExit exception Forces Python to exit. Value is return code.
  16. Copyright (C) 2007, http://www.dabeaz.com 1- os Module # findcache.py #

    Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 16 os module Contains useful OS related functions (files, processes, etc.)
  17. Copyright (C) 2007, http://www.dabeaz.com 1- os.walk() # findcache.py # Recursively

    scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 17 os.walk(topdir) Recursively walks a directory tree and generates a sequence of tuples (path,dirs,files) path = The current directory name dirs = List of all subdirectory names in path files = List of all regular files (data) in path
  18. Copyright (C) 2007, http://www.dabeaz.com 1- A Sequence of Caches #

    findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 18 This statement generates a sequence of directory names where '_CACHE_MAP_' is contained in the filelist. The directory name that is generated as a result File name check
  19. Copyright (C) 2007, http://www.dabeaz.com 1- Printing the Result # findcache.py

    # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 19 This prints the sequence of cache directories that are generated. Note: Output produced immediately as caches are found.
  20. Copyright (C) 2007, http://www.dabeaz.com 1- Commentary • Our solution is

    strongly based on a "declarative" programming style (again) • We simply write out a sequence of operations that produce what we want • Not focused on the underlying mechanics of how to traverse all of the directories. 20
  21. Copyright (C) 2007, http://www.dabeaz.com 1- Mini-Reference • sys module 21

    sys.argv # List of command line options sys.stdin # Standard input sys.stdout # Standard output sys.stderr # Standard error sys.executable # Full path of Python executable sys.exc_info() # Information on current exception • os module os.walk(dir) # Recursively walk dir producing a # sequence of tuples (path,dlist,flist) os.listdir(dir) # Return a list of all files in dir • SystemExit exception raise SystemExit(n) # Exit with integer code n
  22. Copyright (C) 2007, http://www.dabeaz.com 1- Problem : Getting File Info

    • Collect some cache information Create a command-line tool cacheinfo.py that searches for cache directories and produces reports including directory names, sizes, and modification dates. 22 • Example: % python cacheinfo.py -s /Users/ 32435354 /Users/beazley/Library/.../qs1ab616.default/Cache 254100 /Users/beazley/Library/.../wxuoyiuf.slt/Cache % python cacheinfo.py -t /Users 09/26/2007 08:54 /Users/beazley/Library/.../qs1ab616... 01/29/2007 20:47 /Users/beazley/Library/.../wxuoyiuf.slt... %
  23. Copyright (C) 2007, http://www.dabeaz.com 1- cacheinfo.py usage • We're creating

    a command-line tool 23 python [opts] cacheinfo.py dir1 dir2 ... dirn Recursively searches the directories dir1, ... dirn for Firefox browser caches and prints out information. Options: -s Print total size of directory contents -t Print last modification time -sortby=[size|time] Sort results by size or time • Use case : Development of command-line oriented tools and utilities.
  24. Copyright (C) 2007, http://www.dabeaz.com 1- Command Line Parsing # cacheinfo.py

    import optparse p = optparse.OptionParser() p.add_option('-s',action="store_true",dest="size", help="Show total size of each directory") p.add_option('-t',action="store_true",dest="time", help="Show last modification date") p.add_option('--sortby',action="store",dest="sortby", type="choice",choices=["size","time"], help="Sort by 'time' or 'date'") opt,args = p.parse_args() 24
  25. Copyright (C) 2007, http://www.dabeaz.com 1- optparse Module # cacheinfo.py import

    optparse p = optparse.OptionParser() p.add_option('-s',action="store_true",dest="size", help="Show total size of each directory") p.add_option('-t',action="store_true",dest="time", help="Show last modification date") p.add_option('--sortby',action="store",dest="sortby", type="choice",choices=["size","time"], help="Sort by 'time' or 'date'") opt,args = p.parse_args() 25 optparse. A module for parsing Unix-style command line options. A problem that sounds simple, but which is not in practice. Rule of thumb: Python comes with modules that deal with common programming problems.
  26. Copyright (C) 2007, http://www.dabeaz.com 1- optparse Module # cacheinfo.py import

    optparse p = optparse.OptionParser() p.add_option('-s',action="store_true",dest="size", help="Show total size of each directory") p.add_option('-t',action="store_true",dest="time", help="Show last modification date") p.add_option('--sortby',action="store",dest="sortby", type="choice",choices=["size","time"], help="Sort by 'time' or 'date'") opt,args = p.parse_args() 26 Create an OptionParser object and configure it with the available options.
  27. Copyright (C) 2007, http://www.dabeaz.com 1- optparse Module # cacheinfo.py import

    optparse p = optparse.OptionParser() p.add_option('-s',action="store_true",dest="size", help="Show total size of each directory") p.add_option('-t',action="store_true",dest="time", help="Show last modification date") p.add_option('--sortby',action="store",dest="sortby", type="choice",choices=["size","time"], help="Sort by 'time' or 'date'") opt,args = p.parse_args() 27 Parse the command line opts. python cacheinfo.py -s --sortby=time dir1 dir2 sys.argv = ['cacheinfo.py','-s','--sortby=time', 'dir1','dir2'] opt.size=True opt.time=None opt.sortby='time' p.parse_args() args = ['dir1','dir2']
  28. Copyright (C) 2007, http://www.dabeaz.com 1- Sample Use % python cacheinfo.py

    -f /Users Usage: cacheinfo.py [options] cacheinfo.py: error: no such option: -f % python cacheinfo.py -s --sortby=owner /Users Usage: cacheinfo.py [options] cacheinfo.py: error: option --sortby: invalid choice: 'owner' (choose from 'size', 'time') % python cacheinfo.py -h Usage: cacheinfo.py [options] Options: -h, --help show this help message and exit -s Show total size of each directory -t Show last modification date --sortby=SORTBY Sort by 'time' or 'date' 28
  29. Copyright (C) 2007, http://www.dabeaz.com 1- Collecting File Metadata # cacheinfo.py

    ... opt,args = p.parse_args() import os cacheinfo = [] for dirname in args: for path, dirs, files in os.walk(dirname): if '_CACHE_MAP_' in files: fnames = [os.path.join(path,name) for name in files] size = sum(os.path.getsize(name) for name in fnames) mtime = max(os.path.getmtime(name) for name in fnames) cacheinfo.append((path,size,mtime)) 29
  30. Copyright (C) 2007, http://www.dabeaz.com 1- Collecting File Metadata # cacheinfo.py

    ... opt,args = p.parse_args() import os cacheinfo = [] for dirname in args: for path, dirs, files in os.walk(dirname): if '_CACHE_MAP_' in files: fnames = [os.path.join(path,name) for name in files] size = sum(os.path.getsize(name) for name in fnames) mtime = max(os.path.getmtime(name) for name in fnames) cacheinfo.append((path,size,mtime)) 30 General idea: • Walk over a sequence of directories as before. • For each cache directory, get total size of all files and the most recent modification date • Store the information in a list of tuples cacheinfo = [ (dirname, bytes, mtime), (dirname, bytes, mtime), ... ]
  31. Copyright (C) 2007, http://www.dabeaz.com 1- os.path module # cacheinfo.py ...

    opt,args = p.parse_args() import os cacheinfo = [] for dirname in args: for path, dirs, files in os.walk(dirname): if '_CACHE_MAP_' in files: fnames = [os.path.join(path,name) for name in files] size = sum(os.path.getsize(name) for name in fnames) mtime = max(os.path.getmtime(name) for name in fnames) cacheinfo.append((path,size,mtime)) 31 os.path has portable file related functions os.path.join(name1,name2,...) # Join path names os.path.getsize(filename) # Get the file size os.path.getmtime(filename) # Get modification date There are many more functions, but this is the preferred module for basic filename handling
  32. Copyright (C) 2007, http://www.dabeaz.com 1- os.path.join() # cacheinfo.py ... opt,args

    = p.parse_args() import os cacheinfo = [] for dirname in args: for path, dirs, files in os.walk(dirname): if '_CACHE_MAP_' in files: fnames = [os.path.join(path,name) for name in files] size = sum(os.path.getsize(name) for name in fnames) mtime = max(os.path.getmtime(name) for name in fnames) cacheinfo.append((path,size,mtime)) 32 Creates a fully-expanded pathname path = '/foo/bar' files = ['file1','file2',...] fnames = ['/foo/bar/file1','/foo/bar/file2',...] [os.path.join(path,name) for name in files] Note: Use of os.path solves cross-platform issues related to pathnames ('/' vs. '\')
  33. Copyright (C) 2007, http://www.dabeaz.com 1- Getting Metadata # cacheinfo.py ...

    opt,args = p.parse_args() import os cacheinfo = [] for dirname in args: for path, dirs, files in os.walk(dirname): if '_CACHE_MAP_' in files: fnames = [os.path.join(path,name) for name in files] size = sum(os.path.getsize(name) for name in fnames) mtime = max(os.path.getmtime(name) for name in fnames) cacheinfo.append((path,size,mtime)) 33 Performing reductions across the list of filenames fnames = ['/foo/bar/file1','/foo/bar/file2',...] size = sum([os.path.getsize('/foo/bar/file1'), os.path.getsize('/foo/bar/file2'), ...]) The argument looks funny, but it's really just generating a sequence of values as input.
  34. Copyright (C) 2007, http://www.dabeaz.com 1- Data Representation # cacheinfo.py ...

    opt,args = p.parse_args() import os cacheinfo = [] for dirname in args: for path, dirs, files in os.walk(dirname): if '_CACHE_MAP_' in files: fnames = [os.path.join(path,name) for name in files] size = sum(os.path.getsize(name) for name in fnames) mtime = max(os.path.getmtime(name) for name in fnames) cacheinfo.append((path,size,mtime)) 34 Collecting data as a list of tuples: cacheinfo = [ (dirname, bytes, mtime), (dirname, bytes, mtime), ... ]
  35. Copyright (C) 2007, http://www.dabeaz.com 1- Commentary • Again, making strong

    use of declarative style to collect data. 35 size = sum(os.path.getsize(name) for name in fnames) • Compare to this: size = 0 for name in fnames: total += os.path.getsize(name) • The choice of programming style is mostly a matter of personal preference • I personally tend to prefer the first approach
  36. Copyright (C) 2007, http://www.dabeaz.com 1- Producing Results # cacheinfo.py ...

    cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 36
  37. Copyright (C) 2007, http://www.dabeaz.com 1- Sorting (Revisited) # cacheinfo.py ...

    cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 37 Here we are sorting based on an optional command line option (--sortby=[size|time]). .sort() sorts a list "in place."
  38. Copyright (C) 2007, http://www.dabeaz.com 1- Sorting with a key func

    # cacheinfo.py ... cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 38 Supplies a function that will be used to supply the keys used when comparing elements in the sort. lambda Creates a function from a single expression. .sort(key=lambda x: x[2]) def keyfunc(x): return x[2] .sort(key=keyfunc)
  39. Copyright (C) 2007, http://www.dabeaz.com 1- More on lambda # cacheinfo.py

    ... cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 39 lambda is used to create small anonymous functions. Almost always reserved for simple callbacks. def add(x,y): return x+y add = lambda x,y : x+y These statements are equivalent lambda is used sparingly
  40. Copyright (C) 2007, http://www.dabeaz.com 1- Back to Sorting # cacheinfo.py

    ... cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 40 cacheinfo = [ (dirname, bytes, mtime), (dirname, bytes, mtime), ... ]
  41. Copyright (C) 2007, http://www.dabeaz.com 1- Printing Output # cacheinfo.py ...

    cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 41 Printing optional fields. Trailing , omits newline
  42. Copyright (C) 2007, http://www.dabeaz.com 1- Date/time Handling # cacheinfo.py ...

    cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 42 time module Contains functions related to system time values (e.g., seconds since 1970). time.localtime(s) # Make time structure time.time() # Get current time time.strftime(tm) # Time string format time.clock() # CPU clock time.sleep(s) # Sleep for s seconds ...
  43. Copyright (C) 2007, http://www.dabeaz.com 1- Mini-Reference • os.path module 43

    os.path.join(s1,s2,...) # Join pathname parts together os.path.getsize(path) # Get file size of path os.path.getmtime(path) # Get modify time of path os.path.getatime(path) # Get access time of path os.path.getctime(path) # Get creation time of path os.path.exists(path) # Check if path exists os.path.isfile(path) # Check if regular file os.path.isdir(path) # Check if directory os.path.islink(path) # Check if symbolic link os.path.basename(path) # Return file part of path os.path.dirname(path) # Return dir part of os.path.abspath(path) # Get absolute path • time module time.time() # Current time (seconds) time.localtime([s]) # Turn seconds into a structure # of different parts (hour,min,etc.) time.ctime([s]) # Time as an string time.strftime(fmt,tm) # Time string formatting
  44. Copyright (C) 2007, http://www.dabeaz.com 1- Interlude: Tool Building • So

    far, we have been exploring some of the basic machinery for building tools • Program environment (sys module) • Filesystem (os, os.path module) • Command line processing (optparse) • A lot of this would provide the basic framework for more advanced applications 44
  45. Copyright (C) 2007, http://www.dabeaz.com 1- Problem : Searching Data •

    Extract all requests in the cache Write a program cachecontents.py that scans the contents of the _CACHE_00n_ files and prints a list of URLs for documents stored in the cache. 45 • Example: % python cachecontents.py /Users/.../qs1ab616.default/Cache http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/js/ad_eo_1 http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif http://us.i1.yimg.com/us.yimg.com/i/ww/thm/1/search_1.1.png ... % • Use case: Searching files for text or specific data patterns
  46. Copyright (C) 2007, http://www.dabeaz.com 1- The Firefox Cache • The

    cache directory holds two types of data • Metadata (URLs, headers, etc.). • Raw data (HTML, JPEG, PNG, etc.) • This data is stored in two places • Cryptic files in the Cache directory • Blocks inside the _CACHE_00n_ files 46
  47. Copyright (C) 2007, http://www.dabeaz.com 1- Possible Solution : Regex •

    The _CACHE_00n_ files are encoded in a binary format, but URLs are embedded inside: 47 \x00\x01\x00\x08\x92\x00\x02\x18\x00\x00\x00\x13F\xff\x9f \xceF\xff\x9f\xce\x00\x00\x00\x00\x00\x00H)\x00\x00\x00\x1a \x00\x00\x023HTTP:http://slashdot.org/\x00request-method\x00 GET\x00request-User-Agent\x00Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7\x00 request-Accept-Encoding\x00gzip,deflate\x00response-head\x00 HTTP/1.1 200 OK\r\nDate: Sun, 30 Sep 2007 13:07:29 GMT\r\n Server: Apache/1.3.37 (Unix) mod_perl/1.29\r\nSLASH_LOG_DATA: shtml\r\nX-Powered-By: Slash 2.005000176\r\nX-Fry: How can I live my life if I can't tell good from evil?\r\nCache-Control: • Maybe the requests could just be ripped using a regular expression.
  48. Copyright (C) 2007, http://www.dabeaz.com 1- A Regex Solution # cachecontents.py

    import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for URL strings request_pat = re.compile('(http:|https:)//.*?\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() for m in request_pat.finditer(data): print m.group() 48
  49. Copyright (C) 2007, http://www.dabeaz.com 1- The re module # cachecontents.py

    import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for URL strings request_pat = re.compile('(http:|https:)//.*?\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() for m in request_pat.finditer(data): print m.group() 49 re module Contains all functionality related to regular expression pattern matching, searching, replacing, etc. Features are strongly influenced by Perl, but regexs are not directly integrated into the Python language.
  50. Copyright (C) 2007, http://www.dabeaz.com 1- Using re # cacherequests.py import

    re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for URL strings request_pat = re.compile('(http:|https:)//.*?\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() for m in request_pat.finditer(data): print m.group() 50 Patterns are first specified as strings and compiled into a regex object. pat = re.compile(pattern [,flags]) The pattern syntax is "standard" pat* pat+ pat? (pat) . pat1|pat2 [chars] [^chars] pat{n} pat{n,m}
  51. Copyright (C) 2007, http://www.dabeaz.com 1- Using re # cacherequests.py import

    re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for URL strings request_pat = re.compile('(http:|https:)//.*?\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() for m in request_pat.finditer(data): print m.group() 51 All subsequent operations are methods of the compiled regex pattern m = pat.match(data) # Check for match m = pat.search(data) # Search for match newdata = pat.sub(data, repl) # Pattern replace allmatches = pat.findall(data) # Find all matches for m in pat.finditer(data): ... # Iterate over matches
  52. Copyright (C) 2007, http://www.dabeaz.com 1- re Matches # cachecontents.py import

    re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for URL strings request_pat = re.compile('(http:|https:)//.*?\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() for m in request_pat.finditer(data): print m.group() 52 Regex matches are represented by a MatchObject m.group([n]) # Text matched by group n m.start([n]) # Starting index of group n m.end([n]) # End index of group n prints the matched text of the entire pattern
  53. Copyright (C) 2007, http://www.dabeaz.com 1- Commentary on Solution • This

    regex approach is mostly a hack for this particular application. • Reads entire cache files into memory as strings (may be quite large) • Only finds URLs, no other metadata • Returns false positives since the cache also contains data (e.g., HTML pages which would contained embedded URL links) 53
  54. Copyright (C) 2007, http://www.dabeaz.com 1- Mini-Reference • Seaching for text

    using re 54 datepat = re.compile(r'(\d+)/(\d+)/(\d+)') m = datepat.search(data) if m: fulltext = m.group() # Text of complete match month = m.group(1) # Text of specific groups day = m.group(2) year = m.group(3) • Replacement Example def euro_date(m): month = m.group(1) day = m.group(2) year = m.group(3) return "%d/%d/%d" % (day,month,year) newdata = datepat.sub(euro_date,data)
  55. Copyright (C) 2007, http://www.dabeaz.com 1- Problem : Parsing Data •

    Extract the cache data (for real) Write a module ffcache.py that contains a set of functions for reading Firefox cache data into useful data structures that can be used by other programs. Capture all available information including URLs, timestamps, sizes, locations, etc. 55 • Use case: Blood and guts Writing programs that can process foreign file formats. Processing binary encoded data. Creating code for later reuse.
  56. Copyright (C) 2007, http://www.dabeaz.com 1- The Firefox Cache • There

    are four critical files 56 _CACHE_MAP_ # Cache index _CACHE_001_ # Cache data _CACHE_002_ # Cache data _CACHE_003_ # Cache data • All files are binary-encoded • _CACHE_MAP_ is used by Firefox to locate data, but it is not updated until Firefox exits. • We will ignore _CACHE_MAP_ since I want to observe caches of live Firefox sessions.
  57. Copyright (C) 2007, http://www.dabeaz.com 1- Firefox _CACHE_ Files • _CACHE_00n_

    file organization 57 Free/used block bitmap Blocks 4096 bytes Up to 32768 blocks • The block size varies according to the file: _CACHE_001_ 256 byte blocks _CACHE_002_ 1024 byte blocks _CACHE_003_ 4096 byte blocks
  58. Copyright (C) 2007, http://www.dabeaz.com 1- Cache Entries • Each cache

    entry: • A maximum of 4 cache blocks • Can either be data or metadata • If >16K, written to a file instead 58 • Notice how all the "cryptic" files are >16K -rw------- beazley 111169 Sep 25 17:15 01CC0844d01 -rw------- beazley 104991 Sep 25 17:15 01CC3844d01 -rw------- beazley 47233 Sep 24 16:41 021F221Ad01 ... -rw------- beazley 26749 Sep 21 11:19 FF8AEDF0d01 -rw------- beazley 58172 Sep 25 18:16 FFE628C6d01
  59. Copyright (C) 2007, http://www.dabeaz.com 1- Cache Metadata • Metadata is

    encoded as a binary structure 59 Header Request String Request Info 36 bytes Variable length (in header) Variable length (in header) • Header encoding (binary, big-endian) magic (???) location fetchcount fetchtime modifytime expiretime datasize requestsize infosize unsigned int (0x00010008) unsigned int unsigned int unsigned int (system time) unsigned int (system time) unsigned int (system time) unsigned int (byte count) unsigned int (byte count) unsigned int (byte count) 0-3 4-7 8-11 12-15 16-19 20-23 24-27 28-31 32-35
  60. Copyright (C) 2007, http://www.dabeaz.com 1- Solution Outline • Part 1:

    Parsing Metadata Headers • Part 2: Getting request information • Part 3: Scanning each cache file • Part 4: Collecting cache data from an entire cache directory 60
  61. Copyright (C) 2007, http://www.dabeaz.com 1- Part I - Reading Headers

    • Write a function that can parse the metadata header and return the data in a useful format • Add some checks that can bail out if the data does not look like a valid header. 61
  62. Copyright (C) 2007, http://www.dabeaz.com 1- Reading Headers import struct #

    This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata,maxsize): head = struct.unpack(">9I",headerdata) magic = head[0] rsize = head[7] isize = head[8] if magic != 0x00010008 or (rsize + isize + 36) > maxsize: return None meta = dict(zip(_headernames,head)) return meta 62
  63. Copyright (C) 2007, http://www.dabeaz.com 1- Reading Headers >>> f =

    open("Cache/_CACHE_001_","rb") >>> f.seek(4096) # Skip the bit map >>> headerdata = f.read(36) # Read 36 byte header >>> meta = parse_meta_header(headerdata,1024) >>> meta {'fetchtime': 1190829792, 'requestsize': 27, 'magic': 65544, 'fetchcount': 3, 'expiretime': 0, 'location': 2449473536L, 'modifytime': 1190829792, 'datasize': 29448, 'infosize': 531} >>> 63 • How this is supposed to work: • Basically, we're parsing the header into a useful Python data structure
  64. Copyright (C) 2007, http://www.dabeaz.com 1- struct module import struct #

    This function parses cache metadata into a dictionary # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata,maxsize): head = struct.unpack(">9I",headerdata) magic = head[0] rsize = head[7] isize = head[8] if magic != 0x00010008 or (rsize + isize + 36) > maxsize: return None meta = dict(zip(_headernames,head)) return meta 64 Parses binary encoded data into Python objects. You would use this module to pack/unpack raw binary data from Python strings. Sample format codes 'i' int 'I' unsigned int 'h' short 'H' unsigned short 'd' double ... '>' Big endian '<' Little endian '!' Network 'n' Repetition
  65. Copyright (C) 2007, http://www.dabeaz.com 1- Header Validation import struct #

    This function parses a metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata,maxsize): head = struct.unpack(">9I",headerdata) magic = head[0] rsize = head[7] isize = head[8] if magic != 0x00010008 or (rsize + isize + 36) > maxsize: return None meta = dict(zip(_headernames,head)) return meta 65 Unpacked data is a tuple A check that sees if the unpacked data looks valid head = (n,n,n,n,n,n,n,n,n)
  66. Copyright (C) 2007, http://www.dabeaz.com 1- Dictionary Creation import struct #

    This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata,maxsize): head = struct.unpack(">9I",headerdata) magic = head[0] rsize = head[7] isize = head[8] if magic != 0x00010008 or (rsize + isize + 36) > maxsize: return None meta = dict(zip(_headernames,head)) return meta 66 zip(s1,s2) makes a list of tuples zip(_headernames,head) [('magic',head[0]), ('location',head[1]), ('fetchcount',head[2]) ... ] Make a dictionary
  67. Copyright (C) 2007, http://www.dabeaz.com 1- Commentary • Dictionaries as data

    structures 67 meta = { 'fetchtime' : 1190829792, 'requestsize' : 27, 'magic' : 65544, 'fetchcount' : 3, 'expiretime' : 0, 'location' : 2449473536L, 'modifytime' : 1190829792, 'datasize' : 29448, 'infosize' : 531 } • Useful if data has many parts data = f.read(meta[8]) # Huh?!? vs. data = f.read(meta['infosize']) # Better
  68. Copyright (C) 2007, http://www.dabeaz.com 1- Part 2 : Reading Requests

    • Write a function that will read the request string and request information • Request String : A Null-terminated string 68 • Request Info : A sequence of Null-terminated key-value pairs (like a dictionary)
  69. Copyright (C) 2007, http://www.dabeaz.com 1- Reading Requests # Given a

    dictionary of header information and a file, # this function extracts the request data from a cache # metadata entry and saves it in the dictionary. Returns # True or False depending on success. def read_request_data(meta,f): request = f.read(meta['requestsize']).strip('\x00') infodata = f.read(meta['infosize']).strip('\x00') # Validate request and infodata here (nothing now) # Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True 69
  70. Copyright (C) 2007, http://www.dabeaz.com 1- Usage : Requests >>> f

    = open("Cache/_CACHE_001_","rb") >>> f.seek(4096) # Skip the bit map >>> headerdata = f.read(36) # Read 36 byte header >>> meta = parse_meta_header(headerdata,1024) >>> read_request_data(meta,f) True >>> meta['request'] 'http://www.yahoo.com/' >>> meta['info'] {'request-method': 'GET', 'request-User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/ 20070914 Firefox/2.0.0.7', 'charset': 'UTF-8', 'response- head': 'HTTP/1.1 200 OK\r\nDate: Wed, 26 Sep 2007 18:03:17 ...' } >>> 70 • Usage of the function:
  71. Copyright (C) 2007, http://www.dabeaz.com 1- # Given a dictionary of

    header information and a file, # this function extracts the request data from a cache # metadata entry and saves it in the dictionary. Returns # True or False depending on success. def read_request_data(meta,f): request = f.read(meta['requestsize']).strip('\x00') infodata = f.read(meta['infosize']).strip('\x00') # Validate request and infodata here (nothing now) # Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True String Stripping 71 Here, we just read the request string followed by the request info string. Both end with a NULL which we strip off.
  72. Copyright (C) 2007, http://www.dabeaz.com 1- # Given a dictionary of

    header information and a file, # this function extracts the request data from a cache # metadata entry and saves it in the header. Returns # True or False depending on success. def read_request_data(header,f): request = f.read(header['requestsize']).strip('\x00') infodata = f.read(header['infosize']).strip('\x00') # Validate request and infodata here (nothing now) # Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True String Splitting 72 The request info is a string of key/value pairs: infodata = 'key\x00value\x00key\x00value\x00key\x00value\' .split('\x00') parts = ['key','value','key','value','key','value']
  73. Copyright (C) 2007, http://www.dabeaz.com 1- # Given a dictionary of

    header information and a file, # this function extracts the request data from a cache # metadata entry and saves it in the header. Returns # True or False depending on success. def read_request_data(header,f): request = f.read(header['requestsize']).strip('\x00') infodata = f.read(header['infosize']).strip('\x00') # Validate request and infodata here (nothing now) # Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Advanced List Slicing 73 We can slice a list with a stride parts = ['key','value','key','value','key','value'] parts[::2] ['key','key','key'] parts[1::2] ['value','value','value'] zip(parts[::2],parts[1::2]) [('key','value'), ('key','value') ('key','value')] Makes a dictionary
  74. Copyright (C) 2007, http://www.dabeaz.com 1- # Given a dictionary of

    header information and a file, # this function extracts the request data from a cache # metadata entry and saves it in the dictionary. Returns # True or False depending on success. def read_request_data(header,f): request = f.read(header['requestsize']).strip('\x00') infodata = f.read(header['infosize']).strip('\x00') # Validate request and infodata here (nothing now) # Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Fixing the Request 74 Cleaning up the request string request = "HTTP:http://www.google.com" .split(':',1) ['HTTP','http://www.google.com'] [1] 'http://www.google.com'
  75. Copyright (C) 2007, http://www.dabeaz.com 1- Commentary • Emphasize that Python

    has very powerful list manipulation primitives • Indexing • Slicing • List comprehensions • Etc. • Knowing how to use these leads to rapid development and compact code 75
  76. Copyright (C) 2007, http://www.dabeaz.com 1- Part 3: File Scanning •

    Write a function that scans a cache file and produces a sequence of records containing all of the cache metadata. • This is just one more of our building blocks • The goal is to hide some of the nasty bits 76
  77. Copyright (C) 2007, http://www.dabeaz.com 1- File Scanning # scan a

    cache file from beginning to end, producing a # sequence of dictionaries with metadata information def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break header = parse_meta_header(headerdata,maxsize) if header and read_request_data(header,f): yield header # Move the file pointer to the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) 77
  78. Copyright (C) 2007, http://www.dabeaz.com 1- Usage : File Scanning >>>

    f = open("Cache/_CACHE_001_","rb") >>> for meta in scan_cache_file(f,256) ... print meta['request'] ... http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/ http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif ... 78 • Usage of the scan function • We can just open up a cache file and write a for-loop to iterate over all of the metadata entries.
  79. Copyright (C) 2007, http://www.dabeaz.com 1- # scan a cache file

    from beginning to end, producing a # sequence of dictionaries with metadata information def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break header = parse_meta_header(headerdata,maxsize) if header and read_request_data(header,f): yield header # Move the file pointer to the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Python File I/O 79 File Objects Modeled after ANSI C. Files are just bytes. File pointer keeps track. f.read() # Read bytes f.tell() # Current fp f.seek(n,off) # Move fp
  80. Copyright (C) 2007, http://www.dabeaz.com 1- # scan a cache file

    from beginning to end, producing a # sequence of dictionaries with metadata information def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break header = parse_meta_header(headerdata,maxsize) if header and read_request_data(header,f): yield header # Move the file pointer to the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Using Earlier Code 80 Here we are using our header parsing functions written in previous parts.
  81. Copyright (C) 2007, http://www.dabeaz.com 1- # scan a cache file

    from beginning to end, producing a # sequence of dictionaries with metadata information def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break header = parse_meta_header(headerdata,maxsize) if header and read_request_data(header,f): yield header # Move the file pointer to the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Generating Results 81 We are using yield to produce data for a single metadata entry. If someone uses a for-loop, they will get all of the entries. Note: This allows us to process the cache without reading all of the data into memory.
  82. Copyright (C) 2007, http://www.dabeaz.com 1- Commentary • Have created a

    function that can scan a single _CACHE_00n_ file and produce a sequence of dictionaries with metadata. • It's still somewhat low-level • Just need to package it a little better 82
  83. Copyright (C) 2007, http://www.dabeaz.com 1- Part 4 : Scan the

    Cache • Write a function that takes the name of a Firefox cache directory, scans all of the cache files for metadata, and produces a useful sequence of results. • Make it real easy to extract data 83
  84. Copyright (C) 2007, http://www.dabeaz.com 1- Solution : Cache Scan #

    Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): n = 1 blocksize = 256 while n <= 3: cname = "_CACHE_00%d_" % n cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() n += 1 blocksize *= 4 84
  85. Copyright (C) 2007, http://www.dabeaz.com 1- Solution : Cache Scan #

    Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): n = 1 blocksize = 256 while n <= 3: cname = "_CACHE_00%d_" % n cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() n += 1 blocksize *= 4 85 General idea: We loop over the three _CACHE_00n_ files and produce a sequence of the metadata dictionaries
  86. Copyright (C) 2007, http://www.dabeaz.com 1- More Generation # Given the

    name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): n = 1 blocksize = 256 while n <= 3: cname = "_CACHE_00%d_" % n cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() n += 1 blocksize *= 4 86 By using yield here, we are chaining together the results obtained from all three cache files into one big long sequence of results. The underlying mechanics and implementation details are hidden (user doesn't care)
  87. Copyright (C) 2007, http://www.dabeaz.com 1- Additional Data # Given the

    name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): n = 1 blocksize = 256 while n <= 3: cname = "_CACHE_00%d_" % n cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() n += 1 blocksize *= 4 87 Adding path and file information to the dictionary (May be useful later)
  88. Copyright (C) 2007, http://www.dabeaz.com 1- Usage : Cache Scan >>>

    for meta in scan_cache("Cache/"): ... print meta['request'] ... http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/ http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif ... 88 • Usage of the scan function • Given the name of a cache directory, we can just loop over all of the metadata. Trivial! • With work, could perform various kinds of queries and processing of the data
  89. Copyright (C) 2007, http://www.dabeaz.com 1- A Mini-Example >>> for meta

    in scan_cache("Cache/"): ... if 'slashdot' in meta['request']: ... print meta['request'] ... http://www.slashdot.org/ http://images.slashdot.org/topics/topiccommunications.gif http://images.slashdot.org/topics/topicstorage.gif http://images.slashdot.org/comments.css?T_2_5_0_176 ... 89 • Find all requests related to Slashdot • Well, that was pretty easy.
  90. Copyright (C) 2007, http://www.dabeaz.com 1- Intermission • Have written two

    programs • findcache.py. A program that locates Firefox cache directories on file system • ffcache.py. A set of utility functions for extracting cache metadata. • Have taken a moderately complex data processing problem and simplified it. • < 100 lines of code. 90
  91. Copyright (C) 2007, http://www.dabeaz.com 1- Data Encoding • Getting more

    information 91 All documents on the Internet have some type, encoding, and character set. For example, a 'text/ html' file that uses the 'UTF-8' character set and is encoded using gzip compression. • Your problem: Write a function add_content_info(m) that inspects the cache metadata and adds additional information concerning the content type, encoding, and charset.
  92. Copyright (C) 2007, http://www.dabeaz.com 1- HTTP Responses • The cache

    metadata includes an HTTP response header 92 >>> print meta['info']['response-head'] HTTP/1.1 200 OK Date: Sat, 29 Sep 2007 20:51:37 GMT Cache-Control: private Vary: User-Agent Content-Type: text/html; charset=utf-8 Content-Encoding: gzip >>> Content type, character set, and encoding.
  93. Copyright (C) 2007, http://www.dabeaz.com 1- Solution # Given a metadata

    dictionary, this function adds additional # fields related to the content type, charset, and encoding import email def add_content_info(meta): info = meta['info'] if 'response-head' not in info: content = None encoding = None charset = None else: rhead = info.get('response-head').split("\n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset 93
  94. Copyright (C) 2007, http://www.dabeaz.com 1- Internet Data Handling # Given

    a metadata dictionary, this function adds additional # fields related to the content type, charset, and encoding import email def add_content_info(meta): info = meta['info'] if 'response-head' not in info: content = None encoding = None charset = None else: rhead = info.get('response-head').split("\n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset 94 Python has a vast assortment of internet data handling modules. email. Parsing of email messages, MIME headers, etc.
  95. Copyright (C) 2007, http://www.dabeaz.com 1- Modified Cache Scan # A

    cache scanning function that adds the content # information to the metadata returned. def scan_cache_withcontent(cachedir): for meta in scan_cache(cachedir): add_content_info(meta) yield meta 95 Add content info
  96. Copyright (C) 2007, http://www.dabeaz.com 1- A Mini-Example >>> jpegs =

    (meta for meta in scan_cache_withcontent("Cache/") if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000) >>> for j in jpegs: ... print j['request'] ... http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/ story.jpg http://images.salon.com/ent/video_dog/ifc/2007/09/28/ apocalypse/story.jpg http://www.lakesideinns.com/images/fallroadphoto2006.jpg ... >>> 96 • Find all large JPEG images in the cache • That was also pretty easy
  97. Copyright (C) 2007, http://www.dabeaz.com 1- File Inspector • Getting Cached

    Documents 97 Add a function getdata(m) to ffcache.py that takes a metadata dictionary and attempts to return the actual data that's stored in the cache. If the data has been compressed or encoded in some manner, decode it before returning. • Big Picture More file I/O, but with the added problem of dealing with various data encoding issues.
  98. Copyright (C) 2007, http://www.dabeaz.com 1- Cache Locations • The Firefox

    cache metadata does not encode the location of the corresponding data. • It is only stored in _CACHE_MAP_ • _CACHE_MAP_ is only updated when Firefox terminates (held in memory). • So, we won't be able to accurately pull data from a live session. 98
  99. Copyright (C) 2007, http://www.dabeaz.com 1- A File Heuristic • Look

    for cache files that have the same file size as that encoded in the metadata. • If there are duplicates, pick the one that has a modification date closest to that in the metadata. • If none of this works, don't worry about it. 99
  100. Copyright (C) 2007, http://www.dabeaz.com 1- Guessing at the File #

    Attempt to locate the matching cache file (if possible) import glob def getcandidate(meta): datasize = meta['datasize'] filepat = os.path.join(meta['cachedir'],'[0-9A-F]*') filelist = glob.glob(filepat) # Get all files that are the same size samesize = [name for name in filelist if os.path.getsize(name) == datasize] if not samesize: return "" # Get file with closest modification time mtime = meta['modifytime'] delta,filename = min((abs(mtime-os.path.getmtime(name)), name) for name in samesize) return filename 100
  101. Copyright (C) 2007, http://www.dabeaz.com 1- glob Module # Attempt to

    locate the matching cache file (if possible) import glob def getcandidate(meta): datasize = meta['datasize'] filepat = os.path.join(meta['cachedir'],'[0-9A-F]*') filelist = glob.glob(filepat) # Get all files that are the same size samesize = [name for name in filelist if os.path.getsize(name) == datasize] if not samesize: return "" # Get file with closest modification time mtime = meta['modifytime'] delta,filename = min((abs(mtime-os.path.getmtime(name)), name) for name in samesize) return filename 101 glob module. Returns a list of filenames matching a pattern
  102. Copyright (C) 2007, http://www.dabeaz.com 1- More File Operations # Attempt

    to locate the matching cache file (if possible) import glob def getcandidate(meta): datasize = meta['datasize'] filepat = os.path.join(meta['cachedir'],'[0-9A-F]*') filelist = glob.glob(filepat) # Get all files that are the same size samesize = [name for name in filelist if os.path.getsize(name) == datasize] if not samesize: return "" # Get file with closest modification time mtime = meta['modifytime'] delta,filename = min((abs(mtime-os.path.getmtime(name)), name) for name in samesize) return filename 102 Get all files with the correct size. Get file with closest modification time
  103. Copyright (C) 2007, http://www.dabeaz.com 1- Read a Cache File #

    Given a metadata entry, this attempts to read the associated # data (assuming it can be located) import gzip, codecs def getdata(meta): filename = getcandidate(meta) if not filename: return "" encoding = meta.get("content-encoding","") if encoding == 'gzip': f = gzip.open(filename) else: f = open(filename,"rb") charset = meta.get("charset",None) if charset: reader = codecs.getreader(charset)(f) else: reader = f try: data = reader.read() except (IOError,ValueError): return "" return data 103
  104. Copyright (C) 2007, http://www.dabeaz.com 1- # Given a metadata entry,

    this attempts to read the associated # data (assuming it can be located) import gzip, codecs def getdata(meta): filename = getcandidate(meta) if not filename: return "" encoding = meta.get("content-encoding","") if encoding == 'gzip': f = gzip.open(filename) else: f = open(filename,"rb") charset = meta.get("charset",None) if charset: reader = codecs.getreader(charset)(f) else: reader = f try: data = reader.read() except (IOError,ValueError): return "" return data Internet Data Encoding 104 When working with foreign data, it is critical to be able to work with different encodings. gzip. Read/write gzip encoded files. codecs. Read/write different character encodings (UTF-8, UTF-16, Big5, etc.) There are many similar modules.
  105. Copyright (C) 2007, http://www.dabeaz.com 1- # Given a metadata entry,

    this attempts to read the associated # data (assuming it can be located) import gzip, codecs def getdata(meta): filename = getcandidate(meta) if not filename: return "" encoding = meta.get("content-encoding","") if encoding == 'gzip': f = gzip.open(filename) else: f = open(filename,"rb") charset = meta.get("charset",None) if charset: reader = codecs.getreader(charset)(f) else: reader = f try: data = reader.read() except (IOError,ValueError): return "" return data Reading gzip Files 105 If the file is encoded as 'gzip', we'll open it using the gzip module. Otherwise, open it as a normal file.
  106. Copyright (C) 2007, http://www.dabeaz.com 1- # Given a metadata entry,

    this attempts to read the associated # data (assuming it can be located) import gzip, codecs def getdata(meta): filename = getcandidate(meta) if not filename: return "" encoding = meta.get("content-encoding","") if encoding == 'gzip': f = gzip.open(filename) else: f = open(filename,"rb") charset = meta.get("charset",None) if charset: reader = codecs.getreader(charset)(f) else: reader = f try: data = reader.read() except (IOError,ValueError): return "" return data Character Encoding 106 The codecs module is used to deal with special character encodings such as UTF-8. Here, we are putting a codecs "reader" around the file if a charset was specified. Otherwise, just use the original file.
  107. Copyright (C) 2007, http://www.dabeaz.com 1- More on Data Encoding •

    Python has full support for Unicode • Unicode strings 107 • An advanced (and painful) topic • For internet applications, you should assume that text data will always be encoded according to some standard codecs (Latin-1, ISO-8859-1, UTF-8, Big5, etc.) pepper = u'Jalape\xf1o'
  108. Copyright (C) 2007, http://www.dabeaz.com 1- Where do we stand? •

    findcache.py. A program that can locate Firefox cache directories. (10 lines) • ffcache.py. A module that can read cache metadata, determine content encoding, and read files from the cache (assuming they can be located). (140 lines) 108
  109. Copyright (C) 2007, http://www.dabeaz.com 1- Problem : CacheSpy • Big

    Brother (make an evil sound here) 109 Write a program that first locates all of the Firefox cache directories under a given directory. Then have that program run forever as a network server, waiting for connections. On each connection, send back all of the current cache metadata. • Big Picture We're going to write a daemon that will find and quietly monitor browser caches.
  110. Copyright (C) 2007, http://www.dabeaz.com 1- Part I : Running a

    Program • Find all Firefox cache directories • Didn't we already write that code? • Yes. Let's run that program as a subprocess and collect its output. • Problem: How to run other processes? 110
  111. Copyright (C) 2007, http://www.dabeaz.com 1- Solution : Pipes # Run

    the findcache.py program as a subprocess and # collect the output. import popen2 import sys def findcaches(topdir): cmd = sys.executable + " findcache.py " + topdir out,inp = popen2.popen2(cmd) inp.close() # No input to subprocess caches = [line.strip() for line in out] return caches 111
  112. Copyright (C) 2007, http://www.dabeaz.com 1- popen2 module # Run the

    findcache.py program as a subprocess and # collect the output. import popen2 import sys def findcaches(topdir): cmd = sys.executable + " findcache.py " + topdir out,inp = popen2.popen2(cmd) inp.close() # No input to subprocess caches = [line.strip() for line in out] return caches 112 popen2 Contains functions for launching subprocesses and creating pipes.
  113. Copyright (C) 2007, http://www.dabeaz.com 1- Shell Commands # Run the

    findcache.py program as a subprocess and # collect the output. import popen2 import sys def findcaches(topdir): cmd = sys.executable + " findcache.py " + topdir out,inp = popen2.popen2(cmd) inp.close() # No input to subprocess caches = [line.strip() for line in out] return caches 113 We create a full shell command and execute it. cmd = "/usr/local/bin/python2.5 findcache.py topdir" sys.executable Full path of python intepreter.
  114. Copyright (C) 2007, http://www.dabeaz.com 1- Usage >>> caches = findcaches("/Users")

    >>> caches ['/Users/beazley/Library/Caches/Firefox/Profiles/ qs1ab616.default/Cache', '/Users/beazley/Library/Mozilla/ Profiles/default/wxuoyiuf.slt/Cache'] >>> 114 • Usage of the findcaches function • Commentary. Using pipes might be overkill for this. This is mostly just to illustrate.
  115. Copyright (C) 2007, http://www.dabeaz.com 1- More on Processes • Python

    has extensive support for processes • os module (pipes, fork, exec, spawnv, wait, exit, system, ttys) • popen2 (pipes) • subprocess (high level subprocess API) • signal (signal handling) • Almost any C-level program written using POSIX calls can be implemented in Python 115
  116. Copyright (C) 2007, http://www.dabeaz.com 1- Interactive Processes • Python does

    not have built-in support for controlling interactive subprocesses (e.g., "Expect") • Must install third party modules for this • Example: pexpect • http://pexpect.sourceforge.net 116
  117. Copyright (C) 2007, http://www.dabeaz.com 1- Part 2 : Making a

    Server • Write a simple server program that sends back all metadata when it receives a connection. 117
  118. Copyright (C) 2007, http://www.dabeaz.com 1- CacheSpy Server import pickle, sys,

    SocketServer, ffcache cachelist = findcaches(sys.argv[1]) def dump_cache(f): for dir in cachelist: for meta in ffcache.scan_cache_withcontent(dir): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",31337),SpyHandler) print "CacheSpy running on port 31337" serv.serve_forever() 118
  119. Copyright (C) 2007, http://www.dabeaz.com 1- SocketServer Module import pickle, sys,

    SocketServer, ffcache cachelist = findcaches(sys.argv[1]) def dump_cache(f): for dir in cachelist: for meta in ffcache.scan_cache_withcontent(dir): pickle.dump(f,meta) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",31337),SpyHandler) print "CacheSpy running on port 31337" serv.serve_forever() 119 SocketServer A module for easily creating low-level internet applications using sockets.
  120. Copyright (C) 2007, http://www.dabeaz.com 1- SocketServer Module import pickle, sys,

    SocketServer, ffcache cachelist = findcaches(sys.argv[1]) def dump_cache(f): for dir in cachelist: for meta in ffcache.scan_cache_withcontent(dir): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",31337),SpyHandler) print "CacheSpy running on port 31337" serv.serve_forever() 120 You define a simple class that implements handle(). This implements the server logic.
  121. Copyright (C) 2007, http://www.dabeaz.com 1- SocketServer Module import pickle, sys,

    SocketServer, ffcache cachelist = findcaches(sys.argv[1]) def dump_cache(f): for dir in cachelist: for meta in ffcache.scan_cache_withcontent(dir): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",31337),SpyHandler) print "CacheSpy running on port 31337" serv.serve_forever() 121 Next, you just create a Server object, hook the handler up to it, and run the server.
  122. Copyright (C) 2007, http://www.dabeaz.com 1- Data Serialization import pickle, sys,

    SocketServer, ffcache cachelist = findcaches(sys.argv[1]) def dump_cache(f): for dir in cachelist: for meta in ffcache.scan_cache_withcontent(dir): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",31337),SpyHandler) print "CacheSpy running on port 31337" serv.serve_forever() 122 Here, we are turning a socket into a file and dumping cache data on it. socket corresponding to client that connected.
  123. Copyright (C) 2007, http://www.dabeaz.com 1- pickle Module import pickle, sys,

    SocketServer, ffcache cachelist = findcaches(sys.argv[1]) def dump_cache(f): for dir in cachelist: for meta in ffcache.scan_cache_withcontent(dir): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",31337),SpyHandler) print "CacheSpy running on port 31337" serv.serve_forever() 123 The pickle module takes any Python object and serializes it. There are really only two ops: pickle.dump(obj,f) # Dump object obj = pickle.load(f) # Load object
  124. Copyright (C) 2007, http://www.dabeaz.com 1- Running our Server % python

    cachespy.py /Users CacheSpy running on port 31337 124 • Example: • Server is just sitting there waiting • You can try connecting with telnet % telnet localhost 31337 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. (dp0 S'info' p1 ... bunch of cryptic data ...
  125. Copyright (C) 2007, http://www.dabeaz.com 1- Problem : CacheMon • The

    evil overload (bwahahahahaha!) 125 Write a program cachemon.py that contains a function for retrieving the cache contents from a remote machine. • Big Picture Writing network clients. Programs that make outgoing connections to internet services.
  126. Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py import pickle, socket

    def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close() Solution : Cachemon 126
  127. Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py import pickle, socket

    def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close() Solution : Socket Module 127 socket module provides direct access to low-level socket API. s = socket(addr,type) s.connect(host) s.bind(addr) s.listen(n) s.accept() s.recv(n) s.send(data) ...
  128. Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py import pickle, socket

    def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close() Unpickling a Sequence 128 Here we use pickle to repeatedly load objects. We use yield to generate a sequence of received objects.
  129. Copyright (C) 2007, http://www.dabeaz.com 1- Example Usage >>> rcache =

    scan_remote_cache(("localhost",31337)) >>> jpegs = (meta for meta in rcache ... if meta['content-type'] == 'image/jpeg' ... and meta['datasize'] > 100000) >>> for j in jpegs: ... print j['request'] ... http://images.salon.com/ent/video_dog/comedy/2007/09/27/ cereal/story.jpg http://images.salon.com/ent/video_dog/ifc/2007/09/28/ apocalypse/story.jpg http://www.lakesideinns.com/images/fallroadphoto2006.jpg ... 129 • Example: Find all JPEG images > 100K on a remote machine • Very similar to old code!
  130. Copyright (C) 2007, http://www.dabeaz.com 1- Variation : CacheMon • Scan

    a whole cluster of machines 130 Write a function that can easily scan the caches of an entire list of remote hosts. • Big Picture Collecting data from a group of machines on the network.
  131. Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py def scan_caches(hostlist): for

    host in hostlist: try: for meta in scan_remote_cache(host): yield meta except (EnvironmentError,socket.error): pass Solution : Cachemon 131 A bit of exception handling to deal with dead machines, and other problems (might need to be expanded)
  132. Copyright (C) 2007, http://www.dabeaz.com 1- Example Usage >>> hosts =

    [('host1',31337),('host2',31337),...] >>> rcaches = scan_caches(hosts) >>> jpegs = (meta for meta in rcache ... if meta['content-type'] == 'image/jpeg' ... and meta['datasize'] > 100000) >>> for j in jpegs: ... print j['request'] ... ... 132 • Example: Find all JPEG images > 100K on a set of remote machines • Think about the abstraction of "iteration" here. Query code is exactly the same.
  133. Copyright (C) 2007, http://www.dabeaz.com 1- Concurrent Monitor • Collect data

    from a large set of machines 133 In the last section, we wrote a function that scanned an entire list of hosts, one at a time. Modify that function to scan a set of hosts using concurrent network connections. • Big Picture Breaking a task up into concurrently executing parts. Programming with threads.
  134. Copyright (C) 2007, http://www.dabeaz.com 1- Concurrency 134 • Python provides

    full support for threads • They are real threads (pthreads, system threads, etc.) • The only catch: The Python run-time interpreter is protected by a global interpreter lock. So, no true concurrency across multiple CPUs.
  135. Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py ... import threading

    class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): try: for meta in scan_remote_cache(self.host): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass A Cache Scanning Thread 135
  136. Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py ... import threading

    class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): try: for meta in scan_remote_cache(self.host): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass threading Module 136 threading module. Contains most functionality related to threads.
  137. Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py ... import threading

    class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): try: for meta in scan_remote_cache(host): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass Thread Base Class 137 Threads are defined by inheriting from the Thread base class.
  138. Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py ... import threading

    class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): try: for meta in scan_remote_cache(self.host): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass Thread Initialization 138 initialization and setup
  139. Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py ... import threading

    class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): try: for meta in scan_remote_cache(self.host): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass Thread Execution 139 run() method Contains code that executes in the thread.
  140. Copyright (C) 2007, http://www.dabeaz.com 1- Launching a Thread 140 •

    You create a thread object and launch it t1 = ScanThread(("host1",31337),msg_q) t1.start() t2 = ScanThread(("host2",31337),msg_q) t2.start() • .start() starts the thread and calls .run()
  141. Copyright (C) 2007, http://www.dabeaz.com 1- Interlude 141 • Threads are

    commonly used to implement variations of producer/consumer problems. • Data is produced by one or more producers (e.g., Each remote client) • Consumed by one or more consumers (e.g., a centralized function). • You could try to coordinate this with locks and other synchronization.
  142. Copyright (C) 2007, http://www.dabeaz.com 1- Thread Safe Queues 142 •

    Queue module. Provides a thread-safe queue. import Queue msg_q = Queue.Queue() • Queue insertion msg_q.put(obj) • Queue removal obj = msg_q.get() • Queue can be shared by as many threads as you want without worrying about locking.
  143. Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py ... import threading

    class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): try: for meta in scan_remote_cache(self.host): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass Use of a Queue Object 143 A Queue object. Where incoming objects are placed. Get data from the remote machine and put in the Queue
  144. Copyright (C) 2007, http://www.dabeaz.com 1- Primitive Use of a Queue

    144 • You create a queue, then launch the thread msg_q = Queue.Queue() t1 = ScanThread(("host1",31337),msg_q) t1.start() t2 = ScanThread(("host2",31337),msg_q) t2.start() while True: meta = msg_q.get() # Get metadata
  145. Copyright (C) 2007, http://www.dabeaz.com 1- Monitor Architecture 145 Host Host

    Host Monitor Thread Thread Thread msg_q socket socket socket Consumer .put() .get() ????
  146. Copyright (C) 2007, http://www.dabeaz.com 1- Concurrent Monitor import threading, Queue

    def launch_scanners(hostlist,msg_q): tlist = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() tlist.append(thr) for thr in tlist: thr.join() msg_q.put(None) # Sentinel def scan_caches(hostlist): msg_q = Queue.Queue() thr = threading.Thread(target=launch_scanners, args=(hostlist,msg_q)) thr.start() while True: meta = msg_q.get() if meta: yield meta else: break 146
  147. Copyright (C) 2007, http://www.dabeaz.com 1- Launching Threads import threading, Queue

    def launch_scanners(hostlist,msg_q): tlist = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() tlist.append(thr) for thr in tlist: thr.join() msg_q.put(None) # Sentinel def scan_caches(hostlist): msg_q = Queue.Queue() thr = threading.Thread(target=launch_scanners, args=(hostlist,msg_q)) thr.start() while True: meta = msg_q.get() if meta: yield meta else: break 147 The above function is a thread that launches ScanThreads. It then waits for the threads to terminate by joining with them. After all threads have terminated, a sentinel is dropped in the Queue.
  148. Copyright (C) 2007, http://www.dabeaz.com 1- Collecting Results import threading, Queue

    def launch_scanners(hostlist,msg_q): tlist = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() tlist.append(thr) for thr in tlist: thr.join() msg_q.put(None) # Sentinel def scan_caches(hostlist): msg_q = Queue.Queue() thr = threading.Thread(target=launch_scanners, args=(hostlist,msg_q)) thr.start() while True: meta = msg_q.get() if meta: yield meta else: break 148 The function below creates a Queue and launches a thread to launch all of the scanning threads. It then produces a sequence of cache data until the sentinel (None) is pulled off of the queue.
  149. Copyright (C) 2007, http://www.dabeaz.com 1- More on Threads 149 •

    There are many more issues to thread programming that we could discuss. • All issues concerning locking, synchronization, event handling, and race conditions apply to Python. • Because of global interpreter lock, threads are not a way to achieve higher performance (generally).
  150. Copyright (C) 2007, http://www.dabeaz.com 1- Thread Synchronization 150 • threading

    module has various primitives Lock() # Mutex Lock RLock() # Reentrant Mutex Lock Semaphore(n) # Semaphore • Example use: x = value # Some kind of shared object x_lock = Lock() # A lock associated with x ... x_lock.acquire() # Modify or do something with x (critical section) ... x_lock.release()
  151. Copyright (C) 2007, http://www.dabeaz.com 1- Story so Far 151 •

    Wrote a program findcache.py that located cache directories (~ 10 lines) • Wrote a module ffcache.py that parsed contents of caches (~140 lines) • Wrote cachespy.py that allows caches to be retrieved (~30 lines) • Wrote a concurrent monitor for getting that data (~50 lines)
  152. Copyright (C) 2007, http://www.dabeaz.com 1- A subtle observation 152 •

    In none of our programs have we read the entire contents of the Firefox cache into memory. • In cachespy.py, the contents are read iteratively and piped through a socket (not stored in memory) • In cachemon.py, contents are received and routed through message queues. Processed iteratively (no temporary lists of results).
  153. Copyright (C) 2007, http://www.dabeaz.com 1- Another Observation 153 • For

    every connection, cachespy sends the entire contents of the Firefox cache metadata back to the client. • However, mostly we're just performing various kinds of queries on the data and filtering. • Question: Could we do any of this work remotely?
  154. Copyright (C) 2007, http://www.dabeaz.com 1- Remote Mapping • Distribute the

    work 154 Modify the cachespy program so that some of the mapping and filtering work can be performed remotely on each of the machines. Only send back a subset of the data to the monitor program. • Big Picture Distributed computation. Massive security problem.
  155. Copyright (C) 2007, http://www.dabeaz.com 1- The idea • Modify scan_remote_cache()

    to accept an optional filter specification. Pass this on to the remote machine and use it to process the data remotely before returning reults. 155 filter = """ if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000 """ rcache = scan_remote_cache(host,filter)
  156. Copyright (C) 2007, http://www.dabeaz.com 1- Changes to the Monitor #

    cachemon.py def scan_remote_cache(host,filter=""): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() pickle.dump(filter,f) f.flush() try: while True: meta = pickle.load(f) meta['host'] = host yield meta except EOFError: pass 156 Send the filter to the remote host
  157. Copyright (C) 2007, http://www.dabeaz.com 1- Changes to the Monitor #

    cachemon.py ... class ScanThread(threading.Thread): def __init__(self,host,msg_q,filter=""): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q self.filter = filter def run(self): try: for meta in scan_remote_cache(self.host,self.filter): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass 157 filter added to thread data
  158. Copyright (C) 2007, http://www.dabeaz.com 1- Changes to the Monitor #

    cachemon.py ... def launch_scanners(hostlist,msg_q,filter=""): tlist = [] for host in hostlist: thr = ScanThread(host,msg_q,filter) thr.start() tlist.append(thr) for thr in tlist: thr.join() msg_q.put(None) # Sentinel 158 filter passed to thread creation
  159. Copyright (C) 2007, http://www.dabeaz.com 1- Changes to the Monitor #

    cachemon.py ... def scan_caches(hostlist,filter=""): msg_q = Queue.Queue() thr = threading.Thread(target=launch_scanners, args=(hostlist,msg_q,filter)) thr.start() while True: meta = msg_q.get() if not meta: break yield meta 159 filter added
  160. Copyright (C) 2007, http://www.dabeaz.com 1- Changes to CacheSpy # cachespy.py

    ... def dump_cache(f,filter): valuegen = """(meta for dir in cachelist for meta in ffcache.scan_cache_withcontent(dir) %s)""" % filter try: for meta in eval(valuegen): pickle.dump(meta,f) except: pickle.dump({'error': traceback.format_exc()},f) 160
  161. Copyright (C) 2007, http://www.dabeaz.com 1- Changes to CacheSpy # cachespy.py

    ... def dump_cache(f,filter): valuegen = """(meta for dir in cachelist for meta in ffcache.scan_cache_withcontent(dir) %s)""" % filter try: for meta in eval(valuegen): pickle.dump(meta,f) except: pickle.dump({'error': traceback.format_exc()},f) 161 Filter added and used to create an expression string. filter = "if meta['datasize'] > 100000" valuegen = """(meta for dir in cachelist for meta in ffcache.scan_cache_withcontent(dir) if meta['datasize'] > 100000)"""
  162. Copyright (C) 2007, http://www.dabeaz.com 1- Eval() # cachespy.py ... def

    dump_cache(f,filter): valuegen = """(meta for dir in cachelist for meta in ffcache.scan_cache_withcontent(dir) %s)""" % filter try: for meta in eval(valuegen): pickle.dump(meta,f) except: pickle.dump({'error': traceback.format_exc()},f) 162 eval(s). Evaluates s as a Python expression. A bit error of handling. traceback module creates stack traces for exceptions.
  163. Copyright (C) 2007, http://www.dabeaz.com 1- Changes to the Server #

    cachespy.py ... class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() filter = pickle.load(f) dump_cache(f,filter) f.close() 163 filter added
  164. Copyright (C) 2007, http://www.dabeaz.com 1- Putting it all Together •

    Have create some interesting machinery 164 # Find all of those slashdot slackers import cachemon hosts = [('host1',31337),('host2',31337), ('host3',31337),...] filter = "if 'slashdot' in meta['request']" rcaches = scan_caches(hosts,filter) for meta in rcaches: print meta['request'] print meta['host'],meta['cachedir'] print
  165. Copyright (C) 2007, http://www.dabeaz.com 1- Putting it all Together •

    Queries run remotely on all the hosts • Only data of interest is sent back • No temporary lists or large data structures • Concurrent execution on monitor • Concurrency is hidden from user 165
  166. Copyright (C) 2007, http://www.dabeaz.com 1- The Power of Iteration •

    Loop over all entries in a cache file: 166 for meta in scan_cache_file(f,256): ... • Loop over all entries in a cache directory for meta in scan_cache(dirname): ... • Loop over all cache entries on remote host for meta in scan_remote_cache(host): ... • Loop over all cache entries on many hosts for meta in scan_caches(hostlist): ...
  167. Copyright (C) 2007, http://www.dabeaz.com 1- Wrapping Up • A lot

    of material has been presented • Again, the goal was to do something interesting with Python, not to be just a reference manual. • This is only a small taste of what's possible • And it's only a small taste of why people like programming in Python 167
  168. Copyright (C) 2007, http://www.dabeaz.com 1- Where to go from here?

    • Everything Pythonic: 168 http://www.python.org • Get involved. PyCon'2008 (Chicago) • Have an on-site course (shameless plug) http://www.dabeaz.com/python.html
  169. Copyright (C) 2007, http://www.dabeaz.com 1- Thanks for Listening! • Hope

    you got something out of the class 169 • Please give me feedback!