Python in Action - Part 2 (Systems Programming)

Copyright (C) 2007, http://www.dabeaz.com 1- Python in Action 1 Presented
at USENIX LISA Conference November 16, 2007 David M. Beazley http://www.dabeaz.com (Part II - Systems Programming)

Copyright (C) 2007, http://www.dabeaz.com 1- Section Overview • In this
section, we're going to get dirty • Systems Programming • Files, I/O, ﬁle-system • Text parsing, data decoding • Processes and IPC • Threads and concurrency • Networking 2

Copyright (C) 2007, http://www.dabeaz.com 1- Commentary • I personally think
Python is a fantastic tool for systems programming. • Modules provide access to most of the major system libraries I used to access via C • No enforcement of "morality" • Good performance • It just "works" and it feels right. 3

Copyright (C) 2007, http://www.dabeaz.com 1- Approach • I've thought long
and hard about how I would present this part of the class. • A reference manual approach is just going to be long and very boring. • So instead, we're going to focus on building something more in tune with the times. 4

Copyright (C) 2007, http://www.dabeaz.com 1- "To Catch a Slacker" •
Write some Python programs that can quietly monitor Firefox browser caches to ﬁnd out who has been spending their day reading Slashdot instead of working on their TPS reports. • Oh yeah, and be a real sneaky bugger about it. 5

Copyright (C) 2007, http://www.dabeaz.com 1- Why this Problem? • Involves
a real-world system and data • Firefox already installed on your machine (?) • Cross platform (Linux, Mac, Windows) • Many opportunities for tool building • Related to a variety of practical problems • A good tour of "Python in Action" 6

Copyright (C) 2007, http://www.dabeaz.com 1- Disclaimers • I am not
involved in browser forensics (or spyware for that matter). • I am in no way afﬁliated with Firefox/Mozilla nor have I ever seen Firefox source code • I have never worked with the cache data prior to preparing this tutorial • I have never used any third-party tools for looking at this data. 7

Copyright (C) 2007, http://www.dabeaz.com 1- More Disclaimers • All of
the code in this tutorial works with a standard Python installation • No third party modules. • All code is cross-platform • Code samples are available online at 8 http://www.dabeaz.com/lisa/ • Please look at that code and follow along

Copyright (C) 2007, http://www.dabeaz.com 1- Assumptions • This is not
a review of systems concepts • You should be generally familiar with background material (files, filesystems, file formats, processes, threads, networking, protocols, etc.) • You can "extrapolate" from the material presented here to construct more advanced Python applications. 9

Copyright (C) 2007, http://www.dabeaz.com 1- Rough Outline • The ﬁle
system and environment • Data processing (text and binary) • File encoding/decoding • Interprocess communication • Networking • Concurrency • Distributed computing 10

Copyright (C) 2007, http://www.dabeaz.com 1- The Firefox Cache • The
Firefox browser keeps a disk cache of recently visited sites 11 % ls Cache/ -rw------- 1 beazley 111169 Sep 25 17:15 01CC0844d01 -rw------- 1 beazley 104991 Sep 25 17:15 01CC3844d01 -rw------- 1 beazley 47233 Sep 24 16:41 021F221Ad01 ... -rw------- 1 beazley 26749 Sep 21 11:19 FF8AEDF0d01 -rw------- 1 beazley 58172 Sep 25 18:16 FFE628C6d01 -rw------- 1 beazley 1939456 Sep 25 19:14 _CACHE_001_ -rw------- 1 beazley 2588672 Sep 25 19:14 _CACHE_002_ -rw------- 1 beazley 4567040 Sep 25 18:44 _CACHE_003_ -rw------- 1 beazley 33044 Sep 23 21:58 _CACHE_MAP_ • A bunch of cryptic ﬁles.

Copyright (C) 2007, http://www.dabeaz.com 1- Problem : Finding Files •
Find the Firefox cache Write a program ﬁndcache.py that takes a directory name as input and recursively scans that directory and all subdirectories looking for Firefox/Mozilla cache directories. 12 • Example: % python findcache.py /Users/beazley /Users/beazley/Library/.../qs1ab616.default/Cache /Users/beazley/Library/.../wxuoyiuf.slt/Cache % • Use case: Searching on the ﬁlesystem.

Copyright (C) 2007, http://www.dabeaz.com 1- Solution # findcache.py # Recursively
scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 13

Copyright (C) 2007, http://www.dabeaz.com 1- The sys module # findcache.py
# Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 14 The sys module has basic information related to the execution environment. sys.argv A list of the command line options sys.stdin sys.stdout sys.stderr Standard I/O ﬁles sys.argv = ['findcache.py', '/Users/beazley']

Copyright (C) 2007, http://www.dabeaz.com 1- Program Termination # findcache.py #
Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 15 SystemExit exception Forces Python to exit. Value is return code.

Copyright (C) 2007, http://www.dabeaz.com 1- os Module # findcache.py #
Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 16 os module Contains useful OS related functions (ﬁles, processes, etc.)

Copyright (C) 2007, http://www.dabeaz.com 1- os.walk() # findcache.py # Recursively
scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 17 os.walk(topdir) Recursively walks a directory tree and generates a sequence of tuples (path,dirs,files) path = The current directory name dirs = List of all subdirectory names in path files = List of all regular files (data) in path

Copyright (C) 2007, http://www.dabeaz.com 1- A Sequence of Caches #
findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 18 This statement generates a sequence of directory names where '_CACHE_MAP_' is contained in the ﬁlelist. The directory name that is generated as a result File name check

Copyright (C) 2007, http://www.dabeaz.com 1- Printing the Result # findcache.py
# Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name 19 This prints the sequence of cache directories that are generated. Note: Output produced immediately as caches are found.

Copyright (C) 2007, http://www.dabeaz.com 1- Commentary • Our solution is
strongly based on a "declarative" programming style (again) • We simply write out a sequence of operations that produce what we want • Not focused on the underlying mechanics of how to traverse all of the directories. 20

Copyright (C) 2007, http://www.dabeaz.com 1- Mini-Reference • sys module 21
sys.argv # List of command line options sys.stdin # Standard input sys.stdout # Standard output sys.stderr # Standard error sys.executable # Full path of Python executable sys.exc_info() # Information on current exception • os module os.walk(dir) # Recursively walk dir producing a # sequence of tuples (path,dlist,flist) os.listdir(dir) # Return a list of all files in dir • SystemExit exception raise SystemExit(n) # Exit with integer code n

Copyright (C) 2007, http://www.dabeaz.com 1- Problem : Getting File Info
• Collect some cache information Create a command-line tool cacheinfo.py that searches for cache directories and produces reports including directory names, sizes, and modiﬁcation dates. 22 • Example: % python cacheinfo.py -s /Users/ 32435354 /Users/beazley/Library/.../qs1ab616.default/Cache 254100 /Users/beazley/Library/.../wxuoyiuf.slt/Cache % python cacheinfo.py -t /Users 09/26/2007 08:54 /Users/beazley/Library/.../qs1ab616... 01/29/2007 20:47 /Users/beazley/Library/.../wxuoyiuf.slt... %

Copyright (C) 2007, http://www.dabeaz.com 1- cacheinfo.py usage • We're creating
a command-line tool 23 python [opts] cacheinfo.py dir1 dir2 ... dirn Recursively searches the directories dir1, ... dirn for Firefox browser caches and prints out information. Options: -s Print total size of directory contents -t Print last modification time -sortby=[size|time] Sort results by size or time • Use case : Development of command-line oriented tools and utilities.

Copyright (C) 2007, http://www.dabeaz.com 1- Command Line Parsing # cacheinfo.py
import optparse p = optparse.OptionParser() p.add_option('-s',action="store_true",dest="size", help="Show total size of each directory") p.add_option('-t',action="store_true",dest="time", help="Show last modification date") p.add_option('--sortby',action="store",dest="sortby", type="choice",choices=["size","time"], help="Sort by 'time' or 'date'") opt,args = p.parse_args() 24

Copyright (C) 2007, http://www.dabeaz.com 1- optparse Module # cacheinfo.py import
optparse p = optparse.OptionParser() p.add_option('-s',action="store_true",dest="size", help="Show total size of each directory") p.add_option('-t',action="store_true",dest="time", help="Show last modification date") p.add_option('--sortby',action="store",dest="sortby", type="choice",choices=["size","time"], help="Sort by 'time' or 'date'") opt,args = p.parse_args() 25 optparse. A module for parsing Unix-style command line options. A problem that sounds simple, but which is not in practice. Rule of thumb: Python comes with modules that deal with common programming problems.

optparse p = optparse.OptionParser() p.add_option('-s',action="store_true",dest="size", help="Show total size of each directory") p.add_option('-t',action="store_true",dest="time", help="Show last modification date") p.add_option('--sortby',action="store",dest="sortby", type="choice",choices=["size","time"], help="Sort by 'time' or 'date'") opt,args = p.parse_args() 26 Create an OptionParser object and conﬁgure it with the available options.

optparse p = optparse.OptionParser() p.add_option('-s',action="store_true",dest="size", help="Show total size of each directory") p.add_option('-t',action="store_true",dest="time", help="Show last modification date") p.add_option('--sortby',action="store",dest="sortby", type="choice",choices=["size","time"], help="Sort by 'time' or 'date'") opt,args = p.parse_args() 27 Parse the command line opts. python cacheinfo.py -s --sortby=time dir1 dir2 sys.argv = ['cacheinfo.py','-s','--sortby=time', 'dir1','dir2'] opt.size=True opt.time=None opt.sortby='time' p.parse_args() args = ['dir1','dir2']

Copyright (C) 2007, http://www.dabeaz.com 1- Sample Use % python cacheinfo.py
-f /Users Usage: cacheinfo.py [options] cacheinfo.py: error: no such option: -f % python cacheinfo.py -s --sortby=owner /Users Usage: cacheinfo.py [options] cacheinfo.py: error: option --sortby: invalid choice: 'owner' (choose from 'size', 'time') % python cacheinfo.py -h Usage: cacheinfo.py [options] Options: -h, --help show this help message and exit -s Show total size of each directory -t Show last modification date --sortby=SORTBY Sort by 'time' or 'date' 28

Copyright (C) 2007, http://www.dabeaz.com 1- Collecting File Metadata # cacheinfo.py
... opt,args = p.parse_args() import os cacheinfo = [] for dirname in args: for path, dirs, files in os.walk(dirname): if '_CACHE_MAP_' in files: fnames = [os.path.join(path,name) for name in files] size = sum(os.path.getsize(name) for name in fnames) mtime = max(os.path.getmtime(name) for name in fnames) cacheinfo.append((path,size,mtime)) 29

Copyright (C) 2007, http://www.dabeaz.com 1- Collecting File Metadata # cacheinfo.py
... opt,args = p.parse_args() import os cacheinfo = [] for dirname in args: for path, dirs, files in os.walk(dirname): if '_CACHE_MAP_' in files: fnames = [os.path.join(path,name) for name in files] size = sum(os.path.getsize(name) for name in fnames) mtime = max(os.path.getmtime(name) for name in fnames) cacheinfo.append((path,size,mtime)) 30 General idea: • Walk over a sequence of directories as before. • For each cache directory, get total size of all ﬁles and the most recent modiﬁcation date • Store the information in a list of tuples cacheinfo = [ (dirname, bytes, mtime), (dirname, bytes, mtime), ... ]

Copyright (C) 2007, http://www.dabeaz.com 1- os.path module # cacheinfo.py ...
opt,args = p.parse_args() import os cacheinfo = [] for dirname in args: for path, dirs, files in os.walk(dirname): if '_CACHE_MAP_' in files: fnames = [os.path.join(path,name) for name in files] size = sum(os.path.getsize(name) for name in fnames) mtime = max(os.path.getmtime(name) for name in fnames) cacheinfo.append((path,size,mtime)) 31 os.path has portable ﬁle related functions os.path.join(name1,name2,...) # Join path names os.path.getsize(filename) # Get the file size os.path.getmtime(filename) # Get modification date There are many more functions, but this is the preferred module for basic ﬁlename handling

Copyright (C) 2007, http://www.dabeaz.com 1- os.path.join() # cacheinfo.py ... opt,args
= p.parse_args() import os cacheinfo = [] for dirname in args: for path, dirs, files in os.walk(dirname): if '_CACHE_MAP_' in files: fnames = [os.path.join(path,name) for name in files] size = sum(os.path.getsize(name) for name in fnames) mtime = max(os.path.getmtime(name) for name in fnames) cacheinfo.append((path,size,mtime)) 32 Creates a fully-expanded pathname path = '/foo/bar' files = ['file1','file2',...] fnames = ['/foo/bar/file1','/foo/bar/file2',...] [os.path.join(path,name) for name in files] Note: Use of os.path solves cross-platform issues related to pathnames ('/' vs. '\')

Copyright (C) 2007, http://www.dabeaz.com 1- Getting Metadata # cacheinfo.py ...
opt,args = p.parse_args() import os cacheinfo = [] for dirname in args: for path, dirs, files in os.walk(dirname): if '_CACHE_MAP_' in files: fnames = [os.path.join(path,name) for name in files] size = sum(os.path.getsize(name) for name in fnames) mtime = max(os.path.getmtime(name) for name in fnames) cacheinfo.append((path,size,mtime)) 33 Performing reductions across the list of ﬁlenames fnames = ['/foo/bar/file1','/foo/bar/file2',...] size = sum([os.path.getsize('/foo/bar/file1'), os.path.getsize('/foo/bar/file2'), ...]) The argument looks funny, but it's really just generating a sequence of values as input.

Copyright (C) 2007, http://www.dabeaz.com 1- Data Representation # cacheinfo.py ...
opt,args = p.parse_args() import os cacheinfo = [] for dirname in args: for path, dirs, files in os.walk(dirname): if '_CACHE_MAP_' in files: fnames = [os.path.join(path,name) for name in files] size = sum(os.path.getsize(name) for name in fnames) mtime = max(os.path.getmtime(name) for name in fnames) cacheinfo.append((path,size,mtime)) 34 Collecting data as a list of tuples: cacheinfo = [ (dirname, bytes, mtime), (dirname, bytes, mtime), ... ]

Copyright (C) 2007, http://www.dabeaz.com 1- Commentary • Again, making strong
use of declarative style to collect data. 35 size = sum(os.path.getsize(name) for name in fnames) • Compare to this: size = 0 for name in fnames: total += os.path.getsize(name) • The choice of programming style is mostly a matter of personal preference • I personally tend to prefer the ﬁrst approach

Copyright (C) 2007, http://www.dabeaz.com 1- Producing Results # cacheinfo.py ...
cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 36

Copyright (C) 2007, http://www.dabeaz.com 1- Sorting (Revisited) # cacheinfo.py ...
cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 37 Here we are sorting based on an optional command line option (--sortby=[size|time]). .sort() sorts a list "in place."

Copyright (C) 2007, http://www.dabeaz.com 1- Sorting with a key func
# cacheinfo.py ... cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 38 Supplies a function that will be used to supply the keys used when comparing elements in the sort. lambda Creates a function from a single expression. .sort(key=lambda x: x[2]) def keyfunc(x): return x[2] .sort(key=keyfunc)

Copyright (C) 2007, http://www.dabeaz.com 1- More on lambda # cacheinfo.py
... cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 39 lambda is used to create small anonymous functions. Almost always reserved for simple callbacks. def add(x,y): return x+y add = lambda x,y : x+y These statements are equivalent lambda is used sparingly

Copyright (C) 2007, http://www.dabeaz.com 1- Back to Sorting # cacheinfo.py
... cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 40 cacheinfo = [ (dirname, bytes, mtime), (dirname, bytes, mtime), ... ]

Copyright (C) 2007, http://www.dabeaz.com 1- Printing Output # cacheinfo.py ...
cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 41 Printing optional ﬁelds. Trailing , omits newline

Copyright (C) 2007, http://www.dabeaz.com 1- Date/time Handling # cacheinfo.py ...
cacheinfo = [] ... if opt.sortby == 'size': cacheinfo.sort(key=lambda x: x[1]) elif opt.sortby == 'time': cacheinfo.sort(key=lambda x: x[2]) import time for path, size, mtime in cacheinfo: if opt.size: print size, if opt.time: tm = time.localtime(mtime) print time.strftime("%m/%d/%Y %H:%M",tm), print path 42 time module Contains functions related to system time values (e.g., seconds since 1970). time.localtime(s) # Make time structure time.time() # Get current time time.strftime(tm) # Time string format time.clock() # CPU clock time.sleep(s) # Sleep for s seconds ...

Copyright (C) 2007, http://www.dabeaz.com 1- Mini-Reference • os.path module 43
os.path.join(s1,s2,...) # Join pathname parts together os.path.getsize(path) # Get file size of path os.path.getmtime(path) # Get modify time of path os.path.getatime(path) # Get access time of path os.path.getctime(path) # Get creation time of path os.path.exists(path) # Check if path exists os.path.isfile(path) # Check if regular file os.path.isdir(path) # Check if directory os.path.islink(path) # Check if symbolic link os.path.basename(path) # Return file part of path os.path.dirname(path) # Return dir part of os.path.abspath(path) # Get absolute path • time module time.time() # Current time (seconds) time.localtime([s]) # Turn seconds into a structure # of different parts (hour,min,etc.) time.ctime([s]) # Time as an string time.strftime(fmt,tm) # Time string formatting

Copyright (C) 2007, http://www.dabeaz.com 1- Interlude: Tool Building • So
far, we have been exploring some of the basic machinery for building tools • Program environment (sys module) • Filesystem (os, os.path module) • Command line processing (optparse) • A lot of this would provide the basic framework for more advanced applications 44

Copyright (C) 2007, http://www.dabeaz.com 1- Problem : Searching Data •
Extract all requests in the cache Write a program cachecontents.py that scans the contents of the _CACHE_00n_ files and prints a list of URLs for documents stored in the cache. 45 • Example: % python cachecontents.py /Users/.../qs1ab616.default/Cache http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/js/ad_eo_1 http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif http://us.i1.yimg.com/us.yimg.com/i/ww/thm/1/search_1.1.png ... % • Use case: Searching files for text or specific data patterns

Copyright (C) 2007, http://www.dabeaz.com 1- The Firefox Cache • The
cache directory holds two types of data • Metadata (URLs, headers, etc.). • Raw data (HTML, JPEG, PNG, etc.) • This data is stored in two places • Cryptic ﬁles in the Cache directory • Blocks inside the _CACHE_00n_ ﬁles 46

Copyright (C) 2007, http://www.dabeaz.com 1- Possible Solution : Regex •
The _CACHE_00n_ ﬁles are encoded in a binary format, but URLs are embedded inside: 47 \x00\x01\x00\x08\x92\x00\x02\x18\x00\x00\x00\x13F\xff\x9f \xceF\xff\x9f\xce\x00\x00\x00\x00\x00\x00H)\x00\x00\x00\x1a \x00\x00\x023HTTP:http://slashdot.org/\x00request-method\x00 GET\x00request-User-Agent\x00Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7\x00 request-Accept-Encoding\x00gzip,deflate\x00response-head\x00 HTTP/1.1 200 OK\r\nDate: Sun, 30 Sep 2007 13:07:29 GMT\r\n Server: Apache/1.3.37 (Unix) mod_perl/1.29\r\nSLASH_LOG_DATA: shtml\r\nX-Powered-By: Slash 2.005000176\r\nX-Fry: How can I live my life if I can't tell good from evil?\r\nCache-Control: • Maybe the requests could just be ripped using a regular expression.

Copyright (C) 2007, http://www.dabeaz.com 1- A Regex Solution # cachecontents.py
import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for URL strings request_pat = re.compile('(http:|https:)//.*?\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() for m in request_pat.finditer(data): print m.group() 48

Copyright (C) 2007, http://www.dabeaz.com 1- The re module # cachecontents.py
import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for URL strings request_pat = re.compile('(http:|https:)//.*?\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() for m in request_pat.finditer(data): print m.group() 49 re module Contains all functionality related to regular expression pattern matching, searching, replacing, etc. Features are strongly inﬂuenced by Perl, but regexs are not directly integrated into the Python language.

Copyright (C) 2007, http://www.dabeaz.com 1- Using re # cacherequests.py import
re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for URL strings request_pat = re.compile('(http:|https:)//.*?\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() for m in request_pat.finditer(data): print m.group() 50 Patterns are ﬁrst speciﬁed as strings and compiled into a regex object. pat = re.compile(pattern [,flags]) The pattern syntax is "standard" pat* pat+ pat? (pat) . pat1|pat2 [chars] [^chars] pat{n} pat{n,m}

Copyright (C) 2007, http://www.dabeaz.com 1- Using re # cacherequests.py import
re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for URL strings request_pat = re.compile('(http:|https:)//.*?\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() for m in request_pat.finditer(data): print m.group() 51 All subsequent operations are methods of the compiled regex pattern m = pat.match(data) # Check for match m = pat.search(data) # Search for match newdata = pat.sub(data, repl) # Pattern replace allmatches = pat.findall(data) # Find all matches for m in pat.finditer(data): ... # Iterate over matches

Copyright (C) 2007, http://www.dabeaz.com 1- re Matches # cachecontents.py import
re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for URL strings request_pat = re.compile('(http:|https:)//.*?\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() for m in request_pat.finditer(data): print m.group() 52 Regex matches are represented by a MatchObject m.group([n]) # Text matched by group n m.start([n]) # Starting index of group n m.end([n]) # End index of group n prints the matched text of the entire pattern

Copyright (C) 2007, http://www.dabeaz.com 1- Commentary on Solution • This
regex approach is mostly a hack for this particular application. • Reads entire cache ﬁles into memory as strings (may be quite large) • Only ﬁnds URLs, no other metadata • Returns false positives since the cache also contains data (e.g., HTML pages which would contained embedded URL links) 53

Copyright (C) 2007, http://www.dabeaz.com 1- Mini-Reference • Seaching for text
using re 54 datepat = re.compile(r'(\d+)/(\d+)/(\d+)') m = datepat.search(data) if m: fulltext = m.group() # Text of complete match month = m.group(1) # Text of specific groups day = m.group(2) year = m.group(3) • Replacement Example def euro_date(m): month = m.group(1) day = m.group(2) year = m.group(3) return "%d/%d/%d" % (day,month,year) newdata = datepat.sub(euro_date,data)

Copyright (C) 2007, http://www.dabeaz.com 1- Problem : Parsing Data •
Extract the cache data (for real) Write a module ffcache.py that contains a set of functions for reading Firefox cache data into useful data structures that can be used by other programs. Capture all available information including URLs, timestamps, sizes, locations, etc. 55 • Use case: Blood and guts Writing programs that can process foreign ﬁle formats. Processing binary encoded data. Creating code for later reuse.

Copyright (C) 2007, http://www.dabeaz.com 1- The Firefox Cache • There
are four critical ﬁles 56 _CACHE_MAP_ # Cache index _CACHE_001_ # Cache data _CACHE_002_ # Cache data _CACHE_003_ # Cache data • All ﬁles are binary-encoded • _CACHE_MAP_ is used by Firefox to locate data, but it is not updated until Firefox exits. • We will ignore _CACHE_MAP_ since I want to observe caches of live Firefox sessions.

Copyright (C) 2007, http://www.dabeaz.com 1- Firefox _CACHE_ Files • _CACHE_00n_
ﬁle organization 57 Free/used block bitmap Blocks 4096 bytes Up to 32768 blocks • The block size varies according to the ﬁle: _CACHE_001_ 256 byte blocks _CACHE_002_ 1024 byte blocks _CACHE_003_ 4096 byte blocks

Copyright (C) 2007, http://www.dabeaz.com 1- Cache Entries • Each cache
entry: • A maximum of 4 cache blocks • Can either be data or metadata • If >16K, written to a ﬁle instead 58 • Notice how all the "cryptic" ﬁles are >16K -rw------- beazley 111169 Sep 25 17:15 01CC0844d01 -rw------- beazley 104991 Sep 25 17:15 01CC3844d01 -rw------- beazley 47233 Sep 24 16:41 021F221Ad01 ... -rw------- beazley 26749 Sep 21 11:19 FF8AEDF0d01 -rw------- beazley 58172 Sep 25 18:16 FFE628C6d01

Copyright (C) 2007, http://www.dabeaz.com 1- Cache Metadata • Metadata is
encoded as a binary structure 59 Header Request String Request Info 36 bytes Variable length (in header) Variable length (in header) • Header encoding (binary, big-endian) magic (???) location fetchcount fetchtime modifytime expiretime datasize requestsize infosize unsigned int (0x00010008) unsigned int unsigned int unsigned int (system time) unsigned int (system time) unsigned int (system time) unsigned int (byte count) unsigned int (byte count) unsigned int (byte count) 0-3 4-7 8-11 12-15 16-19 20-23 24-27 28-31 32-35

Copyright (C) 2007, http://www.dabeaz.com 1- Solution Outline • Part 1:
Parsing Metadata Headers • Part 2: Getting request information • Part 3: Scanning each cache ﬁle • Part 4: Collecting cache data from an entire cache directory 60

Copyright (C) 2007, http://www.dabeaz.com 1- Part I - Reading Headers
• Write a function that can parse the metadata header and return the data in a useful format • Add some checks that can bail out if the data does not look like a valid header. 61

Copyright (C) 2007, http://www.dabeaz.com 1- Reading Headers import struct #
This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata,maxsize): head = struct.unpack(">9I",headerdata) magic = head[0] rsize = head[7] isize = head[8] if magic != 0x00010008 or (rsize + isize + 36) > maxsize: return None meta = dict(zip(_headernames,head)) return meta 62

Copyright (C) 2007, http://www.dabeaz.com 1- Reading Headers >>> f =
open("Cache/_CACHE_001_","rb") >>> f.seek(4096) # Skip the bit map >>> headerdata = f.read(36) # Read 36 byte header >>> meta = parse_meta_header(headerdata,1024) >>> meta {'fetchtime': 1190829792, 'requestsize': 27, 'magic': 65544, 'fetchcount': 3, 'expiretime': 0, 'location': 2449473536L, 'modifytime': 1190829792, 'datasize': 29448, 'infosize': 531} >>> 63 • How this is supposed to work: • Basically, we're parsing the header into a useful Python data structure

Copyright (C) 2007, http://www.dabeaz.com 1- struct module import struct #
This function parses cache metadata into a dictionary # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata,maxsize): head = struct.unpack(">9I",headerdata) magic = head[0] rsize = head[7] isize = head[8] if magic != 0x00010008 or (rsize + isize + 36) > maxsize: return None meta = dict(zip(_headernames,head)) return meta 64 Parses binary encoded data into Python objects. You would use this module to pack/unpack raw binary data from Python strings. Sample format codes 'i' int 'I' unsigned int 'h' short 'H' unsigned short 'd' double ... '>' Big endian '<' Little endian '!' Network 'n' Repetition

Copyright (C) 2007, http://www.dabeaz.com 1- Header Validation import struct #
This function parses a metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata,maxsize): head = struct.unpack(">9I",headerdata) magic = head[0] rsize = head[7] isize = head[8] if magic != 0x00010008 or (rsize + isize + 36) > maxsize: return None meta = dict(zip(_headernames,head)) return meta 65 Unpacked data is a tuple A check that sees if the unpacked data looks valid head = (n,n,n,n,n,n,n,n,n)

Copyright (C) 2007, http://www.dabeaz.com 1- Dictionary Creation import struct #
This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata,maxsize): head = struct.unpack(">9I",headerdata) magic = head[0] rsize = head[7] isize = head[8] if magic != 0x00010008 or (rsize + isize + 36) > maxsize: return None meta = dict(zip(_headernames,head)) return meta 66 zip(s1,s2) makes a list of tuples zip(_headernames,head) [('magic',head[0]), ('location',head[1]), ('fetchcount',head[2]) ... ] Make a dictionary

Copyright (C) 2007, http://www.dabeaz.com 1- Commentary • Dictionaries as data
structures 67 meta = { 'fetchtime' : 1190829792, 'requestsize' : 27, 'magic' : 65544, 'fetchcount' : 3, 'expiretime' : 0, 'location' : 2449473536L, 'modifytime' : 1190829792, 'datasize' : 29448, 'infosize' : 531 } • Useful if data has many parts data = f.read(meta[8]) # Huh?!? vs. data = f.read(meta['infosize']) # Better

Copyright (C) 2007, http://www.dabeaz.com 1- Part 2 : Reading Requests
• Write a function that will read the request string and request information • Request String : A Null-terminated string 68 • Request Info : A sequence of Null-terminated key-value pairs (like a dictionary)

Copyright (C) 2007, http://www.dabeaz.com 1- Reading Requests # Given a
dictionary of header information and a file, # this function extracts the request data from a cache # metadata entry and saves it in the dictionary. Returns # True or False depending on success. def read_request_data(meta,f): request = f.read(meta['requestsize']).strip('\x00') infodata = f.read(meta['infosize']).strip('\x00') # Validate request and infodata here (nothing now) # Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True 69

Copyright (C) 2007, http://www.dabeaz.com 1- Usage : Requests >>> f
= open("Cache/_CACHE_001_","rb") >>> f.seek(4096) # Skip the bit map >>> headerdata = f.read(36) # Read 36 byte header >>> meta = parse_meta_header(headerdata,1024) >>> read_request_data(meta,f) True >>> meta['request'] 'http://www.yahoo.com/' >>> meta['info'] {'request-method': 'GET', 'request-User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/ 20070914 Firefox/2.0.0.7', 'charset': 'UTF-8', 'response- head': 'HTTP/1.1 200 OK\r\nDate: Wed, 26 Sep 2007 18:03:17 ...' } >>> 70 • Usage of the function:

Copyright (C) 2007, http://www.dabeaz.com 1- # Given a dictionary of
header information and a file, # this function extracts the request data from a cache # metadata entry and saves it in the dictionary. Returns # True or False depending on success. def read_request_data(meta,f): request = f.read(meta['requestsize']).strip('\x00') infodata = f.read(meta['infosize']).strip('\x00') # Validate request and infodata here (nothing now) # Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True String Stripping 71 Here, we just read the request string followed by the request info string. Both end with a NULL which we strip off.

header information and a file, # this function extracts the request data from a cache # metadata entry and saves it in the header. Returns # True or False depending on success. def read_request_data(header,f): request = f.read(header['requestsize']).strip('\x00') infodata = f.read(header['infosize']).strip('\x00') # Validate request and infodata here (nothing now) # Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True String Splitting 72 The request info is a string of key/value pairs: infodata = 'key\x00value\x00key\x00value\x00key\x00value\' .split('\x00') parts = ['key','value','key','value','key','value']

header information and a file, # this function extracts the request data from a cache # metadata entry and saves it in the header. Returns # True or False depending on success. def read_request_data(header,f): request = f.read(header['requestsize']).strip('\x00') infodata = f.read(header['infosize']).strip('\x00') # Validate request and infodata here (nothing now) # Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Advanced List Slicing 73 We can slice a list with a stride parts = ['key','value','key','value','key','value'] parts[::2] ['key','key','key'] parts[1::2] ['value','value','value'] zip(parts[::2],parts[1::2]) [('key','value'), ('key','value') ('key','value')] Makes a dictionary

header information and a file, # this function extracts the request data from a cache # metadata entry and saves it in the dictionary. Returns # True or False depending on success. def read_request_data(header,f): request = f.read(header['requestsize']).strip('\x00') infodata = f.read(header['infosize']).strip('\x00') # Validate request and infodata here (nothing now) # Turn the infodata into a dictionary parts = infodata.split('\x00') info = dict(zip(parts[::2],parts[1::2])) meta['request'] = request.split(':',1)[1] meta['info'] = info return True Fixing the Request 74 Cleaning up the request string request = "HTTP:http://www.google.com" .split(':',1) ['HTTP','http://www.google.com'] [1] 'http://www.google.com'

Copyright (C) 2007, http://www.dabeaz.com 1- Commentary • Emphasize that Python
has very powerful list manipulation primitives • Indexing • Slicing • List comprehensions • Etc. • Knowing how to use these leads to rapid development and compact code 75

Copyright (C) 2007, http://www.dabeaz.com 1- Part 3: File Scanning •
Write a function that scans a cache ﬁle and produces a sequence of records containing all of the cache metadata. • This is just one more of our building blocks • The goal is to hide some of the nasty bits 76

Copyright (C) 2007, http://www.dabeaz.com 1- File Scanning # scan a
cache file from beginning to end, producing a # sequence of dictionaries with metadata information def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break header = parse_meta_header(headerdata,maxsize) if header and read_request_data(header,f): yield header # Move the file pointer to the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) 77

Copyright (C) 2007, http://www.dabeaz.com 1- Usage : File Scanning >>>
f = open("Cache/_CACHE_001_","rb") >>> for meta in scan_cache_file(f,256) ... print meta['request'] ... http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/ http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif ... 78 • Usage of the scan function • We can just open up a cache ﬁle and write a for-loop to iterate over all of the metadata entries.

Copyright (C) 2007, http://www.dabeaz.com 1- # scan a cache file
from beginning to end, producing a # sequence of dictionaries with metadata information def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break header = parse_meta_header(headerdata,maxsize) if header and read_request_data(header,f): yield header # Move the file pointer to the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Python File I/O 79 File Objects Modeled after ANSI C. Files are just bytes. File pointer keeps track. f.read() # Read bytes f.tell() # Current fp f.seek(n,off) # Move fp

from beginning to end, producing a # sequence of dictionaries with metadata information def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break header = parse_meta_header(headerdata,maxsize) if header and read_request_data(header,f): yield header # Move the file pointer to the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Using Earlier Code 80 Here we are using our header parsing functions written in previous parts.

from beginning to end, producing a # sequence of dictionaries with metadata information def scan_cachefile(f,blocksize): maxsize = 4*blocksize # Maximum size of an entry f.seek(4096) # Skip the bit-map while True: headerdata = f.read(36) if not headerdata: break header = parse_meta_header(headerdata,maxsize) if header and read_request_data(header,f): yield header # Move the file pointer to the next block fp = f.tell() if (fp % blocksize): f.seek(blocksize - (fp % blocksize),1) Generating Results 81 We are using yield to produce data for a single metadata entry. If someone uses a for-loop, they will get all of the entries. Note: This allows us to process the cache without reading all of the data into memory.

Copyright (C) 2007, http://www.dabeaz.com 1- Commentary • Have created a
function that can scan a single _CACHE_00n_ ﬁle and produce a sequence of dictionaries with metadata. • It's still somewhat low-level • Just need to package it a little better 82

Copyright (C) 2007, http://www.dabeaz.com 1- Part 4 : Scan the
Cache • Write a function that takes the name of a Firefox cache directory, scans all of the cache ﬁles for metadata, and produces a useful sequence of results. • Make it real easy to extract data 83

Copyright (C) 2007, http://www.dabeaz.com 1- Solution : Cache Scan #
Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): n = 1 blocksize = 256 while n <= 3: cname = "_CACHE_00%d_" % n cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() n += 1 blocksize *= 4 84

Copyright (C) 2007, http://www.dabeaz.com 1- Solution : Cache Scan #
Given the name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): n = 1 blocksize = 256 while n <= 3: cname = "_CACHE_00%d_" % n cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() n += 1 blocksize *= 4 85 General idea: We loop over the three _CACHE_00n_ ﬁles and produce a sequence of the metadata dictionaries

Copyright (C) 2007, http://www.dabeaz.com 1- More Generation # Given the
name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): n = 1 blocksize = 256 while n <= 3: cname = "_CACHE_00%d_" % n cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() n += 1 blocksize *= 4 86 By using yield here, we are chaining together the results obtained from all three cache ﬁles into one big long sequence of results. The underlying mechanics and implementation details are hidden (user doesn't care)

Copyright (C) 2007, http://www.dabeaz.com 1- Additional Data # Given the
name of a Firefox cache directory, the function # scans all of the _CACHE_00n_ files for metadata. A sequence # of dictionaries containing metadata is returned. import os def scan_cache(cachedir): n = 1 blocksize = 256 while n <= 3: cname = "_CACHE_00%d_" % n cfile = open(os.path.join(cachedir,cname),"rb") for meta in scan_cachefile(cfile,blocksize): meta['cachedir'] = cachedir meta['cachefile'] = cname yield meta cfile.close() n += 1 blocksize *= 4 87 Adding path and ﬁle information to the dictionary (May be useful later)

Copyright (C) 2007, http://www.dabeaz.com 1- Usage : Cache Scan >>>
for meta in scan_cache("Cache/"): ... print meta['request'] ... http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/ http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif ... 88 • Usage of the scan function • Given the name of a cache directory, we can just loop over all of the metadata. Trivial! • With work, could perform various kinds of queries and processing of the data

Copyright (C) 2007, http://www.dabeaz.com 1- A Mini-Example >>> for meta
in scan_cache("Cache/"): ... if 'slashdot' in meta['request']: ... print meta['request'] ... http://www.slashdot.org/ http://images.slashdot.org/topics/topiccommunications.gif http://images.slashdot.org/topics/topicstorage.gif http://images.slashdot.org/comments.css?T_2_5_0_176 ... 89 • Find all requests related to Slashdot • Well, that was pretty easy.

Copyright (C) 2007, http://www.dabeaz.com 1- Intermission • Have written two
programs • findcache.py. A program that locates Firefox cache directories on file system • ffcache.py. A set of utility functions for extracting cache metadata. • Have taken a moderately complex data processing problem and simplified it. • < 100 lines of code. 90

Copyright (C) 2007, http://www.dabeaz.com 1- Data Encoding • Getting more
information 91 All documents on the Internet have some type, encoding, and character set. For example, a 'text/ html' ﬁle that uses the 'UTF-8' character set and is encoded using gzip compression. • Your problem: Write a function add_content_info(m) that inspects the cache metadata and adds additional information concerning the content type, encoding, and charset.

Copyright (C) 2007, http://www.dabeaz.com 1- HTTP Responses • The cache
metadata includes an HTTP response header 92 >>> print meta['info']['response-head'] HTTP/1.1 200 OK Date: Sat, 29 Sep 2007 20:51:37 GMT Cache-Control: private Vary: User-Agent Content-Type: text/html; charset=utf-8 Content-Encoding: gzip >>> Content type, character set, and encoding.

Copyright (C) 2007, http://www.dabeaz.com 1- Solution # Given a metadata
dictionary, this function adds additional # fields related to the content type, charset, and encoding import email def add_content_info(meta): info = meta['info'] if 'response-head' not in info: content = None encoding = None charset = None else: rhead = info.get('response-head').split("\n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset 93

Copyright (C) 2007, http://www.dabeaz.com 1- Internet Data Handling # Given
a metadata dictionary, this function adds additional # fields related to the content type, charset, and encoding import email def add_content_info(meta): info = meta['info'] if 'response-head' not in info: content = None encoding = None charset = None else: rhead = info.get('response-head').split("\n",1)[1] m = email.message_from_string(rhead) content = m.get_content_type() encoding = m.get('content-encoding',None) charset = m.get_content_charset() meta['content-type'] = content meta['content-encoding'] = encoding meta['charset'] = charset 94 Python has a vast assortment of internet data handling modules. email. Parsing of email messages, MIME headers, etc.

Copyright (C) 2007, http://www.dabeaz.com 1- Modiﬁed Cache Scan # A
cache scanning function that adds the content # information to the metadata returned. def scan_cache_withcontent(cachedir): for meta in scan_cache(cachedir): add_content_info(meta) yield meta 95 Add content info

Copyright (C) 2007, http://www.dabeaz.com 1- A Mini-Example >>> jpegs =
(meta for meta in scan_cache_withcontent("Cache/") if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000) >>> for j in jpegs: ... print j['request'] ... http://images.salon.com/ent/video_dog/comedy/2007/09/27/cereal/ story.jpg http://images.salon.com/ent/video_dog/ifc/2007/09/28/ apocalypse/story.jpg http://www.lakesideinns.com/images/fallroadphoto2006.jpg ... >>> 96 • Find all large JPEG images in the cache • That was also pretty easy

Copyright (C) 2007, http://www.dabeaz.com 1- File Inspector • Getting Cached
Documents 97 Add a function getdata(m) to ffcache.py that takes a metadata dictionary and attempts to return the actual data that's stored in the cache. If the data has been compressed or encoded in some manner, decode it before returning. • Big Picture More ﬁle I/O, but with the added problem of dealing with various data encoding issues.

Copyright (C) 2007, http://www.dabeaz.com 1- Cache Locations • The Firefox
cache metadata does not encode the location of the corresponding data. • It is only stored in _CACHE_MAP_ • _CACHE_MAP_ is only updated when Firefox terminates (held in memory). • So, we won't be able to accurately pull data from a live session. 98

Copyright (C) 2007, http://www.dabeaz.com 1- A File Heuristic • Look
for cache files that have the same file size as that encoded in the metadata. • If there are duplicates, pick the one that has a modification date closest to that in the metadata. • If none of this works, don't worry about it. 99

Copyright (C) 2007, http://www.dabeaz.com 1- Guessing at the File #
Attempt to locate the matching cache file (if possible) import glob def getcandidate(meta): datasize = meta['datasize'] filepat = os.path.join(meta['cachedir'],'[0-9A-F]*') filelist = glob.glob(filepat) # Get all files that are the same size samesize = [name for name in filelist if os.path.getsize(name) == datasize] if not samesize: return "" # Get file with closest modification time mtime = meta['modifytime'] delta,filename = min((abs(mtime-os.path.getmtime(name)), name) for name in samesize) return filename 100

Copyright (C) 2007, http://www.dabeaz.com 1- glob Module # Attempt to
locate the matching cache file (if possible) import glob def getcandidate(meta): datasize = meta['datasize'] filepat = os.path.join(meta['cachedir'],'[0-9A-F]*') filelist = glob.glob(filepat) # Get all files that are the same size samesize = [name for name in filelist if os.path.getsize(name) == datasize] if not samesize: return "" # Get file with closest modification time mtime = meta['modifytime'] delta,filename = min((abs(mtime-os.path.getmtime(name)), name) for name in samesize) return filename 101 glob module. Returns a list of ﬁlenames matching a pattern

Copyright (C) 2007, http://www.dabeaz.com 1- More File Operations # Attempt
to locate the matching cache file (if possible) import glob def getcandidate(meta): datasize = meta['datasize'] filepat = os.path.join(meta['cachedir'],'[0-9A-F]*') filelist = glob.glob(filepat) # Get all files that are the same size samesize = [name for name in filelist if os.path.getsize(name) == datasize] if not samesize: return "" # Get file with closest modification time mtime = meta['modifytime'] delta,filename = min((abs(mtime-os.path.getmtime(name)), name) for name in samesize) return filename 102 Get all files with the correct size. Get file with closest modification time

Copyright (C) 2007, http://www.dabeaz.com 1- Read a Cache File #
Given a metadata entry, this attempts to read the associated # data (assuming it can be located) import gzip, codecs def getdata(meta): filename = getcandidate(meta) if not filename: return "" encoding = meta.get("content-encoding","") if encoding == 'gzip': f = gzip.open(filename) else: f = open(filename,"rb") charset = meta.get("charset",None) if charset: reader = codecs.getreader(charset)(f) else: reader = f try: data = reader.read() except (IOError,ValueError): return "" return data 103

Copyright (C) 2007, http://www.dabeaz.com 1- # Given a metadata entry,
this attempts to read the associated # data (assuming it can be located) import gzip, codecs def getdata(meta): filename = getcandidate(meta) if not filename: return "" encoding = meta.get("content-encoding","") if encoding == 'gzip': f = gzip.open(filename) else: f = open(filename,"rb") charset = meta.get("charset",None) if charset: reader = codecs.getreader(charset)(f) else: reader = f try: data = reader.read() except (IOError,ValueError): return "" return data Internet Data Encoding 104 When working with foreign data, it is critical to be able to work with different encodings. gzip. Read/write gzip encoded ﬁles. codecs. Read/write different character encodings (UTF-8, UTF-16, Big5, etc.) There are many similar modules.

this attempts to read the associated # data (assuming it can be located) import gzip, codecs def getdata(meta): filename = getcandidate(meta) if not filename: return "" encoding = meta.get("content-encoding","") if encoding == 'gzip': f = gzip.open(filename) else: f = open(filename,"rb") charset = meta.get("charset",None) if charset: reader = codecs.getreader(charset)(f) else: reader = f try: data = reader.read() except (IOError,ValueError): return "" return data Reading gzip Files 105 If the ﬁle is encoded as 'gzip', we'll open it using the gzip module. Otherwise, open it as a normal ﬁle.

this attempts to read the associated # data (assuming it can be located) import gzip, codecs def getdata(meta): filename = getcandidate(meta) if not filename: return "" encoding = meta.get("content-encoding","") if encoding == 'gzip': f = gzip.open(filename) else: f = open(filename,"rb") charset = meta.get("charset",None) if charset: reader = codecs.getreader(charset)(f) else: reader = f try: data = reader.read() except (IOError,ValueError): return "" return data Character Encoding 106 The codecs module is used to deal with special character encodings such as UTF-8. Here, we are putting a codecs "reader" around the file if a charset was specified. Otherwise, just use the original file.

Copyright (C) 2007, http://www.dabeaz.com 1- More on Data Encoding •
Python has full support for Unicode • Unicode strings 107 • An advanced (and painful) topic • For internet applications, you should assume that text data will always be encoded according to some standard codecs (Latin-1, ISO-8859-1, UTF-8, Big5, etc.) pepper = u'Jalape\xf1o'

Copyright (C) 2007, http://www.dabeaz.com 1- Where do we stand? •
ﬁndcache.py. A program that can locate Firefox cache directories. (10 lines) • ffcache.py. A module that can read cache metadata, determine content encoding, and read ﬁles from the cache (assuming they can be located). (140 lines) 108

Copyright (C) 2007, http://www.dabeaz.com 1- Problem : CacheSpy • Big
Brother (make an evil sound here) 109 Write a program that ﬁrst locates all of the Firefox cache directories under a given directory. Then have that program run forever as a network server, waiting for connections. On each connection, send back all of the current cache metadata. • Big Picture We're going to write a daemon that will ﬁnd and quietly monitor browser caches.

Copyright (C) 2007, http://www.dabeaz.com 1- Part I : Running a
Program • Find all Firefox cache directories • Didn't we already write that code? • Yes. Let's run that program as a subprocess and collect its output. • Problem: How to run other processes? 110

Copyright (C) 2007, http://www.dabeaz.com 1- Solution : Pipes # Run
the findcache.py program as a subprocess and # collect the output. import popen2 import sys def findcaches(topdir): cmd = sys.executable + " findcache.py " + topdir out,inp = popen2.popen2(cmd) inp.close() # No input to subprocess caches = [line.strip() for line in out] return caches 111

Copyright (C) 2007, http://www.dabeaz.com 1- popen2 module # Run the
findcache.py program as a subprocess and # collect the output. import popen2 import sys def findcaches(topdir): cmd = sys.executable + " findcache.py " + topdir out,inp = popen2.popen2(cmd) inp.close() # No input to subprocess caches = [line.strip() for line in out] return caches 112 popen2 Contains functions for launching subprocesses and creating pipes.

Copyright (C) 2007, http://www.dabeaz.com 1- Shell Commands # Run the
findcache.py program as a subprocess and # collect the output. import popen2 import sys def findcaches(topdir): cmd = sys.executable + " findcache.py " + topdir out,inp = popen2.popen2(cmd) inp.close() # No input to subprocess caches = [line.strip() for line in out] return caches 113 We create a full shell command and execute it. cmd = "/usr/local/bin/python2.5 findcache.py topdir" sys.executable Full path of python intepreter.

Copyright (C) 2007, http://www.dabeaz.com 1- Usage >>> caches = findcaches("/Users")
>>> caches ['/Users/beazley/Library/Caches/Firefox/Profiles/ qs1ab616.default/Cache', '/Users/beazley/Library/Mozilla/ Profiles/default/wxuoyiuf.slt/Cache'] >>> 114 • Usage of the ﬁndcaches function • Commentary. Using pipes might be overkill for this. This is mostly just to illustrate.

Copyright (C) 2007, http://www.dabeaz.com 1- More on Processes • Python
has extensive support for processes • os module (pipes, fork, exec, spawnv, wait, exit, system, ttys) • popen2 (pipes) • subprocess (high level subprocess API) • signal (signal handling) • Almost any C-level program written using POSIX calls can be implemented in Python 115

Copyright (C) 2007, http://www.dabeaz.com 1- Interactive Processes • Python does
not have built-in support for controlling interactive subprocesses (e.g., "Expect") • Must install third party modules for this • Example: pexpect • http://pexpect.sourceforge.net 116

Copyright (C) 2007, http://www.dabeaz.com 1- Part 2 : Making a
Server • Write a simple server program that sends back all metadata when it receives a connection. 117

Copyright (C) 2007, http://www.dabeaz.com 1- CacheSpy Server import pickle, sys,
SocketServer, ffcache cachelist = findcaches(sys.argv[1]) def dump_cache(f): for dir in cachelist: for meta in ffcache.scan_cache_withcontent(dir): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",31337),SpyHandler) print "CacheSpy running on port 31337" serv.serve_forever() 118

Copyright (C) 2007, http://www.dabeaz.com 1- SocketServer Module import pickle, sys,
SocketServer, ffcache cachelist = findcaches(sys.argv[1]) def dump_cache(f): for dir in cachelist: for meta in ffcache.scan_cache_withcontent(dir): pickle.dump(f,meta) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",31337),SpyHandler) print "CacheSpy running on port 31337" serv.serve_forever() 119 SocketServer A module for easily creating low-level internet applications using sockets.

SocketServer, ffcache cachelist = findcaches(sys.argv[1]) def dump_cache(f): for dir in cachelist: for meta in ffcache.scan_cache_withcontent(dir): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",31337),SpyHandler) print "CacheSpy running on port 31337" serv.serve_forever() 120 You deﬁne a simple class that implements handle(). This implements the server logic.

SocketServer, ffcache cachelist = findcaches(sys.argv[1]) def dump_cache(f): for dir in cachelist: for meta in ffcache.scan_cache_withcontent(dir): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",31337),SpyHandler) print "CacheSpy running on port 31337" serv.serve_forever() 121 Next, you just create a Server object, hook the handler up to it, and run the server.

Copyright (C) 2007, http://www.dabeaz.com 1- Data Serialization import pickle, sys,
SocketServer, ffcache cachelist = findcaches(sys.argv[1]) def dump_cache(f): for dir in cachelist: for meta in ffcache.scan_cache_withcontent(dir): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",31337),SpyHandler) print "CacheSpy running on port 31337" serv.serve_forever() 122 Here, we are turning a socket into a ﬁle and dumping cache data on it. socket corresponding to client that connected.

Copyright (C) 2007, http://www.dabeaz.com 1- pickle Module import pickle, sys,
SocketServer, ffcache cachelist = findcaches(sys.argv[1]) def dump_cache(f): for dir in cachelist: for meta in ffcache.scan_cache_withcontent(dir): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",31337),SpyHandler) print "CacheSpy running on port 31337" serv.serve_forever() 123 The pickle module takes any Python object and serializes it. There are really only two ops: pickle.dump(obj,f) # Dump object obj = pickle.load(f) # Load object

Copyright (C) 2007, http://www.dabeaz.com 1- Running our Server % python
cachespy.py /Users CacheSpy running on port 31337 124 • Example: • Server is just sitting there waiting • You can try connecting with telnet % telnet localhost 31337 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. (dp0 S'info' p1 ... bunch of cryptic data ...

Copyright (C) 2007, http://www.dabeaz.com 1- Problem : CacheMon • The
evil overload (bwahahahahaha!) 125 Write a program cachemon.py that contains a function for retrieving the cache contents from a remote machine. • Big Picture Writing network clients. Programs that make outgoing connections to internet services.

Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py import pickle, socket
def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close() Solution : Cachemon 126

def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close() Solution : Socket Module 127 socket module provides direct access to low-level socket API. s = socket(addr,type) s.connect(host) s.bind(addr) s.listen(n) s.accept() s.recv(n) s.send(data) ...

def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close() Unpickling a Sequence 128 Here we use pickle to repeatedly load objects. We use yield to generate a sequence of received objects.

Copyright (C) 2007, http://www.dabeaz.com 1- Example Usage >>> rcache =
scan_remote_cache(("localhost",31337)) >>> jpegs = (meta for meta in rcache ... if meta['content-type'] == 'image/jpeg' ... and meta['datasize'] > 100000) >>> for j in jpegs: ... print j['request'] ... http://images.salon.com/ent/video_dog/comedy/2007/09/27/ cereal/story.jpg http://images.salon.com/ent/video_dog/ifc/2007/09/28/ apocalypse/story.jpg http://www.lakesideinns.com/images/fallroadphoto2006.jpg ... 129 • Example: Find all JPEG images > 100K on a remote machine • Very similar to old code!

Copyright (C) 2007, http://www.dabeaz.com 1- Variation : CacheMon • Scan
a whole cluster of machines 130 Write a function that can easily scan the caches of an entire list of remote hosts. • Big Picture Collecting data from a group of machines on the network.

Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py def scan_caches(hostlist): for
host in hostlist: try: for meta in scan_remote_cache(host): yield meta except (EnvironmentError,socket.error): pass Solution : Cachemon 131 A bit of exception handling to deal with dead machines, and other problems (might need to be expanded)

Copyright (C) 2007, http://www.dabeaz.com 1- Example Usage >>> hosts =
[('host1',31337),('host2',31337),...] >>> rcaches = scan_caches(hosts) >>> jpegs = (meta for meta in rcache ... if meta['content-type'] == 'image/jpeg' ... and meta['datasize'] > 100000) >>> for j in jpegs: ... print j['request'] ... ... 132 • Example: Find all JPEG images > 100K on a set of remote machines • Think about the abstraction of "iteration" here. Query code is exactly the same.

Copyright (C) 2007, http://www.dabeaz.com 1- Concurrent Monitor • Collect data
from a large set of machines 133 In the last section, we wrote a function that scanned an entire list of hosts, one at a time. Modify that function to scan a set of hosts using concurrent network connections. • Big Picture Breaking a task up into concurrently executing parts. Programming with threads.

Copyright (C) 2007, http://www.dabeaz.com 1- Concurrency 134 • Python provides
full support for threads • They are real threads (pthreads, system threads, etc.) • The only catch: The Python run-time interpreter is protected by a global interpreter lock. So, no true concurrency across multiple CPUs.

Copyright (C) 2007, http://www.dabeaz.com 1- # cachemon.py ... import threading
class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): try: for meta in scan_remote_cache(self.host): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass A Cache Scanning Thread 135

class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): try: for meta in scan_remote_cache(self.host): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass threading Module 136 threading module. Contains most functionality related to threads.

class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): try: for meta in scan_remote_cache(host): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass Thread Base Class 137 Threads are deﬁned by inheriting from the Thread base class.

class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): try: for meta in scan_remote_cache(self.host): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass Thread Initialization 138 initialization and setup

class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): try: for meta in scan_remote_cache(self.host): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass Thread Execution 139 run() method Contains code that executes in the thread.

Copyright (C) 2007, http://www.dabeaz.com 1- Launching a Thread 140 •
You create a thread object and launch it t1 = ScanThread(("host1",31337),msg_q) t1.start() t2 = ScanThread(("host2",31337),msg_q) t2.start() • .start() starts the thread and calls .run()

Copyright (C) 2007, http://www.dabeaz.com 1- Interlude 141 • Threads are
commonly used to implement variations of producer/consumer problems. • Data is produced by one or more producers (e.g., Each remote client) • Consumed by one or more consumers (e.g., a centralized function). • You could try to coordinate this with locks and other synchronization.

Copyright (C) 2007, http://www.dabeaz.com 1- Thread Safe Queues 142 •
Queue module. Provides a thread-safe queue. import Queue msg_q = Queue.Queue() • Queue insertion msg_q.put(obj) • Queue removal obj = msg_q.get() • Queue can be shared by as many threads as you want without worrying about locking.

class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): try: for meta in scan_remote_cache(self.host): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass Use of a Queue Object 143 A Queue object. Where incoming objects are placed. Get data from the remote machine and put in the Queue

Copyright (C) 2007, http://www.dabeaz.com 1- Primitive Use of a Queue
144 • You create a queue, then launch the thread msg_q = Queue.Queue() t1 = ScanThread(("host1",31337),msg_q) t1.start() t2 = ScanThread(("host2",31337),msg_q) t2.start() while True: meta = msg_q.get() # Get metadata

Copyright (C) 2007, http://www.dabeaz.com 1- Monitor Architecture 145 Host Host
Host Monitor Thread Thread Thread msg_q socket socket socket Consumer .put() .get() ????

Copyright (C) 2007, http://www.dabeaz.com 1- Concurrent Monitor import threading, Queue
def launch_scanners(hostlist,msg_q): tlist = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() tlist.append(thr) for thr in tlist: thr.join() msg_q.put(None) # Sentinel def scan_caches(hostlist): msg_q = Queue.Queue() thr = threading.Thread(target=launch_scanners, args=(hostlist,msg_q)) thr.start() while True: meta = msg_q.get() if meta: yield meta else: break 146

Copyright (C) 2007, http://www.dabeaz.com 1- Launching Threads import threading, Queue
def launch_scanners(hostlist,msg_q): tlist = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() tlist.append(thr) for thr in tlist: thr.join() msg_q.put(None) # Sentinel def scan_caches(hostlist): msg_q = Queue.Queue() thr = threading.Thread(target=launch_scanners, args=(hostlist,msg_q)) thr.start() while True: meta = msg_q.get() if meta: yield meta else: break 147 The above function is a thread that launches ScanThreads. It then waits for the threads to terminate by joining with them. After all threads have terminated, a sentinel is dropped in the Queue.

Copyright (C) 2007, http://www.dabeaz.com 1- Collecting Results import threading, Queue
def launch_scanners(hostlist,msg_q): tlist = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() tlist.append(thr) for thr in tlist: thr.join() msg_q.put(None) # Sentinel def scan_caches(hostlist): msg_q = Queue.Queue() thr = threading.Thread(target=launch_scanners, args=(hostlist,msg_q)) thr.start() while True: meta = msg_q.get() if meta: yield meta else: break 148 The function below creates a Queue and launches a thread to launch all of the scanning threads. It then produces a sequence of cache data until the sentinel (None) is pulled off of the queue.

Copyright (C) 2007, http://www.dabeaz.com 1- More on Threads 149 •
There are many more issues to thread programming that we could discuss. • All issues concerning locking, synchronization, event handling, and race conditions apply to Python. • Because of global interpreter lock, threads are not a way to achieve higher performance (generally).

Copyright (C) 2007, http://www.dabeaz.com 1- Thread Synchronization 150 • threading
module has various primitives Lock() # Mutex Lock RLock() # Reentrant Mutex Lock Semaphore(n) # Semaphore • Example use: x = value # Some kind of shared object x_lock = Lock() # A lock associated with x ... x_lock.acquire() # Modify or do something with x (critical section) ... x_lock.release()

Copyright (C) 2007, http://www.dabeaz.com 1- Story so Far 151 •
Wrote a program ﬁndcache.py that located cache directories (~ 10 lines) • Wrote a module ffcache.py that parsed contents of caches (~140 lines) • Wrote cachespy.py that allows caches to be retrieved (~30 lines) • Wrote a concurrent monitor for getting that data (~50 lines)

Copyright (C) 2007, http://www.dabeaz.com 1- A subtle observation 152 •
In none of our programs have we read the entire contents of the Firefox cache into memory. • In cachespy.py, the contents are read iteratively and piped through a socket (not stored in memory) • In cachemon.py, contents are received and routed through message queues. Processed iteratively (no temporary lists of results).

Copyright (C) 2007, http://www.dabeaz.com 1- Another Observation 153 • For
every connection, cachespy sends the entire contents of the Firefox cache metadata back to the client. • However, mostly we're just performing various kinds of queries on the data and ﬁltering. • Question: Could we do any of this work remotely?

Copyright (C) 2007, http://www.dabeaz.com 1- Remote Mapping • Distribute the
work 154 Modify the cachespy program so that some of the mapping and ﬁltering work can be performed remotely on each of the machines. Only send back a subset of the data to the monitor program. • Big Picture Distributed computation. Massive security problem.

Copyright (C) 2007, http://www.dabeaz.com 1- The idea • Modify scan_remote_cache()
to accept an optional ﬁlter speciﬁcation. Pass this on to the remote machine and use it to process the data remotely before returning reults. 155 filter = """ if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000 """ rcache = scan_remote_cache(host,filter)

Copyright (C) 2007, http://www.dabeaz.com 1- Changes to the Monitor #
cachemon.py def scan_remote_cache(host,filter=""): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() pickle.dump(filter,f) f.flush() try: while True: meta = pickle.load(f) meta['host'] = host yield meta except EOFError: pass 156 Send the ﬁlter to the remote host

cachemon.py ... class ScanThread(threading.Thread): def __init__(self,host,msg_q,filter=""): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q self.filter = filter def run(self): try: for meta in scan_remote_cache(self.host,self.filter): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass 157 ﬁlter added to thread data

cachemon.py ... def launch_scanners(hostlist,msg_q,filter=""): tlist = [] for host in hostlist: thr = ScanThread(host,msg_q,filter) thr.start() tlist.append(thr) for thr in tlist: thr.join() msg_q.put(None) # Sentinel 158 ﬁlter passed to thread creation

cachemon.py ... def scan_caches(hostlist,filter=""): msg_q = Queue.Queue() thr = threading.Thread(target=launch_scanners, args=(hostlist,msg_q,filter)) thr.start() while True: meta = msg_q.get() if not meta: break yield meta 159 ﬁlter added

Copyright (C) 2007, http://www.dabeaz.com 1- Changes to CacheSpy # cachespy.py
... def dump_cache(f,filter): valuegen = """(meta for dir in cachelist for meta in ffcache.scan_cache_withcontent(dir) %s)""" % filter try: for meta in eval(valuegen): pickle.dump(meta,f) except: pickle.dump({'error': traceback.format_exc()},f) 160

Copyright (C) 2007, http://www.dabeaz.com 1- Changes to CacheSpy # cachespy.py
... def dump_cache(f,filter): valuegen = """(meta for dir in cachelist for meta in ffcache.scan_cache_withcontent(dir) %s)""" % filter try: for meta in eval(valuegen): pickle.dump(meta,f) except: pickle.dump({'error': traceback.format_exc()},f) 161 Filter added and used to create an expression string. filter = "if meta['datasize'] > 100000" valuegen = """(meta for dir in cachelist for meta in ffcache.scan_cache_withcontent(dir) if meta['datasize'] > 100000)"""

Copyright (C) 2007, http://www.dabeaz.com 1- Eval() # cachespy.py ... def
dump_cache(f,filter): valuegen = """(meta for dir in cachelist for meta in ffcache.scan_cache_withcontent(dir) %s)""" % filter try: for meta in eval(valuegen): pickle.dump(meta,f) except: pickle.dump({'error': traceback.format_exc()},f) 162 eval(s). Evaluates s as a Python expression. A bit error of handling. traceback module creates stack traces for exceptions.

Copyright (C) 2007, http://www.dabeaz.com 1- Changes to the Server #
cachespy.py ... class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() filter = pickle.load(f) dump_cache(f,filter) f.close() 163 ﬁlter added

Copyright (C) 2007, http://www.dabeaz.com 1- Putting it all Together •
Have create some interesting machinery 164 # Find all of those slashdot slackers import cachemon hosts = [('host1',31337),('host2',31337), ('host3',31337),...] filter = "if 'slashdot' in meta['request']" rcaches = scan_caches(hosts,filter) for meta in rcaches: print meta['request'] print meta['host'],meta['cachedir'] print

Copyright (C) 2007, http://www.dabeaz.com 1- Putting it all Together •
Queries run remotely on all the hosts • Only data of interest is sent back • No temporary lists or large data structures • Concurrent execution on monitor • Concurrency is hidden from user 165

Copyright (C) 2007, http://www.dabeaz.com 1- The Power of Iteration •
Loop over all entries in a cache ﬁle: 166 for meta in scan_cache_file(f,256): ... • Loop over all entries in a cache directory for meta in scan_cache(dirname): ... • Loop over all cache entries on remote host for meta in scan_remote_cache(host): ... • Loop over all cache entries on many hosts for meta in scan_caches(hostlist): ...

Copyright (C) 2007, http://www.dabeaz.com 1- Wrapping Up • A lot
of material has been presented • Again, the goal was to do something interesting with Python, not to be just a reference manual. • This is only a small taste of what's possible • And it's only a small taste of why people like programming in Python 167

Copyright (C) 2007, http://www.dabeaz.com 1- Where to go from here?
• Everything Pythonic: 168 http://www.python.org • Get involved. PyCon'2008 (Chicago) • Have an on-site course (shameless plug) http://www.dabeaz.com/python.html

Copyright (C) 2007, http://www.dabeaz.com 1- Thanks for Listening! • Hope
you got something out of the class 169 • Please give me feedback!

Python in Action - Part 2 (Systems Programming)

Python in Action - Part 2 (Systems Programming)

More Decks by David Beazley

Other Decks in Programming

Featured

Transcript