Slides from the 10-week course I taught in the Computer Science Masters Degree program at the University of Chicago in 2008. Fairly opinionated and off the deep end. If you ever wanted to attend a 30-hour tutorial, this is for you.
two types of programming languages--those that everyone hates and those that nobody uses.” - John Ousterhout (overheard at a conference) This course is mostly about the first category...
• Clearly there is some kind of distinction • Other than dynamic languages often being derided by "real programmers" • Let's look at a simple programming problem...
Dave has taken out a $500,000 mortgage from Guido's Mortgage, Stock, and Viagra trading corporation. He got an unbelievable rate of 4% and a monthly payment of only $499. However, Guido, being kind of soft-spoken, didn't tell Dave that after 2 years, the rate changes to 9% and the monthly payment becomes $3999. 9 • Question: How much does Dave pay and how many months does it take?
programs are compiled 13 shell % cc mortgage.c -o mortgage.exe shell % • Requires the use of a compiler/development environment (gcc, Visual Studio, etc.) shell % javac Mortgage.java shell % • Produces an executable/class file that is separate from the original source code • That is what you use to run the program
Compilation is a one-time operation. When you want to run the program, you just use the output of the compiler (e.g., the .exe file) • If you want to make any change to the program, the source must be recompiled. • Edit/compile/run/debug cycle.
Compilers perform extensive error checking/validation. • Goal is to find errors before the program runs (reported as compiler errors) • To do this, programs include extra specifications that are used to perform these checks. • Usually associated with "type-checking"
functions/methods have prototypes double square(double x) { return x*x; } • Inconsistent use results in errors double y = square(3,4) // Error. Too many args double y = square("Hello") // Error. Bad arg type char *z = square(4.0) // Error. Bad return type • Emphasize: Errors caught during compilation
compiled languages, the main focus is the compiler. • Compiler produces executables, performs validation, reports errors, performs various kinds of optimizations, etc. • The result is a "static" program. A program whose functionality is rigidly fixed at the time of compilation. A program that can not be changed without recompiling.
"static" programs have been successfully compiled, you are reasonably sure that they are free from certain kinds of errors (especially inconsistent use of data). • (Of course, there may be other bugs) • Since a compiler provides a framework for analyzing programs, a lot of serious computer science has focused on this.
main feature of "dynamic" languages is that they get rid of separate compilation • You write programs (usually without worrying about low-level details). • You then just "run" the program. • Let's look at an example...
mortgage.py Total paid 2623323.00 Months 677 shell % 23 • Running a Python program • The program is executed by an interpreter (python) that reads statements from the input program and runs them one after the other.
is no separate compilation. You just run Python on the program. The source code is the program. • If you make changes, they show up next time. • You don't have to package code into a main() function or anything similar to that. • A program can be just a sequence of statements.
Interpreters delay error checking/validation to run-time. As a result, programs don't generally involve explicit "type" declarations principle = 500000 payment = 499 rate = 0.04 month = 0 total_paid = 0 Notice how none of these variables assignments have a "type" • One consequence: Programs in dynamic languages tend to involve much less typing (at the keyboard)
= 42 # x is an integer ... x = "hello" # x is now a string (OK) • In dynamic languages, variables are not restricted to a single type of data • The type of a "variable" is associated with whatever value is currently assigned to the variable---it may change while running! • This is very different than C/C++/Java.
= x + y # Succeeds if x + y makes sense x = 37 y = 42 z = x + y # Ok. z = 79 x = "Hello" y = "World" z = x + y # Ok. z = "HelloWorld" x = 37 y = "World" z = x + y # Error! This operation fails because the two operands are incompatible (number and string) • All operations involve run-time checks
dynamic languages do everything at run-time, the interpreters can often be used interactively (like a shell) shell % python Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin >>> 3 + 4 7 >>> print "Hello World" Hello World >>> • Sometimes known as a "read-eval" loop
• As a general rule, when a computer scientist talks about some part of a program being "dynamic", it means that it occurs while the program runs. • Dynamic Typing - Type checking at run time. • Dynamic Binding - Virtual methods in OO • Dynamic Linking - Run-time linking of program modules/libraries
Dynamic programs run much slower than static programs because they perform all of the error-checking as the program runs. • Systems. Hard to do low-level hacking of the hardware (i.e., device drivers) • Validation. Since errors are not detected until a program runs, programs may have hidden/obscure errors (that would have been caught by a compiler).
have some benefits. • Rapid Development. Languages are high-level. Programs assembled from components • Scripting. Complex applications can be controlled by programmable scripts that can be changed without having to recompile • Flexibility. Programs can be easily changed/ reconfigured.
use. Dynamic languages are often better suited for end-users. They do not require users to worry about low-level implementation details (such as types). • Portability. Languages are often so high level, they work easily across different machines. • Prototyping. Often significantly easier to prototype a system in a dynamic language.
are used almost everywhere--often behind the scenes. • Internet (Google,Web, etc.) • Movie-making (special effects) • Television (control systems) • Scientific computing (supercomputing) • Robots • Video games
are often viewed as exotic, unreliable, and "unserious" by managers and crusty software engineers • In many cases, their use in an organization is subversive (initiated by lone-programmers, interns, students, etc.) • In certain cases, the use of such languages is considered to be a strategic advantage (a.k.a., a "trade secret").
Programmers are not using dynamic languages as a replacement for C++ or Java. • They're using these languages in addition to static languages • They're writing programs that utilize the strengths of both (e.g., C++ for speed, dynamic languages for flexibility).
early days, computers were big, very expensive, and quite limited in power (your cell-phone has far more compute power). • Early programs written directly on the hardware (hard-wired, machine language). • Later, assembly language. • Fed to systems on punch-cards
first "high-level" programming languages • Fortran (1954) • Lisp (1958) • ALGOL (1958) • COBOL (1959) • Each of these efforts came out of different communities (Fortran - Engineering/Science, Lisp - Mathematics, COBOL - Business)
the early days, no-one really knew exactly what they were doing • "Computer Science" didn't even emerge as a separate discipline until the 1960s • Aspects of "programming languages" were still being worked out.
primarily meant to replace hand- coding of assembly language • Highly focused on raw performance for science/engineering work • Initially developed by IBM around 1954 • Still used today for that same purpose (Fortran 2008 standard underway).
language in many ways • Example: Implicit Typing NMONTHS = 0 TOTALPAID = 0 An integer A real • Type determined by first character of the name (I-N are ints, all others are reals) • And this only scratches the surface. • Yet, compare the "look" of early Fortran to modern scripting languages
a committee of scientists around 1958 (ETH-Zurich) • Initial motivation was to address perceived problems with FORTRAN (of which there were many) • Was hugely influential in subsequent computer science research on programming languages, type-systems, compilers, etc.
modern programming languages utilize concepts that were worked out in various versions of ALGOL. • Very strong focus on the design of compilers • However, ALGOL itself never really caught on commercially (legacy ALGOL?) • The language never offered any standard I/O facilities (different on every machine)
used widely in business/finance • Developed in 1959 by committee (Burroughs, IBM, Honeywell, RCA, Sperry Rand, Sylvania, USAF, NIST, etc.) • Still lives today and has all of the features that you would expect from such a committee effort.
MORTGAGE. DATA DIVISION. WORKING-STORAGE SECTION. 01 PRINCIPLE PIC S9(7)V99 VALUE 500000.00 . 01 PAYMENT PIC 9(7)V99 VALUE 499.00 . 01 RATE PIC 9V99 VALUE 0.04 . 01 MONTH PIC 999 VALUE 0 . 01 TOTALPAID PIC 9(7) VALUE 0.00 . PROCEDURE DIVISION . MAIN. PERFORM WITH TEST BEFORE UNTIL PRINCIPLE < 0.00 COMPUTE PRINCIPLE = PRINCIPLE*(1+RATE/12)-PAYMENT ADD PAYMENT TO TOTALPAID ADD 1 TO MONTH IF MONTH = 24 THEN SET PAYMENT TO 3999.00 SET RATE TO 0.09 END-IF END-PERFORM DISPLAY "TOTAL PAID", TOTALPAID DISPLAY "MONTHS", MONTH. STOP RUN. 49
a language for writing computer programs based on the lambda calculus (Alonzo Church) • Invented by John McCarthy at MIT (1958) • Name derives from "List Processisng Language" • Many modern variations in use (Common Lisp, Scheme, etc.)
truly unlike any of the other early programming languages • The entire language is basically based on the "list." Lisp programs themselves are lists (thus programs can process their own code as data). • Programs written as functions that apply various operations to lists (functional programming). Especially strong reliance on recursive functions, mathematical thinking.
the first major dynamic language • Almost all major concepts of dynamic programming languages were first invented with Lisp. • Modern languages like Python and Ruby borrow heavily from Lisp, but even today, have not replicated all of its features. • All "real programmers" eventually reinvent some part of Lisp without knowing it.
early computing, machines were quite limited and Lisp was much slower than the compiled languages (dynamic) "A LISP programmer knows the value of everything, but the cost of nothing" - Alan Perlis • Lisp was not widely adopted by those who were obsessed with high-performance (science/engineering/business).
Lisp requires a certain degree of mathematical sophistication. Functions, composition of functions, recursion, etc. • Let's be honest---a huge majority of the world's programmers are not mathematicians. You don't need a math degree to write accounting software (or make a web page). • Programs in other languages are more like "recipes" of steps (imperative). Conceptually, this is easier for most people to grasp.
languages of choice for "serious" applications have almost all been compiled programming languages that derive from Fortran/Algol. • Some major languages from the 70s/80s • Fortran (updates, F66, F77). • Pascal (1970) • C (1972)
1970 by Niklaus Wirth • Derives from ALGOL, but strongly focused on structured programming/data structures • Initially developed as a teaching language for structured programming. • Personal note : I remember learning Pascal in high school (1985). Most programmers in 70's/80's would have encountered it.
was very picky about "correctness" • Very strong type system, very strict in what it allowed and did not allow (pitched as a good teaching language) • Used as an alternative to BASIC on early PCs (e.g., Turbo Pascal, UCSD Pascal, etc.). • Early Macintosh systems made heavy use of Pascal (parts of the OS, major applications)
1972 at AT&T Bell Labs by Dennis Ritchie in order to implement Unix • Created as a systems implementation language (a better assembly language) • Although the language borrows some ideas from ALGOL, the language was always meant to be minimal and low-level.
developments in programming languages, there are situations where programs need to directly manipulate the computer hardware • Operating systems, device drivers, etc. • Before C, this code would typically be written in assembly language
C was not created to be a better ALGOL. It was a replacement for assembly coding. • Although it was compiled, "safety" was never really a primary concern • C allowed direct access to hardware/ memory (the complete opposite of Pascal) • Witness the consequences : Buffer overflow attacks (malware).
is currently the de-facto standard for developing systems software • There are many reasons why this happened • Not just related to the technical merits (or lack of) of C as a programming language
of minicomputers/ microcomputers in the 1970s • These systems were extremely minimal/ resource starved (compare a Commodore-64 with an IBM mainframe). • If you wanted anything to run fast and fit in memory, you had to write it in assembly. • A lot of early PC software was assembly
Use of C really exploded with minis/PCs • C was minimal, portable, and didn't enforce any morality rules on programming (you could do anything you wanted). • You could easily write programs that ran almost as fast as hand-written assembler (maybe even faster with optimization). • Growth completely driven by practical applications (and economics)
I don't know when C clobbered Pascal, but I'm guessing in the late 80s (I don't remember many people talking about Pascal after about 1990). • Humor : "How to shoot yourself in the foot" C : "You shoot yourself in the foot." Pascal : "The compiler won't let you shoot yourself in the foot."
had become a lot more interactive • Early 1970's : Interactive video terminals replaced punch cards. Enabled programs to interact with the user in new ways. (shells) • Early 1980's : Graphical User Interfaces. A lot of previous research (e.g., Xerox), but the Apple Lisa/Mac was really first GUI-centric system.
computing power • Rapid growth of CPU power, memory capacity, and disk storage • Enabled new kinds of programs with vastly more complexity than anything before. • GUIs took it to a whole new level
programming was not enough from the standpoint of software engineering. • How to manage large-scale programming projects and complexity? • Rise of object-oriented programming, software components, etc. (1980s). • Example : Development of C++ as a "better C"
greater reliance on programming libraries, pre-built software components • Example: GUI "widgets" • Writing software was becoming less about creating everything from scratch and more about gluing together components that already existed.
clear that languages somewhat different from those in existence today would enhance the preparation of structured programs. We will perhaps eventually be writing only small modules which are identified by name as they are used to build larger ones, so that devices like indentation [...] might become feasible for expressing local structure in the source language." - Donald Knuth (1974)
early home computers came prepackaged with BASIC (usually Microsoft) • If you turned on the system, you were often dropped directly into a BASIC interpreter • Greatly expanded the number of programmers • Notion of "programmability" (maybe BASIC's only redeeming quality other than peek/poke)
growing (1980's) • Major universities/corporations were already connected (ARPA Net) • Services for home users (Compuserve, BBS,etc). • Growth of free software/open source • Sharing of ideas and source code
By the mid 1980's, there were a lot of things going on (PCs, GUIs, faster systems, early Internet, objects, etc.). • Programmers using C/Pascal, but there were many perceived limitations. • A cauldron of activity.
programs to solve problems. • Problems that are of interest to people who are usually not programmers • Therefore, it is important to figure out some way to make programs generally usable (i.e., "user friendly"). 79
just isn't that useful • Everything (including the underlying logic) is hard-coded into the program • Despite the possible merits of keeping a programmer employed full-time to make changes, the program isn't reusable • Thus, an important part of software engineering is how to make code more general purpose 82
rule, software engineers do not like to write programs where everything is just hard-coded. • I'd deduct points if you turned in a big programming assignment and you did this. • An extreme example : A former student (nameless) when asked to create a website that played "tic tac toe" tried to create a separate .html document for every possible game configuration (of which there are many) 83
makes it easier to change the code. If you want to make changes to parameters, you just change in one location. • However, it's still not very user-friendly. To change a parameter, you have to recompile • "Pardon me, I'll tell you how much your mortgage will cost so soon as I finish recompiling my mortgage software." 86
parameters from user works, but does not scale well to more complicated problems. • Example : A problem where you had to specify several hundred parameters • Also messy if the program is "branchy." • End-users probably find this to be clunky 89
loan.exe Loan type (1=Mortgage, 2=Auto, 3=Commercial) : 1 Mortgage type (1=Conventional, 2=Evil) : 2 Principle : 500000 Payment : 3999 Rate : 0.09 Teaser Payment : 499 Teaser Rate : 0.04 Teaser Period : 24 Be a sneaky bugger? (Y=Yes, N=No) : Y ... 90 • You might laugh, but a huge amount of "mission-critical" software is often not much more sophisticated than this. • Many GUIs not much different (dialogs)
their own devices, programmers have had a tendency to create their own weird application-specific command/config languages. • You see this in large apps (e.g., VBA in Microsoft Office) • Typically done without thinking much about programming languages, theory, or previous work however. 94
unchecked, configuration languages may grow into some sort of ad-hoc domain-specific (scripting) language 95 “Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified bug-ridden slow implementation of half of Common Lisp.” - Philip Greenspun
a "scripting language" has been around for a very long time • JCL - IBM System/360 (1964) • sh - Unix Shell (1971) • Rexx - IBM (1979) • Basically, these languages are oriented around controlling applications and the operating system. 97
programmerss know something about the operating system shell • They use it to run their applications, run the compiler, etc. • The command shell is often cited as an influence for the domain-specific languages that get created 98
focus on running other applications • Most basic operation is running a program • Supplying arguments to a program 99 shell % someprog.exe foo bar 42 blah -x int main(int argc, char *argv[]) { ... } # args
mechanisms for I/O 100 shell % someprog.exe > out.txt shell % someprog.exe < in.txt shell % someprog.exe | otherprog.exe • So, you can hook programs up to files and hook programs up to other programs
have "local" variables 102 shell % x="Hello" shell % y="World" shell % echo "$x $y" Hello World shell % • These are not passed to applications, but you can export them to the global environment shell % export x
to work heavily with text • Shell interpreter performs variable substitutions prior to executing any command (known as interpolation). • Usually a special syntax is used ($var) 103 shell % cmd=ls shell % opts=-l shell % $cmd $opts /somedir -rw-r--r-- beazley staff 408 Apr 30 2007 foo -rw-r--r-- beazley staff 658 Apr 30 2007 bar -rw-r--r-- beazley staff 332 Apr 30 2007 spam ...
basic control-flow features 104 if test $x -gt 0; then echo "$x is greater than 0" else echo "$x is not greater than 0" fi • Loops x="foo bar spam" for i in $x do echo $i done
define procedures in the shell 105 add() { echo `expr $1 + $2` } shell % add 3 4 7 shell % • However, all of this starts to get weird pretty fast (e.g., no local variables)
shell scripts tend to grow into large shell scripts (usually unintelligible) • If you want to write an "application", get convoluted mix of tools hooked together in bizarre ways • Very limited support for data processing and data structures (strings and lists of strings) • Slow 106
the interactivity of shells • But they want to more than just launch other programs and manipulate strings • So, there has always been an interest in expanding shells with features from "real" programming languages • More flexible data structures, proper procedures, control flow, variables, etc. 108
brief 5-10 year period, there was a sudden flurry of development in which a variety of new programming languages were created • All of these languages were created by single individuals, often without any “official” funding. • Not associated with academic CS. • Question : Why?
"The Tcl scripting language grew out of my work on design tools for integrated circuits [...] Each tool needed to have a command language. However, our primary interest was in the tools, not their command languages. Thus, we didn’t invest much effort in command languages and the languages ended up being weak and quirky. Furthermore, the language for one tool couldn’t be carried over to the next so each tool ended up with a different bad command language. After a while, this became rather embarrassing." - John Ousterhout
"Like the typical human, Perl was conceived in secret, and existed for roughly nine months before anyone in the world ever saw it. Its womb was a secret project for the National Security Agency known as the ‘Blacker’ project, which has long since closed down. The goal of that sexy project was not to produce Perl. However, Perl may well have been the most useful thing to come from Blacker. Sex can fool you that way." - Larry Wall (Note: Perl was created to process logs and generate reports)
"My original motivation for creating Python was the perceived need for a higher level language in the Amoeba [Operating Systems] project. I realized that the development of system administration utilities in C was taking too long. Moreover, doing these things in the Bourne shell wouldn't work for a variety of reasons. ... So, there was a need for a language that would bridge the gap between C and the shell." - Guido van Rossum
115 "PHP, known originally as Personal Home Pages, was first conceived in the autumn of 1994 by Rasmus Lerdorf. He wrote it as a way to track visitors to his online CV. The first version was released in early 1995, by which time Rasmus had found that by making the project open-source, people would fix his bugs.” - From “A History of PHP” (In Q1, '07, there were 78 PHP books on the market).
have largely been developed outside of “academia.” • All are based in practical applications • None was meant to be a “theoretical” experiment in programming languages. • Much to the dismay of academic researchers in programming languages 116 (“I don’t get no respect” - Rodney Rangerfield)
the Christmas holiday in December 1989, hacker Guido van Rossum of the Netherlands was bored, so he created a descendent of the ABC scripting language for the Unix platform, dubbing it Python, from the British comedy troop Monty Python's Flying Circus." - Timothy Morgan "Python has been an important part of Google since the beginning, and remains so as the system grows and evolves." - Peter Norvig, Google.
("Tickle") • Released as open-source in late 80s • One of the most influential early scripting languages • The big idea : A simple standardized programming language that could be easily added to other applications (like a library). • So you didn't have to write your own 118
traditional C program is launched from some kind of shell/command prompt • Command line arguments are passed as strings in an array (argv) 119 shell % someprog.exe foo bar 42 blah -x int main(int argc, char *argv[]) { ... } # args • There is just one entry point to the program (main) which figures out what to do
simple interpreter that can call a collection of C functions using the same idea 120 Tcl Interpreter C Code User • User interacts with interpreter, issuing commands that call into C.
each "command" you write C code 121 int square(void *clientData, Tcl_Interp *interp, int argc, char *argv[]) { double x; if (argc != 2) { return TCL_ERROR; } x = atof(argv[1]); /* Convert argument */ y = x*x; /* Compute something */ Tcl_SetDouble(interp,y);/* Return result */ return TCL_OK; } • Each command looks like a little C main() function
provided a "shell" where commands could be invoked (tclsh) 122 % square 4 16 % square 5 25 % Command Name Arguments int square(void *clientData, Tcl_Interp *interp, int argc, char *argv[]) { ... } Launches
lifted some ideas from the Unix shell and put them inside C programs • If you had a big C program with several hundred functions, you would take each function and turn it into a Tcl command • High-level control flow of the application driven by a Tcl script (instead of being hard- coded in C). 123
Tcl, an optional add-on called "Tk" was released • Tk was a Tcl-based interface to a graphical user interface widget set and toolkit • At the time, it revolutionized GUI programming on UNIX systems. • Could build entire GUI using high level scripts 127
originally envisioned, Tcl was meant to be a small language you added to huge C programs (most code written in C) • Didn't quite turn out that way. • Programmers wrote huge applications entirely in Tcl (> 100K lines) • Tcl/Tk used in a large number of mission critical applications (control systems, etc.) 129
Tcl/Tk is out of fashion, but it was very influential. • Showed that there was great utility in using a dynamic language to control code written in a static language (mixed languages) • Later became one of the first cross- platform GUI development languages • A lot of ground-breaking software engineering related to scripting 130
dynamic languages have an optional interface to Tcl/Tk for GUI programming. • Python (Tkinter), Perl/Tk, Ruby/Tk, Scheme/ Tk, etc (Tcl is often hidden behind scenes) • Many other languages copied much of the SW-engineering practices of Tcl. 131
learned that a dynamic language made them far more productive • For example, creating a simple GUI in Tcl was something you could do in an afternoon • Mixed-language development. C for systems/ high performance, Tcl for control. 132
in 1987 • The big idea : Take concepts from various facets of shell programming, but create a general purpose programming language • Fix annoying issues with shell programs • Incorporate features from text processing tools (especially sed and awk) 133
Roughly taken from C • Expands shell scripting with some new data structures (lists and associative arrays) • Adds support for regular-expression pattern matching (from sed) • Major goal : Scripting related to data processing, text processing, report generation. 134
greatly simplified tasks that were previously done using rather complicated shell scripts (and faster) • Showed that a lot of network, system admin, and web-development code could be written entirely in a "script language" • Completely dominated early web- development (CGI scripting) 136
user community was very effective at organizing third-party modules, add-ons, and extensions • Huge contributed library (CPAN) • Very influential on other open-source language projects (Ruby, Python, etc.) 137
languages have evolved from earlier experiences with Tcl, Perl, C etc. • Because everything has been done in the open, there is a lot of cross-pollination • For example, Python has copied Perl's regular expression features. Perl copied Python's object system (in a manner) 138
When building a large C application, it is useful to have an extension/ control language (e.g., Tcl). • Scripting languages. For scripting, it is useful to have a real programming language with useful data structures and high-level features (e.g., Perl). 139
class, we will be using a wide variety of programming languages • This section serves as an introduction to one of the languages we'll be using more often. • More reference material is available online and in books 141
interpreted, dynamically typed programming language. • In other words: A language that's similar to Perl, Ruby, Tcl, and other so-called "scripting languages." • Created by Guido van Rossum around 1990. • Named in honor of Monty Python 142
section, we will cover the absolute basics of Python programming • How to start Python • Interactive mode • Creating and running simple programs • Basic calculations and file I/O. 143
an interpreter • If you give it a filename, it interprets the statements in that file • Otherwise, you get an "interactive" mode where you can experiment • No edit/compile/run/debug cycle 144
line shell % python Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin Type "help", "copyright", "credits" or "license" >>> • Integrated Development Environment (IDLE) shell % idle or 145
put in .py files # helloworld.py print "hello world" • Source files are simple text files • Create with your favorite editor (e.g., emacs) • Note: May be special editing modes • Can also edit programs with IDLE or other Python IDE (too many to list) 150
environments, Python may be run from command line or a script • Command line (Unix) shell % python helloworld.py hello world shell % • Command shell (Windows) C:\Somewhere>c:\python25\python helloworld.py hello world C:\Somewhere> 155
Sears Tower Problem You are given a standard sheet of paper which you fold in half. You then fold that in half and keep folding. How many folds do you have to make for the thickness of the folded paper to be taller than the Sears Tower? A sheet of paper is 0.1mm thick and the Sears Tower is 442 meters tall. 156
# How many times do you have to fold a piece of paper # for it to be taller than the Sears Tower? height = 442 # Meters thickness = 0.1*(0.001) # Meters (0.1 millimeter) numfolds = 0 while thickness <= height: thickness = thickness * 2 numfolds = numfolds + 1 print numfolds, thickness print numfolds, "folds required" print "final thickness is", thickness, "meters" 157
A Python program is a sequence of statements • Each statement is terminated by a newline • Statements are executed one after the other until you reach the end of the file. • When there are no more statements, the program stops 159
Comments are denoted by # # This is a comment height = 442 # Meters 160 • Extend to the end of the line • There are no block comments in Python (e.g., /* ... */).
variable is just a name for some value • Variable names follow same rules as C [A-Za-z_][A-Za-z0-9_]* • You do not declare types (int, float, etc.) height = 442 # An integer height = 442.0 # Floating point height = "Really tall" # A string • Differs from C++/Java where variables have a fixed type that must be declared. 161
has a basic set of language keywords • These are mostly C-like and have the same meaning in most cases • Variables can not have one of these names 162 and assert break class continue def del elif else except exec finally for from global if import in is lambda not or pass print raise return try while yield
while statement executes a loop • Executes the indented statements underneath while the condition is true 163 while thickness <= height: thickness = thickness * 2 numfolds = numfolds + 1 print numfolds, thickness
If-else if a < b: print "Computer says no" else: print "Computer says yes" • If-elif-else if a == '+': op = PLUS elif a == '-': op = MINUS elif a == '*': op = TIMES else: op = UNKNOWN 165
Boolean expressions (and, or, not) if b >= a and b <= c: print "b is between a and c" if not (b < a or b > c): print "b is still between a and c" • Non-zero numbers, non-empty objects also evaluate as True. 167 x = 42 if x: # x is nonzero else: # x is zero
The print statement print x print x,y,z print "Your name is", name print x, # Omits newline • Produces a single line of text • Items are separated by spaces • Works with any kind of Python object • Very useful for debugging 168
• Sometimes you will need to specify an empty block of code if name in namelist: # Do something else: pass # Not implemented yet 169 • pass is a "no-op" statement • It does nothing, but serves as a placeholder for statements (possibly to be added later)
put single-line bodies on same line for i in range(10): print i • Multiple statements on the same line (;) x = 4; y = 10; z = "hello" • Line continuation (\) if product=="game" and type=="pirate memory" \ and age >= 4 and age <= 8: print "I'll take it!" 170 • Line continuation not needed for (),[],{} if (product=="game" and type=="pirate memory" and age >= 4 and age <= 8): print "I'll take it!"
True, False a = True b = False • Evaluated as integers with value 1,0 c = 4 + True # c = 5 d = False if d == 0: print "d is False" • A relatively late addition to Python (v2.3) 173
precision integers a = 37L b = -126477288399477266376467L • Integers that overflow promote to longs >>> 3 ** 73 67585198634817523235520443624317923L >>> a = 72883988882883812 >>> a 72883988882883812L >>> • Can almost always be used interchangeably with integers 175
Subtract * Multiply / Divide // Floor divide % Modulo ** Power << Bit shift left >> Bit shift right & Bit-wise AND | Bit-wise OR ^ Bit-wise XOR ~ Bit-wise NOT abs(x) Absolute value pow(x,y[,z]) Power with optional modulo (x**y)%z divmod(x,y) Division with remainder 176
a decimal or exponential notation a = 37.45 b = 4e5 c = -1.345e-10 • Represented as double precision using the native CPU representation (IEEE 754) 17 digits of precision Exponent from -308 to 308 • Same as the C double type 178
that floating point numbers are inexact when representing decimal values. >>> a = 3.4 >>> a 3.3999999999999999 >>> 179 • This is not Python, but the underlying floating point hardware on the CPU.
- Subtract * Multiply / Divide % Modulo (remainder) ** Power pow(x,y [,z]) Power modulo (x**y)%z abs(x) Absolute value divmod(x,y) Division with remainder • Additional functions are in the math module import math a = math.sqrt(x) b = math.sin(x) c = math.cos(x) d = math.tan(x) e = math.log(x) 180
can be used to convert a = int(x) # Convert x to integer b = long(x) # Convert x to long c = float(x) # Convert x to float • Only work if type conversion makes sense >>> a = "Hello World" >>> int(a) ValueError: invalid literal for int() >>> • Also work with strings containing numbers >>> a = "3.14159" >>> float(a) 3.14159 >>> int("0xff",16) # Optional integer base 255 181
a = "Yeah but no but yeah but..." b = 'computer says no' c = ''' Look into my eyes, look into my eyes, the eyes, the eyes, the eyes, not around the eyes, don't look around the eyes, look into my eyes, you're under. ''' • Standard escape sequences work (e.g., '\n') • Triple quotes capture all literal text enclosed 182
feed '\r' Carriage return '\t' Tab '\xhh' Hexadecimal value '\”' Literal quote '\\' Backslash • In literals, standard escape codes work • Raw strings (don’t interpret escape codes) a = r"\w+\.\w+" # String exactly as specified 183 Leading r
sequence of bytes (characters) 184 • Store 8-bit data (ASCII) • May contain binary data, embedded nulls • Strings are frequently used for both text and for raw-data of any kind
of characters : s[n] a = "Hello world" b = a[4] # b = 'o' c = a[-1] # c = 'd' (Taken from end of string) • Slicing/substrings : s[start:end] d = a[:5] # d = "Hello" e = a[6:] # e = "world" f = a[3:8] # f = "lo wo" g = a[-5:] # g = "world" • Concatenation (+) a = "Hello" + "World" b = "Say " + a 185
(len) >>> s = "Hello" >>> len(s) 5 >>> • Membership test (in) >>> 'e' in s True >>> 'x' in s False >>> "ello" in s True 186 • Replication (s*n) >>> s = "Hello" >>> s*5 'HelloHelloHelloHelloHello' >>>
leading/trailing whitespace t = s.strip() • Case conversion t = s.lower() t = s.upper() • Replacing text t = s.replace("Hello","Hallo") 187 • Strings have "methods" that perform various operations with the string data.
Check if string ends with suffix s.find(t) # First occurrence of t in s s.index(t) # First occurrence of t in s s.isalpha() # Check if characters are alphabetic s.isdigit() # Check if characters are numeric s.islower() # Check if characters are lower-case s.isupper() # Check if characters are upper-case s.join(slist) # Joins lists using s as delimeter s.lower() # Convert to lower case s.replace(old,new) # Replace text s.rfind(t) # Search for t from end of string s.rindex(t) # Search for t from end of string s.split([delim]) # Split string into list of substrings s.startswith(prefix) # Check if string starts with prefix s.strip() # Strip leading/trailing space s.upper() # Convert to upper case 188
"immutable" • Once created, the value can't be changed >>> s = "Hello World" >>> s[1] = 'a' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'str' object does not support item assignment >>> 189 • All operations and methods that manipulate string data always create new strings
any object to string • Produces the same text as print s = str(obj) • Actually, print uses str() for output >>> x = [1,2,3,4] >>> str(x) '[1, 2, 3, 4]' >>> 190
often split into a list of strings >>> line = 'GOOG 100 490.10' >>> fields = line.split() >>> fields ['GOOG', '100', '490.10'] >>> 191 • Example: When reading data from a file, you might read each line and then split the line into columns or fields.
are from the end names[-1] "Curtis" 193 • Lists are indexed by integers (starting at 0) names = [ "Elwood", "Jake", "Curtis" ] names[0] "Elwood" names[1] "Jake" names[2] "Curtis" • Changing one of the items names[1] = "Joliet Jake"
item names.remove("Curtis") del names[2] • Deleting an item by index 195 • Removal results in items moving down to fill the space vacated (i.e., no "holes").
Opening a file f = open("foo.txt","r") # Open for reading g = open("bar.txt","w") # Open for writing • To read a line of text line = f.readline() • To write text to a file g.write(text) • To print to a file print >>g, "Your name is", name 196
Reading a file line by line f = open("foo.txt","r") for line in f: # Process the line ... f.close() • Alternatively for line in open("foo.txt","r"): # Process the line ... • This reads all lines until you reach the end of the file 197
for code you want to reuse def square(x): return x*x • Calling a function a = square(3) 198 • A function is just a series of statements that return a result or carry out some task
with a large standard library • Library modules accessed using import import math x = math.sqrt(10) import urllib u = urllib.urlopen("http://www.python.org/index.html") data = u.read() 199 • Will cover in more detail later
reported as exceptions • Cause the program to stop >>> f = open("file.dat","r") Traceback (most recent call last): File "<stdin>", line 1, in <module> IOError: [Errno 2] No such file or directory: 'file.dat' >>> 201 • For debugging, message describes what happened, where the error occurred, along with a traceback.
try-except try: f = open(filename,"r") except IOError: print "Could not open", filename • To raise an exception, use raise raise RuntimeError("What a kerfuffle") 202 • Exceptions can be caught
an overview of simple Python • Enough to write basic programs • Python code tends to be fairly readable • Just have to know the core datatypes and a few basics (loops, conditions, etc.) 203
section, we take a closer look at how dynamic languages handle data • Topics include: • Variables and values • Primitive data types • Operations on data • Compound data • Memory management
to cover some topics that you normally do not find in the "user manual" for various languages. • My goal is to explore the design challenges and decisions that have been made in various languages. • The big picture
with data, programs typically assign values to "variables" • A variable has a name which is known as an "identifier" • The identifier is used to identify values in subsequent calculations • The value is some sort of data
x; 8 • In static languages such as C, C++, and Java, all variables must be declared and given a specific type in advance (declarations) • Underneath the covers, this binds the variable name to a fixed memory location that holds the value of the variable. • Type and location remain fixed
x = 42 9 • In dynamic languages, variables are just names for values • As the program runs, the value may change. • And it may change to a completely different type of data x = "foo" • Underneath the covers, it's just a table
b = 42 c = "Hello World" 10 'a' 'b' 'c' 0.0 42 "Hello World" Variable table • As your program runs, this table gets dynamically updated as variables are created, values get changed, and variables are destroyed.
represents some kind of data • Usually falls into a couple of categories • Primitive data (numbers and strings) • Compound data (arrays) • Objects • The treatment of values is actually a fairly complex problem (more soon)
some languages, all values are the same • For example, in shell scripts and Tcl, all values are just text strings set a 0.0 set b 42 set c "Hello World" 'a' "0.0" 'b' "42" 'c' "Hello World" • Because there are no types, programs simply interpret the value strings in different ways set c "$a + $b" # c -> "0.0 + 42" set c [expr "$a + $b"] # c -> "42.0"
dynamic languages use typed values a = 0.0 b = 42 c = "Hello World" 'a' (float, 0.0) 'b' (int, 42) 'c' (str,"Hello World") • Various operations in the language then look at the types to figure out what to do x = a + b # Ok. x = 42.0 y = b + c # Error. Can't add int and str • However, there is great variation in how "strict" a language is when types are mixed.
a language is strongly typed, it tends to enforce strict rules about how values are used # Python a = 42 # An integer b = "Hello World" # A string x = a + b # Error • Any operation involving incompatible types may result in some kind of "Type Error"
language may also be "weakly" typed. • In this case, the language performs implicit conversions to make certain operation go ahead. • For example, implicitly treating numbers as strings (shown above). // Javascript var a = 42 // An integer var b = "Hello World" // A string var x = a + b // x = "42Hello World"
"weak" typing • Generally this just refers to whether or not a programming language makes a lot of implicit type conversions. a = 42 b = "Hello World" a + b Error # Strong typing a + b "42Hello World" # Weak typing • For example, even though C is statically typed, it is considered to be weakly typed
related issue that pertains to whether or not a language lets you "cast" values between incompatible data types. • Example : Pointer casting in C Foo *f; int x; x = (int) f; // OK. • This was one big difference between C/ Pascal (C let you do anything you wanted)
programming languages don't always fall neatly into any one category • Certain parts of the language may appear to be strongly typed whereas other parts seem to be weakly typed. x = 42 # int y = 2.5 # float z = x + y # float (x implicitly converted to int) • If it's too strict, it's "safe" but quite fussy
are obviously one of the most common primitive data types • There are two basic kinds of numbers: • Integers : 123, -45, 1234 • Reals : 1.23, 4.5, 12e+34 • However, working with numbers is often a surprisingly difficult problem • Let's talk more about this....
• The CPU of your computer supports math with a few primitive types • Integer word (32 or 64 bits) • Floating point (32 or 64 bits) • In static languages (C, Java), these map to very specific datatypes int # 32 bit integer long # 32 or 64 bit integer (depends) float # 32-bit floating point number double # 64-bit floating point
On the CPU, integers are a bunch of bits 5 00000000000000000000000000000101 -5 11111111111111111111111111111011 00000000110101111110101000010001 sign bit "value" • Data representation is in 2's complement • Invert all bits and add 1 to go between +/- 32 bits
Range of integers (32 bits) 10000000000000000000000000000000 01111111111111111111111111111111 -2147483648 2147483647 00000000000000000000000000000000 0 • Commentary : The representation of numbers is a surprisingly complex problem (there are many ways to do it). Would cover in more detail in an computer architecture coure.
CPU, math operations that exceed the hardware range will overflow 01001001100101100000001011010010 1234567890 10010011001011000000010110100100 -1825831516 * 2 Result overflows into the sign bit • C/C++ is completely silent when this happens (i.e., you don't get an error).
of integer math on the CPU is fairly well understood by C/C++ programmers (maybe) • Math operations in those languages are directly mapped to low-level machine instructions (C as a better assembly) • Truncation, overflow, and other aspects are just "features" of those languages.
Floating point numbers are a representation of the real numbers (decimals) • A number consists of three parts -1.23647223 x 1034 sign (+,-) mantissa exponent • "Floating Point" refers to the fact that the position of the decimal point varies
Fixed point numbers 1.23456 0.12345 0.01234 0.00123 0.00012 • Floating point numbers 1.23456 x 100 1.23456 x 10-1 1.23456 x 10-2 1.23456 x 10-3 1.23456 x 10-4 The exponent adjusts the position of the decimal point
On hardware, floating point numbers are merely a different interpretation of the bits 00000000110101111110101000010001 sign bit 32 bits exponent mantissa 8 bits 23 bits 32 bit float • Value is computed as (+/-) mantissa * 2exponent
There are two main types of floats • Described by a standard : IEEE 754 • Single precision (32-bit) sign bit exponent mantissa 8 bits 23 bits • Double precision (64-bit) sign bit exponent mantissa 11 bits 52 bits
Numerical range of floating point • Single precision • Double precision 8 digits of accuracy Max value : 3.4 x 1038 17 digits of accuracy Max value : 1.8 x 10308 • Given a choice, most people use double
The CPU has a floating point unit to perform math operations (+,-,*,/, sqrt, sin, cos, tan, etc). • One caution : Floating point can not accurately represent decimals (all values are approximate). >>> x = 3.4 >>> x 3.39999999999999 >>> >>> x = 0.1 * 0.1 >>> print x 0.01 >>> x == 0.01 False >>>
Because floating point is approximate, repeated calculations result in mathematical errors that accumulate in a program. • Normally, this is covered in a numerical analysis/numerical methods class D. Goldberg, "What Every Computer Scientist Should Know About Floating Point Arithmetic" • One reason why floating point is sometimes avoided in business software
Certain operations result in exceptions (divide by 0, overflow, sqrt(-1)) • There are three special values +Inf Positive infinity -Inf Minus infinity NaN Not a number • These get encoded in a special way in the number (exponent field set to all 1s).
Design issue : If a math calculation produces an exceptional value (+Inf, -Inf, NaN), should it cause a program to abort or should the program keep running? • Note : These special values are "sticky". Any operation involving +Inf,-Inf, NaN will only produce one of those values as a result (it will not ever turn back into a normal number)
If you ignore errors, a program may run for a very long time silently producing garbage data (NaNs, Inf, etc.) • If you cause an abort, an unexpected math error (e.g., due to some kind of transient event) might cause the whole program to mysteriously crash.
tricky • Classic example : Arianne 5 Rocket Launch (1996) • Exploded 37 seconds after launch • Cause : Overflow in a float to integer conversion produced an uncaught math exception (which then caused the guidance software to dump core).
• Dynamic languages tend to be very high level x = 12345 # An integer y = 123.45 # A float • Design question : Should a high-level language force programmers to think about low-level implementation details regarding math? (e.g., bits, overflow, etc.)
common to represent integers by mapping them to the native integer type • This is beneficial for performance • If so, integers will have a fixed range: • -2147483648 -> 2147483647 (32 bits) • Question : what happens if you go outside that range?
solution : Do nothing. Just let math operations overflow like they do in C. • Example: Tcl set x 1234567890 set y 1234567890 set z [expr $x + $y] puts $z # Outputs -1825831516 • It's all perfectly intuitive if you're a C programmer (in fact, you'd ideally write your program to depend on this "feature" in some sort of very crucial, but diabolical way)
43 "Perhaps my greatest shock came when I found an innocent loop that had no test in it. No Test. None. Common sense said it had to be a closed loop, where the program would circle, forever, endlessly. Program control passed right through it, however, and safely out the other side. It took me two weeks to figure it out." "The vital clue came when I noticed ... incrementing the instruction address would make it overflow..." http://www.cs.utah.edu/~elb/folklore/mel.html
Another strategy : Promote integers to arbitrary precision longs/bignums • Example: Ruby/Python x = 1234567890 # int y = 1234567890 # int z = x + y # long print type(x) # produces <type 'int'> print type(z) # produces <type 'long'> • In this case, integers are allowed to grow to arbitrary size.
a bignum is a sequence of native integers chained together. 32 bits 32 bits 32 bits ... • The number of parts is allowed to grow or shrink dynamically to accommodate the number as necessary bignum
Why not just store all integers using the big number format? • Calculations will be slower due to extra processing overhead • Big numbers take more memory • Since small integer values are the most common, it makes little sense to penalize them.
Some languages just always represent integers using double precision floating point numbers • Example : Perl, Javascript $x = 1234567890; # float $y = 1234567890; # float $z = $x + $y; # float • In this case, you just dispense with the problem of having to promote values (all numbers are the same type)
If you use floats, you'll get an extended range of exact integer values. (53 bits). 0 9007199254740992.0 (9.0e+15) • If you go beyond this, things get "weird" 9007199254740992.0 + 1 9007199254740992.0 (same) 9007199254740992.0 + 2 9007199254740994.0 9007199254740992.0 + 3 9007199254740996.0 • Will start to get "gaps" between numbers
There are some downsides • Floating point math is slower than integer math on the hardware. However, maybe you don't care in an interpreted language. • Increased memory footprint. 64-bit floats take twice as much memory as 32-bit ints. • May find special cases at/around 32 bit limit. For example, systems interfaces may only work with 32 bit integer values
• Sometimes silently truncated at 32-bits x = 9876543210 a = x * 2; # Multiplies x by 2 b = x << 1; # Multiplies x by 2 a 19753086420 (Perl) b 4294967294 a 19753086420 (Python/Ruby) b 19753086420 a 19753086420 (PHP) b -2 a 19753086420 (Javascript) b -1721750060
the reasons why Python/Ruby use integer bignums is to provide mathematical consistency across the entire range of integer values • There is no mysterious 32-bit cut-off for some operations and not others
There are actually very few practical applications that need the accuracy of really big integer numbers (e.g., cryptography) • Most uses of integers are for counting and for indexing into data (e.g., array lookup). • Example : Indexing bytes in a file 1 Gigabyte 1073741824.0 1 Terabyte 1099511627776.0 1 Petabyte 1125899906842624.0 Largest int in a 64-bit float 9007199254740992.0 • For now, using floats is fine.
-7/3 -3 (Python, Ruby, Tcl) -7/3 -2 (C, C++, Java) 54 • Integer division behaves differently in different languages (a surprise!) • Choices: • Convert to floating value (exact value) • Floor division (closest integer less than the value) • Truncate towards zero.
the trend in dynamic languages may be to make integer division convert the result to a floating point number if result not exact • Python is changing integer division in v. 3.0 • This change is highly controversial (I was even skeptical when I first heard it).
compiled languages, you can write functions that expect to work with floats, but you can use them fine with integers float midpoint(float x, float y) { return (x+y)/2; } ... float m = midpoint(12,17); // m = 14.5 • Inside the compiler, it knows the arguments are supposed to be floats. So, it automatically converts the integer arguments to floats. • It all just works.
dynamic languages, functions are written with no type information def midpoint(x, y): return (x+y)/2 ... m = midpoint(12,17) // m = 14 m = midpoint(12.0,17.0) // m = 14.5 • It is very easy to silently introduce numerical errors into a program. • You can code around it, but it is error prone, makes code hard to read, and runs slower.
There are important uses of truncating integer division • Most common : Date/time calculations seconds = x; minutes = seconds / 60 hours = minutes / 60 days = hours / 24 • A good subject for flame wars involving people with far too much spare time
Some languages blur the distinction between numbers and text strings x = "42 bottles" y = "37 bottles" $x + $y 79 (Perl, PHP) x + y "42 bottles37 bottles" (Python, Ruby) • If a string is used in a context that expects a number, it may be converted to a number
In some cases, it's a little diabolical // Javascript var x = "42" var y = "37" var a = x + y; // a = "4237" var b = x * y; // b = 1554 • If numbers and strings are mixed, it's more common to have separate string/math ops # Perl/PHP $x = "42" $y = "37" $a = $x + $y; # Numeric add : a = 79 $b = $x . $y; # String concat : a = "4237"
the covers, an interpreter may keep multiple representations of data $x = "1 bottle"; str : "1 bottle" num : 1 • If used as a number, the numeric value will be saved and reused in later calculations • Perl does this.
Many languages have multiple numeric types int : 42 long : 1273894812883991923 float : 1.2374623 complex : 1.23 + 4.5j • In calculations involved mixed types, numbers are converted to the same type. • Usually done in a way so accuracy is not lost 42 + 4.5 42.0 + 4.5 • You still need to be careful
Generalized Decimal Arithmetic • Example : Python Decimal() types >>> import decimal >>> a = decimal.Decimal("3.45") >>> b = decimal.Decimal("7.22") >>> a + b Decimal("10.67") >>> a / b Decimal("0.4778393351800554016620498615") • This module performs exact decimal ops • IBM General Decimal Arithmetic Spec.
can tell, numbers are a bit of a mess • Not entirely standardized across languages • Many possibilities for program errors (especially with weak typing) • Many tradeoffs and design considerations • Let's move on to something more "simple"...
typically refers to a sequence of characters x = "Hello World" • Most programmers are generally familiar with working with strings. • Operations on strings are fairly well "standardized" across languages.
are a number of quoting conventions a = "Hello World" b = 'Hello World' c = """This is a multiline string. It captures all text.""" • "Heredoc" assignment (Perl, PHP, Ruby) a = <<END All of the text from here on is captured just as is is typed. END
can sometimes substitute variables name = "Dave" text = "Your name is $name" # Perl/PHP/Tcl text = "Your name is ${name}" # Alternative text = "Your name is #{name}" # Ruby • This is notably absent in Python/Javascript (although you can sometimes hack it) name = "Dave" text = "Your name is %(name)s" % vars() # Python
are common string operations • stripping/chopping " text \n" "text" • splitting "text1 text2 text3" [ "text1", "text2", "text3" ] • replacing "Hello World" "Hello There" • Just read a manual to find out how
There are a number of "hard" real-world issues concerning the use, design, and implementation of strings. • These issues are often overlooked/ignored by programmers (at their own peril) • We're not one of those programmers
is a number • 65 'A' (ASCII) • The number is a symbolic representation of some sort of a writing element (e.g., a letter) • The number 65 represents the letter 'A' • A character is not the visual presentation (that's called a "glyph"). • Oh, and a character is not a byte
Characters are bytes and always have been! • Characters 0-127 : ASCII • Characters 128-255 : Everything else • Just look it up in the manual... you'll see. "A string is an array of bytes." - From Ruby in a Nutshell • Yes, it's common for characters to be bytes
of early computing was invented in the west (US/Europe) • Strong bias towards European languages and European characters • Which, conveniently, happened to mostly fit into a single byte of data. • Example : ASCII character set • But there was other horrible weirdness
In the late 70s, every manufacturer implemented ASCII, but did whatever they wanted with the rest of the characters Starship Enterprise! Incriminating drink stain "\x09\x0a"
The practice of storing characters in a single byte is not workable in general • Thousands of different world languages • Some have thousands of characters (e.g., Chinese, Japanese, Korean, etc.) • Is everyone going to go create their own mutually incompatible character set? (well, yes, actually).
mapping of characters to numerical codes • First appeared ~1991 and is periodically updated by a consortium • Simple explanation : Assign a unique number to all characters used by humans in all written languages. • (Sounds like a good project for a hapless Ph.D. student).
There are currently about 100000 characters • Characters 0-127 correspond to ASCII • Other character sets are mapped to different ranges of numbers (usually given in hex) • Example: Armenian (0530-058F) • Example: Mongolian (1800-18AF)
Unicode just assigns numbers • The unicode standard does NOT specify how characters are supposed to be represented in memory • Does NOT specify how characters are supposed to be stored in files • And it's not even entirely consistent on how you represent certain characters
• In theory, there is a unique integer code for each character. • However, in some languages, there are characters and then there are characters with modifiers (e.g.,ä, ã, â, á, à, å). • Unicode gives all of these variants a separate numerical code. ä U+00e4 ã U+00e3 â U+00e2 á U+00e1 à U+00e0
• But certain characters can also be constructed by adding modifiers • ä = a + ̈ (0061 + 0308) • So, you might have multiple representations "Jalapeño" 004a 0061 007c 0061 0070 0065 00f1 006f 004a 0061 007c 0061 0070 0065 006e 0303 006f ñ n ̃
the same text has multiple representations, how do you do string comparison? • Well, in general you don't • To do this, you would have to "normalize" strings to one standard representation • A related, but equally nasty problem : Alphabetization (collocation)
collocation of characters varies by language/region, not by character set • So, to make sorting work, you would have to have some kind of collocation sequence that specifies the desired order [..., c, d, e, è, é, ê, ë, f, g, h, ... • Bloody hell • Let's move on...
a string is just a sequence of "characters" x = "Hello World" • Question : You're the language designer. Are you going to support Unicode? • Well, it's 2008, so let's assume yes...
1: Make each character a 32-bit int • This is known as UCS-4 • More than enough bits to represent all unicode characters, but it hogs memory • ASCII text takes 4 times as much memory • Memory is cheap--buy more RAM. • Worse performance (e.g., CPU cache)
2 : Make each character a slightly smaller, but still large enough integer. • For example : 20 bits • Fine except that 20 bits is pretty odd • No C,C++,Java datatype for that. • Not natively supported on the CPU. • Will run slow as hell. • Nobody does this.
3 : Make each character a 16-bit int • Known as UCS-2 (very common) • Much less memory overhead. • But 16-bits is not enough to represent all of the unicode characters • However, the Unicode people thought of that...
Unicode characters can be encoded into a pair of smaller character codes • U+D800 - U+DFFF (Surrogate pairs) • How it works : U+1D122 ( ) 1D122 00011011000100100010 0001101100 0100100010 1011100001101100 1011100100100010 (U+D86C, U+D922) (20 bits) (2x10 bits) (Add to D800) (A pair of 16 bit values)
unicode characters now get encoded as a pair of "sort of" characters • U+1D122 becomes (U+D86C, U+D922) • How is that supposed to work in practice? • Does an application programmer check? • If surrogate pairs get handled automatically, you are probably working in a string encoding known as UTF-16
4. Variable length encoding • Example : UTF-8 • Now, this is something you see a lot <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> • But, what is UTF-8 exactly?
are ASCII (backwards compatibility) • The rest of the entire Unicode character set is encoded into numerical range 128-255. • However, single characters may require a variable number of bytes to be represented
some nice properties • Can often be plugged into legacy programs that just process characters as bytes • Problem : Not a good internal format for randomly accessing unicode characters. • Example : Array lookup s = "some unicode string" c = s[n] # Does this return the nth character # or does it return the nth byte?
• Putting unicode characters in string literals • Source code encoded in Unicode • Read/writing Unicode data from files (later) • Unicode character properties database a = "¼" b = "x" numeric(a) -> 0.25 numeric(b) -> false
There is still a need for processing data as raw sequences of bytes • Example : Processing binary formats (images, sound files, video, etc.) • Example : Fast ASCII Text processing • Do you have to wedge this into all of the Unicode processing?
solution is to provide an entirely different primitive datatype for "byte strings" • Example : Python-3000 s = b"just some bytes" • This raises new issues : Do you allow text strings and byte strings to intermix? • If so, what rules define that relationship?
programming languages have been wrestling with the unicode problem right now • It's a complicated issue because these languages have grown entirely out of real- world application development. • Many of the issues are quite subtle. • We haven't even discussed I/O yet!
once met some American programmers working on a news web site that published some articles in Spanish. They couldn't figure out how to deal with the special "Spanish" characters so they just dropped them entirely. "That's a spicy Jalapeo" • Don't be like those guys...
string is a sequence of characters x = "Hello World" • Yes, conceptually simple, but some horrible details concerning "characters" • But let's assume you've sorted that out. • Question : How do strings actually behave in our favorite dynamic language?
: Can you modify the contents of a string after you create it? • Sometimes yes : Perl, PHP, Ruby irb(main):001:0> a = "Hello World" => "Hello World" irb(main):002:0> a[1] = 'a' => "a" irb(main):003:0> a => "Hallo World" • Sometimes no : Python, Javascript >>> a = "Hello World" >>> a[1] = 'a' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'str' object does not support item assignment >>>
in-place modification of string data • High performance for manipulating huge strings $text =~s/Foo/Bar/g; # Substition in perl • Question : But what happens here? $other = $text; • It's fast because you can modify the contents without making a new copy in memory.
on assignment. When saving the value of a string in a new variable, make a fresh copy. • Copy by reference. Just copy a pointer. Of course, this can lead to bizarre sharing. $other = $text; # ruby a = "Hello World" b = a b.sub!("Hello","Hello Cruel") print a # "Hello Cruel World"
on write. Initially copy by reference, but if anyone makes a modification, make a local copy a = "Hello World" b = a b[1] = 'a' "Hello World" a b "Hello World" a b "Hallo World" copy • Sounds tricky...
of sharing, mutable strings may have one set of methods that always return a new string and another set that modify a string "in-place" • Example: Ruby Create new strings In-place ------------------ --------- s.capitalize s.capitalize! s.chomp s.chomp! s.gsub s.gsub! s.strip s.strip! ...
mutable strings requires a certain degree of programming discipline • Since values might be shared, changes can unexpectedly affect other parts of code (like working with pointers in C++) • Could get real messy if working with Unicode and multibyte character sets because of internal representation and encoding issues.
strings are much simpler • Since they are read-only, all operations that manipulate strings always return new strings s = "Hello World" a = s.upper() # a = "HELLO WORLD" b = s.replace("Hello","Hallo") # b = 'Hallo World' • Copies are always made by reference a = "Hello World" b = a c = b "Hello World" a b c
fact that strings are immutable allows operations to optimized inside the interpreter • For example : Use of small strings to refer to named fields, etc. • Gives more freedom in how strings are represented/manipulated internally (since programs aren't allowed to touch the bits)
have been more about numbers and strings than you ever wanted to know • In my experience, working programmers only have a flimsy grasp of the details (especially when it concerns Unicode). • One goal of going into detail has been to inform you of the important issues and pitfalls. • Note : There is not going to be a unicode quiz
programs, it is often necessary to represent data that consists of multiple parts • Example: A holding of stock Name : "GOOG" (string) Shares : 100 (integer) Price : 490.10 (float) 100 shares of GOOG at 490.10 • There are three basic components
• In static programming languages (C, Java, etc.), data structures are managed by defining a "structure" or "class" struct StockHolding { char name[8]; int shares; double price; }; • This precisely defines the members, the memory layout, and other low-level details
languages don't really have "structs" • Instead, you can group values together g = "GOOG", 100, 490.10 a = "AAPL", 50, 123.45 • This becomes a single object composed of multiple parts (sometimes known as a tuple) • You can pass it around in your program as a single "value"
When values are grouped, the components are typically ordered (like an array) g = "GOOG", 100, 490.10 name = g[0] shares = g[1] price = g[2] • However, you also just unpack values like this: g = "GOOG", 100, 490.10 ... name, shares, price = g
of packing/unpacking values is surprising rich in most dynamic languages • Most programmers aren't even aware of the full extent to which this actually works • Some examples follows
= ( "Working with Data", ("David Beazley","[email protected]"), ( (16,"Jan", 2008), (17, 30) ) ) Just flip the sides and put in some variable names: ( title, (name,email) ( (day,month,year), (hour,minute) ) ) = Lecture
data using packed values tends to be quite efficient • Fairly small memory footprint • Implementation is highly optimized in the interpreter (dynamic languages often rely on these same data structures for their own operation)
unpacking values only really works well if data consists of a small number of parts • It would be extremely annoying to do this with a 50-field database row • There may be constraints on packed values. For example, in Python, such objects are immutable.
is almost identical across languages • Work like arrays but you use the field names shares = g['shares']; # Retrieval g['shares'] = 75; # Assignment • Unlike a normal array, there is no ordering. • Keys aren't stored in alphabetical order, etc.
Programs often have to work with collections of "objects" • Example : A collection of stocks in a portfolio YHOO 50 19.25 AAPL 100 143.41 SCOX 500 4.21 GOOG 20 490.10 MSFT 50 67.12 JAVA 75 6.23 IBM 50 91.10
dynamic languages, there are two very common choices for collections • List or array (ordered sequence of items) • Associative array/hash table (unordered data) • We've already used these in the last section.
sequence of values items = [1, 3.5, "Hello"] • Items are accessed by numerical indices n = len(items) # Number of items a = items[i] # Retrieve the ith item items[i] = b # Change the ith item • There are often append/insert/delete operations items.append(x) items.remove(y) items.insert(i,z) • Read the manual to know exact syntax
collection of values prices = { 'GOOG' : 523.10, 'AAPL' : 172.23, 'IBM' : 105.44 } • Values are accessed by keys n = len(prices) # Number of items a = prices['GOOG'] # Retrieve 'GOOG' value prices['SCOX'] = 0 # Change the 'SCOX' value • Likewise, there are various operations for manipulating the contents
critical part of using containers is knowing that you can store any kind of data that you want inside • This includes other lists and hashes YHOO 50 19.25 AAPL 100 143.41 SCOX 500 4.21 GOOG 20 490.10 MSFT 50 67.12 JAVA 75 6.23 IBM 50 91.10 [ ['YHOO', 50, 19.25], ['AAPL', 100, 143.41], ['SCOX', 500, 4.21], ['GOOG', 20, 490.10], ['MSFT', 50, 67.12], ['JAVA', 75, 6.23], ['IBM', 50, 91.10] ] list of lists
Sometimes you will computer scientists talking about so-called "First Class" objects. • This means that whatever they're talking about can be used as data value in a program. • You can assign it to a variable • You can store it in an array. • It has equal status with primitive types • In most dynamic languages, everything is FC
programmer can write significant programs that do nothing but perform operations on lists and hashes • These data structures are powerful enough to do almost any kind of data processing you would ever need to do. • Note : You almost never hear of people implementing things like linked lists and search trees in these languages (why bother?)
ordered data, an array is usually just a resizable array of references to values items = [1, 3.5, "Hello"] 1 3.5 "Hello" • It's an array of pointers to the values items
dictionaries/hashes, you get a mapping of keys to values 523.10 172.23 105.44 prices 'GOOG' 'IBM' 'AAPL' • The tricky part : Searching for keys prices = { 'GOOG' : 523.10, 'AAPL' : 172.23, 'IBM' : 105.44 }
critical part of creating a dictionary is knowing what to do with the keys • Can a key be any object or is it restricted to strings? • How do you perform a fast key-lookup?
that can be used as a key, is given a hash value operation. • This usually computes an integer value irb(main):023:0> a = "GOOG" => "Hello" irb(main):024:0> a.hash => 252612492 irb(main):025:0> Hash value • You use the hash value to perform a lookup
Hash tables are one of the most essential data structures in virtually every dynamic language • Used not only by end-users, but for the implementation of the interpreter itself • Reading : A. Kuchling, "Python's Dictionary Implementation : Being All Things To All People", in Beautiful Code (O'Reilly)
x = 42 151 • In dynamic languages, variables are just names for values • As the program runs, the value may change. • And it may change to a completely different type of data x = "foo" • Er..... didn't we already have this slide????
42 152 • What does this do? • It assigns a value to a variable, yes. • But what does this do? y = x • And this...? z = [x,y] # A list/array with two items
# Binding a value to a name y = x # Binding a value to a name items[2] = x # Binding to container location 153 • When programs run, values (i.e., data) get bound to different locations • But, what really happens? • In the above code, the value 42 has been assigned to three different places. • Does that mean there are three copies of 42 in memory? (Answer : It depends)
42 y = x items[2] = x 154 • Assignments always make a local copy of whatever value is being stored. x 42 y items 42 42 2 These are all distinct objects even though they have the same value
"... A string with 10 million characters ..." y = x items[2] = x 155 • But, consider this case: • Discuss amongst yourselves... • Maybe this one isn't so clear-cut. • Might depend on how strings were implemented.
= 42; $y = \$x; # Reference to $x print $$y,"\n"; # Dereference the value $$y = 37; # Reassign value being reference 156 • You might introduce special reference/ pointer variables (Perl/PHP) • This lets you refer to data instead of making copies, but it also introduces pointers • That may or may not be a good thing
42 y = x items[2] = x 157 • All assignments merely makes a reference to the value (like a pointer) x 42 y items 2 There is one object with value 42, many locations point to it.
If everything is a reference, there are other issues. x 42 y items 2 • Are primitive types mutable? x = 37 • In general, you want immutable data to avoid making your head explode.
do you track memory? x 42 y items 2 • Must keep reference counts on values or perform some kind of garbage collection ref=3 • Values (memory) will be reclaimed when no more references
dynamic languages assign by reference (Python, Ruby, Javascript, etc.) • This is one of the reasons why strings are immutable in Python/Javascript • You can check it out: >>> a = 42 >>> b = a >>> a is b True >>>
with containers (lists/hashes) is very tricky in this model irb(main):001:0> a = [1,2,3,4] => [1, 2, 3, 4] irb(main):002:0> b = a => [1, 2, 3, 4] irb(main):003:0> b[2] = 99 => 99 irb(main):004:0> a => [1, 2, 99, 4] irb(main):005:0> • Since assignments only make references, you get shared references to the same object a b 1 2 99 4
copies of containers, you have to take special steps irb(main):001:0> a = [1,2,3,4] => [1, 2, 3, 4] irb(main):002:0> b = a.clone => [1, 2, 3, 4] irb(main):003:0> b[2] = 99 => 99 irb(main):004:0> a => [1, 2, 3, 4] irb(main):005:0> b => [1, 2, 99, 4] Make a "copy" of the object
of containers (python) >>> a = [2,3,[100,101],4] >>> b = list(a) >>> a is b False • However, items in list copied by reference >>> a[2].append(102) >>> b[2] [100,101,102] >>> 100 101 102 2 3 4 a b This list is being shared
actually copy data, you might have to execute a "deep copy" operation >>> a = [2,3,[100,101],4] >>> import copy >>> b = copy.deepcopy(a) >>> a[2].append(102) >>> b[2] [100,101] >>> • Recursively traverses through the object and copies everything that can be found. • (This is also an interesting CS problem)
languages use various facets of references, immutable data, and this memory model to perform various kinds of optimization. • Example : Small integer caching. Small integers are frequently cached and reused. >>> a = 42 >>> b = 37 >>> c = b + 5 >>> c is a True >>> a 42 b 37 c
Sharing dictionary keys and variable names stock = { 'name' : 'GOOG', 'shares' : 100, 'price' : 490.10 } person = { 'name' : 'Dave', 'email' : '[email protected]' } 'name' name = "Mondo" • Programs may use much less memory than you think
of material has been presented in this section, but there are three big take-aways • Data. A close look at primitive datatypes (numbers, reals, and strings) • Data structures. How to group data together (lists, arrays, etc.) • Assignment. What happens when you assign variables and manipulate values in a program.
else related to manipulating basic data is user-manual sorts of stuff • E.g., can look up how to append to a list in your favorite language. • Will get a chance to explore in the exercise
explores the problem of structuring more complicated programs • Program structure and statements • Control flow structures • Functions • Exception handling
• A program is a series of statements • The statements perform various operations and generate some kind of result • When a program runs, it executes statements until there is nothing more to do • It seems pretty straightforward---although there are some thorny theoretical questions (e.g., "The Halting Problem").
while n > 0: print "T-minus", n n = n - 1 print "Fizzle..." 8 • Statements never run in isolation! • They always run inside an "environment" • This is where the variables live environment statements 'n' : 10 variables
while n > 0: print "T-minus", n n = n - 1 print "Fizzle..." 9 • As a program runs, the statements tend to do one of several things • They either modify the environment environment statements 'n' : 10 9 variables
while n > 0: print "T-minus", n n = n - 1 print "Fizzle..." 10 • Or they control the next statement that executes (control flow) environment statements 'n' : 9 variables
while n > 0: print "T-minus", n n = n - 1 print "Fizzle..." 11 • Or they perform some kind of input/output with the outside world environment statements 'n' : 9 variables
a fairly simple view, but most programs really don't do much more than what I've described • Of course, the devil is in the details • We're going to look at various facets of program structure and execution
Assignment stores a value x = 42 avg = (x+y)/2 items[2] = 37 a["name"] = "Elvis" a.response = "Yeah" • General form of assignment location = expression • An expression represents a value • Location specifies the place where stored
always represents a value • May involve various operations on data 42 # Literal x # Variable x + y # Math operator x[i+n] # Array lookup foo(x) # Function call (x+y) / (a+b) # Grouping • The syntax and set of operators is fairly standard across languages (minor variations)
represents a place where a value is going to be "stored" (known as an "lvalue") • It might be a name x = 42 name = "Elvis" • But it also might also involve an expression names[i+n] = Elvis • Key point : The left hand side must always represent a place where you can put a value
refer to a place in the surrounding environment • Storing to an unknown name creates an entry • Storing to an existing place replaces the value a = 3 b = 4 x = [a, b] x[1] = 37 environment statements 'a' : 3 'b' : 4 'x' : [3, 4] variables 37
does it mean to "store" a value? • From last lecture, we saw that this can be more complicated that you might imagine • Might be by value or by reference
Assign by reference (copy pointers) a = 42 b = a environment statements variables • Here, there is one object with the value 42, but two names refer to it 42 'a' 'b'
overwrite previous values by overwriting the memory? • Or does assignment make a new object? a = 42 ... a = 37 'a' 42 42 'a' 37 Overwrites 37 ref-- Rebind to new object • Answer : It depends on the language
Overwriting a variable a = 42 b = a a = 37 environment statements variables 42 'a' 'b' 37 old new • You get a new value and the name is rebound • The old value may persist if used elsewhere
point : Assignment is an operation that modifies the environment in which statements execute • Deep thought : The environment sure looks a lot like a hash table/dictionary/associative array
statements on condition # Python if x > 0: statements else: statements • Almost all languages do exactly what you would expect here • Condition is checked and only one branch runs # Ruby if x > 0 statements else statements end # Perl/PHP if ($x > 0) { statements } else { statements }
generally rely on the result of a conditional expression if condition statements • Usually, this is built from special operators expr == expr expr != expr expr < expr expr > expr expr <= expr expr >= expr condition and condition condition or condition not condition • Produces a true/false value
A tricky issue : What happens here? if x statements • x is just some value (we don't know what) • For this to make sense, you have to know what it means for a value to be "True" • Believe it not, there are some differing ideas about that.
Option 1: A value is true if it is non-zero, non- empty, or generally looks like it has an interesting value • This is probably the most common treatment # True Values x = 1 x = "Hello" x = [1,2,3] # False Values x = 0 x = "" x = [] x = None if x statements
Option 2: A value is true if x is not false or is assigned to an actual value. • This is the approach Ruby takes if x statements # True Values x = 0 x = "Hello" x = [1,2,3] x = "" # False values x = nil x = false • Danger : 0 evaluates as True! • This is really a pointer check--does x point to a value?
only evaluate parts until the result can be determined if condition1 and condition2 statements if condition1 or condition2: statements 33 • Example: not evaluated if condition1 is False not evaluated if condition1 is True if x != 0 and y/x < 0.01 statements • Also known as "short-circuit" evaluation
a condition 34 statement if condition; statement unless condition; • Examples: print "Hello Dave\n" unless ($name ne "Dave"); # Perl print "Hello Dave\n" unless $name != "Dave" # Ruby • This form is somewhat less common. • Personally, I'm not a huge fan... Here is a really long statement that looks like it will erase the entire filesystem ... NOT!
code based on the value of a variable: 35 switch(variable) { case value1: statements case value2: statements case value3: statements default: statements } • This is not always supported and even if it is, there are subtle issues
blocks "fall-through?" 36 switch(variable) { case value1: statements break case value2: statements case value3: statements break default: statements } statements If no break, execution falls through to the next case • This is the behavior of C/Java/Javascript, etc. • Fall-through may be disallowed (Ruby)
provide any kind of switch. You just chain if-elif-else statements 37 if condition: statements elif condition: statements elif condition: statements else: statements • Thinking : Having a separate switch statement just seems redundant if content == 'gif': ... elif content == 'png': ... elif content == 'jpg': ... else print "Unknown content!" Example
statement might be far-more efficient than chained if-else depending on how it is implemented • Historically, compilers would turn switch into a jump table (a goto lookup table) 38 switch(variable) { case value1: statements case value2: statements case value3: statements default: statements } value1 value3 value2 loc1 loc3 loc2 loc1: statements ... loc2: statements ... loc3: statements
of switch in many dynamic languages seems to be hit or miss • I actually did a little experiment ("The Big Switch") 39 switch(variable) { case 1: statement case 2: statement ... case 999: statement } if (variable == 1) { statement } else if (variable == 2) { statement } else if { ... } else if (variable == 999) { statement } vs. variable == 999
switch : 58.9 seconds else if : 108.1 seconds • Ruby switch :151.5 seconds else if : 106.5 seconds • Javascript (in Firefox) switch :~7 seconds else if : ???? minutes (didn't have patience to wait) • A million repetitions
while loops (universally supported) 42 while condition statement statement statement end • Only the syntax differs slightly • Sometimes you will find this variation do statement statement statement while condition
break out of a loop 43 while condition statement break # Terminates a loop statement end • Example: Python while True: line = f.readline() if line == 'END' : break # Various processing ...
the rest of the statements and go back to the start of the loop 44 while condition statement continue # Go back to the top statement # Not executed end • Example: Python while True: line = f.readline() if line.startswith("#"): continue # Do more processing ...
some kind of looping variable 45 for (init; condition; increment) { statements } • Example: for (i = 0; i < 10; i++) { print i } • This is really just a short-hand for this i = 0; while (i < 10) { print i; i++; }
modern use of for is to loop over items of a collection (array, hash, etc.) 46 for item in collection statements end • Example: items = [1, 4, "Foo", "Bar"] # Ruby for x in items # x = 1, 4, "Foo", "Bar" ... end • Might be known as a "foreach" statement
collection is a very powerful concept • A collection could be many different things • An array, hash, set, string, file, etc. 47 f = open("foo.txt") for line in f: statements ...
a collection of stocks 48 portfolio = [ ('GOOG',100, 490.10), ('IBM', 50, 91.10), ('AAPL', 75, 122.45), ('YHOO', 45, 28.42) ] for name, shares, cost in portfolio: # statements # ... • Notice how values get expanded into variables for you (very nice)
concept of "iterating" over data is something that has been expanded greatly in dynamic languages • For instance, a large number of recent features in Python are just related to this • We'll see a lot more of this later 49
50 for x in s: statements else: statements • The else clause only runs if the loop runs all the way to completion without breaking for line in open("stocks.dat"): if 'IBM' in line: break else: print "Didn't find it"
(Ruby) 51 for x in s statements redo statements end • Restarts the body of the loop without updating the iteration variable • Retry : Restarts from the beginning for x in s statements retry statements end
of things can be done using nothing but basic statements, conditions, and loops • For example: Writing scripts, data processing, etc. • I would suspect that a large number of programs actually use nothing more than these features to do various odd-jobs 52
Mathematically, it's an operation that accepts a bunch of inputs (arguments) and produces an output (the result) • Examples • sin(x) • f(x,y) -> 3x2 + 2xy - 7 • However, this isn't a math class nor is it a theoretical programming languages course 55
A function is a named sequence of statements def funcname statement statement ... statement end 56 • If you want those statements to run, you just invoke the function name funcname
function is actually an assignment in the environment 58 statements variables 'countdown' def countdown n = 10 while n > 0 printf("T-minus %d\n", n) n -= 1 end print "Fizzle...\n" end statement statement statement ... • The "value" of a function is the list of statements inside the body of the function
like data in dynamic languages • In fact, they can be redefined on-the-fly just like variables 59 • You can even redefine a function in the middle of running your program (try that in C++)
n = 10 while n > 0 printf("T-minus %d\n", n) n -= 1 end print "Fizzle...\n" end countdown # Run the above function def countdown print "Boom!\n" end countdown # Run the new function
while n > 0 printf("T-minus %d\n", n) n -= 1 end print "Fizzle...\n" end Function Execution • What happens when you call a function? statement statement countdown statement statement 61 • Control passes to the first function statement • After the function is done, you go back to the statement after the function call
function call creates a new environment, everything that happens inside a function stays localized • A function can freely create new variables and modify its own environment • These changes don't affect anything else • The environment is destroyed when the function returns 63
executes in its own private environment, how do you get data in and out of the environment? • Passing parameters to a function • Returning results from a function 64
data into a function, use arguments def square(x) return x*x end 65 • However, an argument doesn't receive a value until the function is actually called a = square(3) argument • Arguments represent incoming values that will be bound to names when the function runs
data from a function, use return def square(x) return x*x end 67 Return value • It is up to the caller to save the result (using assignment) statements variables 'r' : 9 r = square(3)
passing and returning values seems like it should be straightforward • However, there are a number of subtle issues that come up • Where do arguments get evaluated? • How do arguments get passed? 68
'a' : 3 'b' : 4 'square' : <func> a = 3 b = 4 square(a+b) • Consider the following Must evaluate this expression • When calling a function, the arguments are usually fully evaluated first. • Known as "Applicative Evaluation Order"
How do the values get passed into a function? 70 statements variables 'x' 'y' func(x,y) value1 value2 def func(a,b) statements variables 'a' 'b' statement statement statement ? ? ? ?
is often preferred because it is the most efficient way to pass containers (lists and hashes) to functions • For example, if you have a list with a million entries in it, you don't want to make a copy • However, be aware that modifications to argument will affect the caller. 73
mutable data types (e.g., lists, dicts) will be reflected in the original object--arguments are not copies. 74 def insert_sorted(s,val): for i,x in enumerate(s): if x > val: s.insert(i,val) break else: s.append(val) a = [10, 15, 50] insert_sorted(a,27) # a = [10, 15, 27, 50] Modifies the passed object
• Recall : All statements execute in an environment that holds variables • A thorny question : Are statements able to access variables that have been defined in other environments? • For example, can a function access variables that were defined outside of the function? (e.g., globals)
programming languages deal with these questions using two-level "lexical scoping" • General idea : All variables either live in a "local" space or they live in a "global" space as determined by the structure of the source code. • Globals are the variables defined outside of function bodies • Locals are the variables defined inside functions
a program starts, there is an empty global environment x = 42 def foo y = 2*x x = 37 bar end def bar print x print y end foo start statements variables
start populating the environment x = 42 def foo y = 2*x x = 37 bar end def bar print x print y end foo statements variables 'x' : 42 'foo' : <func> 'bar' : <func>
consider a function call: x = 42 def foo y = 2*x x = 37 bar end def bar print x print y end foo statements variables 'x' : 42 'foo' : <func> 'bar' : <func> Call a function
calls create a new environment globals 'x' : 42 'foo' : <func> 'bar' : <func> y = 2*x x = 37 bar statements variables foo : environment • Globals is the variable table from the global environment (previous slide)
variable lookup globals 'x' : 42 'foo' : <func> 'bar' : <func> y = 2*x x = 37 bar statements variables foo : environment x? x? • When looking up a value, look in the variable table of the local environment first • If not found, look in globals (as a fallback)
assignment globals 'x' : 42 'foo' : <func> 'bar' : <func> y = 2*x x = 37 bar statements variables 'y' : 84 foo : environment • Assignment puts a new value in the locals
assignment globals 'x' : 42 'foo' : <func> 'bar' : <func> y = 2*x x = 37 bar statements variables 'y' : 84 foo : environment • But, what happens here? • Notice : There is nothing in the assignment statement that indicates where it goes x? x?
overwrite previous values? globals 'x' : 42 'foo' : <func> 'bar' : <func> y = 2*x x = 37 bar statements variables 'y' : 84 foo : environment • Is that a good idea or not? 37 ?
this code fragment def foo() x = 42 y = 37 ... end • If reading this code, most programmers will interpret those variables as locals • It would be pretty damn weird if the behavior changed depending on whether or not someone defined a global with those names
assignment should be local globals 'x' : 42 'foo' : <func> 'bar' : <func> y = 2*x x = 37 bar statements variables 'y' : 84 'x' : 37 foo : environment • Python and Ruby operate like • However, it's not so clear cut • Let's continue for now...
lookup (reprise) globals 'x' : 42 'foo' : <func> 'bar' : <func> y = 2*x x = 37 bar statements variables 'y' : 84 'x' : 37 foo : environment print x print y statements variables bar : environment y? y? A name error. "y" is not defined.
scoping is an effective way of managing variables, but it has a problem • There is always this distinction between the local space and global space • Sometimes you want to assign values in either one of those spaces • Sometimes you actually do want to change a global variable.
• The same experiment in Javascript x = 19; function foo() { x = 42; y = 37; } foo(); document.writeln(x); // Produces 42 document.writeln(y); // Produces 37 • Okay, I'm just a little disturbed (and you should be too)
here is that the syntax for assignment doesn't say anything about where a value is supposed to be stored x = 37 • No clear way to indicate that it's a local or a global variable • So, a language will do whatever its designers felt like it should do.
languages, this problem is solved through the use of "declarations" which precisely pin down the location int x; // Global void foo() { int y; // Local y = 2*x; // No issues here x = 37; } • Fine, but in dynamic languages you don't declare datatypes
common approach : Require variables with non-local scope to be tagged x = 19 def foo(): global x # The x below is global x = 42 y = 37 foo() print x # Produces 42 print y # NameError : y not defined • This approach is used by Python, PHP, Tcl
approach : Allow variables to have optional scope "declarators" • This an approach used by Perl • Except there is more to this (in a minute) $x = 19; sub foo() { local $x = 42; local $y = 37; } foo(); print "$x\n"; # Produces 19 print "$y\n"; # Produces nothing
Javascript, variables live where they are formally declared using "var" • If you leave off the var when defining, the variable is just global var x = 19; // A global function foo() { x = 42; var y = 37; // A local } foo(); document.writeln(x); // 42 document.writeln(y); // Nothing
prepends variables with a special symbol to tell you where it's located • Here, you just look at the variable and you know where it lives Name # A constant name # A local variable $name # A global variable @name # An instance variable (objects) @@name # A class variable (objects) $x = 19; def foo $x = 42 // A global y = 37 // A local end
Just when you thought were safe, consider this bit of Perl code $x = 42; sub foo() { local $x = 37; bar(); } sub bar() { print "$x\n"; } bar(); # Prints 42 foo(); # Prints 37 (?!?!?!?!?!?!) • This is an example of "Dynamic Scope"
Variables bind to the nearest definition on the function call stack $x = 42; sub foo() { local $x = 37; bar(); } sub bar() { print "$x\n"; } foo(); # Prints 37 globals $x = 42; foo() local $x = 37; bar() print "$x\n"; • This can get absolutely diabolical!
Allows a variables to only exist in the block where it was defined $x = 42; sub foo() { my $x = 37; # Local variable bar(); } sub bar() { print "$x\n"; } bar(); # Prints 42 foo(); # Prints 42 • This takes us back to two-level scoping
is quite a bit more to variable assignment than meets the eye • Especially related to where a value lives • The lack of formal declarations creates various sorts of chaos • Read the manual! • Ignore at your own peril
usually no checking or validation of function arguments. • A function will work on any data that is compatible with the statements in the function def add(x,y): return x + y add(3,4) # 7 add("Hello","World") # "HelloWorld" add([1,2],[3,4]) # [1,2,3,4] • Example (Python): 111
also rarely any checking of return values. • Inconsistent use does not result in an error def foo(x,y): if x: return x + y else: return • Example: 112 Inconsistent use of return (not checked)
are errors in a function, they will show up at run time (as an exception) def add(x,y): return x+y >>> add(3,"hello") Traceback (most recent call last): ... TypeError: unsupported operand type(s) for +: 'int' and 'str' >>> • Example: 113
of looping over data is a very common programming operation names = [ "Dave", "Leo", "Nita" ] for name in names: print "Hello", name • Example: 115 • This is something programmers use all of the time without even thinking about it
languages have the ability to turn iteration itself into some kind of "object" that you can manipulate 116 • Example : Generator Functions in Python • Example : Code blocks in Ruby • Disclaimer : This is an advanced topic---I'm just going to cover the basics now
that, instead of returning a single value, stays alive and generates a sequence of results def countdown(n): print "Counting down!" while n > 0: yield n # Yield a value n -= 1 >>> for i in countdown(5): ... print i, ... Counting down! 5 4 3 2 1 >>> 117 • Example:
are pretty odd • If you call one, it doesn't seem to do anything >>> c = countdown(5) >>> 118 • However, it you call .next() on the result, you'll see it start to run >>> c.next() Counting down! 5 >>> c.next() 4 >>>
a normal function except that the "environment" is an object with a method that triggers statement execution. 119 print "Counting down!" while n > 0: yield n n -= 1 statements variables 'n' : 5 countdown : environment suspended
runs statements until you reach a yield statement. That pops a value out of the function. 120 print "Counting down!" while n > 0: yield n n -= 1 statements variables 'n' : 5 countdown : environment .next() 5
following .next(), it wakes up and continues where it left off until the next yield statement is encountered 122 print "Counting down!" while n > 0: yield n n -= 1 statements variables 'n' : 4 countdown : environment .next() 4 • This continues until there are no more statements
the concept of iteration from code that uses the iteration 123 for i in countdown(5): print i, for i in countdown(5): print "T-minus", i for i in countdown(5): os.system("rm img%d.png" % i) Iteration Code block that uses the iteration • We will talk more about this in a later class
of turning iteration into an object, Ruby flips the whole thing around and allows code blocks to be turned into objects. 124 def countdown(n) print "Counting down!\n" while n > 0 yield n n -= 1 end end countdown(5) { |i| puts i } Counting down! 5 4 3 2 1 Code block
a block of code gets packaged into an object 125 countdown(5) { |i| puts i } puts i • That object is then passed into a function as part of the environment
function runs as normally until the yield statement is reached 126 print "Counting down!" while n > 0 yield n n -= 1 end statements variables 'n' : 5 <block> countdown : environment |i| puts i
then produces a value that's fed into the code block which runs 127 print "Counting down!" while n > 0: yield n n -= 1 end statements variables 'n' : 5 <block> countdown : environment |i| puts i • The code block then executes in the environment where it was defined!
the code block runs, you go back to statements in the current function 128 print "Counting down!" while n > 0: yield n n -= 1 end statements variables 'n' : 5 <block> countdown : environment |i| puts i • This continues until no more statements
to realize that there is an environment switch going on here! sum = 0 countdown(5) { |i| sum += i } • When the code block gets executed, it runs in the outer environment---not in the environment of the countdown function • Note: This is related to "closures" (which we will cover in a few weeks)
related to iteration are really just fancy tricks involving the execution environment • Generator : A function that can suspend itself and emit a value from its environment • Code block : A chunk of code that you can run, but which executes in the environment where it was defined. 130
write a program and it encounters an error, it normally aborts with some kind of traceback 132 >>> prices['SCOX'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'SCOX' >>> • Errors usually have a "type" and some informative diagnostics. • Programs can raise and catch errors
exception (begin - rescue) • Raising an exception (raise) raise RuntimeError,"Name not found" begin statements rescue RuntimeError => e puts e end 134 • Exceptions are very similar in most languages • So, I will focus on Python.
have an associated value • More information about what's wrong raise RuntimeError("Invalid user name") • Passed to variable supplied in except try: ... except RuntimeError,e: ... • Commonly a string, but may be any object 137
that must run regardless of whether or not an exception occurs f = open("foo","r") try: ... finally: f.close() # Close file • Commonly use to properly manage resources (especially locks, files, etc.) • In Ruby, this is the "ensure" clause 141
not always possible to retry code that generated an error • For example, in Python, execution always resumes after the try-except block • Ruby has a retry statement begin statements rescue # Determine if we can retry retry end 142
most dynamic languages are interpreted, they typically have the ability to run their own code given as a string 144 s = """ for i in range(10): print "i =", i """ exec(s) x = eval("3 + 20/5") • The exact syntax varies
code strings is inherently a frightening concept 145 • The usual semantics is for the code string to execute as if it were typed directly into the program at the point of eval/exec statement • However, it is sometimes possible to execute strings in their own environment
really focused on the structure and control-flow of programs • The really big issues: • Statement execution environment • Global/local environment distinction • What happens during function calls • Will return to more of this later. 148
section, we look at the problem of how to put larger programs together • Organizing programs into modules • Introduction to "objects" • Object oriented programming
focus of this section is not on how to hack code (e.g. loops, variables, functions, algorithms, etc.) • It's more related to software engineering • How do you organize a million line program? • How do you make programs extensible? • How do you make programs maintainable?
You have already been working with "objects" import csv reader = csv.reader(open("portfolio.dat")) a = [1,2,3] # A list (object) a.append(42) # Append to a list (a method) • Have probably encountered modules as well • However, you may not have thought much about why objects and modules behave the way that they do.
engineering aspects of developing software have been known for a long time • Currently, most of the work in this area is found under the banner of "Object Oriented Programming" • However, it is a lot like religion. There might be something redeeming about it, but it's not always clear what people are talking about.
you can own just one book on software engineering... • F. Brooks, "The Mythical Man- Month" • And when you're done, read T. Kidder, "The Soul of a New Machine."
class isn't about software project management (at least not directly) • Instead, we're going to focus a little bit on how object oriented programming came into existence • A lot of background on why people work with objects to begin with
later going to go explore Smalltalk • One of the earliest OO languages • By far, the most influential language on the development of later dynamic languages such as Ruby and Python. • Disclaimer : I might crash and burn with this. (Don't say I didn't warn you).
• A program is a series of statements • There are many different types of statements • Assignment, conditions, loops, exception handling, function calls, function definition, etc. • But what happens as a program grows?
: Editing the source code • As a program grows in size, the program will start to become quite large in the editor. • You don't want to edit a 100000 line program that's been typed into one big file • As a practical matter, programmers don't like to edit files that are much longer than a few thousand lines.
programs involve multiple source files • Generally, you break it up by putting related functionality into the same file • However, this introduces a variety of new problems related to file management • Example : Separate compilation in C • Example : Management of the global namespace
extern int bar(int); void foo(int n) { ... x = bar(n); ... } 14 • If you split across files, you have to have some way to reference definitions in other locations (e.g., "extern") /* bar.c */ int bar(int n) { ... statements ... } foo.c bar.c
addition, there is the question of how global symbols are managed. • For example, do all of the variable and function names have to be distinct? • In C/C++, the answer is generally "yes."
If everything exists in the same global space, the problem of picking names becomes increasingly difficult as the program grows • An added problem arises if you start using programming libraries • Those libraries also need to pick unique names
solution : Name prefixing def Foo_bar() ... end def Foo_spam() ... end def Foo_grok() ... end • Group related functionality under a common name prefix (very common in C)
: How do you make it easy to extend a program with new features? • Code reuse : How do you make it easy to re- use parts of a program that you have already written? • Modularization : How do you make it easier to divide a program into pieces that many people can work on?
have written a big application that has to read data def read_data(source) ... end Mondo Application • And suppose there is a single function that reads input data (e.g., read_data)
modify the application to support reading data from the following file formats • CSV files • From a relational database • XML files • Scraped off HTML pages • Excel spreadsheets
implement this, you might implement many different versions of the functions: def read_data_csv(name) ... def read_data_xml(name) ... def read_data_db(name) ... def read_data_html(name) ... • But now, you have the problem of plugging these functions into the larger program.
solution : A dispatch function data_format = "xml" ... def read_data(name) if data_format == 'csv' d = read_data_csv(name) elif data_format == 'xml' d = read_data_xml(name) elif data_format == 'db' d = read_data_db(name) elif data_format == 'html' d = read_data_html(name) • Yow! How is this going to scale up in a huge programming project? (hint : It's not!)
you have implemented some general purpose functionality def read_data_xml(source) ... # General purpose code to parse XML statements ... # Specific statements to process data statements ... • Maybe you want to re-use the more general parts of this code in other places • Question : How?
projects, there is a benefit to breaking a program up into to small self- contained modules • Each module can be maintained separately • Often by different groups of programmers • To make it work, you really have to think about the boundaries between modules (interfaces, versions, etc.)
modularize code typically lead to the use of software "components" • Components are self-contained and have a well-defined programming interface (API) foo API • Applications constructed mostly by assembling and gluing components together
real world, software components may be written in entirely different languages • There is a whole industry surrounding the use of components • Example : COM, Active-X, etc. on Windows • This is also a major reason why people are using dynamic languages
start working on a program, it often starts out as one source file • However, at some point, it reaches a point where you want to split it into two files • Splitting across files is probably the most fundamental division of source code. • It seems simple enough...
dynamic languages provide some kind of statement to load statements from another source code file execfile("foo.py") # Python require 'foo.rb' # Ruby require "foo.pl"; # Perl require("foo.php"); # PHP • Examples:
file include typically executes the statements in the file as if they had been typed at the point where the include statement was placed • However, there are still some tricky issues lurking underneath the covers # Here's my big application require 'funcs.src' require 'utils.src' ... statements ...
a file be included more than once? • require() is often a one-time operation. require 'foo.src' ... statements ... require 'foo.src' # Ignored • The one-time behavior is used to make programming libraries work correctly
a filename is trickier than it looks require './foo.src' require '/users/beazley/Projects/foo.src' require 'C:\Documents and Settings\Projects\foo.src' • As a general rule, it's bad practice to hard- code path names into a program (especially if it has to be moved around) • Solution : PATH variables • May be platform dependent
languages will have some kind of internal variable that contains the list of search directories for file includes • Example : Ruby ($: variable) irb(main):001:0> puts $: /usr/lib/ruby/site_ruby/1.8 /usr/lib/ruby/site_ruby/1.8/powerpc-darwin8.0 /usr/lib/ruby/site_ruby/1.8/universal-darwin8.0 /usr/lib/ruby/site_ruby /usr/lib/ruby/1.8 /usr/lib/ruby/1.8/powerpc-darwin8.0 /usr/lib/ruby/1.8/universal-darwin8.0 . => nil
As a program continues to grow, you will reach a point where you want to split files across multiple directories Formats/ png.src gif.src jpg.src tiff.src Parsing/ html.src xml.src csv.src
of related files is sometimes known as a "package" Formats/ png.src gif.src jpg.src tiff.src • To install, you need to put the package directory on the file search path. • But packages don't always play nice...
Consider two packages of source code /Blah/ foo.src bar.src spam.src grok.src /Yow/ bar.src spam.src • What happens if both packages are on the file search path and they include the same filename? # foo.src require 'spam.src' # bar.src require 'spam.src' ??
dynamic language has implemented file inclusion in some sort of slightly broken way • It's a problem that seems like it should be easy, but which is hard and sneaky • Solutions usually focus on making the loading of files more abstract and high level • Example : Packages in Java, Python, Perl, etc.
import import java.io.*; // Java import os.path // Python use blah; // Perl • Here, the request to "import" code is not directly tied to low-level details concerning the file system • You still worry about configuration, but it's a little more controlled
Even if you have multiple files, you still have issues with naming things • Recall that all statements execute inside an environment (that holds variables) • There is usually a global/local environment • Question : Do all files execute in the same global environment?
is a named environment where program statements can execute • To break up a large program, different parts of the program can execute in different namespaces • This provides isolation between components • Namespace serves as a kind of "module"
two different sets of statements x = 42 def square(y) return y*y end x = 10 def countdown(n) while n > 0 print "T-minus", n n -= 1 end print "Fizzle" end • If you do nothing, these statements live in the same space
put statements into a named env namespace foo { x = 42 def square(y) return y*y end } namespace bar { x = 10 def countdown(n) while n > 0 print "T-minus", n n -= 1 end print "Fizzle" end } foo bar • Note : Exact syntax varies widely for this
define separate environments foo bar 'x' : 42 'square' : <func> 'x' : 10 'countdown' : <func> namespace foo { x = 42 def square(y) return y*y end } namespace bar { x = 10 def countdown(n) while n > 0 print "T-minus", n n -= 1 end print "Fizzle" end
namespaces are isolated, you still need to access to data/functionality contained in other namespaces • You need an access mechanism to cross module boundaries
a namespace is implemented as some kind of data or "object" in the language print foo.x print bar.x bar.countdown(10) a = foo.square(3) • Here, the "namespace" is something you can pass around and treat like data b = bar b.countdown(5)
you now have a mechanism for breaking code across files and isolating the execution of code to different environments • This is critical to the development of programming libraries and components • Library builders can isolate their code and have a reasonable assurance that it won't conflict with your code
need to work with data structures • For example, a graphics program might have to work with shapes like Circles and Rectangles. • Each of shape will have some basic attributes struct Circle { double radius; }; struct Rectangle { double width; double height; }; { 'radius' : 4 } { 'width' : 4, 'height' : 5 } C Python
there are functions that perform various operations on data • These are typically called "methods" • Some examples for shapes: • Compute the area • Compute the perimeter • Draw on the screen
you bundle data structures and methods together in an effective way? • One approach : Use a namespace • Rationale : Namespaces keep code isolated. So, just put the functionality for each kind of data in a separate namespace.
• Here is some code for a Circle (Python) # Circle.py import math def new(): c = { } return c def init(c,radius): c['radius'] = radius def area(c): return math.pi*c['radius']**2 def perimeter(c): return 2*math.pi*c['radius']
new(): c = { } return c def init(c,radius): c['radius'] = radius def area(c): return math.pi*c['radius']**2 def perimeter(c): return 2*math.pi*c['radius'] Example : A Circle 55 • Here is some code for a Circle (Python) The namespace "Circle"
new(): c = { } return c def init(c,radius): c['radius'] = radius def area(c): return math.pi*c['radius']**2 def perimeter(c): return 2*math.pi*c['radius'] Example : A Circle 56 • Here is some code for a Circle (Python) Create a container where we will store data related to the circle
new(): c = { } return c def init(c,radius): c['radius'] = radius def area(c): return math.pi*c['radius']**2 def perimeter(c): return 2*math.pi*c['radius'] Example : A Circle 57 • Here is some code for a Circle (Python) Initialize a circle by storing some data (the radius) inside the container
new(): c = { } return c def init(c,radius): c['radius'] = radius def area(c): return math.pi*c['radius']**2 def perimeter(c): return 2*math.pi*c['radius'] Example : A Circle 58 • Here is some code for a Circle (Python) Perform some kind of operation on a Circle
Here's how you would use the Circle >>> import Circle >>> c = Circle.new() >>> Circle.init(c,4) >>> Circle.area(c) 50.26548245743669 >>> Circle.perimeter(c) 25.132741228718345 >>> • Notice how the namespace (Circle) is encapsulating all of the functionality related to circles
• Here is similar code for a Rectangle # Rectangle.py def new(): r = { } return r def init(r,width,height): r['width'] = width r['height'] = height def area(r): return r['width']*r['height'] def perimeter(r): return 2*(r['width']+r['height'])
Example use The code for each shape is isolated in a separate module (namespace) >>> import Circle >>> import Rectangle >>> c = Circle.new() >>> Circle.init(c,4) >>> r = Rectangle.new() >>> Rectangle.init(r,4,5) >>> Circle.area(c) 50.26548245743669 >>> Rectangle.area(r) 20 >>> Circle.perimeter(c) 25.132741228718345 >>> Rectangle.perimeter(r) 18 >>>
Rectangle >>> c = Circle.new() >>> Circle.init(c,4) >>> r = Rectangle.new() >>> Rectangle.init(r,4,5) >>> Circle.area(c) 50.26548245743669 >>> Rectangle.area(r) 20 >>> Circle.perimeter(c) 25.132741228718345 >>> Rectangle.perimeter(r) 18 >>> Example : Shapes 63 • Example use Here, we are creating and initializing some shapes (which are just dictionaries)
Rectangle >>> c = Circle.new() >>> Circle.init(c,4) >>> r = Rectangle.new() >>> Rectangle.init(r,4,5) >>> Circle.area(c) 50.26548245743669 >>> Rectangle.area(r) 20 >>> Circle.perimeter(c) 25.132741228718345 >>> Rectangle.perimeter(r) 18 >>> Example : Shapes 64 • Example use Performing various operations on the shapes
"works," but you have to be very specific about the methods you call. >>> Circle.area(c) 50.26548245743669 >>> Rectangle.area(r) 20 >>> • There is another issue: How do you know what kind of shape you have? s = ... # A shape of some kind # Calculate the area area = ????? # What do you do here?
In order to distinguish different kinds of data, you can tag it with some kind of "class" # Circle.py import math def new(): c = { 'class' : 'Circle' } return c def init(c,radius): c['radius'] = radius def area(c): return math.pi*c['radius']**2 def perimeter(c): return 2*math.pi*c['radius'] An attribute that says what the data actually is
data is tagged with a "class", you can create a high-level "dispatch function" def area(shape): if shape['class'] == 'Circle': return Circle.area(shape) elif shape['class'] == 'Rectange': return Rectangle.area(shape) ... # Usage c = Circle.new(4) s = Rectangle.new(4,5) print area(c) # Calls Circle.area print area(s) # Calls Rectangle.area
What if you wanted all shapes to have position information and some functions for movement? • Example : x,y coordinates and a function for moving the shape. • One approach : Modify every source file involving shapes... ugh.
Circle >>> c = Circle.new() >>> Circle.init(c,4) >>> c['x'] 0 >>> Circle.move(c,3,7) >>> c['x'] 3 >>> Circle.area(c) 50.26548245743669 >>> Notice how Circles picked up the functionality we defined in Shape
Since Circle and Rectangles are using common functionality from Shape, we should probably make sure that both objects get created in a consistent way def new(): c = { 'class' : 'Circle' } return c def new(): r = { 'class' : 'Rectangle' } return r
ColoredCircle.py import Circle def new(classname="ColoredCircle"): c = Circle.new(classname) return c def init(c,color,radius): Circle.init(c,radius) c['color'] = color # Add a color value # Just use the same functions for area/perimeter area = Circle.area perimeter = Circle.perimeter
existing object with new attributes or methods is called "inheritance" • You're "inheriting" all of the features of the original object, but making modifications
set up a lot of machinery, but the problem of method dispatch is still horrible • Here's an example of what's wrong: import Rectangle, Circle, ColoredCircle # Create some shapes a = Rectangle.new(); Rectangle.init(r,4,5) b = Circle.new(); Circle.init(c,4) c = ColoredCircle.new(); ColoredCircle.init(c,"red",5) shapes = [a,b,c] for s in shapes: print area(s) # Compute area of whatever Not quiet sure what to do here (depends on the shape)
can implement a sort of "hack" import sys def dispatch(s,name): classname = s['class'] module = sys.modules[classname] return getattr(module,name) • Example: shapes = [a,b,c] for s in shapes: print dispatch(s,"area")() • This looks up a method based on the classname
• By now, it should be pretty clear • The code we have been writing has been building towards the concept of an "object" • Roughly speaking, an "object" is a way of packaging data and functions together • It ties most of what we just did together
The container used to hold object data is called an "instance." The data stored inside is called "instance data." • The namespace where all of the methods are defined is called a "class" • Borrowing methods from other classes is called "inheritance." • Dispatching is called "polymorphism"
have been writing programs that do these sorts of things for a long time • For example, you can implement all of this in C or other simple languages • However, it's usually really clunky, verbose, and really hard to maintain
An "object oriented language" makes it a lot easier by taking care of low-level details • There is special syntax and other features >>> c = Circle(4.0) >>> r = Rectangle(4,5) >>> c.area() 50.26548245743669 >>> r.area() 20 >>> • So, let's talk about that...
Simula 89 • The first "object oriented" programming language was Simula. • Simula was largely based on adding support for "objects" to Algol-60. • Strongly based on static compilers • Most of the core ideas in Simula later re- surfaced in C++ and by extension in Java.
Smalltalk was also one of the first object- oriented programming languages • Initially developed at Xerox PARC (~1971) • Smalltalk-80 was first public release • Unlike Simula, it was a dynamic language (!)
in a typical PARC hallway bullsession, Ted Kaeher, Dan Ingalls, and I were standing around talking about programming languages. The subject of power came up and the two of them wondered how large a language one would have to make to get great power. With as much panache as I could muster, I asserted that you could define the "most powerful language in the world" in "a page of code." They said, "Put up or shut up." - Alan Kay, "The Early History of Smalltalk"
Almost all modern dynamic languages cite Smalltalk as an "influence" "The idea of [....] comes from Smalltalk" • I'll be honest, I've never written a single program in Smalltalk before this lecture. • But, what in the heck does it mean to be "influenced" by Smalltalk? • Let's go find out...
• Everything in Smalltalk is an "object" • Objects hold state (data) • An object sends messages to and receives messages from other objects (or itself) ("That's all folks.")
It is possible (but unusual) to program Smalltalk as a text-based language • For the examples that follow, I am using GNU smalltalk 3.0 • Fine for illustrating the general idea
Creating an object (an integer) x := 5 • The data stored by that object is the value (5) • The object has an associated class that indicates what kind of object it is st> x class SmallInteger st> • The object is called an "instance"
have created an object, there is only one thing you do with it • You can send it a message. • That's it. • Nothing else. • Thus ends our tutorial of Smalltalk....
two components selector parameter (opt) • It is first delivered to the object's class Object Magnitude Number Integer SmallInteger x := 5 Instance of SmallInteger selector | parm Message
handled, it propagates to the superclass • This is an example of "inheritance" Object Magnitude Number Integer SmallInteger x := 5 Instance of SmallInteger selector | parm Message
will propagate up the class hierarchy until a matching selector is found • At this point, the message is handled. Object Magnitude Number Integer SmallInteger x := 5 Instance of SmallInteger selector | parm Message selector Message Handler code
do you send a message? • Here's is an example: st> x := 5. 5 st> x factorial. 120 st> x abs. 5 st> • Is this case, we are sending a simple "unary" message (just a selector, no parameters) The object The message
messages (+, -, /, *, etc.) • These take another object as a parameter st> x := 5. 5 st> x + 3. 8 st> x * 4. 20 st> • Here, the operator (+,*) is the selector and the value on the right is the parameter The message
You'll now notice some pretty odd things st> x := 5 5 st> x + 3 * 4. 32 st> • There are no "operators" in Smalltalk, just messages which usually bind left to right ???? (x + 3) * 4. -> Send "+ 3" to x. Produces 8 8 * 4. -> Send "* 4" to 8. Produces 32
Displaying an object object display • Printing with a newline object printNl (note ends with lower-case 'L') • Inspecting an object (debugging) object inspect
I said that Smalltalk only has objects and messages • THAT'S IT! • There are no "control flow" statements • No conditional statements • No looping statements • No function statements
of code are objects st> a := [ x := 3. y := 4. x + y ]. st> a a BlockClosure st> a value 7 st> An object holding the code A message telling the code block to run and produce a value
Code block can optionally take parameters st> b := [ :x :y | x + y ]. st> b value: 3 value : 4 7 st> • This gives you something that roughly looks like a function • But it's still just an object. You send it messages to get it to run.
also messages involving code blocks st> x := 0. st> 3 timesRepeat: [x := x + 1]. 3 st> [ x < 100 ] whileTrue: [x := x + 1]. nil st> x 100 st> • It's a message with code blocks as parameters
Smalltalk is an object • You create your own objects by defining a class. • However, there is no special "class" statement. • Instead, you send a message
To create a new class, you send a message to the parent class (the superclass) • If you don't know what the parent is, you send a message to Object. The root of all objects. st> Object subclass: #Shape. Shape st> • Here, we are asking Object to create a subclass called "Shape."
have internal data • The members are set up in the class. st> Shape instanceVariableNames: 'x y'. Shape st> • This operation sets the names of instance variables of Shape objects • Enforces that Shapes will have x and y.
create a shape, you have to define new Shape class extend [ new [ | s | s := super new. s init. ^s ] ] • This is an example of a "class method" • It is a message that is sent to the class itself
create a shape, you have to define new Shape class extend [ new [ | s | s := super new. s init. ^s ] ] • This is an example of a "class method" • It is a message that is sent to the class itself A local variable
create a shape, you have to define new Shape class extend [ new [ | s | s := super new. s init. ^s ] ] • This is an example of a "class method" • It is a message that is sent to the class itself Sends 'new' to the parent class. The "parent" (superclass)
create a shape, you have to define new Shape class extend [ new [ | s | s := super new. s init. ^s ] ] • This is an example of a "class method" • It is a message that is sent to the class itself Send the 'init' message to the newly created instance
create a shape, you have to define new Shape class extend [ new [ | s | s := super new. s init. ^s ] ] • This is an example of a "class method" • It is a message that is sent to the class itself Return the instance
Here's some sample output st> s := Shape new. Object: Shape new "<0x40292bb0>" error: did not understand #init ... st> • It didn't work because we didn't define init yet
you create is called an "instance" • All of the internal data is completely private • There is no way to inspect it from outside • To do anything, you make the object respond to messages (by implementing methods)
Let's make a shape move Shape extend [ movex: dx [ x := x + dx. ] movey: dy [ y := y + dy. ] ] • These also correspond to messages st> s movex: 3. a Shape st> s movey: 2. a Shape st> s x. 3 st> s y. 2 st>
Here's a method that sends some messages Shape extend [ movene: distance [ self movex: distance. self movey: distance. ] ] • Example use: st> s movene: 4. a Shape st> s x. 4 st> s y. 4 st> self is the instance
subclass: #Circle. Circle instanceVariableNames: 'radius'. Circle class extend [ new: radius [ |c| c := super new. c init: radius. ^c ] ] Circle extend [ init: rad [ radius := rad. ] area [ |a| a := 3.1415926*(radius raisedTo: 2). ^a ] ]
everything we have been doing has focused on instances. • However, the class itself is an object • The class can have its own variables (called class variables) • A class can have its own methods (called class methods)
Here's a sample definition Shape class extend [ foo [ 'Hello World' printNl. ] ] • Here's a use st> s := Shape new. a Shape. st> s foo. Object: Shape new "<0x402960b8>" did not understand #foo st> Shape foo. Hello World st> A method on the class
up when the class is created Object subclass: #Shape instanceVariableNames: 'x y' classVariableNames: 'ncreate' poolDictionaries: '' category: nil ! • Can be accessed in class methods Shape class extend [ ncreate [^ncreate] new [ |s| s := super new. s init. (ncreate = nil) ifTrue: [ncreate := 1] ifFalse: [ncreate := ncreate + 1]. ^s.] ]
use: st> s := Shape new. st> s ncreate. Object: Shape new "<0x402988d8>" error: did not understand #ncreate st> Shape ncreate. 1 st> • Again: Notice that it's part of the class
objects in Smalltalk are "open" • You can add and modify methods of both instances and classes at any time. • Essentially, you can make an instances respond to new kinds of messages at will (even after creation) • On the other hand, you can't really add new instance variables after creation.
Smalltalk environment itself is an object • It turns out that the assignment operator (:=) is actually a message as well. st> Smalltalk at: #x put: 42. 42 st> x 42 st> • The whole language is objects and messages.
we looked at problems related to creating large programs • Took a detour to go look at Smalltalk, one of the first, and most influential object oriented languages • Today, we're going to look at how all of this gets put together in modern languages
review of concepts • The Ruby object model (in depth) • The Python object model (in depth) • The Perl object model (brief survey) • The Javascript object model (brief survey)
• An object is a programming abstraction that bundles two things together • Data • Methods that operate on the data • For example : A Circle • Data : The radius • Methods : area(), perimeter(), etc.
create objects, you are creating "instances" • Each instance of an object has its own internal data (instance variables) • Examples : Instances of circles .radius=6 .radius=3 .radius=4 .radius=9
that operate on instances of objects are known as "instance methods" • For example: Compute the area of a circle • The result depends on the circle instance that you supply to the method
are not stored as part of the instances themselves. • They are found in an associated class • Instances are always linked back to a class class Circle area() perimeter() ... .radius=6 .radius=3 .radius=4 Circle instances
class may define its own variables known as "class variables" • These variables act as a kind of "global variable" for all everything in the class class Circle area() perimeter() ... ncircles = 3 .radius=6 .radius=3 .radius=4 Circle instances class variable
point of inheritance is to borrow or modify existing functionality • For example, a Circle picks up all of the functionality that was defined for shapes • And it can modify that functionality if it wants
In some languages, classes themselves are an object (an instance of a "class") • The "data" stored in a class object consists of the instance methods and class variables class Circle area() perimeter() class Shape move() class Stock sell() cost() Instances of "classes"
class may define methods that operate on the class itself (as an object) • These are known as "class methods" • An example : the new method • New is a class method that asks the class to create a new instance
normal function that just happens to be placed in a class for the purposes of packaging • It has no relation to instances or classes • It's just placed into the class namespace • This is more of a C++/Java oddity
Ruby is object oriented "I wanted a scripting language that was more powerful than Perl, and more object-oriented than Python. That’s why I decided to design my own language (Ruby).” - Matz (creator of Ruby) • What it really means : Matz likes Smalltalk.
• How to navigate the hierarchy x.class # The class to which x belongs cls.superclass # The superclass of a class cls • Example: irb(main):025:0> x = 37 => 37 irb(main):026:0> x.class => Fixnum irb(main):027:0> Fixnum.superclass => Integer irb(main):028:0> Integer.superclass => Numeric irb(main):029:0> Numeric.superclass => Object irb(main):030:0>
To create an object, you first define a class class Circle def initialize(radius) @radius = radius end def area Math::PI * @radius ** 2 end def perimeter 2*Math::PI * @radius end end • A class is mainly just a collection of methods
variables are denoted by @varname class Circle def initialize(radius) @radius = radius end def area Math::PI * @radius ** 2 end def perimeter 2*Math::PI * @radius end end Instance variables • These variables are storing the data that is unique to each instance that is created
called when an object is created class Circle def initialize(radius) @radius = radius end def area Math::PI * @radius ** 2 end def perimeter 2*Math::PI * @radius end end • This name of this method is "special". Ruby expects initialization to use this specific name.
create instances, you use new c = Circle.new(4) d = Circle.new(9) ... • This calls initialize with the supplied argument c = Circle.new(4) class Circle def initialize(radius) @radius = radius end ...
call methods, you just need an instance c = Circle.new(4) d = Circle.new(9) ... puts c.area # Calls the area method on c puts d.area # Calls the area method on d puts c.perimeter • Inside methods, the instance variables (@vars) bind to the values stored in the instance.
a debugging aid, you can inspect objects c = Circle.new(4) ... puts c.inspect • Generates a string showing what's inside #<Circle:0x25170 @radius=4> • No way to directly access the internals however (more in a minute)
point : The set of instance variables on an object is not restricted or declared • Whenever a method assigns to @varname, that creates a new instance variable class Circle <Shape def initialize(radius) @radius = radius end ... def set_color(color) @color=color end end This "spontaneously" creates a new instance variable when called the first time
instance variables in Ruby are private • The only way to access is through methods class Circle def initialize(radius) @radius = radius end def radius @radius end def radius=(value) @radius=value end end Return the value Set the value
of using the accessor methods c = Circle.new(4) puts c.radius # Prints 4 c.radius=5 puts c.area # Prints 78.5398163397448 Both of these operations are actually method calls • Important point : There is never direct access to instance data in Ruby. It's always a method.
Instance variables and methods are separate • Plus, there is special syntax to distinguish instance variables from methods (@varname) • So, it does not matter that there is an instance variable called @radius and a method called radius.
may inherit from one other class class Shape def initialize @x = 0 @y = 0 end def move(dx,dy) @x += dx @y += dy end end class Circle <Shape ... end superclass
superclass is listed, Object is assumed class Shape ... end class Shape <Object ... end == • Ruby only supports single inheritance. • So, there is always just one superclass
Derived classes must initialize parents • This is done using "super" class Shape def initialize @x = 0 @y = 0 end end class Circle <Shape def initialize(radius) super() # Initialize parent @radius = radius end end
method, super is a special keyword that refers to the same method in the parent class (superclass) • This is used when you re-implement a method, but still want to call the original version within the new method
data always has to be accessed through methods, but defining those methods repeatedly gets tedious and annoying • Here is a shortcut class Shape attr_reader :x, :y def initialize @x = 0 @y = 0 end end This creates accessor methods for reading the values def x @x end def y @y end
attribute writers class Shape attr_reader :x, :y attr_writer :x, :y def initialize @x = 0 @y = 0 end end This creates accessor methods for writing the values def x=(value) @x=value end def y=(value) @y=value end • Note - :x is the Ruby syntax for a symbol
All access to an object occurs via methods • However, methods that take no arguments also look like data "attributes" class Circle <Shape ... def radius @radius end def area Math::PI*@radius **2 end def perimeter 2*Math::PI*@radius end end c = Circle.new(4) puts c.radius puts c.area puts c.perimeter ... Notice how the access is very "uniform"
is part of the "public" interface of an object that's presented to a user • It has nothing to do with the internal state stored by an object (instance data) • Example : Certain attributes are stored (the radius), but others are computed (the area) • This concept of hiding internals behind methods is very big in OO-programming
are also "objects" in Ruby • This is subtle, but a class can have its own internal variables (like instance data) class Shape @@ncreated = 0 # Class variable def initialize @x = 0 @y = 0 @@ncreated += 1 end end • A class variable is shared by all instances (but there is just one copy of the variable)
variables are not part of instances irb(main):002:0> s = Shape.new => #<Shape:0x625b0 @y=0, @x=0> irb(main):003:0> • Nor are they readable... (They're private too) You do not see a reference to @@ncreated here irb(main):002:0> Shape.ncreated NoMethodError: undefined method 'ncreated' for Shape:Class irb(main):003:0>
can be defined for the class itself class Shape @@ncreated = 0 ... def Shape.ncreated @@ncreated end end Prefix with the class name to define a class method • To use the method, apply it to the class irb(main):006:0> s = Shape.new => #<Shape:0x58cf4 @x=0, @y=0> irb(main):007:0> Shape.ncreated => 1 irb(main):008:0>
methods only operate on classes, not instances of objects defined by a class • This is somewhat subtle irb(main):006:0> s = Shape.new => #<Shape:0x58cf4 @x=0, @y=0> irb(main):007:0> Shape.ncreated => 1 irb(main):008:0> s.ncreated NoMethodError: undefined method `ncreated' for #<Shape:0x58cf4 @x=0, @y=0> from (irb):8 irb(main):009:0>
are a "deep concept" • When you define a class, you're actually defining two different kinds of objects • Instances of the class • The class itself • Although they're related, these objects are distinct from each other and handled separately (more in a minute)
are "open" in Ruby • After a class has been defined, you can later open it and add new methods to it • Repeated use of class merely extends the previous definition with new methods
Here's a class class Circle <Shape def initialize(radius) @radius = radius end ... end • And some code that extends it c = Circle.new(4) # Create a circle class Circle # Add a new method def holler puts "I'm a happy shiny circle" end end c.holler # Print 'I'm a happy ...'
affect instances already created! c = Circle.new(4) d = Circle.new(5) puts c.area # Prints 50.2654824574367 puts d.area # Prints 78.5398163397448 class Circle def area (4/(5/4.0))*@radius**2 end end puts c.area # Prints 51.2 puts d.area # Prints 80.0
new methods for a single instance c = Circle.new(4) d = Circle.new(4) # Circle c is moving to Indiana. Fix it class <<c def area 4/(5/4.0)*@radius**2 end end puts c.area # Prints 51.2 puts d.area # Prints 50.2654824574367
a "module" mechanism module MoveLRUD # Methods for moving def left(dx) # left,right,up,down move(-dx,0) end def right(dx) move(dx,0) end def up(dy) move(0,-dy) end def down(dy) move(0,dy) end end • A module is a namespace • It can contain instance methods/class methods
a collection of methods, but they do not define any kind of class or instance • Which makes them rather odd creatures irb(main):010:0> Movable.left(3) NoMethodError: undefined method `left' for Movable:Module from (irb):10 irb(main):011:0> • If you define instance methods in a module, there's no obviously apparent way to use them
can be included into a class class Shape include MoveLRUD # Include as a mixin def move(dx,dy) @x += dx @y += dy end end • This takes all of the methods in the module and makes them part of the class as if they were defined there • It "mixes in" the other methods
can also be mixed into an instance a = Shape.new b = Shape.new b.extend(MoveLRUD) b.left(4) b.up(3) a.left(4) # Error. Left not defined • Here, the module methods only work on the specific instance that was extended
mixins, you implement common functionality in one place (a module) • You then include it in a variety of different places over and over again to reuse it • Note: This is a slightly different concept than "inheritance"
are normally public, meaning anyone can call them • Can also have protected and private methods class Foo private def bar ... end protected def spam ... end end
• Method can be called by any method of the defining class or subclasses • May be invoked on other instances • Private • Method can only be called by methods in the same class • And only on the current object
ivars class class Circle <Shape def initialize(radius) @radius = radius end end c = Circle.new(4) d = Circle.new(5) flags ivars class c d { 'radius' => 4, ... } { 'radius' => 5, ... } Circle All instances link back to their class
class is an object with additional information • A reference to the superclass • A list of methods flags ivars class super methods • Important : A class is also an object
... def area ... end def perimeter ... end ... end { } Circle flags ivars class super methods { 'area'=> method, 'perimeter' => method, ... } Shape flags ivars class super methods class Shape @@ncreated = 0 def move ... end end { 'ncreated => 0 } { 'move' => method, ... } Object 60
are linked to classes • Classes are linked to the superclass • This is the key to knowing how methods get dispatched to the appropriate definition • Essentially you just follow those links
class super methods { 'area'=> method, 'perimeter' => method, } Shape flags ivars class super methods { 'move' => method } Object c = Circle.new(4) c.area c.move Every method call involves a search of the class and all base classes (just follow super) 62
ivars class flags ivars class super methods Circle c { 'area'=> meth, 'perimeter' => } Now, let's extend that instance by redefining area c = Circle.new(4) class <<c def area 4/(5/4.0)*@radius**2 end end The new area method needs to be inserted here
ivars class flags ivars class super methods c = Circle.new(4) class <<c def area 4/(5/4.0)*@radius**2 end end flags ivars class super methods Circle <virtual> c { 'area'=> meth } { 'area'=> meth, 'perimeter' => } A "virtual" anonymous class gets inserted into the class chain for c V
bar puts 'Foo.bar' end end First define a module A Module is just a collection of methods flags ivars class super methods Foo (Module) { 'bar'=> meth, }
super methods module Foo def bar puts 'Foo.bar' end end class Circle <Shape ... end Circle flags ivars class super methods Foo (Module) flags ivars class super methods Shape Now, start defining a class { 'bar'=> meth, }
super methods module Foo def bar puts 'Foo.bar' end end class Circle <Shape include Foo ... end Circle flags ivars class super methods Foo (Module) flags ivars class super methods Shape Include a module as a mixin { 'bar'=> meth, } The functionality of Foo needs to be added to Circle somehow
super methods module Foo def bar puts 'Foo.bar' end end class Circle <Shape include Foo ... end Circle flags ivars class super methods Foo (Module) flags ivars class super methods Shape Again, an anonymous class gets inserted into the class chain { 'bar'=> meth, } flags ivars class super methods Mixin Proxy
Deep thought : A class is also an object • If so, it must belong to some class! • It does - check it out class Circle ... end puts Circle.class # Prints 'Class' • The output says that "Circle" is a "Class"
class super methods • Here is a picture Circle flags ivars class super methods Class Circle is a Class • Notice the parallel to instances flags ivars class super methods Circle flags ivars class c = Circle.new(r) c is a Circle
What is this "Class"? • Well, a "Class" is just another object • Which is, well, also a Class irb(main):001:0> puts Class.class Class => nil irb(main):002:0> • Huh?!?!?!
flags ivars class super methods • Just sketch it out... Circle flags ivars class super methods Class Circle is a Class Class is a Class • Clearly, there's something is going on here • Let's take a look at the superclasses...
Inspect the superclass irb(main):001:0> puts Class.class Class => nil irb(main):002:0> puts Class.superclass Module => nil • Whoa, a class is a kind of Module • And a Module is a namespace • And in the last lecture we saw how you could implement objects using namespaces
ivars class super methods Circle flags ivars class super methods Class Circle is a Class Class is a Class flags ivars class super methods Module Classes are implemented on top of Modules
Inspect the class and superclass irb(main):001:0> puts Module.class Class => nil irb(main):002:0> puts Module.superclass Object => nil • A module is an object like everything else • There is a class (Module) • Module inherits from Object
Circle flags ivars class super methods Class flags ivars class super methods Module flags ivars class super methods Object Here, you are seeing inheritance, but it's all about classes. 1. A Circle is a Class 2. A Class is a Module 3. A Module is an Object
• A look at "Object" irb(main):001:0> puts Object.class Class => nil irb(main):002:0> puts Object.superclass nil • An Object is described by a Class • There are no parents to an Object • That's the end of the line...
Circle flags ivars class super methods Class flags ivars class super methods Module flags ivars class super methods Object nil Object has no parents. So, it terminates the chain of linked objects
Circle flags ivars class super methods Class flags ivars class super methods Module flags ivars class super methods Object flags ivars class super methods Shape Let's add in some other classes nil
it? 84 • There are always two different paths • The "instance" path • The "class" path • Instance path : Instance methods • Class path : Class methods • The choice depends on the starting object
Circle flags ivars class super methods Object flags ivars class super methods Shape nil The Instance Path # c is a Circle instance c.area Here, you start with an instance of Circle. flags ivars class c
Circle flags ivars class super methods Class flags ivars class super methods Module flags ivars class super methods Object nil The Class Path c = Circle.new(4) Here, you start with the class itself (Circle).
how the search process is highly uniform • In fact, there is a simple algorithm obj.meth 1. Follow the class link of obj 2. Look in the method table 3. If not found, follow the super link 4. Repeat 2-4 until you find the method
Class methods • Recall that class methods are methods that operate on classes, not instances class Shape <Object @@ncreated = 0 def Shape.ncreated # A class method @@ncreated end end • These methods live along the class chain
0 def Shape.ncreated @@ncreated end end flags ivars class super methods Shape flags ivars class super methods Shape' (virtual) flags ivars class super methods Class { 'ncreated'=> meth } flags ivars class super methods Object Class methods live in a separate anonymous class that's inserted into the class chain (sometimes known as a "metaclass")
are the key points • Everything is an object • All objects are described by a class • Classes are objects • Everything is linked together in a big tree/graph
has always had "objects", but OOP was never the overriding design philosophy • In fact, user-defined classes were one of the last features added to the language • Recall that one motivation for Ruby was to address a perceived problem with Python OO 92
• Objects in Python are organized in a hierarchy object int • object is at the top • However, the hierarchy is relatively flat • Example: Don't see numbers grouped together under a class "Numeric" float str list dict
objects have a "type" >>> x = 37 >>> x.__class__ <type 'int'> >>> • The type is the class to which an object belongs • Finding the parent class (superclass) >>> int.__bases__ (<type 'object'>,) >>>
a new user-defined object class Circle(object): def __init__(self,radius): self.radius = radius def area(self): return math.pi*(self.radius**2) def perimeter(self): return 2*math.pi*self.radius • A class is a collection of functions (methods) • Nothing conceptually new here 96
serves as a "factory" >>> c = Circle(4.0) >>> c.area() 50.26548245743669 >>> c.radius 4.0 >>> 97 • Note : You don't call a special method like "new", you just use the class as a function
initialize objects • Called whenever a new object is created >>> c = Circle(4.0) class Circle(object): def __init__(self,radius): self.radius = radius newly created object • __init__ is example of a "special method" • Has special meaning to Python interpreter 98
within the object class Circle(object): def __init__(self,radius): self.radius = radius • Outside class, just access through the instance name • Inside class, referenced using self.attrname def area(self): return math.pi*(self.radius**2) >>> c = Circle(4.0) >>> c.radius 4.0 99
instances of an object class Circle(object): ... def area(self): return math.pi*self.radius**2 • By convention, called "self" • The object is always passed as first argument >>> c.area() def area(self): ... The name is unimportant---the object is always passed as the first argument. It is simply Python programming style to call this argument "self." C++ programmers might prefer to call it "this." 100
seeing some huge differences from the Ruby object system • All instance variables are fully visible >>> c = Circle(4) >>> c.radius 4 >>> c.radius = 5 >>> 101 • Explicit use of "self" to refer to the instance def area(self): return math.pi*self.radius**2
Python may have multiple bases class Foo(object): ... class Bar(object): ... class Spam(Foo,Bar): ... • We have not seen this before • Not allowed in Ruby, Smalltalk, Java, etc. • There are some nasty issues (later) 104
have variables. Just define in the class definition class Shape(object): numcreated = 0 # class variable def __init__(self): Shape.numcreated += 1 self.x = 0.0 self.y = 0.0 >>> Shape.numcreated 0 >>> s = Shape() >>> Shape.numcreated 1 >>> 105
"decoration" class Shape(object): @classmethod def spam(cls): print "Hello. Your class is", cls class Circle(Shape): ... >>> Circle.spam() Hello. Your class is <class '__main__.Circle'> >>> Shape.spam() Hello. Your class is <class '__main__.Shape'> >>> 106 • Class methods receive the class itself as the first argument (classes are also objects)
and instance methods/variables are co-mingled class Shape(object): @classmethod def spam(cls): print "Hello. Your class is", cls class Circle(Shape): ... >>> c = Circle(4) >>> c.spam() Hello. Your class is <class '__main__.Circle'> >>> c.numcreated 1 >>> 107 • Class methods can be invoked via instance
quite "open" in the same way as they are in Ruby • If the same class definition appears more than once, the new definition replaces the old definition (but it does not affect existing instances) 108
Class redefinition 109 class Foo(object): def bar(self): print "Hello World" a = Foo() class Foo(object): def bar(self): print "Hello Cruel World" b = Foo() a.bar() # Prints "Hello World" b.bar() # Prints "Hello Cruel World"
add new methods to an existing class by just defining them outside and attaching them to the class object 110 class Circle(object): def __init__(self,radius): self.radius = radius def area(c): return math.pi*c.radius**2 Circle.area = area
it or not, that is about the extent of defining and using classes in Python • A class is just a bunch of functions • The functions are normally an instance method that receives the instance as the first parameter (self) • Can optionally be defined as class methods that receive the class as the first parameter 111
features are available, but they just take a different form • Example : Mixins using multiple inheritance 113 class Shape(object): def move(self,dx,dy): self.x += dx self.y += dy class MoveLRUD(object): def left(self,dx): self.move(-dx,0) def right(self,dx): self.move(dx,0) def up(self,dy): self.move(0,-dy) def down(self,dy): self.move(0,dy) class Circle(Shape,MoveLRUD): ...
Python view on objects... • An instance is just a collection of stuff • A class is just a collection of stuff • A dictionary is just a collection of stuff • Hey, I'll just use that! 114
of objects is mainly just a wrapper layer • Objects and classes are just wrappers around dictionaries • And methods are just wrappers around ordinary functions • Let's go take a look... 115
definition class Circle(Shape): def __init__(self,radius): Shape.__init__(self) self.radius = radius def area(self): return math.pi*self.radius**2 def perimeter(self): return 2*math.pi*self.radius • A class creates a special kind of object >>> Circle <class '__main__.Circle'> >>> • What is this object? 116 A Class Object
hold a reference to their class >>> c = Circle(4) >>> c.__dict__ {'x': 0,'radius': 4,'y': 0 } >>> c.__class__ <class '__main__.Circle'> >>> • __class__ attribute refers to class object 120
class A(B,C): ... • Classes may inherit from other classes • Bases stored as a tuple in class object >>> A.__bases__ (<class '__main__.B'>,<class '__main__.C'>) >>> • __bases__ is tuple of base class objects 121
Everything stored in dictionaries • Instances have dicts (instance data) • Classes have dicts (methods, class attributes) • Instances, classes, bases are linked together (__class__, __bases__) 123
special operators for getting, setting, and deleting "attributes" 124 obj.name # Get an attribute value obj.name = value # Set an attribute value del obj.name # Delete an attribute • To finish off the object system, you just have to define the behavior of these operators • Connect it up to all of those dictionaries
attribute overrides any attributes set in class or bases >>> c = Circle(4) >>> c.area() 50.2654824574367 >>> c.area = "pretty big" >>> c.area() Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: 'str' object is not callable >>> c.area 'pretty big' >>> • One way to create a mighty kerfuffle 126
in local __dict__ • If not found, look in __dict__ of class >>> c = Circle(...) >>> c.radius 4 >>> c.area() 50.2654824574 >>> c .__dict__ .__class__ {'x' : 0, 'radius' : 4, ...} Circle .__dict__ {'area': <func>, 'perimeter':<func>, '__init__':..} • If not found in class, look in base classes .__bases__ look in __bases__ 1 2 3 128
uses a single dictionary to store everything associated with a class • This dictionary contains both data (class variables) and methods • So, you can't have instance data and methods with the same names (they'll conflict) • However, there is a tricky bit with all of this... 129
lookup data, you get the data • If you lookup a method, it's different • You don't get method function! 130 >>> c = Circle(4) >>> c.radius 42 >>> c.area <bound method Circle.area of <__main__.Circle object at 0x6cb50>> >>> • What in the heck is that?
get "wrapped" • The returned object is a method that's waiting for you to call it... 131 >>> c = Circle(4) >>> a = c.area >>> a <bound method Circle.area of <__main__.Circle object at 0x6cb50>> >>> a() 50.2654824574 >>> Calls the method The method itself as an object
don't see it, but method calls are always a two-step process like this 132 c.area() <bound method : area> . operator - attribute lookup () operator - call 50.2654824574 • Essentially, looking up a method is separate from calling the method
covers 133 >>> c = Circle(4) >>> a = c.area >>> a <bound method Circle.area of <__main__.Circle object at 0x6cb50>> >>> a.im_class <class '__main__.Circle'> >>> a.im_func <function area at 0x69370> >>> a.im_self <__main__.Circle object at 0x6cb50> >>> • What happens on call? >>> a.im_func(a.im_self) 50.2654824574 >>> The class The func Instance
that certain items pop out of a class with a wrapper slapped onto it should be somewhat disturbing • Who or what is doing this wrapping? • How does it fit into the rest of the object system? 134
is performed by defining "descriptor" objects • A descriptor is an object that hooks into the attribute access on classes in Python • Allows customized actions to be defined 135
Object class Descriptor(object): def __get__(self,instance,cls): print "get", instance,cls def __set__(self,instance,value): print "set", instance, value def __delete__(self,instance) print "delete", instance 136 • Placing it into a class definition class Foo(object): bar = Descriptor() ... • It just has to have __get__,__set__, etc.
put together def bar_impl(self): print "I'm an instance method bar" class Foo(object): bar = InstanceMethodDescriptor(bar_impl) 139 • Example use: >>> f = Foo() >>> f.bar <__main__.BoundMethod object at 0x6cbb0> >>> f.bar() I'm an instance method bar >>>
makes sense, you're at 11 with Python • Much of this tucked away behind the scenes • It's critical to how Python works • But unknown to most Python programmers 140
class B(object): pass class C(A,B): pass • Base tuple contains multiple entries object A B C • For example: >>> C.__bases__ (<class '__main__.A'>, <class '__main__.B'>) >>> 141
looks in base classes • However, complex hierarchies make this much more tricky class A(object): def bar(self): pass def spam(self): pass class B(object): def spam(self): pass class C(A,B): pass >>> c = C() >>> c.spam() # Which spam()??? 142
• Class is always checked first • Then bases are checked in order listed class A(object): ... class B(object): ... class C(A,B): ... >>> c = C() >>> c.spam() 143 Search order: C, A, B
class B(object): pass class C(A,B): pass class D(B): pass class E(C,D): pass • Consider a more complex hierarchy object A B C D E • What happens here? >>> e = E() >>> e.x # Attribute access 144
is based on a sort of bases object A B C D E • Search rules >>> e = E() >>> e.x • Can all of these be satisfied? Check E first C before D : class E(C,D) C before A : class C(A,B) C before B : class C(A,B) D before B : class D(B) A before B : class C(A,B) object last • Answer: Yes. E,C, A, D, B, object 145
order (MRO) • __mro__ attribute contains order in which classes are searched >>> E.__mro__ (<class '__main__.E'>, <class '__main__.C'>, <class '__main__.A'>, <class '__main__.D'>, <class '__main__.B'>, <type 'object'>) • Determination of MRO is rather complex • Beyond scope of this talk • "C3 Linearization Algorithm" 146
classes that are rejected! • Example: class A(object): pass class B(object): pass class C(A,B): pass class D(B,C): pass Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: Error when calling the metaclass bases Cannot create a consistent method resolution order (MRO) for bases B, C object A B C D • Reason: class D(B,C) --> B before C class C(A,B) --> C before B (B is base of C) 147
are much more exposed • No notion of private data • Implementation is completely visible • Again, it's really just a layer that's been wrapped around dictionaries • However, over time, various "tweaks" have shown up in the language 148
with leading __ is "private" class Foo(object): def __init__(self): self.__x = 0 • Example >>> f = Foo() >>> f.__x AttributeError: 'Foo' object has no attribute '__x' >>> • This is really just a name mangling trick >>> f = Foo() >>> f._Foo__x 0 >>> 149
restrict the set of attribute names class Foo(object): __slots__ = ['x','y'] ... • Produces errors for other attributes >>> f = Foo() >>> f.x = 3 >>> f.y = 20 >>> f.z = 1 Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: 'Foo' object has no attribute 'z' • Prevents errors, restricts usage of objects 150
with some accessor funcs class Foo(object): def __init__(self,name): self.__name = name def getName(self): return self.__name def setName(self,name): if not isinstance(name,str): raise TypeError, "Expected a string" self.__name = name • Property maps accessor funcs to attribute class Foo(object): ... name = property(getName,setName) ... 151
classes may redefine attribute access entirely • In other words, you can redefine (.) • Set of special methods for setting, deleting, and getting attributes 153
every time an attribute is read • Default behavior looks at instance dict • Then it checks the class dict • Then it checks base classes (inheritance) • If that fails, __getattr__(self,name) method is invoked 154
Ruby, Python also has the concept of classes as objects • However, there is a huge twist to it • Python let's you redefine what a class is! • Let's go take a look... 158
a class, you get an "object" 159 class Circle(Shape): def __init__(self,radius): Shape.__init__(self) self.radius = radius def area(self): return math.pi*self.radius**2 def perimeter(self): return 2*math.pi*self.radius >>> Circle <class '__main__.Circle'> • A "class object"
are instances of "types" >>> class Circle(Shape): pass >>> type(Circle) <type 'type'> >>> isinstance(Circle,type) True >>> Recall: type() tells you the type of an object. Here we're using it on a class itself. 160 • Here, Python is following the convention that you see in C++/Java • Classes define types
definitions create new types. • However, a type is just a class class type(object): def __init__(self, *args, **kwargs): ... >>> type <type 'type'> >>> 161 • It's a class that creates new "types" • This is something known as a "metaclass"
Body is exec'd in its own dictionary __dict__ = { } exec body in globals(), __dict__ • The statements in the body execute • Afterwards, __dict__ is populated >>> __dict__ {'__init__' : <function __init__ at 0x4da10>, 'area' : <function area at 0x4dd70>, 'perimeter': <function perimeter at 0x4dea0>,} >>> 165
Class is constructed from its name, base classes, and the dictionary >>> Circle = type("Circle",(Shape,),__dict__) >>> Circle <class '__main__.Circle'> >>> c = Circle(4) >>> c.area() 50.2654824574 >>> • type(name, bases, dict) constructs a class object 166
provides a hook that allows you to intercept the class creation step • Using this, you can feed the "class" into something other than "type" • In other words, you could come up with something very different than a normal class 167
• Sets the metaclass that's used for construction • May be a class attribute or a global variable class Foo: __metaclass__ = type def bar(self): print "Foo.bar" 168 __metaclass__ = type class Foo: ... class Bar: ...
you inherit from type and redefine __new__ class mytype(type): def __new__(cls,name,bases,__dict__): print "Creating class : ", name print "Base classes : ", bases print "Attributes : ", __dict__.keys() return type.__new__(cls,name,bases,__dict__) 170 • Then you define objects that hook to it class myobject: __metaclass__ = mytype
provides class name, base classes, and dictionary prior to class creation • Can inspect this information • Can modify this information • If you know what you are doing, can be used for a variety of useful/diabolical purposes 171
the most advanced and misunderstood part of Python • However, used widely by framework developers • Can be used to perform very interesting things with objects • Where do you go after reaching level 11? Metaclasses. 172
of objects is based on a simple idea (just use dictionaries) • However, there are some subtle complications of that approach. • The co-mingling of data and functions means that you have to play some games with wrappers (descriptors) to get it to work • Otherwise, it's similar to what we saw before 173
support for OO programming • It is generally acknowledged that the whole idea for it was taken straight out of Python. • Guido (Python) and Larry Wall (Perl) had previously interaction at conferences • And Perl already had a dictionary type (Hash) 176
has to store instance variables • Let's just put them in a hash sub new { my $radius = shift; my $instance_data = { "radius" => $radius }; return $instance_data; } 177 • Hey, Python used a dictionary after all...
that use the hash sub area { my $self = shift; return $PI*$self{'radius'}**2; } sub perimeter { my $self = shift; return 2*$PI*$self{'radius'}; } 178 • These are just normal functions my $c = new(4); print(area($c),"\n");
which are a namespace package Circle; sub new { my $radius = shift; my $instance_data = { "radius" => $radius }; return $instance_data; } sub area { my $self = shift; return $PI*$self{'radius'}**2; } sub perimeter { my $self = shift; return 2*$PI*$self{'radius'}; } 179
real close... $c = Circle::new(4); print(Circle::area($c),"\n"); ... 180 • All of the methods are packaged together • Similar to a class • Recall from last lecture : This was one way that classes came about
"bless" data into a package 181 package Circle; sub new { my $radius = shift; my $instance_data = { "radius" => $radius }; bless $instance_data,"Circle"; return $instance_data; } • This sets an attribute on the hash to point to the package name supplied • Aha! So that's a link to the class (the package)
more OO-syntax: 183 $c = Circle->new(4); print($c->area(),"\n"); print($c->perimeter(),"\n"); • Requires a slight change to the new function sub new { my ($pkg,$radius) = @_; # Get name and argument my $instance_data = { "radius" => $radius }; bless $instance_data, $pkg; return $instance_data; }
few more details • Basically, you're just linking hash tables and packages together • Hash table is the instance data • Package is the class • Variables in the package set up inheritance 187
system is actually more flexible than you might imagine • For example, you don't technically have to use a Hash object to store data • You can implement objects in different ways • Many other customization features 188
really have an OO system based on classes per se. • Instead, it just merges arrays and objects together • Essentially : An associative array is an object. An object is an associative array. 190
some instance data 191 var c = { 'radius' : 4, 'x' : 0, 'y' : 0 }; • Once you've done that, you can access the data in two different ways document.writeln(c['radius']); document.writeln(c.radius); • The (.) operator is just an array lookup
c = { 'radius' : 4, 'x' : 0, 'y' : 0 }; c.area = function() { return 3.1415926*this.radius*this.radius; } • Just attach a function to an array • Call the method a = c.area(); • Inside the function, 'this' refers to the array
{ this.radius = radius; } c = new Circle(4); • Any function can be a "constructor" • If you call a function like this, 'this' is already set up to point to an empty array • You just place values into it
Circle(radius) { this.radius = radius; this.area = function() { return PI*this.radius*this.radius; } this.perimeter = function() { return 2*PI*this.radius; } } c = new Circle(4); a = c.area(); p = c.perimeter(); • Just write a function and define methods
problem with this approach • Methods get defined and stored in every single instance function Circle(radius) { this.radius = radius; this.area = function() { return PI*this.radius*this.radius; } this.perimeter = function() { return 2*PI*this.radius; } } • Needless to say, that isn't very efficient
{ this.radius = radius; } Circle.prototype.area = function() { return PI*this.radius*this.radius; } Circle.prototype.perimeter = function() { return 2*PI*this.radius; } • A function has a hidden "prototype" attached • A prototype is just another object (array)
new Circle(4); r = c.radius; # From array c a = c.area(); # From Circle.prototype • If a function has a prototype attached, a link to the prototype gets carried along with any object that gets created • Attribute lookup will go to the prototype if it can't be found in the array itself { 'radius' : 4 } { 'area' : <func> 'perimeter' : <func> } Circle.prototype
Prototypes look a lot like a class • Every object has its own data • But, each object is also linked to a prototype that can supply values as a fallback
doesn't really have classes, there is no "class-based" inheritance • However, you can play funny games with linking prototypes together • This gets rather ugly in a hurry
{ this.x = 0; this.y = 0; } Shape.prototype.move = function(dx,dy) { this.x += dx; this.y += dy; } function Circle(radius) { Shape.call(this); this.radius = radius; } Circle.prototype = new Shape(); delete Circle.prototype.x; delete Circle.prototype.y; Circle.prototype.constructor = Circle; Circle.prototype.area = function() { return PI*this.radius*this.radius; } We create a Shape and use it as the prototype. However, we have to patch it up a bit.
probably the logical extension of using hash tables/arrays to represent objects • It essentially just merges them together • Functions are set up to receive the array as "this" if they're invoked through an array
have taken a very detailed tour of how objects work in a variety of languages • There were some common themes • Covered many subtle differences between implementations
be honest, most "serious" computing applications tend to be written in C, C++, Java, or some kind of "compiled" language • It's partly for performance (a C program may be 100x faster than an equivalent script) • Also for extra safety. The compiler has strict rules and performs all kinds of program checking (to catch errors before you run)
applications don't exist in total isolation • They always have to read/write data • The data may arrive in many ways (files, pipes, network, etc.) • And many possible formats
don't use one application for everything (well, let's exclude emacs) • You solve problems by using many different applications for different kinds of tasks • You use the best tool for the job • Much of day-to-day work is involved in simply moving data around between applications
picture, each component is a completely separate application • May be written in different languages • Developed completely independently • May be legacy code that can't be replaced
I started working with dynamic languages in 1995, I was a programmer on a large scientific computing project • About 80% of our time was spent futzing around with data files (moving them around, converting them, making them work with other programs, etc.) • It was a huge pain.
Very easy to develop and reconfigure • Can handle a huge variety of data formats • You're taking a problem which is inherently messy and solving with a language that is adept at solving messy problems.
probably the most basic form of handling data • Programs write files as output • You create files that serve as input • Let's talk about some basic concepts...
the lowest level, a file is a byte sequence • All operations concerning files are focused around manipulating that byte sequence (reading it, writing it, modifying it, etc.) • To the operating system, there is nothing particularly "special" about any given file • It's just a bunch of bytes...
To use a file, it must first be "opened" • Example: f = open("somefile.txt","r") # Open for read f = open("somefile.txt","w") # Open for write f = open("somefile.txt","a") # Open for append • This gives you an object with basic operations f.read(maxbytes) # Read N bytes f.write(text) # Write to a file f.close() # Close the file
The programming model for most languages is taken from low-level system calls (POSIX) open(filename,mode,flags) # Open a file read(fd,buffer,maxsize) # Read into a buffer write(fd,buffer,nbytes) # Write a buffer close(fd) # Close a file seek(fd,offset,origin) # Seek to a new position tell(fd) # Get file pointer • It might be cleaned up a bit, but usually it's not much different than this
open("foo.txt","r") mode : r flags : XX fp : 0 Operating System • Opening a file creates an OS data structure • The contents are not visible • Holds the state of the "file"
useful internal state is the file pointer • Keeps track of current file position f = open("foo.txt","r") data = f.read(10) data = f.read(15) read(10) read(15) mode : r flags : XX fp : 25 Operating System foo.txt
Manipulation of the file pointer >>> f = open("foo.txt","r") >>> f.seek(1024) # Set fp >>> data = f.read(76) >>> f.tell() 1100 >>> • It's exactly the same in most other languages
The same file can be open in more than one place at a time (even in the same program) • Each time you open a file, you get a new file object with a separate file pointer • Although each file is managed separately, they all operate on the same underlying data
to a file are reflected everywhere • If a file is opened for reading and the file contents get modified behind the scenes, those changes will affect subsequent read operations • Basically, everything stays in sync. • Details are covered in an OS class
default, files are opened in text mode f = open(filename,"r") # Read, text mode f = open(filename,"w") # Write, text mode f = open(filename,"a") # Append, text mode • Text mode assumes line orientation • However, what is a line? some characters .......\n (Unix) some characters .......\r\n (Windows) some characters .......\r (Classic Mac) • This determination is made by the system
reading, system newline is converted back to the standard '\n' character >>> f = open("test.txt","r") >>> f.read() 'Hello World\n' >>> • Mostly, you don't have to worry about it • .... except if you do cross-platform work
• Example: Reading a Windows text file on Unix >>> f = open("test.txt","r") >>> f.readlines() ['Hello\r\n', 'World\r\n'] >>> • Here, you get that extra '\r' in the input • Which may break code next expecting it
data requires a different I/O mode f = open(filename,"rb") # Read, binary mode f = open(filename,"wb") # Write, binary mode f = open(filename,"ab") # Append, binary mode • Disables all newline translation (reads/writes) • Required for binary data on Windows • Optional, but supported on Unix (gotcha)
business with text vs. binary is part of the operating system itself • All programming languages and applications on the system face the same issue • It's one reason why data is sometimes corrupted when transferred between systems (unintended newline expansion)
is a running program • Has its own dedicated resources • Memory, open files, net connections, etc. • Runs independently (own stack, PC, etc.) • Isolated from other processes • Closely associated with an "application" 32
create a new process • This is called a "subprocess" • The subprocess often runs under the control of the original process (which is known as the "parent" process) • Parent often wants to collect output or the status result of the subprocess 34
a subprocess, the parent typically has control over the following: • Command line arguments • Environment variables • Standard I/O streams • Signal handling 35
list of strings 36 shell % foo.exe arg1 arg2 ... arg3 • In the target process, these shown up in argv C/C++: int main(int argc, char *argv[]) { ... } Java: public static void main(String argv[]) { ... } Python: sys.argv Perl: @ARGV
of string values 37 shell % setenv NAME VALUE shell % foo.exe • In the target process C/C++: char *value = getenv("NAME"); Python: value = os.environ['NAME'] Perl: $value = %ENV{'NAME'}
set of files (stdin, stdout, stderr) 38 shell % foo.exe >out.txt shell % foo.exe <in.txt shell % foo.exe | bar.exe subprocess stdin stdout stderr • In the shell, controlled via redirection/pipes • Parent process sets up these files for subprocess
can signal a subprocess 39 shell % kill -signo pid shell % subprocess • Examples : suspend, terminate, etc. • On Unix, this is the "kill" command parent signal • On Windows, support is weak/nonexistent
terminates, it returns a status • An integer code of some kind 40 C: exit(status); Java: System.exit(status); Python: raise SystemExit(status) • Convention is for 0 to indicate "success." Anything else is an error.
that subprocesses are almost entirely independent from the parent • The parent can set up the environment, send signals, and collect return codes, but otherwise has no control over what happens inside the subprocess. 41
`ls -l`; # Backticks. Perl/Ruby • Support for this varies • There may be some simple options 42 • This runs a shell command and captures the output. • However, it's lacking for a lot of other things • Will often see a process management module. Will illustrate for Python.
module for subprocesses • Cross-platform (Unix/Windows) • Tries to consolidate the functionality of a wide- assortment of low-level system calls (system, popen(), exec(), spawn(), etc.) • Will illustrate with some common use cases 43
to execute a simple shell command or run a separate program. You don't care about capturing its output. import subprocess p = subprocess.Popen(['mkdir','temp']) q = subprocess.Popen(['rm','-f','tempdata']) • Executes a command string • Returns a Popen object (more in a minute) 44
Popen() accepts a list of command args 45 • These are the same as the args in the shell shell % rm -f tempdata • Note: Each "argument" is a separate item subprocess.Popen(['rm','-f','tempdata']) # Good subprocess.Popen(['rm','-f tempdata']) # Bad Don't merge multiple arguments into a single string like this.
>>> • When launching a command, Popen() uses the setting of the PATH environment variable to search for the subprocess program 46 • Changes affect subsequent Popen() calls os.environ['PATH']="/mypath/bin:"+os.environ['PATH'] p = subprocess.Popen(["foo"])
'NAME1' : 'VALUE1', 'NAME2' : 'VALUE2', ... } p = subprocess.Popen(['cmd','arg1',...,'argn'], env=env_vars) • How to set up environment variables 47 • Note : If this is supplied and there is a PATH environment variable, it will be used to search for the command (Unix)
cwd='/some/directory') • If you need to change the working directory 48 • Note: This changes the working directory for the subprocess, but does not affect how Popen() searches for the command
subprocess.Popen(['cmd','arg1',...,'argn']) ... status = p.wait() • When you launch a subprocess, it runs independently from the parent • To wait and collect status, use wait() 49 • Status will be the integer return code (which is also stored) p.returncode # Exit status of subprocess
subprocess.Popen(['cmd','arg1',...,'argn']) ... if p.poll() is None: # Process is still running else: status = p.returncode # Get the return code • poll() - Checks status of subprocess 50 • Returns None if the process is still running, otherwise the returncode is returned
subprocess.Popen(['cmd','arg1',...,'argn']) import os os.kill(p.pid,9) # • A notable omission (subprocess module provides no such functionality). • On Unix, can use os.kill() 51 • On Windows, a mess (many options) subprocess.Popen(['TASKKILL','/PID',str(p.pid),'/F']) import win32api win32api.TerminateProcess(int(p._handle),-1)
to execute another program and capture its output • Use additional options to Popen() import subprocess p = subprocess.Popen(['cmd'], stdout=subprocess.PIPE) data = p.stdout.read() • This works with both Unix and Windows • Captures any output printed to stdout 52
to execute a program, send it some input data, and capture its output • Set up pipes using Popen() p = subprocess.Popen(['cmd'], stdin = subprocess.PIPE, stdout = subprocess.PIPE) p.stdin.write(data) # Send data p.stdin.close() # No more input result = p.stdout.read() # Read output python cmd p.stdout p.stdin stdin stdout 53
to execute a program, send it some input data, and capture its output • Set up pipes using Popen() p = subprocess.Popen(['cmd'], stdin = subprocess.PIPE, stdout = subprocess.PIPE) p.stdin.write(data) # Send data p.stdin.close() # No more input result = p.stdout.read() # Read output python cmd p.stdout p.stdin stdin stdout 54 Pair of files that are are hooked up to the subprocess
to a file f_in = open("somefile","r") p = subprocess.Popen(['cmd'], stdin=f_in) 56 • Connecting the output to a file f_out = open("somefile","w") p = subprocess.Popen(['cmd'], stdout=f_out) • Basically, stdin and stdout can be connected to any open file object • Note : Must be a real file in the OS
can be used to set up fairly complex I/O patterns 57 import subprocess p1 = subprocess.Popen("ls -l", shell=True, stdout=subprocess.PIPE) p2 = subprocess.Popen("wc",shell=True, stdin=p1.stdout, stdout=subprocess.PIPE) out = p2.stdout.read() • Note: this is the same as this popen2.popen2("ls -l | wc")
required when communicating with subprocesses 58 • To signal end of input, don't forget to close the input stream p = subprocess.Popen(['cmd'], stdout=subprocess.PIPE, stdin=subprocess.PIPE) p.stdin.write(data) # Send data p.stdin.close() # No more input result = p.stdout.read() # Read output • If you forget, subprocess may hang
does not work well for controlling interactive processes • Buffering behavior is often wrong (may hang) • Pipes don't properly emulate terminals • Subprocess may not operate correctly 59
want to clone the original process and have two identical processes • fork(), wait(), _exit() import os pid = os.fork() if pid == 0: # Child process ... os._exit(0) else: # Parent process ... # Wait for child os.wait(pid) python python fork() _exit() wait() concurrent execution 60
creates an identical process • Newly created process is a "child process" • fork() returns different values in parent/child import os pid = os.fork() if pid == 0: # Child process else: # Parent process 61 pid is 0 in child, non-zero in parent • Parent and child run independently afterwards
common use-case: multiple clients • Server forks a new process to handle each client 62 Server listening Server Server Server Server fork() Client Client Client Client new clients
with I/O is that data is often encoded in a variety of different formats • Compression (gz, bz2, zip, etc.) • Unicode (UTF-8, UTF-16, etc.) • Text (Base64, Hex, Quopri, etc.) • Data might be a mix of formats • Example: A compressed UTF-8 file 65
itself as a file but is really just a wrapper around a low-level file • This kind of approach is used by Java • Also starting to show up in dynamic languages • Let's look at codecs in Python 67
like files, but certain operations may break the encoding process • Example: Random access/seeks • Example: Invalid data (encoding error) • Your mileage might vary 70
and codecs are friends 71 s.encode(encoding) # Encode a string s.decode(encoding) # Decode a string • Example: >>> s = "Hello World" >>> t = s.encode("base64") >>> t 'SGVsbG8gV29ybGQ=\n' >>> t.decode("base64") 'Hello World' >>>
manually handle encodings • Just call encode/decode yourself as needed 72 >>> f = open("foo.dat","wb") >>> f.write(data.encode("zlib")) >>> f.close() >>> g = open("foo.dat","rb") >>> data = g.read().decode("zlib") >>> g.close() • Note: using codecs module may be more memory efficient
73 s = u"Hello World" t = u"Jalape\u00f1o" • Encodes characters from all used languages • Widely used on Internet (internationalization) • A huge topic (won't cover all details)
stores Unicode as 16-bit integers (UCS-2) 74 t = u"Jalape\u00f1o" 004a 0061 006c 0061 0070 0065 00f1 006f • Normally, you don't worry about this • Except if you write a unicode string to a file u"J" --> 00 4a (Big Endian) u"J" --> 4a 00 (Little Endian)
I/O always involves some encoding • Handled through codecs module 75 >>> f = codecs.open("data.txt","w","utf-8") >>> f.write(u"Hello World\n") >>> f.close() >>> f = codecs.open("data.txt","w","utf-16") >>> f.write(data) >>> • Several hundred character codecs are provided • Consult documentation for details
via strings 76 >>> a = u"Jalape\u00f1o" >>> enc_a = a.encode("utf-8") >>> • Example: Writing Unicode strings to a file >>> f = open(filename,"wb") >>> f.write(data.encode("utf-8")) • Note: Since encoding may contain binary data, should probably use binary file modes.
also be decoded into Unicode 77 >>> enc_a = 'Jalape\xc3\xb1o' >>> a = enc_a.decode("utf-8") >>> a u'Jalape\xf1o' >>> • Example: Reading Unicode strings to a file >>> f = open(filename,"rb") >>> enc_data = f.read() >>> data = enc_data.decode("utf-8") • Again, be aware that Unicode data may contain binary data
do you determine the encoding of a file? • Might be known in advance (strongly typed) • Often indicated in the file itself 78 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> • Depends on the data source, application, etc.
dynamic languages have extensive support for text pattern processing with regular expressions • And in some languages, regular expressions are part of the language itself (e.g., Perl and Ruby). • Let's dig a little deeper... 80
involve searching and matching specific text patterns. • Example: email addresses 81 Please send email to [email protected] and maybe you will get a response (maybe). • Example: URLs Go look on http://www.google.com for details. • Example: A U.S. phone number 773-555-1212
matching a text pattern is a much more complicated problem than looking for an exact substring • Must have a concise and easy way to specify the legal characters that make up a pattern along with the order in which they are supposed to appear 82
regular expression is a concise specification of a text pattern • Built from a few basic rules: 83 abc Matches the chars 'abc' exactly [chars] Match characters in a set [^chars] Match characters not in a set pat1|pat2 Matches either pat1 or pat2 pat* Zero or more repetitions of pat pat+ One or more repetitions of pat pat? Zero or more occurence of pat (pat) A group the matches pat • These are then combined
pattern to match the title of an HTML doc <title>(.*?)</title> • Problem 1 : Print all matching lines <html> <head> <title>This is an example</title> </head> <body> ... </body> </html>
• Perl open(INFILE,"foo.html"); while ($line = <INFILE>) { if ($line =~ /<title>(.*?)<\/title>/) { print $line; } } • Ruby f = open("foo.html") for line in f if line =~ /<title>(.*?)<\/title>/ print line end end
• Example: Python import re pat = re.compile('<title>(.*?)</title>') for line in open("foo.html"): if pat.search(line): print line • Here, the regex features are just in a library module. There is no special syntax or operators devoted to matching (it's a method)
pattern to match the title of an HTML doc <title>(.*?)</title> • Problem 2 : Extract just the title text itself <title>This is an example</title> This is an example
define groups <title>(.*?)</title> ([\w-]+):(.*) • Groups are assigned numbers <title>(.*?)</title> ([\w-]+):(.*) 1 1 2 • Number determined left-to-right 88
open(INFILE,"foo.html"); while ($line = <INFILE>) { if ($line =~ /<title>(.*?)<\/title>/) { print $1,"\n"; } } • Ruby f = open("foo.html") for line in f if line =~ /<title>(.*?)<\/title>/ print Regexp.last_match(1),"\n" end end
pattern to match the title of an HTML doc <title>(.*?)</title> • Problem 3 : Change the title to subject <title>This is an example</title> <subject>This is an example</subject>
open(INFILE,"foo.html"); while ($line = <INFILE>) { $line =~ s/<title>(.*?)<\/title>/<subject>\1<\/subject/; print $line; } • Ruby f = open("foo.html") for line in f line.gsub!(/<title>(.*?)<\/title>/, '<subject>\1</subject>') print line end
use regular expressions is mostly a matter of reading the manual • All of the books on dynamic languages cover them • Most programmers have used them at some point • So, I'm not going to continue with a manual 94
think most programmers (including myself), think regular expressions involve some fairly serious magic • This is partly true....especially for some of the more hard-core features • However, how do they really work? • Is there anything interesting to be learned by looking into this? Well, maybe.... 95
originate in theoretical computer science--automata theory. • First appear sometime in the 1950s • They were popularized greatly by Ken Thompson who incorporated regex capabilities into the Unix ed editor (~1970) • They then propagated to to other Unix tools (grep, awk, vi, lex, emacs, etc.) 96
free software library written by Henry Spencer (~1985) • Used to build regex support in early versions of Perl and Tcl which then expanded upon/rewrote the library • Almost every modern language with regex support derives directly/indirectly from the Spencer library (or at least its approach) 97
expression patterns are typically used to build a NFA • Non-deterministic Finite Automata • Covered in great detail in theory course, but let's look at the general idea 98
and try the other path out 107 Input: "aabaabbbab" b b > a start final start The unlabeled arrow here means that we can just move to the next state without reading any input
expression patterns can be turned into an NFA using a few primitive building blocks • It's tricky, but relatively straightforward • You can read details 109
dynamic languages are using an approach NFA matching known as "recursive backtracking" • This involves trying all possibilities until a suitable match is found • And it can lead to some pathlogical cases • Certain patterns that match very poorly 110
last few classes, we have spent a lot of time looking at "objects" • An object encapsulates data and has a collection of methods that operate on that data • However, this is not the only way to do it • Let's return to functions...
It turns out that you can do a lot of very useful programming just using functions • "Functional programming" • Mathematicians study functions a lot • However, there are some essential features that you need to move beyond the basics • Today, we'll look at it in a little more detail
I program, the more I find myself drawn towards functional programming • It feels more logically coherent than OO • And a lot less byzantine • Plus, I have a secret past as a math major • Oh yeah, and get off my lawn!
is a HUGE topic • Which can be highly mathematical • I'm not going to take that approach... • Especially with my brain pounding cold • However, I will try to cover a few absolute basics and show some interesting examples
all examples are going to be Python • Python is by no means considered to be a purely "functional" language • However, it has enough of the core features for me to illustrate some interesting things • People often remark on its freakish similarity to certain parts of Lisp
A function is a series of statements def foo(x): statements ... some calculation ... statements return result • A function receives input arguments • Performs some kind of calculation • Returns a result
A function is also an object that you can treat like it was ordinary data def square(x): return x*x • You can assign it to a variable s = square • Put it in a list items = [1,"Hello",square] • Pass it as an argument to another function y = foo(3,square)
In fact, there isn't anything that's allowed on the other objects, but which is forbidden on a function object • The only difference is that the contents of a function don't look like anything you're used to (number, string, array, etc.) • In reality, it's just a sequence of statements
functions have equal footing with numbers, strings, and other core datatypes, then they're said to be "first-class" • Basically, it means that functions are nothing special---they're just like anything else in the language
functions lets you pass functions into other functions as an argument • This allows a program to make use of so- called "callback" functions • Functions that get executed under certain circumstances by another function
use case: Supplying the comparison function for a list sort def wordcmp(s,t): s_l = s.lower() t_l = t.lower() if s_l < t_l : return -1 elif s_l > t_l : return 1 else: return 0 words = ['MONDO','diabolical','Thrash'] words.sort(wordcmp) # Produces ['diabolical','MONDO','Thrash'] • sort() "calls back" into the compare function to help it figure out the ordering
The fact that functions are data opens up a variety of interesting possibilities • Can have stored tables and collections of functions (already saw that with classes) • Also, functions can be passed around to different parts of a program • At first glance, all of this might sound a little exotic (but we'll see examples soon)
a function, you can define new functions and use them elsewhere (like data) • Example: def make_greeting(name): def greet(): print "Hey %s, get off my lawn!" % name return greet • Check it out - a function was returned >>> p = make_greeting("Punk") >>> p <function greet at 0x69330> >>>
Using an "inner function" is interesting • It secretly carries information about all of the variables that were alive when it was defined >>> p = make_greeting("Punk") >>> p <function greet at 0x69330> >>> p() Hey Punk, get off my lawn! >>> Notice how it somehow picked up the name variable
together with its surrounding environment is known as a "closure" • Basically, the closure has all of the information needed to make the function execute correctly • Normally all of this is tucked away behind the scenes (it just works)
inspect the closure if sneaky >>> p = make_greeting("Punk") >>> p <function greet at 0x69330> >>> p.func_closure (<cell at 0x6c950: str object at 0x6c9e0>,) >>> p.func_closure[0].cell_contents 'Punk' >>> • A closure is almost like a weird kind of "object" >>> k = make_greeting("Kid") >>> g = make_greeting("Governor") >>> g() Hey Governor, get off my lawn! >>> k() Hey Kid, get off my lawn! >>>
just scratched the surface of what it means for functions to be "first-class" • You can pass existing functions around as data • You can create new functions on-the-fly • Newly created functions retain parts of the environment where they were created • This is where it starts to get interesting...
In the assignment, you wrote some programs that worked with a portfolio of stocks • There was some data in a file (a list of lines) MSFT,100,54.25 IBM,50,91.10 AA,25,23.10 CAT,75,70.13 MSFT,50,64.23 GM,200,45.11 HPQ,80,37.42 IBM,40,88.20 PG,125,56.22 BA,75,92.72 MSFT,50,71.21 AIG,40,41.81
Calculate the cost of the portfolio total = 0.0 for stock in portfolio: total += stock[1]*stock[2] print "Total", total • Involves a list of "stocks" • Iterating over this list • Performing an "operation" on each item
that maps a function to each item of a list, producing a new list def map(func, items): result = [] for it in items: result.append(func(it)) return result • Example: def square(x): return x*x nums = [1,2,3,4,5] sqs = map(square,nums) # [1,4,9,16,25]
that checks each item and discards those that don't match a condition def filter(condf,items): result = [] for it in items: if condf(it): result.append(it) return result • Example: def positive(x): return x > 0 nums = [1,-2,3,-4,5] p = filter(positive,nums) # [1,3,5]
that combines successive list elements and produces a single result def reduce(combinef,items,initial=0) result = initial for it in items: result = combinef(result,it) return result • Example: def add(x,y): return x+y nums = [1,2,3,4,5] total = reduce(add,nums) # total = 15
total cost of all stocks in the portfolio with 100 or more shares # Three functions def cost_f(s): return s[1]*s[2] def hundred_f(s): return s[1] >= 100 def add_f(x,y): return x+y # Now, using these operations stocks = filter(hundred_f, portfolio) costs = map(cost_f,stocks) total = reduce(add_f,costs) • It's a little clunky, but essentially applying operations to entire lists at each step
language already has some variation of map, filter, and reduce • And some common reductions (sum, min, max) • These are basic list/array operations • Have been around for almost forever. • Look in the manual for details.
can be easily passed around, you often end up writing code that relies on a lot of small functions or "formulas" # Three functions def cost_f(s): return s[1]*s[2] def hundred_f(s): return s[1] >= 100 def add_f(x,y): return x+y # Now, using these operations stocks = filter(hundred_f, portfolio) costs = map(cost_f,stocks) total = reduce(add_f,costs) • This style gets old real fast...
Lambda expressions. Creates a function right on the spot for you # Now, using these operations stocks = filter(lambda s: s[1] >= 100, portfolio) costs = map(lambda s: s[1]*s[2], stocks) total = reduce(lambda x,y: x+y,costs) • Lambda creates a function that is a single expression lambda x,y : x+y # Is the same as typing this out long-form def anon(x,y): return x+y
too much into this lambda stuff • It's just a special syntax that let's us take a simple expression and quickly turn it into an unnamed function • Often more convenient than defining a separate function elsewhere • Name "lambda" comes from Lisp which comes from the "Lambda Calculus"
reality, you probably want to use it sparingly • Overuse makes code impossible to decipher • Not as powerful as defining a normal function • The body of a lambda can only be a single expression (not a bunch of statements)
new list by applying an operation to each element of a sequence. >>> a = [1,2,3,4,5] >>> b = [2*x for x in a] >>> b [2,4,6,8,10] >>> • Another example: 41 >>> names = ['Elwood','Jake'] >>> a = [name.lower() for name in names] >>> a ['elwood','jake'] >>>
comprehension can also filter >>> f = open("stockreport","r") >>> goog = [line for line in f if 'GOOG' in line] >>> >>> a = [1, -5, 4, 2, -2, 10] >>> b = [2*x for x in a if x > 0] >>> b [2,8,4,20] >>> • Another example 42
[expression for x in s if condition] • What it means result = [] for x in s: if condition: result.append(expression) 43 • Basically, this is map/filter rolled into one op
syntax (in full) [expression for x in s if cond1 for y in t if cond2 ... if condfinal] • What it means result = [] for x in s: if cond1: for y in t: if cond2: if condfinal: result.append(expression) 44
comprehensions are hugely useful • Collecting the values of a specific field stocknames = [s['name'] for s in stocks] • Performing database-like queries a = [s for s in stocks if s['price'] > 100 and s['shares'] > 50 ] • Quick mathematics over sequences cost = sum([s['shares']*s['price'] for s in stocks]) 45
come from Haskell a = [x*x for x in s if x > 0] # Python a = [x*x | x <- s, x > 0] # Haskell 46 • And this is motivated by sets (from math) a = { x2 | x ∈ s, x > 0 }
List comprehensions encourage a more "declarative" style of programming when processing sequences of data. • Data can be manipulated by simply "declaring" a series of statements that perform various operations on it. • Although, it may require some care... 47
of stocks lines = open("dowportfolio.csv") fields = [line.split(",") for line in lines] portfolio = [[f[0],int(f[1]),float(f[2])] for f in fields] 48 • Performing a calculation total = sum([s[1]*s[2] for s in portfolio if s[1] >= 100]) • We're just applying list operation after list operation to get the result we want
example: 50 def add(x,y): def do_add(): return x+y return do_add • This function creates a new function that performs a calculation when it runs (later) >>> r = add(3,4) >>> r <function do_add at 0x693b0> >>> r() 7 >>>
example illustrates something known as "lazy" evaluation • A function was created to perform some work • But the execution of the function didn't occur until later on (it was delayed) • This style of programming can be used for all sorts of good and evil 51
expensive calculations in a way where they will only be carried out if actually requested later • Example : Fetch a URL, but, not right now import urllib def prepare_download(url): def do_download(): return urllib.urlopen(url).read() return do_download >>> d = prepare_download("http://www.blah.com") ... >>> text = d() # Okay, do it
some of the arguments to a function now (get the rest later) def partial(func,*args): def call(func,*moreargs): return func(*(args+moreargs)) return call • Example: def add(x,y,z): return x+y+z a = partial(add,2,3) ... print a(4) # prints 9 : 2 + 3 + 4 print a(10) # prints 15 : 2 + 3 + 10
a logfile import time def tail(thefile): thefile.seek(0,2) # Go to EOF def do_next(): while True: line = thefile.readline() if line: return line time.sleep(0.1) return do_next • Example: >>> next = tail(open("logfile","r")) >>> while True: ... print next(),
-f example was interesting • That function created a function which emitted new a new line from a file every time you called it • You might be able to expand on that idea by writing functions that generate sequences
: A Countdown def countdown(n): while n > 0: yield n n = n - 1 • This spits out new values for use in a for-loop • Example: >>> c = countdown(5) >>> for n in c: ... print n, ... 5 4 3 2 1 >>>
version of a list comprehension >>> a = [1,2,3,4] >>> b = (2*x for x in a) >>> b <generator object at 0x58760> >>> for i in b: print b, ... 2 4 6 8 >>> • Important differences • Does not construct a list. • Only useful purpose is iteration • Once consumed, can't be reused 58
(expression for i in s for j in t ... if conditional) • Can also serve as a function argument sum(x*x for x in a) • Can be applied to any iterator >>> a = [1,2,3,4] >>> b = (x*x for x in a) >>> c = (-x for x in b) >>> for i in c: print i, ... -1 -4 -9 -16 >>> 59
a field in a large input file f = open("datfile.txt") # Strip all lines that start with a comment lines = (line for line in f if not line.startswith('#')) # Split the lines into fields fields = (s.split() for s in lines) # Sum up one of the fields print sum(float(f[2]) for f in fields) • Solution 60 823.1838823 233.128883 14.2883881 44.1787723 377.1772737 123.177277 143.288388 3884.78772 ...
• Each generator expression only evaluates data as needed (lazy evaluation) • Example: Running above on a 6GB input file only consumes about 60K of RAM f = open("datfile.txt") # Strip all lines that start with a comment lines = (line for line in f if not line.startswith('#')) # Split the lines into fields fields = (s.split() for s in lines) # Sum up one of the fields print sum(float(f[2]) for f in fields)
been a small taste of functional programming idioms • If you go further, focus on organization of functions, closures, routing of data, etc. • Personally, I think it's a fun way to program • Very different than OO however... 64
we ended with some discussion of generator functions • However, didn't get a chance to look at more interesting examples • Let's spend a little more time on this
: A Countdown def countdown(n): while n > 0: yield n n = n - 1 • This spits out new values for use in a for-loop • Example: >>> c = countdown(5) >>> for n in c: ... print n, ... 5 4 3 2 1 >>>
Example : A Countdown class Countdown def initialize(n) @start = n end def each n = @start while n > 0 yield n n -= 1 end end end • Use for i in Countdown.new(5) puts i end
version of a list comprehension >>> a = [1,2,3,4] >>> b = (2*x for x in a) >>> b <generator object at 0x58760> >>> for i in b: print b, ... 2 4 6 8 >>> 8 • Generates a sequence of values where some operation has been applied
Generators are most effectively used to set up data processing pipelines • Similar to pipes in Unix 10 % ls -l | wc • Can structure programs as stages of processing chained together
how many bytes of data were transferred by summing up the last column of data in this Apache web server log 81.107.39.38 - ... "GET /ply/ HTTP/1.1" 200 7587 81.107.39.38 - ... "GET /favicon.ico HTTP/1.1" 404 133 81.107.39.38 - ... "GET /ply/bookplug.gif HTTP/1.1" 200 23903 81.107.39.38 - ... "GET /ply/ply.html HTTP/1.1" 200 97238 81.107.39.38 - ... "GET /ply/example.html HTTP/1.1" 200 2359 66.249.72.134 - ... "GET /index.html HTTP/1.1" 200 4447 Oh yeah, and the log file might be huge (Gbytes)
line of the log looks like this: 12 bytestr = line.rsplit(None,1)[1] 81.107.39.38 - ... "GET /ply/ply.html HTTP/1.1" 200 97238 • The number of bytes is the last column • It's either a number or a missing value (-) 81.107.39.38 - ... "GET /ply/ HTTP/1.1" 304 - • Converting the value if bytestr != '-': bytes = int(bytestr)
do a simple for-loop 13 wwwlog = open("access-log") total = 0 for line in wwwlog: bytestr = line.rsplit(None,1)[1] if bytestr != '-': total += int(bytestr) print "Total", total • We read line-by-line and just update a sum
use some generator expressions 14 wwwlog = open("access-log") bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print "Total", sum(bytes) • Well, that's certainly different • Less code • A completely different programming style
The solution is setting up a pipeline 15 wwwlog bytecolumn bytes sum() access-log total • Each step is defined by iteration/generation wwwlog = open("access-log") bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print "Total", sum(bytes)
step of the pipeline, we declare an operation that will be applied to the entire input stream 16 wwwlog bytecolumn bytes sum() access-log total bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) This operation gets applied to every line of the log file
focusing on the problem at a line-by-line level, you just break it down into big operations that operate on the whole file • This is very much a "declarative" style • The key : Think big... 17
• The glue that holds the pipeline together is the iteration that occurs in each step wwwlog = open("access-log") bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print "Total", sum(bytes) • The calculation is being driven by the last step • The sum() function is consuming values being pushed through the pipeline (via .next() calls)
open("access-log") total = 0 for line in wwwlog: bytestr = line.rsplit(None,1)[1] if bytestr != '-': total += int(bytestr) print "Total", total wwwlog = open("access-log") bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog) bytes = (int(x) for x in bytecolumn if x != '-') print "Total", sum(bytes) 21.19s 20.14s Time Time
program that can easily extract metadata from Firefox browser cache files • You were suppose to do this in the assignment • Probably encountered a variety of nasty bits of code with that
are four critical files 22 _CACHE_MAP_ # Cache index _CACHE_001_ # Cache data _CACHE_002_ # Cache data _CACHE_003_ # Cache data • All files are binary-encoded • _CACHE_MAP_ is the index, but it is encoded in a tricky way. • Don't need it to extract URL requests anyways
struct cachefiles = [('_CACHE_001_',256),('_CACHE_002_',1024), ('_CACHE_003_',4096)] def generate_headers(cachedir): for name, blocksize in cachefiles: pathname = os.path.join(cachedir,name) f = open(pathname,"rb") f.seek(4096) while True: header = f.read(36) if not header: break fields = struct.unpack(">9I",header) if fields[0] == 0x00010008: yield f, fields fp = f.tell() offset = fp % blocksize if offset: f.seek(blocksize - offset,1) f.close() We loop over each _CACHE_00N_ file one by one. Open each file and skip the 4096 byte block bit-map at the beginning
struct cachefiles = [('_CACHE_001_',256),('_CACHE_002_',1024), ('_CACHE_003_',4096)] def generate_headers(cachedir): for name, blocksize in cachefiles: pathname = os.path.join(cachedir,name) f = open(pathname,"rb") f.seek(4096) while True: header = f.read(36) if not header: break fields = struct.unpack(">9I",header) if fields[0] == 0x00010008: yield f, fields fp = f.tell() offset = fp % blocksize if offset: f.seek(blocksize - offset,1) f.close() We read the file and look for metadata headers (look for the magic bytes)
struct cachefiles = [('_CACHE_001_',256),('_CACHE_002_',1024), ('_CACHE_003_',4096)] def generate_headers(cachedir): for name, blocksize in cachefiles: pathname = os.path.join(cachedir,name) f = open(pathname,"rb") f.seek(4096) while True: header = f.read(36) if not header: break fields = struct.unpack(">9I",header) if fields[0] == 0x00010008: yield f, fields fp = f.tell() offset = fp % blocksize if offset: f.seek(blocksize - offset,1) f.close() Skip to the start of the next block (we look at the current file pointer to compute a skip value)
to the processing pipeline 31 def generate_meta(headers): for f, header in headers: urlstr = f.read(header[7]) infostr = f.read(header[8]) yield header, urlstr, infostr • Example: headers = generate_headers("FFCache") meta = generate_meta(headers) for header, url, info in meta: print header print url print info print
34 http://www.google.com/images/firefox/grgrad.gif Sat Oct 20 10:01:54 2007 http://www.google.com/images/firefox/clear.gif Sat Oct 20 10:01:54 2007 http://www.google.com/images/firefox/title.gif Sat Oct 20 10:01:54 2007 http://www.google.com/images/firefox/fox1.gif Sat Oct 20 10:01:54 2007 http://www.google.com/images/firefox/fox2.gif Sat Oct 20 10:01:54 2007 http://www.google.com/images/firefox/google.gif Sat Oct 20 10:01:54 2007
names from requests 35 def generate_domains(requests): for url, fetchtime in requests: proto,request = url.split("://",1) domain = request.split("/",1)[0] yield domain • Example: headers = generate_headers("FFCache") meta = generate_meta(headers) requests = generate_requests(meta) domains = generate_domains(requests) for d in sorted(set(domains)): print d
sequences together 38 def concatenate(seq): for s in seq: for item in s: yield s • Example: Find all domains in all caches all_caches = (path for path,dirlist,filelist in os.walk("/") if '_CACHE_MAP_' in filelist) headers = concatenate(generate_headers(path) for path in all_caches) meta = generate_meta(headers) requests = generate_requests(meta) domains = generate_domains(requests)
fund 39 INSANE MONEY w/ GUIDO PY 142.34 (+8.12) JV 34.23 (-4.23) CPP 4.10 (-1.34) NET 14.12 (-0.50) After watching 87 straight hours of "Guido's Insane Money" on his Tivo, Dave has decided to quit his day job as a jazz musician and start a hedge fund. • Problem : Write a program that can process infinite streams of real-time stock market data
a sequence of real-time data is being written to a log (stocklog) 40 unix % tail -f stocklog.dat "MCD",50.80,"6/11/2007","09:30.00",-0.61,51.47,50.80,50.80,92400 "KO",51.63,"6/11/2007","09:30.00",-0.04,51.67,51.63,51.63,395215 "MMM",85.75,"6/11/2007","09:30.00",-0.19,85.94,85.75,85.75,15610 "JNJ",62.08,"6/11/2007","09:30.00",-0.05,62.89,62.08,62.08,25340 "AXP",62.39,"6/11/2007","09:30.01",-0.65,62.79,62.39,62.38,83462 ... • Again let's use generators...
Python version of 'tail -f' 41 import time def follow(thefile): thefile.seek(0,2) # Go to the end of the file while True: line = thefile.readline() if not line: time.sleep(0.1) # Sleep briefly continue yield line • Idea : Seek to the end of the file and repeatedly try to read new lines. If new data is written to the file, we'll pick it up.
function 42 stocklog = open("stocklog.dat","r") lines = follow(stocklog) ... for line in lines: print line, • This produces the same output as 'tail -f'
lines of stock data are CSV formatted • Let's route them through the CSV module 43 import csv stocklog = open("stocklog.dat","r") lines = follow(stocklog) fields = csv.reader(lines) for f in fields: print f
that to our processing pipeline 46 import csv stocklog = open("stocklog.dat","r") lines = follow(stocklog) fields = csv.reader(lines) fieldtypes = [str,float,str,str, float,float,float,float,int] converted = (convert_fields(fieldtypes,f) for f in fields) for s in converted: print s • This now produces lists of converted values
all of the fields into a dictionary 48 import csv stocklog = open("stocklog.dat","r") lines = follow(stocklog) fields = csv.reader(lines) fieldtypes = [str,float,str,str, float,float,float,float,int] converted = (convert_fields(fieldtypes,f) for f in fields) fieldnames = ['name','price','date','time', 'change','open','high','low','volume'] stocks = (dict(zip(fieldnames,c)) for c in converted)
an infinite input source • We then routed lines from that source through a processing pipeline that produces an infinite sequence of dictionaries 50 follow csv.reader convert makedict stocklog.dat { } lines lines lists lists dicts
has a portfolio of stocks 51 portfolio = set(['IBM','MSFT','HPQ','CAT','AA']) • Write a program that prints out a real-time ticker showing the name, price, change, and volume for just these stocks in_portfolio = (s for s in stocks if s['name'] in portfolio) ticker = ((s['name'],s['price'],s['change'],s['volume']) for s in in_portfolio) for t in ticker: print "%10s %10.2f %10.2f %10d" % t
data for negative change 52 in_portfolio = (s for s in stocks if s['name'] in portfolio) ticker = ((s['name'],s['price'],s['change'],s['volume']) for s in in_portfolio) negticker = (t for t in ticker if t[2] < 0) for t in negticker: print "%10s %10.2f %10.2f %10d" % t
used heavily in network programming applications • Processing different file formats • Interacting with web servers • Implementing network servers, etc. 55
• This is a complete Python web-server with support for CGI scripting from BaseHTTPServer import HTTPServer from CGIHTTPServer import CGIHTTPRequestHandler import os os.chdir("/home/docs/html") serv = HTTPServer(("",8080),CGIHTTPRequestHandler) serv.serve_forever() • Serves HTML files and executes scripts in "/cgi-bin" and "/htbin" directories
lot of these network features is mostly a matter of reading the manual • There are various libraries and frameworks • Instead of talking about that, will cover absolute basics of network programming • Material most good programmers should just know about 57
network have a hostname • Hostname mapped to numerical address (e.g., IP address, DNS) 60 Network foo.bar.com 205.172.13.4 www.python.org 82.94.237.218
between "ports" • Ports are bound to running processes/services 61 foo.bar.com 205.172.13.4 web email IM Port 80 Port 25 Port 31337 browser sendmail Port 7823 Port 3342
for incoming connections and provide some kind of service (e.g., web) • Clients make connections to servers 63 www.bar.com 205.172.13.4 web Port 80 browser Client Server • To make it work, servers use standardized port numbers (e.g., web server always on port 80)
application use a request/ response programming model • Client sends a request (e.g., HTTP) 65 GET /index.html HTTP/1.0 • Server sends a response (e.g., HTTP) HTTP/1.0 200 OK Content-type: text/html Content-length: 48823 <HTML> ... • Actual protocol depends on the application
network code • Socket: A communication endpoint 66 socket socket • Supported by socket library module • Allows data to be written/read (e.g., like a file) network
• Most common case: TCP connection s = socket(AF_INET, SOCK_STREAM) s = socket(AF_INET, SOCK_DGRAM) 68 • Almost all code will use one of following s = socket(AF_INET, SOCK_STREAM)
a socket is only the first step 69 s = socket(AF_INET, SOCK_STREAM) • Further use depends on application • Server • Listen for incoming connections • Client • Make an outgoing connection
establish a dedicated connection • Bi-directional data transfer • Continuous I/O stream (like a file, pipe, etc.) • Reliable • Connection stays open until explicitly closed DATA TCP/IP socket(AF_INET,SOCK_STREAM)
socket to make a connection from socket import * s = socket(AF_INET,SOCK_STREAM) s.connect(("www.python.org",80)) s.send("GET /index.html HTTP/1.0\n\n") data = s.recv(10000) s.close() 71 • s.connect(addr) makes a connection s.connect(("www.python.org",80)) • Once connected, use send(),recv() to transmit and receive data • close() shuts down the connection
server 72 from socket import * s = socket(AF_INET,SOCK_STREAM) s.bind(("",9000)) s.listen(5) while True: c,a = s.accept() print "Received connection from", a c.send("Hello %s\n" % a[0]) c.close() • Send a message back to a client % telnet localhost 9000 Connected to localhost. Escape character is '^]'. Hello 127.0.0.1 Connection closed by foreign host. % Server message
73 from socket import * s = socket(AF_INET,SOCK_STREAM) s.bind(("",9000)) s.listen(5) while True: c,a = s.accept() print "Received connection from", a c.send("Hello %s\n" % a[0]) c.close() • Addressing s.bind(("",9000)) s.bind(("localhost",9000)) s.bind(("192.168.2.1",9000)) s.bind(("104.21.4.2",9000)) binds the socket to a specific address If system has multiple IP addresses, can bind to a specific address binds to localhost
for connections 74 from socket import * s = socket(AF_INET,SOCK_STREAM) s.bind(("",9000)) s.listen(5) while True: c,a = s.accept() print "Received connection from", a c.send("Hello %s\n" % a[0]) c.close() • s.listen(backlog) • backlog is # of pending connections to allow • Note: not related to number of clients Tells system to start listening for connections on the socket
new connection 75 from socket import * s = socket(AF_INET,SOCK_STREAM) s.bind(("",9000)) s.listen(5) while True: c,a = s.accept() print "Received connection from", a c.send("Hello %s\n" % a[0]) c.close() • s.accept() blocks until connection received • Server sleeps if nothing is happening Accept a new client connection
and address 76 from socket import * s = socket(AF_INET,SOCK_STREAM) s.bind(("",9000)) s.listen(5) while True: c,a = s.accept() print "Received connection from", a c.send("Hello %s\n" % a[0]) c.close() Accept returns a pair (client_socket,addr) ("104.23.11.4",27743) <socket._socketobject object at 0x3be30> This is the network/port address of the client that connected This is a new socket that's used for data
77 from socket import * s = socket(AF_INET,SOCK_STREAM) s.bind(("",9000)) s.listen(5) while True: c,a = s.accept() print "Received connection from", a c.send("Hello %s\n" % a[0]) c.close() Send data to client Note: Using the client socket, not the server socket
connection 78 from socket import * s = socket(AF_INET,SOCK_STREAM) s.bind(("",9000)) s.listen(5) while True: c,a = s.accept() print "Received connection from", a c.send("Hello %s\n" % a[0]) c.close() Close client connection • Note: Server can keep client connection alive as long as it wants • Can repeatedly receive/send data
the next connection 79 from socket import * s = socket(AF_INET,SOCK_STREAM) s.bind(("",9000)) s.listen(5) while True: c,a = s.accept() print "Received connection from", a c.send("Hello %s\n" % a[0]) c.close() Wait for next connection • Original server socket is reused for further connections • Server runs forever
Price Server • Suppose there is a dictionary with prices 80 prices = { } for line in open("prices.dat"): fields = line.split(",") prices[fields[0]] = float(fields[1]) >>> prices['IBM'] 102.86 >>> prices['AA'] 39.48 >>> • Turn this into a server where clients can connect and get the prices
last example so specific prices can be requested and returned • Allow a list of stock names to be sent 82 % telnet localhost 9000 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. IBM AA CAT <newline> CAT,78.29 IBM,102.86 AA,39.48 Connection closed by foreign host. %
= socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.setsockopt(socket.SOL_SOCKET,socket.SO_REUSEADDR,1) s.bind(("",9000)) s.listen(5) while True: c,a = s.accept() f = c.makefile() nameline = f.readline() nameset = nameline.split() for name in prices: if not nameset or name in nameset: c.sendall("%s,%0.2f\n" % (name, prices[name])) f.close() c.close()
sent in discrete packets (Datagrams) • No concept of a "connection" • No reliability, no ordering of data • Datagrams may be lost, arrive in any order • Higher performance (used in games, etc.) DATA DATA DATA socket(AF_INET,SOCK_DGRAM)
datagram to a server 85 from socket import * s = socket(AF_INET,SOCK_DGRAM) s.sendto(msg,("server.com",10000)) data, addr = s.recvfrom(maxsize) Create datagram socket • Key concept: No "connection" • You just send a data packet Send a message Wait for a response returned data remote address
datagram server 86 from socket import * s = socket(AF_INET,SOCK_DGRAM) s.bind(("",10000)) while True: data, addr = s.recvfrom(maxsize) # Do something ... s.sendto(resp,addr) Create datagram socket • Much simpler than a TCP server • Again: No "connection" is established Bind to a specific port Wait for a message Send response
lowest level of network programming • If you know what you are doing, you can use sockets to write programs that interact with any other program on the network • Of course, the low-level details might be really hairy 90
web page: urlopen() 94 >>> import urllib >>> u = urllib.urlopen("http://www.python/org/index.html") >>> data = u.read() >>> print data <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML ... ... >>> • urlopen() returns a file-like object • Can use standard file operations on it
a request GET /index.html HTTP/1.1 Host: www.python.org User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; Accept: text/xml,application/xml,application/xhtml+xml,text/h Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive <blank line> • Request line followed by headers • Terminated by a blank line
a response HTTP/1.1 200 OK Date: Thu, 26 Apr 2007 19:54:01 GMT Server: Apache/2.0.54 (Debian GNU/Linux) DAV/2 SVN/1.1.4 mod_py Last-Modified: Thu, 26 Apr 2007 18:40:24 GMT ETag: "61b82-37eb-5a0eb600" Accept-Ranges: bytes Content-Length: 14315 Connection: close Content-Type: text/html <HTML> ... • Response line followed by headers • Blank line followed by data
a small number of request types GET POST HEAD PUT • This isn't an exhaustive tutorial • There are standardized response codes 200 OK 403 Forbidden 404 Not Found 501 Not implemented ...
a custom class... 99 from BaseHTTPServer import BaseHTTPRequestHandler,HTTPServer class MyHandler(BaseHTTPRequestHandler): def do_GET(self): ... def do_POST(self): ... def do_HEAD(self): ... def do_PUT(self): ... serv = HTTPServer(("",8080),MyHandler) serv.serve_forever() Redefine the behavior of the server by defining code for all of the standard HTTP request types
create a stand-alone server 102 from SimpleXMLRPCServer import SimpleXMLRPCServer def add(x,y): return x+y s = SimpleXMLRPCServer(("",8080)) s.register_function(add) s.serve_forever() • How to test it (xmlrpclib) >>> import xmlrpclib >>> s = xmlrpclib.ServerProxy("http://localhost:8080") >>> s.add(3,5) 8 >>> s.add("Hello","World") "HelloWorld" >>>
have to write programs that perform multiple tasks in parallel • Example : Network servers that handle multiple client connections • Modern systems have multiple CPU cores • There is interest in writing programs that can take advantage of multiple cores to get better performance (parallel processing)
class, we looked at how it is possible to create subprocesses 7 • Example: Setting up a pipe p = subprocess.Popen(['cmd'], stdin = subprocess.PIPE, stdout = subprocess.PIPE) p.stdin.write(data) # Send data p.stdin.close() # No more input result = p.stdout.read() # Read output python cmd p.stdout p.stdin stdin stdout
creates an identical process • Newly created process is a "child process" • fork() returns different values in parent/child import os pid = os.fork() if pid == 0: # Child process else: # Parent process 9 pid is 0 in child, non-zero in parent • Parent and child run independently afterwards
programmers find their way into concurrent programming by way of network programming • In order to handle multiple clients, servers must manage simultaneous network connections 10
client has its own socket connection 12 web browser web web browser server clients # server code s = socket(AF_INET, SOCK_STREAM) ... while True: c,a = s.accept() ... a connection point for clients client data transmitted on a different socket
manage multiple clients, • Server must accept multiple connections and keep all connections alive • Must actively manage all client connections • Each client may be performing different tasks 14
os from socket import * s = socket(AF_INET,SOCK_STREAM) s.bind(("",9000)) s.listen(5) while True: c,a = s.accept() if os.fork() == 0: # Child process. Manage client ... c.close() os._exit(0) else: # Parent process. Clean up and go # back to wait for more connections c.close() • Each client is handled by a subprocess
= socket(AF_INET,SOCK_STREAM) s.bind(("",9000)) s.listen(5) nservers = 0 while True: # Spawn servers if nservers < maxservers: if os.fork() == 0: for i in xrange(maxrequests): c,a = s.accept() # Manage client ... c.close() os._exit(0) else: nservers += 1 else: os.wait() n.nservers -= 1 • Server creates copies of itself in advance • A popular approach used by Apache, etc. Each server runs in this simple loop
subprocesses are covered in great detail in an Operating Systems coure • There are a few important details • Every process is independent • If multiple CPUs, more than one process can run simultaneously • Processes can exchange data with each other (Interprocess Communication) 20
structure applications as a collection of co-operating processes that work together 22 process process process process • Each process runs independently, but sends/ receives data from other processes • Question : What are the communication options?
between two processes 24 pipe p = subprocess.Popen(['process2'], stdin=subprocess.PIPE, stdout=subprocess.PIPE) # Send data to subprocess p.stdin.write(data) # Receive data from subprocess result = p.stdout.read() Process 1 Process 2 • A pair of "files" hooked up to a subprocess
commonly used to collect the output of commands executed as subprocesses • However, a pipe can be left "open" indefinitely • With proper programming, can be used as a bi- directional communication channel for exchanging data • Terminology : This is known as a "co-process" 25
(named pipe) 26 • Example: # Creating a FIFO import os os.mkfifo("fifo_A",0666) # Reading from a FIFO f = open("fifo_A","r") data = f.read(nbytes) # Writing to a FIFO f = open("fifo_A","w",0) # Unbuffered I/O f.write(data) Process 1 Process 2 FIFO FIFO
set up elaborate communications 27 Process 1 FIFO1 Process 2 Process 3 FIFO2 FIFO3 • Each process has own FIFO for messages • Any process can send to any other process
With FIFOs, multiple processes can send data to the same target 28 Process 1 Process 2 Process 3 FIFO3 • Will cause chaos on the receiver unless you figure out some way to coordinate it
control access to the channel via file system locking or some other approach # Each process opens a lock file import fcntl f = open("/tmp/fifo","w",0) # The FIFO g = open("/tmp/fifo.lock","w") # A lock # Critical section fcntl.flock(g.fileno(),fcntl.LOCK_EX) ... f.write("Some data\n") # Write on the FIFO ... fcntl.flock(g.fileno(),fcntl.LOCK_UN) • Example: Unix
starting to see problems with concurrency • Once there are multiple processes that access to shared resources, you often need to coordinate control and access • Locking and synchronization • Also : Almost none of this is "portable"
network layer 31 Process 1 Process 2 Process 3 socket socket socket • Basic idea: communication via TCP, UDP, etc. • We talked a bit about this last time
socket to make a connection from socket import * s = socket(AF_INET,SOCK_STREAM) s.connect(("some.host.com",10000)) .. # Send/receive data s.send(request) response = s.recv(10000) ... # Done. Close the connection s.close() 32
thought of for "network programming" • Can be used as an IPC mechanism between processes running on the same machine • Networking via the loopback interface (127.0.0.1) 34
the socket API to create a "pipe" s = socket(AF_UNIX,SOCK_STREAM) s.bind("/tmp/foo") s.listen(5) c,a = s.accept() # Send/receive data request = c.recv(10000) c.send(response) ... c.close() 35 • Clients s = socket(AF_UNIX,SOCK_STREAM) s.connect("/tmp/foo") s.send(request) resp = s.recv(10000)
on the system, pipes and FIFOs are often highly optimized in the operating system • Network layer often involves more processing steps and buffering • However, programming with pipes may be more difficult (especially FIFOs) 36
can share memory via mmap 37 Process 1 Process 2 • Idea here: processes share a mutable byte array • Changes immediately reflected in both processes • Highly optimized in the OS (no copying) memory mapped file array array
a memory mapped file 38 # Common code import mmap SIZE = 100000 # Number of bytes f = open("shared","w+b") f.seek(SIZE,0) # Expand file to desired size f.write("\n") # Now, memory map the file into an array m = mmap.mmap(f.fileno(),100000, mmap.ACCESS_WRITE) • This creates a shared byte array
a memory mapped file 39 m = mmap.mmap(f.fileno(),100000, mmap.ACCESS_WRITE) • Extract data from the memory array data = m[start:stop] • Store data in the memory array m[start:stop] = data # Data must exactly fit • Key point: Modifications to the array instantly appear in all shared copies of the file. Memory is shared, there is no copying/buffering.
memory mapped regions requires very careful coordination • Again, you may have to use file-locks 40 # Each process opens a lock file import fcntl f = open("shared","w+b") # The shared file # Critical section fcntl.flock(f.fileno(),fcntl.LOCK_EX) ... ... Some critical operation ... fcntl.flock(f.fileno(),fcntl.LOCK_UN)
to program with IPC • In practice, programs can be written in a manner where the actual IPC mechanism being used is hidden from application code. • Programming abstractions: • IPC via "files" • IPC via "messages"
For pipes : You already get a pair of files p = subprocess.Popen(['cmd'], stdin=subprocess.PIPE, stdout=subprocess.PIPE) in_f = p.stdin out_f = p.stdout • For sockets : Can wrap with a file-layer s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) ... in_f = s.makefile("w") out_f = s.makefile("r")
With a file-API, you just read/write streams of characters • Processes communicate by interpreting the contents of the I/O stream. • The tricky part : There is no concept of "records" or "messages" • If more than one process write onto a single stream, then you have to figure out how to coordinate it and sort out the results
General idea : Encapsulate IPC into some sort of message-passing API ch = IPC_Channel(args) ch.send(msg) # Send a message msg = ch.receive() # Receive a message • Message passing is a long established concept • However, there are dozens (if not hundreds) of libraries related to doing it. • Each with their own slightly different API
With messages, processes send well-defined chunks of data to each other. • The absolute critical operations are • send() - Send a message somewhere • receive() - Wait for a message • Let's build a message passing library...
idea: Define an I/O "Channel" class Channel(object): ... • Implement methods such as the following c.send(msg) # Send a message c.receive() # Receive a message • A message is just a string of bytes
we want to implement message passing over a pair of file objects inf # File open for reading outf # File open for writing • Ex: Files might be from a pipe or FIFO
code to send a message class FileChannel(object): ... def send(self,msg): self.outf.write("%d\n" % len(msg)) self.outf.write(msg) self.outf.flush() • In this case, length followed by data size msg • This approach is giving us a means for "framing" the data into records that can be easily understood by the receiver
opposite of sending class FileChannel(object): ... def receive(self): size_str = self.inf.readline() size = int(size_str) msg = self.inf.read(size) return msg • Note: Would probably want to add some more robust error handling (will skip for now)
Use >>> import channel >>> echo_in = open("/tmp/echo_in","wb",0) >>> echo_out = open("/tmp/echo_out","rb") >>> ch = channel.FileChannel(echo_out,echo_in) >>> ch.send("Hello") >>> ch.receive() 'Client received: Hello' >>> • Note : Order in which FIFOs are opened is critical here • Client must already be started
Message Passing is simple • Just a few basic primitives • send, receive • Can be used to build more advanced IPC programming abstractions • Remote procedure call • Distributed objects (e.g., CORBA, etc.)
Long history of message passing • It's an established programming technique • Algorithms, properties, pitfalls are known • Scalable performance • Thousands of processors • Supercomputers
languages can be extremely powerful when mixed with message passing • Can build systems based on remote procedure call/distributed objects, etc. • Let's look at a simple example...
languages allow objects to be "serialized" into strings • For example : pickle module in Python import pickle bytes = pickle.dumps(obj) # Turn obj into bytes obj = pickle.loads(bytes) # Turn bytes back to obj
add an object serialization/unserialization step on each end of a communication channel Process 1 Process 2 serialize unserialize • Let's look at that a little further...
subprocess : An Adder # adder.py import sys, channel fch = channel.FileChannel(sys.stdin,sys.stdout) pch = channel.PickleChannel(fch) while True: x,y = pch.receive() pch.send(x+y) • Receive two objects as input • Adds and sends the result back
Example is an example of remote procedure call • Have a subprocess/coprocess that implements such functionality • Another process sends data (parameters) and receives results • Can package this up in more exotic ways
passing is a very powerful technique for setting up concurrent programs • Easily adapted to different I/O schemes • Can be extended across the network • Quite portable if done right
program.py statement statement ... create thread(foo) def foo(): statement statement ... statement statement ... return or exit statement statement ... Key idea: Thread is like a little subprocess that runs inside your program thread
create a thread, you define a class import time import threading class CountdownThread(threading.Thread): def __init__(self,count): threading.Thread.__init__(self) self.count = count def run(self): while self.count > 0: print "Counting down", self.count self.count -= 1 time.sleep(5) return • Inherit from Thread and redefine run() 82
may be necessary to wait for a thread t.start() # Launch a thread ... # Do other work ... # Wait for thread to finish t.join() # Waits for thread t to exit • t.join([timeout]) • Can only be used by other threads (a thread can't join itself) 84
daemon thread (detached thread) t.setDaemon(True) • Daemon threads run forever • Like a background thread • Destroyed when the process exits • Can't be joined • Often used when creating worker/client threads 85
share common data • Extreme care if accessing shared data • One thread must not modify data while another thread is reading it • Otherwise, will get a "race condition" 86
shared object x = 0 • And two threads Thread-1 -------- ... x = x + 1 ... Thread-2 -------- ... x = x - 1 ... • Possible that the attribute will be corrupted • If one thread modifies the value just after the other has read it. 87
threads Thread-1 -------- ... x = x + 1 ... Thread-2 -------- ... x = x - 1 ... • Low level interpreter code Thread-1 -------- push(x) push(1) add x = pop Thread-2 -------- push(x) push(1) sub x = pop() context switch 88 reads a stale value overwrites update by Thread-2 context switch
a real concern or simply theoretical? >>> x = 0 >>> def foo(): ... global x ... for i in xrange(100000000): x += 1 ... >>> def bar(): ... global x ... for i in xrange(100000000): x -= 1 ... >>> t1 = threading.Thread(target=foo) >>> t2 = threading.Thread(target=bar) >>> t1.start(); t2.start() >>> t1.join(); t2.join() >>> x -834018 >>> 89 ???
locks m = threading.Lock() # Create a lock m.acquire() # Acquire the lock m.release() # Release the lock • If another thread tries to acquire the lock, it blocks until the lock is released • Only one thread may hold the lock 90
Commonly placed around critical sections x = 0 x_lck = threading.Lock() def foo(): global x x_lck.acquire() x += 1 x_lck.release() def bar(): global x x_lck.acquire() x -= 1 x_lck.release() 91 Critical section Critical section
Mutex Lock m = threading.RLock() # Create a lock m.acquire() # Acquire the lock m.release() # Release the lock • Semaphores m = threading.Semaphore(n) # Create a semaphore m.acquire() # Acquire the lock m.release() # Release the lock • Lock based on a counter • Can be acquired multiple times by same thread • Won't cover in detail here 92
between threads e = threading.Event() e.isSet() # Return True if event set e.set() # Set event e.clear() # Clear event e.wait() # Wait for event • Common use Thread 1 -------- ... # Wait for an event e.wait() ... # Respond to event 93 Thread 2 -------- ... # Trigger an event e.set() notify
work with multiple threads 94 Thread 1 e.wait() setting the event unblocks all waiting threads Thread 2 e.wait() Thread 3 e.wait() e = threading.Event() blocked Thread X e.set()
define parts of program that can run concurrently (may depend on algorithm) • Must identify all shared data structures • Must protect critical sections with locks • Synchronize threads with events as needed • Must cross fingers and hope that it works 95
must use threads, consider using the approach which causes the least amount of peril and pain • Independent threads that communicate via message queues 97 Thread 1 Thread 2 Queue
in Python • Creating a Queue with maximum # elements import Queue q = Queue.Queue(maxsize) • To create an infinite Queue import Queue q = Queue.Queue() 98
def consume_items(): while True: item = in_q.get() # Consume the item ... • Producer threads • Consumer thread while True: # Produce an item ... # Send to the consumer in_q.put(item) 100
threading.Thread.__init__(self) self.in_q = Queue.Queue() def send(self,item): self.in_q.put(item) def run(self): while True: item = self.in_q.get() # Process item ... • An alternative formulation is to structure consumers as objects you "send" items to 101 • This ties threads to "message passing"
• No locks. Queue is thread-safe • No shared data. Producer/consumer only communicate via queue. • Strikingly similar to message passing • Code is simple 102
sometimes considered for applications where there is massive concurrency (e.g., server with thousands of clients) • However, threads are fairly expensive • Don't improve performance (context-switching) • Incur considerable memory overhead (each thread has its own C stack, etc.) 103
languages often make very poor use of threads • The interpreters themselves are often not thread-safe (or are locked down in some way) • Example : Global interpreter lock in Python • As a result, even if you use threads, programs won't run on more than one CPU 104
event driven systems, programs get built as a collect of function/objects that react to different events • Classic example : GUIs • However, the same approach can be applied to networks, file I/O, etc. 108
• Make a button (using Tk) 109 >>> def response(): ... print "You did it!" ... >>> from Tkinter import Button >>> x = Button(None,text="Do it!",command=response) >>> x.pack() >>> x.mainloop() • Clicking on the button.... You did it! You did it! ...
used for implementing co-operative multitasking 110 def do_foo(): while True: # Various statements .... ... (yield) # Yield control to someone else ... • Basic idea : Functions run until they explicitly yield to some other function • Only one thing runs at once, but it gives the illusion of concurrency
example 111 def countdown(n): while True: print "T-minus", n (yield) n = n - 1 • This is like a generator, but we're not actually generating any values >>> c = countdown(10) >>> c.next() T-minus 10 >>> c.next() T-minus 9 >>>
co-routines 112 c1 = countdown(20) c2 = countdown(10) procs = [c1,c2] while procs: for p in procs: try: p.next() except StopIteration: procs.remove(p) • This is an outer loop that is "scheduling" the different co-routines (round-robin)