Discovering Python

Discovering Python

Presentation. PyCon 2014. Montreal. Conference video at https://www.youtube.com/watch?v=RZ4Sn-Y7AP8

70c42f4cf225f1455a7e01379bbd4d48?s=128

David Beazley

April 11, 2014
Tweet

Transcript

  1. 2.

    In 2005... ... I was hired to go look at

    1.5 TB (yes, that's Terabytes) of source code sitting in a secret vault.
  2. 3.

    Six Years Later... I testified in US district court about:

    - Concurrency - Threads - Event loops - Interrupts Good god!
  3. 5.

    Disclaimer Everything in this talk actually happened Names and details

    have been changed Non-disclosure (I'd have to kill you) All exhibits/photos are fictional I know nothing, you'll learn nothing
  4. 13.

    Let's Talk Patents A hot-button issue Myth: All patent lawsuits

    are trolls Myth: All patent lawsuits involving software are purely about software Fact: Patent litigation is hell
  5. 14.

    Patent Litigation Timeline You hear about patents a lot But

    what actually happens? This talk is about that! Initial Complaint Fact Discovery (9-12 months) Claim Construction Summary Judgement Trial
  6. 21.

    What Happens You are dropped into a firestorm No technical

    guidance Because no one knows anything... that's why they called you!
  7. 23.

    Quick Learning The Invention 7. The system of claim 5

    or 6, wherein the display and input means comprises displays means and input means, the input means being connected to the central processing unit, the display means being connected to the slithering means and the central processing unit, the display means being arranged to display the displays and the input means transferring the input responses to the central processing unit, and wherein the display and input task means further comprises display task means and input task means, the display task means being arranged to control the display means by transferring display commands to, and receiving the display responses from, the display means, the input task means being arranged to control the input means by transferring input commands to, and receiving input responses from, the input means. The Patent
  8. 24.

    Quick Learning The Invention 7. The system of claim 5

    or 6, wherein the display and input means comprises displays means and input means, the input means being connected to the central processing unit, the display means being connected to the slithering means and the central processing unit, the display means being arranged to display the displays and the input means transferring the input responses to the central processing unit, and wherein the display and input task means further comprises display task means and input task means, the display task means being arranged to control the display means by transferring display commands to, and receiving the display responses from, the display means, the input task means being arranged to control the input means by transferring input commands to, and receiving input responses from, the input means. The Patent
  9. 26.

    Patent Compilation Does the patent even work? Would the code

    compile? Can it be explained to others? You'd better find out How?
  10. 28.

    Enter Python definitions = { 450: [ 'spam', 'grok', ],

    451: [ 'foo', ], 452: [ 'bar', ] } definitions calls = { 123: [ 'blah', 'read_input', 'send_msg', ], 124: [ 'spam', 'foo', 'bar' ] } calls Entered by hand (from paper copy) A long weekend
  11. 29.

    Just Link It symbols = { name: pageno for pageno,

    defns in definitions.items() for name in defns } unresolved = [ (name, pageno) for pageno, clist in calls.items() for name in clist if name not in symbols ] missing = defaultdict(list) for name, pageno in unresolved: missing[name].append(pageno) for item in missing.items() print("Missing: %s on pages %s" % item)
  12. 41.

    From: Guido van Rossum <guido@python.org> Date: Dec 9 23:21:42 CET

    2011 Subject: [Python-Dev][PATCH] Adding braces to __future__ For me, if I had to design a new language today, I would probably use braces, not because they're better than whitespace, but because pretty much every other lanugage uses them, and there are more interesting concepts to distinguish a new language. That said, I don't regret that Python uses indentation, and the rest I have to say about the topic would violate the above request. -- --Guido van Rossum (python.org/~guido) Emails
  13. 42.

    From: Guido van Rossum <guido@python.org> Date: Dec 9 23:21:42 CET

    2011 Subject: [Python-Dev][PATCH] Adding braces to __future__ For me, if I had to design a new language today, I would probably use braces, not because they're better than whitespace, but because pretty much every other lanugage uses them, and there are more interesting concepts to distinguish a new language. That said, I don't regret that Python uses indentation, and the rest I have to say about the topic would violate the above request. -- --Guido van Rossum (python.org/~guido) Emails Smoking gun?!?
  14. 45.

    We Have Their Software It's highly proprietary You're the only

    one approved to look at it It's actually sitting over in a vault AKA: Software escrow
  15. 46.
  16. 47.
  17. 52.

    What's There? A collection of large hard drives D:\ E:\

    F:\ G:\ Each containing copies of CDs (>1.5 TB total) No documentation or organization
  18. 57.
  19. 58.

    Perspective Software archive for the infringing invention Embedded Microcontroller System

    Display Module Keypads 7-segment A PC Custom PCI Board Second PC
  20. 59.

    Perspective A PC Custom PCI Board Second PC Custom Router

    Actually, more of a distributed system
  21. 60.

    Perspective The software is "all stack" (a million lines of

    code) C++/ Win32 C/ASM DCOM/ CORBA C/ASM C/ASM VB Java RMI RTOS
  22. 61.

    Enter Time OS/2 90 92 94 95 96 97 98

    00 01 WinNT V1 V2 V3 V4 V5 V6 V7 V8 V9 RevA RevB RevC • Weekly snapshots (52 x 15 years = 780 versions) • Multiple hardware revisions/configurations • Operating system changes/deployment changes
  23. 62.

    Enter Customers • Dozen major customers (corporations) • Customer-specific system

    modifications • Think "skins" on main system • Hundreds of interlocking versions Base System Version 2.51 ACME Vers 1.23 Buy N Large Vers 4.22 Tyrell Corp Vers 3.43
  24. 69.

    Constraints No working hardware setup (can't run code) No working

    build environ (can't compile) No tech support (can't call anyone) Fragmentary documentation (if any)
  25. 70.
  26. 72.

    Python? What? How? Unknown: How did Python get placed on

    the machine in the vault? I have NO idea A new IBM PC with only "approved tools" Best Guess: Used by an IBM OEM tool (Yet, there it was, python... in the Windows path no less).
  27. 73.

    Desert Island Coding Admit it, you've probably thought Python might

    be a good choice Batteries included FOR THE WIN!
  28. 74.

    Strategy Create a fact discovery environment from scratch in the

    vault I was destined for this job... I wrote the book
  29. 76.

    Goals What was provided? Is it complete? How does the

    code work? Where is the patent in the code?
  30. 77.

    The Horror! The Horror! Reverse engineering the entire build environ

    Makefiles, config files, etc. Identifying all major software components Examples: .exe files, .DLLs, plugins, etc. Sorting out version histories
  31. 78.

    MKDEP= mkdep SHELL= /bin/sh # === Fixed definitions === OBJS=

    \ bltinmodule.o \ ceval.o cgensupport.o compile.o \ errors.o \ frozen.o \ getargs.o getcompiler.o getcopyright.o getm getplatform.o getversion.o graminit.o \ import.o importdl.o \ marshal.o modsupport.o mystrtoul.o \ pythonrun.o \ sigcheck.o structmember.o sysmodule.o \ traceback.o \ $(LIBOBJS) LIB= libPython.a Sources Library You try to figure it out
  32. 80.

    Basic Tooling Reimplement Unix find grep wc diff tail head

    Because that Windows search mutt must die
  33. 82.

    Example: diff # diff.py import sys, difflib def diff(fromfile, tofile):

    fromlines = open(fromfile).readlines() tolines = open(tofile).readlines() diff = difflib.context_diff(fromlines, tolines, fromfile, tofile) sys.stdout.writelines(diff)
  34. 83.

    Interactive Shell >>> cd('pycode') >>> pwd() D:\Files\pycode >>> diff('Python-2.6/Lib/collections.py', ...

    'Python-2.6.2/Lib/collections.py') *** Python-2.6/Lib/collections.py --- Python-2.6.2/Lib/collections.py *************** *** 103,109 **** # where the named tuple is created. Bypass this step in where # sys._getframe is not defined (Jython for example). if hasattr(_sys, '_getframe'): ! result.__module__ = _sys._getframe(1).f_globals['__n return result --- 103,109 ---- # where the named tuple is created. Bypass this step in where
  35. 84.

    More Than Reinvention Actually implementing an entire workflow Building up

    layers of tools/analyses Not unlike what is done with IPython NB Can't understate Python awesomeness
  36. 85.

    Example def allfiles(topdir): return ((path, filename) for path, dirs, files

    in os.walk(topdir) for filename in files) >>> files = allfiles('AllPython') >>> next(files) ('AllPython/0/python-0.9.1', 'python.man') >>> next(files) ('AllPython/0/python-0.9.1', 'README') >>>
  37. 86.

    Example def filetypes(topdir): from collections import Counter from pprint import

    pprint c = Counter(os.path.splitext(name)[1] for _, name in allfiles(topdir)) pprint(c.most_common()) >>> filetypes('AllPython') [('.py', 125277), ('.c', 27200), ('', 17010), ('.rst', 15439), ('.h', 14782), ('.tex', 12257), ... allfiles()
  38. 87.

    Example def find(topdir, pattern): from fnmatch import fnmatch return ((path,

    name) for path, name in allfiles(topdir) if fnmatch(name, pattern)) >>> f = find('AllPython', '*.py') >>> next(f) ('AllPython/0/python-0.9.1/demo/scripts', 'findlinksto.py' >>> next(f) ('AllPython/0/python-0.9.1/demo/scripts', 'mkreal.py') >>> next(f) ('AllPython/0/python-0.9.1/demo/scripts', 'ptags.py') >>> allfiles() filetypes()
  39. 88.

    Example def create_versions(topdir): import re for path, _ in find(topdir,

    'pgen.c'): pypath, _ = os.path.split(path) version = re.search(r'-(\w+\.\w+(\.\w+)?)$', pypath).group(1) yield version, pypath allfiles() filetypes() find()
  40. 89.

    Example >>> vers = find_versions('AllPython') >>> next(vers) ('0.9.1', 'AllPython/0/python-0.9.1') >>>

    next(vers) ('1.0.1', 'AllPython/1/python-1.0.1') >>> allfiles() filetypes() find() find_versions()
  41. 93.

    Pile it Higher and Higher HDDs Snapshots "Virtual File System"

    View View View You keep building abstractions Reorganized file layer Different views (version, date, prod, debug, etc.) .csv
  42. 94.

    Example: Versioning def versions(filename): import hashlib from collections import defaultdict

    manifest = read_manifest() groups = defaultdict(list) for vers, path in manifest.items(): fullname = os.path.join(path, filename) if os.path.exists(fullname): digest = hashlib.new('md5') digest.update(open(fullname,'rb').read()) groups[digest.digest()].append(vers) return sorted([sorted(g) for g in groups.values()])
  43. 95.

    Example: Versioning >>> for x in versions('Python/thread.c'): ... print(x) ...

    ['1.0.1'] ['1.1'] ['1.2', '1.3'] ['1.4'] ['1.5', '1.5.1'] ['1.5.2', '1.5.2c1'] ['1.5.2b1', '1.5.2b2'] ['1.6', '1.6b1'] ['2.0', '2.0.1', '2.0c1', '2.1', '2.1.1', '2.1.2', '2.1.3'] ...
  44. 96.

    Navigational Tooling "Virtual File System" View View View Query tools

    for going to any version/file Navigational Tools >>> view('2.7.3', 'Python/ceval.c') >>> Typically launch windows tools (e.g., Vis Studio)
  45. 97.

    Timeline/Inventory Tools Link together every version of every component found

    Development timelines Official vs. Debug releases V1 V2 V3 V4 release release release release release release release
  46. 98.

    Commentary I don't know if the opposing side actually expected

    us to figure out their code We knew almost everything about everything Python FOR.THE.WIN.
  47. 99.

    How Does Code Work? Better make sure you understand everything

    about the code Software architecture Interaction between components Underlying algorithms
  48. 100.

    Problem: Code Sucks Nobody wants to read code Better: Design

    documents, specs Nobody wants to give you that "Go read the source."
  49. 101.

    Let's Go Fishing Interesting files Code comments TPS reports PDF,

    DOC, RTF, HTML, TXT / / See: Important Document Fixed bug. See important specification. I ὑ re
  50. 102.

    Back and Forth An obscure find /* See FS-6541-8v2.0 for

    details */ A request to attorneys "Tell opposing counsel we can't find FS-6541-8v2.0" A few silent days pass....
  51. 103.
  52. 104.
  53. 105.
  54. 106.

    Casting a Wide Net Search for documents is far and

    wide Software change notices Unrelated software (peripheral devices) Emails The web (catalogs, manuals, job postings, etc.) Analogy: Pulling on a loose thread...
  55. 107.

    Commentary You're learning the invention from scratch Reading other people's

    code You're teaching attorneys about it The other side doesn't want you to succeed You will learn A LOT in this exercise
  56. 113.

    Some Lessons Learned SUCKS ROCKS C++ Assembly code Asynchronous Threads

    Objects Functions Makefiles IDEs CASE Tools Humans
  57. 114.

    Some Lessons Learned SUCKS ROCKS C++ Assembly code Asynchronous Threads

    Objects Functions Makefiles IDEs CASE Tools Humans UML Words
  58. 115.

    Some Lessons Learned SUCKS ROCKS C++ Assembly code Asynchronous Threads

    Objects Functions Makefiles IDEs CASE Tools Humans 1990s 1970s UML Words
  59. 116.

    Some Lessons Learned SUCKS ROCKS C++ Assembly code Asynchronous Threads

    Objects Functions Makefiles IDEs CASE Tools Humans 1990s 1970s (of course this is just my opinion, I could be wrong) UML Words
  60. 117.

    Speaking of Attorneys Do the "facts" support patent infringement? Does

    it look like it infringes? Can it be proven that it infringes? (Let the game begin)
  61. 118.
  62. 119.

    Remember this? 7. The system of claim 5 or 6,

    wherein the display and input means comprises displays means and input means, the input means being connected to the central processing unit, the display means being connected to the slithering means and the central processing unit, the display means being arranged to display the displays and the input means transferring the input responses to the central processing unit, and wherein the display and input task means further comprises display task means and input task means, the display task means being arranged to control the display means by transferring display commands to, and receiving the display responses from, the display means, the input task means being arranged to control the input means by transferring input commands to, and receiving input responses from, the input means. The Patent
  63. 120.

    Remember this? 7. The system of claim 5 or 6,

    wherein the display and input means comprises displays means and input means, the input means being connected to the central processing unit, the display means being connected to the slithering means and the central processing unit, the display means being arranged to display the displays and the input means transferring the input responses to the central processing unit, and wherein the display and input task means further comprises display task means and input task means, the display task means being arranged to control the display means by transferring display commands to, and receiving the display responses from, the display means, the input task means being arranged to control the input means by transferring input commands to, and receiving input responses from, the input means. The Patent What does this claim mean? (let's rumble)
  64. 121.
  65. 122.

    Defining Claim Terms 7. The system of claim 5 or

    6, wherein the display and input means comprises displays means and input means, the input means being connected to the central processing unit, the display means being connected to the slithering means and the central processing unit, the display means being arranged to display the displays and the input means transferring the input responses to the central processing unit, and wherein the display and input task means further comprises display task means and input task means, the display task means being arranged to control the display means by transferring display commands to, and receiving the display responses from, the display means, the input task means being arranged to control the input means by transferring input commands to, and receiving input responses from, the input means. The Patent Term Plaintiff Defendant central processing unit display means
  66. 123.

    Claim Construction Claim terms have to be supported by reality

    If not, it's game over A lot of attorney/expert consultation Problem: very specific facts and structure File: Widget/foo.c, lines 230-255. Requires a deep dive
  67. 124.

    Problem Matching claims to 800 versions of a million line

    program Pick one version? Which one? Match them all?
  68. 125.
  69. 126.

    /* source.c */ void grok() { if (spam) { foo();

    bar(); } ... } void blah() { ... }
  70. 127.

    /* source.c */ void grok() { if (spam) { foo();

    bar(); } ... } void blah() { ... } file: source.c start: 'void grok()' end: 'void blah()' Fragment
  71. 128.

    /* source.c */ void grok() { if (spam) { foo();

    bar(); } ... } void blah() { ... } file: source.c start: 'void grok()' end: 'void blah()' Fragment Snapshots (>800) Global fragment search across all versions
  72. 129.

    /* source.c */ void grok() { if (spam) { foo();

    bar(); } ... } void blah() { ... } file: source.c start: 'void grok()' end: 'void blah()' Fragment Snapshots (>800) Ver1 Ver2 Ver3 void grok() { if (spam) { foo(); bar(); } } void blah() { void grok() { if (spam) { foo(); bar(x); } } void blah() { void grok() { if (spam) { new_foo(); bar(x); } } void blah() {
  73. 130.

    Big Picture Reduce a massive data set to something sane

    "This claim matches this structure in the code. There have only been six versions of this code over 15 years. Here are the six versions." Keep in mind: All this happening in the vault Big collection of fragment histories
  74. 131.

    PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON

    PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON
  75. 132.

    PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON

    PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON Python makes the impossible possible
  76. 133.

    PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON

    PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON Python makes the impossible possible (even Python 3)
  77. 134.

    Final Thoughts If you get the chance to do this,

    do it! You will learn A LOT! Would I want to do it again? Not sure.
  78. 136.
  79. 137.

    My End Game I learned a lot about generator functions

    Ultimately a well-known PyCon tutorial...
  80. 138.

    Postscript: Expert Report You may be asked to write an

    expert report Outlines all factual findings Ties facts to patent claims A scientific document It's a document that WILL be read
  81. 139.

    Postscript: Deposition You A room of attorneys Opposing expert Court

    reporter Videographer 8 hours It will be one of the most intense, surreal, awesome/worst experiences of your whole life.
  82. 140.

    Postscript: Court Testimony Like deposition, but dialed up to 11

    Twice as many attorneys, more experts Judge & clerks
  83. 141.