Discovering Python

Discovering Python

Presentation. PyCon 2014. Montreal. Conference video at https://www.youtube.com/watch?v=RZ4Sn-Y7AP8

70c42f4cf225f1455a7e01379bbd4d48?s=128

David Beazley

April 11, 2014
Tweet

Transcript

  1. Discovering Python David Beazley (@dabeaz) http:/ /www.dabeaz.com PyCon'2014 Montreal

  2. In 2005... ... I was hired to go look at

    1.5 TB (yes, that's Terabytes) of source code sitting in a secret vault.
  3. Six Years Later... I testified in US district court about:

    - Concurrency - Threads - Event loops - Interrupts Good god!
  4. Discovering with Python (or what happens when Python is brought

    into the ring of a legal battle)
  5. Disclaimer Everything in this talk actually happened Names and details

    have been changed Non-disclosure (I'd have to kill you) All exhibits/photos are fictional I know nothing, you'll learn nothing
  6. Meet Alice

  7. Alice Meet Bob

  8. Alice Bob "No, I'll send YOU a message!"

  9. Alice Bob Bob's Attorney

  10. Alice Bob Bob's Attorney "Bwhahahaha!" Patent Infringement

  11. Alice Bob Bob's Attorney "Prepare to die!" Alice's Attorney

  12. Alice Bob Bob's Attorney "Prepare to die!" Alice's Attorney "Bring

    it!"
  13. Let's Talk Patents A hot-button issue Myth: All patent lawsuits

    are trolls Myth: All patent lawsuits involving software are purely about software Fact: Patent litigation is hell
  14. Patent Litigation Timeline You hear about patents a lot But

    what actually happens? This talk is about that! Initial Complaint Fact Discovery (9-12 months) Claim Construction Summary Judgement Trial
  15. Fact Discovery Bob "Obvious Infringement"

  16. Fact Discovery Alice "Obviously Different"

  17. Fact Discovery Bob's Attorney Alice's Attorney Facts

  18. "Just the facts, ma'am" Enter: Fact Expert Technical expert Unbiased

    party Privileged Works with legal
  19. Reality Bob's Attorney Bob's Coworkers Fact Expert

  20. The Team Bob's Attorney Bob's Coworkers Fact Expert Me

  21. What Happens You are dropped into a firestorm No technical

    guidance Because no one knows anything... that's why they called you!
  22. Quick Learning The Invention

  23. Quick Learning The Invention 7. The system of claim 5

    or 6, wherein the display and input means comprises displays means and input means, the input means being connected to the central processing unit, the display means being connected to the slithering means and the central processing unit, the display means being arranged to display the displays and the input means transferring the input responses to the central processing unit, and wherein the display and input task means further comprises display task means and input task means, the display task means being arranged to control the display means by transferring display commands to, and receiving the display responses from, the display means, the input task means being arranged to control the input means by transferring input commands to, and receiving input responses from, the input means. The Patent
  24. Quick Learning The Invention 7. The system of claim 5

    or 6, wherein the display and input means comprises displays means and input means, the input means being connected to the central processing unit, the display means being connected to the slithering means and the central processing unit, the display means being arranged to display the displays and the input means transferring the input responses to the central processing unit, and wherein the display and input task means further comprises display task means and input task means, the display task means being arranged to control the display means by transferring display commands to, and receiving the display responses from, the display means, the input task means being arranged to control the input means by transferring input commands to, and receiving input responses from, the input means. The Patent
  25. Invention has some code - 600 pages C - PDF

    - 1989
  26. Patent Compilation Does the patent even work? Would the code

    compile? Can it be explained to others? You'd better find out How?
  27. Hand Compilation from PDF - Use highlighter

  28. Enter Python definitions = { 450: [ 'spam', 'grok', ],

    451: [ 'foo', ], 452: [ 'bar', ] } definitions calls = { 123: [ 'blah', 'read_input', 'send_msg', ], 124: [ 'spam', 'foo', 'bar' ] } calls Entered by hand (from paper copy) A long weekend
  29. Just Link It symbols = { name: pageno for pageno,

    defns in definitions.items() for name in defns } unresolved = [ (name, pageno) for pageno, clist in calls.items() for name in clist if name not in symbols ] missing = defaultdict(list) for name, pageno in unresolved: missing[name].append(pageno) for item in missing.items() print("Missing: %s on pages %s" % item)
  30. Secret Weapons List/dict/set comprehensions collections module

  31. WHY?!?!?!?! Due diligence You'd better understand your side's invention Otherwise,

    you will die
  32. Meet The Enemy Alice

  33. Meet The Enemy Alice Alice's Ninja Rockstar Coders

  34. Alice's Ninja Rockstar Coders Meet The Enemy Alice

  35. Meet The Enemy Alice Alice's Adult Engineers SEI CMMI Level

    4
  36. Here Are Some Documents 500 pages

  37. 500 pages 5,000 pages Here Are Some Documents

  38. 500 pages 5,000 pages 500,000 pages Here Are Some Documents

  39. (what's better than one? 300,000 that's what!) Sample Documents

  40. Purported Source Code? ATTORNEY EYES ONLY ATTORNEY EYES ONLY 1677723

    1677724
  41. From: Guido van Rossum <guido@python.org> Date: Dec 9 23:21:42 CET

    2011 Subject: [Python-Dev][PATCH] Adding braces to __future__ For me, if I had to design a new language today, I would probably use braces, not because they're better than whitespace, but because pretty much every other lanugage uses them, and there are more interesting concepts to distinguish a new language. That said, I don't regret that Python uses indentation, and the rest I have to say about the topic would violate the above request. -- --Guido van Rossum (python.org/~guido) Emails
  42. From: Guido van Rossum <guido@python.org> Date: Dec 9 23:21:42 CET

    2011 Subject: [Python-Dev][PATCH] Adding braces to __future__ For me, if I had to design a new language today, I would probably use braces, not because they're better than whitespace, but because pretty much every other lanugage uses them, and there are more interesting concepts to distinguish a new language. That said, I don't regret that Python uses indentation, and the rest I have to say about the topic would violate the above request. -- --Guido van Rossum (python.org/~guido) Emails Smoking gun?!?
  43. Alleged Prior Art

  44. Deposition of Crazy Old Guy Prior art

  45. We Have Their Software It's highly proprietary You're the only

    one approved to look at it It's actually sitting over in a vault AKA: Software escrow
  46. None
  47. The Vault

  48. The Vault By the tracks

  49. The Vault By the tracks Rock band rehearsal space

  50. Vault Protocol No computers No phone No electronics No storage

    devices Pen, paper and books okay
  51. The Vault PC in a locked cage (no network) Printer

    Special paper Log Book
  52. What's There? A collection of large hard drives D:\ E:\

    F:\ G:\ Each containing copies of CDs (>1.5 TB total) No documentation or organization
  53. Perspective Software archive for the infringing invention

  54. Perspective Software archive for the infringing invention Embedded Microcontroller System

  55. Perspective Software archive for the infringing invention Embedded Microcontroller System

    Display Module Keypads 7-segment
  56. Perspective Software archive for the infringing invention Embedded Microcontroller System

    Display Module Keypads 7-segment A PC
  57. Perspective Software archive for the infringing invention Embedded Microcontroller System

    Display Module Keypads 7-segment A PC Custom PCI Board
  58. Perspective Software archive for the infringing invention Embedded Microcontroller System

    Display Module Keypads 7-segment A PC Custom PCI Board Second PC
  59. Perspective A PC Custom PCI Board Second PC Custom Router

    Actually, more of a distributed system
  60. Perspective The software is "all stack" (a million lines of

    code) C++/ Win32 C/ASM DCOM/ CORBA C/ASM C/ASM VB Java RMI RTOS
  61. Enter Time OS/2 90 92 94 95 96 97 98

    00 01 WinNT V1 V2 V3 V4 V5 V6 V7 V8 V9 RevA RevB RevC • Weekly snapshots (52 x 15 years = 780 versions) • Multiple hardware revisions/configurations • Operating system changes/deployment changes
  62. Enter Customers • Dozen major customers (corporations) • Customer-specific system

    modifications • Think "skins" on main system • Hundreds of interlocking versions Base System Version 2.51 ACME Vers 1.23 Buy N Large Vers 4.22 Tyrell Corp Vers 3.43
  63. Provided Tools Windows-XP

  64. Provided Tools Windows-XP Command Prompt

  65. Provided Tools Windows-XP Command Prompt Search Mutt

  66. Provided Tools Notepad

  67. Official Tools Notepad Visual Studio

  68. Printing You can print anything Must be logged Numbered, copied,

    given to opposing side
  69. Constraints No working hardware setup (can't run code) No working

    build environ (can't compile) No tech support (can't call anyone) Fragmentary documentation (if any)
  70. None
  71. Secret Weapon

  72. Python? What? How? Unknown: How did Python get placed on

    the machine in the vault? I have NO idea A new IBM PC with only "approved tools" Best Guess: Used by an IBM OEM tool (Yet, there it was, python... in the Windows path no less).
  73. Desert Island Coding Admit it, you've probably thought Python might

    be a good choice Batteries included FOR THE WIN!
  74. Strategy Create a fact discovery environment from scratch in the

    vault I was destined for this job... I wrote the book
  75. Question: What are the objectives? (What does it mean to

    "look at" the code?)
  76. Goals What was provided? Is it complete? How does the

    code work? Where is the patent in the code?
  77. The Horror! The Horror! Reverse engineering the entire build environ

    Makefiles, config files, etc. Identifying all major software components Examples: .exe files, .DLLs, plugins, etc. Sorting out version histories
  78. MKDEP= mkdep SHELL= /bin/sh # === Fixed definitions === OBJS=

    \ bltinmodule.o \ ceval.o cgensupport.o compile.o \ errors.o \ frozen.o \ getargs.o getcompiler.o getcopyright.o getm getplatform.o getversion.o graminit.o \ import.o importdl.o \ marshal.o modsupport.o mystrtoul.o \ pythonrun.o \ sigcheck.o structmember.o sysmodule.o \ traceback.o \ $(LIBOBJS) LIB= libPython.a Sources Library You try to figure it out
  79. Tackle the Provided Code

  80. Basic Tooling Reimplement Unix find grep wc diff tail head

    Because that Windows search mutt must die
  81. Example: navigation import os def cd(dirname): os.chdir(dirname) def pwd(): print(os.getcwd())

    def ls(dirname=''): os.system('dir %s' % dirname)
  82. Example: diff # diff.py import sys, difflib def diff(fromfile, tofile):

    fromlines = open(fromfile).readlines() tolines = open(tofile).readlines() diff = difflib.context_diff(fromlines, tolines, fromfile, tofile) sys.stdout.writelines(diff)
  83. Interactive Shell >>> cd('pycode') >>> pwd() D:\Files\pycode >>> diff('Python-2.6/Lib/collections.py', ...

    'Python-2.6.2/Lib/collections.py') *** Python-2.6/Lib/collections.py --- Python-2.6.2/Lib/collections.py *************** *** 103,109 **** # where the named tuple is created. Bypass this step in where # sys._getframe is not defined (Jython for example). if hasattr(_sys, '_getframe'): ! result.__module__ = _sys._getframe(1).f_globals['__n return result --- 103,109 ---- # where the named tuple is created. Bypass this step in where
  84. More Than Reinvention Actually implementing an entire workflow Building up

    layers of tools/analyses Not unlike what is done with IPython NB Can't understate Python awesomeness
  85. Example def allfiles(topdir): return ((path, filename) for path, dirs, files

    in os.walk(topdir) for filename in files) >>> files = allfiles('AllPython') >>> next(files) ('AllPython/0/python-0.9.1', 'python.man') >>> next(files) ('AllPython/0/python-0.9.1', 'README') >>>
  86. Example def filetypes(topdir): from collections import Counter from pprint import

    pprint c = Counter(os.path.splitext(name)[1] for _, name in allfiles(topdir)) pprint(c.most_common()) >>> filetypes('AllPython') [('.py', 125277), ('.c', 27200), ('', 17010), ('.rst', 15439), ('.h', 14782), ('.tex', 12257), ... allfiles()
  87. Example def find(topdir, pattern): from fnmatch import fnmatch return ((path,

    name) for path, name in allfiles(topdir) if fnmatch(name, pattern)) >>> f = find('AllPython', '*.py') >>> next(f) ('AllPython/0/python-0.9.1/demo/scripts', 'findlinksto.py' >>> next(f) ('AllPython/0/python-0.9.1/demo/scripts', 'mkreal.py') >>> next(f) ('AllPython/0/python-0.9.1/demo/scripts', 'ptags.py') >>> allfiles() filetypes()
  88. Example def create_versions(topdir): import re for path, _ in find(topdir,

    'pgen.c'): pypath, _ = os.path.split(path) version = re.search(r'-(\w+\.\w+(\.\w+)?)$', pypath).group(1) yield version, pypath allfiles() filetypes() find()
  89. Example >>> vers = find_versions('AllPython') >>> next(vers) ('0.9.1', 'AllPython/0/python-0.9.1') >>>

    next(vers) ('1.0.1', 'AllPython/1/python-1.0.1') >>> allfiles() filetypes() find() find_versions()
  90. Example def write_manifest(topdir): import csv f = open('manifest.csv','w') csv.writer(f).writerows(find_versions(topdir)) f.close()

    allfiles() filetypes() find() find_versions()
  91. Example allfiles() filetypes() find() find_versions() write_manifest()

  92. Example allfiles() filetypes() find() find_versions() write_manifest() .csv Workflows!

  93. Pile it Higher and Higher HDDs Snapshots "Virtual File System"

    View View View You keep building abstractions Reorganized file layer Different views (version, date, prod, debug, etc.) .csv
  94. Example: Versioning def versions(filename): import hashlib from collections import defaultdict

    manifest = read_manifest() groups = defaultdict(list) for vers, path in manifest.items(): fullname = os.path.join(path, filename) if os.path.exists(fullname): digest = hashlib.new('md5') digest.update(open(fullname,'rb').read()) groups[digest.digest()].append(vers) return sorted([sorted(g) for g in groups.values()])
  95. Example: Versioning >>> for x in versions('Python/thread.c'): ... print(x) ...

    ['1.0.1'] ['1.1'] ['1.2', '1.3'] ['1.4'] ['1.5', '1.5.1'] ['1.5.2', '1.5.2c1'] ['1.5.2b1', '1.5.2b2'] ['1.6', '1.6b1'] ['2.0', '2.0.1', '2.0c1', '2.1', '2.1.1', '2.1.2', '2.1.3'] ...
  96. Navigational Tooling "Virtual File System" View View View Query tools

    for going to any version/file Navigational Tools >>> view('2.7.3', 'Python/ceval.c') >>> Typically launch windows tools (e.g., Vis Studio)
  97. Timeline/Inventory Tools Link together every version of every component found

    Development timelines Official vs. Debug releases V1 V2 V3 V4 release release release release release release release
  98. Commentary I don't know if the opposing side actually expected

    us to figure out their code We knew almost everything about everything Python FOR.THE.WIN.
  99. How Does Code Work? Better make sure you understand everything

    about the code Software architecture Interaction between components Underlying algorithms
  100. Problem: Code Sucks Nobody wants to read code Better: Design

    documents, specs Nobody wants to give you that "Go read the source."
  101. Let's Go Fishing Interesting files Code comments TPS reports PDF,

    DOC, RTF, HTML, TXT / / See: Important Document Fixed bug. See important specification. I ὑ re
  102. Back and Forth An obscure find /* See FS-6541-8v2.0 for

    details */ A request to attorneys "Tell opposing counsel we can't find FS-6541-8v2.0" A few silent days pass....
  103. None
  104. None
  105. None
  106. Casting a Wide Net Search for documents is far and

    wide Software change notices Unrelated software (peripheral devices) Emails The web (catalogs, manuals, job postings, etc.) Analogy: Pulling on a loose thread...
  107. Commentary You're learning the invention from scratch Reading other people's

    code You're teaching attorneys about it The other side doesn't want you to succeed You will learn A LOT in this exercise
  108. Some Lessons Learned SUCKS ROCKS

  109. Some Lessons Learned SUCKS ROCKS C++ Assembly code

  110. Some Lessons Learned SUCKS ROCKS C++ Assembly code Asynchronous Threads

  111. Some Lessons Learned SUCKS ROCKS C++ Assembly code Asynchronous Threads

    Objects Functions
  112. Some Lessons Learned SUCKS ROCKS C++ Assembly code Asynchronous Threads

    Objects Functions Makefiles IDEs
  113. Some Lessons Learned SUCKS ROCKS C++ Assembly code Asynchronous Threads

    Objects Functions Makefiles IDEs CASE Tools Humans
  114. Some Lessons Learned SUCKS ROCKS C++ Assembly code Asynchronous Threads

    Objects Functions Makefiles IDEs CASE Tools Humans UML Words
  115. Some Lessons Learned SUCKS ROCKS C++ Assembly code Asynchronous Threads

    Objects Functions Makefiles IDEs CASE Tools Humans 1990s 1970s UML Words
  116. Some Lessons Learned SUCKS ROCKS C++ Assembly code Asynchronous Threads

    Objects Functions Makefiles IDEs CASE Tools Humans 1990s 1970s (of course this is just my opinion, I could be wrong) UML Words
  117. Speaking of Attorneys Do the "facts" support patent infringement? Does

    it look like it infringes? Can it be proven that it infringes? (Let the game begin)
  118. None
  119. Remember this? 7. The system of claim 5 or 6,

    wherein the display and input means comprises displays means and input means, the input means being connected to the central processing unit, the display means being connected to the slithering means and the central processing unit, the display means being arranged to display the displays and the input means transferring the input responses to the central processing unit, and wherein the display and input task means further comprises display task means and input task means, the display task means being arranged to control the display means by transferring display commands to, and receiving the display responses from, the display means, the input task means being arranged to control the input means by transferring input commands to, and receiving input responses from, the input means. The Patent
  120. Remember this? 7. The system of claim 5 or 6,

    wherein the display and input means comprises displays means and input means, the input means being connected to the central processing unit, the display means being connected to the slithering means and the central processing unit, the display means being arranged to display the displays and the input means transferring the input responses to the central processing unit, and wherein the display and input task means further comprises display task means and input task means, the display task means being arranged to control the display means by transferring display commands to, and receiving the display responses from, the display means, the input task means being arranged to control the input means by transferring input commands to, and receiving input responses from, the input means. The Patent What does this claim mean? (let's rumble)
  121. None
  122. Defining Claim Terms 7. The system of claim 5 or

    6, wherein the display and input means comprises displays means and input means, the input means being connected to the central processing unit, the display means being connected to the slithering means and the central processing unit, the display means being arranged to display the displays and the input means transferring the input responses to the central processing unit, and wherein the display and input task means further comprises display task means and input task means, the display task means being arranged to control the display means by transferring display commands to, and receiving the display responses from, the display means, the input task means being arranged to control the input means by transferring input commands to, and receiving input responses from, the input means. The Patent Term Plaintiff Defendant central processing unit display means
  123. Claim Construction Claim terms have to be supported by reality

    If not, it's game over A lot of attorney/expert consultation Problem: very specific facts and structure File: Widget/foo.c, lines 230-255. Requires a deep dive
  124. Problem Matching claims to 800 versions of a million line

    program Pick one version? Which one? Match them all?
  125. Fragment Versioning You're familiar with source code control Imagine applying

    it to code fragments/excerpts In reverse Hmmm.
  126. /* source.c */ void grok() { if (spam) { foo();

    bar(); } ... } void blah() { ... }
  127. /* source.c */ void grok() { if (spam) { foo();

    bar(); } ... } void blah() { ... } file: source.c start: 'void grok()' end: 'void blah()' Fragment
  128. /* source.c */ void grok() { if (spam) { foo();

    bar(); } ... } void blah() { ... } file: source.c start: 'void grok()' end: 'void blah()' Fragment Snapshots (>800) Global fragment search across all versions
  129. /* source.c */ void grok() { if (spam) { foo();

    bar(); } ... } void blah() { ... } file: source.c start: 'void grok()' end: 'void blah()' Fragment Snapshots (>800) Ver1 Ver2 Ver3 void grok() { if (spam) { foo(); bar(); } } void blah() { void grok() { if (spam) { foo(); bar(x); } } void blah() { void grok() { if (spam) { new_foo(); bar(x); } } void blah() {
  130. Big Picture Reduce a massive data set to something sane

    "This claim matches this structure in the code. There have only been six versions of this code over 15 years. Here are the six versions." Keep in mind: All this happening in the vault Big collection of fragment histories
  131. PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON

    PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON
  132. PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON

    PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON Python makes the impossible possible
  133. PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON

    PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON Python makes the impossible possible (even Python 3)
  134. Final Thoughts If you get the chance to do this,

    do it! You will learn A LOT! Would I want to do it again? Not sure.
  135. But How Did it End?

  136. None
  137. My End Game I learned a lot about generator functions

    Ultimately a well-known PyCon tutorial...
  138. Postscript: Expert Report You may be asked to write an

    expert report Outlines all factual findings Ties facts to patent claims A scientific document It's a document that WILL be read
  139. Postscript: Deposition You A room of attorneys Opposing expert Court

    reporter Videographer 8 hours It will be one of the most intense, surreal, awesome/worst experiences of your whole life.
  140. Postscript: Court Testimony Like deposition, but dialed up to 11

    Twice as many attorneys, more experts Judge & clerks
  141. Questions