Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Как добавить scripting в веб приложение, Константин Лопухин, Scrapinghub

Как добавить scripting в веб приложение, Константин Лопухин, Scrapinghub

Выступление на конференции PyCon Russia 2016

IT-People

July 25, 2016
Tweet

More Decks by IT-People

Other Decks in Programming

Transcript

  1. About me Scrapinghub: turning web content into useful data CHTD:

    business intelligence (BI) Side projects: psycopg2cffi, ML and linguistics
  2. More examples • VBA (Visual Basic for Applications), Google Apps

    Script • Vim, Emacs, Atom, Sublime, … • GIMP, Maya, … • Salesforce, OpenERP, 1C, …
  3. But why? • Text is powerful, fast to input, read

    and modify • Allows extension in unforeseen ways • Can be easier to develop than comparable GUI Scripting alternatives: • Build a graphical interface • Customization via custom development
  4. Too hard? Users: • Steep learning curve • Not mobile

    friendly Developers: • Security risk • No out of the box solutions • Not my area of expertise
  5. Implementation • Our own parser + interpreter or compiler: ◦

    more suitable for simple or declarative languages • Existing programming language: ◦ if we need complex control flow, libraries, etc. • Off-the-shelf solutions: ◦ Template engines ◦ Markdown, rst, html
  6. Implementation • Our own parser + interpreter or compiler: ◦

    more suitable for simple or declarative languages • Existing programming language: ◦ if we need complex control flow, libraries, etc. • Off-the-shelf solutions: ◦ Template engines ◦ Markdown, rst, html
  7. Parsing Turn text into structured data • Get AST for

    next steps • Validation • Error reporting
  8. Parsing: example = A1 + SUM(A*) Formula( Plus( left =

    Cell(column = "A", row = "1"), right = Function( fn = "SUM", args = [Cell(column = "A", row = "*")])))
  9. Parsing approaches • Regular expressions ◦ very simple grammers, still

    need error reporting • Parser generators and libraries ◦ The most common case. See “DSL in Python. How and why?” by Ivan Tsyganov • Hand-written parsers ◦ Only for very complex grammers and very custom error reporting
  10. Case study: conditional cell formatting A1<100: A2.color = red B*>A*:

    B*.arrow = up • Parsed with regexps • Evaluated on the server • GUI to assist text input→ • Very easy to learn
  11. Implementation • Our own parser + interpreter or compiler: ◦

    more suitable for simple or declarative languages • Existing programming language: ◦ if we need complex control flow, libraries, etc.
  12. Sandbox Allow only desired (safe) functionality • Do not allow

    I/O: filesystem, network • Limit CPU and memory usage
  13. Python • Python :) • Nice syntax • Rich stdlib

    • Taught a lot, quite popular • Convenient for code sharing with main app
  14. Naïve Python >>> del __builtins__ >>> open Traceback (most recent

    call last): File "<stdin>", line 1, in <module> NameError: name 'open' is not defined
  15. Naïve Python >>> del __builtins__ >>> import os Traceback (most

    recent call last): File "<stdin>", line 1, in <module> ImportError: __import__ not found
  16. Naïve Python >>> ().__class__.__bases__[0]. __subclasses__() [<type 'type'>, <type 'weakref'>, <type

    'weakcallableproxy'>, <type 'weakproxy'>, <type 'int'>, <type 'basestring'>, <type 'bytearray'>, <type 'list'>, <type 'NoneType'>, <type 'NotImplementedType'>, <type 'traceback'>, <type 'super'>, <type 'xrange'>, <type 'dict'>, ...
  17. Naïve Python >>> __builtins__ = [ x for x in

    ().__class__.__base__. __subclasses__() if x.__name__ == 'catch_warnings'] [0]()._module.__builtins__ >>> import os >>>
  18. RPython sandbox • Proper sandbox • Limit heap size, CPU

    time • Supports all Python syntax • Many stdlib modules (pure-python, re, datetime*, etc.)
  19. Using RPython sandbox • Start the process, give it some

    source code • Usually it’s our “driver” code + user code • Communicate with it via stdin/stdout • Optionally, allow desired I/O events Very bare-bones, only Python 2. Use if you REALLY need Python (large API, code sharing, etc.)
  20. Javascript Selling point: it’s possible to run (and sandbox) Javascript

    in the browser. Implementation: • PyV8: embedding V8 in Python (last commit in 2014, sits in google code archive) • V8 or SpiderMonkey in a separate process
  21. Lua • Embeddable language by design • “JavaScript done right”

    • Naïve approach that failed in Python works! Very easy to use from Python using lupa:
  22. Lupa: other nice things Easy to build a rich interface:

    • Passing Lua and Python objects • Python callbacks, coroutines As a bonus, LuaJIT is very fast.
  23. Docker/VM/chroot/etc. They are not safe by default or fool-proof sandboxes.

    You need to be an expert to build a safe sandbox from the basic building blocks.
  24. Case study: ETL scripts ETL = Extract, Transform, Load. Define

    a short script for one field (column) • Splitting: “Moscow, Russia” (“Moscow”, “Russia”) • Cleanup: “00-123-78” -> 12378 • Parsing: dates
  25. Example scripts return field.split("-")[0] return datetime.strptime(field.split()[0], "%Y %m %d") m

    = re.search(r"(\d+)", f["index"]) return m.groups()[0] if m else 0
  26. Implementation details • RPython sandbox • Freeze timezone to make

    datetime work • Line-oriented text protocol, using ast. literal_eval • User API: input via magic variable “field”, output via return. Extra features: fields ["name"], raise Skip.
  27. User reception • Usable by non-programmers with the help of

    examples • API is simple, quick to learn • Flexible enough to overcome missing features in other parts of the system