PyConZA 2012: "Building RESTful, service-oriented architectures with Twisted" by Bryn Divey
General overview of building service-oriented systems using REST and Twisted. Motivations for doing so, design considerations, issues encountered, possible mitigations. Will try and pull it back to Twisted as far as possible.
running at scale (100s of machines currently, aiming for 10s of thousands) • Creates a private cloud fabric and API over datacenters for virtualization • Written in Python/Twisted Friday, October 12, 2012
having two to fifteen instances depending on the size of the site • ~5 node level services, running on every machine in the size Friday, October 12, 2012
Navigable - transition between resources using links (hypertext) • Stateless - client state not persisted between requests • Cacheable - conditions defined by server Friday, October 12, 2012
the way, or provide caching or load balancing services • Uniform interface - defined methods (GET, PUT, etc.), standard response codes (2xx, 4xx, 5xx), well-known content-types Friday, October 12, 2012
requests comprising methods (POST) and representations of the resource (JSON representation of a server object that may be stored in a different form) Friday, October 12, 2012
to run while waiting on async IO • Allows network heavy applications to “simultaneously” service multiple requests in a single thread, scheduling action on IO Friday, October 12, 2012
within their own service • Enforces good interfaces • Having defined interfaces between sections of code is good practice; using services enforces it Friday, October 12, 2012
across hundreds of machines failure is expected - having multiple instances of services makes it non-disruptive • Freedom wrt to implementation • No lock in to language or framework - allows for eg. using compiled languages for performance critical sections Friday, October 12, 2012
APIs is hard. Designing models and choosing which HTTP verbs apply to them is a little easier • Also, forces developers to simplify their APIs (no EmailUsersOfFeatureX calls - must create or manipulate a model) Friday, October 12, 2012
• caches, proxies, etc. • Easy to apply middleware • eg. authentication • Seemed like a good idea at the time • were designing for WANs not LANs • REST was the coolest thing in 2008 Friday, October 12, 2012
• Battle-proven framework • No threading complexities • Initial developers has familiarity • Thought we could isolated unfamiliar devs using threadPools and inlineCallbacks Friday, October 12, 2012
handled all user, group, authentication type calls • ‘instance’ service places, runs, and destroys virtual machines • ‘storage’ service manages storage servers and attaches virtual block storage Friday, October 12, 2012
represent all aspects of the system • as the API is a rather CRUDdy version of REST, most methods on these object match up to HTTP verbs: get, list, update, save, delete • this makes models easy to write with metaclasses consuming their field definitions Friday, October 12, 2012
per model • Operations take a model and metadata, return a response or raise an exception • Details of HTTP transport hidden from controller • Allows for HTTP to be swapped out for eg. AQMP invisibly Friday, October 12, 2012
access patterns depending on context • user.get() in a separate controller will cause an HTTP call • user.get() in the user service will query the underlying database • calls are identical, sources change Friday, October 12, 2012
• models.Instance.list() -> front-end • GET http://api.cloud/compute/instance -> compute service • GET http://compute.cloud:5001/compute/instance • nginx acts as a service bus - routes requests with ‘compute’ base URI to the compute service DNS Friday, October 12, 2012
is not authorized to use" " machine image 'c323c3ee-cc86-47c0-8518-59a7ee02a06a'") GET https://api.cloud/images/machineimage/c323c3ee- cc86-47c0-8518-59a7ee02a06a proxies to GET https://image:5005/images/machineimage/ c323c3ee-cc86-47c0-8518-59a7ee02a06a returns 200 OK - with machine image details on success returns 401 Unauthorized - if the user doesn't have access Examples Friday, October 12, 2012
to multiple external blocking services: get machine images and storage volumes, and check if there are conflicting virtual machines (an operation on the database local to the service) Friday, October 12, 2012
deferreds machine_images = get_machine_images_for_instance (instance) storage_volumes = get_storage_volumes_for_instance (instance) related_instances = get_related_instances(instance) # all deferreds are processed in parallel, so this call (hypothetically) # takes only as long as the slowest one results = yield gatherResults(machine_images + storage_volumes + related_instances) Friday, October 12, 2012
calls to both the image and storage services, and local DB queries • All call signatures to fetch objects are the same • All IO operations happen in parallel, and the thread of execution does something else until they finish Friday, October 12, 2012
auth requests, fetching 1000 machine images, creating 1000 quota objects etc. to launch 1000 instances • Each requires an HTTP request, requiring an SSL encrypted connection, and these have to be limited. Each connection is blocked until the current request completes. Friday, October 12, 2012
hire a lot of low- level devs • Moving across to the idioms and libraries of Python, from C, can sometimes be difficult • Adding learning Twisted on top makes is even more so Friday, October 12, 2012
• Forgetting a yield, or the decorator, causes horrible problems • Currently we have 6415 yields in the codebase - 6413 of them in inlineCallbacks functions • Very difficult to track down - needs static analysis Friday, October 12, 2012
the low-level routing • We get connection pools locking up, requests timing out, etc. • Monitor these things and throw loud alerts Friday, October 12, 2012