with JS, C & Objective-C Took a more operationally involved role in 2013–2014 ➤ Learned a whole bunch of other stuff Started working with Erlang/Elixir in 2015
opinion; any resemblance of facts should be considered a mere coincidence. Depending on your current scale and cultural inclination, they may or may not be applicable to your product.
Phoenix ➤ Dockerisation, OTP Releases, etc. already covered numerous times ➤ Distillery is clearly the way forward Starting from scratch is a different story ➤ Let’s try to build up a decision framework
for your deployment Requirements: things that are important to have in your deployment Design: How we currently address this issue Implementation: SSL/TLS steering, clustering, configuration, etc Demo: (if time permits) a full demonstration of what we’re using Then try to answer any question
Time ➤ Erlang/OTP is a beautiful and pioneering tool ➤ Many concepts have found their way outside the Erlang/OTP ecosystem ➤ Load balancing, clustering, service discovery, etc ➤ Your colleagues probably already know how to use them ➤ Key: Your colleagues are knowledgeable: ensure they are useful
Product Development ➤ You wrote it — You maintain it! ➤ Any code written by your team will need to be maintained for the entire lifetime of its existence by your team, using your team’s budget ➤ Consider opportunity cost of custom platform engineering ➤ Seek maximum leverage from commercial off-the-shelf solutions ➤ Key: Fully exploit well-known solutions created and maintained by others
Lock-In Anyway ➤ “Portability” is a damned lie; it’s vendor-specific shims all the way down ➤ Open Source Languages/Frameworks — Always! ➤ Closed Source Third-Party Services — Some? Many? All? ➤ Post-porting activities also important: fine-tune, measure, monitor, re-adjust… ➤ Might as well design and exploit each platform to their fullest ➤ Key: When possible, spend other people’s R&D money
existing knowledge; avoid forced re-learning Use as many pre-built solutions as possible ➤ Build upon existing solutions by your community Properly integrate with your primary platform ➤ An existing and adequate solution is still better than nothing Also: try to follow https://12factor.net as closely as possible
not compromise: ➤ Minimum hours spent away from work per day, per team member ➤ Number of involuntary interventions per week/month/year Do not optimise for “developer happiness” or “developer productivity” ➤ These are by-products of a system correctly designed for perpetual operation ➤ Focus on operational stability and sustainability instead
issues do not require human intervention Efficient: quick to deploy/rollback, quick to start, etc Secure: proper role/net segregation, minimised public footprint Observable: easy to monitor/intervene (say, with console)
you can probably lose an AZ but not an entire region Intermittent errors can be recovered from automatically. ➤ Let your Supervision Tree sort out the easy bits. ➤ Have your infrastructure pick up the hard bits. ➤ NB: they operate at different levels and are actually complementary
fails health check and is replaced automatically. New containers launched on the replacement instance; service continues. ➤ Bad: Intermittent 500s start to appear. Malaised server generates errors quicker than everything else. Everything is broken. The site is down.
Site goes read-only as read replica gets promoted to primary. Important events queued for deferred processing. Impacted Erlang/OTP processes restart. You may or may not have lost a couple transactions (depending on how your replication is built). ➤ Bad: Site goes down. Restoration from yesterday’s backup will take at least 4 hours. Third party events dropped. R.T.O. now on the floor.
seconds: good; 1 minute: OK, longer than 5 minutes: insufferable Helps reduce ceremonial role of deployments ➤ Much lower cost of error correction if deployments are fast Also helps you fulfil the inevitable “build a new stack” requests ➤ Staging / Test / Support / One-Off Odds & Ends / Big Migration / DR… ➤ You will probably get asked to “run a copy of PROD” with miracles expected
a safe manner” Good: New stack created and traffic split at DNS / LB level. Everybody carries on working/testing. Eventually, rollout completed. Bad: Big-bang rollout which inevitably fails. After much gnashing of teeth and lots of finger-pointing, the operators were blamed (!) Probably OK: use feature flags…
separate places ➤ Utilise VPC capabilities to their fullest ➤ Additional layers around your systems Separate development/deployment/infrastructure roles ➤ Resource access/creation/mutation limited with dedicated roles ➤ Also allows you to deploy customer-hosted services as a vendor
TVBCOA… ➤ Good: “Run this template, then give us keys from its output, which are limited to deployment/maintenance for this service only.” ➤ Bad: “Yeah we need root access to these servers just to deploy…” Scenario 2: Lots of frenemies in one room ➤ Good: each division gets their own VPC ➤ Bad: everything can see everything else…
time ➤ Have a mental picture of what “normal” looks like ➤ Different tools for Application and Infrastructure level needs Maintain ability to intervene quickly if required ➤ Little can be done by pure operators apart from scaling horizontally ➤ Developer access is still crucial as systems mature
Good: Console access is available. Node connected to, root cause identified, and ad-hoc patch seems to alleviate the problem. New release created and rolled out. ➤ Bad: Insufficient access, so a new release was needed, just to add logging statements…
Production, One for Development ➤ Some older AWS EC2 customers actually do not have a “Default VPC”! ➤ Even if they do, it would be poor form to put everything in there Three Subnets per Availability Zone ➤ One each for Public, Private and Data services: Public → Private, Private → Data ➤ For Internet Access (NTP, etc): Private → Public, Data → Public ➤ You can put in a S3 Gateway if you wish: cheaper S3 access!
➤ AWS-managed Kubernetes would be great. ➤ ECS hosts spread around all available AZs Fully utilise Docker for production builds ➤ We’re all using Macs anyway ➤ Development on macOS, Release/Production on Linux ➤ Helps catch system-specific issues
➤ This is much simpler than writing an Infrastructure Deployment Guide ➤ 100% repeatable, no churn, no faffing about ➤ Checked into Git and Version Controlled Locally run Bash scripts for everything else ➤ The choice is simple: either run Bash locally or trust/run Lambda Functions remotely. I’d rather do it locally ➤ Checked into Git and Version Controlled
pin dependency against upstream stacks ➤ Pinned resources can not change upstream Run Stack (as Administrator) ➤ You can make more Administrator accounts using IAM ➤ Meta: you could also build a self-service administrator account in IAM, which can only update the stack which created it.
➤ The same images previously used for validation are reused Push Images to ECS Container Repository Revise ECS Task Definition Revise ECS Service Definition
particulars Retrieve latest events from ECS ➤ Has the cluster entered a stable state? Retrieve EC2 instances and their particulars Print metrics of particular interest (RAM/CPU utilisation, etc)
# find a container ➤ $ docker exec -it (container) iex -S mix Also for later consideration: attach a local Erlang node to VPC infra ➤ This is a straightforward matter of lining the ports up ➤ You could even do an ad-hoc task if you wish, but it’d be slower Future Script: Attach Console
servers ➤ Essential for automatic failover ➤ You can attach volumes for stuff you wish to keep Alternative: Cloud Init, Custom IAM, Ansible / Puppet ➤ Either slower (re-provisioning everything dynamically takes minutes to hours), or less efficient (entire IAMs need to be rebuilt: is your build infrastructure also reproducible?)
get the artefacts out. ➤ This gets you the underlying Erlang/OTP release ➤ You could then wrap it in another delivery mechanism ➤ Erlang on Xen, perhaps?
$PGDATA. The key pair has been # generated manually beforehand, as the container running # PostgreSQL does not have OpenSSL exposed anyway. # # https://www.postgresql.org/docs/9.1/static/ssl-tcp.html # See: 17.9.3. Creating a Self-signed Certificate # cp /docker-entrypoint-initdb.d/server.{crt,key} "$PGDATA" chown postgres:postgres "$PGDATA"/server.{crt,key} chmod 0600 "$PGDATA"/server.key # # Given that this is a development container, # we do not wish to play with PostgreSQL configuration too much # therefore a simple line appended to the end of the configuration # file will suffice. # echo "ssl = on" >> "$PGDATA/postgresql.conf"
endpoint, redirect to HTTPS ➤ Much better for applications: better than throwing an error ➤ Not really needed for programmatic access: just fail straight away X-Forwarded-Proto header ➤ Emitted by Heroku and AWS Elastic Load Balancer ➤ Supported by Plug.SSL and therefore Phoenix’s Endpoint ➤ config :app, Endpoint, force_ssl: [rewrite_on: [:x_forwarded_proto]]
the Elastic Load Balancer, all ELB / App interaction will be conducted in HTTP, including health checks. With HTTPS enforcement, health checks by default get 301. Adjust your Matcher accordingly to avoid knocking your site offline.
AWS ➤ https://aws.amazon.com/articles/6234671078671125 ➤ Establish VPN tunnels between each host., then fake it… You could emulate UDP multicast in AWS if you really want to ➤ A lot of work, though! ➤ Consider what the actual reward
load-balancer / proxy ➤ Perhaps etcd or riak, but you could use Redis/Postgres if you wish Each Erlang/OTP node puts in an expiring “heartbeat” ➤ {nodename, host, port, cookie, memo}, TTL= 30 seconds Each Erlang/OTP node then tries to connect with each other ➤ Custom libCluster strategy needed
wish to have work partitioned among your nodes. ➤ You can utilise Riak Core once you get a cluster going ➤ Alternatively: you can take a distributed lock and allocate from there ➤ Many other ways to distribute work — find the best for your application
customising vm.args ➤ Pass a “placeholder” Erlang cookie, so we can deploy many revisions of something together without them talking to each other ➤ Actual Erlang Cookie kept in an AWS SSM Parameter Also customised the startup script ➤ Dynamically generate the Node Name to avoid clashes
false set cookie: :placeholder set vm_args: "./rel/vm.args" end # rel/vm.args ## Name of the node -name ${NODE_NAME} ## Cookie for distributed erlang -setcookie ${NODE_COOKIE}
to RDS ➤ ssh -L local_port:remote_host:remote_port ➤ Tunnel from Laptop to Bastion ➤ Tunnel from Bastion to any ECS Host ➤ You must have at least one host in there anyway ➤ Result: localhost:5436 maps to RDS:5432 — success! ➤ Postgres should be configured to use SSL only
Goal: once the stack is handed off to the development team, it is already running the application in a fully functional fashion off the master branch You could tunnel to RDS and build up the state yourself ➤ Good way to consistently exercise the “Seed” file ➤ Good way to expose incorrect migrations too
It is a good starting point ➤ However, your migrations should always run with only Repo running ➤ This means no application dependencies ➤ iex -S mix --no-start ➤ Treat Ecto migrations as a stable Elixir to SQL transformation
tree! ➤ However, you may not want to run everything when starting iex. ➤ Wouldn’t be good if a remote node started sharing production workloads ➤ You can check :init.get_arguments[:user] ➤ Does it have ’Elixir.IEx.CLI’? ➤ Optionally skip certain bits of your supervision tree