About Me
• VP of Engineering, offers.com
• Austin PHP Organizer
• I play competitive Skee Ball
• github.com/jimbojsb
• @jimbojsb
2
Slide 3
Slide 3 text
SURVEY
Lets Start With A
3
Slide 4
Slide 4 text
Agenda
• What can we consider highly available?
• Why are containers well-suited for this?
• What technology choices do we have to
make?
• Recommendations
• Lessons learned the hard way
4
Slide 5
Slide 5 text
Opinion vs Fact
• This talk is based on my opinions
• There are many different ways to do things
• If I trash your favorite, we can still have a
beer later
• Why am I even qualified to talk about this?
5
Slide 6
Slide 6 text
This is not a tutorial
• There’s no way I can show you enough in
an hour to build this all from scratch
• See what ideas might apply to your
systems
6
Slide 7
Slide 7 text
What is High Availability
• Your stuff just doesn’t go down
• Like ever
• And not just by happy coincidence!
7
Slide 8
Slide 8 text
How Often Are You Down?
8
• 99% Uptime = Down 7h a month
• 99.9% Uptime = Down 45m a month
• 99.99% Uptime = Down <5m a month
• 99.999% Uptime = Down <30s a month
Slide 9
Slide 9 text
What Should We Shoot For?
• Minimum “4 9’s”
• “5 9’s” is totally doable
• HA costs real money
• All about managing potential loss
9
Slide 10
Slide 10 text
How To Calculate Your Risk Tolerance
• Log in to your AWS account
• Hand me your laptop
• I will terminate one EC2 instance of my
choosing, at random
• How long will you let me sit there?
10
Slide 11
Slide 11 text
But seriously…
• Risk mitigation costs money
• Consider battery backups as an example
• Asking “how much reliability do you want”
is a silly question
• Make these decisions with hard numbers,
not feelings
11
Slide 12
Slide 12 text
Obligatory Metaphors
• Until the late 2000’s, we treated servers
like pets
• Then with Chef, Puppet, Ansible, etc, we
treated them like cattle
• Now we can treat them like ants
12
Slide 13
Slide 13 text
Assumptions for Now
• You have an AWS account
• You have something to lose if your apps
are down
• You have a budget to solve this problem
13
Slide 14
Slide 14 text
Example App Ecosystem
14
PHP Web
App
API
Scheduled
Jobs
Queue
Workers
Database
Cache
Job
Queue
Uploaded
Files
Slide 15
Slide 15 text
Lets Start with Hardware
• All this stuff works great in the cloud
• It also works just fine on bare metal
• You need at least 2 of everything
• You need a plan for how to fail
• You need a replacement plan
15
Slide 16
Slide 16 text
Self-Healing Systems
• If a server ceases to exist, it should be
replaced without human interaction
• Use AWS Cloud Formation
• Actually learn AWS Cloud Formation
• AWS Elastic Beanstalk is an option
• Terraform is also decent
16
Slide 17
Slide 17 text
What about my DevOps Tools?
• I’ve got all this _____ stuff already set up
• Run EVERYTHING in Docker, and you
don’t need it
• You can still use it if you insist
• Who runs the scripts?
• Is the _____ server highly available?
17
Slide 18
Slide 18 text
Why Docker Is Suitable
• Immutable, disposable infrastructure
• Requires no bootstrapping if using a
docker-friendly OS
• Don’t have to care what is running where,
just that you have enough hardware
18
Slide 19
Slide 19 text
Set up your hosts
• I recommend CoreOS
• Understand the CoreOS version scheme
• CoreOS is self-upgrading (probably bad)
19
Slide 20
Slide 20 text
Outsource Your State
• Docker isn’t amazing at managing state
• Local state is not fault-tolerant
• AWS provides perfectly good, managed
storage
• Managed != High Availability
20
Slide 21
Slide 21 text
Database
• What does your I/O load look like?
• Split writes and reads
• Recommend AWS Aurora
• Always have at least 2 of your biggest
server
• Watch out for maintenance windows
21
Slide 22
Slide 22 text
Disk Storage
• Try to avoid local disk storage of anything
• Put PHP sessions in a cache of your
choosing
• Upload files directly to S3
• Consider Flysystem to avoid development
S3 buckets
22
Slide 23
Slide 23 text
Job Queue
• Pick one that can be load balanced
• Ideally outsource this too
• Apache Kafka
• Amazon SQS
• RabbitMQ
23
Slide 24
Slide 24 text
Cache
• How important is your cache?
• Does your app work if the cache
disappears?
• Make sure it’s not the source of truth
• Sharding vs. Replication for scale
24
Slide 25
Slide 25 text
RUN EVERYTHING ELSE IN
DOCKER
Now that we’ve solved our state problems
25
Slide 26
Slide 26 text
Containerize All The Things
• This isn’t just about containers for the sake
of containers
• The container way of thinking leads you
down the right path
26
Slide 27
Slide 27 text
docker run Is Not Sufficient
• Just like with building apps, you’re going to
want a framework
• Some sort of api and deployment system
to run containers
• Something to wrangle hardware with
27
Slide 28
Slide 28 text
This is a solved problem
• Kubernetes
• Mesos / Marathon
• Docker Swarm / Compose
• Rancher
28
Slide 29
Slide 29 text
My Recommendation
• Rancher
• Experimenting & learning
• Small-to-mid scale
• NoOps
• Mesos & Marathon
• Big scale (dozens of services & instances)
• You have an ops team (or an expert dev)
• Full AWS stack
29
Slide 30
Slide 30 text
Apache Mesos
• Your “interface” to hardware
• You don’t speak to it directly
• Clusters many instances into a pool of
resources
• Runs containers
30
Slide 31
Slide 31 text
31
Slide 32
Slide 32 text
32
Slide 33
Slide 33 text
Marathon
• Runs your long-running services /
containers
• Websites, APIs, etc
• Job queue workers
• May or may not expose ports
• Works with haproxy
33
Slide 34
Slide 34 text
34
Slide 35
Slide 35 text
Scheduled Jobs
• Remember, we agreed to run everything in
Docker.
• How do we load balance & failover cron?
35
Slide 36
Slide 36 text
This also, is a solved problem
• Chronos
• Part of “Mesosphere”
• Clunky UI
• Kind of a pain to deploy to
• Singularity
• Not singly focused
• Happens to do scheduled jobs really well
36
Slide 37
Slide 37 text
37
Slide 38
Slide 38 text
Resource Allocation
• Marathon and Singularity both support
cgroups and docker resource offers
• Commit ahead of time to what your app
needs, don’t over-provision
• Can result in fragmentation
38
Slide 39
Slide 39 text
Load Balancers
• Apply liberally
• Also good for port translation on public
facing side
• Load balance all your internal services too
• If you don’t need a load balancer, it’s a
good sign that’s a risky service
39
Slide 40
Slide 40 text
Service Discovery
• Can you ever really know where anything
is running?
• No, No you cannot
• Because of the way Docker works, there
will be many “unknown” ports
40
Slide 41
Slide 41 text
This is a solved problem
• etcd
• Consul
• Zookeeper
• Several others
41
Slide 42
Slide 42 text
My Recommendation
• Don’t use any of these
• Service discovery costs time
• How often do your services actually
reconfigure themselves
• We use known-port discovery
42
Single App Network Diagram
45
Internet
myapp.com:80
haproxy:10090
container:31456
container:30437
haproxy:10090
Slide 46
Slide 46 text
So Now What
• You went and built out all this fancy stuff
• You absolutely MUST test it
• Go terminate an instance, see if everything
fixes itself
• If you haven’t tested each “HA”
component, you’re not done yet
46
Slide 47
Slide 47 text
We can version our infrastructure
47
Slide 48
Slide 48 text
We can version our infrastructure
48
Slide 49
Slide 49 text
DevOps Benefits
• Ops can focus on supporting the cluster
and not the apps
• Empower engineers to ship infrastructure
in a “fool proof” way
• Intangible benefit of spreading cool tech
into the organization
49
Slide 50
Slide 50 text
Learn from my mistakes
• RTFM, specifically about minimum requirements
• Is your DNS HA?
• Upgrade your way out of trouble
• It’s highly unlikely you need a bleeding edge
version of Docker
• If you deploy in Docker, you’d better be
developing in it
50
Slide 51
Slide 51 text
Downsides
• Mesos/Marathon doesn’t expose some of
the most modern Docker features
• This stuff is not easy, and it’s moving fast
• It can be hard to test the waters, lots of
circular dependencies
51