High Availability PHP (Zendcon 2016)

Slide 1

Slide 1 text

High-Availability PHP Josh Butts Zendcon 2016

Slide 2

Slide 2 text

About Me • VP of Engineering, offers.com • Austin PHP Organizer • I play competitive Skee Ball • github.com/jimbojsb • @jimbojsb 2

Slide 3

Slide 3 text

SURVEY Lets Start With A 3

Slide 4

Slide 4 text

Agenda • What can we consider highly available? • Why are containers well-suited for this? • What technology choices do we have to make? • Recommendations • Lessons learned the hard way 4

Slide 5

Slide 5 text

Opinion vs Fact • This talk is based on my opinions • There are many different ways to do things • If I trash your favorite, we can still have a beer later • Why am I even qualified to talk about this? 5

Slide 6

Slide 6 text

This is not a tutorial • There’s no way I can show you enough in an hour to build this all from scratch • See what ideas might apply to your systems 6

Slide 7

Slide 7 text

What is High Availability • Your stuff just doesn’t go down • Like ever • And not just by happy coincidence! 7

Slide 8

Slide 8 text

How Often Are You Down? 8 • 99% Uptime = Down 7h a month • 99.9% Uptime = Down 45m a month • 99.99% Uptime = Down <5m a month • 99.999% Uptime = Down <30s a month

Slide 9

Slide 9 text

What Should We Shoot For? • Minimum “4 9’s” • “5 9’s” is totally doable • HA costs real money • All about managing potential loss 9

Slide 10

Slide 10 text

How To Calculate Your Risk Tolerance • Log in to your AWS account • Hand me your laptop • I will terminate one EC2 instance of my choosing, at random • How long will you let me sit there? 10

Slide 11

Slide 11 text

But seriously… • Risk mitigation costs money • Consider battery backups as an example • Asking “how much reliability do you want” is a silly question • Make these decisions with hard numbers, not feelings 11

Slide 12

Slide 12 text

Obligatory Metaphors • Until the late 2000’s, we treated servers like pets • Then with Chef, Puppet, Ansible, etc, we treated them like cattle • Now we can treat them like ants 12

Slide 13

Slide 13 text

Assumptions for Now • You have an AWS account • You have something to lose if your apps are down • You have a budget to solve this problem 13

Slide 14

Slide 14 text

Example App Ecosystem 14 PHP Web App API Scheduled Jobs Queue Workers Database Cache Job Queue Uploaded Files

Slide 15

Slide 15 text

Lets Start with Hardware • All this stuff works great in the cloud • It also works just fine on bare metal • You need at least 2 of everything • You need a plan for how to fail • You need a replacement plan 15

Slide 16

Slide 16 text

Self-Healing Systems • If a server ceases to exist, it should be replaced without human interaction • Use AWS Cloud Formation • Actually learn AWS Cloud Formation • AWS Elastic Beanstalk is an option • Terraform is also decent 16

Slide 17

Slide 17 text

What about my DevOps Tools? • I’ve got all this _____ stuff already set up • Run EVERYTHING in Docker, and you don’t need it • You can still use it if you insist • Who runs the scripts? • Is the _____ server highly available? 17

Slide 18

Slide 18 text

Why Docker Is Suitable • Immutable, disposable infrastructure • Requires no bootstrapping if using a docker-friendly OS • Don’t have to care what is running where, just that you have enough hardware 18

Slide 19

Slide 19 text

Set up your hosts • I recommend CoreOS • Understand the CoreOS version scheme • CoreOS is self-upgrading (probably bad) 19

Slide 20

Slide 20 text

Outsource Your State • Docker isn’t amazing at managing state • Local state is not fault-tolerant • AWS provides perfectly good, managed storage • Managed != High Availability 20

Slide 21

Slide 21 text

Database • What does your I/O load look like? • Split writes and reads • Recommend AWS Aurora • Always have at least 2 of your biggest server • Watch out for maintenance windows 21

Slide 22

Slide 22 text

Disk Storage • Try to avoid local disk storage of anything • Put PHP sessions in a cache of your choosing • Upload files directly to S3 • Consider Flysystem to avoid development S3 buckets 22

Slide 23

Slide 23 text

Job Queue • Pick one that can be load balanced • Ideally outsource this too • Apache Kafka • Amazon SQS • RabbitMQ 23

Slide 24

Slide 24 text

Cache • How important is your cache? • Does your app work if the cache disappears? • Make sure it’s not the source of truth • Sharding vs. Replication for scale 24

Slide 25

Slide 25 text

RUN EVERYTHING ELSE IN DOCKER Now that we’ve solved our state problems 25

Slide 26

Slide 26 text

Containerize All The Things • This isn’t just about containers for the sake of containers • The container way of thinking leads you down the right path 26

Slide 27

Slide 27 text

docker run Is Not Sufficient • Just like with building apps, you’re going to want a framework • Some sort of api and deployment system to run containers • Something to wrangle hardware with 27

Slide 28

Slide 28 text

This is a solved problem • Kubernetes • Mesos / Marathon • Docker Swarm / Compose • Rancher 28

Slide 29

Slide 29 text

My Recommendation • Rancher • Experimenting & learning • Small-to-mid scale • NoOps • Mesos & Marathon • Big scale (dozens of services & instances) • You have an ops team (or an expert dev) • Full AWS stack 29

Slide 30

Slide 30 text

Apache Mesos • Your “interface” to hardware • You don’t speak to it directly • Clusters many instances into a pool of resources • Runs containers 30

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Marathon • Runs your long-running services / containers • Websites, APIs, etc • Job queue workers • May or may not expose ports • Works with haproxy 33

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Scheduled Jobs • Remember, we agreed to run everything in Docker. • How do we load balance & failover cron? 35

Slide 36

Slide 36 text

This also, is a solved problem • Chronos • Part of “Mesosphere” • Clunky UI • Kind of a pain to deploy to • Singularity • Not singly focused • Happens to do scheduled jobs really well 36

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Resource Allocation • Marathon and Singularity both support cgroups and docker resource offers • Commit ahead of time to what your app needs, don’t over-provision • Can result in fragmentation 38

Slide 39

Slide 39 text

Load Balancers • Apply liberally • Also good for port translation on public facing side • Load balance all your internal services too • If you don’t need a load balancer, it’s a good sign that’s a risky service 39

Slide 40

Slide 40 text

Service Discovery • Can you ever really know where anything is running? • No, No you cannot • Because of the way Docker works, there will be many “unknown” ports 40

Slide 41

Slide 41 text

This is a solved problem • etcd • Consul • Zookeeper • Several others 41

Slide 42

Slide 42 text

My Recommendation • Don’t use any of these • Service discovery costs time • How often do your services actually reconfigure themselves • We use known-port discovery 42

Slide 43

Slide 43 text

Full Network Diagram 43 Marathon Marathon Mesos Master Mesos Master Mesos Master Mesos ELB Marathon ELB The Interweb Marathon LBs Marathon LBs App ELBS App ELBS App ELBS Mesos Slaves Mesos Slaves Mesos Slaves Mesos Slaves Mesos Slaves VPN Singularity Singularity Singularity ELB Zookeeper Zookeeper Zookeeper

Slide 44

Slide 44 text

IT’S BASICALLY MAGIC If you get this far 44

Slide 45

Slide 45 text

Single App Network Diagram 45 Internet myapp.com:80 haproxy:10090 container:31456 container:30437 haproxy:10090

Slide 46

Slide 46 text

So Now What • You went and built out all this fancy stuff • You absolutely MUST test it • Go terminate an instance, see if everything fixes itself • If you haven’t tested each “HA” component, you’re not done yet 46

Slide 47

Slide 47 text

We can version our infrastructure 47

Slide 48

Slide 48 text

We can version our infrastructure 48

Slide 49

Slide 49 text

DevOps Benefits • Ops can focus on supporting the cluster and not the apps • Empower engineers to ship infrastructure in a “fool proof” way • Intangible benefit of spreading cool tech into the organization 49

Slide 50

Slide 50 text

Learn from my mistakes • RTFM, specifically about minimum requirements • Is your DNS HA? • Upgrade your way out of trouble • It’s highly unlikely you need a bleeding edge version of Docker • If you deploy in Docker, you’d better be developing in it 50

Slide 51

Slide 51 text

Downsides • Mesos/Marathon doesn’t expose some of the most modern Docker features • This stuff is not easy, and it’s moving fast • It can be hard to test the waters, lots of circular dependencies 51