High Availability PHP (Nomad PHP January 2018)

Slide 1

Slide 1 text

High Availability PHP Josh Butts Nomad PHP - January 2018

Slide 2

Slide 2 text

About Me • SVP of Engineering,  Ziff Davis • Austin PHP Organizer • github.com/jimbojsb • @jimbojsb 2

Slide 3

Slide 3 text

Agenda • What can we consider highly available? • How do we mitigate risk? • Why are containers well suited for HA? • Recommendations • Lessons learned the hard way 3

Slide 4

Slide 4 text

Opinion vs Fact • This talk is based on my opinions • There are many different ways to do things • If I trash your favorite, we can still be friends • Why am I even qualified to talk about this? 4

Slide 5

Slide 5 text

This is not a tutorial • There’s no way I can show you enough in an hour to build all this from scratch • See what ideas might apply to your systems • Commit to incremental improvements 5

Slide 6

Slide 6 text

What is high availability? • Your stuff just doesn’t go down • Like ever • This is not just a happy coincidence 6

Slide 7

Slide 7 text

How often are you down • 99% Uptime = Down 7h/month • 99.9% Uptime = Down 45m/month • 99.99% Uptime = Down <5m/month • 99.999% Uptime = Down <30s/month 7

Slide 8

Slide 8 text

What should we shoot for? • Minimum of “4 9’s” • “5 9’s” is totally doable • HA costs real money • Balance potential loss against costs 8

Slide 9

Slide 9 text

How to calculate your risk tolerance • Log in to your AWS account • Hand me your laptop • I will terminate one EC2 instance of my choosing • How long will you let me sit there? 9

Slide 10

Slide 10 text

But seriously… • Risk mitigation costs money • Consider battery backups as an example • Asking “how much reliability do you want” is a silly question • Make these decisions with hard numbers, not feelings 10

Slide 11

Slide 11 text

Obligatory Metaphors • Until the late 2000’s, we treated servers like pets • Then with Chef, Puppet, Ansible, etc we treated them like cattle • Now we can treat them like ants! 11

Slide 12

Slide 12 text

Example App Ecosystem 12 PHP Web App API Scheduled Jobs Queue Workers Database Cache Job Queue Uploaded Files

Slide 13

Slide 13 text

Lets start with hardware • All these tactics work great with cloud providers (doesn’t have to be AWS) • You need at least 2 of everything • You need a plan for how to fail • You need a replacement plan 13

Slide 14

Slide 14 text

Self-healing systems • If a server ceases to exist, it should be replaced without human interaction’ • AWS Cloud Formation and Elastic Beanstalk are good options • Terraform for non-AWSers 14

Slide 15

Slide 15 text

What about my devops tools? • I’ve already got all this ____ stuff set up • Docker can obviate all of that • You can still use these things if you must • Who runs the scripts? • Is the ____ server highly available? 15

Slide 16

Slide 16 text

Why Docker? • Immutable, disposable infrastructure • Requires no bootstrapping if using a Docker-friendly OS • Don’t have to care about what is running where, just that you have enough hardware 16

Slide 17

Slide 17 text

Containerize All The Things! • This isn’t just about containers for the sake of containers • The container way of thinking leads you down the right path 17

Slide 18

Slide 18 text

docker run Is Not Sufficient • Just like with building apps, you’re going to want a framework • API-based deployment and scheduling of containers • Something to wrangle hardware 18

Slide 19

Slide 19 text

Don’t worry, this is a solved problem • Kubernetes • Mesosphere / DCOS • Docker Swarm • Rancher 19

Slide 20

Slide 20 text

Containers and Schedulers • Common to run multiple containers on one piece of hardware • What if that hardware goes down? • What if US-East-1D goes down? 20

Slide 21

Slide 21 text

Example 21

Slide 22

Slide 22 text

I’m so tired of hearing people talk about Docker • This is not a Docker talk, I promise • Containers breed immutable, repeatable infrastructure • Immutable infrastructure is disposable and replaceable • Containers breed 12-factor apps • 12-factor apps are modular enough to facilitate true HA 22

Slide 23

Slide 23 text

Database • What does your I/O load look like • Split reads and writes • AWS Aurora if applicable • You really need at least 2 of your biggest server • Maintenance windows? 23

Slide 24

Slide 24 text

Disk Storage • Try to avoid local disk storage of anything • Put PHP sessions in a memory cache • Upload files directly to S3 • FlySystem is your friend, especially for development 24

Slide 25

Slide 25 text

Cache • How important is your cache? • Does your app work if the cache disappears? • Make sure it’s not the source of truth • Sharding vs Replication for scale 25

Slide 26

Slide 26 text

The Basics 26

Slide 27

Slide 27 text

Getting Started 27

Slide 28

Slide 28 text

Now IP addresses are broken 28 $clientIp = $_SERVER['REMOTE_ADDR']; // 1.2.3.4:80 - address of load balancer

Slide 29

Slide 29 text

Lets fix IPs 29 $clientIp = 0; if (isset($_SERVER['HTTP_X_FORWARDED_FOR'])) { $possibleIp = $_SERVER['HTTP_X_FORWARDED_FOR']; if (strpos($possibleIp, ',') !== false) { $ipList = explode(',', $possibleIp); foreach ($ipList as $ip) { $ip = trim($ip); if (filter_var($ip, FILTER_VALIDATE_IP)) { $clientIp = $ip; break; } } } else { $clientIp = $possibleIp; } } else { $clientIp = $_SERVER['REMOTE_ADDR'] ?? null; }

Slide 30

Slide 30 text

Now lets fix the databases 30

Slide 31

Slide 31 text

App Considerations 31 $dbs = ["db1.site.com", "db2.site.com", "db3.site.com"]; $slaveNum = mt_rand(0, count($dbs) - 1); $pdo = new \PDO($dbs[$slaveNum]);

Slide 32

Slide 32 text

Better Version 32 $dbs = ["db1.site.com", "db2.site.com", "db3.site.com"]; $slaveNum = mt_rand(0, count($dbs) - 1); try { $pdo = new \PDO($dbs[$slaveNum]); } catch (\PDOExcetion $e) { for ($i = 0; $i < count($dbs); $i++) { if ($i != $slaveNum) { try { $pdo = new \PDO($dbs[$slaveNum]); break; } catch (\PDOException $e) { } } } if (!$pdo) { throw new \RuntimeException("all out of DBs"); } }

Slide 33

Slide 33 text

Now we need more load balancers 33

Slide 34

Slide 34 text

Lets not forget caching! 34

Slide 35

Slide 35 text

App Considerations 35 require_once __DIR__ . '/vendor/autoload.php'; $client = new \Predis\Client(); $value = $client->get("myvalue"); if (!$value) { $value = reallyExpensiveFunction(); $client->set("myvalue", $value); }

Slide 36

Slide 36 text

Better Version 36 class Cache { private $redis; public function get($key) { try { return $this->redis->get($key); } catch (Exception $e) { return null; } } public function set($key, $value) { try { return $this->redis->set($key, $value); } catch (Exception $e) { } } }

Slide 37

Slide 37 text

• Consider a circuit breaker pattern for other external services • https://github.com/ offers/rho 37 Circuit Breaker Pattern

Slide 38

Slide 38 text

Service Discovery Overview • Distributed data stores that are a registry of what servers are where • Your code connects to these instead of using a config file • Even if you had to update it manually, it’d be faster than deploying 38

Slide 39

Slide 39 text

Service Discovery • Etcd • Consul • Zookeeper • Oh by the way, these all need their own cluster of at least 3 servers 39

Slide 40

Slide 40 text

Quick Service Discovery Example 40 $etcd = new LinkOrb\Component\Etcd\Client($etcdClusterHostname); $dbs = $etcClient->get("/database/slaves"); $slaveNum = mt_rand(0, count($dbs) - 1); try { $pdo = new \PDO($dbs[$slaveNum]); } catch (\PDOExcetion $e) { //... }

Slide 41

Slide 41 text

The problem with service discovery • Latency • Each lookup takes approximately 10ms • If you have to look up DB, Cache, ElasticSearch, SMTP, etc, it adds up • Try to organize services by logical application, so you can query for a whole namespace at once 41

Slide 42

Slide 42 text

Updated Discovery Example 42 $etcd = new LinkOrb\Component\Etcd\Client($etcdClusterHostname); $config = = $etcClient->get("/services/myapp"); $dbs = $config["databases"]["slaves"] $slaveNum = mt_rand(0, count($dbs) - 1); try { $pdo = new \PDO($dbs[$slaveNum]); } catch (\PDOExcetion $e) { //... }

Slide 43

Slide 43 text

Questions?