Open Source Infrastructure

Slide 1

Slide 1 text

Lightning Talk: Open Source Infrastructure June 10, 2017

Slide 2

Slide 2 text

About Me • Systems Engineer @ SeatGeek • Cake Core Developer • twitter.com/savant • github.com/josegonzalez

Slide 3

Slide 3 text

Some Stats • 25 Core Developers • 5 Different Continents • 15+ timezones • 12+ languages

Slide 4

Slide 4 text

What do we do? • A butcher • A baker • A candlestick-maker

Slide 5

Slide 5 text

What do we do? • Write and translate documentation • Maintain existing CakePHP Websites • Investigate new core/community initiatives • Provide support via chat/email/forums • Wrangle Social Media

Slide 6

Slide 6 text

What day jobs do we have? • Car Parts Salesman • Company Owner • Professional Dancer • Software Developer

Slide 7

Slide 7 text

We have real jobs, With real lives, and concerns other than: Is the docs site up?

Slide 8

Slide 8 text

We have real jobs, With real lives, and concerns other than: Did the server get hacked?

Slide 9

Slide 9 text

We have real jobs, With real lives, and concerns other than: Why is the server down?

Slide 10

Slide 10 text

We have real jobs, With real lives, and concerns other than: Who can deploy the bakery?

Slide 11

Slide 11 text

What is the problem? We need to ensure that that the CakePHP Sites and related services are highly available to our users with minimal interference

Slide 12

Slide 12 text

What does this even mean?

Slide 13

Slide 13 text

• Centralized Logging • Server Metrics and APM • Authentication and ACL for infrastructure access • Backups (and backup testing!) • Scaling • Disaster Recovery

Slide 14

Slide 14 text

All things that are full time jobs

Slide 15

Slide 15 text

At normal, paid institutions

Slide 16

Slide 16 text

For teams of dedicated systems engineers

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Why is this even a problem?

Slide 19

Slide 19 text

Moar background • 5-10 people online at any time • Might be busy with paid work, side projects, or life • Some with tons of infra experience, most with none • Language Barrier • Little to no on boarding time

Slide 20

Slide 20 text

Can we pay for a service? • CakeDC isn’t made out of money • Services are expensive • Still require maintenance, on-boarding

Slide 21

Slide 21 text

What about our systems engineers? • Full time jobs/burnout • Different tech than day job • May not be available

Slide 22

Slide 22 text

Choosing Technology Because technology solves everything™

Slide 23

Slide 23 text

Solves the problem Not creating interesting ones for the hell of it

Slide 24

Slide 24 text

Familiar to maintainers Or at least some of them

Slide 25

Slide 25 text

Quick to pick up For those for whom the tech is new

Slide 26

Slide 26 text

Boring Choice is the Best Choice

Slide 27

Slide 27 text

Easy to extend Needs change, so should infra

Slide 28

Slide 28 text

Infra as code Why is this setting there? Who applied this change?

Slide 29

Slide 29 text

Conﬁguration Management with Ansible

Slide 30

Slide 30 text

But why tho? • Everyone can read YAML • Low learning curve, lots of tutorials • Maps well to existing server tasks • Easy to write custom modules

Slide 31

Slide 31 text

But why not tho? • Everyone needs ssh access • Repo credentials are in the open, even if encrypted • YAML sucks as an automation language • Moves fast and breaks things

Slide 32

Slide 32 text

Continuous Integration Via Jenkins

Slide 33

Slide 33 text

But why tho? • Everyone has used it, everyone hates it equally • Jobs can be generated via Groovy DSL • Deployable via Docker • Plugins for everything

Slide 34

Slide 34 text

But why not tho? • Ecosystem is constantly moving • Default UI isn’t great • Really easy to use/abuse plugins non-pipeline jobs

Slide 35

Slide 35 text

But why not CircleCI/TravisCI/Wrecker? • Expensive • Jobs are usually attached to a single repo • Hard to do OSS with secure secrets • At the whim of service providers

Slide 36

Slide 36 text

Automated Deployments Using Dokku

Slide 37

Slide 37 text

But why tho? • Already built • Integrates with Ansible, and mostly unattended • Designed with Docker in mind • Has an internal Champion • OSS

Slide 38

Slide 38 text

But why other solutions? • Does not need to be clustered • We can withstand 30 min of downtime during restores • Setup/training costs much larger for K8s/Mesos/ Nomad • Custom scripting not necessary, can use Heroku Buildpacks • No need to rebuild the wheel and write build/release code

Slide 39

Slide 39 text

Considerations For the considerate

Slide 40

Slide 40 text

Access Control • Everyone has access, but is that appropriate? • Smaller circle of trust makes infra control easier, but harder to deal with in distributed context

Slide 41

Slide 41 text

Access Control • Passwords and keys need to be decrypted • Web of trust for initial access must be established

Slide 42

Slide 42 text

Access Control • Use strong authentication, prefer keys to passwords • SSH Session Auditing?

Slide 43

Slide 43 text

Logging • Logs should be aggregated • Use the same format, json or logfmt • ISO8601 - Don’t make up date time formats

Slide 44

Slide 44 text

Logging level=info datetime=2017-06-10T08:01:40+00:00 msg="Stopping all fetchers”

Slide 45

Slide 45 text

Logging Add metadata to make logs useful

Slide 46

Slide 46 text

Logging level=info datetime=2017-06-10T08:01:40+00:00 msg="Stopping all fetchers” id=ConsumerFetcherManager-1382721708341 tag=stopping_fetchers module=kafka.consumer.ConsumerFetcherManager

Slide 47

Slide 47 text

Logging • Self-hosted logging can be cheaper, but more labor intensive • Ship ﬁrst, ﬁlter logs later

Slide 48

Slide 48 text

Monitoring • Our site needs to be globally available, does yours? • Server metrics via Diamond/StatsD/Graphite • Visualization via Grafana

Slide 49

Slide 49 text

Monitoring • APM can be expensive • Site Speed and Analytics

Slide 50

Slide 50 text

Backups Do Them and Verify Them

Slide 51

Slide 51 text

Backupss • Backups go to Rackspace Cloud, manually cleared • Manual veriﬁcation performed semi-monthly • Automated backup veriﬁcation coming up

Slide 52

Slide 52 text

Backupsss • No Offsite Backups • Backups not encrypted (yet) • No Disaster Recovery