Lightning Talk: Open Source Infrastructure
June 10, 2017
Slide 2
Slide 2 text
About Me
• Systems Engineer @ SeatGeek
• Cake Core Developer
• twitter.com/savant
• github.com/josegonzalez
Slide 3
Slide 3 text
Some Stats
• 25 Core Developers
• 5 Different Continents
• 15+ timezones
• 12+ languages
Slide 4
Slide 4 text
What do we do?
• A butcher
• A baker
• A candlestick-maker
Slide 5
Slide 5 text
What do we do?
• Write and translate documentation
• Maintain existing CakePHP Websites
• Investigate new core/community initiatives
• Provide support via chat/email/forums
• Wrangle Social Media
Slide 6
Slide 6 text
What day jobs do we have?
• Car Parts Salesman
• Company Owner
• Professional Dancer
• Software Developer
Slide 7
Slide 7 text
We have real jobs,
With real lives, and
concerns other than:
Is the docs site up?
Slide 8
Slide 8 text
We have real jobs,
With real lives, and
concerns other than:
Did the server get hacked?
Slide 9
Slide 9 text
We have real jobs,
With real lives, and
concerns other than:
Why is the server down?
Slide 10
Slide 10 text
We have real jobs,
With real lives, and
concerns other than:
Who can deploy the bakery?
Slide 11
Slide 11 text
What is the problem?
We need to ensure that that the CakePHP Sites
and related services are highly available to our
users with minimal interference
Slide 12
Slide 12 text
What does this even
mean?
Slide 13
Slide 13 text
• Centralized Logging
• Server Metrics and APM
• Authentication and ACL for infrastructure access
• Backups (and backup testing!)
• Scaling
• Disaster Recovery
Slide 14
Slide 14 text
All things that are full
time jobs
Slide 15
Slide 15 text
At normal, paid
institutions
Slide 16
Slide 16 text
For teams of dedicated
systems engineers
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
Why is this even a
problem?
Slide 19
Slide 19 text
Moar background
• 5-10 people online at any time
• Might be busy with paid work, side projects, or
life
• Some with tons of infra experience, most with
none
• Language Barrier
• Little to no on boarding time
Slide 20
Slide 20 text
Can we pay for a service?
• CakeDC isn’t made out of money
• Services are expensive
• Still require maintenance, on-boarding
Slide 21
Slide 21 text
What about our systems engineers?
• Full time jobs/burnout
• Different tech than day job
• May not be available
Slide 22
Slide 22 text
Choosing Technology
Because technology solves everything™
Slide 23
Slide 23 text
Solves the problem
Not creating interesting ones for the hell of it
Slide 24
Slide 24 text
Familiar to maintainers
Or at least some of them
Slide 25
Slide 25 text
Quick to pick up
For those for whom the tech is new
Slide 26
Slide 26 text
Boring Choice is the
Best Choice
Slide 27
Slide 27 text
Easy to extend
Needs change, so should infra
Slide 28
Slide 28 text
Infra as code
Why is this setting there? Who applied this change?
Slide 29
Slide 29 text
Configuration
Management
with Ansible
Slide 30
Slide 30 text
But why tho?
• Everyone can read YAML
• Low learning curve, lots of tutorials
• Maps well to existing server tasks
• Easy to write custom modules
Slide 31
Slide 31 text
But why not tho?
• Everyone needs ssh access
• Repo credentials are in the open, even if
encrypted
• YAML sucks as an automation language
• Moves fast and breaks things
Slide 32
Slide 32 text
Continuous Integration
Via Jenkins
Slide 33
Slide 33 text
But why tho?
• Everyone has used it, everyone hates it equally
• Jobs can be generated via Groovy DSL
• Deployable via Docker
• Plugins for everything
Slide 34
Slide 34 text
But why not tho?
• Ecosystem is constantly moving
• Default UI isn’t great
• Really easy to use/abuse plugins non-pipeline
jobs
Slide 35
Slide 35 text
But why not CircleCI/TravisCI/Wrecker?
• Expensive
• Jobs are usually attached to a single repo
• Hard to do OSS with secure secrets
• At the whim of service providers
Slide 36
Slide 36 text
Automated
Deployments
Using Dokku
Slide 37
Slide 37 text
But why tho?
• Already built
• Integrates with Ansible, and mostly unattended
• Designed with Docker in mind
• Has an internal Champion
• OSS
Slide 38
Slide 38 text
But why other solutions?
• Does not need to be clustered
• We can withstand 30 min of downtime during restores
• Setup/training costs much larger for K8s/Mesos/
Nomad
• Custom scripting not necessary, can use Heroku
Buildpacks
• No need to rebuild the wheel and write build/release
code
Slide 39
Slide 39 text
Considerations
For the considerate
Slide 40
Slide 40 text
Access Control
• Everyone has access, but is that appropriate?
• Smaller circle of trust makes infra control easier,
but harder to deal with in distributed context
Slide 41
Slide 41 text
Access Control
• Passwords and keys need to be decrypted
• Web of trust for initial access must be
established
Slide 42
Slide 42 text
Access Control
• Use strong authentication, prefer keys to
passwords
• SSH Session Auditing?
Slide 43
Slide 43 text
Logging
• Logs should be aggregated
• Use the same format, json or logfmt
• ISO8601 - Don’t make up date time formats
Slide 44
Slide 44 text
Logging
level=info datetime=2017-06-10T08:01:40+00:00 msg="Stopping all fetchers”
Slide 45
Slide 45 text
Logging
Add metadata to make logs useful
Slide 46
Slide 46 text
Logging
level=info datetime=2017-06-10T08:01:40+00:00
msg="Stopping all fetchers” id=ConsumerFetcherManager-1382721708341
tag=stopping_fetchers module=kafka.consumer.ConsumerFetcherManager
Slide 47
Slide 47 text
Logging
• Self-hosted logging can be cheaper, but more
labor intensive
• Ship first, filter logs later
Slide 48
Slide 48 text
Monitoring
• Our site needs to be globally available, does
yours?
• Server metrics via Diamond/StatsD/Graphite
• Visualization via Grafana
Slide 49
Slide 49 text
Monitoring
• APM can be expensive
• Site Speed and Analytics
Slide 50
Slide 50 text
Backups
Do Them and Verify Them
Slide 51
Slide 51 text
Backupss
• Backups go to Rackspace Cloud, manually
cleared
• Manual verification performed semi-monthly
• Automated backup verification coming up
Slide 52
Slide 52 text
Backupsss
• No Offsite Backups
• Backups not encrypted (yet)
• No Disaster Recovery