Slide 1

Slide 1 text

INTRO TO IT OPERATIONS CONCEPTS & TOOLS MICHAEL DEHAAN NCSU SPRING 2017

Slide 2

Slide 2 text

WHY • Familiarity with how your code is deployed and managed in production leads to better code • Better understanding of performance and failure modes • It’s good to be friends with the ops team (inverse: even more true!) • Increasing shifts in shared responsibility, sometimes filed under overloaded umbrella-term “DevOps”.

Slide 3

Slide 3 text

THINGS TO COVER • Classic vs Microservices Architectures • IaaS / Cloud APIs and Tools • Configuration Management / Immutable Systems • Monitoring / Log Collection • Load Balancers / Update Strategies • Backup / Disaster Recovery • Security Policy • Continuous Integration / Continuous Deployment

Slide 4

Slide 4 text

“TYPICAL” WEB ARCHITECTURE Load Balancer wwww1 www2 www3 db1 db2 message bus job1 job2

Slide 5

Slide 5 text

MICROSERVICES “ARCHITECTURES”

Slide 6

Slide 6 text

IN THE BEGINNING • In the beginning (and still a lot today), software installs were largely run by systems administrators writing their own custom scripts • These scripts grew unmaintainable over time • Scripts could fail • Much of install processes were not fully automated even if some scripts existed • Upgrades were a frequent cause of widespread system failure

Slide 7

Slide 7 text

IAAS / CLOUD • Misleading assumption that Cloud services (ex: Amazon, GCE) are primarily about renting IP addresses • ALSO: storage, databases, load balancers, firewalls/security, messaging, etc • Cloud topology control examples: CloudFormation (AWS), Terraform (generic) • Cloud API examples: Boto (AWS Python) • CLI Tools

Slide 8

Slide 8 text

CONFIGURATION MANAGEMENT • Declarative description of what should be on a system • “Idempotence” & the GPS Analogy: F(x) = F(F(x)) • Typically “push” or “pull” based • Designed around Pull: Puppet, Chef • Designed around Push: Ansible

Slide 9

Slide 9 text

IMMUTABLE SYSTEMS • Alternative strategy to configuration management • New images replace old images, rather than upgrading systems in place • Increases reliability and potentially decreases upgrade times • Cannot be as easily applied to stateful servers (databases, etc) • Can slow down development process • Image building: Packer, docker • Image management: EC2, Mesos/Kubernetes

Slide 10

Slide 10 text

MONITORING • On-site: • Graphite, Ganglia, Nagios, Cacti, Munin • Hosted / Off-site: • Newrelic • Alerting vs trending • Application Performance Management (APM): • AppDynamics

Slide 11

Slide 11 text

LOG COLLECTION/SEARCH • Off-site: • Splunk • SumoLogic • Loggly • On-site: • Logstash / “ELK Stack”

Slide 12

Slide 12 text

LOAD BALANCERS & AUTO SCALING • Typically more than one instance of a service is deployed • Routes requests between services • Closely related: auto-scaling groups • Warming up problems and solutions • TV show voting example

Slide 13

Slide 13 text

BACKUP / DISASTER RECOVERY • You must be able to restore everything from backup • Minimize number/types of data sources • If backups are not tested they do not exist • Understanding multi-region and multi-datacenter

Slide 14

Slide 14 text

HIDDEN MANAGEMENT COMPLEXITY • As you add management software, the management software often needs management • Be aware what happens when you lose a shard or key server • Some software upgrades “weird” • Holes in bucket: This software requires zookeeper, which requires etcd, …

Slide 15

Slide 15 text

UPDATE STRATEGIES • Outages • Rolling updates • “Red/green, blue/green, whatever” updates

Slide 16

Slide 16 text

SECURITY POLICY • As the number of teams engaged in “self-service” type deployments happen… • Security scans increasingly need to happen at build-time • Consistency is mandatory • Code-review checks need to be in-place and not simply rubber-stamps

Slide 17

Slide 17 text

CONTINUOUS INTEGRATION • Automatically build code when checked-in • Ideally: run unit tests as part of build step. • Typically: Jenkins. Also Travis/CI, CircleCI, Teamcity, Bamboo, others. • Dangers of inconsistent build job rules.

Slide 18

Slide 18 text

CONTINUOUS DEPLOYMENT • Can’t get here overnight - This is a spectrum. • First requires full automation of a deploy, and a solid C.I. setup • When C.I. completes at least deploy to stage and run functional tests • Next step: if FTs pass, consider a deploy to prod

Slide 19

Slide 19 text

ADDITIONAL RESOURCES • Unfortunately, moves fast. • Latest tech, but advice of varying quality: • news.ycombinator.com • Reddit.com/r/devops • Reddit.com/r/sysadmin

Slide 20

Slide 20 text

No content