Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Operations Concepts & Tools

Operations Concepts & Tools

An intro to some important high-level aspects of code running in production given to some NCSU students -- mostly intended as "these are things you may encounter and may wish to look up" rather than something in-depth. (Not supremely valuable without the accompanying talk but maybe useful, idk)

Michael DeHaan

February 24, 2017
Tweet

More Decks by Michael DeHaan

Other Decks in Programming

Transcript

  1. INTRO TO IT OPERATIONS
    CONCEPTS & TOOLS
    MICHAEL DEHAAN
    NCSU
    SPRING 2017

    View Slide

  2. WHY
    • Familiarity with how your code is deployed and managed in production leads
    to better code
    • Better understanding of performance and failure modes
    • It’s good to be friends with the ops team (inverse: even more true!)
    • Increasing shifts in shared responsibility, sometimes filed under overloaded
    umbrella-term “DevOps”.

    View Slide

  3. THINGS TO COVER
    • Classic vs Microservices Architectures
    • IaaS / Cloud APIs and Tools
    • Configuration Management / Immutable Systems
    • Monitoring / Log Collection
    • Load Balancers / Update Strategies
    • Backup / Disaster Recovery
    • Security Policy
    • Continuous Integration / Continuous Deployment

    View Slide

  4. “TYPICAL” WEB ARCHITECTURE
    Load Balancer
    wwww1 www2 www3
    db1
    db2
    message bus
    job1 job2

    View Slide

  5. MICROSERVICES “ARCHITECTURES”

    View Slide

  6. IN THE BEGINNING
    • In the beginning (and still a lot today), software installs were largely run by
    systems administrators writing their own custom scripts
    • These scripts grew unmaintainable over time
    • Scripts could fail
    • Much of install processes were not fully automated even if some scripts
    existed
    • Upgrades were a frequent cause of widespread system failure

    View Slide

  7. IAAS / CLOUD
    • Misleading assumption that Cloud services (ex: Amazon, GCE) are primarily
    about renting IP addresses
    • ALSO: storage, databases, load balancers, firewalls/security, messaging, etc
    • Cloud topology control examples: CloudFormation (AWS), Terraform (generic)
    • Cloud API examples: Boto (AWS Python)
    • CLI Tools

    View Slide

  8. CONFIGURATION MANAGEMENT
    • Declarative description of what should be on a system
    • “Idempotence” & the GPS Analogy: F(x) = F(F(x))
    • Typically “push” or “pull” based
    • Designed around Pull: Puppet, Chef
    • Designed around Push: Ansible

    View Slide

  9. IMMUTABLE SYSTEMS
    • Alternative strategy to configuration management
    • New images replace old images, rather than upgrading systems in place
    • Increases reliability and potentially decreases upgrade times
    • Cannot be as easily applied to stateful servers (databases, etc)
    • Can slow down development process
    • Image building: Packer, docker
    • Image management: EC2, Mesos/Kubernetes

    View Slide

  10. MONITORING
    • On-site:
    • Graphite, Ganglia, Nagios, Cacti, Munin
    • Hosted / Off-site:
    • Newrelic
    • Alerting vs trending
    • Application Performance Management (APM):
    • AppDynamics

    View Slide

  11. LOG COLLECTION/SEARCH
    • Off-site:
    • Splunk
    • SumoLogic
    • Loggly
    • On-site:
    • Logstash / “ELK Stack”

    View Slide

  12. LOAD BALANCERS & AUTO SCALING
    • Typically more than one instance of a service is deployed
    • Routes requests between services
    • Closely related: auto-scaling groups
    • Warming up problems and solutions
    • TV show voting example

    View Slide

  13. BACKUP / DISASTER RECOVERY
    • You must be able to restore everything from backup
    • Minimize number/types of data sources
    • If backups are not tested they do not exist
    • Understanding multi-region and multi-datacenter

    View Slide

  14. HIDDEN MANAGEMENT COMPLEXITY
    • As you add management software, the management software often needs
    management
    • Be aware what happens when you lose a shard or key server
    • Some software upgrades “weird”
    • Holes in bucket: This software requires zookeeper, which requires etcd, …

    View Slide

  15. UPDATE STRATEGIES
    • Outages
    • Rolling updates
    • “Red/green, blue/green, whatever” updates

    View Slide

  16. SECURITY POLICY
    • As the number of teams engaged in “self-service” type deployments
    happen…
    • Security scans increasingly need to happen at build-time
    • Consistency is mandatory
    • Code-review checks need to be in-place and not simply rubber-stamps

    View Slide

  17. CONTINUOUS INTEGRATION
    • Automatically build code when checked-in
    • Ideally: run unit tests as part of build step.
    • Typically: Jenkins. Also Travis/CI, CircleCI, Teamcity, Bamboo, others.
    • Dangers of inconsistent build job rules.

    View Slide

  18. CONTINUOUS DEPLOYMENT
    • Can’t get here overnight - This is a spectrum.
    • First requires full automation of a deploy, and a solid C.I. setup
    • When C.I. completes at least deploy to stage and run functional tests
    • Next step: if FTs pass, consider a deploy to prod

    View Slide

  19. ADDITIONAL RESOURCES
    • Unfortunately, moves fast.
    • Latest tech, but advice of varying quality:
    • news.ycombinator.com
    • Reddit.com/r/devops
    • Reddit.com/r/sysadmin

    View Slide

  20. View Slide