Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenShift Updates and Release Process

Rob
August 17, 2020

OpenShift Updates and Release Process

Rob

August 17, 2020
Tweet

More Decks by Rob

Other Decks in Technology

Transcript

  1. Understanding over-the-air capabilities OpenShift Updates and Release Process Rob Szumski

    OpenShift Product Management @robszumski Scott Dodson OpenShift Engineering @sdodson 1
  2. Each OpenShift release is a collection of Operators • 30

    Operators run every major part of the platform: ◦ Console, Monitoring, Authentication, Machine management, Kubernetes Control Plane, etcd, DNS, and more. • Operators constantly strive to meet the desired state, merging admin config and Red Hat recommendations • CI testing is constantly running install, upgrade and stress tests against groups of Operators
  3. OpenShift release cadence Stream of updates that transitions from full

    feature development to critical bugs x.1 MAY JUN JUL AUG SEP OCT NOV DEC JAN FEB MAR APR MAY N release Full support, RFEs, bugfixes, security N-2 release OTA pathway to N release, critical bugs and security • Z-stream releases weekly ◦ New installer binary ◦ Over-the-air upgrade package ◦ Release notes published ◦ Errata notice published • Active development happens while the Y-stream is the latest • Critical bugs and security remain fixed for the entire duration, including backports • Each release includes Kubernetes software and RHCOS node software x.1.2 x.1.24
  4. Upgrading weekly z-stream releases What to expect when maintaining your

    clusters with the latest security patches • Updates can be driven by Console or programmatically through API • Upgrades happen in place, there is no re-provisioning of Nodes • Apps using Kubernetes HA features should not have downtime • Pods typically do not need to be rescheduled, although all Nodes will reboot in a serial fashion • All user sessions will be reset • Update duration is dependent on the size of the cluster and how long Pods take to evict themselves from your Nodes
  5. Connected Clusters Cluster’s are given a set of happy paths

    through different versions Admin Quay.io Container Registry Connected OpenShift Cluster Red Hat sourced update image OpenShift Update Service Red Hat sourced update graph (Cincinnati protocol) Select desired version from available options
  6. OpenShift release channels Gain control over the pace of over-the-air

    updates • Best mechanism for testing compatibility with bleeding edge versions of OpenShift • Can include versions for which there is no recommended update path candidate-4.5 fast-4.5 stable-4.5 • Always contains GA versions of OpenShift • Fastest pace channel • Use on at least 1 production cluster to catch issues specific to you • Always contains GA versions of OpenShift • Slower paced channel • Released after stability looks good on fast • May lag fast during the first weeks of a new y-stream release by design Read more: documentation on updating your cluster GitHub: Look at the channel source data
  7. Feedback through CI Release Candidates GA Update Signing Final Testing

    GA Build Dev & CI Extra focus on upgrade testing Version number born here Release pulled if tests fail Feedback through telemetry Feedback through support cases Feedback through bugs Promote to candidate channel Promote to fast channel Promote to stable channel Extra focus on real-world envs Pulled for real-world errors found outside CI Errata & docs published Extra focus on upgrade errors & platform stability Edges blocked for bug count, upgrade error rates or degraded Operator health Extra focus on workload stability Promotions for Z-stream = ~2 days Y-stream = ~weeks Red Hat & Partner Testing Pre-Release General Availability OpenShift release process
  8. Overlap of OCP support lifecycles A rolling N-2 support window

    keeps you secure and up to date with Kubernetes x.2 x.3 EUS extended support period for an EUS x.4 x.5 x.6 x.1 Year 1 Year 2 Year 3 MAY JUN JUL AUG SEP OCT NOV DEC JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC JAN FEB MAR APR MAY JUN JUL AUG N release Full support, RFEs, bugfixes, security N-2 release OTA pathway to N release, critical bugs and security Upgrade window
  9. Upgrading 4.4 to 4.5 using channels Cluster admins are always

    in control of when clusters update fast-4.4 fast-4.5 stable-4.5 Current Channel Running 4.4.11 Do I want to remain on fast or go to stable for 4.5? Next Channel
  10. How is this done safely? Red Hat curates the best

    sequence of updates through a graph database $ curl -sH 'Accept: application/json' 'https://api.openshift.com/api/upgrades_info/v1/g raph?channel=fast-4.4' | jq -r '[.nodes[].version] | sort | unique[]' 4.3.12 4.3.13 4.3.18 4.3.19 4.3.21 4.3.22 4.3.23 4.3.25 4.3.26 4.3.27 4.3.28 4.3.29 4.4.10 4.4.11 4.4.12 4.4.3 4.4.4 4.4.5 4.4.6 4.4.8 4.4.9 • Paths are constantly being tweaked to get the best experience • Feedback through telemetry, bugs and automated testing • Paths can skip entire sections if safe • Admins can force their way through the cluster’s protections, if desired Simple view Full update graph Versions that can upgrade to 4.4 Choices to upgrade to within 4.4
  11. Blocked edges Paths that are unsafe can be “blocked” to

    force clusters through a safer alternative Skip Bug in 4.4.11 Issue identified, no fix X • Goal 1: route around the versions with bugs • Goal 2: provide remediation for impacted clusters • Threshold for blocking an edge can be low if it’s widespread or rare but severe. • OpenShift fleet gets smarter every day X
  12. Threshold for blocking edges You’ll see a common set of

    questions on Bugzilla Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
  13. Encountering a blocked edge What should I do if I

    am faced with a blocked edge? Already running version X • You’re running a supported release. • Red Hat is committed to supporting any debugging, recovery, and mitigation which may be required to get you through the update. • Not every [cluster × platform × workload × config] hits every issue • Typical: bugs are fixed and a new path is published for you to follow • Less common: mechanisms in place to force your way through, after testing to understand the ramifications. When in doubt, ask Red Hat. Desire to run version X • Blocked edges don’t affect upgrades once started • Always test it out in your test environment • Understand bugs, errata and content within the release • Once ready, find the image pull spec, ie: https://access.redhat.com/errata/RHBA-2020:1393 $ oc adm upgrade --help ... Options: --allow-explicit-upgrade=false --allow-upgrade-with-warnings=false # use CLI to grab release info $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.3.12-x86_64 | grep 'Name:\|OS/Arch:\|Pull From:' Name: 4.3.12 OS/Arch: linux/amd64 Pull From: quay.io/openshift-release-dev/ocp-release@sha256:75e8f20e9d5a8fcf5b ba4b8f7d17057463e222e350bcfc3cf7ea2c47f7d8ba5d # upgrade to the content within the image $ oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:75e8f20e9d5a8fcf5b ba4b8f7d17057463e222e350bcfc3cf7ea2c47f7d8ba5d
  14. Disconnected Clusters Designed to give you the same automation as

    connected clusters Admin Local Container Registry Local Copy of Update Image Disconnected OpenShift Cluster Red Hat sourced Update Image Mirrored to local registry Cluster updated locally Same as connected Release images & signatures Release notes & bugs Click button in GUI or upgrade via API Monitoring progress Debugging issues Mirroring commands Point CRI-O at internal registry instead of quay.io Same as connected Unique to disconnected Quay.io Container Registry OpenShift Update Service
  15. Check the upgrade paths in OpenShift Update Service • Narrow

    your selection of versions to those that have upgrade paths from your current OpenShift version • Coming Soon! A webpage to guide you through this Understand any bugs that may be open against candidate versions • Ask your Technical Account Manager for advice and bugs they may be tracking for you specifically • Use BugZilla advanced search for your desired y-stream version Review roadmap for your compute platforms, storage providers and networking plugins • Enhancements maybe be coming that would make sense to integrate into an upgrade cycle, especially if it takes a longer amount of time to qualify a release Choosing a release to qualify for your clusters Disconnected clusters require a human to provide input alongside Update Service $ git clone https://github.com/openshift/cincinnati.git && cd hack $ curl -sH 'Accept:application/json' 'https://api.openshift.com/api/upgrades_info/v1/graph?channel=fast-4.4' | ./graph.sh | dot -Tpng >graph-fast-4.4.png
  16. Pinch points between Y-streams Typically you must be on the

    last few releases to move to another y-stream 4.3 4.4 4.5 Simple view of upgrade paths Collapsing down to 4.3.28 Early z-streams are typically serial, creating wider branches Later z-streams collapse back down as graph is enhanced with feedback Narrow paths between y-streams increases quality, these are highly tested
  17. FAQ Sometimes builds don’t make it into a channel. Why?

    • Once a version number is minted, we don’t ever reuse it • If a build has issues that are found immediately after it is built, it will not be promoted to any channel • This can happen with the first release of a new y-stream ◦ 4.4.0-4.4.2 had issues, 4.4.3 became the first GA release on that channel ◦ 4.5.1 has issues discovered related to RHCOS 4.4.4 is in stable and 4.4.5 isn't, is 4.4.5 safe for us to start using? • All releases in fast are GA, just like stable. The only difference is timing. • You should be testing out newer releases on test and staging clusters. • This is how you will find issues specific to your environment before it rolls out more widely. Why do we have to upgrade to 4.3.18 before we can go to 4.4? • Yes, the later releases on a channel are required to upgrade to the next y-stream • This reduces the paths, which increases focus and quality • Later, more paths may be added as the upgrade looks healthy and more testing is done Why is there no release at the expected time? • Rarely, a release is skipped for build issues • Red Hat maintains at least 3 different y-streams, that are all shipping upgrades. Delaying one typically delays another. ◦ If there is no critical security content, we rather skip than delay Can I skip a release during an upgrade? Go from 4.3 to 4.5? • No, you will need to go through a 4.4 release even if it is run for a short amount of time • Kubernetes is going to be making several changes/migrations that will need to be made ◦ Many API’s moving to stable that require migrations ◦ Storage and other plugins moving from in-tree to out-of-tree with migrations
  18. More Resources • Blog posts about upgrades ◦ https://www.openshift.com/blog/red-hat-openshift-cluster-upgrades-and-application-operator-updates •

    OpenShift Continuous Integration and Testing ◦ https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/ • Production Cincinnati endpoint ◦ https://api.openshift.com/api/upgrades_info/v1/graph
  19. linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat Red Hat is the world’s leading

    provider of enterprise open source software solutions. Award-winning support, training, and consulting services make Red Hat a trusted adviser to the Fortune 500. Thank you 20