Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Our Journey to High Availability and Fast Load

mbroz2
April 02, 2019

Our Journey to High Availability and Fast Load

When we started, our app was deployed in a single datacenter, causing outages over which we had no control, providing unoptimized experience for our non-local users, and other developer pain points. Join us as we go over the (mis)steps we took to resolve this by deploying to datacenters across multiple continents and configuring a global load balancer to provide the high availability and improved performance we sought. The talk will also include DNS configuration and testing, certificate management, and availability testing.

mbroz2

April 02, 2019
Tweet

More Decks by mbroz2

Other Decks in Programming

Transcript

  1. Our Journey to High Availability and Fast Load Times via

    Global Load Balancing Michal Broz Kin Ueng
  2. • https://github.com/OpenLiberty/openliberty.io compromised of: • Mostly frontend (html, js, adoc)

    • Some Java backend (REST APIs, Filters, etc) • Some backend infrastructure enablement (server.xml) • Some cloud infrastructure (IBM Cloud stuff) • Controls the CI/CD • Mapping of app to address (manifest.yaml) • Build & Deploy (various bash scripts) Source Code (public GitHub repo)
  3. CI/CD • Deployment on every commit • Build WAR •

    clones github repos on every commit • Deployed as a CloundFoundry (cf) app on a Liberty for Java applications buildpack • Blue/Green deployment (no downtime, minimal redundancy)
  4. OH NO: SITE OUTAGE! • Site outage due to network

    gateway going down • DNS: Akamai • Pointing to a static IP in IBM Cloud (Dallas datacenter) • Registrar: via AT&T • Quick Fix: Update static IP to new Static IP • Site used A Record (IP Address) • Proper Fix: Switch to CNAME (domain)
  5. OH NO: SITE OUTAGE!! • CloudFoundry in Dallas Datacenter goes

    down • Unable to use site during conference • Dallas had more incidents than other datacenters • Move Datacenter to Washington • Change pipeline to also deploy to Washington Datacenter • Pipeline remains in Dallas • Change CNAME to point at Washington • Too bad we only initially only moved www.openliberty.io (openliberty.io, still pointed at Dallas, but we fixed that soon thereafter)
  6. OH NO: We’re at 2 strikes, 3rd one and you’re…

    • What could cause our 3rd outage: • Long turnaround to make DNS changes via Akamai • Single Datacenter means single point of failure • Degrading performance as you get further away from Washington • Lots of European/west coast conferences • Hard deadline for making ‘risky’ environment changes • Huge conference coming up • ‘Conference season’ starting
  7. Our World Expands Further Sydney Dallas Washington UK Germany Dallas

    Datacenter Deploy Build New Deploy Deploy Deploy Deploy Akamai AT&T
  8. How to test load balancing the wrong way • What

    we didn’t know • How to organize the Cloud Foundry routes • Under the impression openliberty.io subdomain was unavailable for testing • The wrong way • Performing tests with a mocked up DNS environment • The test setup • One global load balancer instance with hostname glb.openliberty.io • Leave production untouched • Hardcode DNS servers on laptop to do testing
  9. How we successfully tested the GLB • Get AT&T to

    create test.openliberty.io • Update Akamai to route test.openliberty.io on IBM Cloud • Tell Akamai to use Cloudflare to resolve test.openliberty.io • Real-world testing
  10. Multiple Certificate Issues (test.openliberty.io) • Incorrect certificates being sent to

    users • Domain mismatch (syd.ol.io) • GLB Health check fails for all pool due to TLS security • Custom certificate not valid for *.test.openliberty.io subdomains
  11. Our Word, Untrusted Dallas Datacenter Build CIS GLB Region Route

    Sydney Datacenter Dallas Datacenter Washington Datacenter UK Datacenter Germany Datacenter Deploy Deploy Deploy Deploy Deploy Region Route Region Route Region Route Region Route Incorrect Certificate (Proxy Off) Pools Cloudflare
  12. test.openliberty.io – GLB proxy off mode GLB Proxy Off Incorrect

    Certificates: *.mybluemix.net Correct Certificates: *.openliberty.io *.mybluemix.net *.mybluemix.net Dallas Datacenter
  13. test.openliberty.io – proxy on (routing directly to app) CIS GLB

    Incorrect Certificates Correct Certificates 404 Dallas Datacenter
  14. test.openliberty.io – proxy on (routing to correct datacenter endpoint) Incorrect

    Certificates Correct Certificates 404 us-south.test.openlibery.io != *.openliberty.io CIS GLB Dallas Datacenter
  15. GLB on hold; Migrate openliberty.io to CIS! • We know

    migrating domain from Akamai to CIS works from our experience with test.openliberty.io • Get rid of using *.test.openliberty.io domains • Keep openliberty.io hardcoded to point at a single datacenter • Validate the GLB on the side before routing openliberty.io to the GLB
  16. Our World Now Dallas Datacenter Build CIS GLB Region Route

    Sydney Datacenter Dallas Datacenter Washington Datacenter UK Datacenter Germany Datacenter Deploy Deploy Deploy Deploy Deploy Region Route Region Route Region Route Region Route Trusted Certificates (Proxy On) Pools Cloudflare
  17. Monitoring: Health Check • Supports 15 regions • Response &

    Load Times • Response Codes for resources • Alerts • Tip: Append a query parameter to URL for easy analytics filtering
  18. TODOs: ❑Caching • Enabled itself automatically when we turned on

    GLB ❑DDoS (IP whitelist) ❑Replicate pipeline on other datacenters