Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apidays Singapore 2024 - API Monitoring x SRE b...

Apidays Singapore 2024 - API Monitoring x SRE by Ryan Ashneil and Eugene Wong, GovTech Singapore

API Monitoring x SRE (Site Reliability Engineering)
Ryan Ashneil, Software Engineer - Government Technology Agency of Singapore
Eugene Wong, Senior DevOps Engineer - Government Technology Agency of Singapore

Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024)

------

Check out our conferences at https://www.apidays.global/

Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8

Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io

Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/

apidays

May 04, 2024
Tweet

Video

More Decks by apidays

Other Decks in Technology

Transcript

  1. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. API Monitoring x SRE Eugene Wong - Senior DevOps Engineer Ryan Ashneil - Software Engineer 18 April 2024 apidays
  2. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Discover the underlying technologies that power Singapore Government APIs. Peek into how we design our central API Monitoring Dashboards around the SRE principles. SYNOPSIS
  3. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. 4 CONTEXT BUILDING 1. Introduction to our platforms 2. API Monitoring within the API Lifecycle DASHBOARD DEEP DIVE 3. SRE-designed dashboards 4. Dashboarding: for preparation of major events 5. Dashboarding: for authentication of APIs CONCLUSION 6. Q&A CONTENTS
  4. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. A full-fledged API management platform of the Singapore Government Tech Stack (SGTS) that allows government agencies, businesses and developers to manage and share APIs: • Cross-zone bridging between the internet and intranet realms • Experience a diversity of APIs (>30 agencies) to build meaningful government services • Secure and stays current with policy updates, standards, and best practices regularly. Introduction of our platforms APEX
  5. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Based on the Elastic Stack, StackOps is a key monitoring component of the SGTS, designed to boost observability and support SRE: • Simplified monitoring setup on the Government Commercial Cloud • Reduction of operation overheads to run monitoring • Accelerated Mean Time to Resolution (MTTR) • Helps APEX to monitor the 4 Golden Signals of SRE Introduction of our platforms StackOps
  6. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Where our platforms intersect API Monitoring within the API Lifecycle API Lifecycle • As active applications consume published APIs, their traffic transactions are logged and piped into StackOps. • Within APEX itself, our API publishers can run their entire API lifecycle activities, which are also logged and monitored. • Critical metrics of our infrastructure are also captured and shipped to StackOps. Services Portals Gateways Infrastructure
  7. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Credits and Reference SRE-designed Dashboards The Site Reliability Workbook Practical Ways to Implement SRE (Niall Richard Murphy, David K. Rensin, Kent Kawahara & Stephen Thorne) Published by O’Reilly (3rd Release)
  8. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Monitoring in SRE SRE-designed Dashboards Which of these SRE Principles are related to monitoring? Monitoring Risk (evaluating risk of unexpected failures) SLOs Eliminating Toil (automating work to increase reliability and productivity) Automation (including testing, software deployment, incident response, team communication) Release Engineering Simplicity (simple to manage)
  9. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Monitoring in SRE – Interfaces (cont’d) SRE-designed Dashboards SRE Workbook Pg 63 Interfaces. You’ll likely need to offer different views of the same data based on audience… Be specific about creating dashboards that make sense to the people consuming the content.
  10. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Monitoring in SRE – Interfaces SRE-designed Dashboards
  11. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Monitoring in SRE – Ownership & Tooling SRE-designed Dashboards SRE Workbook Pg 7 Use the Same Tooling, Regardless of Function or Job Title. …teams minding a service should use the same tools, regardless of their role in the organization… The more divergence you have, the less your company benefits from each effort to improve each individual tool.
  12. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Monitoring in SRE – Ownership & Tooling SRE-designed Dashboards • We used a common monitoring tool to measure API Performance (traffic, latency), Business Metrics (SLO, SLI), application logging and infrastructure metrics/logging. • Product Managers, Devs and DevOps, as well as API publishers (customers), all used the same monitoring software – StackOps • This has enabled us to work together and link events across different areas, fostering enhancements and streamlining the monitoring tool. Monitoring spans multiple domains, allowing APEX to oversee system performance during events and assist API users in real-time troubleshooting. • Incentivizes us to make tool improvements to benefit all parties.
  13. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Monitoring in SRE – Speed SRE-designed Dashboards • API metrics needing additional logging pipeline treatment were received in StackOps between a few minutes and a couple of hours after the API calls. Data was not fresh. Logging agents and implementation were carried out “out-of-the-box” from the vendor’s Kubernetes manifest • We monitored every “hop” of the logging containers and re-architected the logging infrastructure to ensure that each “hop” of logging was rightly-sized and optimised (in configuration) for performance SRE Workbook Pg 62 Speed. Data should be available when you need it… Data more than four to five minutes stale might significantly impact how quickly you can respond to an incident.
  14. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Monitoring in SRE – Modelling SRE-designed Dashboards SRE Workbook Pg 22 Draw a high-level architecture diagram of your system; show the key components; the request flow, the data flow, and the critical dependencies. SRE Workbook Pg 39 Modelling User Journeys. You can use critical user journeys to help capture the experience of your customers. SRE Workbook Pg 72 Implementing Purposeful Metrics. Each exposed metric should serve a purpose… When you write a postmortem, think about which additional metrics would have allowed you to diagnose the issue faster.
  15. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Monitoring in SRE – Modelling (Cont’d) SRE-designed Dashboards • How we define Critical Metrics – traffic that affects the bottom line (revenue) of APEX • Being intentional and concise about which metrics to display in the Critical Metrics graph • Mapped out the paths of critical flow inter- dependencies, including: K8s nodes Traffic Pods Database Load Balancers Forward Proxies API Traffic
  16. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Preparations for National Event Dashboarding: for preparation of major events • Devising a custom dashboard for event based on the below principles: o The dashboard will show the business metrics (API metrics) which are important for the event (i.e., API status codes, API latency) o Links to other critical data which will allow troubleshooting are also embedded in the dashboard • Participating in end-to-end load tests with the API consumer and publisher and testing the usefulness of the dashboard and to ascertain the performance of the API backend server. SRE Workbook Pg 39 Modelling User Journeys. You can use critical user journeys to help capture the experience of your customers.
  17. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Preparations for National Event - Dashboarding Dashboarding: for preparation of major events
  18. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Preparations for National Event - Dashboarding Dashboarding: for preparation of major events
  19. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Preparations for National Event - Dashboarding Dashboarding: for preparation of major events SRE Workbook Pg 39 Modelling User Journeys. You can use critical user journeys to help capture the experience of your customers.
  20. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. JWT Authentication Dashboarding: for authentication of APIs What is JWT Authentication and how is it used? • JWT Authentication is a client-assertion-based security mechanism for our APIs that incorporates authentication, authorisation, data integrity and non-repudiation. • It is loosely based on a JWT authorisation header, which the API consumer signs and APEX system verifies the claims and signature of the signed JWT.
  21. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. JWT Authentication Dashboarding: for authentication of APIs Opportunities Our monitoring setup and uniquely defined error codes allowed the APEX Operations team to be lean and focus on other higher-value work instead of being bogged down by the time-consuming work of studying logs to diagnose the root cause of issues. Problems As with API gateway systems, authentication and authorisation errors form a good percentage of the troubleshooting tickets and were time- consuming to investigate.
  22. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Customised Error Codes of JWT Authentication Dashboarding: for authentication of APIs 403/432 Invalid JWKS endpoint of API Key 433 Invalid JWKS 434 JWT header missing 435 Invalid JWT format 436 Missing ‘iss’ claim 437 Unable to find matching Key Id 438 Missing or invalid ‘alg’ claim 439 Missing or invalid ‘typ’ claim 440 Invalid ‘iss’ claim 441 Missing or invalid ‘iat’ claim 442 Missing or invalid ‘aud’ claim 443 Missing or invalid ‘jti’ claim 445 Missing or invalid ‘sub’ claim 446 Missing or invalid ‘data’ claim 447 Missing or invalid ‘exp’ claim 448 Invalid JWT format 449 Missing or invalid ‘iss’ claim 450 Invalid API Key 452 Invalid JWT Signature 4XX Reuse of nonce 4XX >2 API Keys detected in ‘iss’ claim 4XX Attempt to use none in ‘alg’ claim 4XX Attempt to use claims – ‘jku’,’x5u’,’x5t’,’x5c’
  23. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Error Codes of JWT Authentication Dashboarding Dashboarding: for authentication of APIs
  24. Copyright of GovTech © Not to be reproduced unless with

    explicit consent by GovTech. Outcomes of JWT Authentication Dashboarding Dashboarding: for authentication of APIs • This empowered the users to diagnose errors and rectify them using a self-serve model and our DevOps teams to identify issues without needing to read detailed logs. • DevOps team was also able to diagnose if there were potential security threats to the API gateway by looking out for error codes which were reserved for security errors. SRE Workbook Pg 101 Engineer Toil Out of the System The optimal strategy for handling toil is to eliminate it at the source… working with product development teams to develop operationally friendly software that is not only less toilsome, but also more scalable, secure, and resilient.
  25. Thank You! Eugene Wong / Senior DevOps Engineer Ryan Ashneil

    / Software Engineer 18 April 2024 API Monitoring X SRE