Upgrade to Pro — share decks privately, control downloads, hide ads and more …

API Monitoring with OpenAPI and Ecosystem using Schema

API Monitoring with OpenAPI and Ecosystem using Schema

LINE Developers

March 11, 2021
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. Engineering The role and experience ABOUT ME 2 name: Wataru

    Manji role: Software Engineer team: Verda Reliability Engineering team activities: - Development of monitoring system - Direction of incident handling - Implement of on-call system - User support and training - and more manji0 manji0#9999
  2. Engineering Agenda • What is Verda? • Motivation • Basic

    Idea • Deep-dive to API Monitoring • Practical Operation • Future Plans • Conclusion 3
  3. Engineering Of the LINER, by the LINER, for the LINER

    Verda is the Infra Platform 5 Verda Web UI Verda REST APIs Server (VM/Baremetal) LoadBalancer (L4/L7) Storage (Object/Block) Datastore(MySQL,Redis) Kubernetes Elasticsearch ɾɾɾ
  4. Engineering We manage many of resources The Scale is LARGE

    6 Baremetal & HV VM K8s cluster 22,000+ 65,000+ 800+ EA※1 EA※1 ※1: Count of Dec. 2020 EA※1
  5. • Server monitoring was running on a metrics basis. •

    API Monitoring for the services was only log basis. • It does not summarize which part of the micro-services is failing the request. • Periodic spikes in server resource usage, but can't figure out why. • Some products had their own service monitoring, but this information was not known by other teams. • Need a unified method to measure API’s availability, throughput, and latency. Everything was so not clear Server Metrics is Not Enough Engineering 8
  6. • We can collect the API metrics by implement schemas

    of them. • The same can be done for other services by introducing the proxy and schema. • Verda k8s team developed “verda-common-proxy”, that is simple http-proxy sidecar that supports exporting metrics defined by OpenAPI schema. • Some components already use the proxy for collecting access log. That is a sidecar proxy We Already Have the Solution Engineering 9
  7. Engineering Split management is the basic principle Schema Management 12

    k8s manifest 
 (e.g Deployment) Application 
 Image Nginx + Schema files 
 Image Fix each version Repo Repo
  8. Engineering The proxy can get a schema from api-server Support

    for Modern Web-Framework 13 • VCP can get a schema from target service 
 → Works well with frameworks that take a schema-first approach
  9. • request_count_total {deployment, pod, path, method, status_code, error} • request_latency_second_total

    {deployment, pod, path, method} • request_inflight {deployment, pod, path, method} What have we been able to observe Queries Engineering 14 • request_size_bucket {deployment, pod, path, method} • response_size_bucket {deployment, pod, path, method} • What percentage of requests resulted in a 5xx? • What percentage of requests that violate the API spec? • Which path & method have lower throughput?
  10. • The dashboard is implemented on Grafana. To observe information

    in a time series Visualization Engineering 15 • Automatically add dashboards by identifying region and service via metric’s label.
  11. • VCP is the proxy that process AuthN, validation, and

    recording access log. VCP: verda-common-proxy Engineering 17 • More detail… https://engineering.linecorp.com/ja/blog/verda-common-proxy/ VCP APP ɾValidate request with schema ɾRecord some metrics ɾValidate token and add the result as headers ɾRecord access log ɾRecord some metrics Request to the pod Response from the pod
  12. Engineering For investigation Working with Audit-log 19 • Metrics alone

    do not provide information on specific requests. → Save access log separately and use it for analysis • We are using fluentd for the log transfer. • NOTE: This is a mechanism for analyzing platform-side requests, not the requests of apps built on the platform.
  13. • It turns out that the last deployment switched the

    API endpoint referenced by HV, and the load on the new endpoint's resources skyrocketed. • The failure rate of networking management API increased dramatically, greatly affecting the creation of VMs and other services. • Solved by calculating the throughput from the API Monitoring values and scaling the pod to an appropriate value. CASE Engineering 21
  14. • The API Monitoring mechanism should serve as a foundation

    for publishing those. • Verda's API Spec is not exposed at a high enough level. • There is no portal that lists Verda's services status. → Such information is essential for users to trust and use Verda • We would like to develop an interface for users to learn about the functions and their reliability. UX Issues Engineering 24
  15. • Schema is an API Spec that can be exposed

    to users • We can provide users with an API Document that automatically follows changes in the application. Implement more manageable user documentation API Documents generated by Schema Engineering 25
  16. • From the metrics collected by API Monitoring, we can

    calculate service status and service level. A mechanism to expose the health of the API to users Summary of Service Status Engineering 26
  17. • In some cases, the return code of an API

    is not enough to tell the user whether the function has done its job correctly or not. → Mainly operations that return 202 Accepted. e.g. Create VM • In order to measure the reliability of those processes, we will implement some tracking and verification measures. • openstack request-id based tracing • simple testing of the created resources Resource Monitoring Engineering 27
  18. • Implement an interface that allows developers to easily define

    monitoring items. • Metrics scraping target and labeling • Alert definitions and routing • Logging target, parsing rules, and routing → Currently under design and development • An interface for users to easily submit requests for documentation and service specifications. → Will work on a design that is easy to use for both users and developers in conjunction with Github features. Others Engineering 29
  19. • We can now measure health and demand for all

    API paths and methods. • Request count • Latency • Availability • We provide specific benefits and frameworks for schema-first development. • Enable automatic API monitoring • Labor-saving documentation • Clarification of development items and procedures What We've Accomplished Engineering 31
  20. • Automatic generation of API document with API Spec and

    permissions information • More rigorous and measurable service-level definitions Future Plan1: Improving Verda's UX Engineering 32 • Implement an ecosystem of user documentation, including the information needed to trust and use the service. • API Spec that includes the information about permissions for execution • Difference between SLO and current service level
  21. • State tracking of resources manipulated by asynchronous APIs. •

    Track processing through with id per request. • Perfect service level measurement • Easy to understand the cause and extent of trouble Future Plan2: Perfect Monitoring Engineering 33
  22. Engineering Do you interest in Verda? WE ARE HIRING!!! 34

    List of open positions: https://lin.ee/YwYLuGa About Verda SRE team: https://lin.ee/4mgu0nn