API Monitoring with OpenAPI and Ecosystem using Schema

Slide 1

Slide 1 text

Engineering API Monitoring with OpenAPI   and Ecosystem using Schema Wataru Manji, Verda -- LINE Corp.

Slide 2

Slide 2 text

Engineering The role and experience ABOUT ME 2 name: Wataru Manji role: Software Engineer team: Verda Reliability Engineering team activities: - Development of monitoring system - Direction of incident handling - Implement of on-call system - User support and training - and more manji0 manji0#9999

Slide 3

Slide 3 text

Engineering Agenda • What is Verda? • Motivation • Basic Idea • Deep-dive to API Monitoring • Practical Operation • Future Plans • Conclusion 3

Slide 4

Slide 4 text

Engineering 4

Slide 5

Slide 5 text

Engineering Of the LINER, by the LINER, for the LINER Verda is the Infra Platform 5 Verda Web UI Verda REST APIs Server (VM/Baremetal) LoadBalancer (L4/L7) Storage (Object/Block) Datastore(MySQL,Redis) Kubernetes Elasticsearch ɾɾɾ

Slide 6

Slide 6 text

Engineering We manage many of resources The Scale is LARGE 6 Baremetal & HV VM K8s cluster 22,000+ 65,000+ 800+ EA※1 EA※1 ※1: Count of Dec. 2020 EA※1

Slide 7

Slide 7 text

Engineering Motivation Background to the introduction of API monitoring 7

Slide 8

Slide 8 text

● Server monitoring was running on a metrics basis. ● API Monitoring for the services was only log basis. ● It does not summarize which part of the micro-services is failing the request. ● Periodic spikes in server resource usage, but can't figure out why. ● Some products had their own service monitoring, but this information was not known by other teams. ● Need a unified method to measure API’s availability, throughput, and latency. Everything was so not clear Server Metrics is Not Enough Engineering 8

Slide 9

Slide 9 text

● We can collect the API metrics by implement schemas of them. ● The same can be done for other services by introducing the proxy and schema. ● Verda k8s team developed “verda-common-proxy”, that is simple http-proxy sidecar that supports exporting metrics defined by OpenAPI schema. ● Some components already use the proxy for collecting access log. That is a sidecar proxy We Already Have the Solution Engineering 9

Slide 10

Slide 10 text

Engineering Basic Idea Summary of API monitoring implementation 10

Slide 11

Slide 11 text

Engineering K8s native design Overview 11

Slide 12

Slide 12 text

Engineering Split management is the basic principle Schema Management 12 k8s manifest   (e.g Deployment) Application   Image Nginx + Schema files   Image Fix each version Repo Repo

Slide 13

Slide 13 text

Engineering The proxy can get a schema from api-server Support for Modern Web-Framework 13 ● VCP can get a schema from target service   → Works well with frameworks that take a schema-first approach

Slide 14

Slide 14 text

● request_count_total {deployment, pod, path, method, status_code, error} ● request_latency_second_total {deployment, pod, path, method} ● request_inflight {deployment, pod, path, method} What have we been able to observe Queries Engineering 14 ● request_size_bucket {deployment, pod, path, method} ● response_size_bucket {deployment, pod, path, method} ● What percentage of requests resulted in a 5xx? ● What percentage of requests that violate the API spec? ● Which path & method have lower throughput?

Slide 15

Slide 15 text

● The dashboard is implemented on Grafana. To observe information in a time series Visualization Engineering 15 ● Automatically add dashboards by identifying region and service via metric’s label.

Slide 16

Slide 16 text

Engineering Deep-dive to API Monitoring The keyword is OpenAPI 16

Slide 17

Slide 17 text

● VCP is the proxy that process AuthN, validation, and recording access log. VCP: verda-common-proxy Engineering 17 ● More detail… https://engineering.linecorp.com/ja/blog/verda-common-proxy/ VCP APP ɾValidate request with schema ɾRecord some metrics ɾValidate token and add the result as headers ɾRecord access log ɾRecord some metrics Request to the pod Response from the pod

Slide 18

Slide 18 text

Data-Flow of the Metrics Engineering 18 ● Separate data-source availability from data-store availability with remote-write

Slide 19

Slide 19 text

Engineering For investigation Working with Audit-log 19 ● Metrics alone do not provide information on specific requests. → Save access log separately and use it for analysis ● We are using fluentd for the log transfer. ● NOTE: This is a mechanism for analyzing platform-side requests, not the requests of apps built on the platform.

Slide 20

Slide 20 text

Engineering Practical Operation It helped us in these cases 20

Slide 21

Slide 21 text

● It turns out that the last deployment switched the API endpoint referenced by HV, and the load on the new endpoint's resources skyrocketed. ● The failure rate of networking management API increased dramatically, greatly affecting the creation of VMs and other services. ● Solved by calculating the throughput from the API Monitoring values and scaling the pod to an appropriate value. CASE Engineering 21

Slide 22

Slide 22 text

Engineering CASE 22

Slide 23

Slide 23 text

Engineering Future Plans Provide value to the users of Verda 23

Slide 24

Slide 24 text

● The API Monitoring mechanism should serve as a foundation for publishing those. ● Verda's API Spec is not exposed at a high enough level. ● There is no portal that lists Verda's services status. → Such information is essential for users to trust and use Verda ● We would like to develop an interface for users to learn about the functions and their reliability. UX Issues Engineering 24

Slide 25

Slide 25 text

● Schema is an API Spec that can be exposed to users ● We can provide users with an API Document that automatically follows changes in the application. Implement more manageable user documentation API Documents generated by Schema Engineering 25

Slide 26

Slide 26 text

● From the metrics collected by API Monitoring, we can calculate service status and service level. A mechanism to expose the health of the API to users Summary of Service Status Engineering 26

Slide 27

Slide 27 text

● In some cases, the return code of an API is not enough to tell the user whether the function has done its job correctly or not. → Mainly operations that return 202 Accepted. e.g. Create VM ● In order to measure the reliability of those processes, we will implement some tracking and verification measures. ● openstack request-id based tracing ● simple testing of the created resources Resource Monitoring Engineering 27

Slide 28

Slide 28 text

Engineering Design of the Eco-System 28

Slide 29

Slide 29 text

● Implement an interface that allows developers to easily define monitoring items. ● Metrics scraping target and labeling ● Alert definitions and routing ● Logging target, parsing rules, and routing → Currently under design and development ● An interface for users to easily submit requests for documentation and service specifications. → Will work on a design that is easy to use for both users and developers in conjunction with Github features. Others Engineering 29

Slide 30

Slide 30 text

Engineering Conclusion What we have achieved and Overall future vision 30

Slide 31

Slide 31 text

● We can now measure health and demand for all API paths and methods. ● Request count ● Latency ● Availability ● We provide specific benefits and frameworks for schema-first development. ● Enable automatic API monitoring ● Labor-saving documentation ● Clarification of development items and procedures What We've Accomplished Engineering 31

Slide 32

Slide 32 text

● Automatic generation of API document with API Spec and permissions information ● More rigorous and measurable service-level definitions Future Plan1: Improving Verda's UX Engineering 32 ● Implement an ecosystem of user documentation, including the information needed to trust and use the service. ● API Spec that includes the information about permissions for execution ● Difference between SLO and current service level

Slide 33

Slide 33 text

● State tracking of resources manipulated by asynchronous APIs. ● Track processing through with id per request. ● Perfect service level measurement ● Easy to understand the cause and extent of trouble Future Plan2: Perfect Monitoring Engineering 33

Slide 34

Slide 34 text

Engineering Do you interest in Verda? WE ARE HIRING!!! 34 List of open positions: https://lin.ee/YwYLuGa About Verda SRE team: https://lin.ee/4mgu0nn