Upgrade to Pro — share decks privately, control downloads, hide ads and more …

自動化監控伺服器工具 - Gatus

Bo-Yi Wu
July 26, 2022

自動化監控伺服器工具 - Gatus

1. Why not Prometheus Alert manager, CloudWatch, or event Splunk?
2. 為什麼我選擇 Gatus?
3. Gatus 運作流程
4. Gatus 系統架構
5. Scalability (Distributed Approach)

Bo-Yi Wu

July 26, 2022
Tweet

More Decks by Bo-Yi Wu

Other Decks in Technology

Transcript

  1. ⾃動化監控伺服器
    ⼯具 - Gatus
    2022/07/26


    Bo-Yi Wu

    View Slide

  2. About me
    • Software Engineer in Mediatek (AIDE)


    • Member of Drone CI/CD Platform


    • Member of Gitea Platform


    • Member of Gin Golang Framework


    • Maintain Some GitHub Actions Plugins.

    View Slide

  3. Why not


    Prometheus Alert
    manager, CloudWatch
    or event Splunk?

    View Slide

  4. 內部監控指標


    主要依賴現有流量


    客⼾遇到問題


    團隊才會收到通知

    View Slide

  5. 假設 Load Balancer 掛掉


    或付款流程中間出現錯誤


    團隊是否會收到通知?

    View Slide

  6. 該如何在客⼾發現錯誤之前


    就提前知道,並且修正完畢

    View Slide

  7. 如何實現
    • 建立系統服務狀態⾴⾯


    • 主動監控系統狀態⼯具

    View Slide

  8. 建立系統服務狀態介⾯

    View Slide

  9. IUUQTXXXHJUIVCTUBUVTDPN
    (JU)VCܥ౷෰຿㐫ଶ

    View Slide

  10. View Slide

  11. 主動監控系統狀態⼯具

    View Slide

  12. 客製化監控協定
    • HTTP


    • DNS


    • ICMP/PING


    • TCP

    View Slide

  13. 為什麼我選擇


    Gatus
    IUUQTHJUIVCDPNJWCFHBXFTPNFTUBUVTQBHFT

    View Slide

  14. Gatus 優勢
    • 監控條件 (客製化回應)


    • 開源專案 (Go 語⾔)


    • 簡易⾴⾯ (Status Page)

    View Slide

  15. 監控條件

    View Slide

  16. Conditions
    • [STATUS] == 200


    • [STATUS] < 300


    • [RESPONSE_TIME] < 500


    • [BODY].user.name == John


    • len([BODY].data) < 10

    View Slide

  17. View Slide

  18. ⽀援監控


    GraphQL

    View Slide

  19. View Slide

  20. 客製化監控協定
    • HTTP (REST API, GraphQL)


    • ICMP (ping)


    • DNS (A, AAAA, CNAME, MX, NS)


    • TCP (Database)


    • TLS (LDAP, HTTPS, mail servers)


    • STARTTLS (mail servers)

    View Slide

  21. ⽀援多種 Alert 機制
    • Discord


    • Email


    • Google Chat


    • Matrix


    • Mattermost


    • Slack
    • Teams


    • Telegram


    • Twilio


    • PagerDuty


    • Opsgenie


    • Custom

    View Slide

  22. 簡易


    監控⾴⾯

    View Slide

  23. View Slide

  24. View Slide

  25. 開源專案


    使⽤ Go 語⾔

    View Slide

  26. $POUSJCVUFUPPQFOTPVSDFQSPKFU

    View Slide

  27. Gatus


    系統架構


    流程圖

    View Slide

  28. View Slide

  29. View Slide

  30. endpoints:
    - name: monitoring
    group: internal
    url: "https://example.org/"
    interval: 5m
    conditions:
    - "[STATUS] == 200"
    - name: example-dns-query
    url: "1.1.1.1"
    interval: 5m
    dns:
    query-name: "example.com"
    query-type: "A"
    conditions:
    - "[BODY] == 93.184.216.34"
    - "[DNS_RCODE] == NOERROR"
    - name: icmp-ping
    url: "icmp://example.org"
    interval: 1m
    conditions:
    - "[CONNECTED] == true"
    .FNPSZ
    42-JUF
    1PTUHSFT
    ৽⃧ࢿྉ

    View Slide

  31. endpoints:
    - name: monitoring
    group: internal
    url: "https://example.org/"
    interval: 5m
    conditions:
    - "[STATUS] == 200"
    - name: example-dns-query
    url: "1.1.1.1"
    interval: 5m
    dns:
    query-name: "example.com"
    query-type: "A"
    conditions:
    - "[BODY] == 93.184.216.34"
    - "[DNS_RCODE] == NOERROR"
    .FNPSZ
    42-JUF
    1PTUHSFT
    Ҡআࢿྉ

    View Slide

  32. err := store.Initialize(cfg.Storage)
    if err != nil {
    panic(err)
    }
    var keys []string
    for _, endpoint := range cfg.Endpoints {
    keys = append(keys, endpoint.Key())
    }
    numberOfDeleted := store.Get().DeleteAllEndpointStatusesNotInKeys(keys)
    if numberOfDeleted > 0 {
    }

    View Slide

  33. View Slide

  34. WatchDog


    流程

    View Slide

  35. endpoints:
    - name: front-end
    group: core
    url: "https://twin.sh/health"
    interval: 5m
    conditions:
    - "[STATUS] == 200"
    - "[BODY].status == UP"
    - "[RESPONSE_TIME] < 150"
    - name: back-end
    group: core
    url: "https://example.org/"
    interval: 5m
    conditions:
    - "[STATUS] == 200"
    - "[CERTIFICATE_EXPIRATION] > 48h"

    View Slide

  36. endpoints:
    - name: front-end
    group: core
    url: "https://twin.sh/health"
    interval: 5m
    conditions:
    - "[STATUS] == 200"
    - "[BODY].status == UP"
    - "[RESPONSE_TIME] < 150"
    - name: back-end
    group: core
    url: "https://example.org/"
    interval: 5m
    conditions:
    - "[STATUS] == 200"
    - "[CERTIFICATE_EXPIRATION] > 48h"

    View Slide

  37. for _, endpoint := range cfg.Endpoints {
    if endpoint.IsEnabled() {
    go monitor(endpoint, cfg.Alerting, cfg.Maintenance, …)
    }
    }
    ᩇऔॴ༗&OEQPJOUࢿྉ
    എܠ႔ཧࢿྉ

    View Slide

  38. // Run it immediately on start
    execute(endpoint, alertingConfig …)
    // Loop for the next executions
    for {
    select {
    case return
    case execute(endpoint, alertingConfig …)
    }
    }

    View Slide

  39. 如何避免服務


    初始啟動同時


    發出⼤量請求

    View Slide

  40. for _, endpoint := range cfg.Endpoints {
    if endpoint.IsEnabled() {
    time.Sleep(777 * time.Millisecond)
    go monitor(endpoint, cfg.Alerting, cfg.Maintenance, …)
    }
    }

    View Slide

  41. 經過⼀段時間


    有⼀定的機率


    會在同時間發送請求


    ⽽影響回應時間


    (Response Time)

    View Slide

  42. Using


    sync.Mutex

    View Slide

  43. View Slide

  44. Endpoint A


    調整 Interval 時間為


    5min

    View Slide

  45. When to disable


    Monitoring Lock

    View Slide

  46. 取消 Lock 機制
    • 壓⼒測試


    • 有⼤量的 Endpoints 需要監控


    • 多個 Endpoints 的 intervals < 5s

    View Slide

  47. Gatus


    系統架構

    View Slide

  48. View Slide

  49. var router http.Handler = handler.CreateRouter(ui.StaticFolder, securityConfig …)
    server = &http.Server{
    Addr: fmt.Sprintf("%s:%d", webConfig.Address, webConfig.Port),
    Handler: router,
    ReadTimeout: 15 * time.Second,
    WriteTimeout: 15 * time.Second,
    IdleTimeout: 15 * time.Second,
    }
    log.Println("[controller][Handle] Listening on " + webConfig.SocketAddress())
    log.Println("[controller][Handle]", server.ListenAndServe())

    View Slide

  50. View Slide

  51. for {
    time.Sleep(30 * time.Second)
    if cfg.HasLoadedConfigurationFileBeenModified() {
    stop()
    time.Sleep(time.Second)
    save()
    updatedConfig, err := loadConfiguration()
    if err != nil {
    if cfg.SkipInvalidConfigUpdate {
    cfg.UpdateLastFileModTime()
    continue
    } else {
    panic(err)
    }
    }
    initializeStorage(updatedConfig)
    start(updatedConfig)
    return
    }
    }



    View Slide




  52. View Slide

  53. 重新啟動


    Graceful Shutdown

    View Slide

  54. View Slide

  55. if server != nil {
    _ = server.Shutdown(context.TODO())
    server = nil
    }
    4IVUEPXO8FC4FSWJDF

    View Slide

  56. ctx, cancel = context.WithCancel(context.Background())
    for _, endpoint := range cfg.Endpoints {
    if endpoint.IsEnabled() {
    time.Sleep(777 * time.Millisecond)
    go monitor(endpoint, cfg.Alerting, ctx)
    }
    }
    4IVUEPXO8BUDI%PH

    View Slide

  57. // Loop for the next executions
    for {
    select {
    case return
    case execute(endpoint, alertingConfig …)
    }
    }
    4IVUEPXO8BUDI%PH

    View Slide

  58. Scalability


    Distributed Approach
    IUUQTHJUIVCDPN5XJ/HBUVTJTTVFT

    View Slide

  59. View Slide

  60. IUUQTHJUIVCDPN5XJ/HBUVTJTTVFT

    View Slide

  61. 4FSWFS"HFOUQSPQPTBM

    View Slide



  62. NBLFSFRVFTU
    BDDFQUSFRVFTU
    4FSWFS"HFOUQSPQPTBM

    View Slide

  63. 4FSWFS"HFOUQSPQPTBM
    TFOENFUSJDEBUB

    View Slide

  64. ֬อ㑌୆"HFOUෆ။፤౸ॏෳ&OEQPJOU
    ᙛ&OEQPJOU༗ᏓԽ࣌ɼ೗Կ௨஌"HFOU᮫ด
    ႔ཧ4FSWFSٴ"HFOUHSBDFGVMTIVUEPXOػ੍
    4FSWFS"HFOUQSPQPTBM

    View Slide

  65. Gatus


    Online Version


    (為什麼我要付錢?)

    View Slide

  66. 付錢理由?
    • 開源版本只能透過 YAML 設定


    • 開源版本需要有⼈管理主機


    • 假如您的基礎設施遇到問題,那 Gatus 也起不
    了任何作⽤了。


    • 給我錢,我會花更多時間在 Gatus

    View Slide

  67. View Slide

  68. Thanks

    View Slide