Slide 1

Slide 1 text

⾃動化監控伺服器 ⼯具 - Gatus 2022/07/26 Bo-Yi Wu

Slide 2

Slide 2 text

About me • Software Engineer in Mediatek (AIDE) • Member of Drone CI/CD Platform • Member of Gitea Platform • Member of Gin Golang Framework • Maintain Some GitHub Actions Plugins.

Slide 3

Slide 3 text

Why not Prometheus Alert manager, CloudWatch or event Splunk?

Slide 4

Slide 4 text

內部監控指標 主要依賴現有流量 客⼾遇到問題 團隊才會收到通知

Slide 5

Slide 5 text

假設 Load Balancer 掛掉 或付款流程中間出現錯誤 團隊是否會收到通知?

Slide 6

Slide 6 text

該如何在客⼾發現錯誤之前 就提前知道,並且修正完畢

Slide 7

Slide 7 text

如何實現 • 建立系統服務狀態⾴⾯ • 主動監控系統狀態⼯具

Slide 8

Slide 8 text

建立系統服務狀態介⾯

Slide 9

Slide 9 text

IUUQTXXXHJUIVCTUBUVTDPN (JU)VCܥ౷෰຿㐫ଶ

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

主動監控系統狀態⼯具

Slide 12

Slide 12 text

客製化監控協定 • HTTP • DNS • ICMP/PING • TCP

Slide 13

Slide 13 text

為什麼我選擇 Gatus IUUQTHJUIVCDPNJWCFHBXFTPNFTUBUVTQBHFT

Slide 14

Slide 14 text

Gatus 優勢 • 監控條件 (客製化回應) • 開源專案 (Go 語⾔) • 簡易⾴⾯ (Status Page)

Slide 15

Slide 15 text

監控條件

Slide 16

Slide 16 text

Conditions • [STATUS] == 200 • [STATUS] < 300 • [RESPONSE_TIME] < 500 • [BODY].user.name == John • len([BODY].data) < 10

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

⽀援監控 GraphQL

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

客製化監控協定 • HTTP (REST API, GraphQL) • ICMP (ping) • DNS (A, AAAA, CNAME, MX, NS) • TCP (Database) • TLS (LDAP, HTTPS, mail servers) • STARTTLS (mail servers)

Slide 21

Slide 21 text

⽀援多種 Alert 機制 • Discord • Email • Google Chat • Matrix • Mattermost • Slack • Teams • Telegram • Twilio • PagerDuty • Opsgenie • Custom

Slide 22

Slide 22 text

簡易 監控⾴⾯

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

開源專案 使⽤ Go 語⾔

Slide 26

Slide 26 text

$POUSJCVUFUPPQFOTPVSDFQSPKFU

Slide 27

Slide 27 text

Gatus 系統架構 流程圖

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

endpoints: - name: monitoring group: internal url: "https://example.org/" interval: 5m conditions: - "[STATUS] == 200" - name: example-dns-query url: "1.1.1.1" interval: 5m dns: query-name: "example.com" query-type: "A" conditions: - "[BODY] == 93.184.216.34" - "[DNS_RCODE] == NOERROR" - name: icmp-ping url: "icmp://example.org" interval: 1m conditions: - "[CONNECTED] == true" .FNPSZ 42-JUF 1PTUHSFT ৽⃧ࢿྉ

Slide 31

Slide 31 text

endpoints: - name: monitoring group: internal url: "https://example.org/" interval: 5m conditions: - "[STATUS] == 200" - name: example-dns-query url: "1.1.1.1" interval: 5m dns: query-name: "example.com" query-type: "A" conditions: - "[BODY] == 93.184.216.34" - "[DNS_RCODE] == NOERROR" .FNPSZ 42-JUF 1PTUHSFT Ҡআࢿྉ

Slide 32

Slide 32 text

err := store.Initialize(cfg.Storage) if err != nil { panic(err) } var keys []string for _, endpoint := range cfg.Endpoints { keys = append(keys, endpoint.Key()) } numberOfDeleted := store.Get().DeleteAllEndpointStatusesNotInKeys(keys) if numberOfDeleted > 0 { }

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

WatchDog 流程

Slide 35

Slide 35 text

endpoints: - name: front-end group: core url: "https://twin.sh/health" interval: 5m conditions: - "[STATUS] == 200" - "[BODY].status == UP" - "[RESPONSE_TIME] < 150" - name: back-end group: core url: "https://example.org/" interval: 5m conditions: - "[STATUS] == 200" - "[CERTIFICATE_EXPIRATION] > 48h"

Slide 36

Slide 36 text

endpoints: - name: front-end group: core url: "https://twin.sh/health" interval: 5m conditions: - "[STATUS] == 200" - "[BODY].status == UP" - "[RESPONSE_TIME] < 150" - name: back-end group: core url: "https://example.org/" interval: 5m conditions: - "[STATUS] == 200" - "[CERTIFICATE_EXPIRATION] > 48h"

Slide 37

Slide 37 text

for _, endpoint := range cfg.Endpoints { if endpoint.IsEnabled() { go monitor(endpoint, cfg.Alerting, cfg.Maintenance, …) } } ᩇऔॴ༗&OEQPJOUࢿྉ എܠ႔ཧࢿྉ

Slide 38

Slide 38 text

// Run it immediately on start execute(endpoint, alertingConfig …) // Loop for the next executions for { select { case <-ctx.Done(): return case <-time.After(endpoint.Interval): execute(endpoint, alertingConfig …) } }

Slide 39

Slide 39 text

如何避免服務 初始啟動同時 發出⼤量請求

Slide 40

Slide 40 text

for _, endpoint := range cfg.Endpoints { if endpoint.IsEnabled() { time.Sleep(777 * time.Millisecond) go monitor(endpoint, cfg.Alerting, cfg.Maintenance, …) } }

Slide 41

Slide 41 text

經過⼀段時間 有⼀定的機率 會在同時間發送請求 ⽽影響回應時間 (Response Time)

Slide 42

Slide 42 text

Using sync.Mutex

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Endpoint A 調整 Interval 時間為 5min

Slide 45

Slide 45 text

When to disable Monitoring Lock

Slide 46

Slide 46 text

取消 Lock 機制 • 壓⼒測試 • 有⼤量的 Endpoints 需要監控 • 多個 Endpoints 的 intervals < 5s

Slide 47

Slide 47 text

Gatus 系統架構

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

var router http.Handler = handler.CreateRouter(ui.StaticFolder, securityConfig …) server = &http.Server{ Addr: fmt.Sprintf("%s:%d", webConfig.Address, webConfig.Port), Handler: router, ReadTimeout: 15 * time.Second, WriteTimeout: 15 * time.Second, IdleTimeout: 15 * time.Second, } log.Println("[controller][Handle] Listening on " + webConfig.SocketAddress()) log.Println("[controller][Handle]", server.ListenAndServe())

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

for { time.Sleep(30 * time.Second) if cfg.HasLoadedConfigurationFileBeenModified() { stop() time.Sleep(time.Second) save() updatedConfig, err := loadConfiguration() if err != nil { if cfg.SkipInvalidConfigUpdate { cfg.UpdateLastFileModTime() continue } else { panic(err) } } initializeStorage(updatedConfig) start(updatedConfig) return } }

Slide 52

Slide 52 text

Slide 53

Slide 53 text

重新啟動 Graceful Shutdown

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

if server != nil { _ = server.Shutdown(context.TODO()) server = nil } 4IVUEPXO8FC4FSWJDF

Slide 56

Slide 56 text

ctx, cancel = context.WithCancel(context.Background()) for _, endpoint := range cfg.Endpoints { if endpoint.IsEnabled() { time.Sleep(777 * time.Millisecond) go monitor(endpoint, cfg.Alerting, ctx) } } 4IVUEPXO8BUDI%PH

Slide 57

Slide 57 text

// Loop for the next executions for { select { case <-ctx.Done(): return case <-time.After(endpoint.Interval): execute(endpoint, alertingConfig …) } } 4IVUEPXO8BUDI%PH

Slide 58

Slide 58 text

Scalability Distributed Approach IUUQTHJUIVCDPN5XJ/HBUVTJTTVFT

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

IUUQTHJUIVCDPN5XJ/HBUVTJTTVFT

Slide 61

Slide 61 text

4FSWFS"HFOUQSPQPTBM

Slide 62

Slide 62 text

NBLFSFRVFTU BDDFQUSFRVFTU 4FSWFS"HFOUQSPQPTBM

Slide 63

Slide 63 text

4FSWFS"HFOUQSPQPTBM TFOENFUSJDEBUB

Slide 64

Slide 64 text

֬อ㑌୆"HFOUෆ။፤౸ॏෳ&OEQPJOU ᙛ&OEQPJOU༗ᏓԽ࣌ɼ೗Կ௨஌"HFOU᮫ด ႔ཧ4FSWFSٴ"HFOUHSBDFGVMTIVUEPXOػ੍ 4FSWFS"HFOUQSPQPTBM

Slide 65

Slide 65 text

Gatus Online Version (為什麼我要付錢?)

Slide 66

Slide 66 text

付錢理由? • 開源版本只能透過 YAML 設定 • 開源版本需要有⼈管理主機 • 假如您的基礎設施遇到問題,那 Gatus 也起不 了任何作⽤了。 • 給我錢,我會花更多時間在 Gatus

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

Thanks