Slide 1

Slide 1 text

⽤用 Go 語⾔言 打造多台機器 Scale 架構 Bo-Yi Wu 2020/09/08

Slide 2

Slide 2 text

About me • Software Engineer in Mediatek • Member of Drone CI/CD Platform • Member of Gitea Platform • Member of Gin Golang Framework • Maintain Some GitHub Actions Plugins. • Teacher of Udemy Platform: Golang + Drone

Slide 3

Slide 3 text

NeuroPilot MediaTek Ecosystem for AI Development https://neuropilot.mediatek.com/

Slide 4

Slide 4 text

專案需求 • 客⼾戶單機版 (Docker 版本) • 內建簡易易的 Queue 機制 • 公司內部架構 (軟體 + 硬體) • 多台 Queue 機制 + 硬體模擬 每個 Job 吃 2core 8GB 記憶體

Slide 5

Slide 5 text

為什什麼選 Go 語⾔言 • 公司環境限制 • 保護程式邏輯 • 跨平台編譯 (Windows, Linux) • 強⼤大 Concurrency

Slide 6

Slide 6 text

客⼾戶單機版

Slide 7

Slide 7 text

導入 Queue 機制 3BCCJU.2 /42

Slide 8

Slide 8 text

Service 部分元件 • Database: SQLite (不需要 MySQL, Postgres) • Cache: Memory (不需要 Redis) • Queue: ⾃自⾏行行開發

Slide 9

Slide 9 text

客⼾戶 IT 環境

Slide 10

Slide 10 text

如何實作簡易易的 Queue 機制 每個 Job 吃 2core 8GB 記憶體

Slide 11

Slide 11 text

先了了解 Channel Blocking

Slide 12

Slide 12 text

https://utcc.utoronto.ca/~cks/space/blog/programming/GoConcurrencyStillNotEasy

Slide 13

Slide 13 text

Limit Concurrency Issue

Slide 14

Slide 14 text

found := make(chan int) limitCh := make(chan struct{}, concurrencyProcesses) for i := 0; i < jobCount; i++ { limitCh <- struct{}{} go func(val int) { defer func() { wg.Done() <-limitCh }() found <- val }(i) } jobCount = 100 concurrencyProcesses = 10

Slide 15

Slide 15 text

found := make(chan int) limitCh := make(chan struct{}, concurrencyProcesses) for i := 0; i < jobCount; i++ { limitCh <- struct{}{} go func(val int) { defer func() { wg.Done() <-limitCh }() found <- val }(i) } jobCount = 100 concurrencyProcesses = 10

Slide 16

Slide 16 text

解決⽅方案 將 limitCh 丟到背景處理理?

Slide 17

Slide 17 text

found := make(chan int) limitCh := make(chan struct{}, concurrencyProcesses) for i := 0; i < jobCount; i++ { go func() { limitCh <- struct{}{} }() go func(val int) { defer func() { <-limitCh wg.Done() }() found <- val }(i) } jobCount = 100 concurrencyProcesses = 10

Slide 18

Slide 18 text

found := make(chan int) limitCh := make(chan struct{}, concurrencyProcesses) for i := 0; i < jobCount; i++ { go func() { limitCh <- struct{}{} }() go func(val int) { defer func() { <-limitCh wg.Done() }() found <- val }(i) } 無法解決 Limit Concurrency jobCount = 100 concurrencyProcesses = 10

Slide 19

Slide 19 text

解決⽅方案 重新改寫架構

Slide 20

Slide 20 text

found := make(chan int) queue := make(chan int) go func(queue chan<- int) { for i := 0; i < jobCount; i++ { queue <- i } close(queue) }(queue) for i := 0; i < concurrencyProcesses; i++ { go func(queue <-chan int, found chan<- int) { for val := range queue { defer wg.Done() found <- val } }(queue, found) } jobCount = 100 concurrencyProcesses = 10

Slide 21

Slide 21 text

Internal Queue 單機版

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Setup Consumer

Slide 24

Slide 24 text

type Consumer struct { inputChan chan int jobsChan chan int } const PoolSize = 200 func main() { // create the consumer consumer := Consumer{ inputChan: make(chan int, 1), jobsChan: make(chan int, PoolSize), } }

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

func (c *Consumer) queue(input int) { fmt.Println("send input value:", input) c.jobsChan <- input } func (c *Consumer) worker(num int) { for job := range c.jobsChan { fmt.Println("worker:", num, " job value:", job) } } for i := 0; i < WorkerSize; i++ { go consumer.worker(i) }

Slide 27

Slide 27 text

rewrite queue func func (c *Consumer) queue(input int) bool { fmt.Println("send input value:", input) select { case c.jobsChan <- input: return true default: return false } } 避免使⽤用者⼤大量量送資料進來來

Slide 28

Slide 28 text

Shutdown with Sigterm Handling

Slide 29

Slide 29 text

func WithContextFunc(ctx context.Context, f func()) context.Context { ctx, cancel := context.WithCancel(ctx) go func() { c := make(chan os.Signal) signal.Notify(c, syscall.SIGINT, syscall.SIGTERM) defer signal.Stop(c) select { case <-ctx.Done(): case <-c: f() cancel() } }() return ctx }

Slide 30

Slide 30 text

func (c Consumer) startConsumer(ctx context.Context) { for { select { case job := <-c.inputChan: if ctx.Err() != nil { close(c.jobsChan) return } c.jobsChan <- job case <-ctx.Done(): close(c.jobsChan) return } } } select 不保證讀取 Channel 的順序性

Slide 31

Slide 31 text

Cancel by ctx.Done() event func (c *Consumer) worker(num int) { for job := range c.jobsChan { fmt.Println("worker:", num, " job value:", job) } } Channel 關閉後,還是可以讀取資料到結束

Slide 32

Slide 32 text

Graceful shutdown with worker sync.WaitGroup

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

wg := &sync.WaitGroup{} wg.Add(WorkerSize) // Start [PoolSize] workers for i := 0; i < WorkerSize; i++ { go consumer.worker(i) }

Slide 35

Slide 35 text

WaitGroup WaitGroup WaitGroup WaitGroup

Slide 36

Slide 36 text

func (c Consumer) worker(wg *sync.WaitGroup) { defer wg.Done() for job := range c.jobsChan { // handle the job event } }

Slide 37

Slide 37 text

Add WaitGroup after Cancel Function

Slide 38

Slide 38 text

func WithContextFunc(ctx context.Context, f func()) context.Context { ctx, cancel := context.WithCancel(ctx) go func() { c := make(chan os.Signal) signal.Notify(c, syscall.SIGINT, syscall.SIGTERM) defer signal.Stop(c) select { case <-ctx.Done(): case <-c: cancel() f() } }() return ctx } Add WaitGroup after Cancel Function

Slide 39

Slide 39 text

wg := &sync.WaitGroup{} wg.Add(numberOfWorkers) ctx := signal.WithContextFunc( context.Background(), func() { wg.Wait() close(finishChan) }, ) go consumer.startConsumer(ctx)

Slide 40

Slide 40 text

End of Program select { case <-finished: case err := <-errChannel: if err != nil { return err } }

Slide 41

Slide 41 text

單機版限制 系統資源不⾜足

Slide 42

Slide 42 text

系統架構

Slide 43

Slide 43 text

Server - Agent

Slide 44

Slide 44 text

4FSWFS᪑"HFOUߔ௨ํࣜ https://github.com/hashicorp/go-retryablehttp

Slide 45

Slide 45 text

r := e.Group("/rpc") r.Use(rpc.Check()) { r.POST("/v1/healthz", web.RPCHeartbeat) r.POST("/v1/request", web.RPCRquest) r.POST("/v1/accept", web.RPCAccept) r.POST("/v1/details", web.RPCDetails) r.POST("/v1/updateStatus", web.RPCUpdateStatus) r.POST("/v1/upload", web.RPCUploadBytes) r.POST("/v1/reset", web.RPCResetStatus) } Check RPC Secret

Slide 46

Slide 46 text

/rpc/v1/accept Update jobs set version = (oldVersion + 1) where machine = "fooBar" and version = oldVersion

Slide 47

Slide 47 text

Create multiple worker

Slide 48

Slide 48 text

if r.Capacity != 0 { var g errgroup.Group for i := 0; i < r.Capacity; i++ { g.Go(func() error { return r.start(ctx, 0) }) time.Sleep(1 * time.Second) } return g.Wait() } 單機版設定多個 Worker

Slide 49

Slide 49 text

for { var ( id int64 err error ) if id, err = r.request(ctx); err != nil { time.Sleep(1 * time.Second) continue } go func() { if err := r.start(ctx, id); err != nil { log.Error().Err(err).Msg("runner: cannot start the job") } }() } 公司內部 + Submit Job

Slide 50

Slide 50 text

Break for and select loop func (r *Runner) start(ctx context.Context, id int64) error { LOOP: for { select { case <-ctx.Done(): return ctx.Err() default: r.poll(ctx, id) if r.Capacity == 0 { break LOOP } } time.Sleep(1 * time.Second) } return nil }

Slide 51

Slide 51 text

即時取消正在執⾏行行的任務?

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

Context with Cancel or Timeout ctx, cancel := context.WithCancel(context.Background()) defer cancel() timeout, cancel := context.WithTimeout(ctx, 60*time.Minute) defer cancel() Job03 context

Slide 56

Slide 56 text

Context with Cancel or Timeout ctx, cancel := context.WithCancel(context.Background()) defer cancel() timeout, cancel := context.WithTimeout(ctx, 60*time.Minute) defer cancel() Job03 context Job05 context

Slide 57

Slide 57 text

Watch the Cancel event (Agent) go func() { done, _ := r.Manager.Watch(ctx, id) if done { cancel() } }()

Slide 58

Slide 58 text

Handle cancel event on Server subscribers: make(map[chan struct{}]int64), cancelled: make(map[int64]time.Time),

Slide 59

Slide 59 text

User cancel running job c.Lock() c.cancelled[id] = time.Now().Add(time.Minute * 5) for subscriber, build := range c.subscribers { if id == build { close(subscriber) } } c.Unlock()

Slide 60

Slide 60 text

Agent subscribe the cancel event for { select { case <-ctx.Done(): return false, ctx.Err() case <-time.After(time.Minute): c.Lock() _, ok := c.cancelled[id] c.Unlock() if ok { return true, nil } case <-subscriber: return true, nil } }

Slide 61

Slide 61 text

case <-time.After(time.Minute): c.Lock() _, ok := c.cancelled[id] c.Unlock() if ok { return true, nil }

Slide 62

Slide 62 text

case <-time.After(time.Minute): c.Lock() _, ok := c.cancelled[id] c.Unlock() if ok { return true, nil } 1 Cancel

Slide 63

Slide 63 text

case <-time.After(time.Minute): c.Lock() _, ok := c.cancelled[id] c.Unlock() if ok { return true, nil } 1 2 Reconnect Server Cancel

Slide 64

Slide 64 text

感謝參參與