Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ECS Events & Lambda でカジュアルにはじめるコンテナスケジューラー / 20...

Taro Hirose
December 12, 2017

ECS Events & Lambda でカジュアルにはじめるコンテナスケジューラー / 20171212_jawsug-container-lt

JAWS-UG コンテナ支部 #10 - connpass
https://jawsug-container.connpass.com/event/71130/

Taro Hirose

December 12, 2017
Tweet

More Decks by Taro Hirose

Other Decks in Technology

Transcript

  1. whoami ኍ੉ ଠ࿠ / Taro Hirose ▸ OPENREC.tv / CyberZ,

    Inc. ▸ Backend Engineer ▸ id: @uorat ▸ http://uorat.hatenablog.com
  2. ECS Event is Կ ECS Cluster ಺Ϧιʔεͷঢ়ଶมߋʹԠͯ͡௨஌͞ΕΔ CloudWatch Events ▸

    ҎԼͷঢ়ଶมߋΠϕϯτΛडऔՄೳ ▸ Container Instance ▸ Task ▸ 2016.11.25 ։௨ ▸ Amazon ECSΠϕϯτετϦʔϜͰɺΫϥελͷঢ়ଶΛ؂ࢹ | Amazon Web Services ϒϩά ▸ https://aws.amazon.com/jp/blogs/news/monitor-cluster-state-with-amazon-ecs-event-stream/ Amazon ECS CloudWatch Events Lambda Event Stream Events SNS Kinesis
  3. ECS Event is Կ e.g. Task ىಈ { "version": "0",

    "id": "451dda85-ca1a-9045-5121-7a12dfb9317f", "detail-type": "ECS Task State Change", "source": "aws.ecs", "account": "123456789012", "time": "2017-09-07T08:28:04Z", "region": "ap-northeast-1", "resources": [ "arn:aws:ecs:ap-northeast-1:123456789012:task/b280d725-7382-43b8-a50d-ef909a36cb80" ], "detail": { "clusterArn": "arn:aws:ecs:ap-northeast-1:123456789012:cluster/uorat-ecs-event-test", "containerInstanceArn": "arn:aws:ecs:ap-northeast-1:123456789012:container-instance/ff83c4a8-67fc-4a13-8134-897c6dd2195a", ... "desiredStatus": "RUNNING", ... "lastStatus": "PENDING", ... "taskDefinitionArn": "arn:aws:ecs:ap-northeast-1:123456789012:task-definition/uorat-ecs-event-test:35", ... } } ECS Task Event
  4. ECS Event is Կ Կ͕Ͱ͖Δͷʁ ▸ “ECSΠϕϯτετϦʔϜͰΫϥελͷঢ়ଶΛ؂ࢹ” ΑΓҾ༻ ▸ https://aws.amazon.com/jp/blogs/news/monitor-cluster-state-with-amazon-ecs-event-stream/

    ▸ “͜ͷ৘ใΛ࢖ͬͯɺίϯςφͷ഑ஔͱεέʔϧΛࣗಈԽ͢Δ͜ͱ΋ՄೳͰɺΫϥελΛඇৗʹਫ਼ີͳϨ ϕϧͰ”ਖ਼͍͠αΠζ”ʹ͢Δ͜ͱ͕Ͱ͖·͢ɻϓϧܕͰ͸ͳ͘ΠϕϯτۦಈͰΫϥελͷঢ়ଶͷ৘ใΛ४ ϦΞϧλΠϜͰ഑ૹ͢Δ͜ͱʹΑΓɺECSΠϕϯτετϦʔϜػೳ͸ίϯςφΠϯϑϥͷ؂ࢹͱεέʔϧ ʹରͯ͠ඇৗʹ޿ൣғͳՄೳੑΛఏڙ͍ͯ͠·͢ɻ”
  5. ECS Event is Կ ΠϕϯτۦಈͳλεΫ഑උ΍γεςϜ࿈ܞ ▸ ྫ͑͹ ▸ λεΫͷՔಈཤྺΛ Elasticsearch

    ΍ DynamoDB ʹอଘͯ͠ղੳ༻ʹ׆༻ ▸ λεΫ΍ίϯςφΠϯελϯεͷىಈ/ఀࢭ࣌ʹԿ͔͠ΒͷॲཧΛ࣮ߦ ▸ ؂ࢹγεςϜ࿈ܞ
  6. ECS Event is Կ e.g. ▸ Container Scheduler for Amazon

    ECS ▸ re:Invent 2016 Ͱެ։͞Εͨ golang ੡ OSS ▸ ECS Cluster ༻ͷΧελϜεέδϡʔϥΛ࣮૷Մೳ ▸ ECS Cluster ͷΠϕϯτݕ஌ ▸ ECS Cluster ͷঢ়ଶ௥੻ ▸ ΧελϜεέδϡʔϥʔͷ࣮ߦ ▸ REST API ͷఏڙ
  7. Case: OPENREC.tv ήʔϜʹಛԽͨ͠ಈը഑৴ϝσΟΞ ▸ ௿஗Ԇɾߴը࣭ ▸ ίϯςϯπͷ9ׂ͸UGC ▸ ϢʔβʔओಋͷϥΠϒ഑৴͕த৺ ▸

    ಉ࣌഑৴਺΋഑৴࣌ؒ΋഑৴ऀ࣍ୈ ▸ ू٬ྗ΋഑৴ऀ࣍ୈ ▸ ∴ ෛՙ͕ಡΈͮΒ͍ ▸ ಉ࣌ࢹௌऀ਺ ແ੍ݶ ▸ ͍ΘΏΔ “࿮” ͸ແ͍ ▸ શϢʔβʔ౳͘͠ϥΠϒࢹௌͰ͖Δ͜ͱ
  8. Architecture of live transcoding system CloudWatch Events (scheduled/1min, ECS Event

    Stream) + Lambda + API ECS Cluster Broadcaster Container Instance Task (Container) Broadcaster Container Instance Task (Container) Broadcaster Container Instance Task (Container) Container Instance Aurora LIVE API ELB EC2 RDS CloudWatch Events & Lambda ec2:RunInstances ecs:RunTask ecs:StopTask ec2:StopInstances … ECS Events
  9. Architecture of live transcoding system ▸ EC2/ECS Auto Scaling ͸૬ੑ͕ѱ͍

    ▸ RTMP = ৗ࣌઀ଓ ▸ ෛՙ͕௿ͯ͘΋഑৴͍ͯ͠Ε͹ॖୀͰ͖ͳ͍ ▸ “഑৴ঢ়گ” ͱ͍͏ಠࣗࢦඪʹج͍ͮͯ εέʔϧ ͤ͞Δεέδϡʔϥʔ͕ඞཁ ▸ Rolling Deploy ͷਏΈ ▸ ഑৴ऴྃ΋Ϣʔβʔ࣍ୈ ▸ ഑৴͕ऴΘΔ·Ͱجຊతʹམͱͤͳ͍ ▸ தʹ͸ 24 ࣌ؒ഑৴΋…
  10. Architecture of live transcoding system Stateful application, but disposability ECS

    Cluster Broadcaster Container Instance Task (Container) Broadcaster Container Instance Task (Container) Broadcaster Container Instance Task (Container) Container Instance Aurora LIVE API ELB EC2 RDS CloudWatch Events & Lambda ec2:RunInstances ecs:RunTask ecs:StopTask ec2:StopInstances … ECS Events
  11. Architecture of live transcoding system Expire >> Drain >> Stop

    Container ECS Cluster Broadcaster Container Instance Task (Container) Broadcaster Container Instance Task (Container) Container Instance Aurora LIVE API ELB EC2 RDS CloudWatch Events & Lambda ec2:RunInstances ecs:RunTask ecs:StopTask ec2:StopInstances … ECS Events Broadcaster Container Instance Task (Container)
  12. Architecture of live transcoding system Expire >> Drain >> Stop

    Container ECS Cluster Broadcaster Container Instance Task (Container) Broadcaster Container Instance Task (Container) Container Instance Aurora LIVE API ELB EC2 RDS CloudWatch Events & Lambda ec2:RunInstances ecs:RunTask ecs:StopTask ec2:StopInstances … ECS Events Container Instance Task (Container)
  13. Architecture of live transcoding system Expire >> Drain >> Stop

    Container ECS Cluster Broadcaster Container Instance Task (Container) Broadcaster Container Instance Task (Container) Container Instance Aurora LIVE API ELB EC2 RDS CloudWatch Events & Lambda ec2:RunInstances ecs:RunTask ecs:StopTask ec2:StopInstances … ECS Events Container Instance
  14. Architecture of live transcoding system ഑৴਺/഑৴ෛՙʹԠͯ͡ Container ΍ Instance Λ

    AutoScale ▸ ϦϦʔε΋ `docker image push` ͢Ε͹ɺউखʹ৽Πϝʔδ͕ਁಁ͢Δ ECS Cluster Broadcaster Container Instance Task (Container) Broadcaster Container Instance Task (Container) Broadcaster Container Instance Task (Container) Container Instance Aurora LIVE API ELB EC2 RDS CloudWatch Events & Lambda ec2:RunInstances ecs:RunTask ecs:StopTask ec2:StopInstances … ECS Events
  15. Case1: Monitoring Tasks Container/Application ؂ࢹͷࣗಈઃఆ ▸ Task ͷঢ়ଶมԽʹԠͯ͡ ؂ࢹ ON/OFF

    1. Task ։࢝ → JVM, Wowza ૚ͷ؂ࢹ։࢝ 2. Task ਖ਼ৗఀࢭ → ؂ࢹఀࢭ 3. Task ҟৗఀࢭ → ؂ࢹఀࢭͤͣΞϥʔτൃ๒ ▸ ؂ࢹγεςϜ͸طଘ Zabbix ࢖͍ճ͠ ▸ ্ͷ 1, 2 Ͱ Zabbix API Λίʔϧ ▸ ͋Γ΋ͷͳͷͰ্͕҆Γ
  16. Case1: Monitoring Tasks e.g. ecs:StopTask ࣮ߦ { "version": "0", "id":

    "41f02974-8365-f955-8465-264ef8b189ca", "detail-type": "ECS Task State Change", "source": "aws.ecs", "account": "123456789012", "time": "2017-09-07T08:10:36Z", "region": "ap-northeast-1", "resources": [ “arn:aws:ecs:ap-northeast-1:123456789012:task/55a0cfde-a377-4b97-…” ], "detail": { "clusterArn": "arn:aws:ecs:ap-northeast-1:123456789012:cluster/uorat-ecs-event-test", "containerInstanceArn": "arn:aws:ecs:ap-northeast-1:123456789012:container-instance/a04...", ... "desiredStatus": "STOPPED", ... "lastStatus": “RUNNING”, ... "stoppedReason": "Task stopped by user", ... } } ECS Task Event ECS Cluster Container Instance Task (Container) Container Instance Task (Container) ecs:StopTask
  17. Case1: Monitoring Tasks e.g. ecs:StopTask ࣮ߦ def handle(event, context): ...

    if desire_status == "RUNNING" and last_status == "PENDING": logger.info("Enable monitoring by Zabbix: host=%s" % (tag_name)) zabbix_register(tag_name, private_ip) elif desire_status == "STOPPED": logger.info("Found the stopped task: task_arn=%s, last_status=%s" % ( task_arn, last_status )) stopped_reason = event["detail"]["stoppedReason"] if stopped_reason == "Task stopped by user" and last_status == "RUNNING": logger.info("Disable monitoring by Zabbix: host=%s" % (tag_name)) zabbix_disable(tag_name) elif stopped_reason != "Task stopped by user": logger.warn("Found the failed task: host=%s, task_arn=%s, stopped_reason=%s" % ( tag_name, task_arn, stopped_reason )) respawn(task_arn, ec2_instance_id) logger.warn("Respawned and locked the task: host=%s, task_id=%s" % ( tag_name, task_arn )) ECS Cluster Container Instance Task (Container) Container Instance Task (Container) ecs:StopTask Lambda function: main.py
  18. Case2: Respawn failed tasks ҟৗऴྃͨ͠ Task Λୟ͖ى͜͢ ▸ ࣋ଓ઀ଓͷͨΊ഑৴தͷ Task

    ͸ཁٹग़ ▸ “StoppedReason” ͕ظ଴֎ͷ৔߹ TaskΛ Re-run ▸ ServiceTask Ͱ͸ͳ͘ RunTask Ώ͑ʹඞཁ ▸ ͦͷޙͷରԠʹඞཁͳॲཧΛ࣮ߦ ▸ ҟৗऴྃ͸Ξϥʔτཁൃ๒ɺ؂ࢹࣗಈແޮ͠ͳ͍ ▸ Өڹͷग़ͨ഑৴৘ใΛ಺෦޲͚ʹ௨஌ ▸ ௐࠪ/߃ٱରԠͷͨΊ Task / Instance ΛϩοΫ
  19. Case2: Respawn failed tasks e.g. Task ҟৗऴྃ ECS Cluster Container

    Instance Container Instance Task (Container) { "version": "0", "id": "faeb52d8-e2ef-a726-655a-80f2373046b9", "detail-type": "ECS Task State Change", "source": "aws.ecs", "account": "123456789012", "time": "2017-09-07T08:40:33Z", "region": "ap-northeast-1", "resources": [ "arn:aws:ecs:ap-northeast-1:123456789012:task/d56d76f1-eb2a-42e5-..." ], "detail": { "clusterArn": "arn:aws:ecs:ap-northeast-1:123456789012:cluster/uorat-ecs-event-test", "containerInstanceArn": "arn:aws:ecs:ap-northeast-1:123456789012:container-instance/a04...", ... "desiredStatus": "STOPPED", … "lastStatus": "STOPPED", ... "stoppedReason": "Essential container in task exited", ... } } ECS Task Event
  20. Case2: Respawn failed tasks e.g. Task ҟৗऴྃ def handle(event, context):

    ... if desire_status == "RUNNING" and last_status == "PENDING": logger.info("Enable monitoring by Zabbix: host=%s" % (tag_name)) zabbix_register(tag_name, private_ip) elif desire_status == "STOPPED": logger.info("Found the stopped task: task_arn=%s, last_status=%s" % ( task_arn, last_status )) stopped_reason = event["detail"]["stoppedReason"] if stopped_reason == "Task stopped by user" and last_status == "RUNNING": logger.info("Disable monitoring by Zabbix: host=%s" % (tag_name)) zabbix_disable(tag_name) elif stopped_reason != "Task stopped by user": logger.warn("Found the failed task: host=%s, task_arn=%s, stopped_reason=%s" % ( tag_name, task_arn, stopped_reason )) respawn(task_arn, ec2_instance_id) logger.warn("Respawned and locked the task: host=%s, task_id=%s" % ( tag_name, task_arn )) Lambda function: main.py ECS Cluster Container Instance Container Instance Task (Container) ecs:StartTask
  21. Summary ͜Μͳوํʹ ͓͢͢Ί ECS Events & Lambda ▸ ECS ඪ४ͷ

    Task Placement ͩͱগ͠෺ ଍Γͳ͍ ▸ k8s ΍ Blox ͸ڇ౛͔΋ ▸ ҟৗऴྃͨ͠ Task Λٹग़͍ͨ͠ ▸ ΠϕϯτۦಈͰ͜·ΊʹϦιʔε੍ޚ͠ ͍ͨ