Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ECS Fargate spotを活用した負荷試験ツールを3年間運用してわかったこと

ECS Fargate spotを活用した負荷試験ツールを3年間運用してわかったこと

2022.11.30 Kubernetes/Fargateの導入・活用事例-カカクコム×CROOZ×ナビタイム-【TECHHILLS】

https://techplay.jp/event/877265

NAVITIME JAPAN
PRO

November 29, 2022
Tweet

More Decks by NAVITIME JAPAN

Other Decks in Technology

Transcript

  1. ECS Fargate spotΛ׆༻ͨ͠ෛՙࢼݧπʔϧΛ 3೥ؒӡ༻ͯ͠෼͔ͬͨ͜ͱ 2022 / 11 / 30

  2. ࣗݾ঺հ לౡ ࠀӳ େ෼ग़਎ झຯ ௼Γɺΰϧϑɺϥϯχϯάɺ໺ٿ ୲౰ SRE PJ Manager

    ։ൃɺӡ༻ɺϚωʔδϝϯτ …etc.
  3. ໊ࣾ: גࣜձࣾφϏλΠϜδϟύϯ ࣾһ਺: ໿500ਓ ຊࣾ: ౦ژ౎ߓ۠ೆ੨ࢁ ઃཱ: 2000೥ ձࣾ֓ཁ φϏλΠϜδϟύϯͷೆ੨ࢁΦϑΟε

  4. ܦӦཧ೦ ܦ࿏୳ࡧΤϯδϯͷٕज़Ͱ ੈքͷ࢈ۀʹไ࢓͢Δ

  5. ࣄۀྖҬ - B to C ϝσΟΞࣄۀ ૯߹φϏαʔϏε ϚʔέςΟϯάࢧԉ τϥϕϧࣄۀ ཱྀߦαʔϏε

    ؍ޫϚʔέςΟϯάࢧԉ υϥΠϒࣄۀ ࣗಈं޲͚αʔϏε πʔϦϯάࣄۀ ࣗసं/ΦʔτόΠ޲͚αʔϏε ๏ਓ/࣏ࣗମ޲͚αʔϏε όεɾ΢ΥʔΩϯ άࣄۀ όεɾ݈߁αʔϏε ΩϟϦΞڠۀࣄۀ ϏδωεφϏλΠϜࣄۀ େܕं޲͚ ಈଶ؅ཧ
  6. ࣄۀྖҬ - B to B Maasࣄۀ MaasΞϓϦ޲͚ APIఏڙ ίϯςϯππʔϦζϜͷاըɾӡӦ ஍Ҭ࿈ܞࣄۀ

    ࣏ࣗମ޲͚αʔϏε։ൃࢧԉ ؍ޫ٬༠க౳ͷίϯαϧςΟϯά CASEࣄۀ ࣗಈं޲͚αʔϏεͷ։ൃࢧԉ ަ௨σʔλࣄۀ ަ௨ɾ؍ޫσʔλͷఏڙɾ෼ੳ ϩέʔγϣϯϚʔ έςΟϯάࣄۀ ళฮσʔλ؅ཧΫϥ΢υαʔϏε ళฮϚʔέςΟϯάࢧԉ ιϦϡʔγϣϯࣄۀ ެڞަ௨ࣄۀ ަ௨ࣄۀऀ޲͚ιϦϡʔγϣϯ ๏ਓ޲͚API,SDK, πʔϧ౳ͷఏڙ
  7. ৽ػೳɾαʔϏε ഑ୡ׬ྃ·Ͱͷॴཁ࣌ؒΛ୹ॖͨ͠८ճܦ࿏ݕࡧ ࠓ೥ͷ໠ॵͷਏ͔͞Βੜ·Εͨػೳ &7ϢʔβʔͷࠔΓ͝ͱʮॆిʯʹϑΥʔΧε Ҡಈͷ՝୊ղܾʮ/"7*5*.&GPS#BCZʯ

  8. ׆༻͍ͯ͠Δίϯςφؔ࿈ٕज़ɾσϓϩΠख๏ɾ՝୊ ECS on fargate ECS on ec2 EKS on self

    managed nodes ECS on fargate Λ3೥ؒར༻ͯ͠Ͳ͏͔ͩͬͨʁ ຊ೔ɺ͓࿩͢Δ͜ͱ
  9. શͯΦϯϓϨͰӡ༻ NAVITIME Πϯϑϥͷྺ࢙ 35 AWSΛࢼݧར༻ Ϋϥ΢υҠߦ։࢝ 2001~ 2015ʙ 2016ʙ

  10. ར༻͍ͯ͠ΔΫϥ΢υϕϯλʔ

  11. ར༻ׂ߹͕Ұ൪େ͖͍ͷ͸AWS ୯Ұͷϕϯμʔʹݶఆͤͣɺ֤ϕϯμʔ ͷྑ͍෦෼Λ༗ޮʹ׆༻

  12. ίϯςφؔ࿈ٕज़ ར༻ࣄྫ঺հ

  13. ECS fargate ࣄྫ঺հ

  14. AWS͕ఏڙ͍ͯ͠Δ ίϯςφΦʔέετϨʔγϣϯ αʔϏε ίϯςφΛՔಇͤ͞Δ Nodeͷ؅ཧ͕ෆཁ ECS fargate ͱ͸ʁ

  15. ECS fargate ͷར༻ࣄྫ Locust (ෛՙࢼݧπʔϧ) ͰFargateΛ׆༻ ໿140ݸͷServiceʹର͢ΔෛՙࢼݧͰར༻

  16. γφϦΦΛPythonͰ࣮૷Ͱ͖Δʢ֦ுੑ͕ߴ͍ʣ Master - Slave ߏ੒Ͱେن໛ͳෛՙࢼݧΛߦ͏ࣄ͕Մೳ ϦΞϧλΠϜʹෛՙࢼݧ݁ՌΛϒϥ΢β্Ͱ֬ೝͰ͖Δ Locustͷಛ௃

  17. શମߏ੒ slave ECS Service: Master ECS Service: Slave slave slave

    master Access log ܭଌର৅ Service Discovery
  18. Locust - masterͷ໾ׂ slave ECS Service: Master ECS Service: Slave

    slave slave master Access log ܭଌର৅ Service Discovery ࢼݧ݁Ռ֬ೝ༻ WebϖʔδΛϦΞϧλΠϜʹՄࢹԽ slave ͔Β౷ܭ৘ใΛऩू masterͷ໾ׂ
  19. slave ECS Service: Master ECS Service: Slave slave slave master

    Access log ܭଌର৅ Service Discovery ܭଌର৅αʔόʔʹ࣮ࡍʹϦΫΤετΛૹ৴͢Δ slaveͷ໾ׂ Locust - slaveͷ໾ׂ
  20. ॲཧͷྲྀΕ ᶃ Start swarming Λԡ͢

  21. slave ECS Service: Master ECS Service: Slave slave slave master

    Access log ܭଌର৅ Service Discovery ᶄ Locust ͷ֤slaveίϯςφ͸ෛՙࢼݧͰར ༻͢ΔϦΫΤετϦετΛs3͔Βऔಘ ॲཧͷྲྀΕ
  22. slave ECS Service: Master ECS Service: Slave slave slave master

    Access log ܭଌର৅ Service Discovery ॲཧͷྲྀΕ ᶄ Locust ͷ֤slaveίϯςφ͸ܭଌର৅αʔ όʔʹϦΫΤετΛૹ৴ɻ Ϩεϙϯε݁Ռ͸ϦΞϧλΠϜʹMasterʹૹ ৴͞ΕΔ
  23. slave ECS Service: Master ECS Service: Slave slave slave master

    Access log ܭଌର৅ Service Discovery PythonͰ࣮૷Ͱ͖ΔͷͰ֦ுੑ͕ߴ͍ S3͔ΒࣄલʹςετϦΫΤετΛऔಘ ϦΫΤετΛվ᜵ /v1Λ /v2ʹมߋ ϦΫΤετϔομʔΛ௥Ճ ϦΫΤετૹ৴ִؒΛௐ੔ ྫ
  24. AWS CLI Λར༻͠CloudformationͷελοΫΛհͯ͠શͯͷ AWSϦιʔεΛ࡞੒ ࣾ಺πʔϧͳͷͰ Blue/GreenσϓϩΠ͸ߦ͍ͬͯ·ͤΜ ࡞੒ or ࡟আͷΈ ECS

    Fargate؀ڥ΁ͷσϓϩΠ
  25. ECS fargate ΁ͷσϓϩΠ 1. ઃఆϑΝΠϧΛ࡞੒ ͠ɺGitʹPush ECSClusterName: locust VpcId: vpc-xxxxxxxx

    ECSSecurityGroupId: sg-xxxxxxx ECSTaskExecutionRole: arn:aws:iam::333333333:role/EC2_allow_locust ECSSubnetId1: subnet-22222222222 ECSSubnetId2: subnet-44444444444 ECSSubnetId3: subnet-11111111111 ECSImageName: hoge:1.2.2 TargetUrl: https://hoge.jp TargetService: xxx-truckapp-stg ApacheLogUrl: s3://hoge/archive/latest/latest/*.gz RequestUrlPattern: /v1/[0-9]{8}/(route) RequestUrlExcludePattern: '' RequestUrlReplace: '' MasquaradeUseragent: false RequestToEveryPath: false SetScaleInAlarm: false AddNLBToLocustMaster: false LocustMaster: ECSTaskCPUUnit: 512 ECSTaskMemory: 1024 LocustOpts: --loglevel=ERROR --csv=output MasterNamespace: hoge.jp PrivateNamespaceId: ns-xxxxx LocustSlave: ECSTaskCPUUnit: 256 ECSTaskMemory: 512 ECSTaskDesiredCount: 1 LocustOpts: --loglevel=ERROR ྫ ෛՙࢼݧΛߦ͏։ൃऀ
  26. ECS fargate ΁ͷσϓϩΠ 2. Jenkins δϣϒΛ ࢖ͬͯ ECS Fargate ͷ

    ServiceΛσϓϩΠ ෛՙࢼݧΛߦ͏։ൃऀ αʔϏεA ༻؀ڥ αʔϏεB ༻؀ڥ αʔϏεC ༻؀ڥ
  27. ECS fargate ΁ͷσϓϩΠ 3. ECS TaskىಈޙɺServiceຖʹ෷͍ ग़͞ΕΔRoute53υϝΠϯʹΞΫηε ͠ɺෛՙࢼݧΛ։࢝ ෛՙࢼݧΛߦ͏։ൃऀ

  28. ECS FargateΛ࠾༻ͨ͠ܦҢ ӡ༻ίετ ར༻ϓϩμΫτ͕ଟ͍ࣾ಺πʔϧͳͷͰɺॳظ ߏஙɾӡ༻ίετΛ͔͚ͨ͘ͳ͍

  29. ECS FargateΛ࠾༻ͨ͠ܦҢ Lambda > ECS Fargate > ECS on ec2

    ͷॱͰΞʔΩςΫ νϟΛݕ౼ ֶशίετ͕ൺֱత௿͍ (ECS on ec2 ɺEKSͱൺ΂ͯ)
  30. ίετ࡟ݮͷҝͷ޻෉ ECS fargate spotΛ׆༻ • ࠷େ70% ׂҾ͞ΕΔ • ࠓͷͱ͜ΖҰ౓΋໰୊͸ൃੜ͍ͯ͠·ͤΜ Taskͷࣗಈఀࢭ

    • ෛՙࢼݧΛ࣮ߦ͍ͯ͠ͳ͍TaskΛݕ஌͠ɺࣗ ಈͰ࡟আ
  31. ECS on ec2 ࣄྫ঺հ

  32. NAVITIME ৐׵NAVITIME ΧʔφϏλΠϜ ࣗసंNAVITIME Japan Travel by NAVITIME τϥοΫΧʔφϏ πʔϦϯάαϙʔλʔ

    Ͱӡ༻͍ͯ͠ΔϓϩμΫτͷྫ ECS on ec2 ALKOO by NAVITIME
  33. NAVITIME ৐׵NAVITIME ΧʔφϏλΠϜ ࣗసंNAVITIME Japan Travel by NAVITIME τϥοΫΧʔφϏ πʔϦϯάαϙʔλʔ

    Ͱӡ༻͍ͯ͠ΔϓϩμΫτͷྫ ECS on ec2 ECS on ec2͸ࣾ಺ͰҰ൪ଟ͘ར༻͞Ε͍ͯΔ ALKOO by NAVITIME
  34. ΠϯϑϥίετΛۃྗԼ͍͛ͨ Fargate͕ϦϦʔε͞Εͨ࣌ɺطʹAutoscalingGroupͱECS Λ ࿈ಈͤ͞Δ ϊ΢ϋ΢͕ࣾ಺ʹ͋ͬͨ Ұ෦APIͰNAVITIMEͷੑೳཁ݅Λຬͨͤͳ͍Մೳੑ͕͋ͬͨ ECS on ec2 ΛϝΠϯͰར༻͍ͯ͠Δཧ༝

  35. JenkinsδϣϒͰσϓϩΠΛ࣮ࢪ ᶃ ίϯςφ࡞੒ & ECR ᶄ Cloudformation Λ࢖ͬͯCanaryϦϦʔε Deployϑϩʔ

  36. Deploy - Jenkins ͰίϯςφΛϏϧυ Build & Push ECR

  37. CloudformationͰશϦιʔεΛ࡞੒ Availability Zone A Availability Zone C Availability Zone D

    Service Task Task ֤αʔϏεͷAWS؀ڥ ECS ALB ECR
  38. CloudformationͰશϦιʔεΛ࡞੒ Availability Zone A Availability Zone C Availability Zone D

    ֤αʔϏεͷAWS؀ڥ αʔϏεAlert ϩάσʔλ஝ੵ
  39. Availability Zone A Availability Zone A Availability Zone A CanaryϦϦʔεͷྲྀΕ

    ֤αʔϏεͷAWS؀ڥ Blue Service 100 %
  40. CanaryϦϦʔεͷྲྀΕ ֤αʔϏεͷAWS؀ڥ Blue Service Availability Zone A 100 % Availability

    Zone A Availability Zone A 0 % Green Service
  41. Canary - ALBͷՃॏϧʔςΟϯάͰGreenʹϦΫΤετΛྲྀ͢ ֤αʔϏεͷAWS؀ڥ Blue Service Availability Zone A 0

    % Availability Zone A Availability Zone A 100 % Green Service
  42. CanaryϦϦʔεͷྲྀΕ ֤αʔϏεͷAWS؀ڥ Availability Zone A Availability Zone A Availability Zone

    A 100 % Green Service
  43. CodeDeploy ͸࢖͍ͬͯ·ͤΜ ཧ༝ ECS Service ͱ AutoscalingGroupͷ࿈ಈઃఆΛߦ͏ඞཁ͕ ͋ͬͨҝ CanaryϦϦʔε ݱࡏ͸Capacity

    Providerػೳ͕ఏڙ͞Ε͍ͯΔҝɺCodeDeployͰ΋ ཁ݅͸ຬͨͤΔ͔΋͠Ε·ͤΜ (ະݕূ)
  44. ΠϯϑϥνʔϜ͕࡞੒ͨ͠ڞ௨εΫϦϓτΛ֤ϓϩμΫτͷσ ϓϩΠδϣϒͰར༻ (bash + aws-cli) CanaryϦϦʔε AutoscalingGroupɺECS Service࡞੒ ALBՃॏϧʔςΟϯάมߋ ڞ௨ϩδοΫ

  45. ϩάసૹ༻ fl uentdίϯςφ͔Βߴස౓Ͱs3ʹϩάσʔλ͕Put͞ Ε͍ͯͨࣄʹΑΓɺs3ྉ͕ۚߴֹ ʹͳ͍ͬͯͨ ӡ༻࣌ʹൃੜͨ͠՝୊ fl uentdͷ fl ush_interval

    ͷ஋Λେ͖ͳ஋ʹมߋ ରॲ
  46. Debugϩά͕ग़ྗ͞ΕΔঢ়ଶͷApplicationʹରͯ͠௕࣌ؒෛՙ ࢼݧΛߦͬͨࣄͰCloudwatchϩάͷྉ͕ۚߴֹʹͳͬͨ ӡ༻࣌ʹൃੜͨ͠՝୊ ෛՙࢼݧ࣮ࢪ࣌ʹΞϓϦέʔγϣϯͷLoggerઃఆ ঢ়گΛνΣοΫ ରॲ

  47. Serviceؒ௨৴ͰύϒϦοΫͳALBΛར༻͍ͯͨ͠ɺAPIϨεϙ ϯε͕ѹॖͰ͖͍ͯͳ͔ͬͨࣄʹΑΓɺDataTransfer-Regional- Byte ͷྉ͕ۚߴֹʹͳ͍ͬͯͨ ӡ༻࣌ʹൃੜͨ͠՝୊ Internal ALBʹมߋ͢Δ APIϨεϙϯεΛѹॖ͢Δ ରॲ

  48. HealthyHostCount͕0ͷTargetGroupʹϦΫΤετ͕ϧʔςΟϯ ά͞ΕΔ ӡ༻࣌ʹൃੜͨ͠՝୊ ՃॏઃఆมߋલʹTargetGroupͷHealthyHostCountΛ νΣοΫ͢ΔΑ͏ʹεΫϦϓτΛվम ରॲ

  49. EKS on self managed nodes ࣄྫ঺հ

  50. શจݕࡧAPI API਺: ໿30ݸɺΠϯελϯε਺: ໿100 Jenkins Πϯελϯε਺ 33ʙ https://note.com/navitime_tech/n/nc663cc1e866e EKS Ͱӡ༻͍ͯ͠ΔService

  51. ArgoCD Λར༻ ArgoCDͱ͸ʁ Kubernetes Ϋϥελʹରͯ͠GitopsʹΑΔܧଓతσϦόϦʔ Λߦ͏πʔϧ EKS Cluster ΁ͷσϓϩΠ

  52. Argo events Argo workflow Trigger ᶄ ϦϙδτϦߋ৽ ᶃ Webhook ϦΫΤετ

    ఆظfetch Blue Service Green Service Argo rollous ᶅ Sync (Deploy) શจݕࡧAPI ͷσϓϩΠ - શମ૾
  53. Blue/GreenɺCanaryϦϦʔεΛαϙʔτ͍ͯ͠ΔKubernetesί ϯτϩʔϥ ςετέʔεɺࣗಈϩʔϧόοΫ͕؆୯ʹઃఆͰ͖Δ Argo rolloutsͱ͸?

  54. Blue/GreenσϓϩΠޙʹ 2xxͷׂ߹͕95%ΛԼճͬͨɺ΋͘͠ ͸ϩʔυόϥϯαʹαʔϏε͕઀ଓ͞Ε͍ͯͳ͍৔߹ʹࣗಈͰ ϩʔϧόοΫΛ࣮ߦ Argo rollouts - ྫ

  55. Argo rollouts - BlueGreenσϓϩΠͷྲྀΕ ͜Ε͔ΒαʔϏεΠϯ͢Δ৽͍͠ίϯςφ αʔϏε͔Βࢀর͞Ε͍ͯΔίϯςφ

  56. Argo rollouts - BlueGreenσϓϩΠͷྲྀΕ ͜Ε͔ΒαʔϏεΠϯ͢Δ৽͍͠ίϯςφ αʔϏε͔Βࢀর͞Ε͍ͯΔίϯςφ ৽͍͠ίϯςφΛαʔϏεΠϯͤ͞Δલʹ ࣄલʹςετϦΫΤετΛૹ৴

  57. Argo rollouts - BlueGreenσϓϩΠͷྲྀΕ ৽͍͠ίϯςφ͕όϥϯαʹͭͳ͕Δ چίϯςφ͕όϥϯα͔Β֎ΕΔ ςετϦΫΤετૹ৴Ͱ໰୊ͳ͍ࣄΛ֬ೝͨ͠Βچίϯςφ ͔Β৽ίϯςφʹϧʔςΟϯάઌΛมߋ

  58. Argo rollouts - BlueGreenσϓϩΠͷྲྀΕ ྫ͑͹ɺ#MVFɾ(SFFO੾Γସ͑ޙʹ৽ίϯςφͰ੒ޭ཰ʢYYϦΫΤετͷׂ߹ʣ͕ Ұఆͷᮢ஋ΛԼճͬͨ͜ͱΛݕ஌

  59. Argo rollouts - BlueGreenσϓϩΠͷྲྀΕ چίϯςφ چίϯςφʹϧʔςΟϯάઌ͕มߋ͞ΕɺσϓϩΠεςʔλε͕ %FHSBEFEͷঢ়ଶͱͳΔ ৽ίϯςφ

  60. GitϦϙδτϦʹPushͨ͠ΒσϓϩΠ͕૸Δҝɺ໰୊ͷ͋Δ ϓϩάϥϜ͕ຊ൪؀ڥʹ؆୯ʹ্͕ͬͯ͠·͏ Gitopsӡ༻࣌ͷ໰୊఺ ManifestΛຊ൪൓ө͢Δલʹςετ͢Δඞཁ͕͋Δ

  61. git push ͢Δલʹ conftest ͰςετΛ࣮ߦ kubernetesΫϥελʹ൓ө͞ΕΔલʹGatekeeperͰςετΛ࣮ߦ 2छྨͷManifestςετΛ࣮ࢪ

  62. ຊ൪޲͚Ingress Ϧιʔεʹݕূ޲͖υϝΠϯ͕ઃఆ͞Ε͍ͯͳ͍͔ʁ ຊ൪޲͚Ingress Ϧιʔεʹݕূ༻ίϯςφ͕ઃఆ͞Ε͍ͯͳ͍͔ʁ Latestλάͷίϯςφ͕ࢦఆ͞Ε͍ͯͳ͍͔ʁ HPAઃఆ͸ద੾͔ʁ ςετέʔεͷྫ

  63. EKSόʔδϣϯΞοϓʹ͕͔͔࣌ؒΔ ӡ༻࣌ʹൃੜͨ͠՝୊ argocd ͰσϓϩΠ͢Δํࣜʹ͢ΔࣄͰ͕࣌ؒ୹ॖ ɹ(Ҏલ͸kubectlͰmanifestΛapplyͯ͠·ͨ͠ʣ ରॲ

  64. EKSόʔδϣϯΞοϓʹ͕͔͔࣌ؒΔ ӡ༻࣌ʹൃੜͨ͠՝୊ argocd ͰσϓϩΠ͢Δํࣜʹ͢ΔࣄͰ͕࣌ؒ୹ॖ ɹ(Ҏલ͸kubectlͰmanifestΛapplyͯ͠·ͨ͠ʣ ରॲ K8sόʔδϣϯΛ্͛ΔຖʹManifestΛมߋ͢Δ࡞ۀ͕΄΅ൃੜ͠·͢ɻ Addon Ҏ֎ͷίϯϙʔωϯτͷ਺΋ଟ͍ͷͰɺݕূɾຊ൪ͷEKSόʔδϣ ϯΞοϓʹࠓͰ΋

    ໿1ϲ݄൒ ͕͔͔͍࣌ؒͬͯΔঢ়گͰ͢ɻ
  65. Node-pressure Eviction ͕ൃੜ͠ɺಛఆNode্ͰՔಇ͍ͯͨ͠ શPod͕ڧ੍Terminate͞ΕΔ ӡ༻࣌ʹൃੜͨ͠՝୊ resources > limits Ͱ্ݶ஋Λࢦఆ kubelet_evictions

    ϝτϦΫεΛGrafanaͰఆ఺؍ଌ ରॲ
  66. Kubernetesٕज़ऀෆ଍ ӡ༻࣌ʹൃੜ͍ͯ͠Δ՝୊

  67. ECS fargateΛ3೥ؒӡ༻ͯ͠ Ͳ͏͔ͩͬͨʁ

  68. ݁࿦ ӡ༻ɾֶशʹίετ͕͔͔Βͳ͍ͷ͕ΠΠʂ

  69. ECS on EC2 ͱൺֱͯ͠ AutoscalingGroup࡞੒ͷख͕ؒল͚Δ ECS Optimized AMIͷఆظߋ৽΋ෆཁ εέʔϧઃఆ͕γϯϓϧ (Nodeͷεέʔϧઃఆ͕ෆཁ)

    ֶशίετ͕গ͠௿͍ ϥϯχϯάίετ͸ߴ͍
  70. EKS on self managed nodesͱൺֱͯ͠ EKS ͸ΫϥελόʔδϣϯΞοϓ࡞ۀʹ͕͔͔࣌ؒΔ EKS (k8s)ͷֶशίετ͸ߴ͍ (

    ݁ߏߴ͍… ) EKS (k8s)͸֦ுੑ͕ߴ͘ɺ͍ΖΜͳϢʔεέʔεͰར༻ Ͱ͖Δ
  71. ͝੩ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠