Upgrade to Pro — share decks privately, control downloads, hide ads and more …

インフラチームからSREへ / SRE in Mercari Developers Summit 2018

700669515ee872152d8b9403c2a0cf8c?s=47 kazeburo
February 16, 2018

インフラチームからSREへ / SRE in Mercari Developers Summit 2018

インフラチームからSREへ

〜メルカリを支える新しいインフラのあり方

Developers Summit 2018/2/16

700669515ee872152d8b9403c2a0cf8c?s=128

kazeburo

February 16, 2018
Tweet

Transcript

  1. ΠϯϑϥνʔϜ͔ΒSRE΁
 ϝϧΧϦΛࢧ͑Δ৽͍͠Πϯϑϥͷ͋Γํ Masahiro Nagano @kazeburo
 Developers Summit 2018/2/16

  2. Me • Masahiro Nagano / ௕໺խ޿ • @kazeburo (twitter/github) •

    גࣜձࣾϝϧΧϦ
 ϓϦϯγύϧΤϯδχΞ
 Site Reliability Engineering (SRE) νʔϜ • BASE, Inc ٕज़ΞυόΠβʔ
  3. Me • ~ 2006: ژ౎ͰελʔτΞοϓࢀՃ • ΤϯδχΞ਺໊ • ։ൃΛ͠ͳ͕ΒΠϯϑϥͷ໘౗ΛݟΔɻDC࡞ۀ΋΍ͬͨ •

    ΞϓϦέʔγϣϯͷνϡʔχϯάɺۭ͍ͨϦιʔεͰ৽ػೳͷ௥Ճͱ͍͏αΠΫϧ • 2006 ~: mixi • ʮΞϓϦέʔγϣϯӡ༻νʔϜʯDCʹߦ͔ͳ͍ΦϖϨʔγϣϯ • େن໛ը૾഑৴΍ΞϓϦέʔγϣϯͷνϡʔχϯά
  4. Me • 2010 ~: livedoor (NHN Japan => LINE) •

    livedoor΍LINEϑΝϛϦʔͷαʔϏεΛԣஅͯ͠ΠϯϑϥετϥΫνϟ΍
 ύϑΥʔϚϯεͷվળ • livedoor Blog ͷMySQLνϡʔχϯά / Plack࠷దԽ • 2015/02 ~ : mercari
  5. ࠷ۙͷ׆ಈ • ొஃ • AWS Dev Day Tokyo 2017 •

    YAPC::Fukuoka 2017, YAPC::Hokkaido 2016 • YAPC::Okinawa 2018, Manabiya Teratail Developer Days ొஃ༧ఆ • هࣄ • WEB+DB PRESS Vol.88, Vol.92-97 ࿈ࡌ, Vol.100 • ೔ܦSYSTEMS 2017 7݄߸, ITPro
  6. AGENDA • ࣗݾ঺հ • SREͱͷग़ձ͍ • ϝϧΧϦʹ͍ͭͯ • SREͱ͸ •

    ϝϧΧϦSREͷࣄྫͱ͜Ε͔Β
  7. SRE ͱͷग़ձ͍ ͳͥSREͳͷ͔

  8. ΠϯϑϥΤϯδχΞʁ • mixi࣌୅͸ʮΞϓϦӡ༻νʔϜʯ • Πϯϑϥ(σʔληϯλʔ)νʔϜ͸ଞʹ͍Δ • σʔληϯλʔνʔϜ͕༻ҙͨ͠αʔόͷೳྗΛҾ͖ग़͠ɺΞϓϦέʔγϣϯ ΤϯδχΞ͕࡞੒ͨ͠ίʔυΛ࠷ߴͷܗͰಈ͔͢ͷ͕ࣗΒ(νʔϜ)ͷ໾ׂ • αʔϏεͷՄ༻ੑ͸ϋʔυ΢ΣΞͷνʔϜͰ͸ͳ͘ɺιϑτ΢ΣΞΛѻ͏νʔ

    Ϝͷ੹೚
  9. ΦϖϨʔγϣϯΤϯδχΞʁ • 2010೥ग़൛ʮWeb Operationsʯ • ܧଓతσϓϩΠɺDevOpsɺࣗಈԽɺ؂ࢹͳͲΦϖ Ϩʔγϣϯʹؔ͢ΔΤοηΠ • ΦϖϨʔγϣϯ(ӡ༻)ΛϧʔνϯϫʔΫͱଊ͑Δਓ΋ ଟ͍

  10. SREͱͷग़ձ͍ • 2012/7 ༑ਓͱͷIRCͰͷձ࿩͔Βڭ͑ͯ΋Β͏ • GoogleͷڊେͳΠϯϑϥͱαʔϏεͷՔಇɺ҆ఆੑΛ୲౰͢ΔνʔϜ͕SRE • https://research.googleblog.com/2012/07/site-reliability-engineers-solving-most.html
 ʮSite Reliability

    Engineers: “solving the most interesting problems”ʯ͜ͷهࣄ͕ެ։͞Εͨࠒ • twitter ͷbio΍ൃදεϥΠυʹʮSite ReliabilityʯΛ௥Ճͯ͠ҙࣝ • https://www.slideshare.net/kazeburo/yapc2102mysql/2 (2012/9) • 2015/11 ϝϧΧϦʹͯνʔϜ໊ͱͯ͠ఏҊ
  11. ϝϧΧϦʹ͍ͭͯ

  12. ϝϧΧϦ • ࠃ಺࠷େڃͷϑϦϚΞϓϦ • 3෼Ͱ؆୯ʹग़඼ 1) ࣸਅΛࡱΔ 2) ঎඼৘ใΛهೖ 3)

    ग़඼ϘλϯΛԡ͢ • ҆৺҆શͳܾࡁɾऔҾ • ΤεΫϩʔ(͓ۚͷ΍ΓͱΓ͸౰͕ࣾؒʹհࡏ) • ಗ໊഑ૹ
  13. ถࠃ/ӳࠃ ΁ͷల։ JP UK US

  14. KPI μ΢ϯϩʔυ਺ GMV(૯औҾֹ) 1ԯDLҎ্(શੈք) ݄ؒ100ԯԁҎ্ ग़඼਺ 1೔100ສ඼Ҏ্

  15. ϝϧΧϦγεςϜ֓ཁ ©2011 Amazon Web Services LLC or its affiliates. All

    rights reserved. Client Multimedia Corporate data center Traditional server Mobile Client IAM Add-on Example: IAM Add-on ence ) Assignment/ Task Requester Workers ग़඼! DB Search 5-දࣔ ݕࡧ൓ө ©2011 Amazon Web Services LLC or its affiliates. All rights reserved. User Users Client Multimedia Corporate data center Traditional server Mobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Amazon Mechanical Turk On-Demand Workforce Human Intelligence Tasks (HIT) Assignment/ Task Requester Workers Amazon Mechanical Turk Non-Service Specific ©2011 Amazon Web Services LLC or its affiliates. All rights reserved. User Users Client Multimedia Corporate data center Traditional server Mobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Amazon Mechanical Turk On-Demand Workforce Human Intelligence Tasks (HIT) Assignment/ Task Requester Workers Amazon Mechanical Turk Non-Service Specific ©2011 Amazon Web Services LLC or its affiliates. All rights reserved. User Users Client Multimedia Corporate data center Traditional server Mobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Amazon Mechanical Turk On-Demand Workforce Human Intelligence Tasks (HIT) Assignment/ Task Requester Workers Amazon Mechanical Turk Non-Service Specific ©2011 Amazon Web Services LLC or its affiliates. All rights reserved. User Users Client Multimedia Corporate data center Traditional server Mobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Amazon Mechanical Turk On-Demand Workforce Human Intelligence Tasks (HIT) Assignment/ Task Requester Workers Amazon Mechanical Turk Non-Service Specific ©2011 Amazon Web Services LLC or its affiliates. All rights reserved. User Users Client Multimedia Corporate data center Traditional server Mobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Amazon Mechanical Turk On-Demand Workforce Human Intelligence Tasks (HIT) Assignment/ Task Requester Workers Amazon Mechanical Turk Non-Service Specific ©2011 Amazon Web Services LLC or its affiliates. All rights reserved. User Users Client Multimedia Corporate data center Traditional server Mobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Amazon Mechanical Turk On-Demand Workforce Human Intelligence Tasks (HIT) Assignment/ Task Requester Workers Amazon Mechanical Turk Non-Service Specific ©2011 Amazon Web Services LLC or its affiliates. All rights reserved. User Users Client Multimedia Corporate data center Traditional server Mobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Amazon Mechanical Turk On-Demand Workforce Human Intelligence Tasks (HIT) Assignment/ Task Requester Workers Amazon Mechanical Turk Non-Service Specific ©2011 Amazon Web Services LLC or its affiliates. All rights reserved. User Users Client Multimedia Corporate data center Traditional server Mobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Amazon Mechanical Turk On-Demand Workforce Human Intelligence Tasks (HIT) Assignment/ Task Requester Workers Amazon Mechanical Turk Non-Service Specific ©2011 Amazon Web Services LLC or its affiliates. All rights reserved. User Users Client Multimedia Corporate data center Traditional server Mobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Amazon Mechanical Turk On-Demand Workforce Human Intelligence Tasks (HIT) Assignment/ Task Requester Workers Amazon Mechanical Turk Non-Service Specific ©2011 Amazon Web Services LLC or its affiliates. All rights reserved. User Users Client Multimedia Corporate data center Traditional server Mobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Amazon Mechanical Turk On-Demand Workforce Human Intelligence Tasks (HIT) Assignment/ Task Requester Workers Amazon Mechanical Turk Non-Service Specific ©2011 Amazon Web Services LLC or its affiliates. All rights reserved. User Users Client Multimedia Corporate data center Traditional server Mobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Amazon Mechanical Turk On-Demand Workforce Human Intelligence Tasks (HIT) Assignment/ Task Requester Workers Amazon Mechanical Turk Non-Service Specific େྔͷϦΫΤετ ©2011 Amazon Web Services LLC or its affiliates. All rights reserved. User Users Client Multimedia Corp data c Mobile Client Internet AWS Management Console IAM Add-on Example: IAM Add-on Human Intelligence Tasks (HIT) Assignment/ Task Requester Workers Amazon Mechanical Turk Non-Service Specific ϦΫΤετԠ౴ DB Search ߪೖ! ਺ඵʙ30ඵ ਺ඵʙ ը૾ ܾࡁ AI ߴ଎ʹฒߦͯ͠େྔͷτϥϯβΫγϣϯΛѻ͏
  16. ΠϯϑϥετϥΫνϟ • ϚϧνΫϥ΢υߏ੒ • JP͸͘͞ΒΠϯλʔωοτɺUS͸AWSɺUK͸GCPΛத৺ͱͨ͠ߏ੒ • ͞Βʹ JPɺUSͰ͸GCPΛ૊Έ߹ΘͤʮϚΠΫϩαʔϏεʯͷج൫Λߏங

  17. ΠϯϑϥετϥΫνϟ DNS: Amazon Route53 CDN: Akamai, Fastly, ImageFlux Storage: Amazon

    S3 Analysis: Google BigQuery / Monitoring: Mackerel, DataDog JP UK US + +
  18. ϚΠΫϩαʔϏεج൫ API Gateway ©2011 Amazon Web Services LLC or its

    affiliates. All rights reserved. Users Client Multimedia Corporate data center Traditional server Mobile Client Management onsole IAM Add-on Example: IAM Add-on man Intelligence Tasks (HIT) Assignment/ Task Requester Workers search backend service offer JP US • طଘAPI(ϞϊϦεAPI)ΛWrap͢Δ API Gateway Λ։ൃ͠ɺGCP(GKE)Ͱߏங • ϞϊϦεAPI֎Ͱͷ৽ػೳ։ൃ • αʔϏεΛஈ֊తʹϚΠΫϩαʔϏεͱ͠ ͯ෼ղ • ϞϊϦεAPIɾϚΠΫϩαʔϏε͔Βݺͼ ग़͞ΕΔBackendαʔϏε΋GKE্Ͱಈ࡞ ϞϊϦεAPI
  19. SREͱ͸ վΊͯ

  20. SREͱ͸ • γεςϜ؅ཧͱαʔϏεӡ༻ͷํ๏࿦ͱͯ͠Googleͷӡ༻νʔϜΛ཰͍ͯ ͍ͨBen Treynor͕ఏএ • USΛத৺ʹେن໛ͳITΠϯϑϥΛӡ༻͢Δ֤ࣾʹ޿·Δ • ໌֬ͳఆٛ͸ͳ͍͕ʮιϑτ΢ΣΞΤϯδχΞϦϯάʹΑͬͯɺΠϯϑϥετ ϥΫνϟɾαʔϏεશମͷՄ༻ੑɺੑೳɺηΩϡϦςΟΛվળ͢ΔʯΤϯδχ

    Ξ/νʔϜ͓Αͼ૊৫ͷ͋Γํ
  21. Google SRE • GoogleͷSREʹ͸ιϑτ΢ΣΞΤϯδχΞϦϯάʹՃ͑ɺγεςϜɾӡ༻ ͷೳྗ͕ٻΊΒΕΔ • ιϑτ΢ΣΞΤϯδχΞϦϯά͸ʮࣗಈԽʯʹಛʹ஫ྗ • SREͷਓ਺͸αʔϏεͷن໛ʹൺྫͤ͞ͳ͍(Googleʹ͓͍ͯ΋ݱ࣮తʹͰ͖ͳ͍) •

    ʮτΠϧ(ख࡞ۀͰߦΘΕɺࣗಈԽՄೳͰ܁Γฦ͢͜ͱʹՁ஋Λ࣋ͨͳ͍)ʯͷ๾໓
  22. Google SRE • ۀ຿࣌ؒͷ50%͸ιϑτ΢ΣΞΤϯδχΞϦϯάΛߦ͏ • ࣗಈԽ(ࣗ཯Խ)ɺ৴པੑ޲্ʹ͋ͯΔ • 50%Λ௒͑Δ͜ͱ͕͋Ε͹ۀ຿ͷݟ௚͠ΛഭΒΕΔ • SLAɺΤϥʔόδΣοτ(༧ࢉ)ʹΑΔ։ൃऀͷར֐ௐ੔

    • ։ൃऀνʔϜͱՄ༻ੑͷ໨ඪ(SLA)ΛαʔϏε͝ͱʹઃఆɻߴ͗͢Δઃఆ͸͠ͳ͍ • ΤϥʔόδΣοτ಺ʹ͋Δͱ͖͸։ൃऀ͸ੵۃతͳϦϦʔεΛߦ͍ɺ༧ࢉΛ௒͑Δ৔ ߹͸৴པੑճ෮ͷͨΊͷ։ൃʹઐ೦͢Δ͜ͱ͕ٻΊΒΕΔ
  23. ೔ຊࠃ಺ͰͷSRE • 2015೥11݄ ϝϧΧϦٕज़blogͰSREΛ঺հ • RettyɺαΠϘ΢ζɺCookPadɺMixiɺ͸ͯͳͳͲWebܥاۀΛத৺ʹSRE ͷ࠾༻͕ਐΉ • SRE Tech

    Talk։࠵ • ୈҰճ: 2016೥6݄ɻୈೋճ: 2017೥1݄ • 100໊Ҏ্ͷࢀՃऀΛूΊΔ
  24. ೔ຊࠃ಺ͰͷSRE • ॻ੶/ࡶࢽ • ΦϥΠϦʔʮSRE αΠτϦϥΠΞϏϦςΟ
 ΤϯδχΞϦϯάʯ • ೔ܦBPʮ೔ܦSYSTEM 2017/7ʯ

  25. ϝϧΧϦSRE

  26. ͳͥϝϧΧϦͰSREͳͷ͔ • 2015/11 ΠϯϑϥνʔϜ͔ΒSREʹվশ • ϝϧΧϦΛ͓٬͞·ʹ௕͘࢖ͬͯ΋Β͏ʹ͸ʮ͍ͭͰ΋շదʹ҆શʹ࢖͑Δʯ ৴པੑ͕ॏཁɻSRE͸͜ͷ৴པੑΛؚΉ • ιϑτ΢ΣΞΤϯδχΞϦϯάʹΑͬͯαʔϏεͷύϑΥʔϚϯεͱՄ༻ੑͷ ޲্ɺσϓϩΠͳͲͷࣗಈԽ͕ۀ຿ͷத৺

    • ઌਐతͳऔΓ૊Έͱͯ͠ͷૂ͍
  27. ϝϧΧϦSRE • 2018/2 ࣌఺Ͱϝϯόʔ͸10໊ • ϚΠΫϩαʔϏεج൫ߏஙɺSys-ML(MLOps)ʹܞΘΔΤϯδχΞ΋ • େن໛ͳWebαʔϏεͷӡ༻ܦݧ͕͋Δத్͕ଟ͍͕ɺ৽ଔϝϯόʔ΋ࡏ੶ • ݸʑͷϝϯόʔ͕ೳಈతʹ໰୊Λൃݟ͠ɺղܾ͍ͯ͘͠

    • SlackɾGitHubͰͷٞ࿦ɺJiraͰͷνέοτ؅ཧΛ௨ͯ͠৘ใڞ༗
  28. SlackͰticket࡞੒ • Jiraͷticket࡞੒ΛࣗಈԽɾ؆ૉԽ • ࢥ͍͍ͭͨ࣌ʹ࡞੒ɾ՝୊ͷڞ༗ • SlackΛΈͨνʔϜϝϯόʔ͕͙͢ʹ
 ղܾ͢Δ͜ͱ΋

  29. ϝϧΧϦ SRE ͷۀ຿ൣғ Operations Software Eng. ج൫ߏங OnCall (ো֐ରԠ) ґཔରԠ

    εέʔϥϏϦςΟɾՄ༻ੑվળ
 ࣗಈԽɺDBAɺϛυϧ΢ΣΞߏங ΞϓϦέʔγϣϯͷઃܭϨϏϡʔ ϩάऩूɾ෼ੳج൫ͷߏஙɺӡ༻ αʔόϓϩϏδϣχϯάɾσϓϩΠɺϚΠΫϩαʔϏεɾ.-ج൫ͷ੔උ ηΩϡϦςΟʗෆਖ਼ར༻ݕग़ γεςϜӡ༻Λʮ࢓૊Έʯͱͯ͠
 ࡞Γ্͛Δ͜ͱ͕ٻΊΒΕ͍ͯΔ
  30. ࠷ۙͷࣄྫ

  31. CruiseControl 8PSLFS2VFVFγεςϜͷෛՙͷ੍ޚ

  32. Worker/QueueγεςϜͷ໰୊఺ • Jobͷॲཧ଎౓͸༷ʑͳཁҼͰมԽ͢Δ • Batch͔Βͷenqueue଎౓ • Workerͷϓϩηε਺ • ॲཧ಺༰ •

    ॲཧ͕଎͗͢Δ͜ͱͰγεςϜʹෛՙ App App Queue RDBMS Worker Worker Worker Worker
  33. ಺੡CRMπʔϧͷࣄྫ • ഑৴ͷ଎౓͸഑৴ϝσΟΞͷબ୒ʹΑܾͬͯ·Δ •RDBMS΁ͷॻ͖ࠐΈ(ΞϓϦ಺PM) •Mail •RDBMS΁ͷॻ͖ࠐΈ(ΞϓϦ಺௨஌) •Push഑৴ • ॲཧ଎౓͕ҰఆͰ͸ͳ͍ɾ഑৴ຖʹมԽ •

    ഑৴ʹ͔͔Δ࣌ؒΛ୹͘͢ΔͨΊWorkerͷ਺ΛखಈͰௐ੔ • ௐ੔࿙ΕʹΑͬͯ૝ఆ֎ͷෛՙɾো֐ CRM Queue RDBMS Worker Worker Worker Worker Mail Push ௿଎ ߴ଎
  34. CruiseControl • ʮ଎౓Λ੍ޚ͢ΔαʔϏεʯΛߏங • Worker͕ॲཧΛ։࢝͢ΔલʹCruiseControlʹ
 ໰͍߹ΘͤΔ • ॲཧ଎౓͕଎͍৔߹͸wait͕ೖΔ • Worker਺͕े෼ʹ͋Ε͹ॲཧ଎౓͕Ұఆʹ

    • CRMπʔϧͷߴ଎ͳ഑৴ͱαʔϏεͷ҆ఆՔಇΛ࣮ݱ CRM App Queue Worker Worker Worker Cruise
 Control RDBMS Mail Push
  35. CruiseControl = NGINX • ngx_http_limit_req_module Λར༻ • pathͱheaderʹΑͬͯ଎౓Λ੍ޚ limit_req_zone $http_x_limit_req

    zone=r10:50m rate=10r/s; limit_req_zone $http_x_limit_req zone=r50:50m rate=50r/s; limit_req_zone $http_x_limit_req zone=r100:50m rate=100r/s; server { listen 8080; root /path/to/root; location /r10 { limit_req zone=r10 burst=4294967296; } location /r50 { limit_req zone=r50 burst=4294967296; } location /r100 { limit_req zone=r100 burst=4294967296; } } % curl -H 'X-Limit-Req: push-msg' cruisecontrol:8080/r100
  36. CruiseControl ʹΑ࣮ͬͯݱ • γεςϜͷෛՙͱqueueͷ਺Λݟͳ͕ΒWorker਺Λมߋ͢Δ
 ৬ਓٕతରԠͷγεςϜԽ • queueͷॲཧ଎౓ͷSLAͱͯ͠΋ػೳ

  37. ·ͱΊ

  38. ·ͱΊ • ΠϯϑϥνʔϜ͔ΒSRE΁ • SRE͸γεςϜʹؔ͢Δ஌ࣝͱιϑτ΢ΣΞΤϯδχΞϦϯάʹΑͬͯαʔϏ εͷύϑΥʔϚϯεͱՄ༻ੑͷ޲্Λ࣮ݱ͢Δ • γεςϜӡ༻Λ࢓૊ΈԽ͢Δ • ϚΠΫϩαʔϏεɾMLج൫ͳͲࣄۀΛ֦େΛࢧ͑Δͷ΋ιϑτ΢ΣΞ

    • ͓٬༷ʹʮ͍ͭͰ΋շదʹ҆શʹ࢖͑Δʯ৴པੑΛఏڙ͢Δ
  39. SRE More!!! https://twitter.com/kazeburo/status/890131903529054210

  40. Ҏ্ => www.mercari.com/jp/jobs/ TQFBLFSEFDLDPNLB[FCVSP

  41. None