Slide 1

Slide 1 text

ϞϯελʔετϥΠΫͷ؂ࢹπʔϧࠓੲ෺ޠ ʙ্רʙ ։ൃຊ෦ SREάϧʔϓ খ஑஌༟ גࣜձࣾϛΫγΟ

Slide 2

Slide 2 text

ࣗݾ঺հ 2

Slide 3

Slide 3 text

ࣗݾ঺հ ‣ࢯ໊ ‣খ஑஌༟ ‣ܦྺ ‣2008೥ϛΫγΟʹೖࣾ ‣SNSʮmixiʯͳͲͰΠϯϑϥɺγεςϜӡ༻ ‣ͦͷޙɺϞϯελʔετϥΠΫΛ͸͡Ίͱͨ͠ήʔϜΞϓϦͷӡ༻ ‣Ϟϯετ֤छαΠτɺࣾ಺Ͱར༻͢ΔαʔϏεӡ༻πʔϧɺ෺ཧΠ ϯϑϥ؀ڥͷӡ༻౳ʑɺ෯޿͘ैࣄ ‣։ൃຊ෦ SRE άϧʔϓॴଐ 3

Slide 4

Slide 4 text

ϞϯελʔετϥΠΫ 4

Slide 5

Slide 5 text

ϞϯελʔετϥΠΫ ࣗ෼ͷϞϯελʔΛҾͬுͬͯ஄͖ɺఢͷϞϯελʔʹ౰ͯͯ౗͍ͯ͘͠ͱ͍͏ɺεϚʔτϑΥϯͷಛੑΛ׆༻ͨ͠ɺ ୭Ͱ΋؆୯ʹָ͠ΊΔΞΫγϣϯRPGͰ͢ɻήʔϜ͸λʔϯ੍Λͱ͓ͬͯΓɺ Ұॹʹ͍Δ༑ͩͪͱ࠷େ4ਓ·Ͱಉ࣌ʹ༡΂ΔڠྗϓϨΠʢϚϧνϓϨΠʣ͕ಛ௕Ͱ͢ɻ 2013೥ͷ10݄ͷఏڙ։͔࢝Βݱࡏ※·Ͱͷੈքྦྷܭར༻ऀ਺4,900ສਓΛಥഁ※ 2018೥12݄࣌఺ ʮੈքྦྷܭར༻ऀ਺ 4,900ສਓΛಥഁͨ͠εϚϗΞϓϦʯ 5

Slide 6

Slide 6 text

͓͠ͳ͕͖ 6 ‣Ϟϯετͷαʔόߏ੒ ‣؂ࢹγεςϜͷߏ੒ʹ͍ͭͯ ‣ࢮ׆ɾϝτϦοΫε؂ࢹ ‣ΞϥʔτରԠ ‣·ͱΊ

Slide 7

Slide 7 text

αʔόߏ੒ 7

Slide 8

Slide 8 text

αʔόߏ੒(γεςϜ) Unicorn memcached MariaDB Redis Fluetnd resque worker LoadBalancer 8

Slide 9

Slide 9 text

αʔόߏ੒(Πϯϑϥ) 9 ‣Քಇαʔόʔ਺ ‣໿1,000୆ ‣ϚϧνΫϥ΢υߏ੒ ‣ΦϯϓϨαʔόʔͱΫϥ΢υͷซ༻ ‣ࣗࣾDC 2ڌ఺ ‣ύϒϦοΫΫϥ΢υΛෳ਺

Slide 10

Slide 10 text

αʔόߏ੒ DataCenter 1 •DB •memcached DataCenter 2 •DB •memcached 10 application Cloud1 application Cloud3 application Cloud4 application Cloud2

Slide 11

Slide 11 text

αʔόߏ੒ 11 ‣Application ‣13,000 ʙ 26,000 core ‣DB ‣෺ཧϚγϯ150୆ 1ηοτ ‣ͦΕͧΕͷDCʹ഑ஔ

Slide 12

Slide 12 text

؂ࢹγεςϜߏ੒ 12

Slide 13

Slide 13 text

؂ࢹγεςϜ 13 ‣ࢮ׆؂ࢹ ‣ Nagios ‣ ϝτϦοΫε؂ࢹ ‣ CloudForecast ‣ Kibana + elasticsearch ‣ Grafana + InfluxDB

Slide 14

Slide 14 text

؂ࢹγεςϜ 14 ‣ͳͥ͜ΕΒΛબΜͩͷ͔ ‣ΦϯϓϨɺΫϥ΢υͰڞ௨Ͱ࢖͑Δ ‣SNS(mixi.jp)͔Βͷࢿ࢈

Slide 15

Slide 15 text

ࢮ׆؂ࢹ 15

Slide 16

Slide 16 text

ࢮ׆؂ࢹ 16 ‣ࠔͬͨ͜ͱ ‣ର৅αʔό͕୯७ʹଟ͍ ‣؂ࢹαʔό͕ࣗ෼Λ؂ࢹͰ͖ͳ͍ ‣αʔό͕ࢮΜͩͷ͔Ϋϥ΢υͱͷ઀ ଓ͕੾Εͨͷ͔໰୊

Slide 17

Slide 17 text

ࢮ׆؂ࢹ 17 ‣֤ڌ఺ʹNagios Λߏங ‣ͦΕͧΕʹ૬ޓ؂ࢹ

Slide 18

Slide 18 text

ࢮ׆؂ࢹ 18 DC 1 DC 2 Cloud 1

Slide 19

Slide 19 text

ࢮ׆؂ࢹ 19 ‣ࠔͬͨ͜ͱ ‣Nagios ͷઃఆϑΝΠϧ(cfg)͕൥ࡶ ‣ෳ਺ͷ؂ࢹαʔόͷߋ৽͕໘౗

Slide 20

Slide 20 text

ࢮ׆؂ࢹ 20 ‣Nagios ‣ ؂ࢹઃఆͷcfg ϑΝΠϧΛYAML͔Βੜ੒ ‣ ֤ڌ఺ͷߋ৽ΛҰׅͰߦ͏πʔϧΛ಺੡ ‣ cfg ϑΝΠϧߋ৽ ‣ syntax check && nagios restart

Slide 21

Slide 21 text

ࢮ׆؂ࢹ 21

Slide 22

Slide 22 text

ࢮ׆؂ࢹ 22 DC 1 DC 2 Cloud 1

Slide 23

Slide 23 text

ࢮ׆؂ࢹ 23 ‣ࠔͬͨ͜ͱ ‣؂ࢹ߲໨ΛΧελϚΠζ͍ͨ͠

Slide 24

Slide 24 text

ࢮ׆؂ࢹ 24 ‣SNMP ͷextend ػೳΛར༻ ‣Net-SNMPͷ֦ுػೳ ‣೚ҙͷίϚϯυ݁ՌΛSNMPͰฦ͢

Slide 25

Slide 25 text

ࢮ׆؂ࢹ 25 ‣check plugin ΋͍͔ͭࣾ͘಺Ͱࣗ࡞ ‣ྫ: ‣αʔόͷuptime ΛνΣοΫ ‣filesystem ͷreadonly ΛνΣοΫ

Slide 26

Slide 26 text

ࢮ׆؂ࢹ 26 ‣ࠔͬͨ͜ͱ ‣ ෳ਺୆ͷ؂ࢹαʔό؅ཧ͸൥ࡶ ‣ ؂ࢹ͢ΔͨΊʹ༷ʑͳιϑτ΢ΣΞ͕ඞཁ ‣ libmysqlclient, snmp… ౳

Slide 27

Slide 27 text

ࢮ׆؂ࢹ 27 ‣ ͦ͜Ͱ ‣؂ࢹγεςϜͷ࡮৽ ‣ৄ͘͠͸ޙฤʹͯ

Slide 28

Slide 28 text

ϝτϦοΫε؂ࢹ 28

Slide 29

Slide 29 text

ࢮ׆؂ࢹ 29 ࠷ॳ

Slide 30

Slide 30 text

ϝτϦοΫε؂ࢹ 30 ‣CloudForecast ‣https://github.com/kazeburo/ cloudforecast ‣monitor pluginΛࣗ࡞

Slide 31

Slide 31 text

ࢮ׆؂ࢹ 31 ࣍ʹ

Slide 32

Slide 32 text

ϝτϦοΫε؂ࢹ 32 ‣Kibana + Elasticsearch ‣application αʔόͷlog ౳Λ஝ੵ ‣ूܭ΍ݕࡧͳͲ ‣1/100 ʹ samplingͰऩू

Slide 33

Slide 33 text

ϝτϦοΫε؂ࢹ Application Elasticsearch + Kibana Fluetnd 33 αϯϓϦϯά΍σʔλՃ޻

Slide 34

Slide 34 text

ࢮ׆؂ࢹ 34 ࣍ʹ

Slide 35

Slide 35 text

ϝτϦοΫε؂ࢹ 35 ‣grafana + InfluxDB ‣ ֤छσʔλΛूܭͯ͠InfluxDB ʹ஝ੵ ‣ ͦΕΒΛgrafana ͰՄࢹԽ ‣ μογϡϘʔυΛ࡞੒ ‣ Ξϥʔτ΋ઃఆ

Slide 36

Slide 36 text

ϝτϦοΫ؂ࢹ 36 ‣ͳͥ͜ΕΒΛબΜͩͷ͔ ‣ΦϯϓϨɺΫϥ΢υͰڞ௨Ͱ࢖͑Δ ‣༷ʑͳϝτϦοΫεΛ௥ՃͰ͖Δ

Slide 37

Slide 37 text

؂ࢹΞϥʔτ 37

Slide 38

Slide 38 text

؂ࢹΞϥʔτ 38 ‣؂ࢹͰҟৗ͕͋ͬͨ৔߹ͷΞϥʔτ ‣ϞϯετͰ͸PagerDuty Λར༻ ‣On-Call౰൪੍ ‣αʔό։ൃ & SRE Ͱϩʔςʔγϣϯ

Slide 39

Slide 39 text

؂ࢹΞϥʔτ 39 ‣PagerDuty ‣ https://www.pagerduty.com/ ‣ ֤छ؂ࢹγεςϜͱ࿈ܞͯ͠௨஌ΛૹΕΔ ‣ ॊೈͳΤεΧϨʔγϣϯϧʔϧ͕૊ΊΔ

Slide 40

Slide 40 text

؂ࢹΞϥʔτ 40 ‣On-Call ౰൪ ‣γεςϜͷো֐ʹඋ͑ΔͨΊͷ଴ػ ‣ೋਓҰ૊ͷ౰൪੍ ‣Ξϥʔτൃੜ࣌ʹ͸15෼ͰରԠ։࢝

Slide 41

Slide 41 text

؂ࢹΞϥʔτ 41 Ϛωʔδϟɾࣄۀ੹೚ऀ ։ൃɾSRE ຊ೔ͷରԠ౰൪ ো֐ൃੜʂ

Slide 42

Slide 42 text

·ͱΊ 42

Slide 43

Slide 43 text

·ͱΊ 43 ‣ϞϯετͷΠϯϑϥͷ؂ࢹʹ͍ͭͯ ‣ࢮ׆؂ࢹ ‣ϝτϦοΫε؂ࢹ ‣ΞϥʔτରԠ

Slide 44

Slide 44 text

Thank you!