Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SREによるモンスト改善事例 / improvement-example-of-monster-strike-by-sre

SREによるモンスト改善事例 / improvement-example-of-monster-strike-by-sre

hbstudy#76
第76回: SRE大全: XFLAG スタジオ編
https://hbstudy.connpass.com/event/62338/

More Decks by 浜田 恭平 (Kyohei Hamada)

Other Decks in Programming

Transcript

  1. SREʹΑΔϞϯετվળࣄྫ
    2017/08/24
    hbstudy ୈ76ճɿ SREେશ: XFLAG ελδΦฤ
    XFLAG ࣄۀຊ෦ ήʔϜ։ൃࣨ SREάϧʔϓ
    ඿ా ګฏ @haman29
    XFLAG STUDIO

    View Slide

  2. About me
    • ඿ా ګฏ @haman29
    • https://twitter.com/haman29
    • લ৬Ͱ͸αʔόαΠυΤϯδχΞɻยखؒͰϑϩϯτΤϯυɺΠϯϑϥɻ
    • ओʹWebαʔϏεʢ๭ϑϦϚαʔϏεɺ๭ΫʔϙϯαʔϏεͳͲʣΛ୲౰ɻ
    • 2016/07 SREάϧʔϓ૑ཱͱಉ࣌ʹೖࣾ͠ɺSREͷҰһʹɻ
    • ओʹ೔ຊ൛Ϟϯετͷӡ༻ɾվળΛ୲౰ɻ
    • झຯ͸ϘϧμϦϯά
    • https://codeiq.jp/magazine/2016/03/38660/ 3ਓ໨ʢલ৬ʹ͍ͨ࣌ʣ
    2

    View Slide

  3. ͓඼ॻ͖
    memcached Λར༻ͨ͠ෛՙରࡦ
    ࣄྫ1
    Resque worker ىಈͷߴ଎Խ
    ࣄྫ2
    DBαʔόߏஙͷࣗಈԽ
    ࣄྫ3
    ·ͱΊ
    3

    View Slide

  4. 4
    ࣄྫ1. memcachedΛར༻ͨ͠ෛՙରࡦ

    View Slide

  5. ̏प೥Πϕϯτ
    • 3प೥രઈײँΨνϟ
    • 1Ϣʔβ1ճͷΈҾ͚ΔΨνϟ
    • ੕6Ωϟϥ͕5ମग़͖ͯͯͦͷ಺1ମΛબ΂Δ
    • ΨνϟΛҾͨ͘ΊͷΞΠςϜʮχδۄʯΛϢʔβશһʹ഑Δ
    • ৽نϢʔβʹ΋഑Δ
    • ϢϝۄΫΤετ
    • ΠϕϯτΫΤετΛपճ͠ɺϢϝۄΛूΊɺϢϝۄΛ࢖ͬͯΨνϟΛճ͢
    • ໨౰ͯͷΩϟϥΛ99ମूΊΔ·Ͱ͜ΕΛ܁Γฦ͢
    5

    View Slide

  6. ̏प೥രઈײँΨνϟ
    • 7:30ࠒϢʔβΞΠςϜ༻ͷDB٧·Δ
    • select/insert/update͕૿Ճ
    • os_waits ͕ٸ૿
    • ରԠ
    • innodb_spin_wait_delay ௐ੔
    • slaveΛ௥Ճͯ͠selectΛ෼ࢄ
    6

    View Slide

  7. ෛՙରࡦɹݕ౼
    • 1िؒޙͷϢϝۄΫΤετʹඋ͑ͯselectΛݮΒ͓͖͍ͯͨ͠
    • ͢ͰʹSQL͸࠷దԽ͞Ε͍ͯΔʢ୯७ͳΫΤϦ͔͠ͳ͍ʣ
    • monsterstrike/second_level_cache Λར༻
    • hooopo/second_level_cache Λfork֦ͯ͠ு
    • ActiveRecordܦ༝ͰΫΤϦΛൃߦ͢Δͱɺ͍͍ײ͡ʹ memcached Ͱ
    Ωϟογϡͯ͘͠ΕΔ
    • ͢ͰʹϞϯετͰར༻࣮੷͕͋Δ
    7

    View Slide

  8. SecondLevelCache `.create`
    {
    “user_item/10001” => “[10001, 101, 1, …]”,

    }
    insert into user_items(user_id, item_id, cnt) values (101, 1, 2);
    class UserItem < ActiveRecord::Base
    acts_as_cached(version: 1, expires_in: 1.day)
    end
    UserItem.create(user_id: 101, item_id: 1) # => id: 10001
    app
    memcached
    MySQL
    8
    no cached

    View Slide

  9. SecondLevelCache `.find`
    {
    “user_item/10001” => “[10001, 101, 1, …]”
    }
    select * from user_items where id = 10001;

    UserItem.find(10001)
    app
    memcached
    MySQL
    9
    class UserItem < ActiveRecord::Base
    acts_as_cached(version: 1, expires_in: 1.day)
    end
    no cached
    cached

    View Slide

  10. SecondLevelCache `.fetch_by_uniq_keys`
    {
    “user_item/fbu/user_id_101_item_id_1” => 10001,
    “user_item/10001” => “[10001, 101, 1, …]” 

    }
    select * from user_items where user_id = 101 and item_id = 1;
    app
    memcached
    MySQL
    UserItem.fetch_by_uniq_keys(user_id:101, item_id: 1)
    ※ΩϟογϡͷϥΠϑαΠΫϧͷҧ͍ʹ஫໨͢Δ͜ͱͰ
    ɹޮ཰Α͘memcachedΛ׆༻Ͱ͖ɺMySQL΁ͷΫΤϦ΋ݮΒ͢͜ͱ͕Ͱ͖Δ 10
    class UserItem < ActiveRecord::Base
    acts_as_cached(version: 1, expires_in: 1.day)
    end
    no cached
    cached
    # جຊతʹexpire͠ͳ͍
    # සൟʹexpire͢Δ

    View Slide

  11. 11
    SecondLevelCache `.fetch_by_index` (֦ு)
    {
    “user_item/fbi/user_id/101” => [10001, 10002, 10003],
    “user_item/10001” => “[10001, 101, 1, …]”,
    “user_item/10002” => “[10002, 101, 2, …]”,
    “user_item/10003” => “[10003, 101, 3, …]”

    }
    select id from user_items where user_id = 101;

    select * from user_items where id in (10001, 10002, 10003);
    app
    memcached
    MySQL
    UserItem.fetch_by_index(user_id:101)
    class UserItem < ActiveRecord::Base
    acts_as_cached(version: 1, expires_in: 1.day)
    acts_as_cached_by_index(:user_id)

    end
    # indexΛு͍ͬͯΔΧϥϜͷΈࢦఆՄ
    no cached
    cached
    ※Ωϟογϡ͍ͯ͠ͳ͍ࠩ෼ͷΈMySQLʹ໰͍߹ΘͤΔ
    # جຊతʹexpire͠ͳ͍
    # සൟʹexpire͢Δ

    View Slide

  12. SecondLevelCacheಋೖޙ
    ࢥ͍ͷ΄͔select͕ݮ͍ͬͯͳ͍…
    12

    View Slide

  13. # Profile
    # Rank Query ID Response time Calls R/Call V/M Item
    # ==== ================== ============= ===== ====== ===== ===============
    # 1 0x365FBDCB443D99A3 5.1165 82.3% 2523 0.0020 0.20 SELECT user_items
    # 2 0x5B79B47AB9093007 1.0197 16.4% 178 0.0057 0.79 SELECT user_items
    # 3 0xA6FF35DF18E85C6C 0.0655 1.1% 101 0.0006 0.00 SELECT user_items
    # 4 0x10259F2E34E9D7F1 0.0136 0.2% 32 0.0004 0.00 SELECT user_items
    # 5 0x28DA30E044AAF5E8 0.0022 0.0% 8 0.0003 0.00 SELECT user_items
    # 6 0x022C53131F50003E 0.0016 0.0% 9 0.0002 0.00 SELECT user_items
    # Query 1: 910.54 QPS, 1.85x concurrency, ID 0x365FBDCB443D99A3 at byte 5687530
    # Scores: V/M = 0.20
    # Time range: 2016-10-13 23:06:37.637168 to 23:06:40.408062
    # Attribute pct total min max avg 95% stddev median
    # ============ === ======= ======= ======= ======= ======= ======= =======
    # Count 88 2523
    # Exec time 82 5s 96us 973ms 2ms 8ms 20ms 185us
    # Rows affecte 100 154 0 1 0.06 0.99 0.24 0
    # Query size 84 302.79k 120 124 122.89 118.34 0.00 118.34
    # Warning coun 0 0 0 0 0 0 0 0
    # String:
    # Hosts 10.53.6.53 (30/1%), 192.168.117.176 (20/0%)... 398 more
    # Query_time distribution
    # 1us
    # 10us #
    # 100us ################################################################
    # 1ms #######
    # 10ms ##
    # 100ms #
    # 1s
    # 10s+
    # Tables
    # SHOW TABLE STATUS LIKE 'user_items'\G
    # SHOW CREATE TABLE `user_items`\G
    # EXPLAIN /*!50100 PARTITIONS*/
    SELECT `user_items`.* FROM `user_items` WHERE `user_items`.`user_id` = 12345 AND `user_items`.`item_id` = 101 LIMIT 1\G
    ΫΤϦௐࠪ
    pt-query-digest ( Percona Toolkit )
    # tcpdump ΛͱΔ

    # ݁Ռ͸selectͷΈʹߜΔ

    $ pt-query-digest --type=tcpdump --limit=100 --filter
    '$event->{arg} =~ m/^(select)/i' dumpfile > result_select
    `.fetch_by_uniq_keys` ͕࢖͑ͦ͏
    13

    View Slide

  14. 14
    ݁Ռ
    selectΛ70%࡟ݮɻΠϕϯτߴෛՙ࣌΋໰୊ͳ͘ࡹ͍ͨɻ
    ϢϝۄΫΤετ
    ̏प೥രઈײँΨνϟ
    ஈ֊తʹෛՙରࡦ༻ࠩ෼ΛσϓϩΠ

    View Slide

  15. 15
    ࣄྫ2. Resque worker ىಈͷߴ଎Խ

    View Slide

  16. Resqueߏ੒
    Batch 1
    Redis 1
    ஗Ԇͯ͠΋ྑ͍ॲཧ͸όοάάϥ΢ϯυͰ
    port 1
    worker
    worker
    worker
    worker
    worker
    worker
    ɾɾɾ
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    Batch 2
    worker
    worker
    worker
    port 2 port 3 port 4
    Redis 2
    port 1 port 2 port 3 port 4
    (Redis 1
    port 1)
    app
    app
    app
    LB
    enqueue
    dequeue
    16
    (Redis 2
    port 4)

    View Slide

  17. resque:restart
    • શͯͷ Resque worker ϓϩηεΛ࠶ىಈ͢ΔλεΫʢcapistranoʣ
    • جຊతʹɺࠩ෼Λ൓ө͢Δ࣌ʹຖճ࣮ߦ͢Δඞཁ͕͋Δ
    • ߹ܭ 3,500 workers
    • Redis server 6 ୆ * 4 ports
    • Batch server 50 ୆ * 1ʙ4 workers / port
    • ՝୊
    • શ୆൓өʹ20෼΄Ͳֻ͔Δɻ
    • worker਺ͷࢦఆɺqueue໊ͷࢦఆ͕ύϥϝʔλԽ͞Ε͍ͯͳ͍
    17

    View Slide

  18. ௐࠪɹresque:restart ಈ࡞
    Batch 1
    worker
    worker
    worker
    worker
    worker
    worker
    ɾɾɾ
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    Batch 2
    worker
    worker
    worker
    (Redis 1
    port 1)
    (Redis 2
    port 4)
    18

    View Slide

  19. Batch 1
    worker
    worker
    worker
    worker
    worker
    worker
    ɾɾɾ
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    Batch 2
    worker
    worker
    worker
    ௐࠪɹresque:restart ಈ࡞
    (Redis 1
    port 1)
    (Redis 2
    port 4)
    19
    STOP

    View Slide

  20. Batch 1
    worker
    worker
    worker
    worker
    worker
    worker
    ɾɾɾ
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    Batch 2
    worker
    worker
    worker
    worker
    ௐࠪɹresque:restart ಈ࡞
    (Redis 1
    port 1)
    (Redis 2
    port 4)
    20
    STOP
    parent

    View Slide

  21. Batch 1
    worker
    worker
    worker
    worker
    worker
    worker
    ɾɾɾ
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    Batch 2
    worker
    worker
    worker
    worker
    fork
    worker
    worker
    worker
    ௐࠪɹresque:restart ಈ࡞
    (Redis 1
    port 1)
    (Redis 2
    port 4)
    21
    STOP
    parent

    View Slide

  22. Batch 1
    worker
    worker
    worker
    worker
    worker
    worker
    ɾɾɾ
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    Batch 2
    worker
    worker
    worker
    worker
    fork
    worker
    worker
    worker
    ௐࠪɹresque:restart ಈ࡞
    (Redis 1
    port 1)
    (Redis 2
    port 4)
    22
    STOP
    parent
    STOP

    View Slide

  23. Batch 1
    worker
    worker
    worker
    worker
    worker
    worker
    ɾɾɾ
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    Batch 2
    worker
    worker
    worker
    worker
    fork
    worker
    fork
    worker
    worker
    worker
    worker
    worker
    worker
    Ҏ߱΋௚ྻͰ࣮ߦ͞ΕΔ
    ௐࠪɹresque:restart ಈ࡞
    (Redis 1
    port 1)
    (Redis 2
    port 4)
    23
    STOP STOP
    parent parent

    View Slide

  24. ௐࠪɹresque:restart ಈ࡞
    worker
    worker
    worker
    ɾɾɾ
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    Redis 1
    port 1
    queue 1
    worker
    worker
    worker
    ɾɾɾ
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    worker
    queue 2
    ෳ਺queue͕૬৐Γ͍ͯ͠Δύλʔϯ
    queue 1 ͷޙʹ queue 2 ͕௚ྻͰ࣮ߦ͞ΕΔ
    24

    View Slide

  25. ࣮૷ɹresque:restart
    • grosser/parallel Λར༻ͯ͠4ฒྻͰ࣮ߦ
    • ෳ਺queue͕૬৐Γ͍ͯ͠Δ৔߹΋ฒྻʹ
    • worker਺ɺqueue໊ΛύϥϝʔλԽ
    • มߋ࣌ͷίετΛݮΒ͢
    25

    View Slide

  26. ݁Ռɹresque:restart
    20෼ → 3෼
    85%ݮ
    Ϣʔβ༷ʹΑΓૣ͘Ձ஋ΛఏڙͰ͖ΔΑ͏ʹͳͬͨ
    26

    View Slide

  27. 27
    ࣄྫ3. DBαʔόߏஙͷࣗಈԽ

    View Slide

  28. • MariaDB
    • શͯΦϯϓϨ
    • DC৑௕
    • 1DC͋ͨΓ150୆ऑ(backupؚΉ)
    → શମͰ300୆ʢ2017/08࣌఺ʣ
    • ਨ௚෼ׂɺਫฏ෼ׂ
    DBαʔόߏ੒
    master
    slave backup
    master
    backup
    DC1 DC2
    replication
    • εέʔϧΞοϓͳͲͷߏ੒มߋɺϋʔυ΢ΣΞނোͳͲʹΑΓDBαʔό
    ͷߏங͕ߴ͍ස౓Ͱൃੜ͢Δ
    28

    View Slide

  29. • Ϛγϯ֬อɺߏ੒มߋʢDCৗறͷϝϯόʔͱ࿈ܞʣ
    • OSΠϯετʔϧʢPXEboot ͔ Cobbler + koanʣ
    • ॳظηοτΞοϓʢAnsibleʣ
    • σʔλྖҬ࡞੒ ( mkfs.xfs, mount )
    • MariaDBηοτΞοϓʢChefʣ
    • όοΫΞοϓͱϦετΞ Percona XtraBackup
    • σʔλݩ͸backup༻DBαʔό
    • ϨϓϦέʔγϣϯ
    • nagios؂ࢹೖΕ
    • (masterͷ৔߹) ϝϯςφϯε࣌ʹMHA(mysql-master-ha)Ͱ੾Γସ͑
    DBαʔόߏஙʙαʔϏεΠϯ
    29

    View Slide

  30. • Ϛγϯ֬อɺߏ੒มߋʢDCৗறͷϝϯόʔͱ࿈ܞʣ
    • OSΠϯετʔϧʢPXEboot ͔ Cobbler + koanʣ
    • ॳظηοτΞοϓʢAnsibleʣ
    • σʔλྖҬ࡞੒ ( mkfs.xfs, mount )
    • MariaDBηοτΞοϓʢChefʣ
    • όοΫΞοϓͱϦετΞ Percona XtraBackup
    • σʔλݩ͸backup༻DBαʔό
    • ϨϓϦέʔγϣϯ
    • nagios؂ࢹೖΕ
    • (masterͷ৔߹) ϝϯςφϯε࣌ʹMHA(mysql-master-ha)Ͱ੾Γସ͑
    DBαʔόߏஙʙαʔϏεΠϯɹվળલ
    ख࡞ۀ
    ख࡞ۀ
    ख࡞ۀ
    30

    View Slide

  31. σʔλྖҬ࡞੒( mkfs.xfs, mount )ͷࣗಈԽ
    • طଘͷChefϨγϐʹ௥Ճ͢Δ
    • ཁ݅
    • SSD, ioDrive, ioMemoryͷ૊Έ߹ΘͤʹରԠ͍ͨ͠
    • Ϛ΢ϯτର৅ͷσόΠε໊Λਪఆ͍ͨ͠
    • LVMΛར༻ͯ͠ෳ਺σΟεΫΛଋͶ͍ͨ
    31

    View Slide

  32. σʔλྖҬͷର৅σΟεΫΛਪఆ͢Δ
    • SSD 1ຕ, ioMemory 1ຕ
    • /dev/fioa ͕ data volume
    • SSD 1ຕ
    • /dev/sda7 ͳͲ͕ data volume
    • SSD 2ຕ
    • /dev/sdb ͕ data volume
    • ( /dev/sda ͸ root volume )
    • SSD 3ຕ
    • /dev/sdb + /dev/sdc (LVM)

    ͕ data volume
    32

    View Slide

  33. • Ϛγϯ֬อɺߏ੒มߋʢDCৗறͷϝϯόʔͱ࿈ܞʣ
    • OSΠϯετʔϧʢPXEboot ͔ Cobbler + koanʣ
    • ॳظηοτΞοϓʢAnsibleʣ
    • σʔλྖҬ࡞੒ ( mkfs.xfs, mount )
    • MariaDBηοτΞοϓʢChefʣ
    • όοΫΞοϓͱϦετΞ Percona XtraBackup
    • σʔλݩ͸backup༻DBαʔό
    • ϨϓϦέʔγϣϯ
    • nagios؂ࢹೖΕ
    • (masterͷ৔߹) ϝϯςφϯε࣌ʹMHA(mysql-master-ha)Ͱ੾Γସ͑
    DBαʔόߏஙʙαʔϏεΠϯɹվળޙ
    1ίϚϯυͰର࿩తʹ࣮ߦͰ͖ΔΑ͏ʹͨ͠
    γΣϧεΫϦϓτʹͨ͠
    Chefద༻ʹؚΊͨ
    33

    View Slide

  34. • ຊ࣭తͳ࡞ۀʹूத͢ΔͨΊʹɺ
    • ʢࣄྫ3ʣࣗಈԽʹΑΓଐਓੑΛഉআ͠ɺΦϖϛεΛݮΒ͠ɺεέʔϧ
    ͠΍͍͢ӡ༻ʹ͢Δ

    • Ϣʔβ༷ʹΑΓߴ͍Ձ஋Λఏڙ͢ΔͨΊʹɺ
    • ʢࣄྫ1ʣαʔϏεΛམͱ͞ͳ͍ͨΊʹઌճΓΛͯ͠ෛՙରࡦ͢Δ
    • ʢࣄྫ2ʣσϓϩΠϑϩʔΛվળͯ͠ϦϦʔεʹֻ͔Δ࣌ؒΛ୹͘͢Δ
    ·ͱΊ
    34

    View Slide

  35. ͋Γ͕ͱ͏͍͟͝·ͨ͠

    View Slide