Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mogilefs, 簡約可靠的儲存方案

Mogilefs, 簡約可靠的儲存方案

TWJUG 2016 十一月份聚會
http://twjug.kktix.cc/events/twjug201611

https://github.com/mogilefs-moji/moji

這次talk想和大家分享mogilefs這套開源分散儲存的一些production經驗~
做為一個senior engineer或architect,在這個過多UGC (user-generated content)、大資料和物聯網的時代,要規劃系統中的儲存架構偶爾不能單純靠掛幾顆硬碟或NAS就可以輕鬆搞定了,有時候我們的stack需要更強大的storage!這時候不提AWS S3實在說不過去,但在許多考量機敏資料、業主需求或是legacy system的情境,簡約的mogilefs或許會是更合適的選擇~這次分享除了mogilefs,會多著墨於moji,讓java user可以自然的存取mogilefs,另外也會講一些可靠度的經驗。如果您對架構設計有興趣或是什麼都要管的full stack engineer歡迎一起來聊聊~

Other Decks in Technology

Transcript

  1. Quick facts “Open source distributed object storage” – a.k.a. cloud

    storage, soft defined storage… • 高可用、水平擴展 • 檔案多副本儲存、修復 • 簡單的架構、容易使用 • 眾多應用實績
  2. client tracker store mysql create_open domain=toast&class=triple&debug_profile=0&fid= 0&multi_dest=1&key=qoo3 OK path_1=http://127.0.0.20:7500/dev2/0/000/000/000 0000014.fid&path_3=http://127.0.0.25:7500/dev3/0

    /000/000/0000000014.fid&devid_1=2&devid_3=3& fid=14&path_2=http://127.0.0.25:7500/dev4/0/000/ 000/0000000014.fid&dev_count=3&devid_2=4 store store tracker tracker PUT /dev208/0/068/050/0068050934.fid HTTP/1.0 Content-length: 9 some data 200 OK 1. Create open 3. Create close 2. Write data (webdav) create_close domain=toast&fid=14&devid=2&path=http://127. 0.0.20:7500/dev2/0/000/000/0000000014.fid&size= 1048576&key=qoo3&devid_2=3&path_2=http://12 7.0.0.25:7500/dev3/0/000/000/0000000014.fid&mu ti_dest=1
  3. KKBOX • 超過3,000 萬首歌(檔案) • 儲存伺服器超過 75 台 • 總硬碟超過

    2,300 顆 • 總儲存空間超過 10 PB • 使用 8 個機櫃 (KKBOX 的音樂檔案儲存技術 Posted on August 2, 2016 by Chris Yuan)
  4. Moji • A file-like MogileFS client for Java developers •

    Production-ready features – Connection pooling, load balancing, fault-tolerant… • Quality – Spring friendly, integration tests, well documented, actively developing… https://github.com/mogilefs-moji/moji
  5. Configuration • Using plain-old-Java • Using the Spring framework SpringMojiBean

    moji = new SpringMojiBean(); moji.setAddressesCsv("192.168.0.1:7001,192.168.0.2:7001"); moji.setDomain("testdomain"); moji.initialise(); moji.setTestOnBorrow(true); moji.tracker.address=192.168.0.1:7001,192.168.0.2:7001 moji.domain=testdomain <import resource="moji-context.xml" />
  6. Usage • Create/update a remote file • Download a remote

    file MojiFile rickRoll = moji.getFile("rick-astley"); moji.copyToMogile(new File("never-gonna-give-you-up.mp3"), rickRoll); rickRoll.copyToFile(new File("foo-fighters.mp3"));
  7. Usage • IO stream MojiFile fooFighters = moji.getFile("stacked-actors"); InputStream stream

    = null; try { stream = fooFighters.getInputStream(); // Do something streamy // stream.read(); } finally { stream.close(); } OutputStream stream = null; try { stream = fooFighters.getOutputStream(); // Do something streamy // stream.write(...); stream.flush(); } finally { stream.close(); }
  8. • Setup environment manually – MogileFS – Maven dependency Call

    to action! • Quickstart feat. docker run -d --name mogile-node jeffutter/mogile-node docker run -it --link mogile-node:mogile-node hrchu/mogile-moji <dependency> <groupId>fm.last</groupId> <artifactId>moji</artifactId> <version>2.0.0</version> </dependency> https://code.google.com/p/mogilefs/wiki/QuickStartGuide
  9. Mogilefs的可靠度對策 • Single copy ACK • Multiple host replication policy

    • MD5 checksum • Basic health disk check • Multiple zone plugin • Reaper/fsck
  10. Multiple Sites • Given a network of: 10.10.0.0/16 • All

    of your machines are configured to have a netmask of 10.10.0.0/16 . When assigning IP addresses to machines, pick them from 10.10.5.0/24 • 設定IP – web1: 10.10.5.1 (netmask 255.255.0.0 or /16) – web2: 10.10.5.2 – tracker1: 10.10.5.3 – tracker2: 10.10.5.4 – storage node 1: 10.10.5.5 – storage node 2: 10.10.5.6 – storage node 3: 10.10.8.1 • MogileFS zones, you configure: – near=10.10.5.0/24 far=10.10.8.0/24 web1 tracker1 node1 node2 near tracker2 node3 far web2
  11. Scrubber • Make use of routine FSCK as scrubber •

    Modified Algorithm – Remove exhaustive search – Improve performance in large scale https://github.com/mogilefs/MogileFS- Network/blob/master/lib/MogileFS/ReplicationPolicy/HostsPerNetwork.pm#L84 mogadm fsck status |grep " Yes " || (mogadm fsck reset; mogadm fsck clearlog; mogadm fsck start) >/var/log/mogadm.fsck 2>&1
  12. Modern durable write • AS-IS client tracker store mysql store

    store tracker tracker 4. Write other copies asynchronously Assume that a file should have at least three replicas in the system to fit the durability requirement
  13. Modern durable write client tracker store mysql 2. Write at

    least two copies before ACK store store tracker tracker 4. Write other copies asynchronously • TO-BE Assume that a file should have at least three replicas in the system to fit the durability requirement mogilefs-moji#25 mogilefs/MogileFS-Server#39
  14. Analysis • Disk failure pattern – MTTF? – poisson distribution?

    • Mark-out: 發現錯誤的空窗期 • Rep latency: 非同步複製的空窗期 • 硬碟大小,檔案大小也會影響計算結果
  15. Analysis • Combinatorial analysis model – Assume that each disk

    fails independently – Assume that after x hours of operation each block has P(x i ) = p – Probability of failure q = 1 - p. – 對replication來說是一個naive的公式:1 – qn
  16. Analysis • 若考慮 – Non-Recoverable Errors (NREs) – drive failure

    events are poisson – site failures (e.g. due to regional disasters) – rep latency, mark-out time – … • Analysis of system durability is commonly done with Markov models
  17. Analysis • Example of durable write – Assume mean disk

    life is 500K hrs – 2 replicas, no NRE 249960 249980 250000 250020 250040 250060 250080 1 0.041666667 0.020833333 0.013888889 diff disk life 5 diff disk life 5 Diff of MTTDL in hr mu 複製速率越低, durable write的改善幅度越大
  18. Analysis • Example of probability of data loss 0.000000E+00 1.000000E-05

    2.000000E-05 3.000000E-05 4.000000E-05 5.000000E-05 6.000000E-05 7.000000E-05 8.000000E-05 1 2 3 4 5 6 7 8 9 10 11 12 13 14 P of data loss 72 P of data loss 48 P of data loss 24 P of data loss 1