Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FM-index による全文検索

Sho Iizuka
February 02, 2015

FM-index による全文検索

Sho Iizuka

February 02, 2015
Tweet

More Decks by Sho Iizuka

Other Decks in Programming

Transcript

  1. FM-IndexʹΑΔશจݕࡧ
    ܭࢉػ࣮शE ࣗ༝՝୊

    View Slide

  2. • จॻ͔ΒจࣈྻΛݕࡧ͢Δํ๏͸2௨Γʹ෼ྨͰ͖Δ
    A. લॲཧ͕ෆཁͳํ๏ (ྗ೚ͤͳํ๏, KMP๏, BM๏)
    B. લॲཧ͕ඞཁͳํ๏ (సஔΠϯσοΫε, ઀ඌࣙ഑ྻ)
    • B͸લॲཧͷ͕࣌ؒඞཁͳ͔ΘΓʹ,

    ಉ͡จॻ͔ΒԿճ΋ݕࡧ͢Δ৔߹͸AΑΓ΋ߴ଎
    • FM-Index͸Bʹ෼ྨ͞ΕΔํ๏Ͱ,

    จॻͷ௕͞ʹґଘ͠ͳ͍࣌ؒͰݕࡧͰ͖Δ

    View Slide

  3. લॲཧ̍ɿ઀ඌࣙ഑ྻͷߏங
    จॻ
    mississippi
    mississippi$
    ΤϯυϚʔΧ$Λ௥Ճ
    mississippi$
    ississippi$
    ssissippi$
    sissippi$
    issippi$
    ssippi$
    sippi$
    ippi$
    ppi$
    pi$
    i$
    $
    ઀ඌࣙͷྻڍ

    View Slide

  4. લॲཧ̍ɿ઀ඌࣙ഑ྻͷߏங
    0 mississippi$
    1 ississippi$
    2 ssissippi$
    3 sissippi$
    4 issippi$
    5 ssippi$
    6 sippi$
    7 ippi$
    8 ppi$
    9 pi$
    10 i$
    11 $
    11 $
    10 i$
    7 ippi$
    4 issippi$
    1 ississippi$
    0 mississippi$
    9 pi$
    8 ppi$
    6 sippi$
    3 sissippi$
    5 ssippi$
    2 ssissippi$
    ࣙॻॱͰιʔτ͢Δ
    ※$͸೚ҙͷΞϧϑΝϕοτΑΓ΋

    ॱҐ͕খ͍͞ͱ͢Δ
    ઀ඌࣙ഑ྻSA

    View Slide

  5. લॲཧ̎ɿBWT

    (Burrows-Wheeler Transform)
    11 $
    10 i$
    7 ippi$
    4 issippi$
    1 ississippi$
    0 mississippi$
    9 pi$
    8 ppi$
    6 sippi$
    3 sissippi$
    5 ssippi$
    2 ssissippi$
    ݩͷจࣈྻʹ͓͚Δ

    ͻͱͭલͷจࣈʹ͢Δ
    i
    p
    s
    s
    m
    $
    p
    i
    s
    s
    i
    i
    BWTจࣈྻT

    View Slide

  6. ݕࡧॲཧ
    • BWTจࣈྻT = ipssm$pissii ʹ͍ͭͯ,

    ࣍ͷؔ਺Λఆٛ͢Δ
    • Rank(c,p) : T[0,p)ͷൣғͰ,

    ΞϧϑΝϕοτcͷग़ݱ਺Λฦ͢
    • RankLT(c) : TશମͰ, cΑΓ΋ॱҐ͕খ͍͞

    ΞϧϑΝϕοτͷग़ݱ਺Λฦ͢

    View Slide

  7. ݕࡧॲཧ
    $
    i$
    ippi$
    issippi$
    ississippi$
    mississippi$
    pi$
    ppi$
    sippi$
    sissippi$
    ssippi$
    ssissippi$
    i
    p
    s
    s
    m
    $
    p
    i
    s
    s
    i
    i
    BWTจࣈྻT ઀ඌࣙ഑ྻSA

    View Slide

  8. ݕࡧॲཧ
    $
    i$
    ippi$
    issippi$
    ississippi$
    mississippi$
    pi$
    ppi$
    sippi$
    sissippi$
    ssippi$
    ssissippi$
    i
    p
    s
    s
    m
    $
    p
    i
    s
    s
    i
    i
    BWTจࣈྻT ઀ඌࣙ഑ྻSA
    'i'+"ppi$"ͷ

    ઀ඌࣙ഑ྻ্Ͱͷ

    ग़ݱҐஔ͸ʁ

    View Slide

  9. ݕࡧॲཧ
    $
    i$
    ippi$
    issippi$
    ississippi$
    mississippi$
    pi$
    ppi$
    sippi$
    sissippi$
    ssippi$
    ssissippi$
    i
    p
    s
    s
    m
    $
    p
    i
    s
    s
    i
    i
    BWTจࣈྻT ઀ඌࣙ഑ྻSA
    'i'+"ppi$"ͷ

    ઀ඌࣙ഑ྻ্Ͱͷ

    ग़ݱҐஔ͸ʁ

    View Slide

  10. ݕࡧॲཧ
    $
    i$
    ippi$
    issippi$
    ississippi$
    mississippi$
    pi$
    ppi$
    sippi$
    sissippi$
    ssippi$
    ssissippi$
    i
    p
    s
    s
    m
    $
    p
    i
    s
    s
    i
    i
    BWTจࣈྻT ઀ඌࣙ഑ྻSA
    'i'+"ppi$"ͷ

    ઀ඌࣙ഑ྻ্Ͱͷ

    ग़ݱҐஔ͸ʁ
    LF-mapping
    c=T[p] ʹଓ͘จࣈྻͷ

    SA্Ͱͷग़ݱҐஔ͸

    RankLT(c)+Rank(c,p)

    View Slide

  11. ݕࡧॲཧ
    $
    i$
    ippi$
    issippi$
    ississippi$
    mississippi$
    pi$
    ppi$
    sippi$
    sissippi$
    ssippi$
    ssissippi$
    i
    p
    s
    s
    m
    $
    p
    i
    s
    s
    i
    i
    BWTจࣈྻT ઀ඌࣙ഑ྻSA
    "ssi"ͷݕࡧ
    [RankLT('i')+Rank('i', 0),

    RankLT('i')+Rank('i', 12))
    'i'Ͱ࢝·Δ

    จࣈྻ

    View Slide

  12. ݕࡧॲཧ
    $
    i$
    ippi$
    issippi$
    ississippi$
    mississippi$
    pi$
    ppi$
    sippi$
    sissippi$
    ssippi$
    ssissippi$
    i
    p
    s
    s
    m
    $
    p
    i
    s
    s
    i
    i
    BWTจࣈྻT ઀ඌࣙ഑ྻSA
    "ssi"ͷݕࡧ
    [RankLT('s')+Rank('s', 1),

    RankLT('s')+Rank('s', 5))
    's'+"i"Ͱ࢝·Δ

    จࣈྻ

    View Slide

  13. ݕࡧॲཧ
    $
    i$
    ippi$
    issippi$
    ississippi$
    mississippi$
    pi$
    ppi$
    sippi$
    sissippi$
    ssippi$
    ssissippi$
    i
    p
    s
    s
    m
    $
    p
    i
    s
    s
    i
    i
    BWTจࣈྻT ઀ඌࣙ഑ྻSA
    "ssi"ͷݕࡧ
    [RankLT('s')+Rank('s', 8),

    RankLT('s')+Rank('s', 10))
    's'+"si"Ͱ࢝·Δ

    จࣈྻ

    View Slide

  14. ݕࡧॲཧ
    • FM-index͸, ݕࡧจࣈྻʹରԠ͢ΔҐஔͷߜΓࠐΈΛ

    LF-mappingͷ܁Γฦ͠ʹΑͬͯߦ͏
    • LF-mapping͸ Rank ͱ RankLT Ͱߦ͑Δ
    • ͜ͷ2ͭͷॲཧ͸, ΢ΣʔϒϨοτ໦΍΢ΣʔϒϨοτߦྻΛ࢖͑͹

    O(log σ) ࣌ؒͰՄೳ (σ ͸ΞϧϑΝϕοτͷछྨ਺)
    • LF-mappingΛݕࡧจࣈྻQͷ௕͞෼͚ͩ܁Γฦ͢ͷͰ,

    Ұճͷݕࡧ͕O(m log σ) ࣌ؒͰՄೳ (m ͸ Q ͷจࣈ਺)
    • ݕࡧ͕࣌ؒจॻͷ௕͞ʹґଘ͠ͳ͍

    View Slide

  15. ੍࡞෺
    • ੨ۭจݿͰਓؾ͕͋Δਤॻ500࡭Λର৅ͱͨ͠

    ΢Σϒϒϥ΢β͔Β࢖͑ΔΠϯΫϦϝϯλϧݕࡧΛ੍࡞
    • ઀ඌࣙ഑ྻͷߏங͸sais.hxx (ߴ଎ͳϥΠϒϥϦ) Λ࢖༻
    • ΢ΣʔϒϨοτߦྻͱFM-Index͸ࣗ෼Ͱ࣮૷ (C++),

    boost-pythonʹΑΓPython༻ͷ֦ுϞδϡʔϧʹม׵
    • Flask (Web App Framework@Python) ͔Βݺͼग़͢

    View Slide

  16. ͏·͍͔͘ͳ͔ͬͨͱ͜Ζ
    • ͍͋·͍ݕࡧΛ࣮૷͠Α͏ͱͯ͠จݙΛ୳ͯ͠Έͨ

    → ฤूڑ཭ʹରͯ͠ࢦ਺͔͔࣌ؒΔΒ͍͠…
    • ࡞੒ͨ͠ࡧҾΛϑΝΠϧ͔ΒಡΈࠐΉॲཧͰ,

    طଘͷϥΠϒϥϦΛ࢖ͬͨΒ࢖༻ϝϞϦͷྔ͕രൃ

    (ݪҼෆ໌)

    View Slide

  17. ·ͱΊ
    • ߴ଎ͳจࣈྻݕࡧͷΞϧΰϦζϜΛ࣮૷ͯ͠Έͨ
    • ϒϥ΢β͔Β࢖͑ΔΑ͏ʹͯ͠Έͨ
    !
    • ࢀߟจݙ
    • Ԭ໺ݪ େี. ߴ଎จࣈྻղੳͷੈք. ؠ೾ॻళ. 2012.

    View Slide

  18. (ิ଍) ΢ΣʔϒϨοτ໦
    3101212213
    1000101101
    10111 32223
    10111 10001
    ԼҐ2Ϗοτ໨ →
    ԼҐ1Ϗοτ໨ →
    0 1111 222 33
    0 1
    0 1 0 1

    View Slide