Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FM-index による全文検索

Sho Iizuka
February 02, 2015

FM-index による全文検索

Sho Iizuka

February 02, 2015
Tweet

More Decks by Sho Iizuka

Other Decks in Programming

Transcript

 1. FM-IndexʹΑΔશจݕࡧ
  ܭࢉػ࣮शE ࣗ༝՝୊

  View Slide

 2. • จॻ͔ΒจࣈྻΛݕࡧ͢Δํ๏͸2௨Γʹ෼ྨͰ͖Δ
  A. લॲཧ͕ෆཁͳํ๏ (ྗ೚ͤͳํ๏, KMP๏, BM๏)
  B. લॲཧ͕ඞཁͳํ๏ (సஔΠϯσοΫε, ઀ඌࣙ഑ྻ)
  • B͸લॲཧͷ͕࣌ؒඞཁͳ͔ΘΓʹ,

  ಉ͡จॻ͔ΒԿճ΋ݕࡧ͢Δ৔߹͸AΑΓ΋ߴ଎
  • FM-Index͸Bʹ෼ྨ͞ΕΔํ๏Ͱ,

  จॻͷ௕͞ʹґଘ͠ͳ͍࣌ؒͰݕࡧͰ͖Δ

  View Slide

 3. લॲཧ̍ɿ઀ඌࣙ഑ྻͷߏங
  จॻ
  mississippi
  mississippi$
  ΤϯυϚʔΧ$Λ௥Ճ
  mississippi$
  ississippi$
  ssissippi$
  sissippi$
  issippi$
  ssippi$
  sippi$
  ippi$
  ppi$
  pi$
  i$
  $
  ઀ඌࣙͷྻڍ

  View Slide

 4. લॲཧ̍ɿ઀ඌࣙ഑ྻͷߏங
  0 mississippi$
  1 ississippi$
  2 ssissippi$
  3 sissippi$
  4 issippi$
  5 ssippi$
  6 sippi$
  7 ippi$
  8 ppi$
  9 pi$
  10 i$
  11 $
  11 $
  10 i$
  7 ippi$
  4 issippi$
  1 ississippi$
  0 mississippi$
  9 pi$
  8 ppi$
  6 sippi$
  3 sissippi$
  5 ssippi$
  2 ssissippi$
  ࣙॻॱͰιʔτ͢Δ
  ※$͸೚ҙͷΞϧϑΝϕοτΑΓ΋

  ॱҐ͕খ͍͞ͱ͢Δ
  ઀ඌࣙ഑ྻSA

  View Slide

 5. લॲཧ̎ɿBWT

  (Burrows-Wheeler Transform)
  11 $
  10 i$
  7 ippi$
  4 issippi$
  1 ississippi$
  0 mississippi$
  9 pi$
  8 ppi$
  6 sippi$
  3 sissippi$
  5 ssippi$
  2 ssissippi$
  ݩͷจࣈྻʹ͓͚Δ

  ͻͱͭલͷจࣈʹ͢Δ
  i
  p
  s
  s
  m
  $
  p
  i
  s
  s
  i
  i
  BWTจࣈྻT

  View Slide

 6. ݕࡧॲཧ
  • BWTจࣈྻT = ipssm$pissii ʹ͍ͭͯ,

  ࣍ͷؔ਺Λఆٛ͢Δ
  • Rank(c,p) : T[0,p)ͷൣғͰ,

  ΞϧϑΝϕοτcͷग़ݱ਺Λฦ͢
  • RankLT(c) : TશମͰ, cΑΓ΋ॱҐ͕খ͍͞

  ΞϧϑΝϕοτͷग़ݱ਺Λฦ͢

  View Slide

 7. ݕࡧॲཧ
  $
  i$
  ippi$
  issippi$
  ississippi$
  mississippi$
  pi$
  ppi$
  sippi$
  sissippi$
  ssippi$
  ssissippi$
  i
  p
  s
  s
  m
  $
  p
  i
  s
  s
  i
  i
  BWTจࣈྻT ઀ඌࣙ഑ྻSA

  View Slide

 8. ݕࡧॲཧ
  $
  i$
  ippi$
  issippi$
  ississippi$
  mississippi$
  pi$
  ppi$
  sippi$
  sissippi$
  ssippi$
  ssissippi$
  i
  p
  s
  s
  m
  $
  p
  i
  s
  s
  i
  i
  BWTจࣈྻT ઀ඌࣙ഑ྻSA
  'i'+"ppi$"ͷ

  ઀ඌࣙ഑ྻ্Ͱͷ

  ग़ݱҐஔ͸ʁ

  View Slide

 9. ݕࡧॲཧ
  $
  i$
  ippi$
  issippi$
  ississippi$
  mississippi$
  pi$
  ppi$
  sippi$
  sissippi$
  ssippi$
  ssissippi$
  i
  p
  s
  s
  m
  $
  p
  i
  s
  s
  i
  i
  BWTจࣈྻT ઀ඌࣙ഑ྻSA
  'i'+"ppi$"ͷ

  ઀ඌࣙ഑ྻ্Ͱͷ

  ग़ݱҐஔ͸ʁ

  View Slide

 10. ݕࡧॲཧ
  $
  i$
  ippi$
  issippi$
  ississippi$
  mississippi$
  pi$
  ppi$
  sippi$
  sissippi$
  ssippi$
  ssissippi$
  i
  p
  s
  s
  m
  $
  p
  i
  s
  s
  i
  i
  BWTจࣈྻT ઀ඌࣙ഑ྻSA
  'i'+"ppi$"ͷ

  ઀ඌࣙ഑ྻ্Ͱͷ

  ग़ݱҐஔ͸ʁ
  LF-mapping
  c=T[p] ʹଓ͘จࣈྻͷ

  SA্Ͱͷग़ݱҐஔ͸

  RankLT(c)+Rank(c,p)

  View Slide

 11. ݕࡧॲཧ
  $
  i$
  ippi$
  issippi$
  ississippi$
  mississippi$
  pi$
  ppi$
  sippi$
  sissippi$
  ssippi$
  ssissippi$
  i
  p
  s
  s
  m
  $
  p
  i
  s
  s
  i
  i
  BWTจࣈྻT ઀ඌࣙ഑ྻSA
  "ssi"ͷݕࡧ
  [RankLT('i')+Rank('i', 0),

  RankLT('i')+Rank('i', 12))
  'i'Ͱ࢝·Δ

  จࣈྻ

  View Slide

 12. ݕࡧॲཧ
  $
  i$
  ippi$
  issippi$
  ississippi$
  mississippi$
  pi$
  ppi$
  sippi$
  sissippi$
  ssippi$
  ssissippi$
  i
  p
  s
  s
  m
  $
  p
  i
  s
  s
  i
  i
  BWTจࣈྻT ઀ඌࣙ഑ྻSA
  "ssi"ͷݕࡧ
  [RankLT('s')+Rank('s', 1),

  RankLT('s')+Rank('s', 5))
  's'+"i"Ͱ࢝·Δ

  จࣈྻ

  View Slide

 13. ݕࡧॲཧ
  $
  i$
  ippi$
  issippi$
  ississippi$
  mississippi$
  pi$
  ppi$
  sippi$
  sissippi$
  ssippi$
  ssissippi$
  i
  p
  s
  s
  m
  $
  p
  i
  s
  s
  i
  i
  BWTจࣈྻT ઀ඌࣙ഑ྻSA
  "ssi"ͷݕࡧ
  [RankLT('s')+Rank('s', 8),

  RankLT('s')+Rank('s', 10))
  's'+"si"Ͱ࢝·Δ

  จࣈྻ

  View Slide

 14. ݕࡧॲཧ
  • FM-index͸, ݕࡧจࣈྻʹରԠ͢ΔҐஔͷߜΓࠐΈΛ

  LF-mappingͷ܁Γฦ͠ʹΑͬͯߦ͏
  • LF-mapping͸ Rank ͱ RankLT Ͱߦ͑Δ
  • ͜ͷ2ͭͷॲཧ͸, ΢ΣʔϒϨοτ໦΍΢ΣʔϒϨοτߦྻΛ࢖͑͹

  O(log σ) ࣌ؒͰՄೳ (σ ͸ΞϧϑΝϕοτͷछྨ਺)
  • LF-mappingΛݕࡧจࣈྻQͷ௕͞෼͚ͩ܁Γฦ͢ͷͰ,

  Ұճͷݕࡧ͕O(m log σ) ࣌ؒͰՄೳ (m ͸ Q ͷจࣈ਺)
  • ݕࡧ͕࣌ؒจॻͷ௕͞ʹґଘ͠ͳ͍

  View Slide

 15. ੍࡞෺
  • ੨ۭจݿͰਓؾ͕͋Δਤॻ500࡭Λର৅ͱͨ͠

  ΢Σϒϒϥ΢β͔Β࢖͑ΔΠϯΫϦϝϯλϧݕࡧΛ੍࡞
  • ઀ඌࣙ഑ྻͷߏங͸sais.hxx (ߴ଎ͳϥΠϒϥϦ) Λ࢖༻
  • ΢ΣʔϒϨοτߦྻͱFM-Index͸ࣗ෼Ͱ࣮૷ (C++),

  boost-pythonʹΑΓPython༻ͷ֦ுϞδϡʔϧʹม׵
  • Flask (Web App Framework@Python) ͔Βݺͼग़͢

  View Slide

 16. ͏·͍͔͘ͳ͔ͬͨͱ͜Ζ
  • ͍͋·͍ݕࡧΛ࣮૷͠Α͏ͱͯ͠จݙΛ୳ͯ͠Έͨ

  → ฤूڑ཭ʹରͯ͠ࢦ਺͔͔࣌ؒΔΒ͍͠…
  • ࡞੒ͨ͠ࡧҾΛϑΝΠϧ͔ΒಡΈࠐΉॲཧͰ,

  طଘͷϥΠϒϥϦΛ࢖ͬͨΒ࢖༻ϝϞϦͷྔ͕രൃ

  (ݪҼෆ໌)

  View Slide

 17. ·ͱΊ
  • ߴ଎ͳจࣈྻݕࡧͷΞϧΰϦζϜΛ࣮૷ͯ͠Έͨ
  • ϒϥ΢β͔Β࢖͑ΔΑ͏ʹͯ͠Έͨ
  !
  • ࢀߟจݙ
  • Ԭ໺ݪ େี. ߴ଎จࣈྻղੳͷੈք. ؠ೾ॻళ. 2012.

  View Slide

 18. (ิ଍) ΢ΣʔϒϨοτ໦
  3101212213
  1000101101
  10111 32223
  10111 10001
  ԼҐ2Ϗοτ໨ →
  ԼҐ1Ϗοτ໨ →
  0 1111 222 33
  0 1
  0 1 0 1

  View Slide