Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Suffix Trees and Suffix Arrays

hsasakawa
September 15, 2020

Suffix Trees and Suffix Arrays

M3 Tech Talk #150

Quick introduction of suffix trees and arrays data structures

hsasakawa

September 15, 2020
Tweet

More Decks by hsasakawa

Other Decks in Science

Transcript

  1. Suffix Trees and Suffix Arrays M3 Tech Talk Hirohito Sasakawa,

    Data engineer, AI and ML team @M3, inc.
  2. Suffix Tree ͱ Suffix Array • จࣈྻʹର͢ΔࡧҾσʔλߏ଄ • ࡧҾ (index):

    
 ݩσʔλʹର͢ΔΞΫηεΛޮ཰Α͘ఏڙ͢Δิॿσʔλߏ଄ • Suffix (Tree | Array) ͸ɼจࣈྻ ( or จࣈྻू߹) ʹରͯ͠෦෼จࣈྻ ͷݕࡧɼස౓ɼ࠷௕ڞ௨෦෼จࣈྻͳͲΛߴ଎ʹܭࢉ͢Δ • ࣮༻తͳԠ༻͕ɼΊͪΌͪ͘Ό͋Δ (ѹॖͱ͔ɼόΠΦܥͱ͔)
  3. Suffix Tree • tl;dr 
 จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

    ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
  4. Suffix Tree • tl;dr 
 จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

    ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suffix Tree
  5. Suffix Tree • ߏஙʹ͔͔Δ࣌ؒ 
 ී௨ʹ΍ΔͱO(n^2) ࣌ؒ
 Ukkonen’s Algorithm ͩͱ


    Online࡞੒ՄೳͰ O(n) ࣌ؒ a co a coa o a coa • ϝϞϦ: O(n)
 O(n^2) ʹͳΒͳ͍͜ͱʹ஫ҙ
 ݩͷจࣈྻʹର͢Δ࢝఺ɼऴ఺ΛϙΠϯλͰอ࣋͢Ε͹Α͍
  6. Ԡ༻: Longest Common Substrings (LCS) • ೖྗ: 2ͭҎ্ͷจࣈྻू߹S 
 ग़ྗ:

    Sʹڞ௨ͯ͠ݱΕΔ෦෼จࣈྻͷ͏ͪ࠷௕ͷ΋ͷ • Sʹରͯ͠Suffix TreeΛߏங͠ɼ
 root͔ΒḷΕΔϊʔυͷ͏ͪɼ྆ऀΛؚΉ
 ࠷௕ͷ෦෼Λฦͤ͹ྑ͍ $ana a $ana na banana$ana na S = { s1 = banana, s2 = ana } $ana na$ana $ana na$ana s2 s2 s2 s1, s2 s1, s2 s2 s2 s2
  7. Suffix Tree • tl;dr 
 จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

    ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
  8. Suffix Tree • tl;dr 
 จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

    ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suffix Tree
  9. Suffix Array • tl;dr 
 จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷ഑ྻ T = cocoa ઀ඌࣙs

    
 with index cocoa ocoa coa oa a           a coa cocoa oa ocoa ιʔτ
  10. Suffix Array • tl;dr 
 จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷ഑ྻ T = cocoa ઀ඌࣙs

    
 with index cocoa ocoa coa oa a           a coa cocoa oa ocoa ιʔτ  Suffix Array
  11. Suffix Array • ߏஙʹ͔͔Δ࣌ؒ
 ී௨ʹ΍Δͱ O(n^2 logn) ࣌ؒ
 SA-ISΞϧΰϦζϜʹΑΓ O(n)

    ࣌ؒͰߏஙͰ͖Δ 
 (ιʔτ͢Δͷʹlogn͕͔ͭͳ͍ෆࢥٞͳํ๏) • ϝϞϦ͸ࣗ໌ʹO(n)      a coa cocoa oa ocoa
  12. Suffix ArrayΛ༻͍ͨύλʔϯͷݕࡧ • Suffix Array্Λύλʔϯ௕͚ͩ܁Γฦ͠ೋ෼୳ࡧ͢Δ
 O(m logn) ࣌ؒ (஗͍) •

    LCP Array (࠷௕ڞ௨઀಄ࣙ഑ྻ) Λ
 ิॿσʔλߏ଄ͱͯ͠࢖ͬͯ
 O(m + log n) ࣌ؒΛୡ੒Ͱ͖Δ • LCP Array͸Suffix Array͔ΒO(n)࣌ؒͰߏஙՄೳ      a coa cocoa oa ocoa
  13. Suffix Tree ͱ Suffix Array ͷؔ܎ • ྺ࢙తʹ͸ Suffix Tree

    ͷํ͕20೥Ҏ্ૣ͘ൃݟ͞Εͨ • όΠΦͷจ຺ͰϝϞϦফඅΛམͱ͍ͨ͠Ϟνϕʔγϣϯ͔Β
 Suffix Array ͕։ൃ͞Εͨ • ͷͪʹ LCP Array ͕։ൃ͞ΕɼSuffix Tree্ͷΞϧΰϦζϜͷଟ͘ ΛSuffix Array্Ͱ࣮ߦͰ͖ΔΑ͏ʹͳͬͨ • ST, SA྆ํͱ΋ɼݸผʹൃలͨ͠ΞϧΰϦζϜ͕ଟ਺։ൃ͞Ε͍ͯΔ
  14. ·ͱΊ • Suffix Tree ͱ Suffix Array ʹ͍ͭͯ঺հͨ͠ • ߏங࣌ؒ

    O(n)ɼϝϞϦO(n) • ݕࡧ΍ɼ࠷௕ڞ௨෦෼จࣈྻͳͲΛߴ଎ʹղ͘͜ͱ͕Ͱ͖Δ • จࣈྻָ͍͠Αʂ