hsasakawa
September 15, 2020
10

# Suffix Trees and Suffix Arrays

M3 Tech Talk #150

Quick introduction of suffix trees and arrays data structures

#### hsasakawa

September 15, 2020

## Transcript

1. ### Suﬃx Trees and Suﬃx Arrays M3 Tech Talk Hirohito Sasakawa,

Data engineer, AI and ML team @M3, inc.
2. ### Suﬃx Tree ͱ Suﬃx Array • จࣈྻʹର͢ΔࡧҾσʔλߏ଄ • ࡧҾ (index):

ݩσʔλʹର͢ΔΞΫηεΛޮ཰Α͘ఏڙ͢Δิॿσʔλߏ଄ • Sufﬁx (Tree | Array) ͸ɼจࣈྻ ( or จࣈྻू߹) ʹରͯ͠෦෼จࣈྻ ͷݕࡧɼස౓ɼ࠷௕ڞ௨෦෼จࣈྻͳͲΛߴ଎ʹܭࢉ͢Δ • ࣮༻తͳԠ༻͕ɼΊͪΌͪ͘Ό͋Δ (ѹॖͱ͔ɼόΠΦܥͱ͔)
3. ### ͪͳΈʹ… • Goͷඪ४ϥΠϒϥϦʹSufﬁx Arrayؚ͕·Ε͍ͯΔ • ݕࡧ෦෼͸400ߦऑͰγϯϓϧ • GoͱΞϧΰϦζϜͷษڧʹ  ྑͦ͞͏ ߏங෦෼͸1000ߦఔ౓Ͱɼߴ଎ͳ

ΞϧΰϦζϜ (SA-IS) ͕࣮૷͞Ε͍ͯΔ  ࿦จยखʹΏͬ͘ΓಡΉͷ͕ྑͦ͞͏
4. ### ༻ޠͷ४උ • Sufﬁx (઀ඌࣙ): จࣈྻTʹରͯ͠ɼઌ಄0จࣈҎ্Λ࡟ͬͨจࣈྻ  ྫ: T = ababb  ઀ඌࣙ:

ababb, babb, abb, bb, b, ‘’ (ۭจࣈ)

7. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ
8. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
9. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
10. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
11. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a
12. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co
13. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a
14. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa
15. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o
16. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a
17. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
18. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suﬃx Tree
19. ### Suﬃx Tree • ߏஙʹ͔͔Δ࣌ؒ   ී௨ʹ΍ΔͱO(n^2) ࣌ؒ  Ukkonen’s Algorithm ͩͱ

Online࡞੒ՄೳͰ O(n) ࣌ؒ a co a coa o a coa • ϝϞϦ: O(n)  O(n^2) ʹͳΒͳ͍͜ͱʹ஫ҙ  ݩͷจࣈྻʹର͢Δ࢝఺ɼऴ఺ΛϙΠϯλͰอ࣋͢Ε͹Α͍
20. ### Suﬃx TreeΛ༻͍ͨύλʔϯͷݕࡧ • rootϊʔυ͔ΒࢬΛબΜͰਐΉ͚ͩ • ౸ୡͨ͠ϊʔυͷ࣍਺ؚ͕·ΕΔස౓ʹͳΔ • ݕࡧͷܭࢉྔ: O(m)  ͜͜Ͱm͸ύλʔϯ௕

ˠ ͭ·ΓݩσʔλΛશ෦ᢞΊͳ͍ (΍ͬͨͶ) a co a coa o a coa
21. ### Ԡ༻: Longest Common Substrings (LCS) • ೖྗ: 2ͭҎ্ͷจࣈྻू߹S   ग़ྗ:

Sʹڞ௨ͯ͠ݱΕΔ෦෼จࣈྻͷ͏ͪ࠷௕ͷ΋ͷ • Sʹରͯ͠Sufﬁx TreeΛߏங͠ɼ  root͔ΒḷΕΔϊʔυͷ͏ͪɼ྆ऀΛؚΉ  ࠷௕ͷ෦෼Λฦͤ͹ྑ͍ \$ana a \$ana na banana\$ana na S = { s1 = banana, s2 = ana } \$ana na\$ana \$ana na\$ana s2 s2 s2 s1, s2 s1, s2 s2 s2 s2

24. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ
25. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
26. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
27. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
28. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a
29. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co
30. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a
31. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa
32. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o
33. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a
34. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
35. ### Suﬃx Tree • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛࣙॻॱʹฒ΂ͯɼ্͔ΒଋͶͨ໦ T = cocoa cocoa

ocoa coa oa a ઀ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suﬃx Tree

38. ### Suﬃx Array • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷ഑ྻ T = cocoa ઀ඌࣙs

with index cocoa ocoa coa oa a     
39. ### Suﬃx Array • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷ഑ྻ T = cocoa ઀ඌࣙs

with index cocoa ocoa coa oa a           a coa cocoa oa ocoa ιʔτ
40. ### Suﬃx Array • tl;dr   จࣈྻTͷ͢΂ͯͷ઀ඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷ഑ྻ T = cocoa ઀ඌࣙs

with index cocoa ocoa coa oa a           a coa cocoa oa ocoa ιʔτ  Suﬃx Array
41. ### Suﬃx Array • ߏஙʹ͔͔Δ࣌ؒ  ී௨ʹ΍Δͱ O(n^2 logn) ࣌ؒ  SA-ISΞϧΰϦζϜʹΑΓ O(n)

࣌ؒͰߏஙͰ͖Δ   (ιʔτ͢Δͷʹlogn͕͔ͭͳ͍ෆࢥٞͳํ๏) • ϝϞϦ͸ࣗ໌ʹO(n)      a coa cocoa oa ocoa
42. ### Suﬃx ArrayΛ༻͍ͨύλʔϯͷݕࡧ • Sufﬁx Array্Λύλʔϯ௕͚ͩ܁Γฦ͠ೋ෼୳ࡧ͢Δ  O(m logn) ࣌ؒ (஗͍) •

LCP Array (࠷௕ڞ௨઀಄ࣙ഑ྻ) Λ  ิॿσʔλߏ଄ͱͯ͠࢖ͬͯ  O(m + log n) ࣌ؒΛୡ੒Ͱ͖Δ • LCP Array͸Sufﬁx Array͔ΒO(n)࣌ؒͰߏஙՄೳ      a coa cocoa oa ocoa
43. ### Suﬃx Tree ͱ Suﬃx Array ͷؔ܎ • ྺ࢙తʹ͸ Sufﬁx Tree

ͷํ͕20೥Ҏ্ૣ͘ൃݟ͞Εͨ • όΠΦͷจ຺ͰϝϞϦফඅΛམͱ͍ͨ͠Ϟνϕʔγϣϯ͔Β  Sufﬁx Array ͕։ൃ͞Εͨ • ͷͪʹ LCP Array ͕։ൃ͞ΕɼSufﬁx Tree্ͷΞϧΰϦζϜͷଟ͘ ΛSufﬁx Array্Ͱ࣮ߦͰ͖ΔΑ͏ʹͳͬͨ • ST, SA྆ํͱ΋ɼݸผʹൃలͨ͠ΞϧΰϦζϜ͕ଟ਺։ൃ͞Ε͍ͯΔ
44. ### ·ͱΊ • Sufﬁx Tree ͱ Sufﬁx Array ʹ͍ͭͯ঺հͨ͠ • ߏங࣌ؒ

O(n)ɼϝϞϦO(n) • ݕࡧ΍ɼ࠷௕ڞ௨෦෼จࣈྻͳͲΛߴ଎ʹղ͘͜ͱ͕Ͱ͖Δ • จࣈྻָ͍͠Αʂ