Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Suffix Trees and Suffix Arrays
Search
hsasakawa
September 15, 2020
Science
0
120
Suffix Trees and Suffix Arrays
M3 Tech Talk #150
Quick introduction of suffix trees and arrays data structures
hsasakawa
September 15, 2020
Tweet
Share
More Decks by hsasakawa
See All by hsasakawa
行動ログ処理基盤の構築
hsasakawa
0
3.3k
冪等性を考慮したデータ連携ジョブの設計
hsasakawa
6
2.1k
Data platform development on M3 USA
hsasakawa
0
1.1k
Data Analysis Platform Development @M3, inc.
hsasakawa
0
3.4k
Other Decks in Science
See All in Science
Design of three-dimensional binary manipulators for pick-and-place task avoiding obstacles (IECON2024)
konakalab
0
220
モンテカルロDCF法による事業価値の算出(モンテカルロ法とベイズモデリング) / Business Valuation Using Monte Carlo DCF Method (Monte Carlo Simulation and Bayesian Modeling)
ikuma_w
0
180
Cross-Media Information Spaces and Architectures (CISA)
signer
PRO
3
31k
mathematics of indirect reciprocity
yohm
1
150
2025-06-11-ai_belgium
sofievl
1
130
データマイニング - グラフデータと経路
trycycle
PRO
1
150
3次元点群を利用した植物の葉の自動セグメンテーションについて
kentaitakura
2
1.3k
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
0
120
機械学習 - ニューラルネットワーク入門
trycycle
PRO
0
810
IWASAKI Hideo
genomethica
0
110
メール送信サーバの集約における透過型SMTP プロキシの定量評価 / Quantitative Evaluation of Transparent SMTP Proxy in Email Sending Server Aggregation
linyows
0
940
データベース09: 実体関連モデル上の一貫性制約
trycycle
PRO
0
700
Featured
See All Featured
Being A Developer After 40
akosma
90
590k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
48
2.9k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
181
53k
How STYLIGHT went responsive
nonsquared
100
5.6k
Making the Leap to Tech Lead
cromwellryan
134
9.4k
A Modern Web Designer's Workflow
chriscoyier
694
190k
Optimizing for Happiness
mojombo
379
70k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
29
9.5k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Code Reviewing Like a Champion
maltzj
524
40k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
138
34k
The Invisible Side of Design
smashingmag
301
51k
Transcript
Suffix Trees and Suffix Arrays M3 Tech Talk Hirohito Sasakawa,
Data engineer, AI and ML team @M3, inc.
Suffix Tree ͱ Suffix Array • จࣈྻʹର͢ΔࡧҾσʔλߏ • ࡧҾ (index):
ݩσʔλʹର͢ΔΞΫηεΛޮΑ͘ఏڙ͢Δิॿσʔλߏ • Suffix (Tree | Array) ɼจࣈྻ ( or จࣈྻू߹) ʹରͯ͠෦จࣈྻ ͷݕࡧɼසɼ࠷ڞ௨෦จࣈྻͳͲΛߴʹܭࢉ͢Δ • ࣮༻తͳԠ༻͕ɼΊͪΌͪ͘Ό͋Δ (ѹॖͱ͔ɼόΠΦܥͱ͔)
ͪͳΈʹ… • Goͷඪ४ϥΠϒϥϦʹSuffix Arrayؚ͕·Ε͍ͯΔ • ݕࡧ෦400ߦऑͰγϯϓϧ • GoͱΞϧΰϦζϜͷษڧʹ ྑͦ͞͏ ߏங෦1000ߦఔͰɼߴͳ
ΞϧΰϦζϜ (SA-IS) ͕࣮͞Ε͍ͯΔ จยखʹΏͬ͘ΓಡΉͷ͕ྑͦ͞͏
༻ޠͷ४උ • Suffix (ඌࣙ): จࣈྻTʹରͯ͠ɼઌ಄0จࣈҎ্Λͬͨจࣈྻ ྫ: T = ababb ඌࣙ:
ababb, babb, abb, bb, b, ‘’ (ۭจࣈ)
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suffix Tree
Suffix Tree • ߏஙʹ͔͔Δ࣌ؒ ී௨ʹΔͱO(n^2) ࣌ؒ Ukkonen’s Algorithm ͩͱ
Online࡞ՄೳͰ O(n) ࣌ؒ a co a coa o a coa • ϝϞϦ: O(n) O(n^2) ʹͳΒͳ͍͜ͱʹҙ ݩͷจࣈྻʹର͢Δ࢝ɼऴΛϙΠϯλͰอ࣋͢ΕΑ͍
Suffix TreeΛ༻͍ͨύλʔϯͷݕࡧ • rootϊʔυ͔ΒࢬΛબΜͰਐΉ͚ͩ • ౸ୡͨ͠ϊʔυͷؚ͕࣍·ΕΔසʹͳΔ • ݕࡧͷܭࢉྔ: O(m) ͜͜Ͱmύλʔϯ
ˠ ͭ·ΓݩσʔλΛશ෦ᢞΊͳ͍ (ͬͨͶ) a co a coa o a coa
Ԡ༻: Longest Common Substrings (LCS) • ೖྗ: 2ͭҎ্ͷจࣈྻू߹S ग़ྗ:
Sʹڞ௨ͯ͠ݱΕΔ෦จࣈྻͷ͏ͪ࠷ͷͷ • Sʹରͯ͠Suffix TreeΛߏங͠ɼ root͔ΒḷΕΔϊʔυͷ͏ͪɼ྆ऀΛؚΉ ࠷ͷ෦Λฦͤྑ͍ $ana a $ana na banana$ana na S = { s1 = banana, s2 = ana } $ana na$ana $ana na$ana s2 s2 s2 s1, s2 s1, s2 s2 s2 s2
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suffix Tree
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a a coa cocoa oa ocoa ιʔτ
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a a coa cocoa oa ocoa ιʔτ Suffix Array
Suffix Array • ߏஙʹ͔͔Δ࣌ؒ ී௨ʹΔͱ O(n^2 logn) ࣌ؒ SA-ISΞϧΰϦζϜʹΑΓ O(n)
࣌ؒͰߏஙͰ͖Δ (ιʔτ͢Δͷʹlogn͕͔ͭͳ͍ෆࢥٞͳํ๏) • ϝϞϦࣗ໌ʹO(n) a coa cocoa oa ocoa
Suffix ArrayΛ༻͍ͨύλʔϯͷݕࡧ • Suffix Array্Λύλʔϯ͚ͩ܁Γฦ͠ೋ୳ࡧ͢Δ O(m logn) ࣌ؒ (͍) •
LCP Array (࠷ڞ௨಄ࣙྻ) Λ ิॿσʔλߏͱͯͬͯ͠ O(m + log n) ࣌ؒΛୡͰ͖Δ • LCP ArraySuffix Array͔ΒO(n)࣌ؒͰߏஙՄೳ a coa cocoa oa ocoa
Suffix Tree ͱ Suffix Array ͷؔ • ྺ࢙తʹ Suffix Tree
ͷํ͕20Ҏ্ૣ͘ൃݟ͞Εͨ • όΠΦͷจ຺ͰϝϞϦফඅΛམͱ͍ͨ͠Ϟνϕʔγϣϯ͔Β Suffix Array ͕։ൃ͞Εͨ • ͷͪʹ LCP Array ͕։ൃ͞ΕɼSuffix Tree্ͷΞϧΰϦζϜͷଟ͘ ΛSuffix Array্Ͱ࣮ߦͰ͖ΔΑ͏ʹͳͬͨ • ST, SA྆ํͱɼݸผʹൃలͨ͠ΞϧΰϦζϜ͕ଟ։ൃ͞Ε͍ͯΔ
·ͱΊ • Suffix Tree ͱ Suffix Array ʹ͍ͭͯհͨ͠ • ߏங࣌ؒ
O(n)ɼϝϞϦO(n) • ݕࡧɼ࠷ڞ௨෦จࣈྻͳͲΛߴʹղ͘͜ͱ͕Ͱ͖Δ • จࣈྻָ͍͠Αʂ