Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Suffix Trees and Suffix Arrays
Search
hsasakawa
September 15, 2020
Science
0
120
Suffix Trees and Suffix Arrays
M3 Tech Talk #150
Quick introduction of suffix trees and arrays data structures
hsasakawa
September 15, 2020
Tweet
Share
More Decks by hsasakawa
See All by hsasakawa
行動ログ処理基盤の構築
hsasakawa
0
3.3k
冪等性を考慮したデータ連携ジョブの設計
hsasakawa
6
2k
Data platform development on M3 USA
hsasakawa
0
1.1k
Data Analysis Platform Development @M3, inc.
hsasakawa
0
3.4k
Other Decks in Science
See All in Science
butterfly_effect/butterfly_effect_in-house
florets1
1
180
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
0
110
03_草原和博_広島大学大学院人間社会科学研究科教授_デジタル_シティズンシップシティで_新たな_学び__をつくる.pdf
sip3ristex
0
470
サイゼミ用因果推論
lw
1
7.3k
Ignite の1年間の軌跡
ktombow
0
130
データベース05: SQL(2/3) 結合質問
trycycle
PRO
0
700
Introd_Img_Process_2_Frequ
hachama
0
560
データベース01: データベースを使わない世界
trycycle
PRO
1
650
機械学習 - K近傍法 & 機械学習のお作法
trycycle
PRO
0
1.1k
データベース02: データベースの概念
trycycle
PRO
2
750
3次元点群を利用した植物の葉の自動セグメンテーションについて
kentaitakura
2
1.2k
統計学入門講座 第1回スライド
techmathproject
0
340
Featured
See All Featured
Scaling GitHub
holman
459
140k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
26k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
8
660
Designing for Performance
lara
609
69k
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
Unsuck your backbone
ammeep
671
58k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
2.8k
Documentation Writing (for coders)
carmenintech
71
4.9k
Code Reviewing Like a Champion
maltzj
524
40k
How to Ace a Technical Interview
jacobian
277
23k
Practical Orchestrator
shlominoach
188
11k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.5k
Transcript
Suffix Trees and Suffix Arrays M3 Tech Talk Hirohito Sasakawa,
Data engineer, AI and ML team @M3, inc.
Suffix Tree ͱ Suffix Array • จࣈྻʹର͢ΔࡧҾσʔλߏ • ࡧҾ (index):
ݩσʔλʹର͢ΔΞΫηεΛޮΑ͘ఏڙ͢Δิॿσʔλߏ • Suffix (Tree | Array) ɼจࣈྻ ( or จࣈྻू߹) ʹରͯ͠෦จࣈྻ ͷݕࡧɼසɼ࠷ڞ௨෦จࣈྻͳͲΛߴʹܭࢉ͢Δ • ࣮༻తͳԠ༻͕ɼΊͪΌͪ͘Ό͋Δ (ѹॖͱ͔ɼόΠΦܥͱ͔)
ͪͳΈʹ… • Goͷඪ४ϥΠϒϥϦʹSuffix Arrayؚ͕·Ε͍ͯΔ • ݕࡧ෦400ߦऑͰγϯϓϧ • GoͱΞϧΰϦζϜͷษڧʹ ྑͦ͞͏ ߏங෦1000ߦఔͰɼߴͳ
ΞϧΰϦζϜ (SA-IS) ͕࣮͞Ε͍ͯΔ จยखʹΏͬ͘ΓಡΉͷ͕ྑͦ͞͏
༻ޠͷ४උ • Suffix (ඌࣙ): จࣈྻTʹରͯ͠ɼઌ಄0จࣈҎ্Λͬͨจࣈྻ ྫ: T = ababb ඌࣙ:
ababb, babb, abb, bb, b, ‘’ (ۭจࣈ)
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suffix Tree
Suffix Tree • ߏஙʹ͔͔Δ࣌ؒ ී௨ʹΔͱO(n^2) ࣌ؒ Ukkonen’s Algorithm ͩͱ
Online࡞ՄೳͰ O(n) ࣌ؒ a co a coa o a coa • ϝϞϦ: O(n) O(n^2) ʹͳΒͳ͍͜ͱʹҙ ݩͷจࣈྻʹର͢Δ࢝ɼऴΛϙΠϯλͰอ࣋͢ΕΑ͍
Suffix TreeΛ༻͍ͨύλʔϯͷݕࡧ • rootϊʔυ͔ΒࢬΛબΜͰਐΉ͚ͩ • ౸ୡͨ͠ϊʔυͷؚ͕࣍·ΕΔසʹͳΔ • ݕࡧͷܭࢉྔ: O(m) ͜͜Ͱmύλʔϯ
ˠ ͭ·ΓݩσʔλΛશ෦ᢞΊͳ͍ (ͬͨͶ) a co a coa o a coa
Ԡ༻: Longest Common Substrings (LCS) • ೖྗ: 2ͭҎ্ͷจࣈྻू߹S ग़ྗ:
Sʹڞ௨ͯ͠ݱΕΔ෦จࣈྻͷ͏ͪ࠷ͷͷ • Sʹରͯ͠Suffix TreeΛߏங͠ɼ root͔ΒḷΕΔϊʔυͷ͏ͪɼ྆ऀΛؚΉ ࠷ͷ෦Λฦͤྑ͍ $ana a $ana na banana$ana na S = { s1 = banana, s2 = ana } $ana na$ana $ana na$ana s2 s2 s2 s1, s2 s1, s2 s2 s2 s2
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suffix Tree
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a a coa cocoa oa ocoa ιʔτ
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a a coa cocoa oa ocoa ιʔτ Suffix Array
Suffix Array • ߏஙʹ͔͔Δ࣌ؒ ී௨ʹΔͱ O(n^2 logn) ࣌ؒ SA-ISΞϧΰϦζϜʹΑΓ O(n)
࣌ؒͰߏஙͰ͖Δ (ιʔτ͢Δͷʹlogn͕͔ͭͳ͍ෆࢥٞͳํ๏) • ϝϞϦࣗ໌ʹO(n) a coa cocoa oa ocoa
Suffix ArrayΛ༻͍ͨύλʔϯͷݕࡧ • Suffix Array্Λύλʔϯ͚ͩ܁Γฦ͠ೋ୳ࡧ͢Δ O(m logn) ࣌ؒ (͍) •
LCP Array (࠷ڞ௨಄ࣙྻ) Λ ิॿσʔλߏͱͯͬͯ͠ O(m + log n) ࣌ؒΛୡͰ͖Δ • LCP ArraySuffix Array͔ΒO(n)࣌ؒͰߏஙՄೳ a coa cocoa oa ocoa
Suffix Tree ͱ Suffix Array ͷؔ • ྺ࢙తʹ Suffix Tree
ͷํ͕20Ҏ্ૣ͘ൃݟ͞Εͨ • όΠΦͷจ຺ͰϝϞϦফඅΛམͱ͍ͨ͠Ϟνϕʔγϣϯ͔Β Suffix Array ͕։ൃ͞Εͨ • ͷͪʹ LCP Array ͕։ൃ͞ΕɼSuffix Tree্ͷΞϧΰϦζϜͷଟ͘ ΛSuffix Array্Ͱ࣮ߦͰ͖ΔΑ͏ʹͳͬͨ • ST, SA྆ํͱɼݸผʹൃలͨ͠ΞϧΰϦζϜ͕ଟ։ൃ͞Ε͍ͯΔ
·ͱΊ • Suffix Tree ͱ Suffix Array ʹ͍ͭͯհͨ͠ • ߏங࣌ؒ
O(n)ɼϝϞϦO(n) • ݕࡧɼ࠷ڞ௨෦จࣈྻͳͲΛߴʹղ͘͜ͱ͕Ͱ͖Δ • จࣈྻָ͍͠Αʂ