Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Suffix Trees and Suffix Arrays
Search
hsasakawa
September 15, 2020
Science
0
120
Suffix Trees and Suffix Arrays
M3 Tech Talk #150
Quick introduction of suffix trees and arrays data structures
hsasakawa
September 15, 2020
Tweet
Share
More Decks by hsasakawa
See All by hsasakawa
行動ログ処理基盤の構築
hsasakawa
0
3.4k
冪等性を考慮したデータ連携ジョブの設計
hsasakawa
6
2.1k
Data platform development on M3 USA
hsasakawa
0
1.2k
Data Analysis Platform Development @M3, inc.
hsasakawa
0
3.4k
Other Decks in Science
See All in Science
データマイニング - グラフ構造の諸指標
trycycle
PRO
0
200
知能とはなにかーヒトとAIのあいだー
tagtag
0
110
データベース14: B+木 & ハッシュ索引
trycycle
PRO
0
500
KH Coderチュートリアル(スライド版)
koichih
1
50k
Optimization of the Tournament Format for the Nationwide High School Kyudo Competition in Japan
konakalab
0
110
Celebrate UTIG: Staff and Student Awards 2025
utig
0
310
データベース08: 実体関連モデルとは?
trycycle
PRO
0
970
データベース09: 実体関連モデル上の一貫性制約
trycycle
PRO
0
1k
baseballrによるMLBデータの抽出と階層ベイズモデルによる打率の推定 / TokyoR118
dropout009
2
600
生成AIと学ぶPythonデータ分析再入門-Pythonによるクラスタリング・可視化をサクサク実施-
datascientistsociety
PRO
4
1.8k
防災デジタル分野での官民共創の取り組み (1)防災DX官民共創をどう進めるか
ditccsugii
0
340
データマイニング - コミュニティ発見
trycycle
PRO
0
160
Featured
See All Featured
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
Principles of Awesome APIs and How to Build Them.
keavy
127
17k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.5k
Bash Introduction
62gerente
615
210k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.1k
A designer walks into a library…
pauljervisheath
209
24k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
27k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.5k
Building a Modern Day E-commerce SEO Strategy
aleyda
44
7.9k
Agile that works and the tools we love
rasmusluckow
331
21k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
140
34k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Transcript
Suffix Trees and Suffix Arrays M3 Tech Talk Hirohito Sasakawa,
Data engineer, AI and ML team @M3, inc.
Suffix Tree ͱ Suffix Array • จࣈྻʹର͢ΔࡧҾσʔλߏ • ࡧҾ (index):
ݩσʔλʹର͢ΔΞΫηεΛޮΑ͘ఏڙ͢Δิॿσʔλߏ • Suffix (Tree | Array) ɼจࣈྻ ( or จࣈྻू߹) ʹରͯ͠෦จࣈྻ ͷݕࡧɼසɼ࠷ڞ௨෦จࣈྻͳͲΛߴʹܭࢉ͢Δ • ࣮༻తͳԠ༻͕ɼΊͪΌͪ͘Ό͋Δ (ѹॖͱ͔ɼόΠΦܥͱ͔)
ͪͳΈʹ… • Goͷඪ४ϥΠϒϥϦʹSuffix Arrayؚ͕·Ε͍ͯΔ • ݕࡧ෦400ߦऑͰγϯϓϧ • GoͱΞϧΰϦζϜͷษڧʹ ྑͦ͞͏ ߏங෦1000ߦఔͰɼߴͳ
ΞϧΰϦζϜ (SA-IS) ͕࣮͞Ε͍ͯΔ จยखʹΏͬ͘ΓಡΉͷ͕ྑͦ͞͏
༻ޠͷ४උ • Suffix (ඌࣙ): จࣈྻTʹରͯ͠ɼઌ಄0จࣈҎ্Λͬͨจࣈྻ ྫ: T = ababb ඌࣙ:
ababb, babb, abb, bb, b, ‘’ (ۭจࣈ)
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suffix Tree
Suffix Tree • ߏஙʹ͔͔Δ࣌ؒ ී௨ʹΔͱO(n^2) ࣌ؒ Ukkonen’s Algorithm ͩͱ
Online࡞ՄೳͰ O(n) ࣌ؒ a co a coa o a coa • ϝϞϦ: O(n) O(n^2) ʹͳΒͳ͍͜ͱʹҙ ݩͷจࣈྻʹର͢Δ࢝ɼऴΛϙΠϯλͰอ࣋͢ΕΑ͍
Suffix TreeΛ༻͍ͨύλʔϯͷݕࡧ • rootϊʔυ͔ΒࢬΛબΜͰਐΉ͚ͩ • ౸ୡͨ͠ϊʔυͷؚ͕࣍·ΕΔසʹͳΔ • ݕࡧͷܭࢉྔ: O(m) ͜͜Ͱmύλʔϯ
ˠ ͭ·ΓݩσʔλΛશ෦ᢞΊͳ͍ (ͬͨͶ) a co a coa o a coa
Ԡ༻: Longest Common Substrings (LCS) • ೖྗ: 2ͭҎ্ͷจࣈྻू߹S ग़ྗ:
Sʹڞ௨ͯ͠ݱΕΔ෦จࣈྻͷ͏ͪ࠷ͷͷ • Sʹରͯ͠Suffix TreeΛߏங͠ɼ root͔ΒḷΕΔϊʔυͷ͏ͪɼ྆ऀΛؚΉ ࠷ͷ෦Λฦͤྑ͍ $ana a $ana na banana$ana na S = { s1 = banana, s2 = ana } $ana na$ana $ana na$ana s2 s2 s2 s1, s2 s1, s2 s2 s2 s2
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suffix Tree
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a a coa cocoa oa ocoa ιʔτ
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a a coa cocoa oa ocoa ιʔτ Suffix Array
Suffix Array • ߏஙʹ͔͔Δ࣌ؒ ී௨ʹΔͱ O(n^2 logn) ࣌ؒ SA-ISΞϧΰϦζϜʹΑΓ O(n)
࣌ؒͰߏஙͰ͖Δ (ιʔτ͢Δͷʹlogn͕͔ͭͳ͍ෆࢥٞͳํ๏) • ϝϞϦࣗ໌ʹO(n) a coa cocoa oa ocoa
Suffix ArrayΛ༻͍ͨύλʔϯͷݕࡧ • Suffix Array্Λύλʔϯ͚ͩ܁Γฦ͠ೋ୳ࡧ͢Δ O(m logn) ࣌ؒ (͍) •
LCP Array (࠷ڞ௨಄ࣙྻ) Λ ิॿσʔλߏͱͯͬͯ͠ O(m + log n) ࣌ؒΛୡͰ͖Δ • LCP ArraySuffix Array͔ΒO(n)࣌ؒͰߏஙՄೳ a coa cocoa oa ocoa
Suffix Tree ͱ Suffix Array ͷؔ • ྺ࢙తʹ Suffix Tree
ͷํ͕20Ҏ্ૣ͘ൃݟ͞Εͨ • όΠΦͷจ຺ͰϝϞϦফඅΛམͱ͍ͨ͠Ϟνϕʔγϣϯ͔Β Suffix Array ͕։ൃ͞Εͨ • ͷͪʹ LCP Array ͕։ൃ͞ΕɼSuffix Tree্ͷΞϧΰϦζϜͷଟ͘ ΛSuffix Array্Ͱ࣮ߦͰ͖ΔΑ͏ʹͳͬͨ • ST, SA྆ํͱɼݸผʹൃలͨ͠ΞϧΰϦζϜ͕ଟ։ൃ͞Ε͍ͯΔ
·ͱΊ • Suffix Tree ͱ Suffix Array ʹ͍ͭͯհͨ͠ • ߏங࣌ؒ
O(n)ɼϝϞϦO(n) • ݕࡧɼ࠷ڞ௨෦จࣈྻͳͲΛߴʹղ͘͜ͱ͕Ͱ͖Δ • จࣈྻָ͍͠Αʂ