Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Suffix Trees and Suffix Arrays
Search
hsasakawa
September 15, 2020
Science
0
120
Suffix Trees and Suffix Arrays
M3 Tech Talk #150
Quick introduction of suffix trees and arrays data structures
hsasakawa
September 15, 2020
Tweet
Share
More Decks by hsasakawa
See All by hsasakawa
行動ログ処理基盤の構築
hsasakawa
0
3.4k
冪等性を考慮したデータ連携ジョブの設計
hsasakawa
6
2.1k
Data platform development on M3 USA
hsasakawa
0
1.2k
Data Analysis Platform Development @M3, inc.
hsasakawa
0
3.4k
Other Decks in Science
See All in Science
データベース11: 正規化(1/2) - 望ましくない関係スキーマ
trycycle
PRO
0
990
機械学習 - pandas入門
trycycle
PRO
0
360
データベース15: ビッグデータ時代のデータベース
trycycle
PRO
0
380
イロレーティングを活用した関東大学サッカーの定量的実力評価 / A quantitative performance evaluation of Kanto University Football Association using Elo rating
konakalab
0
110
機械学習 - 決定木からはじめる機械学習
trycycle
PRO
0
1.2k
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
0
150
会社でMLモデルを作るとは @電気通信大学 データアントレプレナーフェロープログラム
yuto16
1
390
データマイニング - コミュニティ発見
trycycle
PRO
0
170
【論文紹介】Is CLIP ideal? No. Can we fix it?Yes! 第65回 コンピュータビジョン勉強会@関東
shun6211
5
1.6k
ド文系だった私が、 KaggleのNCAAコンペでソロ金取れるまで
wakamatsu_takumu
2
1.6k
データベース05: SQL(2/3) 結合質問
trycycle
PRO
0
840
データマイニング - ノードの中心性
trycycle
PRO
0
310
Featured
See All Featured
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
285
14k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
9.8k
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
How To Stay Up To Date on Web Technology
chriscoyier
791
250k
Building an army of robots
kneath
306
46k
[RailsConf 2023] Rails as a piece of cake
palkan
57
6.1k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.1k
Why Our Code Smells
bkeepers
PRO
340
57k
Imperfection Machines: The Place of Print at Facebook
scottboms
269
13k
Become a Pro
speakerdeck
PRO
30
5.6k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.3k
Agile that works and the tools we love
rasmusluckow
331
21k
Transcript
Suffix Trees and Suffix Arrays M3 Tech Talk Hirohito Sasakawa,
Data engineer, AI and ML team @M3, inc.
Suffix Tree ͱ Suffix Array • จࣈྻʹର͢ΔࡧҾσʔλߏ • ࡧҾ (index):
ݩσʔλʹର͢ΔΞΫηεΛޮΑ͘ఏڙ͢Δิॿσʔλߏ • Suffix (Tree | Array) ɼจࣈྻ ( or จࣈྻू߹) ʹରͯ͠෦จࣈྻ ͷݕࡧɼසɼ࠷ڞ௨෦จࣈྻͳͲΛߴʹܭࢉ͢Δ • ࣮༻తͳԠ༻͕ɼΊͪΌͪ͘Ό͋Δ (ѹॖͱ͔ɼόΠΦܥͱ͔)
ͪͳΈʹ… • Goͷඪ४ϥΠϒϥϦʹSuffix Arrayؚ͕·Ε͍ͯΔ • ݕࡧ෦400ߦऑͰγϯϓϧ • GoͱΞϧΰϦζϜͷษڧʹ ྑͦ͞͏ ߏங෦1000ߦఔͰɼߴͳ
ΞϧΰϦζϜ (SA-IS) ͕࣮͞Ε͍ͯΔ จยखʹΏͬ͘ΓಡΉͷ͕ྑͦ͞͏
༻ޠͷ४උ • Suffix (ඌࣙ): จࣈྻTʹରͯ͠ɼઌ಄0จࣈҎ্Λͬͨจࣈྻ ྫ: T = ababb ඌࣙ:
ababb, babb, abb, bb, b, ‘’ (ۭจࣈ)
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suffix Tree
Suffix Tree • ߏஙʹ͔͔Δ࣌ؒ ී௨ʹΔͱO(n^2) ࣌ؒ Ukkonen’s Algorithm ͩͱ
Online࡞ՄೳͰ O(n) ࣌ؒ a co a coa o a coa • ϝϞϦ: O(n) O(n^2) ʹͳΒͳ͍͜ͱʹҙ ݩͷจࣈྻʹର͢Δ࢝ɼऴΛϙΠϯλͰอ࣋͢ΕΑ͍
Suffix TreeΛ༻͍ͨύλʔϯͷݕࡧ • rootϊʔυ͔ΒࢬΛબΜͰਐΉ͚ͩ • ౸ୡͨ͠ϊʔυͷؚ͕࣍·ΕΔසʹͳΔ • ݕࡧͷܭࢉྔ: O(m) ͜͜Ͱmύλʔϯ
ˠ ͭ·ΓݩσʔλΛશ෦ᢞΊͳ͍ (ͬͨͶ) a co a coa o a coa
Ԡ༻: Longest Common Substrings (LCS) • ೖྗ: 2ͭҎ্ͷจࣈྻू߹S ग़ྗ:
Sʹڞ௨ͯ͠ݱΕΔ෦จࣈྻͷ͏ͪ࠷ͷͷ • Sʹରͯ͠Suffix TreeΛߏங͠ɼ root͔ΒḷΕΔϊʔυͷ͏ͪɼ྆ऀΛؚΉ ࠷ͷ෦Λฦͤྑ͍ $ana a $ana na banana$ana na S = { s1 = banana, s2 = ana } $ana na$ana $ana na$ana s2 s2 s2 s1, s2 s1, s2 s2 s2 s2
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suffix Tree
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a a coa cocoa oa ocoa ιʔτ
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a a coa cocoa oa ocoa ιʔτ Suffix Array
Suffix Array • ߏஙʹ͔͔Δ࣌ؒ ී௨ʹΔͱ O(n^2 logn) ࣌ؒ SA-ISΞϧΰϦζϜʹΑΓ O(n)
࣌ؒͰߏஙͰ͖Δ (ιʔτ͢Δͷʹlogn͕͔ͭͳ͍ෆࢥٞͳํ๏) • ϝϞϦࣗ໌ʹO(n) a coa cocoa oa ocoa
Suffix ArrayΛ༻͍ͨύλʔϯͷݕࡧ • Suffix Array্Λύλʔϯ͚ͩ܁Γฦ͠ೋ୳ࡧ͢Δ O(m logn) ࣌ؒ (͍) •
LCP Array (࠷ڞ௨಄ࣙྻ) Λ ิॿσʔλߏͱͯͬͯ͠ O(m + log n) ࣌ؒΛୡͰ͖Δ • LCP ArraySuffix Array͔ΒO(n)࣌ؒͰߏஙՄೳ a coa cocoa oa ocoa
Suffix Tree ͱ Suffix Array ͷؔ • ྺ࢙తʹ Suffix Tree
ͷํ͕20Ҏ্ૣ͘ൃݟ͞Εͨ • όΠΦͷจ຺ͰϝϞϦফඅΛམͱ͍ͨ͠Ϟνϕʔγϣϯ͔Β Suffix Array ͕։ൃ͞Εͨ • ͷͪʹ LCP Array ͕։ൃ͞ΕɼSuffix Tree্ͷΞϧΰϦζϜͷଟ͘ ΛSuffix Array্Ͱ࣮ߦͰ͖ΔΑ͏ʹͳͬͨ • ST, SA྆ํͱɼݸผʹൃలͨ͠ΞϧΰϦζϜ͕ଟ։ൃ͞Ε͍ͯΔ
·ͱΊ • Suffix Tree ͱ Suffix Array ʹ͍ͭͯհͨ͠ • ߏங࣌ؒ
O(n)ɼϝϞϦO(n) • ݕࡧɼ࠷ڞ௨෦จࣈྻͳͲΛߴʹղ͘͜ͱ͕Ͱ͖Δ • จࣈྻָ͍͠Αʂ