Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Suffix Trees and Suffix Arrays
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
hsasakawa
September 15, 2020
Science
130
0
Share
Suffix Trees and Suffix Arrays
M3 Tech Talk #150
Quick introduction of suffix trees and arrays data structures
hsasakawa
September 15, 2020
More Decks by hsasakawa
See All by hsasakawa
行動ログ処理基盤の構築
hsasakawa
0
3.4k
冪等性を考慮したデータ連携ジョブの設計
hsasakawa
6
2.2k
Data platform development on M3 USA
hsasakawa
0
1.2k
Data Analysis Platform Development @M3, inc.
hsasakawa
0
3.5k
Other Decks in Science
See All in Science
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
0
250
(メタ)科学コミュニケーターからみたAI for Scienceの同床異夢
rmaruy
0
190
データベース12: 正規化(2/2) - データ従属性に基づく正規化
trycycle
PRO
0
1.1k
検索と推論タスクに関する論文の紹介
ynakano
1
180
防災デジタル分野での官民共創の取り組み (1)防災DX官民共創をどう進めるか
ditccsugii
0
580
【論文紹介】Is CLIP ideal? No. Can we fix it?Yes! 第65回 コンピュータビジョン勉強会@関東
shun6211
5
2.4k
データマイニング - グラフデータと経路
trycycle
PRO
2
490
2025-06-11-ai_belgium
sofievl
1
250
なぜ21は素因数分解されないのか? - Shorのアルゴリズムの現在と壁
daimurat
0
360
YouTubeにおける撤回論文の参照実態 / metascience-meetup2026
corgies
3
220
MCMCのR-hatは分散分析である
moricup
0
630
Algorithmic Aspects of Quiver Representations
tasusu
0
250
Featured
See All Featured
HDC tutorial
michielstock
1
590
My Coaching Mixtape
mlcsv
0
91
4 Signs Your Business is Dying
shpigford
187
22k
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
460
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
1
340
Test your architecture with Archunit
thirion
1
2.2k
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
64
53k
How to make the Groovebox
asonas
2
2.1k
Believing is Seeing
oripsolob
1
100
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
690
Speed Design
sergeychernyshev
33
1.6k
Neural Spatial Audio Processing for Sound Field Analysis and Control
skoyamalab
0
240
Transcript
Suffix Trees and Suffix Arrays M3 Tech Talk Hirohito Sasakawa,
Data engineer, AI and ML team @M3, inc.
Suffix Tree ͱ Suffix Array • จࣈྻʹର͢ΔࡧҾσʔλߏ • ࡧҾ (index):
ݩσʔλʹର͢ΔΞΫηεΛޮΑ͘ఏڙ͢Δิॿσʔλߏ • Suffix (Tree | Array) ɼจࣈྻ ( or จࣈྻू߹) ʹରͯ͠෦จࣈྻ ͷݕࡧɼසɼ࠷ڞ௨෦จࣈྻͳͲΛߴʹܭࢉ͢Δ • ࣮༻తͳԠ༻͕ɼΊͪΌͪ͘Ό͋Δ (ѹॖͱ͔ɼόΠΦܥͱ͔)
ͪͳΈʹ… • Goͷඪ४ϥΠϒϥϦʹSuffix Arrayؚ͕·Ε͍ͯΔ • ݕࡧ෦400ߦऑͰγϯϓϧ • GoͱΞϧΰϦζϜͷษڧʹ ྑͦ͞͏ ߏங෦1000ߦఔͰɼߴͳ
ΞϧΰϦζϜ (SA-IS) ͕࣮͞Ε͍ͯΔ จยखʹΏͬ͘ΓಡΉͷ͕ྑͦ͞͏
༻ޠͷ४උ • Suffix (ඌࣙ): จࣈྻTʹରͯ͠ɼઌ಄0จࣈҎ্Λͬͨจࣈྻ ྫ: T = ababb ඌࣙ:
ababb, babb, abb, bb, b, ‘’ (ۭจࣈ)
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suffix Tree
Suffix Tree • ߏஙʹ͔͔Δ࣌ؒ ී௨ʹΔͱO(n^2) ࣌ؒ Ukkonen’s Algorithm ͩͱ
Online࡞ՄೳͰ O(n) ࣌ؒ a co a coa o a coa • ϝϞϦ: O(n) O(n^2) ʹͳΒͳ͍͜ͱʹҙ ݩͷจࣈྻʹର͢Δ࢝ɼऴΛϙΠϯλͰอ࣋͢ΕΑ͍
Suffix TreeΛ༻͍ͨύλʔϯͷݕࡧ • rootϊʔυ͔ΒࢬΛબΜͰਐΉ͚ͩ • ౸ୡͨ͠ϊʔυͷؚ͕࣍·ΕΔසʹͳΔ • ݕࡧͷܭࢉྔ: O(m) ͜͜Ͱmύλʔϯ
ˠ ͭ·ΓݩσʔλΛશ෦ᢞΊͳ͍ (ͬͨͶ) a co a coa o a coa
Ԡ༻: Longest Common Substrings (LCS) • ೖྗ: 2ͭҎ্ͷจࣈྻू߹S ग़ྗ:
Sʹڞ௨ͯ͠ݱΕΔ෦จࣈྻͷ͏ͪ࠷ͷͷ • Sʹରͯ͠Suffix TreeΛߏங͠ɼ root͔ΒḷΕΔϊʔυͷ͏ͪɼ྆ऀΛؚΉ ࠷ͷ෦Λฦͤྑ͍ $ana a $ana na banana$ana na S = { s1 = banana, s2 = ana } $ana na$ana $ana na$ana s2 s2 s2 s1, s2 s1, s2 s2 s2 s2
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa
Suffix Tree • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛࣙॻॱʹฒͯɼ্͔ΒଋͶͨ T = cocoa cocoa
ocoa coa oa a ඌࣙͨͪ a coa cocoa oa ocoa ιʔτ a co a coa o a coa Suffix Tree
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a a coa cocoa oa ocoa ιʔτ
Suffix Array • tl;dr จࣈྻTͷͯ͢ͷඌࣙΛιʔτͨ࣌͠ͷݩͷจࣈྻͷindexͷྻ T = cocoa ඌࣙs
with index cocoa ocoa coa oa a a coa cocoa oa ocoa ιʔτ Suffix Array
Suffix Array • ߏஙʹ͔͔Δ࣌ؒ ී௨ʹΔͱ O(n^2 logn) ࣌ؒ SA-ISΞϧΰϦζϜʹΑΓ O(n)
࣌ؒͰߏஙͰ͖Δ (ιʔτ͢Δͷʹlogn͕͔ͭͳ͍ෆࢥٞͳํ๏) • ϝϞϦࣗ໌ʹO(n) a coa cocoa oa ocoa
Suffix ArrayΛ༻͍ͨύλʔϯͷݕࡧ • Suffix Array্Λύλʔϯ͚ͩ܁Γฦ͠ೋ୳ࡧ͢Δ O(m logn) ࣌ؒ (͍) •
LCP Array (࠷ڞ௨಄ࣙྻ) Λ ิॿσʔλߏͱͯͬͯ͠ O(m + log n) ࣌ؒΛୡͰ͖Δ • LCP ArraySuffix Array͔ΒO(n)࣌ؒͰߏஙՄೳ a coa cocoa oa ocoa
Suffix Tree ͱ Suffix Array ͷؔ • ྺ࢙తʹ Suffix Tree
ͷํ͕20Ҏ্ૣ͘ൃݟ͞Εͨ • όΠΦͷจ຺ͰϝϞϦফඅΛམͱ͍ͨ͠Ϟνϕʔγϣϯ͔Β Suffix Array ͕։ൃ͞Εͨ • ͷͪʹ LCP Array ͕։ൃ͞ΕɼSuffix Tree্ͷΞϧΰϦζϜͷଟ͘ ΛSuffix Array্Ͱ࣮ߦͰ͖ΔΑ͏ʹͳͬͨ • ST, SA྆ํͱɼݸผʹൃలͨ͠ΞϧΰϦζϜ͕ଟ։ൃ͞Ε͍ͯΔ
·ͱΊ • Suffix Tree ͱ Suffix Array ʹ͍ͭͯհͨ͠ • ߏங࣌ؒ
O(n)ɼϝϞϦO(n) • ݕࡧɼ࠷ڞ௨෦จࣈྻͳͲΛߴʹղ͘͜ͱ͕Ͱ͖Δ • จࣈྻָ͍͠Αʂ