2018-SPIRE-Block-Palindromes

6f84f75ab93ad599bcf89a78dc5da8dc?s=47 kgoto
October 11, 2018

 2018-SPIRE-Block-Palindromes

We propose a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes.

SPIRE 2018: https://eventos.spc.org.pe/spire2018/wp/
arXiv: https://arxiv.org/abs/1806.00198

6f84f75ab93ad599bcf89a78dc5da8dc?s=128

kgoto

October 11, 2018
Tweet

Transcript

  1. 1.

    Block Palindromes A New Generalization of Palindromes SPIRE 2018 in

    LIMA Keisuke Goto, Tomohiro I, Hideo Bannai, Shunsuke Inenaga
  2. 2.

     Standard Palindromes Palindromes a b c b a Same

    string a b c d e b a Same string  Gapped Palindromes a b X b a X = cde gap SPIRE 2018 in LIMA 2/ 19
  3. 3.

     Palindromes represent characteristic structures of strings. There are several

    research about properties of palindromes  maximal palindromes, palindrome factorization, ...  Gapped palindromes model hairpin structures of DNA and RNA sequences Why Palindromes? where, G = C and U = A gap https://en.wikipedia.org/wiki/Stem-loop SPIRE 2018 in LIMA 3/ 19
  4. 4.

     A factorization f = f-n … f-1 f0 f1

    … fn of a string T is a block palindrome if f-i = fi for all 0 ≦ i ≦ n * f0 may be empty string and f-i , fi for 0 < i ≦ n mustn’t Block Palindromes (BPs) f 2 f 1 f 0 f 1 f 2 BPs are generalization of standard and gapped palindromes Same string LIMAisn‘tMALI f0 f1 f2 f-1 f-2 We call a factor a block SPIRE 2018 in LIMA 4/ 19
  5. 5.

     We study basic properties of BPs, introducing representatives of

    BPs:  Largest BPs (of a string)  Maximal BPs (in a string)  We propose an algorithm to enumerate all maximal BPs in a string T that runs in O(|T | + ||MBP(T )||) optimal time, where ||MBP(T )|| is the output size (i.e., the sum of # of factors in the outputs) Contributions SPIRE 2018 in LIMA 5/ 19
  6. 6.

     For a string T of length N, there are

    O(2N/2) BPs of T  A unary string has 2N/2 BPs # of BPs of T a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a T = ・・・ 2N/2 a a a a a a a a a SPIRE 2018 in LIMA 6/ 19
  7. 7.

     A string that is a (nonempty and proper) prefix

    and a suffix of T is called a border of T  The outmost block of BPs of T is a border of T  BPs can be obtained by stripping a border iteratively BPs and Borders o n i o n i n o n i o n T = o n i o n i n o n i o n o n i o n i n o n i o n o n i o n i n o n i o n o n i o n i n o n i o n The BPs of T SPIRE 2018 in LIMA 7/ 19
  8. 8.

    8/ 19 SPIRE 2018 in LIMA The largest BP of

    T o n i o n i n o n i o n T = o n i o n i n o n i o n o n i o n i n o n i o n o n i o n i n o n i o n o n i o n i n o n i o n The BPs of T largest # of blocks Properties  The largest BP is unique (obtained by stripping the shortest border iteratively)  Each block is an unbordered string  Any BP is represented by a factorization of the largest BP
  9. 9.

    Factorization of Largest BPs  Let f = f-n ,

    ..., fn , g = g-m , ..., gm be BPs with f the largest  In inductive steps on n > 0, we have 3 cases:  (1) | fn | = |gm |  (2) | fn | > |gm |  (3) | fn | < |gm | SPIRE 2018 in LIMA 9/ 19
  10. 10.

    g-m+1 , ..., gm-1 Factorization of Largest BPs  Let

    f = f-n , ..., fn , g = g-m , ..., gm be BPs with f the largest  In inductive steps on n > 0, we have 3 cases:  (1) | fn | = |gm |  (2) | fn | > |gm |  (3) | fn | < |gm | f-n+1 , ..., fn-1 f-n fn gm g-m By the inductive hypothesis, f-n+1 , ..., fn-1 is finer than g-m+1 , ..., gm-1 , which implies that f is finer than g SPIRE 2018 in LIMA 10/ 19
  11. 11.

    Factorization of Largest BPs  Let f = f-n ,

    ..., fn , g = g-m , ..., gm be BPs with f the largest  In inductive steps on n > 0, we have 3 cases:  (1) | fn | = |gm |  (2) | fn | > |gm | (cannot happen)  (3) | fn | < |gm | f-n+1 , ..., fn-1 f-n gm g-m f-n g-m , s, gm , f-n+1 , ..., fn-1 , gm , s, gm has larger number of factors than f, a contradiction f-n+1 , ..., fn-1 s s gm g-m gm gm SPIRE 2018 in LIMA 11/ 19
  12. 12.

    SPIRE 2018 in LIMA g-m+1 , ..., gm-1 Factorization of

    Largest BPs  Let f = f-n , ..., fn , g = g-m , ..., gm be BPs with f the largest  In inductive steps on n > 0, we have 3 cases:  (1) | fn | = |gm |  (2) | fn | > |gm | (cannot happen)  (3) | fn | < |gm | f-n+1 , ..., fn-1 g-m gm f-n fn By the inductive hypothesis, f-n+1 , ..., fn-1 is finer than s, fn , g-m+1 , ..., gm-1 , fn , s, which implies that f is finer than g g-m+1 , ..., gm-1 s s fn fn 12/ 19
  13. 13.

     We study largest BPs that occur as substrings in

    T Largest BPs in T When we fix a center block (that should be an unbordered string), there could be many largest BPs expanded from it T SPIRE 2018 in LIMA 13/ 19
  14. 14.

     We choose the maximal one as a representative of

    them Maximal BPs When we fix a center block (that should be an unbordered string), there could be many largest BPs expanded from it T maximal BP not maximal BP not maximal BP SPIRE 2018 in LIMA 14/ 19
  15. 15.

     We choose the maximal one as a representative of

    them Maximal BPs T  Maximal BP whose f0 occurs at T[b ... e] is unique  Any largest BP is represented by a substring of the maximal BP  ||MBP(T)|| ≦ N(2N-1) Properties The sum of # of factors of maximal BPs in T maximal BP not maximal BP not maximal BP SPIRE 2018 in LIMA 15/ 19
  16. 16.

     A naïve approach would be to compute the maximal

    BP for every center block  Using a data structure for constant-time longest common extension queries, it can be done O(N3) time in total Enumeration of Maximal BPs We propose an algorithm running in O(N + ||MBP(T)||) time, which is optimal unless the maximal BPs can be represented more compactly SPIRE 2018 in LIMA 16/ 19
  17. 17.

     Our algorithm consists of two steps:  Enumerate all

    pairs of occurrences of unbordered strings and sort them by the center position and beginning positions of the right arms Enumeration of Maximal BPs T SPIRE 2018 in LIMA 17/ 19
  18. 18.

     Our algorithm consists of two steps:  Enumerate all

    pairs of occurrences of unbordered strings and sort them by the center position and beginning positions of the right arms  For each center position, build maximal BPs by concatenating the enumerated pairs that are adjacent Enumeration of Maximal BPs T SPIRE 2018 in LIMA 18/ 19
  19. 19.

     We define block palindromes (BPs) which are new generalization

    of standard and gapped palindromes  We introduce representatives of BPs  Largest BPs (of strings)  Maximal BPs (in strings)  We study basic properties of these representatives and give efficient algorithms to compute/enumerate them  Open problems:  Is it possible to represent the maximal BPs compactly? (what if we only consider BPs with empty center blocks?)  Can we utilize it to design faster enumeration algorithms? Conclusions and Future Work SPIRE 2018 in LIMA 19/ 19