$30 off During Our Annual Pro Sale. View Details »

2018-SPIRE-Block-Palindromes

kgoto
October 11, 2018

 2018-SPIRE-Block-Palindromes

We propose a new generalization of palindromes and gapped palindromes called block palindromes. A block palindrome is a string becomes a palindrome when identical substrings are replaced with a distinct character. We investigate several properties of block palindromes and in particular, study substrings of a string which are block palindromes.

SPIRE 2018: https://eventos.spc.org.pe/spire2018/wp/
arXiv: https://arxiv.org/abs/1806.00198

kgoto

October 11, 2018
Tweet

More Decks by kgoto

Other Decks in Research

Transcript

  1. Block Palindromes
    A New Generalization of Palindromes
    SPIRE 2018 in LIMA
    Keisuke Goto, Tomohiro I, Hideo Bannai, Shunsuke Inenaga

    View Slide

  2.  Standard Palindromes
    Palindromes
    a b c b a Same string
    a b c d e b a
    Same string
     Gapped Palindromes
    a b X b a
    X = cde
    gap
    SPIRE 2018 in LIMA 2/ 19

    View Slide

  3.  Palindromes represent characteristic structures of strings.
    There are several research about properties of palindromes

    maximal palindromes, palindrome factorization, ...
     Gapped palindromes model hairpin structures of DNA and
    RNA sequences
    Why Palindromes?
    where, G = C and U = A
    gap
    https://en.wikipedia.org/wiki/Stem-loop
    SPIRE 2018 in LIMA 3/ 19

    View Slide

  4.  A factorization f = f-n
    … f-1
    f0
    f1
    … fn
    of a string T is a
    block palindrome if f-i
    = fi
    for all 0 ≦ i ≦ n
    * f0
    may be empty string and f-i
    , fi
    for 0 < i ≦ n mustn’t
    Block Palindromes (BPs)
    f 2
    f 1
    f 0
    f 1
    f 2
    BPs are generalization of standard and gapped palindromes
    Same string
    LIMAisn‘tMALI
    f0
    f1
    f2
    f-1
    f-2
    We call a
    factor a block
    SPIRE 2018 in LIMA 4/ 19

    View Slide

  5.  We study basic properties of BPs, introducing
    representatives of BPs:

    Largest BPs (of a string)

    Maximal BPs (in a string)
     We propose an algorithm to enumerate all maximal BPs in
    a string T that runs in O(|T | + ||MBP(T )||) optimal time,
    where ||MBP(T )|| is the output size (i.e., the sum of # of
    factors in the outputs)
    Contributions
    SPIRE 2018 in LIMA 5/ 19

    View Slide

  6.  For a string T of length N, there are O(2N/2) BPs of T
     A unary string has 2N/2 BPs
    # of BPs of T
    a a a a a a a a a
    a a a a a a a a a
    a a a a a a a a a
    a a a a a a a a a
    a a a a a a a a a
    T =
    ・・・
    2N/2
    a a a a a a a a a
    SPIRE 2018 in LIMA 6/ 19

    View Slide

  7.  A string that is a (nonempty and proper)
    prefix and a suffix of T is called a border of T
     The outmost block of BPs of T is a border of T
     BPs can be obtained by stripping a border iteratively
    BPs and Borders
    o n i o n i n o n i o n
    T =
    o n i o n i n o n i o n
    o n i o n i n o n i o n
    o n i o n i n o n i o n
    o n i o n i n o n i o n
    The BPs of T
    SPIRE 2018 in LIMA 7/ 19

    View Slide

  8. 8/ 19
    SPIRE 2018 in LIMA
    The largest BP of T
    o n i o n i n o n i o n
    T =
    o n i o n i n o n i o n
    o n i o n i n o n i o n
    o n i o n i n o n i o n
    o n i o n i n o n i o n
    The BPs of T
    largest # of
    blocks
    Properties
     The largest BP is unique
    (obtained by stripping the shortest border iteratively)
     Each block is an unbordered string
     Any BP is represented by a factorization of the largest BP

    View Slide

  9. Factorization of Largest BPs
     Let f = f-n
    , ..., fn
    , g = g-m
    , ..., gm
    be BPs with f the largest
     In inductive steps on n > 0, we have 3 cases:

    (1) | fn
    | = |gm
    |

    (2) | fn
    | > |gm
    |

    (3) | fn
    | < |gm
    |
    SPIRE 2018 in LIMA 9/ 19

    View Slide

  10. g-m+1
    , ..., gm-1
    Factorization of Largest BPs
     Let f = f-n
    , ..., fn
    , g = g-m
    , ..., gm
    be BPs with f the largest
     In inductive steps on n > 0, we have 3 cases:

    (1) | fn
    | = |gm
    |

    (2) | fn
    | > |gm
    |

    (3) | fn
    | < |gm
    |
    f-n+1
    , ..., fn-1
    f-n
    fn
    gm
    g-m
    By the inductive hypothesis,
    f-n+1
    , ..., fn-1
    is finer than g-m+1
    , ..., gm-1
    ,
    which implies that f is finer than g
    SPIRE 2018 in LIMA 10/ 19

    View Slide

  11. Factorization of Largest BPs
     Let f = f-n
    , ..., fn
    , g = g-m
    , ..., gm
    be BPs with f the largest
     In inductive steps on n > 0, we have 3 cases:

    (1) | fn
    | = |gm
    |

    (2) | fn
    | > |gm
    | (cannot happen)

    (3) | fn
    | < |gm
    |
    f-n+1
    , ..., fn-1
    f-n
    gm
    g-m
    f-n
    g-m
    , s, gm
    , f-n+1
    , ..., fn-1
    , gm
    , s, gm
    has
    larger number of factors than f, a contradiction
    f-n+1
    , ..., fn-1
    s s gm
    g-m
    gm
    gm
    SPIRE 2018 in LIMA 11/ 19

    View Slide

  12. SPIRE 2018 in LIMA
    g-m+1
    , ..., gm-1
    Factorization of Largest BPs
     Let f = f-n
    , ..., fn
    , g = g-m
    , ..., gm
    be BPs with f the largest
     In inductive steps on n > 0, we have 3 cases:

    (1) | fn
    | = |gm
    |

    (2) | fn
    | > |gm
    | (cannot happen)

    (3) | fn
    | < |gm
    |
    f-n+1
    , ..., fn-1
    g-m
    gm
    f-n
    fn
    By the inductive hypothesis,
    f-n+1
    , ..., fn-1
    is finer than s, fn
    , g-m+1
    , ..., gm-1
    , fn
    , s,
    which implies that f is finer than g
    g-m+1
    , ..., gm-1
    s s
    fn
    fn
    12/ 19

    View Slide

  13.  We study largest BPs that occur as substrings in T
    Largest BPs in T
    When we fix a center block (that should be an unbordered
    string), there could be many largest BPs expanded from it
    T
    SPIRE 2018 in LIMA 13/ 19

    View Slide

  14.  We choose the maximal one as a representative of them
    Maximal BPs
    When we fix a center block (that should be an unbordered
    string), there could be many largest BPs expanded from it
    T
    maximal BP
    not maximal BP
    not maximal BP
    SPIRE 2018 in LIMA 14/ 19

    View Slide

  15.  We choose the maximal one as a representative of them
    Maximal BPs
    T
     Maximal BP whose f0
    occurs at T[b ... e] is unique
     Any largest BP is represented by a substring of the maximal BP
     ||MBP(T)|| ≦ N(2N-1)
    Properties
    The sum of # of factors of maximal BPs in T
    maximal BP
    not maximal BP
    not maximal BP
    SPIRE 2018 in LIMA 15/ 19

    View Slide

  16.  A naïve approach would be to compute the maximal BP for
    every center block
     Using a data structure for constant-time longest common
    extension queries, it can be done O(N3) time in total
    Enumeration of Maximal BPs
    We propose an algorithm running
    in O(N + ||MBP(T)||) time, which is optimal unless
    the maximal BPs can be represented more compactly
    SPIRE 2018 in LIMA 16/ 19

    View Slide

  17.  Our algorithm consists of two steps:

    Enumerate all pairs of occurrences of unbordered strings
    and sort them by the center position and beginning
    positions of the right arms
    Enumeration of Maximal BPs
    T
    SPIRE 2018 in LIMA 17/ 19

    View Slide

  18.  Our algorithm consists of two steps:

    Enumerate all pairs of occurrences of unbordered strings
    and sort them by the center position and beginning
    positions of the right arms

    For each center position, build maximal BPs by
    concatenating the enumerated pairs that are adjacent
    Enumeration of Maximal BPs
    T
    SPIRE 2018 in LIMA 18/ 19

    View Slide

  19.  We define block palindromes (BPs) which are new
    generalization of standard and gapped palindromes
     We introduce representatives of BPs

    Largest BPs (of strings)

    Maximal BPs (in strings)
     We study basic properties of these representatives and
    give efficient algorithms to compute/enumerate them
     Open problems:

    Is it possible to represent the maximal BPs compactly?
    (what if we only consider BPs with empty center blocks?)

    Can we utilize it to design faster enumeration algorithms?
    Conclusions and Future Work
    SPIRE 2018 in LIMA 19/ 19

    View Slide