Friday, April 27, 2012

Maximal palindromes

The algorithm for finding palindromic substrings described in my previous blog post has two major problems. First, if I want to find the palindromes in a string of substantial length, it will take a normal computer many years to calculate the result. I didn't mention the second problem explicitly in my previous blog post, but it is pretty obvious. The number of palindromic substrings returned by the algorithm is at least as large as the length of the string, since all single letters are palindromes, and there might be many more. If the string contains a palindrome of length $1,000,000$, then the algorithm will also return the palindrome of length $999,998$ obtained by removing the first and last element of the million letter palindrome, the palindromes of length $999,996$, $999,994$, and so on. This implies that the total number of palindromes occurring in a long string is huge, and I can easily drown in the palindromes returned.

Of the series of palindromes of length $1,000,000$, $999,998$, $999,996$ etc, all palindromes have the same center, and all shorter palindromes can be derived from the longest palindrome by removing equally many characters from the front and from the end. In the string "yabadabadoo", the palindromes around the center at the second occurrence of a 'b', are "b", "aba", and "dabad". "dabad" is the maximal palindrome around this center, since its extension "adabado" is not a palindrome. Maximality does not imply it is the longest palindromic substring, it only is the maximal palindrome around a particular center in the string. For example, in "yabadabadoo", "abadaba" is a longer palindrome, but with a different center. A center position is either on a letter, as in "dabad" on 'b', or in between two letters, as in "oo", where the center is in between the two o's. The string "yabadabadoo" has $23$ centers: one on each letter (of which there are eleven), one before each letter (another eleven), and one after the last letter of the string. If I assign $0$ to the center before the first letter, $1$ to the center on the first letter, $2$ to the center after the first letter, and so on, the maximal palindrome "dabad" has center $13$, and the maximal palindrome "abadaba" $9$. For a string of length $n$, there are $2n+1$ center positions in the string. Since we can derive all shorter palindromes around a center from the maximal palindrome around the center, it is sufficient to calculate the maximal palindromes around the centers in a string. So instead of calculating all palindromic substrings of a string, I calculate all maximal palindromic substrings of a string.

There are just as many maximal palindromes in a string as there are center positions, so there still are quite a few maximal palindromes. But the number of maximal palindromes might be substantially lower than the number of (not necessarily maximal) palindromic substrings in a string. Take for example the string of length $n$ just containing the character 'a'. The number of maximal palindromes is equal to the number of center positions in the string, $2n+1$. The total number of palindromes in the string is the number of substrings of the string, since all substrings consist of only 'a's, and hence all are palindromes. Since substrings is defined as concat . map tails . inits, I can calculate the total number of substrings if I know how many initial and final substrings appear in a string of length $n$. Function inits returns $n+1$ substrings, of length $0...n$, respectively. Similarly, for a string of length $n$, function tails returns $n+1$ substrings, also of length $0...n$. So the total number of substrings is: \[ \sum\limits_{i=0}^{n} (i+1) = \frac{n(n+1)}{2} + n+1 = \frac{(n+2)(n+1)}{2} = \frac{n^2}{2}+\frac{3}{2}n+1\] So in the string "aaaaaaaaaa" of length $10$, there are $21$ maximal palindromes, and $66$ palindromic substrings.

If I know the center and the length of a palindrome, I can recover the palindrome if I have the original string. For example, the maximal palindrome of length $5$ around center $13$ in the string "yabadabadoo" is the string "dabad", and the maximal palindrome of length $7$ around center $9$ is the string "abadaba". So to find maximal palindromes, it suffices to find their center and length.

Given a string, a center within the string, and a lengthp denoting the length of the maximal palindrome in the string around the center, I can use a Haskell program to check if the length is indeed the length of the maximal palindrome around the center in the string. To determine the substring denoted by a center and a length, I first split the string into the initial part of length div (center-lengthp) 2, the part before the maximal palindrome, and the following lengthp letters. Function div divides its first argument by its second argument, throwing away the remainder. Function splitAt takes a positive integer $n$ and a list, and splits the list into two lists. The first list contains the first $n$ letters, and the second list contains the rest of the letters. The lengthp letters after before should form a palindrome. The letters around it, the last letter of the part of the string before it and the head of the string after it, should be different to make it maximal. Functions last and head are predefined functions which return the last or head element of a list, respectively. The palindrome is maximal too if at least one of the strings before or after it is empty.

> maximalPalindrome string center lengthp  =
>   let (before,rest)  =  splitAt 
>                           (div (center-lengthp) 2) 
>                           string
>       (p,after)      =  splitAt lengthp rest
>   in    length p    ==  lengthp 
>      && odd center  ==  odd lengthp
>      && palindrome p 
>      && (  null before 
>         || null after 
>         || last before /= head after
>         )

The function maximalPalindrome uses two sanity checks. It first checks that there indeed exists a substring of length lengthp around center by means of length p == lengthp, where function length is a predefined function that returns the length of a list. The other sanity check checks that the maximal palindrome around an even center has even length, and a maximal palindrome around an odd center has odd length, by means of odd center == odd lengthp, which equals True only if both center and lengthp are odd or both are even. Function odd is a predefined function that determines whether or not a number is odd. The logic or operator || returns True if one of its arguments is True. If its left-hand side argument is True it doesn't evaluate its right-hand argument, which is just as well for the above definition, since evaluating last before if before is empty (null before) would lead to a undefined value.

In the next blog post I will show how to calculate the length of all maximal palindromes in a string.

No comments:

Post a Comment