Wednesday, April 18, 2012

A naive algorithm for finding palindromes

It is easy to check by hand that a sentence like "Madam, I'm Adam" is a palindrome if you ignore whitespace and punctuation characters. But checking that the world's longest palindrome, consisting of 17,826 words, is indeed a palindrome is another matter. For such a string, I need software to check that it is a palindrome.

I have already defined the necessary software to check that a string is a palindrome. The palindrome (or textPalindrome or dnaPalindrome) function defined in the previous blog post does the job. Given an implementation of the function reverse that takes about as many steps as the length of the string it reverses, the palindrome function determines whether or not a string is a palindrome in about the same amount of steps. It is impossible to do this significantly faster on a computer.

Problem solved?

Well, it depends on what you want. If you want to check for a complete string whether or not it is a palindrome, then the problem is solved. But often you want to find occurrences of palindromes inside another, possibly non-palindromic, string. For example, you might want to find the about 5,700,000 characters that together form eight palindromes, in a string of about 20,000,000 characters representing the male-specific region of the Y chromosome. Or you night want to find the longest palindrome in the Bible. These are different problems, for which we need more than just the palindrome function.

To find palindromes in the male-specific region of the Y chromosome, I have to consider all substrings of the DNA string, and check if each of the substrings is a palindrome. How can I find all substrings that are palindromes in a string? A string may contain many substrings that are palindromes. Just the string "abba" contains the palindromes "a" (twice), "b" (twice), "bb", and "abba", and we also consider the empty substring, which appears a lot, a palindrome.

Finding all palindromic substrings is specified by:


> palindromes  :: String -> [String]
> palindromes  =  filter palindrome . substrings

The first line gives the type of the function palindromes. Function palindromes takes a value of type String as argument, and returns a list of palindromic strings, so a value of type [String]. The argument type is to the left of the arrow ->, the result type to the right. Function substrings calculates all substrings of a string. The function substrings is composed with the function filter palindrome using the composition operator . represented by the dot symbol. filter palindrome removes all substrings that are not palindromes. It remains to define function substrings. Function substrings is defined in terms of two helper functions inits and tails. Function inits returns the list of all initial substring of a string. For example, inits "abba" gives ["", "a", "ab", "abb", "abba"]. Similarly, function tails returns the list of all final substrings of a string, so tails "abba" gives ["abba", "bba", "ba", "a", ""].

> inits []      =  [[]]
> inits (x:xs)  =  [[]] ++ map (x:) (inits xs)

> tails []      =  [[]]
> tails (x:xs)  =  [x:xs] ++ tails xs

The empty list has just one initial list, namely the empty list. The initial lists of a list consisting of an element x followed by a list xs are calculated by first calculating the initial lists of xs, prepending x to each of these lists by means of the function (x:), and adding the empty initial list. The empty list also has just one final list, namely the empty list. The final lists of a list consisting of an element x followed by a list xs are calculated by first calculating the final lists of xs, and adding the complete list x:xs. Function substrings is now defined by either taking all initial substrings of all final substrings, or all final substrings of all initial substrings. It doesn't matter much which I take, so I define

> substrings = concat . map tails . inits

The standard function concat flattens takes a list of lists, and turns that into a list. So concat [[3,1],[2,4]] results in [3,1,2,4]. It removes one nesting level. Applying function substrings to "abba" gives ["", "a", "", "ab", "b", "", "abb", "bb", "b", "", "abba", "bba", "ba", "a", ""].

Although we now have a program to find all palindromes in a string, it is of no use when the string in which you want to find palindromes is very large. You might just as well try to find the palindromes by hand, since waiting for the program to find them is easily going to take years. Just how impossible it is to find palindromes in a string of length 20,000,000 using this program, consider the amount of computation necessary to find them. Since we want to find palindromic substrings of this string, we have to calculate the substrings first. We have a single substring of length 20,000,000, two substrings of length 19,999,999 each, three of length 19,999,998, four of length 19,999,997, and so on. So the length of all substrings we need to consider when looking for palindromes is \[ 1 \times n + 2 \times (n-1) + 3 \times (n-2) + \ldots + n \times 1 \] where $n$ is the length of the string in which we are looking for palindromes. I will use some simple arithmetic to obtain a value that is easier to calculate. The above expression can be reformulated to: \[ \sum\limits_{i=1}^{n} i \times (n+1-i) \] Now I apply a law which says that constant factors can be moved outside a sum, so \[ \sum\limits_{i=1}^{n} i \times k = k \times \sum\limits_{i=1}^{n} i \] where I instantiate $k$ with $n+1$, and I apply laws for sums of consecutive integers, namely \[ \sum\limits_{k=1}^n k = \frac{n(n+1)}{2} \\ \sum\limits_{k=1}^n k^2 = \frac{2n^3 + 3n^2 + n}{6} \] and some simple arithmetic to find that the above expression equals: \[ \frac{n^3+3n^2+2n}{6} \] So given a string of length $n$, the length of all substrings that appear in the string is $\frac{n^3+3n^2+2n}{6}$, which is slightly less than $n^3$.

Let me return now to the DNA string mentioned at the start of this blog post. Since its length is around 20,000,000, the total length of the substrings that appear inside it amounts to 1,333,333,533,333,340,000,000. This is more than a trilliard characters to be inspected! Suppose, unrealistically, that I can use the fastest supercomputer available on earth as of 2011, the Fujitsi K computer, named after the japanese word for 10 billiard, kei.

The kei computer is actually not so much a computer, but a collection of over 80,000 computers put together. The speed of a supercomputer is specified as the number of basic operations it can perform per second. The speed of the K computer is 10.51 petaflops, where a petaflop is a billiard of such basic operations per second. So the K computer is appropriately but modestly named after its speed. If a single basic operation is enough to compare two characters (which it isn't), then the K computer would need $1,333,333$, the length of the substrings divided by a billiard, divided by $10.51$, the number of billiard operations the K computer can perform per second, seconds, or about one and a half day, to calculate these comparisons. A normal computer would spend its entire lifetime on this problem, and not finish it.

That is not good enough.

No comments:

Post a Comment