Sunday, October 28, 2012

The history of finding palindromes

I have given several programs to find palindromes in the blog posts on this blog. Who was the first to construct those? The history of finding palindromes is not as long as the history of palindromes themselves, since, obviously, computers did not exist in the time of the ancient Greeks. Using a computer to find palindromes has probably been done since the 1970s. Before that, palindromes have been used to illustrate the power of computing models. In this blogpost I will go through the history of finding palindromes using a computer.

The oldest references to papers in computer science I could find mentioning palindromes (or, actually, `strings satisfying the symmetry predicate') were published in 1965 in Russian by Freivald and Bārzdiņš. Since I cannot read Russian, I depend on the description of these papers in a review from Albert Mullin in the Journal of Symbolic Logic in 1970. Both Freivald and Bārzdiņš show how to determine whether or not a string is a palindrome using different variants of a Turing machine. Turing machines are named after Alan Turing, who was born exactly 100 years ago.

In the 1930s, Turing devised a model of a machine to prove a fundamental mathematical theorem about the Entscheidungsproblem. The gist of this theorem is that you cannot solve all problems by means of a computer. In particular, you cannot use a computer to determine if an arbitrary program returns it result in a reasonable amount of time. The Turing machine model has had a profound influence on real computing machines built in the 1940s and 1950s. It has also been used a lot in Computer Science, the scientific field that arose since the construction of the first computers. For a long time, the fields of algorithms and complexity theory, which study algorithms and their efficiency, have used Turing machines to show the efficiency of particular algorithms. Modern computers and Turing machines are very dissimilar, and the complexity of software is often determined by factors not present in Turing machines. Probably for that reason, Turing machines are not used much anymore in Computer Science, although they still appear as examples of computing models.

The most basic form of a Turing machine has a tape on which it can write symbols, a table that tells which actions to take, using a head that can move along the tape, and a state register, which contains the different states the machine can be in, such as a start state, an end state, or a state in between. The machine can read a symbol from the type, write a symbol on the tape, and move the head along the tape. Freivald and Bārzdiņš show that to test whether an input word is a palindrome requires a number of steps that is quadratic in the length of the input string on such a Turing machine. Eight years later in 1973, Slisenko showed that if you are allowed to use more than one head on a tape, or multiple tapes, than a palindrome can be determined in a number of steps linear in the length of the input string. The English translation of the paper `Recognizing a symmetry predicate by multihead Turing machines with input' is a 183 page, dense mathematical text.

Slisenko announced the result already in 1969, and given the form and the length of his paper (with 183 pages this is more a book than a paper), I find it highly likely that it took him a couple of years to write it. Slisenko's result was a surprisingly strong result, and people started to study and trying to improve upon it. One of these people was Zvi Galil, then a postdoc IBM Yorktown Heights.
He had a hard time understanding Slisenko's paper, which I fully understand, but in 1978 he obtained the same result, only he needed just 18 pages to describe it. The problem of finding palindromes efficiently started off an area within Computer Science now known as stringology, which studies strings and their properties, and algorithms on strings.

Freivald, Bārzdiņš, Slisenko, and Galil describe algorithms for recognizing palindromes on Turing machines with a varying number of tapes. On other machine models, palindromes have been a popular example too. Stephen Cole showed how to recognize palindromes on iterative arrays of finite-state machines in 1964. Alvy Ray Smith III used cellular automata to recognize palindromes in 1972. Also in the 1970s, Freivald and Yao, independently, looked at recognizing palindromes by means of probabilistic Turing machines. Looking at these results from a distance of around 40 years, it seems like it was a sport to study programming with various restrictions, like not using your left hand, or tying your legs together.

Some of the machine models on which algorithms for palindromes were developed have rather artificial restrictions, which gives rather artificial algorithms for finding palindromes. But the algorithms on some of the more realistic machine models, such as the Random Access Machine (RAM) model, contain the essential components of the algorithm I give in the blog post on finding palindromes efficiently. Manacher gives a linear-time algorithm on the RAM computing model finding the smallest initial palindrome of even length. He also describes how to adjust his algorithm in order to find the smallest initial palindrome of odd length longer than 3. He did not realize that his approach could be used to find all maximal palindromes in a string in linear time. Zvi Galil and Joel Seiferas did, in 1976. They wrote a paper, titled `Recognizing certain repetitions and reversals within strings', in which they develop an algorithm that finds all maximal palindromes in a string in linear time. As far as I know, this is the first description of the algorithm I give in the blog post on finding palindromes efficiently.

Twelve years later I rediscovered this algorithm. I published a paper on `The derivation of on-line algorithms, with an application to finding palindromes', in which I show how to obtain the efficient algorithm for finding palindromes from the naive algorithm for finding palindromes using algebraic reasoning. The method at which I arrive at the algorithm is completely different from the way Galil and Seiferas present their version. I presented the algorithm and the method I use to construct it to several audiences. I was only 22 years old at the time, and I am afraid my presentation skills were limited. Several people that saw me presenting the efficient algorithm for finding palindromes thought they could do better. An example can be found in the book `Beauty is our business' (which isn't about models, or escort services, but about Computer Science, and dedicated to the Dutch Turing award winning computer scientist Edsger Dijkstra on his sixtieth birthday), in which Jan Tijmen Udding derives the same algorithm using Dijkstra's calculus for calculating programs.

Zvi Galil developed algorithms for almost all problems related to palindromes in his series of papers on palindromes. In 1985, he even developed an algorithm for finding palindromes using a parallel machine. Turing machines, and the other machine models mentioned above, are sequential machines: computations are performed in sequence, and not at the same time. Parallel machines allow computations to be performed at the same time, in parallel. In the previous century few parallel machines were available, but nowadays, even the laptop on which I am typing this text has four cores, and can do many things in parallel. The importance of parallel machines has increased considerably, also because it is expected that increasing the speed of computers by increasing the clock speed is not going to work anymore in the next couple of years. Extra power of computers has to come from the possibilities offered by using multiple cores for computations. To put these cores to good use, we need efficient parallel algorithms for the problems we want to solve. Apostolico, Breslauer, and Galil improved upon Galil's first parallel algorithm for finding palindromes in their paper on `Parallel detection of all palindromes in a string' in 1995. When finding palindromes in DNA, such parallel algorithms might be very useful. I am not aware of any experience report on using parallel algorithms to find palindromes in DNA, but I hope to try this out myself soon, and report on it on this blog.

Since the discovery that palindromes appear in DNA, people working on bioinformatics have developed algorithms for finding palindromes in DNA. The first description of these algorithms I could find are in the book Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, by Dan Gusfield.

This is one of the standard works in computational biology. Gusfield shows how to compute all maximal palindromes in a string using a suffix tree. He then moves on to discuss gapped palindromes (called separate palindromes in Gusfield's book), and approximate palindromes, for which he leaves it to the reader to construct an algorithm that returns approximate palindrome in time linear in the product of the maximum number of errors and the length of the input. After Gusfield, several other scientist have worked on finding gapped and approximate palindromes, sometimes improving on or generalizing Gusfield's results. For example, as I already mentioned in the blog post on palindromes with errors or gaps, Porto and Barbosa have developed an efficient algorithm that finds approximate palindromes in which also symbols may be inserted or deleted.

Are all problems related to finding palindromes solved? I think many are, and for most practical purposes, efficient algorithms for finding palindromes are available. From an algorithm-design perspective I think there are still some open questions. The central concept I use in the design of the efficient algorithm for finding palindromes is the palindromes in palindromes property. I think this property should also play a central role in efficient algorithms for finding gapped and approximate palindromes. I have tried for quite a while to design algorithms for these problems using the palindromes in palindromes concept, but failed. Anyone?

Tuesday, September 25, 2012

Finding palindromes in the Y chromosome

To test the program I obtain by combining palindrome finding with errors and gaps, I downloaded a copy of the Y chromosome. The first paper that described the occurrences of huge palindromes in the male chromosome referred to in my blog post on palindromes in DNA is `The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes'. This paper appeared in Nature in 2003, and the Nature web pages provide a link to the Y chromosome studied in the paper. The paper caused quite a stir. It has been cited more than a thousand times in other scientific papers. A thousand citations is a lot: none of my direct colleagues in computer science ever wrote an article with so many citations.

Since it is computationally infeasible to test the program for finding palindromes with errors and gaps with large numbers of allowed errors, I looked up the information about palindromes in the paper. The paper contains a table that gives information about the eight palindromes the authors found in chromosome Y.

Palindrome Arm length (kb) Arm-to-arm identity (%) Spacer length (kb) Palindrome span (kb)
P1 1,450 99.97 2.1 2,902
P1.1 9,9 99.95 3.9 24
P1.2 9,9 99.95 3.9 24
P2 122 99.97 2.1 246
P3 283 99.94 170 736
P4 190 99.98 40 419
P5 496 99.98 3.5 996
P6 110 99.97 46 266
P7 8.7 99.97 12.6 30
P8 36 99.997 3.4 75

Palindrome P1 consists of almost 3 million symbols, and the arm-to-arm identity is 99.97 percent. An arm has length almost one and a half million, so the number of errors is around 450. I assume that the errors are more or less evenly distributed in the palindrome, which implies that in the central 10,000 symbols of an arm of a palindrome, after the gap, about 3 errors would occur. So instead of trying to find the long palindrome with possibly 450 errors and a gap of size around 2100 symbols, I instead try to find palindromes with a gap of size 2200 (to be on the safe side) of length at least 5000 (to be on the safe side) with at most 5 errors (to be on the safe side). To my surprise, I do not find a single palindromes that satisfies these requirements. This long palindrome with a gap of size 2100 does not seem to appear in the Y chromosome of the authors. Since P2 has the same gap size, it does not appear either. Hmm, what's wrong here? Have I made a mistake?

Lets have a look at the other palindromes reported in the table. The next smallest gap size reported in the above table is 3.4 kb. If I try to find palindromes with a gap of size 3400, I get two hits. Around position 18905096 I find a palindrome of length 32056 (including the gap) if I allow for 8 errors. If I try to reduce the gap length, I find that I can take a gap length of 2320, and still find the same palindrome. So maybe this is one of the palindromes with a gap of around 2.1 kb? The arm length is (32056-2320)/2 = 14868, which doesn't correspond at all with the reported arm lengths. I also find a palindrome around position 9806955. This palindrome has length 17304, and 5 errors. This palindrome indeed has a gap of size 3400: reducing the gap size leads to much shorter palindromes. But the palindrome around this position is much shorter than the reported length in Nature. I experimented with different gap sizes and numbers of allowed errors, and found the following palindromes in the Y chromosome:

Center Length Errors Gap length
9806955 13904 5 3400
12078263 43356 1 151498
13738581 34452 0 39262
14445793 64738 5 132950
17989037 14934 5 297114
18899672 3364 7 39540
18905096 29736 8 2320
20530881 24374 2 157760

Since my findings are very different from the results reported in Nature, I contacted the group responsible for the paper, the Whitehead Institute at the Department of Biology at the MIT. The research scientist that answered my questions was kind enough to provide me with the positions at which the palindromes occur in the Y chromosome. Given the positions, I tried to compare the arms by hand (or eye, actually). I got nowhere: the start and end of the arms did not resemble each other at all.

The final clue I needed was that it is not enough to discard errors constituted by two symbols that do not match in a palindrome. Sometimes I also have to delete symbols in one arm of a palindrome, or, equivalently, add symbols to the other arm. With this last clue, I selected the reported palindromic arms in the Y chromosome, and compared them. It turns out that in each case I only have to delete or insert a couple of DNA symbols to obtain a palindrome. The table of palindromes I find are the palindromes P8 to P1 reported in Nature, with P7 missing. My seventh palindrome does not appear in the list of positions I received, but this is probably one of P1.1 or P1.2, of which I didn't receive the positions.

After months of playing with the Y chromosome in some of my spare time, I now finally find the palindromes I was after. If I would have to do the same thing again for another DNA string, I would follow the same approach:

  • use the program that finds palindromes with gaps and errors with various gap sizes and numbers of allowed errors to find centers of palindrome candidates
  • hand align the sequences around the centers of palindrome candidates to determine the length of the palindromes around these centers, given some number of allowed deletions or insertions
I don't think another approach is feasible given the current speed of computers.

Wednesday, September 12, 2012

Palindromes with errors or gaps

Sometimes I not only want to find perfect palindromes, but also palindromes that contain a limited number of errors. A palindrome with a limited number of errors is often called an approximate palindrome. For example, in the book Judges in the King James Bible, verse 19:9 reads:

  And when the man rose up to depart, he, and his 
  concubine, and his servant, his father in law, the 
  damsel's father, said unto him, Behold, now the day 
  draweth toward evening, I pray you tarry all night: 
  behold, the day groweth to an end, lodge here, that 
  thine heart may be merry; and to morrow get you early 
  on your way, that thou mayest go home.
The substring "draweth toward" is a text palindrome with one error: the `e' and the `o' don't match. To be precise, a string s is an approximate palindrome with k errors, if it satisfies the predicate approximatePalindrome k s. I define approximatePalindrome as a program by taking the first half of both the input string and its reverse, and comparing each of the elements of these strings pairwise. I don't have to compare the second halves of these strings, since that would give the same result as the comparison of the first halves, and mismatches would be counted twice.

> approximatePalindrome k s = 
>   let half = div (length s) 2
>       approximatePalindrome' k []     []     = k>=0
>       approximatePalindrome' k (x:xs) (y:ys)  
>         | x == y    = approximatePalindrome' k     xs ys
>         | k >= 0    = approximatePalindrome' (k-1) xs ys
>         | otherwise = False
>   in  approximatePalindrome' k 
>         (take half s) 
>         (take half (reverse s)) 

A palindrome with a gap is a palindrome in which a gap of a particular size in the middle is ignored. An example of a palindrome with a gap is found in Revelations, where verses 20:7-8 read:

  And when the thousand years are expired, Satan shall 
  be loosed out of his prison, And shall go out to 
  deceive the nations which are in the four quarters of 
  the earth, Gog, and Magog, to gather them together to 
  battle: the number of whom is as the sand of the sea.
Here "Gog, and Magog" is a text palindrome with a gap of length three in the middle: the `n' and the `M' around the central `d' don't match. Since the gap appears in the middle of the string, the length of the gap is odd if the length of the palindrome is odd, and even if the length of the palindrome is even. To be precise, a string s is a palindrome with a gap of length g in the middle, if it satisfies the predicate gappedPalindrome g s:

> gappedPalindrome g s = 
>   let ls            = length s
>       removeGap g s = 
>           let armLength     = div (ls - g) 2
>               (before,rest) = splitAt armLength s
>               (gap,after)   = splitAt g rest
>           in  before ++ after 
>   in  if    g < ls && odd g == odd ls 
>       then  palindrome (removeGap g s)
>       else  error "gappedPalindrome: incorrect gap length"
 
If the gap is shorter than the length of the input string, and both the gap and input string have even or odd length, I remove the gap from the middle of the string, and check that the remaining string is a palindrome. I remove the gap by taking half of the elements of the input string, minus half of the gap elements, then dropping the gap, and then taking the rest of the string.

To find palindromes with errors or gaps, I adapt my software for finding palindromes.

It is quite easy to adapt the quadratic-time algorithm for finding palindromes to also find approximate palindromes, and palindromes with gaps. In this blog post I show how to adapt the quadratic-time algorithm that finds palindromes in DNA. Adapting the quadratic-time algorithms for finding exact or text palindromes is done in exactly the same way. To find approximate palindromes in DNA, I pass an extra argument k, representing the maximum number of errors allowed, to the function qMaximalPalindromesDNA, which I now call approximatePalindromesDNA. The only function I have to adapt is the function lengthPalindromeAroundDNA:


> lengthApproximatePalindromeAroundDNA 
>   input lastPos k start end  
>   |  start < 0 || end > lastPos                =  end-start-1
>   |  B.index input start =:= B.index input end =  
>        lengthApproximatePalindromeAroundDNA 
>          input 
>          lastPos 
>          k 
>          (start-1) 
>          (end+1) 
>   |  k > 0                                     =  
>        lengthApproximatePalindromeAroundDNA 
>          input 
>          lastPos 
>          (k-1) 
>          (start-1) 
>          (end+1) 
>   |  otherwise                                 =  end-start-1

The time complexity of approximatePalindromesDNA is related to the sum of the lengths of the approximate palindromes found in the input. As I described in the blog post on finding palindromes in DNA, this is not a problem for perfect palindromes in DNA. In the case of approximate palindromes, the number of allowed errors has a big influence on the time complexity. If I increase the number of allowed errors by one, the length of all palindromes increases with at least two, since a single error represents two symbols, and since after the error there might be matching symbols. For example, if I allow for 500 errors to find a palindrome consisting of almost $3$ million symbols, each approximate palindrome in the input has at least length $1000$, except for the palindromes at the start and the end of the string. The average approximate palindrome is quite a bit longer, since the chance for an accidental match is one in four. Since each palindrome has length on average more than $1250$, applying the function approximatePalindromesDNA to a DNA string of length $25$ million requires more than $30$ billion steps only to compute the lengths of the approximate palindromes. This number is very challenging for my laptop, and until now I haven't been able to convince my machine to calculate these values for me, despite letting it run for several nights.

To find palindromes with gaps in DNA, I use an extra argument g, which represents the length of the gap in the middle of the palindrome which I can ignore. A gap may be odd or even, and in the former case I get a palindrome around a center on a symbol in a DNA string, instead of in between two symbols. So where an exact palindromes in a DNA string has its center in between two symbols, a gapped palindrome may have a center in between two symbols or on a symbol, depending on whether or not the gap is even. The function qMaximalPalindromesDNA, which is now called gappedMaximalPalindromes, is the only function I have to adapt. For each center, I have to subtract half of the gap of the center for the start position, and add half of the gap to the end position. When doing so, I have to check that these positions do not point before the beginning of the string or after the end of the string. If they do, I have to replace half of the gap by the appropriate smaller number.


> gappedMaximalPalindromesDNA input g  =
>   let halfg       =  div g 2
>       lengthInput =  B.length input
>       centers     =  if   even g 
>                      then [0 .. lengthInput]
>                      else [0 .. lengthInput-1]
>       halfg' c    |  c < halfg             = c
>                   |  c+halfg > lengthInput = lengthInput-c
>                   |  otherwise             = halfg
>       left c      =  c-1-halfg' c
>       right c     =  if   even g
>                      then c+halfg' c
>                      else c+1+halfg' c
>   in  map 
>         (\c -> lengthPalindromeAroundDNA 
>                  input 
>                  (lengthInput-1)
>                  (left c)
>                  (right c)
>         ) 
>         centers

By combining the functions lengthApproximatePalindromeAroundDNA and gappedMaximalPalindromesDNA I obtain a function for determining palindromes with both gaps in the middle, and errors in the palindromic arms.

Adapting the quadratic-time algorithm for finding palindromes to also allow for errors and gaps in palindromes is not very hard. How about adapting the linear-time algorithm for finding palindromes?

Adapting the linear-time algorithm to account for errors and gaps is very hard. I spent a long time studying the problem of finding palindromes with a given maximum number of errors in time linear in some combination of the number of errors and the length of the input, but failed to find a satisfying solution. During a couple of long trips I drew many ideas in my notebook to adapt the main components of the linear-time algorithm, the longest tail palindrome and the list of maximal palindromes with centers before the longest tail palindrome, to allow for errors and gaps. In vain.

I tried to adapt the linear-time algorithm for finding palindromes to also find approximate palindromes before I had realized that the quadratic-time algorithm returns its results much faster on DNA strings than the linear-time algorithm. My frustrations about not being able to find a solution have almost disappeared now.

In the scientific literature I found two solutions to the problem of finding approximate palindromes in time linear in some combination of the input and the number of allowed errors. The first solution uses so-called suffix trees. A suffix tree is an advanced data structure that can be used to find all kinds of structures in strings, among which palindromes. I compared the efficiency of my linear-time algorithm for finding exact palindromes with an algorithm using suffix trees, and my linear-time algorithm was far superior. The solutions for finding approximate palindromes using suffix trees use even more arguments, and they will run even slower than the algorithm for finding exact palindromes in strings using suffix trees. The quadratic-time solution for finding approximate palindromes with gaps is no more complicated than the above programs, so although I did not perform any tests, I am pretty confident that the above quadratic-time algorithm for finding approximate palindromes runs much faster than the solutions using suffix trees when applied to DNA strings. The second solution uses advanced algorithms on graphs to determine edit actions that turn one string into another string. I'm afraid I don't understand the details sufficiently well to discuss the algorithm here. The time-complexity of this algorithm is related to $k^2 \times n$, where $k$ is the number of allowed errors, and $n$ the length of the input string. For the DNA string of length $25$ million and the $500$ errors used as examples above, this would amount to more than $6$ trillion steps. My machine would just laugh at it.

Thursday, August 16, 2012

Finding palindromes in DNA

Palindromes play an important role in DNA, and many geneticists want to find palindromes in the DNA they have. To find palindromes in DNA, I cannot directly apply the algorithm for finding palindromes described in a previous blog post, since the equality used for palindromes in DNA differs from the normal equality. A string is a palindrome in DNA if each symbol in the reversed string is the negation of the symbol in the original string, where `A' is the negation of `T' and vice versa, and `C' is the negation of `G', and vice versa. The operator =:= introduced in my blog post on What is a palindrome II? A specification in Haskell implements this equality. By negating the result of =:= I get the inequality operator =/= on DNA symbols.

A first attempt to find palindromes in DNA is to adapt the various components to deal with DNA in the functions extendPalindrome, moveCenter, and finalPalindromes introduced in the blog post on finding palindromes efficiently. In the function extendPalindrome the expression a!... /= a!..., where the dots should be replaced by the appropriate indices in the input, is replaced by the expression a!... =/= a!... using the DNA inequality operator. This is sufficient to turn the function extendPalindrome into a function that finds palindromes in DNA. The functions moveCenter and finalPalindromes do not use the (in)equality operator, so you might think that this is all that is needed to obtain a program for finding palindromes in DNA.

There is one more location in the algorithm for finding palindromes where I use a property of palindromes that does not hold for DNA. In the function moveCenter I call function extendPalindrome with a new "current palindrome" of length one. But a DNA string of length one is not a palindrome, since if I reverse it, I get the same string, and to be a palindrome, the string has to contain the complement symbol. Every DNA string of odd length is not a palindrome, since when such a string is reversed, the symbol in the middle is compared against itself, and DNA equality fails. To find palindromes in a DNA string, I only have to look at the even positions in the string, which correspond to the positions in between symbols. The palindromes around odd positions on the symbols always have length zero.

I adapt the algorithm for efficiently finding palindromes to only find palindromes around even positions. Besides adapting the function extendPalindrome with a different inequality operator, I adapt the function moveCenter by calling extendPalindrome with a new longest current palindrome of length zero, and by counting the number of centers down with steps of size two instead of one, since I am not interested in the odd center positions anymore. These are all changes needed, and I will not give the resulting algorithm. The interested reader can find them in my software package for finding palindromes.

In spring 2012 I was contacted by an Indian scientist who was developing an algorithm for finding palindromes in DNA. I soon found out that her algorithm was similar to my naive, quadratic-time, algorithm for finding maximal palindromes. What worried me was that her algorithm seemed to run faster than my efficient algorithm for finding palindromes, adapted to finding palindromes in DNA. By thinking that my efficient algorithm would run faster, I had made the classical mistake of confusing worst-case time complexity with practice.

The worst-case time complexity of my efficient algorithm for finding palindromes is linear. The number of steps the algorithm takes is linear in the length of the input string. The worst case time-complexity of the naive algorithm is quadratic. A string in which all substrings are palindromes leads to "worst case" time behaviour. An example of a DNA string satisfying this property is a string looking like "ATATATAT...". For a string consisting of three million copies of "AT", the efficient algorithm returns the length of all maximal palindromes about equally fast as for the the Amycolatopsis marina, which is about equally long, but the quadratic time algorithm never finishes on the repetitions of ATs. Strings like "ATATATAT..." appear in DNA, but they are short. For example, the longest maximal palindrome in chromosome 18 consisting of just "ATAT..." has length 66. The longest overall maximal palindrome has length 76. The majority of the maximal palindromes is much shorter. The time-complexity of the naive algorithm for finding palindromes is the sum of the length of the palindromes found in the argument string. If the maximum length of a palindrome is 76, then the number of steps performed by the naive algorithm is at most 76 times the length of the input string. Since chromosome 18 consists of almost 75 million symbols, this is more than five and a half billion, which is quite a lot, but doable for a modern laptop. Indeed, if I measure the running time of the quadratic and linear-time solutions on chromosome 18, I find that the quadratic solution is about four to five times faster than the linear solution. To be precise, on my MacBook Pro with 2.4 GHz Intel Core i5 with 4GB memory, with very few other processes running, the linear-time solution requires slightly more than 30 minutes to analyse the almost 75 megabytes of chromosome 18, whereas the quadratic solution uses around 6 minutes. Why is it faster?

It is hard to exactly determine what one program makes faster than another. The linear time solution uses quite a few arguments, in particular the list of maximal palindromes found until now. This is the central idea to obtain linear-time behaviour, but it also requires constructing, passing on, inspecting, and deconstructing this list to determine maximal palindromes with centers to the right of the current longest tail palindrome. I have experimented with several kinds of representations for the list of maximal palindromes used as argument to the functions extendPalindrome and moveCenter, but none of these solutions can compare to the quadratic solution when looking at running times on large DNA strings. An important motivation for developing a linear-time algorithm for finding palindromes turns out not to be valid at all...

The quadratic-time solution for finding palindromes in DNA is very similar to the naive algorithm for finding maximal palindromes. There are three differences. First, the input to the function maximalPalindromes, which I now call qMaximalPalindromesDNA, is of type ByteString instead of Array Int Char. The type ByteString is a data structure specifically designed to efficiently deal with strings of which the length is known. It is very similar to the type String, but functions like read for reading strings, map for mapping a function over a string, length for determining the length of a stringdefined in the module ByteString, all are usually much faster than their equivalents on String, or even on Array Int Char. I write B.ByteString, B.length, and so on, to make explicit that these components come from the module ByteString:


> import qualified Data.ByteString as B

Second, in the function qMaximalPalindromesDNA I only consider positions in between symbols around which to find maximal palindromes, since palindromes in DNA have even length. The length of the list of centers has length one plus the length of the input, to account for the center at the end of the input.

> qMaximalPalindromesDNA :: B.ByteString -> [Int]
> qMaximalPalindromesDNA input = 
>   let lengthInput =  B.length input
>       centers     =  [0 .. lengthInput]
>       lastPos     =  lengthInput - 1
>   in map 
>        (\c -> lengthPalindromeAroundDNA 
>                 input 
>                 lastPos 
>                 (c-1) 
>                 c
>        ) 
>        centers

The third difference is that function lengthPalindromeAroundDNA doesn't distinguish between even-length and odd-length palindromes anymore, since all palindromes have even length, and that it needs to use the equality operator for DNA.

> lengthPalindromeAroundDNA input lastPos start end  
>   | start < 0 || end > lastPos                 =  end-start-1
>   | B.index input start =:= B.index input end  =  
>       lengthPalindromeAroundDNA 
>         input 
>         lastPos 
>         (start-1) 
>         (end+1) 
>   | otherwise                                  =  end-start-1

This solves the problem of finding palindromes in DNA. But I cannot yet use this algorithm to find the huge palindromes in the Y chromosome, since those palindromes have some noise in the middle, and the palindromic arms contain a few errors. I will look into this in a following blog post.

Sunday, July 22, 2012

Palindromes in DNA

DNA strings are often millions of letters long. For example, the copy I have of the DNA of the Amycolatopsis marina, a bacteria discovered in 2009, consists of $6,503,724$ characters.

Bacteria are of course some of the smallest living beings. The DNA in the copy I have of the human chromosome 18 is an astounding $74,657,229$ characters long. Chromosome 18 is one of the 23 chromosomes of human beings, and is estimated to contain in between $300$ and $500$ genes. A gene is responsible for part of our functioning as human beings. An example of a gene contained within chromosome 18 is the NPC1 gene, named after the Niemann-Pick disease, type C1. This gene is named after the disease you get when you have an error on the gene. Experiments with mice show that this gene probably controls the appetite, and people with a mutation on the NPC1 gene often suffer from obesitas. Interestingly, the same mutation on NPC1 might make you immune for the ebola virus, one of the deadliest known viruses. So the mutation is partly a blessing in disguise.

If I search for the keyword palindrome in the electronic publications available at the library of my university, I get more than $500$ hits. The first ten of these hits are all about palindromes in DNA. My guess is that at least $90$% of the list of publications are about palindromes in DNA. So what is the interest in palindromes in DNA? Surely, if your strings only contain `A', `T', `C', and `G' you are likely to get many palindromes?

Let us look at how many palindromes we expect to find in the DNA of the Amycolatopsis marina. Suppose I want to find palindromes of length twelve. I calculate the chance that an arbitrary DNA string of length twelve is a palindrome as follows. The first six characters of the string don't matter. The seventh character needs to match with the sixth character, for which we have a chance of one in four. Remember that in DNA, `A' matches with `T' and vice versa, but both `A' and `T' do not match with themselves, or with `C' and `G'. The eighth character needs to match with the fifth character, for which we also have a chance of one in four. This goes on until the twelfth character, so I get a chance of \[ \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} \times \frac{1}{4} = (\frac{1}{4})^6 = \frac{1}{4^6} = \frac{1}{4096} \] Since the Amycolatopsis marina is $6,503,724$ characters long, there are $6,503,713$ substrings of length twelve. Multiplying this number with the chance that it is a palindrome, I expect to get $1,589$ palindromes. Using my algorithm for finding palindromes I get $1,784$ palindromes. This is slightly above $10$% more than expected, but that might be an accident. If I look at the palindromes of length fourteen in the $18$th human chromosome, using a similar calculation I expect to find $4,556$ palindromes. My algorithm finds $25,323$. More than five times as many palindromes as expected! This is not an accident anymore. Palindromes play a role in DNA. But what role?

Palindromes perform several tasks in human DNA. I will discuss one particularly intriguing task in this blog post.

Why do we humans have sex? I can answer this question from several perspectives. The most common answer will probably mention love, pleasure, or children. Looking at the question from a biological perspective, the answer discusses the biological advantages of having sex. Our children get their genes from two parents. They get two complete copies of human DNA, and merge these copies in some way to make them the way they are. In this merging process, damaged DNA from one parent can be replaced by functioning DNA from the other parent. There is a lot of DNA to repair: every day, each cell may be damaged at one million different places. Most of this damage is harmless, since a lot of DNA does not seem to have any function at all, but some kinds of damage may cause a lot of problems. Being able to repair non-functioning DNA when combining DNA from parents is essential for keeping the human race in a good state. The American geneticist Hermann Joseph Muller used this argument to explain why sexual reproduction is favored over asexual reproduction in organisms. When an organism reproduces asexually it passes all its DNA errors on to its offspring, without a possibility to repair them, eventually leading to the extinction of the population. The process has been dubbed Muller's ratchet, after the ratchet device, which can only turn in one direction. This is the theory, practice is slightly more complicated.

Muller's ratchet should already be at work in humans, since there is one human chromosome in which no combination of parental genes takes place: chromosome 23. Chromosome 23 determines whether we become a man or a woman. A man gets a copy of the chromosome called X from his mother and a copy called Y from his father. A woman gets an X from her mother and an X from her father. A woman can merge both X's and repair possible errors, but a man has two different copies of the chromosome, and has no possibility to combine, let alone repair, them. The Y is passed on from father to son, with no involvement of women. Muller's ratchet theory says the genes on the chromosome that make a man should deteriorate quickly, and men should soon become extinct.

There is no sign of men becoming extinct. Apparently there are some other mechanisms at work to repair DNA on the Y chromosome. If I cannot obtain a copy of some piece of DNA from my mother, maybe I can store copies of the DNA in the Y chromosome itself? If I maintain two copies, I can always check if a piece of DNA is correct by comparing it against the copy of this piece of DNA. This is the mechanism used by the Y chromosome, where the copies are stored as palindromes, with some noise in the middle of these palindromes. Such palindromes with gaps in the middle are often called inverted repeats in biology. The Y chromosome contains eight huge palindromes, the longest of which consists of almost three million characters. Around 25% of the Y chromosome consists of palindromes. The DNA in the palindromes carries genes for describing the male testes. So the mechanism by means of which men survive is called palindrome...

Sunday, July 1, 2012

Other implementations and solutions

Next year it is 25 years ago since I constructed an algorithm for finding palindromes efficiently. I was 22 years old then, and had just started as a PhD student. I was quite excited about having solved the problem of finding palindromes efficiently, and wrote a scientific paper about it. My excitement wasn't shared by the scientific community though. In the last twenty-five years this paper has been cited less than ten times, and appears at the bottom end of my most cited papers list.

In 2007 we developed the ICFP programming contest. The ICFP programing contest is a very challenging contest, in which thousands of programmers try to show off their programming talents in their programming language of choice. Our contest asked participants to transform the left picture below to the right picture using as few commands as possible. We included a problem related to palindromes in our contest, since this was my pet-problem ever since 1988. After the contest I explained how to solve the palindrome problem efficiently in a blog post.












When some years later I looked at the number of hits the contest pages received, I found that each month, thousands of people are reading the blog message about finding palindromes. Why does this page attract so many visitors?

The palindrome problem is a common question in interviews for jobs for software developers, so the blog post attracts software developers looking for a new job, and preparing themselves for an interview. Another reason the blog post attracts visitors is that I think quite a few people are genuinely interested in the problem and its solution. The concept of palindromes appears in almost all languages, which means that the question of finding palindromes is asked all over the world. The blog post indeed attracts visitors from all over the world. The last 100 (July 1, 2012) visitors come from all continents except Australia.

Many people ask the question of how to find palindromes, but also many people try to answer the question. You can find hundreds of solutions for finding palindromes on the internet. Some of these are variants of my linear-time solution, others are more naive quadratic or sometimes even cubic-time solutions. Below I give the list I found, ordered on the programming language used for the implementation. If there exists a linear-time implementation, I don't list less efficient solutions.

C The same linear-time solution as my blog post.
C++ 1 The same linear-time solution as my blog post. This post has an extensive description of the palindromes in palindromes property, including some pictures which try to explain it.
C++ 2 A C++ implementation of Manacher's algorithm.
C++ 3 The same linear-time solution as my blog post in C++ by Fernando Pelliccioni.
C# As far as I can see, this is a quadratic-time solution.
Factor A cubic-time solution.
F# A quadratic-time solution.
Go A quadratic-time solution, by Lars Björnfot.
Haskell Just as the program for finding palindromes described in my blog post, this blog post describes a linear-time program for finding palindromes in Haskell. The interesting aspect of this solution is that it returns results lazily: as soon as it finds a maximal palindrome, it writes it to the output.
Java 1 A quadratic-time solution.
Java 2 Another quadratic-time solution in Java, by Marcello de Sales.
Java 3 Yet another quadratic-time solution in Java, by Mohit Bhandari.
Matlab A quadratic-time solution, by Lalit Mohan.
PHP 1 As far as I can see this is a quadratic-time solution, by Marc Donaldson.
PHP 2 This is a quadratic- or cubic-time solution, by Joseba Bikandi. You can try it on-line.
Python The same linear-time solution as my blog post, developed by Fred Akalin. This post contains an extensive description of the palindromes in palindromes property too.
R This page describes the interface of functionality for finding palindromes in DNA strings. It doesn't say how efficient the software is.
Ruby 1 A quadratic-time solution, by Matthew Kerry.
Ruby 2 Another quadratic-time solution in Ruby, by Rick DeNatale. The post and the comments mention some more Ruby implementations.
Ruby 3 Yet another quadratic-time solution in Ruby, by Mitchell Fang.
Scheme A quadratic-time solution.
I could only find linear-time implementations for finding palindromes in C, C++, Haskell, and Python. Please let me know if you find better or alternative implementations for finding palindromes. This list easily gets outdated, so please keep me informed.

Friday, June 29, 2012

The Sator square

The Sator square is a square that reads the same forwards, backwards, upwards and downwards:

Sator
Arepo
Tenet
Opera
Rotas

The square has been found at many places around the world: on amulets, carved into walls or stones, written on disks used for extinguishing fire, and so on. Its first known appearance is in the ruins of Pompeii. The square was found in a building that was decorated in a style that became popular since the year 50. Pompeii was covered in ash from the Vesuvius in the year 79, and it follows that this example is almost 2000 years old. This makes the square one of the oldest known palindromes.

What does the Sator square mean? Except for `Arepo', scholars mostly agree on the meaning of the individual words. `Sator' is the seeder, sower, or begetter. It is also used as a metaphor for God, not necessarily Christian, but also Roman. `Tenet' is a verb, and means it, he or she holds. `Opera' is often considered to mean with effort. It is related to opus or opera, which means work. Finally, `Rotas' is generally considered a noun, meaning the wheels. But it might also be a verb meaning turn. Nobody has ever found a word in Roman, Greek, Etruscan, or any Indo-European language that explains `Arepo'. Some people think that it is a name, others that it stands for plough, misspelled to fit the palindrome, and yet others that it is just nonsense. For a long period, people thought the square had a Christian meaning by rearranging the letters into two Pater Nosters in the form of a cross. The Pre-Christian Pompeii find crushed this theory. If Arepo is a name, the Sator square might read: `The sower Arepo holds the wheels with effort', and if it stands for plough, you might get: `God holds the plough, but you turn the furrows'. According to John Cullen, this could have been a motto for farmers in Roman times. Most explanations I have read call the Sator square an ancient meme, with as much meaning as the sentence `all your base are belong to us'. In 1937, the Italian Antonio Ferrua probably gives the best explanation for what the square means: `Esattamente quello che si vuole'(!) E basta di questo argumento (it means exactly what you want it to mean. And so much for that argument!). The Sator square is an example of a meme that went viral long before the internet.

Here is an instance of the square I found in the library in Skara, Sweden, written by an unknown author probably in Stockholm, Sweden, in the year 1722. It starts with Rotas instead of Sator, just as the first appearances of the square. For some reason the version starting with Sator has become more popular.

Because it is a palindrome, the Sator square was thought to have many healing effects, curing snake-bites, headaches, jaundice, and many other illnesses. Medieval books mention the square as a cure for fever and insanity. Interestingly, although we now know that using words to cure an illness is of little help, we humans do use palindromes to repair our body. I will say more about this in a later blog post on palindromes in DNA.

The amount of human effort gone into explaining the meaning of the Sator square is unbelievable. Since 1889, when Francis John Haverfield described a find of the square in a Romano-Bristish building in Cirencester, there has been a steady stream of articles on the Sator square, and the total number of articles easily surpasses a hundred. Rose Mary Sheldon recently published a 54-page annotated biography of the literature on the Sator square: The Sator rebus: An unsolved cryptogram? Charles Douglas Gunn wrote his PhD thesis on the square at Yale in 1969. He suggests that the square was written by a Roman who wanted to take palindromic squares one step further from the misformed four-letter word square Roma tibi subito montibus ibit amor, meaning `For by my efforts you are about to reach Rome, the object of your travel'. He wrote software to generate all possible five-letter Latin word squares. These squares take up more than a hundred pages in his thesis. He concludes that the Sator square is the best.

There are not many other examples of squares like the Sator square, consisting of five words that together constitute a meaningful palindrome. Piet Burger discovered the following Dutch square in 1990.

Sneed
nette
etste
etten
deens

Its meaning is so far-fetched that I won't try to give an English translation.