In silico comparison of bacterial strains using mutual information
Fast-sequencing throughput methods have increased the number of completely sequenced bacterial genomes to about 400 by December 2006, with the number increasing rapidly. These include several strains. In silico methods of comparative genomics are of use in categorizing and phylogenetically sorting these bacteria. Various word-based tools have been used for quantifying the similarities and differences between entire genomes. The simple di-nucleotide frequency comparison, codon specificity and k-mer repeat detection are among some of the well-known methods. In this paper, we show that the Mutual Information function, which is a measure of correlations and a concept from Information Theory, is very effective in determining the similarities and differences among genome sequences of various strains of bacteria such as the plant pathogen Xylella fastidiosa, marine Cyanobacteria Prochlorococcus marinus or animal and human pathogens such as species of Ehrlichia and Legionella. The short-range three-base periodicity, small sequence repeats and long-range correlations taken together constitute a genome signature that can be used as a technique for identifying new bacterial strains with the help of strains already catalogued in the database. There have been several applications of using the Mutual Information function as a measure of correlations in genomics but this is the first whole genome analysis done to detect strain similarities and differences.