MEDLINE/PubMed Journal Browser Search

Pubmed for Handhelds

PUBMED FOR HANDHELDS

Search MEDLINE/PubMed

Title: Calculating the exact probability of language-like patterns in biomolecular sequences.
Author: Atteson K.
Journal: Proc Int Conf Intell Syst Mol Biol; 1998; 6():17-24. PubMed ID: 9783205.
Abstract:
We present algorithms for the exact computation of the probability that a random string of a certain length matches a given regular expression. These algorithms can be used to determine statistical significance in a variety of pattern searches such as motif searches and gene-finding. This work improves upon work of Kleffe and Langebacker (Kleffe & Langbecker 1990) and of Sewell and Durbin (Sewell & Durbin 1995) in several ways. First, in many cases of interest, the algorithms presented here are faster. In addition, the type of pattern considered here strictly includes those of both previous works but also allows, for instance, arbitrary length gaps. Also, the type of probability model which can be used is more general than that of Sewell and Durbin, allowing for Markov chains. The problem solved in this work is in fact in the class of NP-hard problems which are believed to be intractable. However, the problem is fixed-parameter tractable, meaning that it is tractable for small patterns. The is problem is also computationally feasible for many patterns which occur in practice. As a sample application, we consider calculating the statistical significance of most of the PROSITE patterns as in Sewell and Durbin. Whereas their method was only fast enough to exactly compute the probabilities for sequences of length 13 larger than the pattern length, we calculate these probabilities for sequences of up to length 2000. In addition, we calculate most of these probabilities using a first order Markov chain. Most of the PROSITE patterns have high significance at length 2000 under both the i.i.d. and Markov chain models. For further applications, we demonstrate the calculation of the probability of a PROSITE pattern occurring on either strand of a random DNA sequence of up to 500 kilo-bases and the probability of a simple gene model occurring in a random sequence of up to 1 megabase.

[Abstract] [Full Text] [Related] [New Search]