Posted by: thefunkymonk | May 18, 2009

## Patterns in DNA

Introduction:

In 1944, scientists showed that DNA was the carrier or hereditary information. It is a double helical structure composed of two long chains of nucleotides.  The bases of DNA are represented by the letters A,C,G,T and vary from one nucleotide to another, giving a long coded message. A pairs to T and C pairs to G, G to C, and T to A.  CMV is a potentially life-threatening disease for people with suppressed or deficient immune systems. Scientists are searching for a special place on the virus’ DNA that contains instructions for its reproduction, the area of replication.  The CMV DNA molecule contains 229,354 complementary pairs of base pairs, while human DNA has more than 3 billion base pairs. The following data is the published DNA sequence of CMV. (Stat Labs pg.76-79)

Results and Data:

Using Excel, 2 sets of 296 random numbers were generated between 0 and 229,354. This data was used to compare with the 296 palindromes recorded in the pattern of CMV DNA. To compare the results, I looked at the numbers themselves and the differences between them. The figure below shows the relationship between the DNA data and the generated data along with tables of their means, standard deviation, and skew:

 Mean Differences Skew Stan. Dev. Palindrome 775.511864 1.8913869 832.802799 Mean Rand1 772.966102 1.33197561 684.78196 Mean Rand2 766.552542 1.41364546 720.683734

Based on the above information, the palindrome locations very much look like randomly generated numbers. Averages, standard deviations, and even historgrams and linear charting shows very limited difference between the CMV DNA locations and randomly generated numbers. The first graph simply shows each number plotted on a linear scale, the second a histogram of the differences between DNA and random numbers, and lastly a historgram of all DNA and random numbers grouped in bins of 4000.

The Poisson Distribution:

-The probability that you’ll get k points in a unit interval.

$\frac{\lambda^k}{k!}*e^{-\lambda}$ for k = 0, 1, 2, … n

$\lambda$= number of events per unit interval

 $\lambda$=5.157894737

(In the Case of the CMV DNA)

Lambda is also the expected value of the distribution.

Example of Poisson Distribution:

The next part of this experiment was to look at this Poisson Process. If these events occur with a known average rate and independently of the time since the last event, the process can be considered Poisson.

Using the palindrome data, we counted the number of points in each bin range of 4,000. (ex. 0-4000, 4001-8000, etc.). The Palindrome count below shows the number of data points in the consecutive bins.

 Palindrome Count 7 1 5 3 8 6 1 4 5 3 6 2 5 8 2 9 6 4 9 4 1 7 7 14 4 4 4 3 5 5 3 6 5 3 9 9 4 5 6 1 7 6 7 5 3 4 4 8 11 5 3 6 3 1 4 8 6

Goodness-of-Fit

Using the chi-square test for goodness of fit to observe if this distribution is Poisson.

 chi-square distribution $\sum_{i=1}^k \left(\frac{X_i-\mu_i}{\sigma_i}\right)^2$
 Chi-Squared Predicted Interval Observed standardized residual 0.06231539 6.36996293 0-2 7 0.2496305 0.03325119 7.50059656 3 8 0.1823491 0.01113553 9.67182189 4 10 0.10552503 0.09571912 9.97724785 5 9 -0.3093851 0.03880769 8.57693237 6 8 -0.1969967 0.27563819 6.31984491 7 5 -0.5250126 0.00136715 4.07463685 8 4 -0.0369751 0.5165489 4.47894899 9-14 6 0.71871336

Chi Squared=

 1.03478

The CMV DNA has an approximate chi-square distribution with six degrees of freedom. It appears possible that this is a Poisson model. The standard resisduals for this test are all very small, unlike the uniform normal distribution seen below which has some large numbers and spikes that represent a lack of goodness-of-fit.

Uniform Normal Distribution:

 standardized residual plot observed predicted -0.1102822 29 29.6 -1.5807114 21 29.6 0.44112877 32 29.6 0.07352146 30 29.6 0.44112877 32 29.6 0.25732512 31 29.6 -0.2940858 28 29.6 0.44112877 32 29.6 0.80873608 34 29.6 -0.4778895 27 29.6

Conclusion: In conclusion, although at first the CMV DNA data seemed random like the randomly generated numbers, through chi-squared and goodness of fit tests I was able to determine that these sequences are not totally random and model a Poisson distribution fairly closely. This gives some hope to scientists in order to find that specific location of the area of replication of this virus.