Introduction:
In 1944, scientists showed that DNA was the carrier or hereditary information. It is a double helical structure composed of two long chains of nucleotides. The bases of DNA are represented by the letters A,C,G,T and vary from one nucleotide to another, giving a long coded message. A pairs to T and C pairs to G, G to C, and T to A. CMV is a potentially life-threatening disease for people with suppressed or deficient immune systems. Scientists are searching for a special place on the virus’ DNA that contains instructions for its reproduction, the area of replication. The CMV DNA molecule contains 229,354 complementary pairs of base pairs, while human DNA has more than 3 billion base pairs. The following data is the published DNA sequence of CMV. (Stat Labs pg.76-79)
Results and Data:
Using Excel, 2 sets of 296 random numbers were generated between 0 and 229,354. This data was used to compare with the 296 palindromes recorded in the pattern of CMV DNA. To compare the results, I looked at the numbers themselves and the differences between them. The figure below shows the relationship between the DNA data and the generated data along with tables of their means, standard deviation, and skew:
Mean Differences | Skew | Stan. Dev. | |
Palindrome | 775.511864 | 1.8913869 | 832.802799 |
Mean Rand1 | 772.966102 | 1.33197561 | 684.78196 |
Mean Rand2 | 766.552542 | 1.41364546 | 720.683734 |
Based on the above information, the palindrome locations very much look like randomly generated numbers. Averages, standard deviations, and even historgrams and linear charting shows very limited difference between the CMV DNA locations and randomly generated numbers. The first graph simply shows each number plotted on a linear scale, the second a histogram of the differences between DNA and random numbers, and lastly a historgram of all DNA and random numbers grouped in bins of 4000.
The Poisson Distribution:
-The probability that you’ll get k points in a unit interval.
for k = 0, 1, 2, … n
= number of events per unit interval
=5.157894737 |
(In the Case of the CMV DNA)
Lambda is also the expected value of the distribution.
Example of Poisson Distribution:
The next part of this experiment was to look at this Poisson Process. If these events occur with a known average rate and independently of the time since the last event, the process can be considered Poisson.
Using the palindrome data, we counted the number of points in each bin range of 4,000. (ex. 0-4000, 4001-8000, etc.). The Palindrome count below shows the number of data points in the consecutive bins.
Palindrome Count |
|||||||||||
7 |
1 |
5 |
3 |
8 |
6 |
1 |
4 |
5 |
3 |
6 |
2 |
5 |
8 |
2 |
9 |
6 |
4 |
9 |
4 |
1 |
7 |
7 |
14 |
4 |
4 |
4 |
3 |
5 |
5 |
3 |
6 |
5 |
3 |
9 |
9 |
4 |
5 |
6 |
1 |
7 |
6 |
7 |
5 |
3 |
4 |
4 |
8 |
11 |
5 |
3 |
6 |
3 |
1 |
4 |
8 |
6 |
Goodness-of-Fit
Using the chi-square test for goodness of fit to observe if this distribution is Poisson.
chi-square distribution |
Chi-Squared | Predicted | Interval | Observed | standardized residual |
0.06231539 | 6.36996293 | 0-2 | 7 | 0.2496305 |
0.03325119 | 7.50059656 | 3 | 8 | 0.1823491 |
0.01113553 | 9.67182189 | 4 | 10 | 0.10552503 |
0.09571912 | 9.97724785 | 5 | 9 | -0.3093851 |
0.03880769 | 8.57693237 | 6 | 8 | -0.1969967 |
0.27563819 | 6.31984491 | 7 | 5 | -0.5250126 |
0.00136715 | 4.07463685 | 8 | 4 | -0.0369751 |
0.5165489 | 4.47894899 | 9-14 | 6 | 0.71871336 |
Chi Squared=
1.03478316 |
The CMV DNA has an approximate chi-square distribution with six degrees of freedom. It appears possible that this is a Poisson model. The standard resisduals for this test are all very small, unlike the uniform normal distribution seen below which has some large numbers and spikes that represent a lack of goodness-of-fit.
Uniform Normal Distribution:
standardized residual plot |
|||
observed | predicted | ||
-0.1102822 | 29 | 29.6 | |
-1.5807114 | 21 | 29.6 | |
0.44112877 | 32 | 29.6 | |
0.07352146 | 30 | 29.6 | |
0.44112877 | 32 | 29.6 | |
0.25732512 | 31 | 29.6 | |
-0.2940858 | 28 | 29.6 | |
0.44112877 | 32 | 29.6 | |
0.80873608 | 34 | 29.6 | |
-0.4778895 | 27 | 29.6 |
Conclusion: In conclusion, although at first the CMV DNA data seemed random like the randomly generated numbers, through chi-squared and goodness of fit tests I was able to determine that these sequences are not totally random and model a Poisson distribution fairly closely. This gives some hope to scientists in order to find that specific location of the area of replication of this virus.
Leave a Reply