Posted by: thefunkymonk | May 18, 2009

Patterns in DNA

Introduction:

In 1944, scientists showed that DNA was the carrier or hereditary information. It is a double helical structure composed of two long chains of nucleotides.  The bases of DNA are represented by the letters A,C,G,T and vary from one nucleotide to another, giving a long coded message. A pairs to T and C pairs to G, G to C, and T to A.  CMV is a potentially life-threatening disease for people with suppressed or deficient immune systems. Scientists are searching for a special place on the virus’ DNA that contains instructions for its reproduction, the area of replication.  The CMV DNA molecule contains 229,354 complementary pairs of base pairs, while human DNA has more than 3 billion base pairs. The following data is the published DNA sequence of CMV. (Stat Labs pg.76-79)

Results and Data:

Using Excel, 2 sets of 296 random numbers were generated between 0 and 229,354. This data was used to compare with the 296 palindromes recorded in the pattern of CMV DNA. To compare the results, I looked at the numbers themselves and the differences between them. The figure below shows the relationship between the DNA data and the generated data along with tables of their means, standard deviation, and skew:

     Mean Differences             Skew       Stan. Dev.
Palindrome 775.511864 1.8913869 832.802799
Mean Rand1 772.966102 1.33197561 684.78196
Mean Rand2 766.552542 1.41364546 720.683734

 

DNA

 

DNA2

 

DNA3

Based on the above information, the palindrome locations very much look like randomly generated numbers. Averages, standard deviations, and even historgrams and linear charting shows very limited difference between the CMV DNA locations and randomly generated numbers. The first graph simply shows each number plotted on a linear scale, the second a histogram of the differences between DNA and random numbers, and lastly a historgram of all DNA and random numbers grouped in bins of 4000.

The Poisson Distribution:

-The probability that you’ll get k points in a unit interval.

\frac{\lambda^k}{k!}*e^{-\lambda} for k = 0, 1, 2, … n

\lambda = number of events per unit interval

\lambda =5.157894737    

 (In the Case of the CMV DNA) 

Lambda is also the expected value of the distribution.

Example of Poisson Distribution:

The next part of this experiment was to look at this Poisson Process. If these events occur with a known average rate and independently of the time since the last event, the process can be considered Poisson.

Using the palindrome data, we counted the number of points in each bin range of 4,000. (ex. 0-4000, 4001-8000, etc.). The Palindrome count below shows the number of data points in the consecutive bins.

Palindrome Count

7

1

5

3

8

6

1

4

5

3

6

2

5

8

2

9

6

4

9

4

1

7

7

14

4

4

4

3

5

5

3

6

5

3

9

9

4

5

6

1

7

6

7

5

3

4

4

8

11

5

3

6

3

1

4

8

6

     

 

Goodness-of-Fit

Using the chi-square test for goodness of fit to observe if this distribution is Poisson.

chi-square distribution \sum_{i=1}^k \left(\frac{X_i-\mu_i}{\sigma_i}\right)^2
Chi-Squared Predicted Interval  Observed standardized residual
0.06231539 6.36996293 0-2 7 0.2496305
0.03325119 7.50059656 3 8 0.1823491
0.01113553 9.67182189 4 10 0.10552503
0.09571912 9.97724785 5 9 -0.3093851
0.03880769 8.57693237 6 8 -0.1969967
0.27563819 6.31984491 7 5 -0.5250126
0.00136715 4.07463685 8 4 -0.0369751
0.5165489 4.47894899 9-14 6 0.71871336

Chi Squared=

1.03478316

 

The CMV DNA has an approximate chi-square distribution with six degrees of freedom. It appears possible that this is a Poisson model. The standard resisduals for this test are all very small, unlike the uniform normal distribution seen below which has some large numbers and spikes that represent a lack of goodness-of-fit.

 

Uniform Normal Distribution:

 

standardized residual plot

 
    observed predicted
-0.1102822   29 29.6
-1.5807114   21 29.6
0.44112877   32 29.6
0.07352146   30 29.6
0.44112877   32 29.6
0.25732512   31 29.6
-0.2940858   28 29.6
0.44112877   32 29.6
0.80873608   34 29.6
-0.4778895   27 29.6

 

Conclusion: In conclusion, although at first the CMV DNA data seemed random like the randomly generated numbers, through chi-squared and goodness of fit tests I was able to determine that these sequences are not totally random and model a Poisson distribution fairly closely. This gives some hope to scientists in order to find that specific location of the area of replication of this virus.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: