The most common measure of “correlation” or “predictability” is Pearson’s coefficient of correlationPearson’s r, can have a value anywhere between 1 and 1. The larger r, ignoring sign, the stronger the association between the two variables and the more accurately you can predict one variable from knowledge of the other variable.
avg of
avg of
standard deviation of
standard deviation of
Background: Dungeness crabs are fished on west coast and are an important to Pacific fisherman. The population of Dungeness crabs has been threatened by overfishing and some believe that the inclusion of the female population with restrictions may be the answer. The entire male population is fished out each year, in order to control the large fluctuations of male crabs fished out yearly, biolgists questioned the fishing of female crabs. The Canadian fishing industry allows for fishing of female crab, and their population doesn’t seem to have the fluctuations the U.S. shows.
The above data shows Premolt with Postmolt size. Adding a Linear trendline and Pearson’s coefficients shows that the data is very closely correlated.
The actual and predicted values correspond very closely. The predictor is an excellent fit based on the hish pearson coefficient. It was based off the equation from the last graph. The linear fit of the data gives the equation of y=0.914x + 25.803 and plugging in the premolt size of the crab into “x” we get a predicted post molt size of the crab for “y”.
Above the actual data is plotted against a line forced through zero. It shows how close this correlation
is to being absolutely linear and with perfect correlation.
After taking a much smaller sampling and calculating the actual pre molt versus the calculated pre molt the data has a much smaller correlation,this random sampling shows no correlation at all.The population much be much larger for the information to be substantial.
Conclusion: In conclusion, this data is closely correlated by Pearson’s coefficient, however a sufficient amount of data must be used in order for the data to be accurate. Using the linear trendline you can very closely perdict the postmolt size using the premolt size.
]]>Team: Scott O’Meara, Jason Motta, Jeff Richard
]]>In 1944, scientists showed that DNA was the carrier or hereditary information. It is a double helical structure composed of two long chains of nucleotides. The bases of DNA are represented by the letters A,C,G,T and vary from one nucleotide to another, giving a long coded message. A pairs to T and C pairs to G, G to C, and T to A. CMV is a potentially lifethreatening disease for people with suppressed or deficient immune systems. Scientists are searching for a special place on the virus’ DNA that contains instructions for its reproduction, the area of replication. The CMV DNA molecule contains 229,354 complementary pairs of base pairs, while human DNA has more than 3 billion base pairs. The following data is the published DNA sequence of CMV. (Stat Labs pg.7679)
Results and Data:
Using Excel, 2 sets of 296 random numbers were generated between 0 and 229,354. This data was used to compare with the 296 palindromes recorded in the pattern of CMV DNA. To compare the results, I looked at the numbers themselves and the differences between them. The figure below shows the relationship between the DNA data and the generated data along with tables of their means, standard deviation, and skew:
Mean Differences  Skew  Stan. Dev.  
Palindrome  775.511864  1.8913869  832.802799 
Mean Rand1  772.966102  1.33197561  684.78196 
Mean Rand2  766.552542  1.41364546  720.683734 
Based on the above information, the palindrome locations very much look like randomly generated numbers. Averages, standard deviations, and even historgrams and linear charting shows very limited difference between the CMV DNA locations and randomly generated numbers. The first graph simply shows each number plotted on a linear scale, the second a histogram of the differences between DNA and random numbers, and lastly a historgram of all DNA and random numbers grouped in bins of 4000.
The Poisson Distribution:
The probability that you’ll get k points in a unit interval.
for k = 0, 1, 2, … n
= number of events per unit interval
=5.157894737 
(In the Case of the CMV DNA)
Lambda is also the expected value of the distribution.
Example of Poisson Distribution:
The next part of this experiment was to look at this Poisson Process. If these events occur with a known average rate and independently of the time since the last event, the process can be considered Poisson.
Using the palindrome data, we counted the number of points in each bin range of 4,000. (ex. 04000, 40018000, etc.). The Palindrome count below shows the number of data points in the consecutive bins.
Palindrome Count 

7 
1 
5 
3 
8 
6 
1 
4 
5 
3 
6 
2 
5 
8 
2 
9 
6 
4 
9 
4 
1 
7 
7 
14 
4 
4 
4 
3 
5 
5 
3 
6 
5 
3 
9 
9 
4 
5 
6 
1 
7 
6 
7 
5 
3 
4 
4 
8 
11 
5 
3 
6 
3 
1 
4 
8 
6 
GoodnessofFit
Using the chisquare test for goodness of fit to observe if this distribution is Poisson.
chisquare distribution 
ChiSquared  Predicted  Interval  Observed  standardized residual 
0.06231539  6.36996293  02  7  0.2496305 
0.03325119  7.50059656  3  8  0.1823491 
0.01113553  9.67182189  4  10  0.10552503 
0.09571912  9.97724785  5  9  0.3093851 
0.03880769  8.57693237  6  8  0.1969967 
0.27563819  6.31984491  7  5  0.5250126 
0.00136715  4.07463685  8  4  0.0369751 
0.5165489  4.47894899  914  6  0.71871336 
Chi Squared=
1.03478316 
The CMV DNA has an approximate chisquare distribution with six degrees of freedom. It appears possible that this is a Poisson model. The standard resisduals for this test are all very small, unlike the uniform normal distribution seen below which has some large numbers and spikes that represent a lack of goodnessoffit.
Uniform Normal Distribution:
standardized residual plot 

observed  predicted  
0.1102822  29  29.6  
1.5807114  21  29.6  
0.44112877  32  29.6  
0.07352146  30  29.6  
0.44112877  32  29.6  
0.25732512  31  29.6  
0.2940858  28  29.6  
0.44112877  32  29.6  
0.80873608  34  29.6  
0.4778895  27  29.6 
Conclusion: In conclusion, although at first the CMV DNA data seemed random like the randomly generated numbers, through chisquared and goodness of fit tests I was able to determine that these sequences are not totally random and model a Poisson distribution fairly closely. This gives some hope to scientists in order to find that specific location of the area of replication of this virus.
]]>Time: numbers of hours played week prior to survey
Like to Play: 1=Never played, 2=Very much, 3=Somewhat, 4=Not really, 5=Not at all
Where they Play:
1=Arcade
2=Home on a system
3=Home on a computer
4=Home on computer and system
5=Arcade and Home(system or computer)
6=Arcade and home (both system and computer)
Frequency of Play: 1=Daily, 2=Weekly, 3=Monthly, 4=Semesterly
Play if Busy: 0=no, 1=yes
Are Video Games Educational?: 0=no, 1=yes
Sex: 0=Female, 1=Male
Age
Computer at Home: 0=no, 1=yes
Hate Math: 0=no, 1=yes
#of hours worked in week prior to survey
Own a PC: 0=no, 1=yes
PC has a CDROM: 0=no, 1=yes
Have an Email: 0=no, 1=yes
Grade expect in course: 4=A, 3=B, 2=C, 1=D, 0=F
The population provided for us is 95 students out of 314. They were the only ones to complete the survey, however only data from 91 of them were used due to errors in the survey or because they were not totally completed.
The survey asked the question , “How much time(hrs.) did you spend playing video games last week?” This graph shows that there is a significant comparison to how much they played the week earlier and how frequently they said they played. The average student plays 1.45 hrs. of video games a week, however if theses students admit to playing daily their average jumps to 3 times this to around 4.5 hrs. a week. Subsequently, as the student admits to playing least frequently, the average amount of hours played the week before substantially decreases. However, each category has 1 or 2 students who play substatially more then rest the rest in the category, which significantly increased averages. I broke this information down into the following table.

Mean 
Mean* 
Total Mean 
Total Mean* 
Percent 
Daily 
4.44444444 
1.71428571 
1.24285714 
0.62613636 
9.89% 
Weekly 
2.53928571 
1.52222222 
– 
– 
30.77% 
Monthly 
0.05555556 
– 
– 
– 
19.78% 
Semesterly 
0.04347826 
– 
– 
– 
25.27% 
In statistics, interval estimation is the use of sample data to calculate an interval of values for an unknown population, rather than point estimates which is simply a set population.
A confidence level is the extent to which an assumption or number is likely to be true. Confidence intervals are used to indicate the reliability of an estimate and how likely the interval is to contain the parameter. Increasing the desired confidence level will widen the confidence interval. Below is work done in class on Mathematica to determine a mean given a confidence level of 0.99.
An estimaton of the number of students who played a video game the week before the survey was taken by dividing the total number of students surveyed, 91, by the number of student who played video games prior, 34, approximately 37%. From here we were asked to calculate a 95% confidence interval for the students who claimed to have played a video game the week prior. For this Mathematica was used (like above) and reported the following interval (1.35992, 5.29302).
The interval estimate for the data below shows that the average amount of time playing video games vary greatly. Based on Mathematics interval output the data can’t be used to accurately represent the total population.
In[2]:= <<HypothesisTesting`
MeanCI[Time,ConfidenceLevel>.95]
Out[20]= {1.35992, 5.29302}
Playing Video Games While You are Busy:
The largest group of people who played video games are those that were busy, which seems backwards. However, personally I find that I will always find something else to do rather than what I know has to be done. Procrastination is the students best friend.Below is the number of students expecting to achieve an A in their classes and their associated game play in relation to grading. Showing that of students who play video games those who restrict their game play to around three hours or less weekly have a better chance of still attaining an A.
The above information is the number of students who are expecting to get an A in their classes. This is compared to the students frequency of game play.
Frequency of Play: 1=Daily, 2=Weekly, 3=Monthly, 4=Semesterly
An analysis of the number of students who work and do not work for pay shows that those who work and those who don’t work and play video games is split 50/50.
This pie graph shows the percentage of students that find video games educational. More people regard video games as noneducational, but it is nearly 50/50.
Legned:
1=Arcade
2=Home on a system
3=Home on a computer
4=Home on computer and system
5=Arcade and Home(system or computer)
6=Arcade and home (both system and computer)
The largest group of people who play games are home computer users.
Legend(BlueBoys RedGirls):
1=Never played, 2=Very much, 3=Somewhat, 4=Not really, 5=Not at all
Suprisingly the data between girls and boys is very similiar in this case. However, more girls do not like video games at all or “not really.”
Conclusion: There is a substatial amount of information in this Stats Lab, however a lot of this data is uncorrelated. Also the data can’t represent a total population because it’s confidence level is not high enough. Lastly, here are some pie charts that show a few of the random questions asked on this survey.
This pie chart shows the types of video games the students in the survey like to play
This pie charts shows the reasons why people don’t like to play video games.
This chart shows the reasons why people play video games.
Worked with – Jeff Richards & Jason Motta
]]>Results 1 : Based on these results babies of smoking mothers had a lower absolute maximum birthweight, but the lowest minimum birthweight actually belonged to a nonsmoker’s baby. They also had a lower average(mean) birth weight. This data shows that nonsmoking mother’s babies have a higher mean birthweight, by roughly 4 ounces, which is not a substantial difference. Based on the data in Table 1, of the 1236 newborn babies 485 were born to mothers that smoked and 742 were born to mothers who didn’t (10 cases were unknown). If we consider using the statistic definition of “low birth weight”, which is roughly under 90 ounces. We can see from this data that 22/742 of the nonsmoking mothers, 2.96%, were of low birth weight and 36/485 of smoking mothers or 7.42% were or “low birth weight”. All figures show negative skew, which means the mass of the distribution is concentrated to the right (higher values), but none of these numbers are substantial.
Using random sampling of nonsmoking mothers to match the 485 smoking mothers. I got the following statistical information:
Results:
Min  Max  Mean  St. Dev.  Skew  
Rand S1  55  174  123.26349  16.98804  0.23700 
Rand S2  62  176  123.96058  16.50857  0.10841 
Smokers  58  163  114.10950  18.09895  0.03370 
This data shows that even with taking a random sampling to compare the 485 smokers in this survey to 485 randomly selected nonsmokers, the average birthweight of babies is lower when their mother smokes. Also using the statistical definition of low birthweight to be less then 90 ounces we found smoking mothers to be averaging 7.42% low birthweight births and nonsmokers to have 2.96%. These randomly selected nonsmoking statistics showed 16/485 and 12/485, or 3.29% and 2.47%, which is much less then the smoking mothers.
Equations:
Mean:
Standard Deviation:
Skewness:
Conclusion:
It is difficult to analyze the question if smoking during pregnancy effects the birth weight of the child. The facts are that there are more than 45% more nonsmokers than smokers in the given data and given this nonsmokers had skewness closest to normal distribution. The data shows that the mean weight for smokers is nearly 9 ounces less than average weight of nonsmokers. This is probably the most significant statistic in this data. Also using the randomly sorted data, each outcome showed that smokers had more low birthweight births. Theis evidence shows that it is strongly likely that smoking during pregnancy can cause low birthweights in your children. Don’t Do it!
]]>