|F71SC3 Statistics 3 2006
Practical Sheet 5
This sheet contains questions from the last few practical sheets that you may not have had time to complete.
1. Obtain and attach the data set “cats” by typing the following:
Have a look at the data by typing cats and also use the ?cats command.
Plot Hwt (Heart Weight) against Bwt (Body Weight) in the usual way.
Obtain the best straight line through the points using
Write down the equation of this line. It will be of the form Hwt = a + b*Bwt. You can obtain the values of a and b using
Now obtain a plot of the residuals against Bwt using
What do they show?
Also obtain a plot of residuals against “fitted values” of Hwt (These are the values that the equation would predict).
For a good “fit”, look for spread around the line y=0 and no obvious trend.
Find the correlation coefficient and comment.
Finally for this example, try the commands:
After you’ve observed the graphs and cut-and pasted them if desired, return the graphics window to normal using
So far, the analysis has not looked at all at the gender of the cat.
Type in the R code
>boxplot(Hwt~Sex, xlab='gender', ylab='heart weight')
which clearly shows that heart weight depends on gender.
Similarly, the R code
> plot(Hwt~Bwt, xlab='body weight (kg)', ylab='heart weight (g)', type='n')
> points(Hwt[Sex=='F']~Bwt[Sex=='F'], pch=3)
produces a plot of heart weight against body weight for male cats (denoted by o) and female cats (denoted by +).
2. A total of k letters fall out of their envelopes, and are replaced at random. We are interested in the distribution of the random variable X, defined as the total number of correctly replaced letters. Except where k is very small, this distribution is very
awkward to calculate, so we will use simulation to estimate it. Let the correctly
matched envelopes and letters be numbered 1 to k.
(a) Take k = 10. Generate a random assignment of letters to envelopes, and count the number of correct matches with
> a = sample(10) # generate random permutation
> a # examine it!
> a==1:10 # check correctness of letter in each envelope
> sum(a==1:10) # count number of correct matches
Now use a for loop to simulate 10000 realisations of X with
> x = numeric(10000) # set up vector of correct length to store results
> for(i in 1:10000) x[i] = sum(sample(10)==1:10)
Use the table function to estimate the probability function of the distribution of X.
Estimate also the mean and variance of this distribution.
(b) It is not difficult to show that, for k 2, both the mean and the variance of X are
exactly equal to 1 (so the average number of correctly replaced letters does not depend on k). Further, theory suggests that, provided k is not too small, X should have an approximate Poisson distribution with parameter (mean) 1. (The quality of the approximation becomes ever better as k increases in size.) Investigate this in the case k = 10 considered above. Investigate it also in the case k = 20.
3. (new) This famous data set gives the measurements in centimetres of the variables sepal length, sepal width, petal length, and petal width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, Iris versicolor, and Iris virginica. The data set is made available as an R data frame iris, directly from within R by simply typing
Type help(iris) for more information, and type iris to display the data. Finally type
to make available the five separate R vectors Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, and Species, each of length 150.
(a) Consider first the variable petal length and its dependence on the variable species. Use the following R code to construct histograms of petal length for each of the three species on a common scale:
Also make the comparison more simply with
> boxplot(Petal.Length~Species, xlab='species', ylab='petal length', main='Boxplots of petal length against species')
Similarly, for each of the remaining three measurement variables, use boxplots to compare their distributions for the three species. Is there definite evidence of difference between the species?
(b) To what extent are the comparisons helped by first taking logarithms of the four variables?
(c) Now consider the relation between petal width and petal length. Plot one variable against the other:
> plot(Petal.Length, Petal.Width, main='Plot of petal width against petal length')
Comment on the observed statistical association.
Now repeat the plot, but use a different plotting symbol for each of the three species. This is a little complicated, but can be accomplished with the following code:
> plot(Petal.Length, Petal.Width, type='n', main='Plot of petal width against petal length, distinguishing species')
> points(Petal.Length[Species=='setosa'], Petal.Width[Species=='setosa'], pch=1)
> points(Petal.Length[Species=='versicolor'], Petal.Width[Species=='versicolor'], pch=2)
> points(Petal.Length[Species=='virginica'], Petal.Width[Species=='virginica'], pch=3)
Give an explanation for the previously observed statistical association and re-evaluate its significance