Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1.1. RUNNING R 1 Week 1 Lab In this week’s Lab, we will: ◦ run R and access to course data set; ◦ see how R stores data; ◦ look at some simple arithmetic operations; ◦ see how to plot generic functions using R; ◦ see how to plot functions over different ranges; ◦ use the plot of p.d.f. or p.m.f. to compare distributions. In these lab notes, text in the typewriter font is to be typed in. 1.1 Running R Those who attended Lab100 lab session can refresh their memories to run R from Lab100 Week 15 session material. We quickly go through the setting up and download a data file for later use. Getting started with R: 1. Login in to the Department Windows Server. 2. Open the R package. 3. Open up the Tinn-R editor for typing codes. 4. Click on File/SaveAs to give your R-code a filename, e.g. lab105wk1 Run: 1. Highlight (Select) all or part of your Rcode. 2. Click on the button SubmitSelection in the top tool bar. Save output: 1. Highlight the output of interest in the R-window. 2. Click on File/SaveToFile and give it a name. 3. The output is then saved in a text file. Quit: 1. Save the final version of your R-code. 2. Click on File/Exit from TinnR. 3. Click on File/Exit from R. 4. Alternatively type q() in R. Don’t forget the brackets. You will be asked: Save workspace image? [y/n/c]: to which you should respond y. At the end of every session, you should quit using q(), and answer y to the question “Save workspace image?”. Saving the “workspace image” means the work you have completed will still be there the next time you use R. Accessing course dataset: Througout the session, we will use some data examples, which you can download from the course webpage. 2 1. Save the file in your working directory, using the filename m105data.r. 2. Then in R, when you want to use the data, type attach("YOURPATH/m105data.r") 3. You will need to do this in each new R session. 1.2 How R stores data R stores data in vectors and arrays. You can enter data into a vector as follows: c(1,6,2,4,3,8) To save this vector, for use later on, assign it using the = operator (pronounced “gets”): x = c(1,6,2,4,3,8) To look at the values in your vector x, just type x at the prompt and you should see x [1] 1 6 2 4 3 8 1.3 Some arithmetic operations You can do simple arithmetic as follows. Find out what the following commands achieve: x + 6 x^2 x^(1/2) x * 2 sqrt(x) R has some helpful functions that calculate summary statistics for us. Use mean(x) to find the mean value of the vector x, and check this by hand. 1.4 Plotting functions We can plot functions such as y = f (x) by 1. deciding what range of x to use; 2. setting up a vector of closely spaced x values spanning that range; 3. calculating y for each x value; 4. plotting y against x. 1.5. PLOTTING PROBABILITY DENSITY FUNCTIONS 3 For x ∈ [−5, 5], create a plot of y = −3x2 + 6, using the following commands: x = seq(-5,5,length=11) # creates a sequence of values between -5 and 5, of length 11. y = -3*x^2 + 6 # calculates y for each x value. plot(x,y) # plots the 11 points (x,y). We can make this look more like the continuous function y = −3x2 + 6 by: 1. calculating the function at more points: use a higher value of length when you define x; 2. joining up the points: use the following extra argument in the plot function: plot(x,y,type="l") here ”l” stands for lines. What value of length needs to be used to make your plot look really smooth? What commands do you need to do to plot your function on the range x ∈ [−1, 3]? x = seq(-1,3,length=50) rest as before... 1.5 Plotting probability density functions We can use R to examine the shape of a distribution by plotting the p.m.f. or p.d.f. Exponential(θ) distribution has p.d.f. f (x; θ) = θ exp(−θx) for x ≥ 0 . and c.d.f F (x; θ) = 1 − exp(−θx) . For x ∈ [0, 10], create the p.d.f. of Exponential(θ) for θ = 1. What commands do you use to plot the function? theta = 1 x = seq(0,10,length=101) y = theta*exp(-theta*x) plot(x,y,type="l") Comment on the shape of the p.d.f for different values of θ. How does the shape of the function change? Exponential(θ) distribution satisfies 1 1 Var[X] = 2 θ θ What is the mean of Exponential(θ = 1) distribution? Where is it located on the plot? E[X] = 4 Add a vertical line at the mean by theta = 1 abline(v=1/theta, col="red") Repeat the procedure for θ = 0.5 and θ = 2. How does the shape of p.d.f. change in relation to the mean? To overlay to the existing plot, use lines(x,y, col="red") How close are those two functions? Now for x ∈ [0, 10], create the c.d.f. of Exponential(θ) for θ = 1. Repeat the procedure for θ = 0.5 and θ = 2. How does the shape of c.d.f. change? Calculate the mean and the variance for each θ? Which one has the largest and the smallest mean? Which one has the largest and the smallest standard deviation? Geometric(θ) distribution has p.m.f. p(x; θ) = θ x (1 − θ) for x = 0, 1, · · · . For x ∈ {0, 1, · · · , 10}, create the p.m.f. of Geometric(θ) for θ = 0.5. What commands do you use to plot the function? theta = 0.5 x = 0:10 # or x = seq(0,10,length=11) y = theta^x*(1-theta) plot(x,y) 1.5. PLOTTING PROBABILITY DENSITY FUNCTIONS 5 Repeat the procedure for θ = 0.25 and θ = 0.75. How does the shape of p.m.f. change? Comment on the shape of the p.m.f.s for different values of θ. It is easy to create the c.d.f. for a discrete random variable, by summing the probability. Find cumulative sum by using ynew = cumsum(y) Check the numbers stored in ynew by comparing with y. Plot the c.d.f. by using plot(x,ynew) How would you plot the c.d.f. for θ = 0.25 and θ = 0.75? How does the shape of c.d.f. change? 1 2.1. RANDOM VARIABLES AND PDFS Week 2 Lab In this week’s Lab, we will: ◦ see how to simulate from various probability distributions; ◦ transform uniform to exponential random variables; ◦ sketch the pdf of the exponential random variable; ◦ illustrate with simulation; ◦ plot the classic bell shaped curve, the Normal pdf. 2.1 Random variables and pdfs Recall that a discrete rv has a pmf and a continuous rv has a pdf. The rv X has the pdf f (x) if P(a < X < b) = Z b f (x) dx a < b, a so that it is the area under the curve. For example, a random variable X has the Uniform distribution on the interval (0,1) if f (x) = 1 for 0 < x < 1, and 0 otherwise. We write that as X ∼ Uniform(0, 1). The shaded area represents P(0.2 < X < 0.5) 1 Uniform(0,1) density 0 P(0.2<X<0.5) 0 0.2 0.5 1 x Q 2.1 Uniform. Simulate 1000 realisations of the rv X ∼ Uniform(0, 1) using runif. Draw the histogram. Plot the pdf on the range (−.5, 1.4) using the function dunif and range = seq(-.5,1.4,length=100) to give 100 points. Find the probability that P(0.2 < X < 0.5) using the function punif. Ans 2.1 x = runif(1000) hist(x, 20, col=’yellow’) range = seq(-.5,1.4,length=100) pdf = dunif(range) 2 plot(range, pdf, type=’n’) lines(range, pdf) punif(0.5) - punif(0.2) Ans 2.1 Q 2.2 Uniform example If X is uniform then the probability it is lies between 0.1 and 0.3 is 0.2. Verify this by simulation. x = runif(1000) y = (0.1<x) & (x<0.3) sum(y) # the frequency of 1’s This should be near 200 = 1000 × 0.2. Repeat this script with 0.3 replaced by 0.6. What value do you expect? Ans 2.2 500 or so. Q 2.3 Simulating Exponential rvs. Theory states that if U is Uniform then the transformation Y = − log(U ) has the Exponential(1) distribution, with pdf f (x) = exp(−x), 0 < x < ∞. Verify this empirically: u = runif(1000) ; hist(u,20) y = -log(u) ; hist(y,20) Not bad! Describe the shape of the histograms of the following transforms (i) Y = −3 log(U ), (ii) Y = log(U ) sin(U ), (iii) Y = U 2 (1 − U )2 . Ans 2.3 u y y y = = = = runif(1000) -3*log(u) log(u)*sin(1-u) u^2*(1-u)^2 ; ; ; ; hist(u,20) hist(y,20) hist(y,20) hist(y,20) (i) Exponentially shaped, (ii) J shaped, (iii) U shaped. Q 2.4 The bell shaped curve, the Normal pdf. When X ∼ Normal(0, 1) it has the pdf 1 1 2 f (x) = p exp − x , −∞ < x < ∞ 2 (2π) Simulate 1000 realisations of the rv X ∼ Normal(0, 1) using rnorm. Draw the histogram. Plot the pdf on the range (−2.5, 4.4) using the function dnorm and range = seq(-2.5,4.4,length=100) 3 2.2. SIMULATING DATA to give 100 points. Find the probability that P(0.2 < X < 0.5) using the function pnorm. Ans 2.4 x = rnorm(1000) hist(x, 20, col=’yellow’) range = seq(-2.5,4.4,length=100) pdf = dnorm(range) plot(range, pdf, type=’n’) lines(range, pdf) pnorm(0.5) - pnorm(0.2) Q 2.5 A shifted scaled Normal. When X ∼ Normal(3, 4) it has the pdf 1 (x − 3)2 1 , exp − f (x) = p 2 4 (2π4) −∞ < x < ∞ Simulate 1000 realisations of the rv X ∼ Normal(3, 4) using modifying the argument of rnorm x = rnorm(1000,mean=3,sd=2) # note sqrt 4 Draw the histogram. Plot the pdf on the range (−2.5, 4.4) using the function dnorm with new argument but the same range. Find the probability that P(0.2 < X < 0.5) using the function pnorm. Ans 2.5 x = rnorm(1000,mean=3,sd=2) # note sqrt 4 hist(x, 20, col=’yellow’) range = seq(-2.5,4.4,length=100) pdf = dnorm(range,mean=3,sd=2) plot(range, pdf, type=’n’) lines(range, pdf) pnorm(0.5,mean=3,sd=2) - pnorm(0.2,mean=3,sd=2) 2.2 Simulating data Often it is handy to do experiments with simulated data. With simulated data, we know the true underlying distribution, which unfortunately is never exactly the case with true life data! Sample 100 Uniform[0,1] variates and store your values in the vector u: u = runif(100,min=0,max=1) You can see your data by typing u. Do a scaled histogram of your sample and comment on the shape of your histogram: hist(u,prob=TRUE) 4 Calculate the mean and 0, 0.25, 0.5, 0.75, and 1.00 quantiles of your sample. quantile(u) Now sample 200 Normal(2,4) variates (note R takes the standard deviation, as an argument to rnorm(), not the variance), then do a scaled histogram of your sample: n = rnorm(200,mean=2,sd=2) hist(n,prob=TRUE) Comment on the shape of your histogram We can overlay the true underlying p.d.f. to make comparison. What is the range of n in the plot? For the lower limit a and the upper limit b that you decided to use for the plot, define x = seq(a, b, length=101) Recall that the Normal(µ, σ 2 ) has density f (x; θ) = √ 1 x − µ 2 1 exp − 2 σ 2πσ where θ = (µ, σ). fx = 1/(sqrt(2*pi)*2)*exp(-0.5*((x-2)/2)^2) We can compare the empirical c.d.f to the c.d.f. First create empirical c.d.f. It is a bit messy to integrate the p.d.f. of Normal(µ, σ) distribution. Fortunately, R has already calculated the c.d.f. for you and the function is named pnorm: Fx = pnorm(x,2,2) To overlay to the empirical c.d.f., use lines(x,Fx) How close are those two functions? To differentiate the curves better, you can use color as lines(x,Fx,col="red") Calculate the mean and variance of your sample. Do these seem close to the true values? 5 2.3. MORE SIMULATION Calculate the 0.4 and 0.99 quantiles of your Normal data sample. Where do these values appear on the histogram of the data? 2.3 More simulation We used the functions runif() and rnorm() to simulate from Uniform and Normal distributions. Simulation from other distributions is as easy: Command runif(50,min=0,max=1) rnorm(20,mean=0,sd=5) rexp(100,0.5) rpois(200,3) rbinom(35,size=6,prob=0.2) rgeom(150,1-0.2) Generates 50 observations from a Uniform[0,1] distribution 20 observations from a Normal(0,25) distribution 100 observations from an Exponential(0.5) distribution 200 observations from a Poisson(3) distribution 35 observations from a Binomial(6,0.2) distribution 150 observations from a Geometric(0.2) distribution The reason for the 1-0.2 in the Geometric case is that unfortunately, in R the probability you specify is the success probability, whereas we define the parameter θ in the p.m.f. of a Geometric random variable as the failure probability. Experiment with these functions by generating samples from different distributions and plotting histograms and empirical c.d.f.s of your samples. 1 3.1. HISTOGRAMS – HEIGHTS OF OFFSHORE WAVES AT NEWLYN Week 3 Lab In this week’s Lab, we will: ◦ draw histograms, and scaled histograms. ◦ plot the empirical distribution function for different data sets; ◦ use the empirical distribution function to estimate quantiles and probabilities; ◦ calculate a range of summary statistics for different data sets; ◦ use exploratory methods to look at bivariate data; ◦ calculate correlations and covariances for the ozone data. 80000 60000 Northings 40000 0 Coastal engineers at the port of Newlyn, in the south west of England, require detailed understanding of oceanographic processes in order to estimate overtopping rates of the sea wall protecting the town. They can then assess whether existing sea wall is adequate, or whether further protection should be built. Offshore waves are induced by meteorological conditions, and though they are complex, they can be summarised by their height and their period. Here we will concentrate on the excess heights of these waves over a threshold. The specific problem that the engineers want to solve is: 100000 Histograms – heights of offshore waves at Newlyn 20000 3.1 Newlyn 0 20000 40000 60000 80000 Eastings Given a small probability of exceedance, what is the wave height that is exceeded with that probability? How accurate is this estimate? Data collected to address this problem are the maximal levels (in metres) recorded over consecutive 15 hour windows, throughout the period 1971-77. The data of offshore wave heights (in metres) at Newlyn are stored in the vector waves. Typing in waves shows you the whole vector. Find the length of this vector: length(waves) Find the mean of the offshore wave heights. mean(waves) # 2.866 Use the hist function to produce a histogram of the offshore wave heights: hist(waves) Describe the shape of this distribution and the range of this variable: 2 400 0 200 Frequency 600 800 Offshore waves 0 2 4 6 8 10 12 Wave height Figure 3.1: Histogram of offshore wave heights (m). What does the y-axis of this plot represent? You can scale the histogram so that it has area 1 (and therefore approximates a probability density function) using hist(waves,prob=TRUE) What does the y-axis of this plot represent now? The wave heights are measured in metres. Convert these into feet by multiplying by 3.28: scaledWaves = waves * 3.28 3.2. EMPIRICAL DISTRIBUTION FUNCTIONS 3 What is the mean of your scaled waves? How does this relate to the mean of the waves measured in metres? mean(scaledWaves) #9.400 3.28 * mean(waves) Produce a scaled histogram of the wave heights measured in feet. What do you notice about the shape of your new histogram? 3.2 Empirical distribution functions In the lectures we have seen empirical c.d.f.s estimated for the ozone data. Look at the ozone data by typing: ozone.summer ozone.winter These are arrays of data. Each row represents a different day, and the column names show the measurements of the air pollutants at each site. To reproduce the empirical c.d.f. of the Leeds ozone: plot(ecdf(ozone.summer$Leeds.O3)) careful - the O3 is capital “O” for “Ozone” Now plot the empirical c.d.f. for the wave heights at Newlyn, contained in waves: plot(ecdf(waves)) There are almost 3000 wave measurements. How does this affect the appearance of the estimated c.d.f. compared with that for the ozone example? 3.3 Estimating quantiles and probabilities from the e.c.d.f. Use your plot of the empirical c.d.f. for the waves data to answer the following questions: What is the median of the wave height distribution? What are the 0.1 and 0.9 quantiles of the wave height distribution? 4 Estimate the probability of a randomly selected wave being less than 1.7m. What wave height is exceeded by 25% of the waves? 3.4 Calculating summary statistics The functions mean() and var() can be used to calculate the mean and variance of a vector (we saw mean in last week’s lab). Use these functions to answer the following questions, making sure you give your answers in the correct units: What is the sample mean of the wave height data? mean(waves) # 2.866 What is the sample variance of the wave height data? var(waves) # 2.564049 Use the sqrt() function to derive the wave height data sample standard deviation: sqrt(2.564) or sqrt(var(waves)) # 1.601265 The function quantile() will calculate quantiles of a vector: quantile(waves) What are the minimum, maximum and median values of this sample? Is the median higher or lower than the mean? How does this relate to the shape of the waves distribution (remember your histogram from last week’s lab)? Find the sample 0.8 quantile of the waves data: quantile(waves,0.8) # 4.03 A graphical display of the quartiles can be made by a box-plot using the function boxplot(). boxplot(waves) This should look like Figure 3.2. The thick line in the middle box is the median, the upper line in the box is the 75% quantile and the lower line is the 25% quantile. The minimum and maximum are easily identified. Skewness is shown as asymmetry of the box. Points appearing outside the limits are considered outliers. Summarise your findings. 5 6 0 2 4 wave heights (m) 8 10 3.5. BIVARIATE DATA Figure 3.2: Boxplot of offshore wave heights (m). 3.5 Bivariate data When data are more than one-dimensional, it is important to look for dependence between variables. The ozone data are four dimensional. Type ozone.summer and ozone.winter to see these data. What are the dimensions of the data? names(ozone.summer) Create a scatterplot of the summer Leeds O3 measurements against the corresponding measurements taken at Ladybower reservoir, and comment on any apparent relationship. plot(ozone.summer$Ladybower.O3,ozone.summer$Leeds.O3) 3.6 Covariances and correlation Calculate the mean and standard deviation of the summer ozone measurements at each site. 6 mean(ozone.summer$Leeds.O3) # 31.78 sd(ozone.summer$Leeds.O3) # 9.28 mean(ozone.summer$Ladybower.O3) # 43.63 sd(ozone.summer$Ladybower.O3) # 11.8 Calculate the sample correlation between these variables: cor(ozone.summer$Ladybower.O3,ozone.summer$Leeds.O3) #0.76 Comment on the value of this sample correlation, referring to your scatterplot. Calculate the sample covariance between the variables: cov(ozone.summer$Ladybower.O3,ozone.summer$Leeds.O3) #83.66 What happens when you calculate the correlation and covariance of these variables after multiplying them by 2? cov(2*ozone.summer$Ladybower.O3,2*ozone.summer$Leeds.O3) # 334.64 cor(2*ozone.summer$Ladybower.O3,2*ozone.summer$Leeds.O3) # 0.76 Now calculate the covariance and correlation between the winter ozone measurements at the two sites. cov(ozone.summer$Ladybower.O3,ozone.summer$Leeds.O3) # 64.38 cor(ozone.summer$Ladybower.O3,ozone.summer$Leeds.O3) # 0.71 4.1. UNDERLYING VARIATION – REPEATED SAMPLING 1 Week 4 Lab In this week’s Lab, we will: ◦ explore underlying random variation in our samples using repeated simulation; ◦ look at the random variation in summary statistics over repeated samples. 4.1 Underlying variation – repeated sampling Any type of sample statistics such as sample mean, sample standard deviation, sample quantiles, sample correlation etc, exhibit random variability. We will explore this property further with simulation experiment. Now simulate 30 realisations of an Exponential(0.5) random variable, histogram your sample, and comment on the shape of your histogram: u = rexp(30,0.5) hist(u,prob=TRUE) Repeat this 5 or 6 times and comment on the differences between the histograms you achieve each time. Now put your sample size up to 100 and repeat your experiment. What do you see now? 4.2 Random variability in sample mean Sample 10 realisations from a Normal(3,9) distribution, and record the sample mean. x = rnorm(30, mean=3, sd=3) mean(x) Do the above two steps again ten times, writing down the sample mean each time. Comment on the variability of the values you obtain for your sample means. Repeat for 100 times and construct a histogram and a boxplot. xbar = rep(0, 100) for (i in 1:100){ # to store the 100 means # repeat for 100 times 2 x = rnorm(10, mean=3, sd=3) # sample 10 from N(3,3^2) xbar[i] = mean(x) # compute sample mean } xbar # print 100 means hist(xbar) hist(xbar, prob=TRUE, xlim=c(0,6)) boxplot(xbar) Repeat the above experiment using a sample size of 50 each time, instead of 10. How variable are your sample means now? x=rnorm(50, mean=3, sd=3) 4.3 Random variability in sample maximum Sample 20 realisations from a Uniform[-1,1] distribution. Use the max() function to calculate the sample maximum and write this down: max(u) Do the above two steps again ten times, writing down the sample maximum each time. Comment on the variability of the values you obtain for your sample maxima. Increase the sample size to 50 and repeat the experiment. Construct a histogram and a boxplot. How variable are your sample maxima now? Are they close to symmetric? xmax = rep(0, 10) # to store the 100 means for (i in 1:10){ # repeat for 100 times x = runif(50, -1, 1) # sample 10 from N(3,3^2) xmax[i] = max(x) # compute sample mean } xmax # print 100 means hist(xmax) hist(xmax, prob=TRUE, xlim=c(0,1)) boxplot(xmax) 5.1. CONFIDENCE INTERVALS FOR DISEASED TREES 1 Week 5 Lab In this week’s Lab, we will: ◦ use the standard errors of parameter estimates to obtain approximate confidence intervals for parameter estimates; ◦ do a class experiment to look at the coverage of confidence intervals. 5.1 Confidence intervals for Diseased trees Diseased trees: We assume the parametric model we have identified for the diseased trees example to be the true model for the run length variable: Run length = X ∼ Geometric(0.343). We will simulate from this model, where we know the true value of θ = 0.343. Using our simulated data, we compute the method of moments estimate and estimate the 95% confidence interval for θ. Obtain your own simulated data from the model and confidence interval for θ as follows: 1) Simulate 30 i.i.d. variables from a Geometric(0.343) distribution: theta = 0.343 x = rgeom(30, 1-theta) # R uses 1-theta for Geometric distribution 2) Calculate x̄ for your simulated data: mean(x) 3) Find the method of moments estimate θ̂. (You may consult your note!) thetaHat = mean(x)/(mean(x)+1) 4) Find the standard error of the estimate: (You may consult your note!) √ θ(1 − θ) √ StdError(θ̂) = . n stdErr=sqrt(thetaHat)*(1-thetaHat)/sqrt(50) 5) Obtain the approximate 95% confidence interval: θ̂ ± 1.96 × StdError(θ̂) lowerC = thetaHat-1.96*stdErr upperC = thetaHat+1.96*stdErr 6) Is the true value of θ = 0.343 contained in your interval? 7) Repeat steps 1)-6) to obtain another estimate of the 95% interval for θ using different simulated data. Collect 50 estimates. theta = 0.343 lowerC = rep(0, 50) upperC = rep(0, 50) for (i in 1:50){ # store 50 estimates of lower limit # store 50 estimates of upper limit 2 x = rgeom(30, 1-theta) # sample of 30 thetaHat = mean(x)/(mean(x)+1) # method of moments estimate stdErr = sqrt(thetaHat)*(1-thetaHat)/sqrt(50) # standard error lowerC[i] = thetaHat - 1.96*stdErr # lower limit of CI upperC[i] = thetaHat + 1.96*stdErr # upper limit of CI } 8) Some intervals contain the true value of θ (i.e. θ = 0.343) and some don’t. What proportion do contain the true value of θ?