Download STA 248 – Winter 2005 – Assignment 1 Due: Thursday, January 27

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
STA 248 – Winter 2005 – Assignment 1
Due: Thursday, January 27 at beginning of lecture.
(Late assignments will be subject to a deduction of 10% of the total marks for the
assignment for each day late.) Please hand in your R code when used. On future
assignments I won’t be typing out the textbook problems. Let me know if you have any
difficulty getting a copy of the text.
Problems to be handed in for marking:
Chapter 6: 12, 17, 25
Chapter 7: 9, 23, 25, 30, 59
Additional problems: 3
Problems from the textbook:
Chapter 6:
8. (a) Acute exposure to cadmium produces respiratory distress and kidney and liver
damage, and may even result in death. For this reason, the level of airborne
cadmium dust and cadmium oxide fume in the air is monitored. This level is
measured in milligrams cadmium per cubic meter of air. A sample of 35 readings
yield the given data (available on the web).
(a) Construct a stem-and-leaf diagram for these data. Use the numbers 02, 03,
04, 05, 06, and 07 as stems. (Do by hand.)
(b) Would you be surprised to hear someone claim that the random variable X,
the cadmium level in the air, is normally distributed? Explain.
(d) Use R to construct a relative frequency histogram for these data. Does the
histogram exhibit the bell-shape characteristic of a normal density?
(e) Construct a relative cumulative frequency ogive for these data. Use the ogive
to approximate that point above which 50% of the readings should fall.
12. (Percentiles.) Let X be a random variable. The point p k/100 such that
P [X < pk/100 ] ≤ k/100
and
P [X ≤ pk/100 ] ≥ k/100
is called the kth percentile for X. For example, let X be binomial with n = 20
and p = .5. The 25th percentile for X is the point p 25/100 = 8 since
P [X < 8] = .1316 ≤ .25
and
P [X ≤ 8] = .2517 ≥ .25
(a) Let X be binomial with n = 20 and p = .5. Find the 60th percentile for X.
(b) Let X be Poisson with λ = 10. Find the 30th percentile for X.
(d) Let X be exponentially distributed with β = 1. Show that the 20th percentile
for X is − ln .80. Hint: Find the point p such that
Z
p
e−x dx = .20
0
1
17. Consider the two given data sets (available on web).
(a) Find the sample mean and sample median for each data set.
(b) Find the sample range for each data set.
(c) Find the sample variance and sample standard deviation for each data set.
(d) Would you be surprised to hear someone claim that these data were drawn
from the same population? Explain. Hint: Consider the shape of the distribution
as well as the observed values of the sample statistics.
20. Use the data of Exercise 8 to approximate the mean, variance, and standard deviation of the random variable X, the level of airborne cadmium dust and cadmium
oxide fumes. Assume that these approximations are fairly accurate. Between what
two values would you expect approximately 95% of the readings to fall? Explain.
25. (Approximating σ via the range.) The range can play an important role in the design
of statistical studies. To obtain a prespecified degree of accuracy when estimating
population parameters, an adequate sized sample must be drawn. Most formulas
used to determine sample size require knowledge of σ, the population standard
deviation. Often the researcher will not have an estimate of σ available but will
have an idea of the expected range of his or her data. When sampling from a
normal distribution,
.
P [−2σ < X − µ < 2σ] = .95
If X is not normally distributed, then Chebyshev’s inequality can be applied to
conclude that
P [−3σ < X − µ < 3σ] ≥ .89
That is, X always lies within at most 3 standard deviations of its mean with high
probability. From this it can be concluded that the estimated range covers an
interval of roughly 4σ for normally distributed random variables and 6σ otherwise.
In the normal case an estimate of σ can be obtained by solving the equation
.
4σ = estimated range
for σ. If X is not normally distributed, then
.
σ = (estimate range)/6
Data are given (available on the web) for the random variable X, the cpu time in
seconds required to run a program using a statistical package.
(a) Construct a stem-and-leaf diagram for these data. Is the assumption justified
that X is normally distributed?
(b) Approximate σ via the sample standard deviation s.
(c) Find the sample range for these data, and use it to approximate σ. Compare
your result to that obtained in part (b).
27. Let X be normally distributed with mean µ and variance σ 2 .
(a) Verify that q3 = µ + .67σ and that q1 = µ − .67σ.
(b) Find the interquartile range for X.
(c) Verify that the inner fences for X are f 1 = µ − 2.68σ and f3 = µ + 2.68σ.
(d) Verify that the probability that X will fall beyond the inner fences is approximately .007.
2
28. Temperature differences between the warm upper surface of the ocean and the
colder deeper levels can be utilized to convert thermal energy to mechanical energy.
This mechanical energy can in turn be used to produce electrical power using a
vapor turbine. Let X denote the difference in temperature between the surface
of the water and the water at a depth of 1 kilometer. Measurements are taken at
15 randomly selected sites in the Gulf of Mexico. The measured temperatures are
available on the web. Use R to do the following.
(a) Construct a double stem-and-leaf diagram for these data.
(b) Find the sample mean, sample median, and sample standard deviation for
these data.
(c) Not that the observation with value 10.1 is very different from the others. It
is a potential outlier. Construct a boxplot for these data to verify that the value
10.1 does appear to be an outlier.
(d) To see the effect of this outlier, drop it from the data set and calculate the
sample mean, median, and standard deviation for the remaining 14 observations.
Which measure is least affected by the presence of the outlier?
36. It is known that power surges or line spikes can damage sensitive electronic equipment. A study of these surges is conducted. The purpose of the study is to ascertain whether or not there are differences in the frequency of these surges among
the seven days of the week. Data for the study is found on the website. Variables
are observation number; day, with m = Monday, t = Tuesday, w= Wednesday, th
= Thursday, f = Friday, s = Saturday, and sn = Sunday; and number of spikes
per day. Use R to do the following.
(a) Obtain descriptive statistics on the number of spikes per day for each day of
the week. Discuss any differences among days that appear to exist.
(b) Construct boxplots for each day, and use the boxplots for a visual comparison
of the days.
Chapter 7:
1. Let X1 , X2 , . . . , X20 be a random sample from a distribution with mean 8 and variance 5. Find the mean and variance of X.
5. Let X1 , X2 , X3 , X4 , X5 be a random sample from a binomial distribution with n = 10
and p unknown.
(a) Show that X/10 is an unbiased estimator for p.
(b) Estimate p based on these data: 3, 4, 4, 5, 6.
9. (Weighted means.) Assume that one has k independent random samples of sizes
n1 , n2 , . . . , nk from the same distribution. These samples generate k unbiased
estimators for the mean, namely, X 1 , X 2 , . . . , X k .
(a) Show that the arithmetic average of these estimators, (X 1 + X 2 + · · · X k )/k,
is also unbiased for µ.
(b) Certain mineral elements required by plants are classed as macronutrients.
Macronutrients are measured in terms of their percentage of the dry weight of
the plant. Proportions of each element vary in different species and in the same
species grown under differeing conditions. One macronutrient is sulfur. In a
3
study of winter cress, a member of the mustard family, these data, based on three
independent random samples, are obtained:
x1 = .8
n1 = 9
x2 = .95
n2 = 3
x3 = .7
n3 = 200
Use the result of part (a) to obtain an unbiased estimate for µ, the mean proportion
of sulfur by dry weight in winter cress. By averaging the three values .8, .95, and
.7 to obtain the estimate for µ, each sample is being given equal importance or
“weight”. Does this seem reasonable in this problem? Explain.
(c) To take sample sizes into account, a “weighted” mean is used. This estimator,
µ̂W , is given by
n1 X 1 + · · · + n k X k
µ̂W =
n1 + · · · + n k
Show that µ̂W is an unbiased estimator for µ.
(d) Use the data of part (b) to find the weighted estimate for the mean proportion
of sulfur by dry weight in winter cress. Compare your answer to the estimate
found in part (b).
16. Let X1 , X2 , . . . , Xm be a random sample of size m from a binomial distribution with
parameters n, assumed to be known, and p. Show that the method of moments
estimator for p is p̂ = X/n.
17. Let X1 , X2 , . . . , Xn be a random sample from a Poisson distribution with parameter
λ. Find the method of moments estimate for λ.
23. Find the method of moments estimator for the parameter p of a geometric distribution.
25. Using the method of moments estimator for p found in Exercise 23, find an estimator
for σ 2 for the geometric distribution. (You don’t have to do the rest of this question
that is in the text.)
27. Carbon dioxide is an odorless, colorless gass that constitutes about .035% by volume
of the atmosphere. It affects the heat balance by acting as a one-way screen. It lets
in the sun’s heat to warm the oceans and the land but blocks some of the infrared
heat that is radiated from the earth. This reflected heat is absorbed into the
lower atmosphere, producing a greenhouse effect which causes the earth’s surface
to become warmer than it would be otherwise. Systematic measurements of CO 2
began in 1957 with Charles D. Keeling monitoring at Mauna Loa in Hawaii.
(a) Given the data (available on the web) that are CO 2 readings in ppm, construct
a stem-and-leaf plot (by hand) for these data using 31, 32, 32, 33, 33, 34, 34, 35 at
stems. Graph leaves 0-4 on the first of each repeated stem and leaves 5-9 on the
other. Is it reasonable to assume that the CO 2 level in the atmosphere is normally
distributed? Explain.
(b) Estimate µ and σ 2 using the method of moments estimators.
(c) Find an unbiased estimate for σ 2 .
29. Based on the data of Exercise 27, what are the maximum likelihood estimates for
the mean and variance of the atmospheric CO2 level?
4
30. Let X1 , X2 , . . . , Xm be a random sample of size m from a binomial distribution
with parameters n, assumed to be known, and p. Find the maximum likelihood
estimator for p. Does it differ from the method of moments estimator found in
Exercise 16?
31. Let W be an exponential random variable with parameter β unknown. Find the
maximum likelihood estimator for β based on a sample of size n. Does it differ
from the method of moments estimator (derived in lecture)?
34. Computer terminals have a battery pack that maintains the configuration of the
terminal. These packs must be replaced occasionally. Let X denote the life span in
years of such a battery. Assume that X is exponentially distributed with unknown
parameter β. Find the maximum likelihood estimate for β based on the given data
(available on the web).
35. To esimate the proportion of defective microprocessor chips being produced by a
particular maker, samples of five chips are selected at 10 randomly selected times
during the day. These chips are inspected, and X, the number of defective chips
in each batch of size 5, is recorded. Assume that X is binomially distributed with
n = 5 and p unknown. Use the data given (available on the web) to find the
maximum likelihood estimate for p.
54. Let X denote the unit price of a 3.5-inch floppy diskette. Observations are obtained
from a random sample of 10 suppliers. (Data are available on web.)
(a) Find an unbiased estimate for the mean price of these diskettes.
(b) Find an unbiased estimate for the variance in the price of these diskettes.
(c) Find the sample standard deviation. Is this an unbiased estimate for σ?
(d) Assume that X is normally distributed. Find the maximum likelihood estimate
for σ 2 . Does this agree with your answer to (b)?
59. Consider the random variable X with density given by
f (x) = (1/θ 2 )xe−x/θ , x > 0
(b) Show that E(X) = 2θ.
(c) Find the method of moments estimator for θ.
(d) Find the maximum likelihood estimator for θ based on a random sample of
size n. Does this estimator differ from that found in part (c)?
(e) Estimate θ based on these data:
3 5 2 3 4
1 4 3 3 3
(f) Are the estimators found in parts (c) and (d) unbiased estimators for θ?
Additional problems:
1. Which of the following statistics can be made arbitrarily large by making one
number out of a batch of 100 numbers arbitrarily large: the mean, the median,
the 10% trimmed mean, the standard deviation, the interquartile range?
2. Suppose X1 , . . . , Xn are n identically distributed random variables with E(X i ) =
µ, i = i, . . . , n. Show that (X)2 is not an unbiased estimate of µ2 .
5
3. What general features are evident in a boxplot of data from a normal distribution?
from a skewed distribution? from a distribution that is symmetric and bell-shaped
like the normal distribution, but has less probability in the tails (the extreme
values)? from a distribution that is symmetric and bell-shaped like the normal
distribution, but has more probability in the tails (the extreme values)?
4. In data compression of text, a probability model is used where the probability
of the next letter is heavily influenced by the preceding letters. In a first-order
Markov model, the probability of the next letter depends only on the one letter
immediately preceding it. Suppose we are interested in a model for the compression of a binary string. I’ll label the values “b” for black and “w” for white. For
a first-order Markov model we need the following probabilities for the value of a
letter given the value preceding it:
P (w|w) = pw , P (b|w) = 1 − pw , P (b|b) = pb , P (w|b) = 1 − pb
Suppose Xi is the random variable that is 1 if the ith letter is w and 0 if the ith
letter is b. Then given that the (i − 1)th letter is w (say), the probability function
of Xi is P (Xi = x|Xi−1 = 1) = pxw (1 − pw )1−x . Suppose the string
bbbbwwwbbbbbwwbbbbbbwwwwb
is observed. Use maximum likelihood to estimate the parameters p w and pb .
6