Download Lecture 7

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Lecture 7
Describing Data; Growth Rates and
Measures of Dispersion
7.1 Describing Data
We have seen how to capture data from the web and download data to Excel.
The next few sessions look at some of the methods used for analysing data.
There are various types of objective that might be relevant here including:


Presenting data in a more user-friendly or informative way
Describing data more precisely
• Compute and interpret various measures of centre and dispersion for
ungrouped data
•
•
Explain the characteristics, uses, advantages and disadvantages of
the various measures of centre and location
Use the normal distribution
Particularly if you have not had much opportunity to work with real data in
economics previously you should practise with the sorts of methods covered
in these sessions to develop confidence in handling data. This is particularly
useful for the group-produced project..
7.1.3 Describing data in Excel
We will now return to Excel for the rest of the lecture. It is also possible to do
some of these statistical tasks in Excel, but it involves using the ‘Data
Analysis’ Add-in. To arrange raw data into an Excel Frequency Distribution
and Histogram, see Demo 7.1 which makes use of this feature of Excel.
7.2 Transforming data
It is very common to get time series data expressed in absolute numbers you
would prefer to express in some other format that makes numbers much
easier to absorb and makes trends, and deviations from trends, easier to
identify. There are two techniques commonly used for transforming data into a
more digestible form.
7.2.1 Calculating growth rates
A growth rate is the change in a variable between time t and time (t +1) in
relation to its initial value in t. For example, if income rises from 200 to 208
during the course of 12 months the (annual) growth rate is (208-200)/200 =
.04 or 4%. Calculations of this kind are very easily done in spreadsheets, but
care is needed with the time units being used. With quarterly data, for
example, you need to multiply each quarterly increase by four to get the
annual growth rate. Also, it is conventional to express growth rates in
percentage terms. The most reliable way to do this is by multiplying the
answer by 100. Sometimes you can use the % icon, but beware: although this
will display the answer as a whole number, arithmetic operations will still treat
the cell as a decimal!
A very popular technique (among economists) to calculate growth rates is to
use logs.
It can be shown that (xt-xt-1)/xt-1 ≈ ln xt -ln xt-1
and, if you want percentage growth rates, it becomes ((xt-xt-1)/xt-1 )*100 ≈ (ln xt
-ln xt-1)*100.
When calculating growth rates in Excel you normally 'lose' the first
observation, since the change cannot be defined until you are at the second
of two points. So, when putting in the formula to compute the rate of growth,
leave the first cell of your new column blank.
7.2.2 Index Numbers
The idea of an index number is to express a sequence of numbers in relation
to a base value of 100. Thus if we had a series of 3 numbers, say, 50, 80 and
150 we could express them as index numbers, using the first number as the
base. In index form the series would become 100, 160 and 300. Whatever
multiplication or division operation you perform to convert the first number to a
value of 100 you repeat for the other numbers as well. Thus to convert 50 to
100 you multiply by 2 (=100/50), so this operation is done for the other
numbers as well. A common application of index numbers is to convert very
large numbers, such as the value of Gross Domestic Product, to a more
manageable format. Thus if GDP at current prices is £668,255 millions in
1994 and £700,890 millions in 1995, we convert these to index numbers thus:
1994: 668,225*(100/668,225) becomes 100
and
1995: 700,890*(100/668,225) becomes 104.89
7.2.3 Rates of change
There are occasions when it is conventional to express data in index number
format but then also to calculate a rate of change. Price inflation is a classic
example. Raw price information (compiled from market prices for a particular
bundle of commodities) is first transformed into an index such as the Retail
Price Index or the Tax and Price Index. Rates of change in the value of this
index are then calculated and used as a measure of the rate of inflation, i.e.
the rate at which prices are changing.
7.2.4 ‘Deflating’: Expressing data in 'real terms' or 'constant
prices'
Often in macroeconomics it is useful to 'deflate' a series of numbers in order
to extract the influence of price changes and to see how a variable has moved
in 'real terms'. A 30-year run of data on Aggregate Consumption or GDP
expressed in 'current prices', for example, will exaggerate the rate at which
consumption or incomes have been increasing. In order to eliminate this price
effect it is common to deflate a series by a price index. This requires data on a
price index (such as the RPI) covering the same period as the data for the
variable in question. The value of the variable in period t is then divided by the
value of the price index in t and the resulting variable is referred to as being
measured in 'real' or 'constant price' terms. The year for which the price index
takes a value of 100 is then specified as part of the name of the new series:
e.g. 'Consumption at constant 1995 prices'.
Definitions
Nominal GDP (or GDP in Current Prices) = GDP at prices prevailing
when income was earned
Real GDP (GDP in Constant Prices) = GDP adjusted for the effects of
inflation
Per Capita real GDP = Real GDP divided by the total population
Economic Growth = The percentage change in Real GDP per year
Demo 7.2 is an extensive demo that runs over four sheets. Click here for a
demonstration of calculating UK data on Growth Rates (Sheet 1), Index
Numbers (Sheet 2), the GDP Deflator (Sheet 3) and Real GDP (Sheet 4).
Try to follow how each sheet develops from the one before.
7.2.5 Graphical representation of data
Very often you will have to play around with data to get them into the format
that makes the most vivid picture. There are no hard and fast rules about how
to do this: it is best to experiment a bit. The final product will depend on both
the format in which numbers are expressed and the choices you make about
how to present the data. Be willing to change both in the search for the best
output.
7.3 Measures of Centre
1. Arithmetic Mean: x  x / n
Excel function =AVERAGE(range : range)
“X-bar” is the sum of the scores divided by the number of people (or items)
being measured (i.e. the sample size).
Properties
 Every set of interval-level and ratio-level data has a mean.
 All the values are included in computing the mean.
 A set of data has a unique mean.
 The mean is affected by outliers (unusually large or small data values),
it is not a ‘resistant’ measure of centre.
 If all the values are summed and an equal number of means are
subtracted, the total will be zero:
7.3.2. The Geometric Mean: GM  n ( x1 )( x 2 )( x3 )...(x n )
The geometric mean (GM) of a set of n numbers is defined as the nth root of
the product of the n numbers.
Excel function =GEOMEAN(range:range)
Basic textbooks do not mention this measure, but it is very important in
measuring financial market movements. If a share goes up from 200 to 250,
the standard growth rate formula gives (250-200)/200 = 25%
But if it goes down from 250 to 200, the standard growth rate gives (200250)/250 = -20%
The share has gone up and down the same amount but the growth rates are
different.
The Geometric Mean is used to overcome this problem. Multiply the 2
numbers together and take the Square Root
=SQRT(200*250) = 223.60679
The geometric mean growth rate will be 23.6% regardless of whether the
share is going up or down.
For the geometric mean of 3 numbers, take the cube root of the 3 numbers
multiplied together.
For the geometric mean of n numbers, take the nth root of the n numbers
multiplied together.
Comparing the Arithmetic Mean and Geometric Mean
The interest rates on three bonds were 5, 7, and 4 percent.
GM = 5.192.
X-bar = 5.333.
The GM gives a more conservative profit figure because it is less heavily
weighted by the rate of 7 percent.
7.3.3. Median
Excel function =MEDIAN(range:range)
The middle value in a distribution which has been arranged in numerical order
(equal number of values above and below the median)
The Mth value in the distribution is found by M 
(n  1)
2
Rule of thumb: If n is odd, the median is the centre value. If n is even, the
median is the average of the two centre observations
Properties of the Median
 There is a unique median for each data set.
 It is a robust measure of centre (i.e. unaffected by outlying values)
 It can be computed for ratio-level, interval-level, and ordinal-level data.
 It can be computed for an open-ended frequency distribution provided
the median does not lie in an open-ended class.
7.3.4. Mode
The most frequently observed data value
Excel function =MODE(range:range)
Demo 7.1 - Calculating Measures of Centre (click here for Excel output
results)
7.4 Measuring Dispersion: Ungrouped Data
7.4.1. Variance
( x  x ) 2
VAR  s 
(n  1
2
The variance of a sample equals the sum of the squared deviations of the
scores from the mean score, divided by (N-1)
Because of the squaring involved (which is done to eliminate negatives), the
variance can get quite large. If we “unsquare” it, we get the Std Deviation
7.4.2. Standard Deviation: s  VAR
Coefficient of Variation
This is another measure of the variability of a series and can be used as a
comparison between different series, where the units of measurement differ
between the series. It is defined as:
Coefficien t of Variation 
Standard Deviation
*100
Mean
It is expressed as a percentage and the series with the highest percentage
has the greater variability.
7.4.3. Range
The range is simply the difference between the highest and lowest values in a
set of data.
Demo 7.2 - Calculating Measures of Dispersion: Variance and Standard
Deviation (click here for Excel output results)
7.4.4. Percentiles and Quartiles
The pth value percentile of the distribution is the value such that p percent of
the observations fall at or below it.
i
p
(n  1)
100
So the value at the 20th percentile has 80% of observations above it, 20% at
or below. The Median, as defined above, is identical with the 50th percentile.
The most commonly used percentiles other than the median are the quartiles.
The first quartile (Q1) is the 25th percentile, the second quartile is the median
(Q2) and the third quartile is the 75th percentile (Q3).
Demo 7.553 - Calculating Percentiles (click here fore Excel output results)
The interquartile range is simply the difference between the quartiles: IQR =
Q3 - Q1
7.5.3 The Shape of Distributions
The pattern of variation of a variable (or a set of data) is called its distribution.
A frequency distribution is a convenient summary of a large set of data. In
shows the frequency of items in each class/interval.
When we construct frequency histograms, they often display low frequencies
on left, building up to a peak, and then dropping steadily down to low
frequencies again on the right.
If the peak is in the centre of the histogram and the slopes are both sides are
the same, then the distribution is said to be symmetrical. If the peak lies to
one or the other side of the histogram or polygon, the distribution is skewed.
The Normal Distribution
All normal distributions have the same overall shape. We will see that the
skewness is equal to zero, i.e., it is perfectly symmetrical.
Its main property is that: mode = median = mean
F
R
E
Q
U
E
N
C
Y
-
-
-
μ
1
2
Values of the Observations
3
If X is a variable having a normal distribution then we say “X is normal with
mean mu and std deviation sigma”:
The ‘68-95-99.7’ Rule for Normal Distributions
If we are sure that the distribution is approximately normal, as with IQ, we find
that
 68% of observations occur within 1 Std Deviation of the Mean
 95% of observations occur within 2 Std Deviations of the Mean
 99.7% of observations occur within 3 Std Deviations of the Mean
The properties of normal distributions are used in determining the probabilities
of rejecting or accepting hypotheses. We often invoke the principle that if an
observation is more than 2 standard deviations from the mean, then it is
unlikely to have occurred by chance.
The “standard normal distribution” Z is normal with mean 0 and std deviation
1: z ~ N
Any normal distribution can be transformed to the standard normal distribution
using the transformation:
Skewness
If the peak lies to one or the other side of the histogram or polygon, the
distribution is skewed.
3-27
Right Skewed Distribution
Positively skewed: Mean and Median are to
the right of the Mode.
Mode<Median<Mean
Left Skewed Distribution
Negatively Skewed: Mean and Median are to the left of the Mode.
Mean<Median<Mode
There are a number of ways of measuring skewness. The simplest is the
Pearson coefficient of skewness. SK 
3( Mean  Median)
S tan dardDeviation
(For the example using mean and median earlier, the SK =3*(58.8-67.4)/6.39
= -4.0 (negatively skewed))
How much is this coefficient if your data are normally distributed?
Income and wealth distributions are often positively skewed, which means
that statistics on “average” incomes have to be treated with caution.
Kurtosis
In addition to skewness, we also have kurtosis.
Kurtosis measures the tallness or flatness of the distribution. We will not
derive the formula for the kurtosis. It is enough to know that for a symmetric
distribution the value equals 3 (the formula is complex).
Demo 7.5.4 - Calculating Measures of Shape (click here for Excel output
results)
7.6 More about graphical representation of data
Very often you will have to play around with data to get them into the format
that makes the most vivid picture. There are no hard and fast rules about how
to do this: it is best to experiment a bit. The final product will depend on both
the format in which numbers are expressed and the choices you make about
how to present the data. Be willing to change both in the search for the best
output. In Week 5 we saw how univariate data could be displayed in
histograms, boxplots. But what if you have two variables – “bivariate” data?
The display you use depends on the “level” of the data – in particular, whether
it is “nominal” (sometimes called “categorical”) or “interval / ratio” level – that
is to say, real measurable data for which you can calculate means and 5number summaries.
Displaying Bivariate Data
Cross-tabulations are used when you have two Nominal (or Categorical) level
variables and you want to see how scores on one relate to scores on the
other. For instance, the following demo cross-analyses choice of degree
course by gender.
Demo 7.6.1 - Cross-tabulations (click here for Excel output results)
Scatterplots, or Scattergrams, or XY Charts are used when we have two
interval-level or ratio-level variables – real data with means and variances –
which we want to cross-analyse.
Demo 7.6.2 shows how student marks in Year 2 relate to student marks in
Year 3. To what extent can we predict student performance from one year to
the next? (Click here for Excel output results)
Click here to go to Exercise 7