Download Lecture 7

Lecture 7 Describing Data; Growth Rates and Measures of Dispersion 7.1 Describing Data We have seen how to capture data from the web and download data to Excel. The next few sessions look at some of the methods used for analysing data. There are various types of objective that might be relevant here including:   Presenting data in a more user-friendly or informative way Describing data more precisely • Compute and interpret various measures of centre and dispersion for ungrouped data • • Explain the characteristics, uses, advantages and disadvantages of the various measures of centre and location Use the normal distribution Particularly if you have not had much opportunity to work with real data in economics previously you should practise with the sorts of methods covered in these sessions to develop confidence in handling data. This is particularly useful for the group-produced project.. 7.1.3 Describing data in Excel We will now return to Excel for the rest of the lecture. It is also possible to do some of these statistical tasks in Excel, but it involves using the ‘Data Analysis’ Add-in. To arrange raw data into an Excel Frequency Distribution and Histogram, see Demo 7.1 which makes use of this feature of Excel. 7.2 Transforming data It is very common to get time series data expressed in absolute numbers you would prefer to express in some other format that makes numbers much easier to absorb and makes trends, and deviations from trends, easier to identify. There are two techniques commonly used for transforming data into a more digestible form. 7.2.1 Calculating growth rates A growth rate is the change in a variable between time t and time (t +1) in relation to its initial value in t. For example, if income rises from 200 to 208 during the course of 12 months the (annual) growth rate is (208-200)/200 = .04 or 4%. Calculations of this kind are very easily done in spreadsheets, but care is needed with the time units being used. With quarterly data, for example, you need to multiply each quarterly increase by four to get the annual growth rate. Also, it is conventional to express growth rates in percentage terms. The most reliable way to do this is by multiplying the answer by 100. Sometimes you can use the % icon, but beware: although this will display the answer as a whole number, arithmetic operations will still treat the cell as a decimal! A very popular technique (among economists) to calculate growth rates is to use logs. It can be shown that (xt-xt-1)/xt-1 ≈ ln xt -ln xt-1 and, if you want percentage growth rates, it becomes ((xt-xt-1)/xt-1 )*100 ≈ (ln xt -ln xt-1)*100. When calculating growth rates in Excel you normally 'lose' the first observation, since the change cannot be defined until you are at the second of two points. So, when putting in the formula to compute the rate of growth, leave the first cell of your new column blank. 7.2.2 Index Numbers The idea of an index number is to express a sequence of numbers in relation to a base value of 100. Thus if we had a series of 3 numbers, say, 50, 80 and 150 we could express them as index numbers, using the first number as the base. In index form the series would become 100, 160 and 300. Whatever multiplication or division operation you perform to convert the first number to a value of 100 you repeat for the other numbers as well. Thus to convert 50 to 100 you multiply by 2 (=100/50), so this operation is done for the other numbers as well. A common application of index numbers is to convert very large numbers, such as the value of Gross Domestic Product, to a more manageable format. Thus if GDP at current prices is £668,255 millions in 1994 and £700,890 millions in 1995, we convert these to index numbers thus: 1994: 668,225*(100/668,225) becomes 100 and 1995: 700,890*(100/668,225) becomes 104.89 7.2.3 Rates of change There are occasions when it is conventional to express data in index number format but then also to calculate a rate of change. Price inflation is a classic example. Raw price information (compiled from market prices for a particular bundle of commodities) is first transformed into an index such as the Retail Price Index or the Tax and Price Index. Rates of change in the value of this index are then calculated and used as a measure of the rate of inflation, i.e. the rate at which prices are changing. 7.2.4 ‘Deflating’: Expressing data in 'real terms' or 'constant prices' Often in macroeconomics it is useful to 'deflate' a series of numbers in order to extract the influence of price changes and to see how a variable has moved in 'real terms'. A 30-year run of data on Aggregate Consumption or GDP expressed in 'current prices', for example, will exaggerate the rate at which consumption or incomes have been increasing. In order to eliminate this price effect it is common to deflate a series by a price index. This requires data on a price index (such as the RPI) covering the same period as the data for the variable in question. The value of the variable in period t is then divided by the value of the price index in t and the resulting variable is referred to as being measured in 'real' or 'constant price' terms. The year for which the price index takes a value of 100 is then specified as part of the name of the new series: e.g. 'Consumption at constant 1995 prices'. Definitions Nominal GDP (or GDP in Current Prices) = GDP at prices prevailing when income was earned Real GDP (GDP in Constant Prices) = GDP adjusted for the effects of inflation Per Capita real GDP = Real GDP divided by the total population Economic Growth = The percentage change in Real GDP per year Demo 7.2 is an extensive demo that runs over four sheets. Click here for a demonstration of calculating UK data on Growth Rates (Sheet 1), Index Numbers (Sheet 2), the GDP Deflator (Sheet 3) and Real GDP (Sheet 4). Try to follow how each sheet develops from the one before. 7.2.5 Graphical representation of data Very often you will have to play around with data to get them into the format that makes the most vivid picture. There are no hard and fast rules about how to do this: it is best to experiment a bit. The final product will depend on both the format in which numbers are expressed and the choices you make about how to present the data. Be willing to change both in the search for the best output. 7.3 Measures of Centre 1. Arithmetic Mean: x  x / n Excel function =AVERAGE(range : range) “X-bar” is the sum of the scores divided by the number of people (or items) being measured (i.e. the sample size). Properties  Every set of interval-level and ratio-level data has a mean.  All the values are included in computing the mean.  A set of data has a unique mean.  The mean is affected by outliers (unusually large or small data values), it is not a ‘resistant’ measure of centre.  If all the values are summed and an equal number of means are subtracted, the total will be zero: 7.3.2. The Geometric Mean: GM  n ( x1 )( x 2 )( x3 )...(x n ) The geometric mean (GM) of a set of n numbers is defined as the nth root of the product of the n numbers. Excel function =GEOMEAN(range:range) Basic textbooks do not mention this measure, but it is very important in measuring financial market movements. If a share goes up from 200 to 250, the standard growth rate formula gives (250-200)/200 = 25% But if it goes down from 250 to 200, the standard growth rate gives (200250)/250 = -20% The share has gone up and down the same amount but the growth rates are different. The Geometric Mean is used to overcome this problem. Multiply the 2 numbers together and take the Square Root =SQRT(200*250) = 223.60679 The geometric mean growth rate will be 23.6% regardless of whether the share is going up or down. For the geometric mean of 3 numbers, take the cube root of the 3 numbers multiplied together. For the geometric mean of n numbers, take the nth root of the n numbers multiplied together. Comparing the Arithmetic Mean and Geometric Mean The interest rates on three bonds were 5, 7, and 4 percent. GM = 5.192. X-bar = 5.333. The GM gives a more conservative profit figure because it is less heavily weighted by the rate of 7 percent. 7.3.3. Median Excel function =MEDIAN(range:range) The middle value in a distribution which has been arranged in numerical order (equal number of values above and below the median) The Mth value in the distribution is found by M  (n  1) 2 Rule of thumb: If n is odd, the median is the centre value. If n is even, the median is the average of the two centre observations Properties of the Median  There is a unique median for each data set.  It is a robust measure of centre (i.e. unaffected by outlying values)  It can be computed for ratio-level, interval-level, and ordinal-level data.  It can be computed for an open-ended frequency distribution provided the median does not lie in an open-ended class. 7.3.4. Mode The most frequently observed data value Excel function =MODE(range:range) Demo 7.1 - Calculating Measures of Centre (click here for Excel output results) 7.4 Measuring Dispersion: Ungrouped Data 7.4.1. Variance ( x  x ) 2 VAR  s  (n  1 2 The variance of a sample equals the sum of the squared deviations of the scores from the mean score, divided by (N-1) Because of the squaring involved (which is done to eliminate negatives), the variance can get quite large. If we “unsquare” it, we get the Std Deviation 7.4.2. Standard Deviation: s  VAR Coefficient of Variation This is another measure of the variability of a series and can be used as a comparison between different series, where the units of measurement differ between the series. It is defined as: Coefficien t of Variation  Standard Deviation *100 Mean It is expressed as a percentage and the series with the highest percentage has the greater variability. 7.4.3. Range The range is simply the difference between the highest and lowest values in a set of data. Demo 7.2 - Calculating Measures of Dispersion: Variance and Standard Deviation (click here for Excel output results) 7.4.4. Percentiles and Quartiles The pth value percentile of the distribution is the value such that p percent of the observations fall at or below it. i p (n  1) 100 So the value at the 20th percentile has 80% of observations above it, 20% at or below. The Median, as defined above, is identical with the 50th percentile. The most commonly used percentiles other than the median are the quartiles. The first quartile (Q1) is the 25th percentile, the second quartile is the median (Q2) and the third quartile is the 75th percentile (Q3). Demo 7.553 - Calculating Percentiles (click here fore Excel output results) The interquartile range is simply the difference between the quartiles: IQR = Q3 - Q1 7.5.3 The Shape of Distributions The pattern of variation of a variable (or a set of data) is called its distribution. A frequency distribution is a convenient summary of a large set of data. In shows the frequency of items in each class/interval. When we construct frequency histograms, they often display low frequencies on left, building up to a peak, and then dropping steadily down to low frequencies again on the right. If the peak is in the centre of the histogram and the slopes are both sides are the same, then the distribution is said to be symmetrical. If the peak lies to one or the other side of the histogram or polygon, the distribution is skewed. The Normal Distribution All normal distributions have the same overall shape. We will see that the skewness is equal to zero, i.e., it is perfectly symmetrical. Its main property is that: mode = median = mean F R E Q U E N C Y - - - μ 1 2 Values of the Observations 3 If X is a variable having a normal distribution then we say “X is normal with mean mu and std deviation sigma”: The ‘68-95-99.7’ Rule for Normal Distributions If we are sure that the distribution is approximately normal, as with IQ, we find that  68% of observations occur within 1 Std Deviation of the Mean  95% of observations occur within 2 Std Deviations of the Mean  99.7% of observations occur within 3 Std Deviations of the Mean The properties of normal distributions are used in determining the probabilities of rejecting or accepting hypotheses. We often invoke the principle that if an observation is more than 2 standard deviations from the mean, then it is unlikely to have occurred by chance. The “standard normal distribution” Z is normal with mean 0 and std deviation 1: z ~ N Any normal distribution can be transformed to the standard normal distribution using the transformation: Skewness If the peak lies to one or the other side of the histogram or polygon, the distribution is skewed. 3-27 Right Skewed Distribution Positively skewed: Mean and Median are to the right of the Mode. Mode<Median<Mean Left Skewed Distribution Negatively Skewed: Mean and Median are to the left of the Mode. Mean<Median<Mode There are a number of ways of measuring skewness. The simplest is the Pearson coefficient of skewness. SK  3( Mean  Median) S tan dardDeviation (For the example using mean and median earlier, the SK =3*(58.8-67.4)/6.39 = -4.0 (negatively skewed)) How much is this coefficient if your data are normally distributed? Income and wealth distributions are often positively skewed, which means that statistics on “average” incomes have to be treated with caution. Kurtosis In addition to skewness, we also have kurtosis. Kurtosis measures the tallness or flatness of the distribution. We will not derive the formula for the kurtosis. It is enough to know that for a symmetric distribution the value equals 3 (the formula is complex). Demo 7.5.4 - Calculating Measures of Shape (click here for Excel output results) 7.6 More about graphical representation of data Very often you will have to play around with data to get them into the format that makes the most vivid picture. There are no hard and fast rules about how to do this: it is best to experiment a bit. The final product will depend on both the format in which numbers are expressed and the choices you make about how to present the data. Be willing to change both in the search for the best output. In Week 5 we saw how univariate data could be displayed in histograms, boxplots. But what if you have two variables – “bivariate” data? The display you use depends on the “level” of the data – in particular, whether it is “nominal” (sometimes called “categorical”) or “interval / ratio” level – that is to say, real measurable data for which you can calculate means and 5number summaries. Displaying Bivariate Data Cross-tabulations are used when you have two Nominal (or Categorical) level variables and you want to see how scores on one relate to scores on the other. For instance, the following demo cross-analyses choice of degree course by gender. Demo 7.6.1 - Cross-tabulations (click here for Excel output results) Scatterplots, or Scattergrams, or XY Charts are used when we have two interval-level or ratio-level variables – real data with means and variances – which we want to cross-analyse. Demo 7.6.2 shows how student marks in Year 2 relate to student marks in Year 3. To what extent can we predict student performance from one year to the next? (Click here for Excel output results) Click here to go to Exercise 7

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 7