Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 4: Continuous probability functions 4.1 Introduction A continuous variable is a variable that can adopt any given value within a given area. Or, between two values of a continuous variable there is always a third possible value. A continuous value is always measured in a quantitative level (interval or ratio). Examples of continuous variables can be: time, length, weight and so on. With a continuous variable, one does not look at the probability of one particular value (which namely equals 0), but the probability of a multitude of values (smaller than a certain value or larger than a certain value). The most important continuous probability function, the normal distribution, is explained in this chapter. We will also briefly pay attention to the exponential probability function. 4.2 Normal probability function A normal distribution is characterized by her mean µ, also called expectation value E[X], and her standard deviation σ. Graph 4.1: Normal distributions Probability density 0.45 A 0.4 0.35 0.3 0.25 B 0.2 0.15 0.1 C 0.05 0 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 X Graph A and B have an mean (µ) of 4 and a standard deviation (σ) of 1, respectively 2, whereas the mean (µ) of graph 2 is 6, and the standard deviation (σ)is 3. Graphs A, B and C are different but have certain similarities, since they are all the graph of a normal distribution. Characteristics of normal distributions are: 1. The probability density is bell-shaped and symmetrical. Values that are more than σ over µ, occur just as often as values that are more than σ under µ. 2. The mathematical mean, the median and the mode are equal to each other. 3. A strong central tendency. Values close to the mean are the most occurring. 4. Relatively large and relatively small values seldom occur. 95.4% of all observations lie in between two standard deviations of µ and 99.7% of them even lie between three standard deviations of µ . The probability density function f(X) of a normal distribution can be calculated with the following formula: f(X) = σ 1 e 2 π − 1 2 ( X −µ σ ) 2 Where: σ: standard deviation of the population µ: Arithmetical mean of the population π: Mathematical constant approximately equal to 3.14159 e: : Mathematical constant approximately equal to 2.71828 When you want to calculate the probability that value X of a normal distributions lies between the values a and b, P(a<X<b), you have to determine the surface area under the probability density function between a and b. f(X) Graph 4.2: Normal distribution with mean µ and standard deviation σ a b X This surface area can be calculated by integrating the probability density function on the interval [a,b] or by standardizing and using the standard normal probability table. A normal distribution can be transformed to a standard normal distribution. A standard normal distribution is a normal distribution where the mean µ = 0 and the standard deviation σ = 1. This provides you with a so-called z-value, of which the probability can be found in a standard normal table. The term z-value is further explained below. Converting a normal distribution to a standard normal distribution, also called standardizing, goes in two steps, which will be clarified by the normal distribution of graph B, with µ = 4 en σ = 2. 1. The mean is subtracted from all the X-values (X- µ). This means that the graph will move so that the mean will lie close to 0. Graph 7.3: distribution with mean 4 Grafiek 7.4: Normal Normale verdeling met gemiddelde 0 and standard devisation 22 en standaardafwijking Probability density kansdichtheid 0,25 0,25 0,2 0,2 0,15 0,15 0,1 0,1 0,05 0,05 00 -8 -8 -7 -7 -6 -6 -5 -5 -4 -4 -3 -3 -2 -2 -1 -1 -0 -0 11 22 33 44 55 66 77 88 99 10 11 11 12 12 10 X X The new X-values are now divided by the standard deviation σ. So (X - µ) / σ. This way you will get the standard normal distribution, where the mean equals 0 and the standard deviation equals 1. With the standard normal distribution we no longer speak of X-values. We now call them z-values. The value of z indicated the number of times the standard deviation that the value X lies separated from the value µ. When the mean (µ) of a normal distribution equals 4 and the standard deviation (σ) equals 2, the following applies: With X = 6 the z-value equals 1 (6 lies 1 time the standard deviation above the mean of 4); With X = 0 the z-value equals -2 (2 lies 2 times the standard deviation under the mean of 4); With X = 3 the z-value equals – ½ (3 lies ½ times the standard deviation under the mean of 4); With X = 8 the z-value equals 2 (8 lies 2 times the standard deviation above the mean of 4); So the chance that X < 6, with a normal distribution with a mean of 4 and a standard deviation of 2: P(X < 6 | µ = 4 ; σ = 2) = P(z < 1) = 0.841310 4.2.1 Probabilities of normal distributions with Excel Excel knows five functions concerning a (standard) normal probability distribution. 1. NORMDIST(x;mean;stand_dev;cummulative) This function calculates the probability of a value smaller than X with a normal distribution with parameters µ and σ. When you fill in false at the last step you will get the probability density. Working method: 8. Select the cell where you would like to let the normal probability be calculated; 9. Press Insert in the menu bar and press Function in the scroll menu. 10. With Or select a category, press Statistical, with Select a function press NORMDIST, and then press OK; You will get the following screen: 10 To be looked up in a table with probabilities of left crossover of the standard normal distribution. 11. With X you fill in the value for which you would like the (left crossover)probability to be calculated. With Mean you fill in the population mean; With Standard_dev you fill in the standard deviation of the population; With Cumulative you fill in TRUE (you want to calculate the probability and not the probability density) Example With a normal distribution with a mean of 8 and a standard deviation of 3, how big is the probability that X is smaller than 6? kansdichtheid Grafiek 7.6: Normale verdeling met gemiddelde 8 en standaardafwijking 3 0,14 0,12 0,1 0,08 0,06 0,04 0,02 0 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 X P(X < 6 | µ = 8 ; σ = 3) NORMDIST (6;8;3;TRUE) = 0.2525 The upper probability can be determined in two steps in Excel, by standardizing the probability distribution first (determining the z-value, see function mentioned at II) and then determining the left crossover probability of this z-value (see function mentioned at III). 1. STANDARIZE(x;mean;stand_dev) This function can calculate the z-value of a normal distribution with parameters µ and σ. Working method: 12. Select the cell where you would like to let the normal chance be calculated; 13. Press Insert in the menu bar and press Function in the scroll menu. 14. With Or select a category, press Statistical, with Select a function press STANDARDIZE, and then press OK; You will get the following insert screen. d. With X you fill in the value for which you would like the z-value to be calculated; With Mean you fill in the population mean; With Standard_dev you fill in the standard deviation of the population; Example What is the z-value belonging to an X-value of 6 of a normal distribution with a mean of 8 and a standard deviation of 3? NORMALIZING (6;8;3) = -0.6667 1. NORMSDIST(z) This function calculates the probability of a value smaller than z (the left crossover probability). Working method: 3. Select the cell where you would like the left crossover probability of the z-value to be calculated; 4. Press Insert in the menu bar and press Function in the scroll menu. 5. With Or select a category, press Statistical, with Select a function press NORMSDIST, and then press OK; You will get the following screen 6. With Z you fill in the value of z for which you would like the left crossover probability to be calculated. Example What is the probability that z is smaller than -0.6667? NORMSDIST (-0.6667) = 0.2525 1. NORMINV(probability;mean;standard_dev) This function calculates the X-value belonging to the given chance with a normal distribution with parameters µ and σ. So this function is the opposite of the function NORMDIST (X;µ;σ), where the probability belonging to an X-value is calculated. Working method: 1. Select the cell where you would like the X-value to be calculated; e. Press Insert in the menu bar and press Function in the scroll menu. f. With Or select a category, press Statistical, with Select a function press NORMINV, and then press OK; You will get the following screen 1. With probability you fill in a number between 0 and 1, for which probability you would like to calculate the X-value belonging to this left crossover probability; With mean you fill in the population mean; With Standard_dev you fill in the standard deviation of the population. Example Which X-value belongs to a normal distribution with a mean of 25, a standard deviation of 4 and a chance of 0.75? kansdichtheid Grafiek 7.7: Normale verdeling met gemiddelde 25 en standaardafwijking 4 0,12 0,1 0,08 0,06 0,04 75% 0,02 0 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 X P(X < ? | µ = 25 ; σ = 4) = 0.75 NORMINV (0.75;25;4) = 27.70 1. NORMSINV(probability) This function calculated the z-value belonging to the given chance. So the function is the opposite of the function NORMSDIST (Z), where the probability belonging to a Z-value is calculated. Working method 1. Select the cell where you would like the Z-value to be calculated 2. Press Insert in the menu bar and then press Function in the scroll menu. 3. With Or select a category, press Statistical, with Select a function press NORMSINV, and then press OK; You will get the following screen. Example What is the z-value belonging to a probability of 0.75? NORMSINV(0.75) = 0.6745 Don't forget to use the following probability rules with normal distributions. a. P(X < a) = P(X < a) b. P(X > a) = 1 – P(X < a) c. P(a < X < b) = P(X < b) – P(X < a) P(X = a) = 0 complement rule Below you will find an Excel worksheet that you can use for the calculation of left- and right crossover probabilities, interval chances and the determining of the X-value with a given probabilities, with normal distributions. In cell A7, amongst others, you will encounter the sign “&”. This means that when the X-value (in cell B5) changes, it will change in this cell too. Calculating normal probabilities Arithmetic mean Standard deviation First X-value Left crossover probability ="P(X<="&B5&")" Right crossover probability ="P(X>="&B5&")" Interval Second X-value ="P(X<="&B11&")" ="P("&B5&"<X<&B11&")" Find X-value Cumulative percentage X-value xxx xxx xxx =NORMDIST(B5,B3,B4,TRUE) =1-B7 Xxx =NORMDIST(B11,B3,B4,TRUE) =ABS(B12-B7) .xx =NORMINV(B15;B3;B4) ABS is a mathematical function and takes the absolute value of a number. When inserting a arithmetical mean of 75, a standard deviation of 6, a first X-value of 69, a second X-value of 81 and a cumulative percentage of 10%, you will get the following execution in Excel. Calculating normal probabilities Arithmetic mean Standard deviation First X-value Left crossover probability P(X<=69) Right crossover probability P(X>=69) Interval Second X-value P(X<=81) P(69<X<81) Find X-value Cumulative percentage X-value 75 6 69 0.158655254 0.841344746 81 0.841344746 0.682689492 0.1 67.31069061 When you change the value in cell B5 to 72, the following will appear in cell A7: P(X<=72) 4.3 Exponential probability distributions The exponential probability distribution is used with waiting time theories to model the time between two arrivals. To calculate the probability with an exponential distribution that a next arrival takes place in between a certain time X, you can make use of the following formula: P(arrival time < X) = 1 – e -λ x Where: e: mathematical constant when approached equal to 2.71828 λ: the average of arrivals of the population X: a continuous variable where 0 < X < + ∞ The exponential probability distribution is only determined by one parameter, the mean λ, which equals the average number of arrivals per time unit. The average time between two arrivals will subsequently equal 1/λ. For instance when the average number of arrivals equals 5 per hour, the time between two arrivals will be 1/5 of an hour, or 12 minutes. Graph 4.9: Exponential distribution using different means P(< X) 1,2 1 0,8 Mean 5 0,6 Mean 10 Mean 20 0,4 0,2 0 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 X ` Example When an average of 30 customers per hour come to an office window at an NS station, how big is the probability that, when one customer has just arrived at the office window, a next customer will arrive within three minutes? Make sure you use the same time units! λ (lambda) = 30 per hour = 0.5 per minute X = 3 minutes = 0.05 hour Unit minutes: Unit hours: P(X < 3) = 1 – e-3*0,5 = 1 – e-1,5 = 1 – 0.2213 = 0.7769 P(X < 0.05) = 1 – e-0.05*30 = 1 – e-1,5 = 1 – 0.2213 = 0.7769 4.3.1 Probabilities of exponential distributions with Excel Excel uses the following function to calculate exponential probabilities EXPONDIST(x;lambda;cumulative) Working method: 1. Select the cell where you would like to calculate the exponential chance, 2. Press Insert in the menu bar and press functions in the scroll menu. 3. With Or select a category, press Statistical, with Select a function press EXPONDIST, and then press OK; You will get the following screen: 4. With X you fill in the value for which you would like to calculate the (left crossover)probability. With Lambda you fill in the mean of the exponential distribution. With Cumulative you fill in TRUE. Example When an average of 20 people per hour come to an office window of an NS station, how big is the chance that the next customer will arrive within 6 minutes? Make sure you use the same time units! λ (lambda) = 20 per hour x = 6 minutes = 0.1 hour P(X < 0,1 | λ = 20) = EXPONDIST(0,1;20;TRUE) = 0.8647 Again, use the formerly mentioned probability rules when you want to calculate other probabilities than the left cross over probability. Example We will continue with the previous example of an office window at an NS station. How big is the probability now, that it takes longer than six minutes for the next customer to arrive? P(X > 0.1) = 1 - P(X < 0.1) = 1 – 0.8647 = 0.1353 Below you will find an Excel worksheet that can be used for the calculation of left- and right crossover probabilities with exponential distributions. Calculating exponential probabilities Arithmetic mean X-value xx xx Left crossover probability ="P(X<="&B4&")" =EXPONDIST(B4,B3,TRUE) Right crossover probability ="P(X>="&B4&")" =1-B7 When you fill in 20 as a mean in cell B3, and you fill in 0.1 as an X-value in cell B4, you will get the following result: Calculating exponential probabilities Arithmetic mean X-value 20 0.1 Left crossover probability P(X<=0.1) 0.864664717 Right crossover probability P(X>=0.1) 0.135335283 4.4 Assignments A private delivery service, that delivers “Metro” to a number of NS stations in the morning, takes an average of two hours to do this, with a standard deviation of ten minutes. Assuming that the delivery time is normally distributed, determine: a. The probability that, at any given day, the private delivery service finishes deliveries within 1 ½ hours. b. The probability that, at any given day, the private delivery service takes longer than 2 ¼ hours to deliver “Metro” c. 'The probability that, at any given day, the delivery of “Metro” takes a minimum of 1 ¾ hours and a maximum of 2 ¼ hours; d. The duration where the probability equals 75%, e. Also that the delivery happens within this time span. An average of 23 travelers per hour during morning rush hours, come to a ticket machine at a certain NS station. When the arrival of the train travelers to this ticket machine is exponentially distributed, determine the probability that: 4. The next traveler will arrive within five minutes 5. The next traveler will arrive after ten minutes 6. It takes a minimum of three minutes, and a maximum of seven minutes, for the next traveler to arrive The times that train travelers take to get a ticket out of the ticket machine is normally distributed with a mean of two minutes and a standard deviation of 15 seconds. Determine: a. The probability that a random traveler finishes within 1 ½ minutes at the ticket machine; b. The probability that a random traveler needs more than two minutes and twenty seconds to get a ticket out of the ticket machine; c. the probability that a random traveler needs a minimum of two minutes and a maximum of 2 ½ minutes to get a ticket out of the ticket machine; d. 'The time between which 95% of the train travelers get a ticket out of the ticket machine; The central information service of the NS about travelling times, is on average being called 18 times during the morning rush hour. The morning rush hour takes one hour. When the arrival of phone calls at the central information service is exponentially distributed, determine: a. The probability that the next phone call will come in within three minutes; b. the probability that it takes more than five minutes for the next phone call to come in; c. The probability that the next phone call comes in between two and four minutes; The company VeelInvest BV has two investment projects P1 and P2. They both have an expected profitability of 15% i.e. 0.15. The standard deviation for P1 equals 0.02 and for P2 it equals 0.06. 1. How big is the probability for project P1 that the actual profitability is lower than 10%? 2. How big is the probability for project P2 that the actual profitability is lower than 10%? 3. Which project should be favored when the risk needs to be controlled as much as possible? 4. VeelInvest would want to take a risk of 80%. What is the profitability of both projects with that risk? Which project is preferabele in this case? Chapter 5: Singular linear regression- and correlation analysis 5.1 Introduction Regression-analysis is about developing models, which explain the relation between a variable to be clarified, and one or several clarifying variables, with the purpose of being able to make a prediction for a variable to be clarified with the help of the clarifying variable (n). A variable to be clarified in a regression model is also called the dependent variable, and the clarifying variable (n) the independent variable (n). So the “Price” which advertisers want to pay for an ad in “Spits” will depend on the “number of prints” of “Spits”. The “Price” in this case, is the dependent (to be clarified) variable, and the “Number of prints” of “Spits” is the independent (clarifying) variable. In practice, the variable to be clarified will often depend on more than one clarifying variable. If more than one clarifying variable is incorporated in the investigation, we speak of multiple regression-analysis. When the investigation confines itself to one clarifying variable, we speak of singular regression-analysis. One then works by the ceteris paribus condition, which means that all the remaining clarifying variables that are not incorporated in the investigation, are assumed to be constant. Correlation analysis is about investigating the strength of the relation between two variables. With correlation- as well as with regression-analysis, the variables should be quantitative. When one or several variables are qualitative, another statistical technique will have to be used for the investigation of the relation between the variables. 5.2 Regression models In Fiction2000 there is a relation between the variable “Age” and the variable “Income”. The “Income” is dependent on the “Age”, so the variable “Income” is shown on the vertical (Y) axis, and the variable “Age” on the horizontal (X) axis. The relation between variables can be described by very simple as well as very complicated mathematical functions. The most simple form is a straight line. The mathematical function of a straight line is: y=a*x+b Where: x : Independent (clarifying) variable y : dependent variable (to be clarified) a : slope (the increase of units Y when X increases with one unit) b : intersection with the Y-axis (the value of Y, when X equals 0) In graph 5.1 we see that the linear regression line between the variables 'Age' and 'Income' can be presented as follows: y = 0.4935 x – 2.2733 When someone is one year older the income will be 0.4935 * €100 = € 49.35 higher. Graph 5.1: Dot diagram of age and income Income (* € 100) 40 35 y = 0.4935x – 2.2733 30 25 20 15 10 5 0 0 10 20 30 40 50 60 70 Age Whether you should use a linear function for your model or a more complex mathematical function depends on the distribution of the (x,y)-values in your dot diagram. Examples of different forms of regression (relations) between x and y can be seen in the following graphs. With graph A the values of y increase more or less linearly proportional to the increase of the values of x. An example of this can be seen in graph 8.1, where the relation between 'Age' and 'Income' is presented. With graph B the values of y decrease more or less linearly in proportion with the increases in the values of x. A declining regression line means that the relation is negative. An example of a negative linear relation is for instance the relation between the 'Price of a product' and the 'Sales of that product'. With graph C there is no relation. High and low y-values can be found with all the x-values. With graph C the values of y increase, when the x-values increase. In the beginning these increases are more than proportional, at the end less than proportional. An example of a positive skew-lined relation is 'Expenses for the commercials for a product' and the 'Sales of that product. In the beginning the sales will increase considerably until one point where you have reached your entire target group with your commercial, and the sales stay more or less constant. With graph F the values of y decrease, when the x-values increase. In the beginning these decreases are more than proportional and at the end they are less than proportional. As an example you may think of the relation between the variables 'Depreciations' and 'Years'. The reduction in value of cars is considerably more in the first years, than later on. With graph F the values of y initially decrease, when the x-values increase. Subsequently the y values increase, when the x-values increase. Here you may think of the relation between the variables 'Time' and the 'Number of mistakes someone makes at a certain job'. By practice you will quickly make less mistakes, but is you practice a certain task for too long the number of mistakes will increase again, by fatigue or boredom that may occur. Graph A: Positive linear correlation Y Y Graph B: Negative linear correlation X X Graph D: Positive nonlinear correlation Grafiek C: Geen verband Y Y X X Graph E: Negative nonlinear correlation Graph F: U-shape correlation Y Y X X 5.3 The linear regression model When you look at graph 5.1 of graph A, not all points seem to be on one straight line. However, you can still draw a line in such a way, that many of these points lie fairly close to this line. The line that describes the relation between x and y best, is called the regression line. The regression line is determined by the smallest squares method. According to this method, the regression line is that line, where the sum of the squares of the vertical distances of the points from the dot diagram to that line is minimal. Just like with variance, we look at the squared distances, because there are points above as well as below the line. The vertical distance of a point from the dot diagram to the regression line is called a residue. These residues are squared and added up. The regression line therefore, is the line for which the residual squares sum is minimal. Graph 5.2: Dot diagram with regression line Y y = ax + b y5 r5 y2 r4 y4 r2 r3 y3 r1 y1 X The vertical distance between a point from the dot diagram (yi) and the regression line (ax + b) are presented in graph 8.2 by ri. The following counts: r1 = y1 – ax1 – b ; r2 = y2 – ax2 – b ; r3 = y3 – ax3 – b ; and so on. All these residues then have to be squared and added up. n 2 S = ∑ ( y i - ax i - b) i=1 S is a function of two variables a and b. To find a solution for the minimum of S, you will have to partially differentiate the function. This is very mathematical and therefore left out. The minimum has to be found for the following values of a and b. a = 1 n ∑ xiyi − x * y n i=1 2 σx b = y −a* x 5.4 Clarified and nonclarified variance. Before you can use the model to do predictions for the dependent variable by using the independent variable, you will have to investigate whether the model is fit for this purpose. A linear model is fit, when the observed points lie not too for from the linear regression line. For this you will need to use a measure that is fit to measure the distance of these different points to the regression line. Such a measure is called the determination coefficient (r2) and is defined as follows: 2 r = clarified variance total variance With the total variance, we look at the sum of the squared distances between the observed y-values and the mean of y ( y ). With the clarified variance we look at the sum of the squared distances between the values of y on the regression line and the mean of y ( y ). Next to this there is also the sum of the squared distances between the values of y on the regression line and the observed y-values, where you look at the nonclarified variance. Or, the reason that not all observed values of y are equal to the mean of y, you can partially explain by the relation between x and y by using the regression line. But not all points are on the regression line, at the cause of other (nonclarified) causes. Graph 5.3: Levels of variance with regression Y yi y Unclarified variance Clarified variance Total variance y xi X The SST or sum of squares total, equals the sum of the squared distances between yi and y. n SST = ∑ ( y i− y ) i=1 2 The SSR or sum of squares regression equals the sum of the squared distances between y^ (point on the regression line) and the mean of y( y ). n SSR ^ = ∑ ( y i− y ) 2 i= 1 The SSE or sum of squares error, equals the sum of the squared distances between the observed y-values (yi) and the y-values on the regression line ( y^ ). n SSE = ^ ∑ ( y i − y i) 2 i= 1 For the squares sums, the following applies: SST = SSR + SSE When you divide the squares sums by n-1, you speak of variance, and the following applies: Total variance = clarified variance + nonclarified variance In other words: the variance of y can be split up in a part clarified variance, so variance caused by the relation with x, and a part own (nonclarified) variance dependent on the relation with x and probably caused by other factors that were not incorporated into the investigation. How much value you should attach to the determination coefficient depends on the number of researched points. When you only incorporate two points in your investigation, this will produce an r2 of 100%, because there always goes a straight line through only two points. 5.5 Linear regression with Excel With Excel you can determine the equation of the linear regression line by three different methods: 1. By Functions; 2. By Graphs with dot diagram; 3. By Data analysis with Regression; 5.5.1 Linear regression with functions SLOPE and INTERSECTION Singular linear regression analysis is a technique to determine the linear (straight-lined) relation between two quantitative variables with the aim of predicting the size of an dependent variable (y) by the size of another independent variable (x). This linear relation can then be described by the function: y = ax + b We will use the variables 'Age' and 'Income' from the file Fiction2000 as our example. You could expect that one gets a higher income once one gets older. The independent variable (x) therefore is 'Age' and the dependent variable (y) is 'Income'. To determine the regression line, the slope (a) and the intersection with the y-axis (b) have to be determined. Working method: 1. Open the file “Fiction2000”; 2. Open an empty worksheet and type Slope (a) in cell A1 and Intersection (b) in cell A2. 3. Type the following in cell A3: ="y="&ROUND(B1,2)&"*x+"&ROUND(B2,2)&"" 11 4. The value of the slope is calculated in cell B1, by the statistical function SLOPE (y,x); In this example y is the 'Income' (dependent on 'Age) and x is 'Age'(independent variable) in this example. Select cell B1. Press Insert in the menu bar and then press Functions in the scroll menu. With Or select a category, press Statistics, with Select a function press DIRECTION and finally press OK. 11 line. The mathematical function ROUND is used here to avoid too many decimals in the regression With Y, select the matrix of the dependent variable 'Income' so Data!F2:F301 With X, select the matrix of the independent variable 'Age' so Data!C2:C301 In cell B1, the value of a will appear: 1.000109128; 5. In cell B2, the value of the intersection with the y-axis is calculated with the statistical function: INTERCEPT(y,x). Select cell B2. Press Insert in the menu bar and press Functions in the scroll menu. Press Statistical with Or select a category, and select INTERCEPT with Select a function. Then press OK. You will get the same completion screen as with SLOPE (4). With Y, select the matrix of the dependent variable 'Income' so Data!F2:F301 With X, select the matrix of the independent variable 'Age' so Data!C2:C301 In cell B2, the value of b will appear: -5.126899282; 6. In cell A3, you will get the function that described the linear relation between the income and the age: y=1*x+-5.13 or y = 1x – 5,13. An increase of age with 1 year, means an increase of income with 1 * € 100. Slope (a) Intersection (b) 1.000109 -5.1269 y=1*x+-5.13 Linear regression is used to be able to, by use of a function, make a prediction for the dependent variable y (here that's 'Income') using the independent variable x (here that's 'Age') Excel knows the following function for making predictions: FORECAST(x; known_y; known_x); For instance if you want to predict the income when a train traveler in the morning rush hour has an age of 34 years old, you can do this with Excel as follows: 1. In cell A5 type Age, and in cell A6 type Predicted income (* € 100); 2. Select cell B6. Press Insert in the menu bar and then press Functions in the scroll menu. With Or select a category, press Statistics, with Select a function press FORECAST and finally press OK. You will get the following completion screen. 3. With X, fill in: B5, with known_Y F2:F301 and with known_X C2:C301 and press OK 4. Select B5 and fill in 34. You will get the following result Slope (a) Intersection (b) 1.000109 -5.1269 y=1*x+-5.13 Age Predicted income 34 28.87681 The age in cell B5 can be changed, whilst the predicted value of the income in cell B6 will change at the same time. 5.5.2 Linear regression analysis by using the graph; dot diagram Also when using a dot diagram you can perceive the equationof the regression line, which desribes the relation between 'Age' and 'Income'. Working method: 1. Open the file “Fiction2000”; 2. Determine which one of the two variables is the dependent variable; the dependent variable is 'Income'; 3. Select the values of the variable 'Income', which means Data!$F$2:$F$301; 4. Press Insert and then press Chart; 5. In Step 1 of 4: From all the different chart types, select Scatter, then select the first Subtype, and press Next; 6. In Step 2 of 4: Press on the tab of Series and with X-values, select the values of the (independent) variable 'Age', this means = Data!$C$2:$C$301 and press Next; 7. With Step 3 of 4: Press in the tab or Titles and fill in the titles. Press the tab of Legend and uncheck the presented legend, press Next; 8. With Step 4 of 4: Select the place where you would like the chart. You will get the following chart: Income (* € 100) Graph 5.4: Dot diagram of income and age 40 35 30 25 20 15 10 5 0 0 10 20 30 40 50 60 70 Age Source: “Fiction 2000” To get the regression line you will have to press Chart in the menu bar and then press Add Trendline on the scroll menu. You can also get this by a right click on one of the dots presented in the chart. You will get the following screen: Move your cursor to Linear and press the tab Options, you will get the following screen: Check Display equation and Display R-squared value, and press OK. You will get the following result: Income (* € 100) Graph 5.4: Dot diagram of income and age 40 y = 0.4935x – 2.2733 R2 = 0.6635 35 30 25 20 15 10 5 0 0 10 20 30 40 50 60 70 Age Source: “Fiction2000” R2 is the determination coefficient12. This is a measure for the fraction of the variance of the 'Income' that is dependent on 'Age'. Or a change in the income can be 66.35% explained by the change in ages. 33.65% of the change in income then depends on other factors that are not investigated in this case, for example education, the branch where one is active etc. 5.5.3 Linear regression analysis by using Data analysis: Regression A third method to receive the regression line is by Data Analysis from the menu Tools. Press the tab Data of the file Fiction2000. With Tools, Data Analysis, choose Regression. 12 The determination coefficient can also be calculated by the statistical function R.SQUARE(y,x) You will get a screen, which you should complete as follows: The SUMMARY OUTPUT on the next page can be divided up in 4 parts: I Data for the regression: From this you can deduce a.o. the correlation coefficient ® and the determination coefficient (R-square). II Variance-analysis III Regression line: here you can see the intersection with the y-axis (b) and the slope (a) of the regression line. IV Malfunctions: here you can see the predicted income for the 300 different observations based on the regression line, and how far they lie from the regression line (malfunctions). Furthermore you will get a dot diagram and a Chart of the malfunctions. See below: Leeftijd Grafiek met storingen 40 20 35 15 10 30 Inkomen (* € 100) 25 20 Voorspeld Inkomen (* € 100) 15 10 Storingen Inkomen (* € 100) Leeftijd Grafiek voor regressielijn 5 0 -5 0 20 40 -10 -15 5 -20 0 0 50 Leeftijd 100 -25 Leeftijd 60 80 5.6 Correlation-Analysis The strength of the relation between two variables in a population is generally measured by the correlation coefficient r. The meaning of r can be read in the table below: Value of r -0,2 < r < 0,2 0,2 < r < 0,4 or 0,2 0,4 < r < 0,7 or 0,4 0,7 < r < 0,9 or 0,7 0,9 < r < 1 or r =1 or –0,4 < r < - Strength Ignored correlation weak correlation –0,7 < r < - average correlation –0,9 < r < - Strong correlation –1 < r < -0,9 Very strong correlation r = -1 Entire correlation A positive correlation means that an increase of the independent (clarifying) variable has an increase of the dependent (to be clarified) variable as its consequence. A negative correlation that an increase of the independent variable has a decrease of the dependent variable as its consequence. This corresponds to a positive- respectively negative slope of the regression line. With entire correlation, all points of the dot diagram are situated on the regression line. The strength of the correlation is of course, also determined by the reliability of r, or by the number of points of the dot diagram. 5.6.1 Correlation-analysis with Excel To calculate the correlation coefficient (r) Excel knows the statistical function CORREL. Other than the linear regression analysis, it doesn't matter for the calculation of the correlation coefficient which you select as your independent or dependent variable. To calculate the correlation coefficient in the example of the ages and incomes, you can choose for matrix1 Data!C2:C301 and for matrix2 Data1!F2:F301. This will produce an r-value of 0.8289 (strong positive correlation). Switching the two matrices has no influence on the value of r. When you execute a singular linear regression analysis by using a dot diagram, the correlation coefficient is not displayed, but it can be calculated by calculating the root of the determination coefficient. With regression analysis by Data analysis, the correlation coefficient is displayed. See chapter 8.5.3 5.7 Linear regression with time ranges With time range analysis as well, linear regression is often applied, where the time (for instance the year) is seen as the independent variable. Take the data of the turnover of inland transport over the years 1983-1996 as an example. The Dutch Railways expect a linear relation between the years (independent variable) and the achieved turnover (dependent variable). To find this linear relation you could perhaps construct a dot diagram. Working method: 1. Open the file “N.S. Traveler Transport”; 2. Press Insert and press Chart (or the chart icon); 3. In Step 1 of 4: Choose Scatter from all the different chart types, then select the first Subtype and press Next. 4. In Step 2 of 4: Select =Data!$D$2:$D$15 with Data reach (the dependent variable 'Turnover'). Press the tab of Series and select =Data$A$2:$A$15 as X-values (the independent variable 'Year') and press Next; 5. In Step 3 of 4: Fill in the titles, uncheck show legend and press Next; 6. In Step 4 of 4: Select Chart1 As new sheet, and press Finish. 7. Right click on one of the dots, press Add Trendline, select Linear from the types, and press the tab Options. With options, check Display equation in chart and Display Rsquare in chart and press OK. Turnover* 1 million Graph 5.5: Dot diagram of the turnover of the N.S. over the years 1983-1996 2500 y = 87.27x - 172265 R2 = 0.9378 2000 1500 1000 500 0 1982 1984 1986 1988 1990 1992 1994 1996 1998 Source: Dutch Railways N.V. By using the statistical function TREND(known_y’s; known_x’s ;new_x’s ;nst)) you can make predictions for the coming years based on the (linear) regression line all at once. When, for instance, you want to make a prediction for the turnover of the national transport for the years 1997 until 2002, first insert the concerning years in the cells A16:A21 (so in A16: 1997, in A17: 1998, and so on). Now select the cells for which you would like to calculate the predicted turnover. So for example you could select the cells D16:D21. Press Insert on the menu bar and press Function in the scroll menu. Press Statistics and then press TREND. With known_Y fill in D2:D15, with known_X fill in A2:A15 and with new_X A16:A21. With Const, you can fill in nothing, or TRUE. See the following screen: Do not press OK, but simultaneously press the Ctrl- Shift- and the Enter key. You will get the following result: Turnover national Year Number of trips Kilometers traveled transport 1983 200 8886 1984 203 8997 1985 206 9007 1986 210 8919 1987 222 9396 1988 230 9664 935 970 1022 1033 1072 1126 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 240 256 330 333 320 312 305 306 10162 11060 15195 14980 14788 14439 13977 14091 1184 1285 1449 1553 1631 1755 1986 2033 2014.098901 2101.369231 2188.63956 2275.90989 2363.18022 2450.450549 Obviously, it is better you round the values in the cells D16:D21 to whole numbers, 5.8 Non linear regression models With all the described relations you have presumed a straight-lined (linear) relation between two variables. Often you deal with a non-linear relation. Like for instance, the chart of the product life cycle (introduction, growth, maturity, saturation, downfall), where the turnover is dependent on the time. This chart is certainly not straight-lined, but it shows an obvious polynomial character. Using Excel it is relatively simple to, when using a dot diagram, find other non linear regression models (polynomial, exponential, logarithmic, power). When making up the trend line you will then have to choose a different Type. As an example you can see a polynomial relation between the variable 'Age' and 'Income' below, taken from the Graph 5.6: Dot diagram of age and income from 300 respondents y = -0.0133x2 + 1.4551x – 17.444 R2 = 0.741 Income (* € 100) 40 35 30 25 20 15 10 5 0 0 10 20 30 40 50 60 70 -5 Age data from the file “Fiction2000”. Source: Fiction2000 When you compare the determination coefficient from this polynomial model with the one from the linear model, you will come to the conclusion that this polynomial model gives a better description of the relation between the variables 'Age' and 'Income'. 5.9 Predictions The value of a prediction depends on: 1. Whether you are doing a prediction on a value that lies within the reach of x-values, so between the lowest and the highest observation of x (interpolating), or that lies outside of the reach (extrapolating). Interpolating can provide you with a good prediction. Extrapolating, especially when the value is far out of the reach, generally does not. So predicting an income of a train traveler in the morning rush hour with an age of 45 years old (interpolating) based on the regression line is more reliable than predicting the income of a train traveler in the morning rush hour with an age of 70 years old (extrapolating). 2. The determination coefficient (r2). The closer it lies to 1, the more correct the prediction is. 3. The number of points in the dot diagram. In the example the regression line is based on 300 observations (points). When you would for instance let the regression line be determined based on the first 10 respondents (points) you will get a higher determination coefficient, but the 95% reliability interval before the intersection with the y-axis and the slope will become considerably larger. Number of observations 300 10 Determination Intersection -coëfficiënt Lowest Hightest 95% 95% 0.663 -3.674 -0.873 0.831 -14.673 1.701 Slope Lowest 95% 0.453 0.419 Highest 95% 0.534 0.905 5.10 Assignments 1. One wants to investigate a possible correlation between the variable 'Income' and the variable “Traveling time” by using the date of the file Fiction2000. The expectation namely is, that people who have a higher income are prepared to travel further for this job. b. Investigate the correlation between 'Income' and 'Traveling time'. c. Determine the linear regression line that describes the relation between the (independent) variable 'Income' and the (dependent) variable 'Traveling Time'. d. What traveling time would you expect for someone with an income of 20 (* € 100) based on the regression line? e. How do you feel about the reliability of the prediction done in section c? 2. One wants to investigate a possible correlation between the variable “Kilometers traveled” and the variable 'Turnover national transport' by using the data over the years 1983 until 1996 of the file “N.S. traveler transport” The expectation is that when the number of traveler kilometers rises, the turnover will rise as well. a. Determine the correlation between 'Kilometers traveled' and 'Turnover national transport'. b. Determine the linear regression line that describes the connection between the variable 'Kilometers traveled' and the variable 'Turnover national transport'. c. Based on the correlation between the variable 'Year' and 'Kilometers traveled', one expects travelers to travel 17.633 (* 1 million) kilometers with NS in the year 2000. What turnover can be expected for 2000, based on the linear relation between the variables 'Kilometers traveled' and 'Turnover national transport'? 3. One wants to do a prediction for the number of trips that is made with the N.S. in the years 1997 until 2000 based on the possible linear relation between the variable 'Year' and the variable 'Number of trips' based on the data over the years 1983 until 1996 from the file “N.S. Traveler transport”, a. Determine the correlation between 'Year' and 'Number of trips', b. Determine the linear regression line that describes the relation between the variable 'Year' and the variable 'Number of trips'. c. Make a prediction based on the linear relation between 'Year' and 'Number of trips' for the years 1997 up and including 2000. d. What comments can you make concerning the prediction? The cost of capital i for investment projects is determined by: ri = rf + ßi (rm - rf) Here counts: rf = risk free rate, the fee for risk free deposits like government bonds. ßi = the systematic risk of investing rm = profitability of the market wallet The systematic risk ßi shows in what matter the demanded profitability ofa deposit reacts to the chance of the profitability of the market wallet. Apply this to the following situation: 4. The profitability of the share Superinfest compared to the market data is presented in the table below: Profitability of the market wallet and share Superinfest month Profitability 1 2 3 4 5 6 Market wallet in % Superinfest in % 7 8 9 10 11 12 13 14 15 11 9 11 14 16 12 8 10 7 11 6 4 5 8 10 13 12 11 13 18 15 10 13 10 14 10 8 7 8 9 Furthermore it is a given that the risk free rate equals 7%, a. Determine the systematic risk ßi when the angle of inclination of the profitability of Superinfest is put up against the one of the Market wallet. Note: what is the dependent variable (the y-axis) and what is the independent variable (the x-axis)? b. A project is started with the profitability when the market wallet equals 12%. What is the cost of capital of the project? c. Same scenario but when the profitability the market wallet equals 9%. What is the cost of capital of the project? d. Same scenario but when the profitability the market wallet equals 6%. What is the cost of capital of the project? e. What is the difference between the cost of capital with a profitability the market wallet 12%, 9% and 6%? What will the decision of the management be? Bijlage: Antwoorden Annex: Answers Answers Chapter 2: Characteristics 1. Income Arithmetic means Mathematical average Median Mode 13.9 14 18 Measures of diversion Quartile distance Standard deviation 10 7.5 Form measures Skewness Kurtosis 0.11 -0.42 2. Traveling time First sort the data in Fictie 2000 by gender Arithmetic mean Mathematical average Median Mode Male Female 39.8 35.7 35 35 23 22 Measures of diversion Quartile distance Average absolute deviation Standard deviation 32 20.2 28 14.4 26.2 18.1 Form measures Skewness Kurtosis 1.18 1.56 0.56 -0.13 99 3. Number of travel days per train per week Arithmetic means Mathematical average Median Mode Male Female 4.4 4.1 5 4 5 4 Measures of diversion Minimum Maximum Variance width First quartile Third quartile Quartile distance Average absolute deviation Standard deviation Variance Variance coefficient 1 5 4 4 5 1 0.7 2 5 3 4 5 1 0.7 0.9 0.7 19% 0.9 0.7 21% Form measures Skewness Kurtosis -1.69 2.94 -0.56 -0.54 4.Traveling time Traveling time by train (in min.) Class median 0 -< 20 20 -< 30 30 -< 40 40 -< 50 50 -< 60 60 -< 100 100 -< 150 Total 68 56 57 25 45 38 11 300 Mathematical average Median 30+ First quartile 20+ Third quartile 50+ Quartile distance 40.3 34.6 21.3 67.6 33.0 Variance Standard deviation f*(median average)^ 2 Number Cum. reatief f*median 10 # 22.67% 680 # 62430.12 25 # 41.33% 1400 # 13109.04 35 # 60.33% 1995 # 1601.13 45 # 68.67% 1125 # 552.25 55 # 83.67% 2475 # 9724.05 80 # 96.33% 3040 # 59891.42 125 # 100.00% 1375 # 78914.99 12090 226223 756.6 27.5 5 Risks of projects: P1 P2 P3 Weighted average € 74,000 € 53,000 € 53,000 Variance € 3,684,000,000 € 1,281,000,000 € 201,000,000 Standard deviation € 60,696 € 35,791 € 14,177 The first project could bring in the most but the revenues are not guaranteed. It could bring in just as well €130,000 as 14,000. The other two projects bring in the same amount of money; the only difference is that the risk is smaller for the third project. If the risk should be as low as possible, the third project would be favorite. 1 6 Risks of projects: Weighted average P1 P2 P3 € 62,000 € 60,000 € 60,000 Variance € 2,676,000,000 € 3,775,000,000 € 5,488,000,000 Standard deviation € 51,730 € 61,441 € 74,081 P2 and P3 bring in the same amount of money, but the uncertainty of P3 is bigger because of the larger standard deviation. As far as risk management is concerned, a combination of P1 and P2 is favorite. 1 Answers Chapter 3: Probabilities Count of Daily paper Daily paper Total Algemeen Dagblad 8.3% NRC 4.0% Telegraaf 16.7% Trouw 6.3% Volkskrant 20.0% Other 17.7% None 27.0% Total 100.0% Count of Daily paper Gender Daily paper Algemeen Dagblad NRC Telegraaf Trouw Volkskrant Other None Total Male Female Total 6.0% 2.3% 8.3% 3.0% 1.0% 4.0% 14.7% 2.0% 16.7% 5.3% 1.0% 6.3% 13.3% 6.7% 20.0% 12.3% 5.3% 17.7% 19.7% 7.3% 27.0% 74.3% 25.7% 100.0% Answer 1a Answer 1b Count of age Age2 < 20 >= 20 Total Answer 1c 20.0% Total 15.0% 85.0% 100.0% 85.0% 14.7% Count of income (* € 100) Income (* € 100) 2 < 15 >= 15 Eindtotaal Answer 1d 33.0% Count of daily paper Daily paper Total Algemeen Dagblad 10.7% NRC 2.0% Telegraaf 25.3% Trouw 3.3% Volkskrant 7.3% Other 12.0% None 39.3% Total 100.0% Count of opinion Spits Answer 2a Answer 2b Answer 2c Count of age Opinion Spits Very bad Bad Reasonable Good Very good Total 10.7% 36.0% Age2 < 30 >= 30 4.0% 6.0% 21.3% 23.3% 8.7% 63.3% Gender Male Female Total 33.0% 17.7% 50.7% 41.3% 8.0% 49.3% 74.3% 25.7% 100.0% Opinion Spits Very bad Bad Reasonable Good Very good Total 1 8.0% 11.3% 34.0% 34.7% 12.0% 100.0% 19.3% Total 4.0% 5.3% 12.7% 11.3% 3.3% 36.7% Total 8.0% 11.3% 34.0% 34.7% 12.0% 100.0% Answer 2d 11.3% 1 3. a. b. c. d. P(Metro | Spits) = 73 / 150 = 48.7% P(Spits | Metro) = 73 / 163 = 44.8% P(Metro en Spits) = 73 / 300 = 24.3% P(Metro of Spits) = (300-60) / 300 = 80.0% of 163 / 300 + 150 / 300 – 73 / 300 = 240 / 300 = 80.0% e. No, because P(Metro) * P(Spits) = 163 / 300 * 150 / 300 = 27.2% and this is not similar to the answer to c. P(Metro en Spits) = 24.3%. 7. a. b. c. d. P(Volkskrant | Female) = 20 / 77 = 26.0% P(Volkskrant | Male) = 40 / 223 = 17.9% P(Volkskrant) = 60 / 300 = 20% No, the percentage of females owing a subscription to the Volkskrant is higher than the percentage for males. P(Volkskrant)* P(Female) = 60/300 77/300 = 5.1 % This differs from P(Volkskrant | Female) = 26% 1 Answers Chapter 4: Continuous probability distribution 1. a. b. c. d. P(X < 90) = 0.0013 P(X > 135) = 0.0668 P(105 < X < 135) = 0.8664 126.7 minutes 2. a. P(X < 0.0833) = 0.8529 f. P(X > 0.1667) = 0.0216 g. P(0.050 < X < 0.1167) = 0.2483 3. a. b. c. d. P(X < 90) = 0.0228 P(X > 140) = 0.0912 P(120 < X < 150) = 0.4772 144.7 seconds 4. a. P(X < 0.0500) = 0.5934 b. P(X > 0.0833) = 0.2231 c. P(0.0333 < X < 0.0667) = 0.2476 5. a. P1: P(X < 0.10) = 0.0062= 0.62% b. P2: P(X < 0.10) = 0.2023 = 20.23% d. P1 is to preferable c. P1: profitability = 16.68% P2: profitability = 20.05% 1 Answers chapter 5: Singular linear regression- and correlation analysis 5. a. Correlation (r) = 0.4952 b. y = 1.6073x + 16.51 c. Traveling time is 49 minutes g. We speak of average correlation (0.4 < r < 0.7), while the number of observation points is fairly large (n = 300). So the prediction is average.. Traveling time (minutes) Graph 5.1: Dot diagram of income and traveling time 160 y = 1.6073x + 16.51 R2 = 0.2452 140 120 100 80 60 40 20 0 0 5 10 15 20 Income (* € 100) Bron: Fictie2000 6. a. Correlation (r) = 0.87 b. y = 0.1224x – 70.398 c. 2088 (* 1 million) 1 25 30 35 40 Turnover (* 1 million) Graph 5.4: Dot diagram of kilometers traveled and turnover 2500 y = 0.1224x –70.398 R2 = 0.7569 2000 1500 1000 500 0 0 2000 4000 6000 8000 10000 Kilometers traveled (* 1 million) Bron: Nederlandse Spoorwegen Bron: Fictie2000 1 12000 14000 16000 7. Correlation (r) = 0.89 b. y = 11.198x –22016 c. Year Number of trips (* 1 million) 1997 346 1998 358 1999 369 2000 380 d. The introduction of the OV-year card for students in 1991 represents a deviating trend in comparison to the year before. It is better to execute the regression analysis concerning the data from 1991 on. However you will only have a few data at your disposal. Number of trips (* 1 million) Graph 5.5: Dot diagram of year and number of trips 400 y = 11.198x - 22016 R2 = 0.8008 350 300 250 200 150 100 50 0 1982 1984 1986 1988 1990 1992 1994 1996 1998 Source: Nederlandse Spoorwegen N.V. 8. a. b. y = 0.7894x + 3.927so the systematic risk ßi equals 0.7894. When the profitability the market wallet equals 12% the cost of capital ri= 7% + 0.7894*(12% - 7%) = 4.017 c. When the profitability the market wallet equals 9% the cost of capital ri= 7% + 0.7894*(9% - 7%) = 1.6488 d. When the profitability the market wallet equals 6% the cost of capital ri= 1 ri= 7% + 0.7894*(6% - 7%) = -0.7194. e. The profitability of the company is lower than the market one. A larger market profitability gives a smaller profitability of the Superinfest. The investments aren't thát super after all. With a profitability of 6% it is better to invest in bonds. 1