Download Qualitative Variables in a Regression Model using Dummy Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Regression analysis wikipedia , lookup

Least squares wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Qualitative Variables in a Regression Model using Dummy Variables



Differences in Two Population Means
Differences Among More Than Two Means
Mixtures of Quantitative and Qualitative Independent Variables
1. Differences in Two Population Means
1.1 Ways to express the values of two means
o Two means can be expressed as two values or as one mean and the difference
between the means.
o Example: You are measuring the starting income of two majors. The average
starting salary of major 1 is $25 per hour and the average starting salary of major
2 is $15. This could also be expressed as the average of major 1 is $25 and when
you go to major 2 from major 1, the mean is reduced by $10.
1.2 Using regression to model these values.
o The intercept is the average value of value when x=0 and the slope is the change
in the mean of y when x increases by 1.
o In the example above:
The intercept would be the average starting salary of major 1 and the slope is the
change in the average starting salary when you go to major 2.
The following population model can be used to express this:
Mean of Y = 0 +  1X = $25 - $10X
where X = 0 for major 1 and X=1 for major 2
1.3 Interpretation of the coefficients
o 0 + 1 = 2 which is the average value of y for the second population
0 = 1 which is the average value of y for the first population
1 = 2 - 1 the mean of second population minus the mean of the first
o
o Example:
$25 - $10 average starting salary for major 2
$25
average starting salary for major 1
-$10 is the average starting salary of the first major minus the average
starting salary of the second major
1.4 Estimates
o
Least square equation
yˆ  b0  b1 X
X = 0 or 1
b0 is the estimated mean of the first population (the first sample mean)
b1 is the estimated difference in means (second sample mean minus the first)
o
Example:
Supposed you wanted to compare the average time spent by males and females
watching a particular cable channel. You found the following least squares line:
yˆ  6  2 X
where X = 0 for Males 1 for females.
The sample intercept is the
__________________________________________________
While the sample slope is the
_________________________________________________
__________________________________________________
1.5 Inferences:
o Requires the same assumptions and has the same degrees of freedom as simple
linear regression
o The test and confidence interval for the slope is identical in values and meanings to
the test and confidence interval for the difference in means (independent sample
case) found in an earlier chapter.
2. More than two means
2.1 Number of differences
o Two means require one mean and one difference; i.e., one dummy variable
o Three means require one mean and two differences; i.e., two dummy variables
o K means require one mean and k-1 differences; i.e. k-1 dummy variables
o Example: average days sick per month for type 1 workers is 4, mean for type 2 is 6 and
the mean for type 3 is 1 can also be expressed as:
The mean for the first type of worker is 4,
when you go from the first worker population to the second the mean increases by 2, and
when you go from the first to the third the mean decreases by 3
2.2 Modeling using dummy variables
2.2.1 Notation and interpretation for three means
oThe intercept is the average value of value when x1=0 and x2=0 and the first slope is the
change in the mean of y when x1 increases by 1 and the second slope is the change in
the mean of y when x2 increases by 1.
E(y) = 0 + 1X1 + 2X2
0 + 1
= 2 which is the average value of y for the second population
0 +
2 = 3 which is the average value of y for the third population
0
= 1 which is the average value of y for the first population

1 =2 - 1 the mean of second population minus the mean of the first
2 =3 - 1 the mean of third population minus the mean of the first
o In the example above:
.
The following population model can be used to express this:
Mean days absent = 0 + 1X1 + 2X2 = 4 +2X1 – 3X2
Where
X1 = 1 indicates the second type of worker and 0 if not
X2 = 1 indicates the third type of worker and 0 if not
Mean for worker 1 is ____
Mean for worker 2 is _______
Mean for worker 3 is _______
Differences in means =
(neither worker 2 or 3)
(worker 2 and not 3)
(worker 3 and not 2)
2.2.1 Notation and interpretation for k means
o The intercept is the average value of value when all k-1 dummy variables is zero and
the slope of the ith (i =1, 2, … , k-1) dummy variable is the change in the mean of y
when Xi increases by 1.
E(y) = 0 + 1X1 + 2X2 + … + k-1Xk-1
0 +
i
= i which is the average value of y for the ith population
0
= 1 which is the average value of y for the first population

i =i - 1 the mean of ith population minus the mean of the first
2.3 Inferences
o Requires same assumptions and uses same degrees of freedom as does a regression
model with k - 1 variables
o F test for regression tests the null hypothesis that all the coefficients are zero. Here if
all the coefficients are zero then all the means are equal.
o A t-test or a confidence interval for i will make inferences about the difference in
the mean of the ith level and the mean of the first level.
Example: Supposed you wanted to compare the average time spent by adult males,
adult females, and children watching a particular cable channel. From a sample of
30, you found the following least squares line yˆ  4.7  1.5 X 1  4 X 2 when X1 = 1 if
males and 0 otherwise and X2= 1 if children and 0 otherwise
The slope estimate of 1.5 could be interpreted as:
4 could be interpreted as:
Additionally using multiple regression for testing all the coefficients you found an F
test value of 4.5Complete the following hypothesis test
H0
H1
Test Statistic
Rejection Region
Conclusion:
3. Mixtures of Quantitative and Qualitative Variables
Consider the following Y = time spent watching a cable channel, X1 is the total time
spent on all channels during the same time period and category (adult males, adult
females, and children).
Examine the following model:
E(y) = 0 + 1X1 + 2X2+ 3X3
Where
X2 = 1 if males and 0 otherwise
X3= 1 if children and 0 otherwise

What is the equation for males?
What is the equation for children?
What is the equation for females?
What is the interpretation of 1?
How would you test for the effect of category?
4. Examples from Bureau of Labor Statistics:
Pricing of College Textbooks
http://www.bls.gov/cpi/cpictb.htm
Pricing of Microwave ovens
http://www.bls.gov/cpi/cpimwo.htm
Creating Occupational Pay relatives
http://www.bls.gov/news.release/ncspay.tn.htm