Download portable document (.pdf) format

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The Analysis of 2 × K Contingency Tables
with Different Statistical Approaches
Hassan Salah M.
Thebes Higher Institute for Management
and Information Technology
drhassn_242@yahoo.com
Abstract
The main objective of this paper is to analyze the 2 × K contingency tables with
three statistical approaches (regression analysis, multinomial logistic regression analysis
and linguistic fuzzy model). We compare these methods for evaluating the association
between a risk factor and a disease. These statistical methods measure the association
between the numeric levels of a risk factor and a disease in different ways. They have
been applied to a set of data of childhood cancer risk from prenatal x-ray exposure.
Regression and multinomial logistic regression analyses show similar results for a data
set of 16226 children whereas the fuzzy analysis yields a different result.
Keywords
Contingency table, Multinomial logistic regression, Linguistic fuzzy model, Data of
childhood cancer, X-ray exposure.
1. Introduction
The 2 × K contingency table is an important extension of 2 × 2 table which is a
basic tool for epidemiology investigation. In 2 × K contingency table, the presence or
absence of a disease is recorded at K levels of a risk factor. The 2 × K contingency
table can be viewed from the perspective of a K - level variable (risk factor) or from the
perspective of a binary variable (disease) [4]. In this paper, we use three different
statistical approaches for analyzing the 2 × K contingency table; regression analysis,
multinomial logistic regression analysis and linguistic fuzzy model.
Data on malignancies in children under 10 years of age and information on the
mother's exposure to x-ray provide an example for the discussion and analysis of a
2 × K table [2] and [3]. Table 1 shows the numbers of prenatal x-rays received by
mothers of children with a malignant disease, and a series of controls (healthy children
of the same age, sex, and similar areas of residence)
Table 1
Observed numbers of cases and controls by recorded
number of maternal x-ray films during pregnancy
Films
Cases Y = 0
Controls Y = 1
Total
Proportion
0
7332
7673
15005
.489
1
287
239
526
.546
2
199
154
353
.564
3
96
65
161
.596
4
59
28
87
.678
≥5*
65
29
94
.691
Total
8038
8188
16226
* for simplicity, the values greater than five were coded as 5.
2. Regression Analysis
A 2 × K contingency table can be viewed as a set of K pairs of values. An
)
estimated probability is generated for each value of X producing K pairs ( x j , p j )
∧
where p j is the estimated probability that Y = 0 associated with each level represented
by x j . In order to analyze the K pairs of values, a straight line which summarizes the
relationship between X and Y is estimated and the slope of the estimated line is used
as a summary of the relationship between X and Y. For a simple linear regression, three
quantities are necessary to derive the basic statistical measures: the sum of squares for
X ( S xx ) , the sum of squares for Y ( S yy ) , and the sum of cross-products for X and Y
( S xy ) . These expressions calculated from a 2 × K contingency table are [7]:
k
v
S xx = ∑ n. j ( x j − x ) 2 ,
where
j =1
v
x = ∑ n. j x j / n
S yy = n1. n2. / n
v v
S xy = ( x1 − x 2 ) S yy
(1)
(2)
where
k
v
xi = ∑ nij x j / ni.
(3)
j =1
Now, the regression coefficient can be estimated as
∧
b y / x = S xy / S xx
(4)
and the variance of the estimated regression coefficient can be estimated as
)
var(b y / x ) = S yy / (n − 1) S xx
(5)
On the other hand, a correlation coefficient measuring the degree of linear association
between X and Y calculated in the usual way is
S xy
(6)
rxy =
S xx S yy
2
For the data in Table 1, these quantities for the malignant disease are:
S xx = 6733.581 , S yy = 4056.155 , S xy = 328.548 . Using (4) and (5), the estimated
coefficient of regression and its variance are 0.049 and 0.000037 respectively. The
correlation between the case/control status and the x-ray exposure is 0.063. A 95 %
confidence interval of the association coefficient is (0.0577, 0.0683).
Moreover, the
expected numbers of cases and controls by recorded
number of maternal x-ray films during pregnancy are estimated using an estimated
∧
linear response p = 0.489 + 0.049 xi as shown in Table 2 below.
Table 2
Expected numbers of cases and controls by recorded
number of maternal x-ray films during pregnancy
Films
0
1
2
3
4
≥5*
Total
Cases Y = 0
7433.14
260.57
174.87
79.76
43.10
46.57
8038
Controls Y = 1
Total
Proportion
7571.86
15005
.495
265.43
526
.531
178.13
353
.573
81.24
161
.615
43.90
87
.657
47.43
94
.699
8188
16226
* for simplicity, the values greater than five were coded as 5.
The observed and expected proportions of cases shown in Table 1 and Table 2 are
plotted in Figure 1 below.
0.8
P
0.6
Obs. P
0.4
Exp. P
0.2
0
1
2
3
4
5
6
X-ray
Figure 1: Proportion of cases childhood cancer
for exposure to maternal x-ray during pregnancy
3
Figure 1 indicates that the distribution of number of the cases is better fitted and the
estimated line is good. An additional assessment of the dose-response relationship is
accomplished by partitioning the total chi-square value. The chi-square statistic that
measures homogeneity (H0: the proportion of cases is the same regardless of the degree
of maternal x-ray exposure) is χ 2 = 47.286 . A chi-square value of this magnitude
indicates the presence of some sort of nonhomogeneous pattern of response
( ρ − value = 0.001 ) [7].
3. Multinomial Logistic Regression Analysis
Multinomial logistic regression analysis is useful for situations in which we
want to be able to classify subjects based on values of a set of predictor variables. This
type of regression is similar to logistic regression, but it is more general. In regression
analysis, we use the numeric levels of a risk factor (the number of x-ray exposures) as
an independent variable and the corresponding proportion of cases as dependent
variable, but in multinomial logistic regression there is need to consider a large number
of records (frequency) to establish an association between risk factor and a disease [5].
In order to analyze a 2× K contingency table using multinomial logistic
regression analysis, the data in Table 1 were processed using SPSSWIN and the
numeric results were similar as those obtained by regression analysis [1]. That is the
association coefficient between risk factor and disease is 0.053 with standard error of
0.008. A 95 % confidence interval of the association coefficient is (0.0481, 0.0579).
4. Fuzzy analysis
In bioscience there are several levels of uncertainty, vagueness and imprecision,
particularly in the medical and epidemiological areas, where the best and most useful
description of disease entities often comprise linguistic terms that are inevitably vague.
The theory of fuzzy logic has been developed to deal with the concept of partial truth
values, ranging from completely true to completely false, and has become a powerful
tool for dealing with imprecision and uncertainty aiming at tractability, robustness and
low-cost solutions for real-world problems.
These features and the ability to deal with linguistic terms could explain the increasing
number of works applying fuzzy logic in biomedicine problems. In fact, the theory of
fuzzy sets has become an important mathematical approach in diagnosis system,
treatment of medical images and, more recently in epidemiology and public health [5]
and [6]. For more knowledge about fuzzy logic theory the book by Yen and Langari [8]
is recommended.
A linguistic fuzzy model consists of a set of fuzzy rules and an inference
method. The most common inference method is the Minimum of Mamdani, whose
output is a fuzzy set. The fuzzy linguistic model to evaluate a childhood cancer risk
4
from prenatal x-ray exposure has two antecedents: malignancies in children under 10
years of age and information on the mother's exposure to x-ray.
The model elaborated five fuzzy sets to the variable number of x-ray films that
exposure to the mothers (very low, low, medium, high and very high) and two fuzzy
sets for the variable number of children with a malignant disease and a series of controls
( healthy children of the same age) (cases and controls).
The consequence of the model is the association between x-ray films and the
malignancies in children under 10 years of age. We considered three fuzzy sets for this
linguistic variable; weak, medium and strong. The base rules consist of the following
ones:
1. If
2. If
3. If
4. If
5. If
x-ray is very low and case then association is weak.
x-ray is low and case then association is weak.
x-ray is medium and case then association is weak.
x-ray is high and case then association is medium.
x-ray is very high and case then association is strong
The association between the childrens' malignancies and x-ray films is
determined by inference of the fuzzy rule set, and defuzzifiction of the fuzzy output.
The system was run in a C++ language. Fuzzy sets to input variable number of x-ray
and to output variable of association between malignancies children and x-Ray are
displayed in Figure 2 and Figure 3 below.
Membership function
VLOW
LOW
MEDIUM
HIGH
VHIGH
1
1
2
4
X – Ray
Figure 2: Fuzzy sets to input variable number of X-ray
5
5
Membership function
WEAK
MEDIUM
STRONG
20
10
Figure 3: Fuzzy sets to output variable of Association between
malignancies children and X-Ray
We notice that by combining all possible inputs it is possible to build 10 rules
but, it only 5 rules were considered because some situations that can not occur. For
example, it is impossible, for the mothers who were not exposed to x-ray, the children
have a disease (if they have; this occurs for another reason). Although this is
mathematically possible, it was subtracted from the rule bases, reducing the number of
rules.
The fuzzy set related to linguistic variables is presented in Figure 2. The
membership fucntion represents the degree of compatibility of some input to all
categpries. In fact, the membership degree represents the possibility that the input
belongs to the set. Figure 3 shows the memebership function of the output. It is clear
that the association increases monotonically when the number of x-ray films increases.
It was 16 % for weak, 17 % for medium and 18 % for strong associations respectively.
Also the weighted mean of the association between X-ray and the disease was 0.125 and
the standard error was 0.0026. A 95 % confidence interval of the association coefficient
is ( 0.1178, 0.1322).
6
Discussion
In regression analysis, we use the numeric levels of a risk factor (the number of
x-ray exposures) as an independent variable and the corresponding proportion of cases
as a dependent variable. Furthermore, in multinomial logistic regression there is need
for a considerable number of records (frequency) to establish an association between
risk factor and a disease. In a fuzzy linguistic model, there is not such need.
The point biserial correlation coefficient (rxy ) , the regression coefficient
∧
(b y / x ) are interrelated when calculated from a 2 × K table. For example, each has an
expected value of zero when the variables X and Y are unrelated. The two statistics
measure the association between the numeric levels of a risk factor and a disease in
different ways but, in terms of probability, lead to the same inference.
A measure of association assesses the strength of a relationship, while a
statistical test gives an idea of the likelihood that such an association occurs by chance
where both regression and multinomial logistic regression give similar results, the fuzzy
model gives rather different results for evaluating the association between the risk factor
and the disease (See: Table 3).
Table 3
Comparison between the results of the three methods
Regression
Association coefficient
Standard error
95 % CI
ρ − value
0.063
0.0061
(.0577, .0683)
0.001
Multinomial
logistic
regression
0.053
0.0082
(.0481, .0579)
0.000
Fuzzy model
0.125
0.0026
(.1178, 1322)
We notice from Table 3 that the three statistical methods (regression,
multinomial logistic regression and fuzzy model) for evaluating the association between
risk factor and a disease show similar results for a data set of 16226 children, but the
results from fuzzy model are rather different.
References
[1] Ashour, S. K. and Salem, S. A. (2005). Statistical Presentation and Analysis using
SPSSWIN, Part two: Advanced Applied Statistics. Cairo University: ISSR.
7
[2] Bithell, J. F., and Steward, M. A. (1975). Prenatal Irradiation and childhood
Malignancy: A Review of British Data from the Oxford Study. Brit. J. of Cancer
(31):271-87.
[3] Breslow, N. E., and Day, N. E. (1987). Statistical Methods in Cancer Research,
Volume II. Oxford University Press. Oxford, UK.
[4] Hardeo Sahai and Anwer Khurshid (1996). Statistics in Epidemiology, Methods,
Techniques and Applications. CRC Press, New York.
[5] Luiz Fernando C. Nascimento and Neli Regina S Ortega (2002). Fuzzy Linguistic
Model for Evaluating the Risk of Neonatal Death. Rev Saude Publica, 36 (6): 686-92.
[6] Schwarzer G., Nagata T., Mattern D., Schmelzeisen R. and Schumacher (2003).
Comparison of Fuzzy Inference, Logistic Regression, and Classification Trees (CART).
Methods Inf Med; 42: 572-7.
[7] Steve, S. (1996). Statistical Analysis of Epidemiologic Data, 2nd ed. Oxford
University Press, Oxford.
[8] Yen J. and Langari R. (1999). Fuzzy Logic: Intelligence, Control an information.
Upper Saddle River (NJ), Prentic-hall.
8