Download Data Mining and Hotspot Detection in an Urban Development Project

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Journal of Data Science 6(2008), 389-414
Data Mining and Hotspot Detection in an
Urban Development Project
Chamont Wang1 and Pin-Shuo Liu2
1 The College of New Jersey and 2 William Paterson University
Abstract: Modern statistical analysis often involves large amount of data
from many application areas with diverse data types and complicated data
structures. This paper gives a brief survey of certain large-scale applications. In addition, this paper compares a number of data mining tools in
the study of a specific data set which has 1.4 million cases, 14 predictors and
a binary response variable. The study focuses on predictive models that include Classification Tree, Neural Network, Stochastic Gradient Boosting,
and Multivariate Adaptive Regression Splines. The study found that the
variable importance scores generated by different data mining tools exhibit
wide variability and that the users need to be cautious in the applications of
these scores. On the other hand, the response surfaces and the classification
accuracies of most models are relatively similar, yet the financial implications can be very profound when the models select the top 10% of cases and
when the cost and profit are incorporated in the calculation. Finally, the
Decision Tree, Predictor Importance, and Geographic Information Systems
(GIS) are used for Hotspot Detection to further enhance the profit to 95.5%
of its full potential.
Key words: Case selection, data mining, geographic information systems,
marginal effect, predictive modeling, profit, variable importance.
1. Introduction
Modern statistical analysis often involves large amount of data with tens,
hundreds or thousands of variables. Case studies that involve large data sets
range from biological applications, web mining, political campaigns, government
services, crime-fighting and the detection of financial fraud, to name just a few.
For example, in biomarker pattern analysis, each data set usually contains
hundreds of thousands predictors in the form of mass-to-charge ratio (m/z) to
generate sets of biomarker classifiers (Conrads, et al., 2004, Yu et al., 2005).
For another example, it is well-known that genome projects and other largescale biological research projects are producing enormous quantities of biological
390
Chamont Wang and Pin-Shuo Liu
data. The entire human genome, for instance, with its sequence represented by
the letters A, T, C and G would fill approximately 1000 books with 1000 pages in
each book when printed. Another line of the development is the Microarray technology that allows scientists to study the behavior patterns of several thousands
of genes simultaneously. (Amaratunga and Cabrera, 2004).
In a different area of application, Google uses data mining techniques extensively on the vast universe of their web data. The techniques include probabilistic
models for page rank, text mining, spell check, and statistical language translation that may involve hundreds of languages around the globe. A New York
Times article (October 30, 2005) reported that Google utilizes millions of variables about its users and advertisers in its predictive modeling to deliver the
message to which each user is most likely to respond. The Times reported that
because of this technology, users click ads 50 percent to 100 percent more often on
Google than they do on Yahoo, and that is a powerful driver of Google’s growth
and profits.
In political campaigns, a series of articles in the New York Times and other
news outlets indicate that both the Republican and the Democratic parties rely
heavily on data mining tools to micro-target potential voters. When well done,
these get-out-the-vote programs typically raise a candidate’s Election Day performance by two to four percentage points. Whether you like it or not, data
mining tools indeed assisted George W. Bush in his presidential election in 2004
and perhaps in 2000 as well. The data banks, according to a New York Times
article (June 9, 2003), the Democratic National Committee boasts electronic files
on 158 million Americans and The Republican National Committee says it’s way
ahead, with files on 165 million (see references and other details in Section 4,
Variable Importance). The US Federal Government also has undertaken very extensive efforts on data mining, including government services, homeland security,
and about 40% of Federal Agencies using or planning to use data mining tools in
their services and operations (see, for example, documents prepared by the US
General Accounting Office in May 2004 and by the US Department of Defense
in March 20042 .). A special tool for crime-fighting and the detection of financial
fraud is called link analysis.
A current hot issue in the data mining community is the Netflix predictive
modeling competition. T he company offers $50,000 each year for five years
and an additional $1,000,000 at the end of the 5-year competition for the best
model that reduces improves the prediction accuracy by a 10% reduction of their
RMSE. The data involves more than 100 million ratings from over 480 thousand
randomly-chosen, anonymous customers. The competition has so far attracted
2
See http://www.gao.gov/new.items/d04548.pdf,
http://www.epic.org/privacy/profiling/tia/tapac report.pdf
Data Mining and Hotspot Detection
391
21865 contestants on 17690 teams from 147 different countries and probably will
produce some good models at the end of the 5-year saga3 .
For other large data sets, an online repository can be found4 . The site is
supported by a grant from the Information and Data Management Program at
the National Science Foundation. The data bank has a total of 32 data sets,
including E. Coli Genes, El Nino Data, the Insurance Company Benchmark, and
Reuters-21578 Text Categorization Collection. On the other hand, statisticians
who are interested in large health data sets or links may want to visit the vitalnet
site5 , which includes birth data from various States, Medicaid datasets, mortality
data, and a variety of other health data.
In a recent volume titled “Data Mining in Action: Case Studies of Enrollment Management,” of New Directions for Institutional Research (2006, editors:
Luan and Zhao), Decision Trees, Neural Networks, and other data mining tools
are used on the following tasks: advanced placement, student mobility, graduation rates, predicting college admissions yield, estimating student retention and
degree-completion time, and developing a learner typology for managing enrollment and course offerings. One of the papers (Luan, 2006) uses a combination of
clustering analysis and traditional statistical techniques in a very skillful manner
and is available6 . Other articles can also be tracked down7 .
Tools for mining large data sets include traditional statistical methods and
new techniques from the machine-learning community. The tools are commonly
grouped in the following categories: supervised methods, unsupervised methods,
and everything else in between. The first category involves a target variable in
predictive modeling. This category covers regression, decision trees, and neural
networks. The second category includes multivariate methods such as cluster
analysis, principle component analysis, data visualization, pattern discovery, and
novel dimension-reduction techniques that do not involve target variables in prediction. The third category includes market basket analysis (a.k.a. association
rule) that is commonly used by Amazon.com, Blockbuster.com, and countless
other online stores. For market basket analysis, a clever use of conditional probability and the a priori algorithm can be found in Chapter 10 of the book by
Larose (2005).
The major difference between the traditional statistical techniques and the
tools from the machine learning community is that the former category put more
emphasis on the underlying mechanism that generated the data, while the latter
is more concerned with the utility and predictive power of the models on the
3
See
See
5
See
6
See
7
See
4
http://www.netflixprize.com/leaderboard
http://www.netflixprize.com/leaderboard)
http://www.ehdp.com/vitalnet/datasets.htm
http://www.airweb.org/page.asp?page=915
http://www.wiley.com/WileyCDA/WileyTitle/productCd-IR.html
392
Chamont Wang and Pin-Shuo Liu
hold-out data that is not used in the modeling process. Studies after studies
have shown that the two approaches can be used to complement each other in a
very productive manner (see, e.g., Conrads, et al., 2004, Yu et al., 2005).
In this paper, the focus is placed on certain aspects of the supervised methods.
Specifically, we consider a number of predictive models such as regression, neural
networks, boosted trees, and support vector machine in the context of an urban
development project. The data has 1.4 million cases, 14 predictors, and a binary
response variable. Section 2 compares model accuracies of various data mining
tools. In addition, the section discusses a number of possible ways to decrease
misclassification, false positive, or false negative rates. Section 3 uses a unique
feature of SAS Enterprise Miner to produce profit charts that is the hallmark
of SAS-EM. The chart is indeed a money maker in the commercial applications
of data mining. Section 4 discusses the current state of predictor importance
and urges users to exercise caution in the application of these scores. Section 5
compares marginal effects and response surfaces of various models. Sections 4
and 5 together reveal the weakness and limitation in the original data that are
common in the mining of large data sets. Section 6 uses Geographic Information
Systems (GIS) to screen cases that enhance profit to 95.5% of its full potential.
Specifically, this study applies GIS to create an urban growth database of
Memphis and Shelby County, Tennessee to estimate the possible relationship between urban development and several growth, stimulant, and deterrent variables.
These variables are collected and constructed for the study including land-use
change between 1990 and 2000 in the study area. The dependent variable for
each geographic unit has two possible values: 1 indicates that the development
status of the site changed from vacant to developed, while 0 indicates that it has
remained vacant. The data set has 1,420,287 cases and includes the following 14
predictors:
Interval variables:
ELEV (elevation, classified into several categories),
EMPDIS (distance to employment center, ratio scale),
INTDIS (proximity to highway intersection, ratio scale data),
POPCHG (population change between 1990 and 2000, ratio scale),
ROADIS (proximity to major road, ratio scale data),
SLOPE (percent of slope, classified into several categories).
Ordinal variable:
SCHDIS (school performance, 1= poor, 2 = better, 3 = best school performance).
Categorical variables:
FLOODWAY (either in the floodway or not),
FP100 (100-year flood plain, either in the 100-year flood plain or not),
Data Mining and Hotspot Detection
393
FP500 (500-year flood plain, either in the 500-year flood plain or not),
PUB (Public land, either public land or not),
SEWER (either in the sewer service area or not),
SURWAT (either in surface water or not),
WET (either in the wetland area or not).
In statistical modeling, it is recommended that the investigator should examine the distributions of each variable before the building of models. In this paper,
the results will be shown first and the distribution and statistical details will be
presented when necessary.
2. Model Accuracies of Various Data Mining Tools
For binary prediction, model accuracies can be assessed by a variety of criteria. In SAS-EM, the tools include the following: (a) Summary statistics: AIC,
BIC, misclassification rates for training, validate and hold-out data, root average squared error, maximum absolute error, root final prediction error, etc., (b)
graphical comparisons: %Response, %Captured Response, lift chart, profit chart,
ROI (Return On Investment) chart, confusion matrix plot, ROC chart, sensitivity
and specificity charts, and response threshold chart. In Microarray and biomarker
pattern analysis, ROC, sensitivity and specificity are the most popular (see e.g.,
Draghici, 2003; Conrads, et al., 2004; Yu et al., 2005). In literature, other criteria
and terms such as precision, recall, and F-measure are used. The countless terms
make the field very colorful or confusing, depending on the perspective of the
user.
In this section, the focus will be the classification accuracy, which is defined to
be the opposite of misclassification rate (a.k.a. risk estimate). The data mining
tools we used in this section include the following.
The List of Predictive Models:
1. Logistic Regression (via SAS-EM)
2. Neural Network (SAS-EM and STATISTICA)
3. Decision Tree (SAS-EM, STATISTICA C&RT, and SALFORD SYSTEMS
CART)
4. Random Forest (SALFORD and STATISTICA)
5. Stochastic Boosting Trees (SALFORD TreeNet and STATISTICA boosted
trees)
394
Chamont Wang and Pin-Shuo Liu
6. Multivariate Adaptive Regression Splines (SALFORD MARS and STATISTICA MARSplines)
7. Support Vector Machine (STATISTICA and Equbits SVM)
8. Genetic Algorithm (DISCIPULUS Genetic Algorithm)
Explanations of these models (except the Genetic Algorithm) can be found in
Hasti et al. (2001) or in the electronic textbook at StatSoft.com. Explanations
of Genetic Algorithms can be found in Berry and Linoff (2004), Spector (2004),
Tomassini (2005) or Tan, Khor, and Lee (2005).
In this list, certain data mining algorithms can function as universal approximators which may be able to model a variety of linear or nonlinear relationship
or interactions. It is arguable that the more of those tools we have from reputable
sources, the better chance there will be to maximize the probability of finding a
good predictive model.
Take the example of the Netflix $1M competition, if you have time or an
assistant, you may want to try all the tools to find the best root mean squared
error (RMSE) as required by the competition. The same argument would apply
to other high-stake situations.
In the list, the first three models (regression, tree, and neural network) are the
major tools of SAS EM predictive modeling. The fourth tool, Random Forest,
is the brainchild of the late Leo Breiman, Department of Statistics, University
of California, Berkeley, 1980 - 2005. Yu et al. (2005) reported that the method
of Random Forest outperformed other methods like linear discriminant analysis,
quadratic discriminant analysis, k-nearest neighbor, and boosting classification
trees. DeVeaux (2005) compared various data mining tools in a study of 1,618
mammograms with the following results:
Table 1: Random forest vs. radiologists
Simple Tree
Neural Network
Boosted Trees
Bagged Trees (Random Forest)
Radiologists
False Positives
False Negatives
32.20%
25.50%
24.90%
19.30%
22.40%
33.70%
31.70%
32.50%
28.80%
35.80%
The DeVeaux study indicates that Random Forest may perform better than
radiologists and better than many other data mining tools. In our study (Table
1), this tool is indeed a very competitive algorithm.
The boosted Tree algorithm is widely credited as an invention of Jerome
Friedman, Department of Statistics, Stanford University. But Friedman (2006)
Data Mining and Hotspot Detection
395
himself stated that the technique was first proposed by Freund and Schapire
(1996). The tool is called TreeNet by Salford Systems and was the winner of
the 2003 Duke University Data Mining competition, sweeping all four categories.
In 2004, the TreeNet submission placed 2nd in the KDD Cup, a cut-throat data
mining competition. The honor goes to the software and to a Salford client who
used the 3-day evaluation copy to accomplish the feat8 .
Multivariate Adaptive Regression Splines (MARS) is another invention of
Jerome Friedman. The tool, together with CART, HotSpotDetector and TreeNet,
helped Salford win the 2000 KDD Cup International Web Mining Competition.
The Support vector machine (SVM) methodology is one of the most important
tools in the machine learning community. Bastos and Wolfinger (2004) reported
4% error rate by using SVM, as compared to 27% error rate on the same set
of data in a 2002 paper in The New England Journal of Medicine. Yu (2005)
used SVM and achieved a 2% error rate in a case study on cloud detection,
as compared to a 53% error rate by expert labels. A dnan and Bastos (2005)
reported substantial advantages of SVM over regression and neural networks.
Furthermore, EQUBITS SVM achieved a stunning 99.6% accuracy in the 2004
UCSD data mining competition9 .
The last tool in our list is DISCIPULUS Genetic Algorithm (GA), which
is based on biological inspirations such as selection, cross-over, and mutation
in the search of the optimal solution. In recent years, evolutionary algorithms
have also been used to enhance automatic quantum computer programming (see,
e.g., Spector, 2004). In addition, much to the delight of students in data mining classes, Artificial Evolution is used in conjunction with Neural Networks to
animate tricky stunts in blockbuster movies such as Troy and the Lord of the
Rings10 .
Table 2 displays the classification rates of each model. All models use the
same 10,000 cases that were selected at random from the full data set of 1.4
million records. Of the 10,000 cases, 80% of the data are used to build the
model, and the remaining 20% are used as hold-out data to test the classification
accuracy. In the modeling process, certain algorithms further split the 80% data
into training and validation data sets to prevent over-fitting, while others (such
as Equbits SVM) use v-fold crossvalidation to search for the best fit. The layout
of Table 2 follows the list sequence at the beginning of this Section.
8
See http://www.salford-systems.com/press1.php,
http://www.salford-systems.com/press8.php
9
See http://www.siam.org/meetings/sdm05/binyu.htm,
http://equbits.com/casestudies/SAS%20Case%20Study.pdf,
http://www.equbits.com/
10
See, e.g., http://www.naturalmotion.com/downloads.htm,
http://www.wired.com/wired/archive/12.01/stuntbots.html
396
Chamont Wang and Pin-Shuo Liu
Table 2: Model accuracies of various data mining tools (see Remark 2 at end
of paper)
n = 10, 000
Accuracy
(Hold-out data)
Accuracy
(Training + Validation data)
No Model
SAS Enterprise Miner, Logistic Regression
SAS Enterprise Miner, Neural Network
STATISTICA, Neural Network
SALFORD, CART (Gini index)
SAS Enterprise Miner, Decision Tree (Chi-square)
SAS Enterprise Miner, Decision Tree (entropy)
SALFORD, Random Forest
STATISTICA, Random Forest
SALFORD, TreeNet (1000 trees)
STATISTICA, Boosted Trees (1000 trees)
SALFORD, MARS
STATISTICA, MARSplines
EQUBITS, Support Vector Machine
STATISTICA, Support Vector Machine
DISCIPULUS, Genetic Algorithm
52.8%
76.3%
76.6%
76.0%
78.3%
75.4%
76.2%
80%
76.9%
80%
77.9%
N/A
75.9%
77.4%
75.0%
N/A
52.8%
75.9%
75.8%
80.0%
78.1%
75.1%
78.6%
78.2%
75.4%
80.1%
83.5%
76.5%
75.9%
77.4%
75.6%
80.4%
(T),
(T),
(T)
(T)
(T),
(T),
(T)
(T),
(T)
(T),
(T)
(T)
(T)
(T)
(T),
75.2% (V)
75.8% (V)
75.6% (V)
77.6% (V)
73.3% (V)
79.2% (V)
80.2% (V)
The second row of Table 2 is labeled “No Model,” indicating that 52.8% of
the past records were “developed” while 48.2% were “not developed.” In comparison, the blackbox models such as TreeNet and Genetic Algorithm increase the
classification accuracy by almost 30%. This is indeed a genuine contribution of
data mining tools to the real-world predictive modeling.
In our study, the SVM models were built respectively by experts at Equbits
and StatSoft. Hence we are somewhat disappointed with the 77.4% and 75%
accuracies. On the other hand, Genetic Algorithms gave us 80.4% accuracy for
training data and 80.2% for validation data, which are higher than the rates
of most tools we have tried. But a cautionary note is that Genetic Algorithms
may remember its historical runs and hence have a tendency to over-fit both
the training and validation data. Consequently, one may have to reserve part of
the original data that is completely distinct from the model building process for
the assessment of model accuracy. In our study, we used the free demo software
which does not allow the deployment of the third data set.
Note that Table 2 does not rank the models by their classification accuracies.
The standard errors of the risk estimates are roughly 0.5% for training data and
0.6% to 0.8% for hold-out data, respectively (See Remark 3). Consequently,
the differences among the Logistic Regression and Neural Network by SAS-EM
(76.3% and 76.6%, respectively) are not statistically significant. The same can
be said among many other models.
Data Mining and Hotspot Detection
397
3. Classification with a Cost Structure
This section discusses binary classification from a profit-cost point of view.
In Table 2, the classification accuracies between SAS-EM logistic regression and
neural network appear small (76.3% vs. 76.6% for hold-out data), but the financial implication can be very substantial as it will be shown in this section.
To simplify the discussion, assume that the land value in the study area will
increase 500% in a 10-year period when the land changes from a vacant lot into
residential use. The land value will stay the same when a vacant land stays as
vacant in the same period. In addition, the cost of the investment would be 50%
of the purchased price in term of investment loss due to the cost of interest in
the 10-year period.
Furthermore, assume that a model produced the following classification matrix (n = 10, 000):
Table 3: Classification Matrix for Profit Analysis
Predicted DEV = 1
(Invest)
Predicted DEV = 0
(Do Not Invest)
Observed
DEV = 1
4310
(76%)
974
Observed
DEV = 0
1374
(24%)
3342
Column total
& percentages
5684
(100%)
4316
Then the expected profit in the 10 years period would be
(5 × 4310 − 0.5 × 5684)/5684 = 3.3.
Recall that in the original data, P [DEV = 1] = 53% and P [DEV = 0] = 47%,
hence the net profit of the blind investment would be 5 × 0.53 − 0.5 = 2.15.
The difference is about 3.3 − 2.15 = 1.15 of investment units. However,
in the original data about 9.1% of cases are public land which would prohibit
investment. After the deletion of these cases, the data size was reduced from the
original 1,420,287 to 1,290,897. A random sample (n = 129,089) of the reduced
data (N = 1,290,897) includes 58.45 × 0.584 − 0.5 = 2.42.
A special feature in SAS-EM allows the user to pick the top candidates (e.g.,
10%, 20%, etc.) to improve the investment profit. This is shown in the following
chart:
398
Chamont Wang and Pin-Shuo Liu
Figure 1: Profit charts, cumulative
Figure 2: Profit charts, non-cumulative
The chart uses 10 bins and ranks from left to right the best 10% of the
lands, the second best 10% of the lands, etc. The chart indicates that the Neural
Network model is the preferred model, and that the profit of selecting only the
top 10% of the lands via the Neural Network model would be about 4 investment
units, and that of all the top 20% would be 3.85 units, etc. In other words, if
each investment unit is $100,000, then the top 10% of the land as selected by the
neural network model would outperform the blind bet by (4 − 2.42) × $100, 000 =
Data Mining and Hotspot Detection
399
$158, 000 for the investment on one piece of the land after 10 years. In the event
that the top 10% are not available to the investor, then the profit for the second
group via Neural Network would be 3.7 units as shown in the following chart:
The chart in Figure 1 is cumulative while the one in Figure 2 is non-cumulative.
The second chart gives warnings that the investment on the bottom 35% or so
would be worse than the average of blind bets. In other words, the models predict that the probabilities for development of the bottom 35% are very low and
investors should try to avoid those pieces of lands.
4. Variable Importance
Given a dozen or thousands of predictors, a natural question is how important
a variable (or a set of variables) is in the prediction of the target. For instance, in
the emerging technology of biomarker pattern analysis, data from high-resolution
mass spectrometry are often used to generate a set of biomarker classifiers (Conrads, et al., 2004, Yu et al., 2005). Each data set usually contains hundreds of
thousands predictors in the form of mass-to-charge ratio (m/z) that need to be
binned to reduce the computational complexity. Statistical techniques such as
Kolmogorov-Smirnov test, Wilcoxon-test, Bonferroni correction, and other tools
are used to further reduce the dimension of the feature space (i.e., predictor space)
without losing its biological meaning. The success of the technology depend on
the ability of a selected set of features to transcend the biologic variability, process variations, and methodologically related background noise (Conrads, et al.,
2004) (See Remark 6). For another example, Google has created an automated
way to search for talent among the more than 100,000 job applications it receives
each month (New York Times, 01/03/2007). The data mining process involves
extensive surveys that explore an applicant’s attitudes, behavior, personality and
biographical details going back to high school. The survey has about 300 items,
i ncluding questions such as: Is your work space messy or neat? Are you an
extrovert or an introvert? What magazines do you subscribe to? What pets do
you have? The answers are fed into a series of formulas to predict how well a
person will fit into its chaotic and competitive culture. The Google studies found
that certain traditional yardsticks are not reliable predictors, while other variables can help find candidates in several areas. The Times also reported that the
use of surveys similar to Google’s new system is on the rise, which in turn may
present new challenges and new opportunities for statistical prediction11 . In our
study, all tree-based data mining tools offer predictor importance scores. SPSS
logistic regression also ranks the relative importance of independent variables.
11
See http://www.nytimes.com/2007/01/03/technology/03google.html?pagewanted=2&ei=5094
&en=4d3171ddca1dab7d&hp&ex=1167886800&partner=homepage
400
Chamont Wang and Pin-Shuo Liu
S-Plus uses Ward statistics to calculate importance scores. Other tools for predictor screening include information gain, expected cross entropy, the weight of
evidence of text, odds ratio, term frequency, mutual information, and modified
Gini Index (Shang et al., 2007).
A disturbing fact is that there seems to be considerable disagreement in the
definitions and in the execution of the algorithms. Variable importance in data
mining is indeed tricky business. For example, in Breiman et al. (1984, p. 147),
the measure of the importance of a variable xm is defined as
M (xm ) =
∑
∆I(s̃m , t),
t∈T
where s̃m is the best surrogate split (p. 40) with variable xm , and ∆I(s̃m , t)
is the drop of Gini impurity at node t. STATISTICA and certain programs such
as Quest by Loh and Shi, on the other hand, compute variable importance by
summing the drop in node impurity over all nodes in the tree(s) (see STATISTICA
help file).
The difference between STATISTICA and Breiman et al. appears minor, but
STATISTICA manual goes to great length to emphasize the differences, and the
two approaches indeed yield vastly different scores for variable importance as can
be seen in Table 4:
Table 4: Scores of variable importance
Variable
PUB
POPCHG
SEWER
ELEV
ROADIS
INTDIS
EMPDIS
SLOPE
SCHDIST
WET
FP100
FP500
FLOODWY
SURWAT
Salford
Salford Salford Salford Statistica Statistica Statistica Statistica SAS-EM
TreeNet Random CART MARS
Boosted
Random
C&RT
MARS
Decision
Forest
Tree
Forest
Tree
100
98*
97
85*
72*
70
61*
50*
41
38
36
30*
29
12
81
72
100
59
43
69
32
15
53
19
14*
4
6*
1
100
45*
100
52
29
63
8*
13
65
36
29
0
17
10
100
48*
58*
29*
13*
29*
13
25
6*
19
29
23
27
0
48*
100
43*
88*
79*
85*
53*
64*
56
48
53
43
41
22
100
73
72*
70
50
49
35
36
34
44
41
19
28
9
53*
29*
100
63
36
64
27
16
57
64*
55
11
31
12
62*
93*
90
100*
79*
100*
55*
65*
74*
74*
66*
36*
56*
26
100
22*
97
59
0*
34*
20
0*
0*
29
3*
0
0*
0
Median
100
72
97
63
43
64
32
25
53
38
36
19
28
10
In Table 4, reading horizontally, a score is marked with an asterisk if it is 20
points away from the median (See Remark 7). The Table shows wide variability
and inconsistency among the different tools by different data mining vendors. In
fact, even among the tools offered by the same company, the scores of variable
importance can differ dramatically. For example, using SALFORD TreeNet and
CART, the scores for EMPDIS are 61 and 8, respectively. For another example,
using STATISTICA Boosted Tree and Random Forest, the scores for PUB are
Data Mining and Hotspot Detection
401
48 and 100 respectively. Furthermore, when we extend the structure of the SAS
Decision Tree, the scores changed considerably.
Another observation is that the bottom 5 variables (WET, FP100, FP500,
FLOODWY, and SURWAT) may be overwhelmed by other variables in the models, but they may be important in their own right, especially in the decisionmaking process regarding whether a piece of land should be developed. As a
matter of fact, these bottom variables can be used, in conjunction with the Decision Tree, to boost the profit as discussed in Section 3.
Specifically, we used a leaf of the SAS Decision Tree to identify a subset of
585,640 sites (from the original 1.4 million geographical locations) that has the
highest probabilities for being developed. Within this subset, we deleted cases
that are associated with WET = 1, FP100 = 1, FP500 = 1, FLOODWY = 1, or
SURWAT = 1. The resulting data set contains 535,076 cases. We then rebuild
predictive models; the Neural Network model enhanced the profit for the top
10% from 4 to 4.15 units. Recall each investment unit is equivalent to $100,000,
hence the monetary increase by case screening via Decision Tree and Predictor
Importance would be (4.15 − 4.00) × $100, 000 = $15, 000 for the investment on
each piece of the land in the 10-year period.
Figure 5: Percent of [DEV = 1] vs. EMPDIS
A cautionary note on the Decision Tree methodology is that if the seed of
the randomization is changed in the construction of the training and validation
data sets, the structure of the tree and hence the resulting profit chart will be
different each time. This is common in data mining and is a sharp departure
from traditional statistical analysis of tightly controlled experiments. And there
402
Chamont Wang and Pin-Shuo Liu
is no way to tell which model actually captures the reality. Some may think
this as a disadvantage of data mining tools, but others may consider this as a
reflection of real-life observational data. Furthermore, experience indicates that
many predictive models tend to produce decent profits for the investment. This
situation reminds us what G.P.E. Box once said, “All models are wrong, but
some are useful.” (See Remarks 8-10)
5. Response Surface, Frequency Table, and Marginal Effect
Given the original data of 1.4 million records, one can use standard tools such
as the SAS frequency table and 2D plot to produce the following chart for the
predictor EMPDIS (distance to employment center):
Figure 5 shows that when EMPDIS is between 2-8 units the probabilities of
development are close to 60%, but the probabilities are lower at the two ends. The
chart of frequency table for ELEV, on the other hand, revealed a strange pattern
that led to the detection of contaminated data outliers. After the elimination
of the outliers, the chart for ELEV shows a non-linear pattern that is similar
to Figure 5. Figure 6 shows the response surface of STATISTICA 2nd order
MARSplines, with x-axis being ELEV, y-axis being EMPDIS and z-axis being
P [DEV = 1]:
Figure 6: STATISTICA 2nd order MARSplines
Figure 6 shares a similar shape with the response surfaces generated by SASEM neural network, SAS-EM logistic regression STATISTICA neural network,
STATISTICA boosted trees, and STATISTICA random forest, but their heights
Data Mining and Hotspot Detection
403
vary from model to model. Nevertheless, they appear consistent with the nonlinear patterns in Figure 5.
Theoretically, it is possible to quantify Figure 6 by the calculation of marginal
effects at each given point. Specifically, let p = f (x1 , x2 , . . . , xk ), then the
marginal effect of xi at (x1∗ , x2∗ , . . . , xk∗ ) is defined to be
∂f
(x1∗ , x2∗ , . . . , xk∗ )
∂xi
(5.1)
Note that none of the software packages in this study provides an easy answer
to (5.2), hence we need to do it by hand. For Figure 6, the MARSplines equation
is a bit complicated (See Remark 11). Consequently we will skip the calculation
for this model. Instead, we will compare the marginal effect of SAS-EM neural
network and logistic regression. To begin with, the equation of the SAS neural
network model takes the following form:
∑
∑
f (x1 , x2 , . . . , xk ) = β0 +
βi tahh(
βij xj )
(5.2)
where tanh(x) = (ex − e−x )/(ex + s−x ) the hyperbolic tangent function. For
this data set, the SAS neural network model has 19 hidden layers with 145 coefficients of β0 , βi and βij which presents a considerable chore in the calculation
of the partial derivative in (5.2). Consequently, we resorted to the following
approximation:
f (x1 , x2 , . . . , xi + 1, . . . , xk ) − f (x1 , x2 , . . . , xi , . . . , xk ).
(5.3)
A combination of (5.4) and SAS-EM neural network produced the following
table where the last entry of the last column gives the desired quantity:
Table 5: Marginal effect (SAS-EM neural network)
ROADIS INTDIS POPCHG EMPDIS SCHDIST SLOPE ELEV P DEV1 Increment of P
4.26
4.26
4.26
7.769
7.769
7.769
38.3
38.3
38.3
7
8
9
2
2
2
2.24
2.24
2.24
7.857 0.659583
7.857 0.638335
7.857 0.614647
Turning to the logistic regression:
)
(
p
log
= a + b1 x1 + b2 x2 + · · · , bk xk ,
1−p
-0.021248
-0.023688
(5.4)
the marginal effect would be
bi ea+b1 x1 +b2 x2 +··· ,bk xk
,
(1 + ea+b1 x1 +b2 x2 +··· ,bk xk )2
which is −0.0408525 at the given point.
(5.5)
404
Chamont Wang and Pin-Shuo Liu
Table 6: Marginal effect at the chosen point
SAS-EM neural network
Logistic regression
Frequency Table (Figure 5)
-2.4%
-4.1%
-5%
The results in Table 6 are relatively similar and are consistent with the graph
in Figure 5 where the change (from EMPDIS = 8 to EMPDIS = 9) is about −5%.
A cautionary note is that Table 6 gives only a snapshot of the overall picture.
In addition, it does not give the effect on the profit as discussed in Section 3.
The hand calculations behind Table 5 are tedious and we hope new software will
soon be available for this important operation.
Recall that the prediction accuracy in Section 2 is close to 80%. So a question
is: why cannot the models capture the remaining 20%? After an examination
of all variables, we found that among the predictors that are related to General
Growth and Development, the original data contains only population change.
Other pieces of important information are indeed missing in the data.
Take the example of employment rate. The US government indeed provides
employment rate data for big areas such as the entire State of Tennessee. But our
study area is confined to Memphis and Shelby County, and information on the
employment rate for the entire State provides no help in the predictive modeling.
Other variables and vital statistics include timely information related to both
business growth and political fallouts, and they are hard to come by. These variables may include anti-sprawling laws, environmental groups, local politics, new
business development, new shopping complex, new policies and new initiatives
after new administrations, all coming to play in a rapid-fire, dynamic fashion.
Consequently, the 1.4 million records cannot really address all questions in urban
growth.
In short, it may sound like a tired clich? that in order to make a good
prediction, you will need to know all the variables. But this is exactly what will
happen in data mining.
6. Probability Maps and GIS for Hotspot Detection
In this Section, we use logistic regression to help plot a probability map under
the GIS environment. The map gives future development probabilities for all
undeveloped parcels in the study area. The map assumes a continuation of recent
development trends and assumes that the geographic influence on development
will be similar to the influence in the past.
Data Mining and Hotspot Detection
405
Figure 7: Probability map
Figure 8: Profit charts, cumulative
The overall picture of Figure 7 is consistent with the urban development trend
in the study area. For example, the red area in the lower-right corner indicates
a high probability of urban development, while areas in other parts of the map
predict low probability of future activities.
406
Chamont Wang and Pin-Shuo Liu
Intuitively, the area with higher probability of development would be the area
with higher return in the investment. Among the original 1.4 million records,
35,535 cases belong to this specific region and are used to re-build the models in
SAS-EM and to re-calculate the profits as displayed in the following charts:
In the comparison of Figure 8 and Figure 1, the profit of selecting only the top
10% of the lands via GIS Neural Network would increase from 4 to 4.3 investment
units, or (4.3 − 4) × $100, 000 = $30, 000 for the investment on one piece of the
land.
The next step to further push the profit upward would be a combination of
GIS, Decision Tree and the Case-Selection. This action results in 18,300 cases
(out of the original 1.4 million records). However, the profit for the top tier of
the 10% remains at 4.3 units (or 95.5% of 4.5 units, the maximum profit), the
same as that of using GIS alone. The result appears to echo a golden rule in real
estate investment: Location, Location, Location.
7. Concluding Remarks
Data Mining is a relatively new field that was described by an article in Amstat News (9/2003) as a defining event that will impact the future of statistics.
MIT Technology Review (Jan/Feb/2001) and Bayesian Machine Learning (Feb,
2004) ranked data mining as one of the ten emerging technologies that will change
the world. One area of application is Microarray analysis and it was observed
that in the 1999 Joint Statistical Meeting there was only one paper on DNA Microarray, but there were over a hundred in 2002 (Amaratunga and Cabrera, 2004)
and some 1,200 papers in 2003. This was exponential growth with astonishing
rate. At SAS.com, one can find about 300 case studies on large-scale real-life
applications of data mining in big companies. Other success stories can be found
at SPSS.com, StatSoft.com, IBM Intelligent Miner website, and Google.com.
A Google search on “data mining software” resulted in hundreds of thousands
of links. In the wild world of data mining, one can expect to see a huge variety
of data mining tools with varied accuracies and quality. In this study, we focused
on a handful of tools for predictive modeling in an urban development project.
Our study and the related observations indicate the following:
1. The classification accuracies of the tools in this study are relatively similar. The standard errors of the misclassification rates are roughly 0.5%
for training data and 0.8% for hold-out data, respectively. In comparison
to blind guessing (no model), statistical models enhance the classification
accuracy by almost 30% (Section 2).
2. This study concerns only one specific data set; it is likely that different
approaches work best for different data as shown in other examples (Section
Data Mining and Hotspot Detection
407
2). In fact, there are various data mining algorithms that can function as
universal approximators which may be able to model certain kind of linear
or nonlinear relationship or interactions. The more of those tools we have
from reputable sources, the better chance there will be to maximize the
probability of finding a good predictive model.
3. Adjustment of classification cutoff (threshold) may reduce the misclassification rate, false positive, or false negative rate, depending on the specific
concern of the study (Section 2).
4. Small differences in classification accuracies by different models may result
in substantial financial gains when the models select the top 10% of cases
and when the cost and profit are incorporated in the calculation (Section
3).
5. Complicated models such as Neural Networks appear to suffer from a lack of
explanation capability. Nevertheless, Response surfaces, Frequency Tables,
and the Marginal Effects of these models may shed light to the inner working
of the non-linear phenomenon. Furthermore, traditional statistical graphs
such as the charts of Frequency Tables may help reveal outliers in the data
(Section 5).
6. Many data mining algorithms generate predictor importance scores. Our
investigation indicates that -based predictor importance scores work in our
study and in certain text mining cases (Section 4). But taken as a whole,
there exist wide variability among different tools by different data mining
vendors. In fact, even among the tools offered by the same company, the
scores of predictor importance can differ dramatically.
7. A change of seed in the randomization can significantly change the structure
of the Decision Tree or other models, and there is no way to tell which model
captures the reality. Nevertheless, many predictive models tend to produce
decent, similar profits for the investment.
8. It can be very fruitful to use a combination of data mining tools and software
of GIS (Geographic Information Systems) to generate probability maps for
visual case selection, and for higher return in the investment that involves
spatial information (Section 6).
In conclusion, data mining is a fast growing field with countless opportunities
and challenges that may re-shape our profession and the world as a whole. With
effort and luck, a better future may unfold in our life time.
408
Chamont Wang and Pin-Shuo Liu
Additional Remarks
1. (Link Analysis): A powerful package for link analysis is the Analyst’s Notebook by a software company called i2. One of the authors of this paper attended an i2 workshop with other attendees from business and government
agencies such as IRS, UBS, AT&T, Pfizer Inc., N.Y. Automobile Insurance
Plan, Commerce Bank, US Army Corps of Engineers, and U.S. Attorney’s
Office, SDNY. Other users of i2 software include the Federal Bureau of Investigation, Drug Enforcement Administration, U.S. Customs Service, U.S.
Postal Service and U.S. Department of Treasury. Note that the commercial
value of the i2 package is about $50,000 per copy, but they are willing to give
it free to academic institutions so that the students can further spread the
software to the commercial world. The software is mainly charts-oriented
but can be teamed up with regression, text mining, GIS and other statistical
tools (http://www.i2.co.uk/Products/Analysts Notebook/default.asp
2. About Table 2:
(a) N = 10, 000. For hot-spot detection such as Figure 7, we use the
full data set of 1.4 million records. For predictive modeling, n = 500
would be enough in most cases. For certain engineering problem such as 2dimensional motion tracking, our neural network needs only 30 data points
which correspond to 30 minutes of observations (kind of long time for many
tracking problems). For the Urban Development case, the full data of 1.4
million records will slow down the calculation quite a bit and does not improve the prediction accuracy in any manner. We pick 10,000 for good
coverage and for the speed when we run neural networks and other complicated models.
(b) N/A cases: DISCIPULU Genetic Algorithm uses hold-out data, but
as said in the last sentence of p.8, “In our study, we used the free demo
software which does not allow the deployment of the third data set.” Salford MARS does not use hold-out data. STATISTICA MARSplines does.
In our opinion, the hold-out data is necessary for neural network and most
machine learning tools but not really so for MARS which is a generalization
of regression (and there is no need for hold-out data in most cross-sectional
regression problems).
(c) Equbits SVM: RBF kernel, tuning parameter = 0.1687, 3-fold cross validation. Note that 5-fold or 10-fold cross validation did not improve model
Data Mining and Hotspot Detection
409
accuracy.
(d) SAS-EM regression includes both standard regression and logistic regression. SAS-EM automatically picks one of them according to the targetvariable type.
3. (Classification accuracy-I): In the calculation of the classification accuracy,
it is desirable to report the confidence interval of the estimate. STATISTICA reports the Standard Error of the risk estimate in their boosted tree
and random forest modules. For other models, one can repeat the experiments by changing the seeds of randomization to get a feeling of the
variability in the misclassification rates. For example, six runs of the SASEM default neural network model produced the following misclassification
rates for the hold-out data: 23.25%, 23.4%, 23.65%, 23.9%, 24.95%, 24.05%.
Therefore SE(risk estimate) is about 0.61%.
4. (Classification accuracy-II): A special trick to boost the classification accuracy is by setting up a spreadsheet that contains the columns of False Positive, False Negative, Hit Rates, and various Classification Cutoffs (thresholds). The user then selects the Cutoff interactively to hunt for the best
rate. Salford provides this kind of spreadsheet. SAS-EM takes this approach one-step further to provide threshold-based charts as follows:
On this chart, the user can move the cursor to explore the impact of different
classification cutoffs in a fun, interactive manner. In the chart, the back
curve is Sensitivity, the middle curve is Specificity, and the front curve is
the overall accuracy.
410
Chamont Wang and Pin-Shuo Liu
5. One possibility of boosting the hit rate is by the use of higher-order interaction terms in logistic regression and in MARSplines. Among the 14
variables in our study, the 6 interval predictors have various distributions
as follows:
In this figure, the graphs on the diagonal line are the histograms while the
charts off the diagonal line are scatterplots with a quadratic line fit. The histogram of ELEV is close to normal, while other histograms are odd shaped.
This figure also indicates that these continuous variables appear uncorrelated or weakly correlated. Nevertheless, interactions may exist among
other variables, and when we considered 2nd order interaction, the overall
hit rate of STATISTICA MARSplines increased from 75.8% (1st order, no
interaction) to 77.1% when running in interactive mode. On the other hand,
we tried transformations of the variables ((log, square root, inverse, square,
exponential, standardize, maximize normality, maximize correlation with
target) and trimming/filtering/binning of non-normal variables, but none
of these techniques really improved the predictive power in this particular
study.
For each individual tool, it is usually possible to calibrate the model parameters to increase the prediction accuracy. For example, the default neural
network in SAS-EM is a multilayer perceptron (MLP) model, but the user
Data Mining and Hotspot Detection
411
has the flexibility to adjust the number of hidden layers, the objective functions, convergence parameters, plus a list of other options. Or the user
can change the architecture of the model from a MLP to a RBF (radial
basis function) model altogether. For the second example, the mutation
rate of DISCIPULUS Genetic Algorithm is often fixed at 70% or higher,
which would be a shock to biologists. But the fact of matter is that the
DISCIPULUS Genetic Algorithm gave us the best rates for the training
and validation data. For the third example, the default of boosted trees is
usually set at 200 trees, but 1,000 trees were used in TreeNet when Salford
Systems captured the grand prize of the 2003 Duke University data mining
competition.
A few years from now, someone may come up with a meta-algorithm and a
super computing machine to incorporate all the strengths of these models
for the best outcomes under various criteria. It is conceivable that the
mathematicians and engineers at Google Inc. might have been working
on that without any disclosure to the public. The field of data mining
is rapidly unfolding, with countless holes in popular models and software
packages (see, e.g., Section 4). When it comes to the statistical analysis of
large data sets, the best is yet to come.
6. (Variable Importance-I): For another example, a series of articles in the
New York Times indicates that both the Republican and the Democratic
parties rely heavily on data mining tools to predict voters’ political leanings and “to identify the issues that most motivate them.” It is conceivable
that statistical techniques on variable importance, if well-calibrated, can
be very useful to rank the issues that are closer to the hearts of the voters (and perhaps to reduce the harassment as perceived by voters as well).
(http://www.nytimes.com/2006/10/28/nyregion/28bloomberg.html?hp&ex
=1162094400&en=487b3f42dc55b77d&ei=5094&partner=homepage, http:
//select.nytimes.com/gst/abstract.html?res=F50B1EF63E5B0C7A8EDDA
90994DE404482, http://time-blog.com/allen report/2006/10/why some top
republicans think.html, http://www.washingtonpost.com/wp-dyn/content/
article/2006/03/07/AR2006030701860 pf.html).
7. (Variable Importance-II): In Table 4, the scores in the second column (Salford TreeNet) are sorted in descending order. Scores in all other columns
are then entered accordingly.
8. (Variable Importance-III): In one experiment, we used the SAS-EM default neural network to compare classification accuracies in the following
manners: (1) Each time we delete one predictor from the full model of 14
predictors and then compare the accuracy with the full model. (2) Each
412
Chamont Wang and Pin-Shuo Liu
time we use only one predictor in the model and then compare the accuracies with the model without any predictor. The experiment indicates that
PUB and SEWER are the two top ranking predictors, which match with
the results by the following tools (see Table 4): SALFORD TreeNet, Random Forest, CART, STATISTICA Random Forest, and SAS-EM Decision
Tree. This is a small comfort we have in this section.
Nevertheless, if the predictors are highly correlated, then this scheme may
create other problems.
9. (Variable Importance-IV): In statistical literature, the analysis on variable
importance is often called Sensitivity Analysis. The STATISTICA Help
file cautions that “there may be interdependent variables that are useful
only if included as a set,” and that ”sensitivity analysis does not rate the
‘usefulness’ of variables in modeling in a reliable or absolute manner.” The
STATISTICA Help file urges the user to be “cautious in the conclusions
you draw about the importance of variables” but maintains that ”nonetheless, in practice it is extremely useful” and that “if a number of models
are studied, it is often possible to identify key variables that are always
of high sensitivity, others that are always of low sensitivity, and ‘ambiguous’ variables that change ratings and probably carry mutually redundant
information.” The above cautionary notes may help in many case studies;
however in light of inconsistent scores in Table 4.1, those cautionary notes
are correct only in some cases but amount to sweeping the dust under the
carpet in other applications.
10. (Variable Importance-V): A similar perspective to G.P.E. Box can be drawn
from a proteomic study (Conrads, et al., 2004, p. 177, Figure 8) that involves some 350,000 predictors. The study revealed four distinct models,
each of them with a handful of predictors, that all produced 100% accuracy
as measured by sensitivity and specificity in testing and validation. Consequently in certain applications, it may be more meaningful to focus on
a set of important variables than to consider the ranking or the scores of
predictor importance. In short, we believe the importance scores are useful
in many applications, but the current state of the affair is rather chaotic
and deserves further scrutiny from the statistical community.
11. The output takes more than one page, which is available from author. For
binary prediction, MARSplines module does not use any link function in
the model building process. Instead, the module treats the binary target
as a continuous variable. The end result is remarkably accurate, but the
equation may yield probabilities that are outside the [0, 1] range.
Data Mining and Hotspot Detection
413
Acknowledgement
The authors would like to thank the reviewers for their comments on the
earlier drafts which led to substantial improvement of the manuscript. In addition, the authors would like to thank the software vendors for their installation
assistance and for their prompt replies to our questions.
References
Adnan, A. and Bastos, E. (2005). A Comparative estimation of machine learning methods on QSAR data sets. SUGI-30 Proceedings.
Amaratunga, D. and Cabrera, J. (2004). Exploration and Analysis of DNA Microarray
and Protein Array Data. Wiley.
Bastos, E. and Wolfinger, R. (2004). Data mining in clinical and genomics data. presented at M2004, the SAS 7th annual Data Mining Technology Conference.
Berry, M. J. A. and Linoff, G. S. (2004). Data Mining Techniques: For Marketing,
Sales, and Customer Relationship Management, 2nd Edition , Wiley.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification
and Regression Trees. Wadsworth International Group.
Conrads, T. P., Fusaro, V. A., Ross, S., Johann, D., Rajapakse, V., Hitt, B. A., Steinberg, S. M., Kohn, E. C., Fishman, D. A., Whiteley, G., Barrett, J. C., Liotta,
L. A., Petricoin III, E. F., and Veenstra, T. D. (2004). High-resolution serum
proteomic features for ovarian cancer detection. Endocrine-Related Cancer 11,
163-178.
De Veaux, R. D. (2005). Data mining in the real world: five lessons learned in the
pit. Presented at the 26th Spring Symposium on Statistical Data Mining, the New
Jersey Chapter of the American Statistical Association.
Draghici, S. (2003). Data Analysis Tools for DNA Microarrays. Chapman & Hall/CRC.
Freund, Y. and Schapire, R. (1996). Experiments with a new boosting algorithm.
Machine Learning: Proceedings of the Thirteenth International Conference, 148156.
Friedman, J. H. (2006). Recent advances in predictive (machine) learning. Journal of
Classification 23, 175-197.
Hasti, T., Tibshirani, R. and Friedman, J. H. (2001). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. Springer.
Larose, D. T. (2005). Discovering Knowledge in Data. Wiley.
Luan, J. (2006). Using academic behavior index (AB-index) to develop a learner typology for managing enrollment and course offerings - a data mining approach.
Institutional Research Applications 10, ****page numbers ****
414
Chamont Wang and Pin-Shuo Liu
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y. and Wang, Z. (2007). A novel feature
selection algorithm for text categorization. Expert Systems with Applications 33,
1-5.
Spector, L. (2004). Automatic Quantum Computer Programming: A Genetic Programming Approach. luwer Academic Publishers.
Tomassini, M. (2005). Spatially Structured Evolutionary Algorithms: Artificial Evolution in Space and Time. Springer.
Tan, K. C., Khor, E. F., and Lee, T.H . (2005). Multi-objective Evolutionary Algorithms
and Applications. Springer. .
Yu, B. (2005). Mining earth science data for geophysical structure: a case study in
cloud detection. Presented at the 5th SIAM International Conference on Data
Mining.
Yu, J. S. and Chen, X. W. (2005). Bayesian neural network approaches to ovarian
cancer identification from high-resolution mass spectrometry data. Bioinformatics
21, Suppl. 1, i487-i494.
Yu, J. S., Ongarello, S., Fiedler, R., Chen, X. W., Toffolo, G., Cobelli, C., and Trajanoski, Z. (2005). Ovarian cancer identification based on dimensionality reduction
for high-throughput mass spectrometry data. Bioinformatics 21, 2200-2209.
Received April 15, 2007; accepted October 7, 2007.
Chamont Wang
Department of Mathematics and Statistics
The College of New Jersey
2000 Pennington Road
Ewing, NJ 08628-4700, USA
wang@tcnj.edu Pin-Shuo
Pin-Shuo Liu
Department of Geography and Urban Studies
William Paterson University
300 Pompton Road Wayne, NJ 07470, USA
LiuP@wpunj.edu