Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Journal of Data Science 6(2008), 389-414 Data Mining and Hotspot Detection in an Urban Development Project Chamont Wang1 and Pin-Shuo Liu2 1 The College of New Jersey and 2 William Paterson University Abstract: Modern statistical analysis often involves large amount of data from many application areas with diverse data types and complicated data structures. This paper gives a brief survey of certain large-scale applications. In addition, this paper compares a number of data mining tools in the study of a specific data set which has 1.4 million cases, 14 predictors and a binary response variable. The study focuses on predictive models that include Classification Tree, Neural Network, Stochastic Gradient Boosting, and Multivariate Adaptive Regression Splines. The study found that the variable importance scores generated by different data mining tools exhibit wide variability and that the users need to be cautious in the applications of these scores. On the other hand, the response surfaces and the classification accuracies of most models are relatively similar, yet the financial implications can be very profound when the models select the top 10% of cases and when the cost and profit are incorporated in the calculation. Finally, the Decision Tree, Predictor Importance, and Geographic Information Systems (GIS) are used for Hotspot Detection to further enhance the profit to 95.5% of its full potential. Key words: Case selection, data mining, geographic information systems, marginal effect, predictive modeling, profit, variable importance. 1. Introduction Modern statistical analysis often involves large amount of data with tens, hundreds or thousands of variables. Case studies that involve large data sets range from biological applications, web mining, political campaigns, government services, crime-fighting and the detection of financial fraud, to name just a few. For example, in biomarker pattern analysis, each data set usually contains hundreds of thousands predictors in the form of mass-to-charge ratio (m/z) to generate sets of biomarker classifiers (Conrads, et al., 2004, Yu et al., 2005). For another example, it is well-known that genome projects and other largescale biological research projects are producing enormous quantities of biological 390 Chamont Wang and Pin-Shuo Liu data. The entire human genome, for instance, with its sequence represented by the letters A, T, C and G would fill approximately 1000 books with 1000 pages in each book when printed. Another line of the development is the Microarray technology that allows scientists to study the behavior patterns of several thousands of genes simultaneously. (Amaratunga and Cabrera, 2004). In a different area of application, Google uses data mining techniques extensively on the vast universe of their web data. The techniques include probabilistic models for page rank, text mining, spell check, and statistical language translation that may involve hundreds of languages around the globe. A New York Times article (October 30, 2005) reported that Google utilizes millions of variables about its users and advertisers in its predictive modeling to deliver the message to which each user is most likely to respond. The Times reported that because of this technology, users click ads 50 percent to 100 percent more often on Google than they do on Yahoo, and that is a powerful driver of Google’s growth and profits. In political campaigns, a series of articles in the New York Times and other news outlets indicate that both the Republican and the Democratic parties rely heavily on data mining tools to micro-target potential voters. When well done, these get-out-the-vote programs typically raise a candidate’s Election Day performance by two to four percentage points. Whether you like it or not, data mining tools indeed assisted George W. Bush in his presidential election in 2004 and perhaps in 2000 as well. The data banks, according to a New York Times article (June 9, 2003), the Democratic National Committee boasts electronic files on 158 million Americans and The Republican National Committee says it’s way ahead, with files on 165 million (see references and other details in Section 4, Variable Importance). The US Federal Government also has undertaken very extensive efforts on data mining, including government services, homeland security, and about 40% of Federal Agencies using or planning to use data mining tools in their services and operations (see, for example, documents prepared by the US General Accounting Office in May 2004 and by the US Department of Defense in March 20042 .). A special tool for crime-fighting and the detection of financial fraud is called link analysis. A current hot issue in the data mining community is the Netflix predictive modeling competition. T he company offers $50,000 each year for five years and an additional $1,000,000 at the end of the 5-year competition for the best model that reduces improves the prediction accuracy by a 10% reduction of their RMSE. The data involves more than 100 million ratings from over 480 thousand randomly-chosen, anonymous customers. The competition has so far attracted 2 See http://www.gao.gov/new.items/d04548.pdf, http://www.epic.org/privacy/profiling/tia/tapac report.pdf Data Mining and Hotspot Detection 391 21865 contestants on 17690 teams from 147 different countries and probably will produce some good models at the end of the 5-year saga3 . For other large data sets, an online repository can be found4 . The site is supported by a grant from the Information and Data Management Program at the National Science Foundation. The data bank has a total of 32 data sets, including E. Coli Genes, El Nino Data, the Insurance Company Benchmark, and Reuters-21578 Text Categorization Collection. On the other hand, statisticians who are interested in large health data sets or links may want to visit the vitalnet site5 , which includes birth data from various States, Medicaid datasets, mortality data, and a variety of other health data. In a recent volume titled “Data Mining in Action: Case Studies of Enrollment Management,” of New Directions for Institutional Research (2006, editors: Luan and Zhao), Decision Trees, Neural Networks, and other data mining tools are used on the following tasks: advanced placement, student mobility, graduation rates, predicting college admissions yield, estimating student retention and degree-completion time, and developing a learner typology for managing enrollment and course offerings. One of the papers (Luan, 2006) uses a combination of clustering analysis and traditional statistical techniques in a very skillful manner and is available6 . Other articles can also be tracked down7 . Tools for mining large data sets include traditional statistical methods and new techniques from the machine-learning community. The tools are commonly grouped in the following categories: supervised methods, unsupervised methods, and everything else in between. The first category involves a target variable in predictive modeling. This category covers regression, decision trees, and neural networks. The second category includes multivariate methods such as cluster analysis, principle component analysis, data visualization, pattern discovery, and novel dimension-reduction techniques that do not involve target variables in prediction. The third category includes market basket analysis (a.k.a. association rule) that is commonly used by Amazon.com, Blockbuster.com, and countless other online stores. For market basket analysis, a clever use of conditional probability and the a priori algorithm can be found in Chapter 10 of the book by Larose (2005). The major difference between the traditional statistical techniques and the tools from the machine learning community is that the former category put more emphasis on the underlying mechanism that generated the data, while the latter is more concerned with the utility and predictive power of the models on the 3 See See 5 See 6 See 7 See 4 http://www.netflixprize.com/leaderboard http://www.netflixprize.com/leaderboard) http://www.ehdp.com/vitalnet/datasets.htm http://www.airweb.org/page.asp?page=915 http://www.wiley.com/WileyCDA/WileyTitle/productCd-IR.html 392 Chamont Wang and Pin-Shuo Liu hold-out data that is not used in the modeling process. Studies after studies have shown that the two approaches can be used to complement each other in a very productive manner (see, e.g., Conrads, et al., 2004, Yu et al., 2005). In this paper, the focus is placed on certain aspects of the supervised methods. Specifically, we consider a number of predictive models such as regression, neural networks, boosted trees, and support vector machine in the context of an urban development project. The data has 1.4 million cases, 14 predictors, and a binary response variable. Section 2 compares model accuracies of various data mining tools. In addition, the section discusses a number of possible ways to decrease misclassification, false positive, or false negative rates. Section 3 uses a unique feature of SAS Enterprise Miner to produce profit charts that is the hallmark of SAS-EM. The chart is indeed a money maker in the commercial applications of data mining. Section 4 discusses the current state of predictor importance and urges users to exercise caution in the application of these scores. Section 5 compares marginal effects and response surfaces of various models. Sections 4 and 5 together reveal the weakness and limitation in the original data that are common in the mining of large data sets. Section 6 uses Geographic Information Systems (GIS) to screen cases that enhance profit to 95.5% of its full potential. Specifically, this study applies GIS to create an urban growth database of Memphis and Shelby County, Tennessee to estimate the possible relationship between urban development and several growth, stimulant, and deterrent variables. These variables are collected and constructed for the study including land-use change between 1990 and 2000 in the study area. The dependent variable for each geographic unit has two possible values: 1 indicates that the development status of the site changed from vacant to developed, while 0 indicates that it has remained vacant. The data set has 1,420,287 cases and includes the following 14 predictors: Interval variables: ELEV (elevation, classified into several categories), EMPDIS (distance to employment center, ratio scale), INTDIS (proximity to highway intersection, ratio scale data), POPCHG (population change between 1990 and 2000, ratio scale), ROADIS (proximity to major road, ratio scale data), SLOPE (percent of slope, classified into several categories). Ordinal variable: SCHDIS (school performance, 1= poor, 2 = better, 3 = best school performance). Categorical variables: FLOODWAY (either in the floodway or not), FP100 (100-year flood plain, either in the 100-year flood plain or not), Data Mining and Hotspot Detection 393 FP500 (500-year flood plain, either in the 500-year flood plain or not), PUB (Public land, either public land or not), SEWER (either in the sewer service area or not), SURWAT (either in surface water or not), WET (either in the wetland area or not). In statistical modeling, it is recommended that the investigator should examine the distributions of each variable before the building of models. In this paper, the results will be shown first and the distribution and statistical details will be presented when necessary. 2. Model Accuracies of Various Data Mining Tools For binary prediction, model accuracies can be assessed by a variety of criteria. In SAS-EM, the tools include the following: (a) Summary statistics: AIC, BIC, misclassification rates for training, validate and hold-out data, root average squared error, maximum absolute error, root final prediction error, etc., (b) graphical comparisons: %Response, %Captured Response, lift chart, profit chart, ROI (Return On Investment) chart, confusion matrix plot, ROC chart, sensitivity and specificity charts, and response threshold chart. In Microarray and biomarker pattern analysis, ROC, sensitivity and specificity are the most popular (see e.g., Draghici, 2003; Conrads, et al., 2004; Yu et al., 2005). In literature, other criteria and terms such as precision, recall, and F-measure are used. The countless terms make the field very colorful or confusing, depending on the perspective of the user. In this section, the focus will be the classification accuracy, which is defined to be the opposite of misclassification rate (a.k.a. risk estimate). The data mining tools we used in this section include the following. The List of Predictive Models: 1. Logistic Regression (via SAS-EM) 2. Neural Network (SAS-EM and STATISTICA) 3. Decision Tree (SAS-EM, STATISTICA C&RT, and SALFORD SYSTEMS CART) 4. Random Forest (SALFORD and STATISTICA) 5. Stochastic Boosting Trees (SALFORD TreeNet and STATISTICA boosted trees) 394 Chamont Wang and Pin-Shuo Liu 6. Multivariate Adaptive Regression Splines (SALFORD MARS and STATISTICA MARSplines) 7. Support Vector Machine (STATISTICA and Equbits SVM) 8. Genetic Algorithm (DISCIPULUS Genetic Algorithm) Explanations of these models (except the Genetic Algorithm) can be found in Hasti et al. (2001) or in the electronic textbook at StatSoft.com. Explanations of Genetic Algorithms can be found in Berry and Linoff (2004), Spector (2004), Tomassini (2005) or Tan, Khor, and Lee (2005). In this list, certain data mining algorithms can function as universal approximators which may be able to model a variety of linear or nonlinear relationship or interactions. It is arguable that the more of those tools we have from reputable sources, the better chance there will be to maximize the probability of finding a good predictive model. Take the example of the Netflix $1M competition, if you have time or an assistant, you may want to try all the tools to find the best root mean squared error (RMSE) as required by the competition. The same argument would apply to other high-stake situations. In the list, the first three models (regression, tree, and neural network) are the major tools of SAS EM predictive modeling. The fourth tool, Random Forest, is the brainchild of the late Leo Breiman, Department of Statistics, University of California, Berkeley, 1980 - 2005. Yu et al. (2005) reported that the method of Random Forest outperformed other methods like linear discriminant analysis, quadratic discriminant analysis, k-nearest neighbor, and boosting classification trees. DeVeaux (2005) compared various data mining tools in a study of 1,618 mammograms with the following results: Table 1: Random forest vs. radiologists Simple Tree Neural Network Boosted Trees Bagged Trees (Random Forest) Radiologists False Positives False Negatives 32.20% 25.50% 24.90% 19.30% 22.40% 33.70% 31.70% 32.50% 28.80% 35.80% The DeVeaux study indicates that Random Forest may perform better than radiologists and better than many other data mining tools. In our study (Table 1), this tool is indeed a very competitive algorithm. The boosted Tree algorithm is widely credited as an invention of Jerome Friedman, Department of Statistics, Stanford University. But Friedman (2006) Data Mining and Hotspot Detection 395 himself stated that the technique was first proposed by Freund and Schapire (1996). The tool is called TreeNet by Salford Systems and was the winner of the 2003 Duke University Data Mining competition, sweeping all four categories. In 2004, the TreeNet submission placed 2nd in the KDD Cup, a cut-throat data mining competition. The honor goes to the software and to a Salford client who used the 3-day evaluation copy to accomplish the feat8 . Multivariate Adaptive Regression Splines (MARS) is another invention of Jerome Friedman. The tool, together with CART, HotSpotDetector and TreeNet, helped Salford win the 2000 KDD Cup International Web Mining Competition. The Support vector machine (SVM) methodology is one of the most important tools in the machine learning community. Bastos and Wolfinger (2004) reported 4% error rate by using SVM, as compared to 27% error rate on the same set of data in a 2002 paper in The New England Journal of Medicine. Yu (2005) used SVM and achieved a 2% error rate in a case study on cloud detection, as compared to a 53% error rate by expert labels. A dnan and Bastos (2005) reported substantial advantages of SVM over regression and neural networks. Furthermore, EQUBITS SVM achieved a stunning 99.6% accuracy in the 2004 UCSD data mining competition9 . The last tool in our list is DISCIPULUS Genetic Algorithm (GA), which is based on biological inspirations such as selection, cross-over, and mutation in the search of the optimal solution. In recent years, evolutionary algorithms have also been used to enhance automatic quantum computer programming (see, e.g., Spector, 2004). In addition, much to the delight of students in data mining classes, Artificial Evolution is used in conjunction with Neural Networks to animate tricky stunts in blockbuster movies such as Troy and the Lord of the Rings10 . Table 2 displays the classification rates of each model. All models use the same 10,000 cases that were selected at random from the full data set of 1.4 million records. Of the 10,000 cases, 80% of the data are used to build the model, and the remaining 20% are used as hold-out data to test the classification accuracy. In the modeling process, certain algorithms further split the 80% data into training and validation data sets to prevent over-fitting, while others (such as Equbits SVM) use v-fold crossvalidation to search for the best fit. The layout of Table 2 follows the list sequence at the beginning of this Section. 8 See http://www.salford-systems.com/press1.php, http://www.salford-systems.com/press8.php 9 See http://www.siam.org/meetings/sdm05/binyu.htm, http://equbits.com/casestudies/SAS%20Case%20Study.pdf, http://www.equbits.com/ 10 See, e.g., http://www.naturalmotion.com/downloads.htm, http://www.wired.com/wired/archive/12.01/stuntbots.html 396 Chamont Wang and Pin-Shuo Liu Table 2: Model accuracies of various data mining tools (see Remark 2 at end of paper) n = 10, 000 Accuracy (Hold-out data) Accuracy (Training + Validation data) No Model SAS Enterprise Miner, Logistic Regression SAS Enterprise Miner, Neural Network STATISTICA, Neural Network SALFORD, CART (Gini index) SAS Enterprise Miner, Decision Tree (Chi-square) SAS Enterprise Miner, Decision Tree (entropy) SALFORD, Random Forest STATISTICA, Random Forest SALFORD, TreeNet (1000 trees) STATISTICA, Boosted Trees (1000 trees) SALFORD, MARS STATISTICA, MARSplines EQUBITS, Support Vector Machine STATISTICA, Support Vector Machine DISCIPULUS, Genetic Algorithm 52.8% 76.3% 76.6% 76.0% 78.3% 75.4% 76.2% 80% 76.9% 80% 77.9% N/A 75.9% 77.4% 75.0% N/A 52.8% 75.9% 75.8% 80.0% 78.1% 75.1% 78.6% 78.2% 75.4% 80.1% 83.5% 76.5% 75.9% 77.4% 75.6% 80.4% (T), (T), (T) (T) (T), (T), (T) (T), (T) (T), (T) (T) (T) (T) (T), 75.2% (V) 75.8% (V) 75.6% (V) 77.6% (V) 73.3% (V) 79.2% (V) 80.2% (V) The second row of Table 2 is labeled “No Model,” indicating that 52.8% of the past records were “developed” while 48.2% were “not developed.” In comparison, the blackbox models such as TreeNet and Genetic Algorithm increase the classification accuracy by almost 30%. This is indeed a genuine contribution of data mining tools to the real-world predictive modeling. In our study, the SVM models were built respectively by experts at Equbits and StatSoft. Hence we are somewhat disappointed with the 77.4% and 75% accuracies. On the other hand, Genetic Algorithms gave us 80.4% accuracy for training data and 80.2% for validation data, which are higher than the rates of most tools we have tried. But a cautionary note is that Genetic Algorithms may remember its historical runs and hence have a tendency to over-fit both the training and validation data. Consequently, one may have to reserve part of the original data that is completely distinct from the model building process for the assessment of model accuracy. In our study, we used the free demo software which does not allow the deployment of the third data set. Note that Table 2 does not rank the models by their classification accuracies. The standard errors of the risk estimates are roughly 0.5% for training data and 0.6% to 0.8% for hold-out data, respectively (See Remark 3). Consequently, the differences among the Logistic Regression and Neural Network by SAS-EM (76.3% and 76.6%, respectively) are not statistically significant. The same can be said among many other models. Data Mining and Hotspot Detection 397 3. Classification with a Cost Structure This section discusses binary classification from a profit-cost point of view. In Table 2, the classification accuracies between SAS-EM logistic regression and neural network appear small (76.3% vs. 76.6% for hold-out data), but the financial implication can be very substantial as it will be shown in this section. To simplify the discussion, assume that the land value in the study area will increase 500% in a 10-year period when the land changes from a vacant lot into residential use. The land value will stay the same when a vacant land stays as vacant in the same period. In addition, the cost of the investment would be 50% of the purchased price in term of investment loss due to the cost of interest in the 10-year period. Furthermore, assume that a model produced the following classification matrix (n = 10, 000): Table 3: Classification Matrix for Profit Analysis Predicted DEV = 1 (Invest) Predicted DEV = 0 (Do Not Invest) Observed DEV = 1 4310 (76%) 974 Observed DEV = 0 1374 (24%) 3342 Column total & percentages 5684 (100%) 4316 Then the expected profit in the 10 years period would be (5 × 4310 − 0.5 × 5684)/5684 = 3.3. Recall that in the original data, P [DEV = 1] = 53% and P [DEV = 0] = 47%, hence the net profit of the blind investment would be 5 × 0.53 − 0.5 = 2.15. The difference is about 3.3 − 2.15 = 1.15 of investment units. However, in the original data about 9.1% of cases are public land which would prohibit investment. After the deletion of these cases, the data size was reduced from the original 1,420,287 to 1,290,897. A random sample (n = 129,089) of the reduced data (N = 1,290,897) includes 58.45 × 0.584 − 0.5 = 2.42. A special feature in SAS-EM allows the user to pick the top candidates (e.g., 10%, 20%, etc.) to improve the investment profit. This is shown in the following chart: 398 Chamont Wang and Pin-Shuo Liu Figure 1: Profit charts, cumulative Figure 2: Profit charts, non-cumulative The chart uses 10 bins and ranks from left to right the best 10% of the lands, the second best 10% of the lands, etc. The chart indicates that the Neural Network model is the preferred model, and that the profit of selecting only the top 10% of the lands via the Neural Network model would be about 4 investment units, and that of all the top 20% would be 3.85 units, etc. In other words, if each investment unit is $100,000, then the top 10% of the land as selected by the neural network model would outperform the blind bet by (4 − 2.42) × $100, 000 = Data Mining and Hotspot Detection 399 $158, 000 for the investment on one piece of the land after 10 years. In the event that the top 10% are not available to the investor, then the profit for the second group via Neural Network would be 3.7 units as shown in the following chart: The chart in Figure 1 is cumulative while the one in Figure 2 is non-cumulative. The second chart gives warnings that the investment on the bottom 35% or so would be worse than the average of blind bets. In other words, the models predict that the probabilities for development of the bottom 35% are very low and investors should try to avoid those pieces of lands. 4. Variable Importance Given a dozen or thousands of predictors, a natural question is how important a variable (or a set of variables) is in the prediction of the target. For instance, in the emerging technology of biomarker pattern analysis, data from high-resolution mass spectrometry are often used to generate a set of biomarker classifiers (Conrads, et al., 2004, Yu et al., 2005). Each data set usually contains hundreds of thousands predictors in the form of mass-to-charge ratio (m/z) that need to be binned to reduce the computational complexity. Statistical techniques such as Kolmogorov-Smirnov test, Wilcoxon-test, Bonferroni correction, and other tools are used to further reduce the dimension of the feature space (i.e., predictor space) without losing its biological meaning. The success of the technology depend on the ability of a selected set of features to transcend the biologic variability, process variations, and methodologically related background noise (Conrads, et al., 2004) (See Remark 6). For another example, Google has created an automated way to search for talent among the more than 100,000 job applications it receives each month (New York Times, 01/03/2007). The data mining process involves extensive surveys that explore an applicant’s attitudes, behavior, personality and biographical details going back to high school. The survey has about 300 items, i ncluding questions such as: Is your work space messy or neat? Are you an extrovert or an introvert? What magazines do you subscribe to? What pets do you have? The answers are fed into a series of formulas to predict how well a person will fit into its chaotic and competitive culture. The Google studies found that certain traditional yardsticks are not reliable predictors, while other variables can help find candidates in several areas. The Times also reported that the use of surveys similar to Google’s new system is on the rise, which in turn may present new challenges and new opportunities for statistical prediction11 . In our study, all tree-based data mining tools offer predictor importance scores. SPSS logistic regression also ranks the relative importance of independent variables. 11 See http://www.nytimes.com/2007/01/03/technology/03google.html?pagewanted=2&ei=5094 &en=4d3171ddca1dab7d&hp&ex=1167886800&partner=homepage 400 Chamont Wang and Pin-Shuo Liu S-Plus uses Ward statistics to calculate importance scores. Other tools for predictor screening include information gain, expected cross entropy, the weight of evidence of text, odds ratio, term frequency, mutual information, and modified Gini Index (Shang et al., 2007). A disturbing fact is that there seems to be considerable disagreement in the definitions and in the execution of the algorithms. Variable importance in data mining is indeed tricky business. For example, in Breiman et al. (1984, p. 147), the measure of the importance of a variable xm is defined as M (xm ) = ∑ ∆I(s̃m , t), t∈T where s̃m is the best surrogate split (p. 40) with variable xm , and ∆I(s̃m , t) is the drop of Gini impurity at node t. STATISTICA and certain programs such as Quest by Loh and Shi, on the other hand, compute variable importance by summing the drop in node impurity over all nodes in the tree(s) (see STATISTICA help file). The difference between STATISTICA and Breiman et al. appears minor, but STATISTICA manual goes to great length to emphasize the differences, and the two approaches indeed yield vastly different scores for variable importance as can be seen in Table 4: Table 4: Scores of variable importance Variable PUB POPCHG SEWER ELEV ROADIS INTDIS EMPDIS SLOPE SCHDIST WET FP100 FP500 FLOODWY SURWAT Salford Salford Salford Salford Statistica Statistica Statistica Statistica SAS-EM TreeNet Random CART MARS Boosted Random C&RT MARS Decision Forest Tree Forest Tree 100 98* 97 85* 72* 70 61* 50* 41 38 36 30* 29 12 81 72 100 59 43 69 32 15 53 19 14* 4 6* 1 100 45* 100 52 29 63 8* 13 65 36 29 0 17 10 100 48* 58* 29* 13* 29* 13 25 6* 19 29 23 27 0 48* 100 43* 88* 79* 85* 53* 64* 56 48 53 43 41 22 100 73 72* 70 50 49 35 36 34 44 41 19 28 9 53* 29* 100 63 36 64 27 16 57 64* 55 11 31 12 62* 93* 90 100* 79* 100* 55* 65* 74* 74* 66* 36* 56* 26 100 22* 97 59 0* 34* 20 0* 0* 29 3* 0 0* 0 Median 100 72 97 63 43 64 32 25 53 38 36 19 28 10 In Table 4, reading horizontally, a score is marked with an asterisk if it is 20 points away from the median (See Remark 7). The Table shows wide variability and inconsistency among the different tools by different data mining vendors. In fact, even among the tools offered by the same company, the scores of variable importance can differ dramatically. For example, using SALFORD TreeNet and CART, the scores for EMPDIS are 61 and 8, respectively. For another example, using STATISTICA Boosted Tree and Random Forest, the scores for PUB are Data Mining and Hotspot Detection 401 48 and 100 respectively. Furthermore, when we extend the structure of the SAS Decision Tree, the scores changed considerably. Another observation is that the bottom 5 variables (WET, FP100, FP500, FLOODWY, and SURWAT) may be overwhelmed by other variables in the models, but they may be important in their own right, especially in the decisionmaking process regarding whether a piece of land should be developed. As a matter of fact, these bottom variables can be used, in conjunction with the Decision Tree, to boost the profit as discussed in Section 3. Specifically, we used a leaf of the SAS Decision Tree to identify a subset of 585,640 sites (from the original 1.4 million geographical locations) that has the highest probabilities for being developed. Within this subset, we deleted cases that are associated with WET = 1, FP100 = 1, FP500 = 1, FLOODWY = 1, or SURWAT = 1. The resulting data set contains 535,076 cases. We then rebuild predictive models; the Neural Network model enhanced the profit for the top 10% from 4 to 4.15 units. Recall each investment unit is equivalent to $100,000, hence the monetary increase by case screening via Decision Tree and Predictor Importance would be (4.15 − 4.00) × $100, 000 = $15, 000 for the investment on each piece of the land in the 10-year period. Figure 5: Percent of [DEV = 1] vs. EMPDIS A cautionary note on the Decision Tree methodology is that if the seed of the randomization is changed in the construction of the training and validation data sets, the structure of the tree and hence the resulting profit chart will be different each time. This is common in data mining and is a sharp departure from traditional statistical analysis of tightly controlled experiments. And there 402 Chamont Wang and Pin-Shuo Liu is no way to tell which model actually captures the reality. Some may think this as a disadvantage of data mining tools, but others may consider this as a reflection of real-life observational data. Furthermore, experience indicates that many predictive models tend to produce decent profits for the investment. This situation reminds us what G.P.E. Box once said, “All models are wrong, but some are useful.” (See Remarks 8-10) 5. Response Surface, Frequency Table, and Marginal Effect Given the original data of 1.4 million records, one can use standard tools such as the SAS frequency table and 2D plot to produce the following chart for the predictor EMPDIS (distance to employment center): Figure 5 shows that when EMPDIS is between 2-8 units the probabilities of development are close to 60%, but the probabilities are lower at the two ends. The chart of frequency table for ELEV, on the other hand, revealed a strange pattern that led to the detection of contaminated data outliers. After the elimination of the outliers, the chart for ELEV shows a non-linear pattern that is similar to Figure 5. Figure 6 shows the response surface of STATISTICA 2nd order MARSplines, with x-axis being ELEV, y-axis being EMPDIS and z-axis being P [DEV = 1]: Figure 6: STATISTICA 2nd order MARSplines Figure 6 shares a similar shape with the response surfaces generated by SASEM neural network, SAS-EM logistic regression STATISTICA neural network, STATISTICA boosted trees, and STATISTICA random forest, but their heights Data Mining and Hotspot Detection 403 vary from model to model. Nevertheless, they appear consistent with the nonlinear patterns in Figure 5. Theoretically, it is possible to quantify Figure 6 by the calculation of marginal effects at each given point. Specifically, let p = f (x1 , x2 , . . . , xk ), then the marginal effect of xi at (x1∗ , x2∗ , . . . , xk∗ ) is defined to be ∂f (x1∗ , x2∗ , . . . , xk∗ ) ∂xi (5.1) Note that none of the software packages in this study provides an easy answer to (5.2), hence we need to do it by hand. For Figure 6, the MARSplines equation is a bit complicated (See Remark 11). Consequently we will skip the calculation for this model. Instead, we will compare the marginal effect of SAS-EM neural network and logistic regression. To begin with, the equation of the SAS neural network model takes the following form: ∑ ∑ f (x1 , x2 , . . . , xk ) = β0 + βi tahh( βij xj ) (5.2) where tanh(x) = (ex − e−x )/(ex + s−x ) the hyperbolic tangent function. For this data set, the SAS neural network model has 19 hidden layers with 145 coefficients of β0 , βi and βij which presents a considerable chore in the calculation of the partial derivative in (5.2). Consequently, we resorted to the following approximation: f (x1 , x2 , . . . , xi + 1, . . . , xk ) − f (x1 , x2 , . . . , xi , . . . , xk ). (5.3) A combination of (5.4) and SAS-EM neural network produced the following table where the last entry of the last column gives the desired quantity: Table 5: Marginal effect (SAS-EM neural network) ROADIS INTDIS POPCHG EMPDIS SCHDIST SLOPE ELEV P DEV1 Increment of P 4.26 4.26 4.26 7.769 7.769 7.769 38.3 38.3 38.3 7 8 9 2 2 2 2.24 2.24 2.24 7.857 0.659583 7.857 0.638335 7.857 0.614647 Turning to the logistic regression: ) ( p log = a + b1 x1 + b2 x2 + · · · , bk xk , 1−p -0.021248 -0.023688 (5.4) the marginal effect would be bi ea+b1 x1 +b2 x2 +··· ,bk xk , (1 + ea+b1 x1 +b2 x2 +··· ,bk xk )2 which is −0.0408525 at the given point. (5.5) 404 Chamont Wang and Pin-Shuo Liu Table 6: Marginal effect at the chosen point SAS-EM neural network Logistic regression Frequency Table (Figure 5) -2.4% -4.1% -5% The results in Table 6 are relatively similar and are consistent with the graph in Figure 5 where the change (from EMPDIS = 8 to EMPDIS = 9) is about −5%. A cautionary note is that Table 6 gives only a snapshot of the overall picture. In addition, it does not give the effect on the profit as discussed in Section 3. The hand calculations behind Table 5 are tedious and we hope new software will soon be available for this important operation. Recall that the prediction accuracy in Section 2 is close to 80%. So a question is: why cannot the models capture the remaining 20%? After an examination of all variables, we found that among the predictors that are related to General Growth and Development, the original data contains only population change. Other pieces of important information are indeed missing in the data. Take the example of employment rate. The US government indeed provides employment rate data for big areas such as the entire State of Tennessee. But our study area is confined to Memphis and Shelby County, and information on the employment rate for the entire State provides no help in the predictive modeling. Other variables and vital statistics include timely information related to both business growth and political fallouts, and they are hard to come by. These variables may include anti-sprawling laws, environmental groups, local politics, new business development, new shopping complex, new policies and new initiatives after new administrations, all coming to play in a rapid-fire, dynamic fashion. Consequently, the 1.4 million records cannot really address all questions in urban growth. In short, it may sound like a tired clich? that in order to make a good prediction, you will need to know all the variables. But this is exactly what will happen in data mining. 6. Probability Maps and GIS for Hotspot Detection In this Section, we use logistic regression to help plot a probability map under the GIS environment. The map gives future development probabilities for all undeveloped parcels in the study area. The map assumes a continuation of recent development trends and assumes that the geographic influence on development will be similar to the influence in the past. Data Mining and Hotspot Detection 405 Figure 7: Probability map Figure 8: Profit charts, cumulative The overall picture of Figure 7 is consistent with the urban development trend in the study area. For example, the red area in the lower-right corner indicates a high probability of urban development, while areas in other parts of the map predict low probability of future activities. 406 Chamont Wang and Pin-Shuo Liu Intuitively, the area with higher probability of development would be the area with higher return in the investment. Among the original 1.4 million records, 35,535 cases belong to this specific region and are used to re-build the models in SAS-EM and to re-calculate the profits as displayed in the following charts: In the comparison of Figure 8 and Figure 1, the profit of selecting only the top 10% of the lands via GIS Neural Network would increase from 4 to 4.3 investment units, or (4.3 − 4) × $100, 000 = $30, 000 for the investment on one piece of the land. The next step to further push the profit upward would be a combination of GIS, Decision Tree and the Case-Selection. This action results in 18,300 cases (out of the original 1.4 million records). However, the profit for the top tier of the 10% remains at 4.3 units (or 95.5% of 4.5 units, the maximum profit), the same as that of using GIS alone. The result appears to echo a golden rule in real estate investment: Location, Location, Location. 7. Concluding Remarks Data Mining is a relatively new field that was described by an article in Amstat News (9/2003) as a defining event that will impact the future of statistics. MIT Technology Review (Jan/Feb/2001) and Bayesian Machine Learning (Feb, 2004) ranked data mining as one of the ten emerging technologies that will change the world. One area of application is Microarray analysis and it was observed that in the 1999 Joint Statistical Meeting there was only one paper on DNA Microarray, but there were over a hundred in 2002 (Amaratunga and Cabrera, 2004) and some 1,200 papers in 2003. This was exponential growth with astonishing rate. At SAS.com, one can find about 300 case studies on large-scale real-life applications of data mining in big companies. Other success stories can be found at SPSS.com, StatSoft.com, IBM Intelligent Miner website, and Google.com. A Google search on “data mining software” resulted in hundreds of thousands of links. In the wild world of data mining, one can expect to see a huge variety of data mining tools with varied accuracies and quality. In this study, we focused on a handful of tools for predictive modeling in an urban development project. Our study and the related observations indicate the following: 1. The classification accuracies of the tools in this study are relatively similar. The standard errors of the misclassification rates are roughly 0.5% for training data and 0.8% for hold-out data, respectively. In comparison to blind guessing (no model), statistical models enhance the classification accuracy by almost 30% (Section 2). 2. This study concerns only one specific data set; it is likely that different approaches work best for different data as shown in other examples (Section Data Mining and Hotspot Detection 407 2). In fact, there are various data mining algorithms that can function as universal approximators which may be able to model certain kind of linear or nonlinear relationship or interactions. The more of those tools we have from reputable sources, the better chance there will be to maximize the probability of finding a good predictive model. 3. Adjustment of classification cutoff (threshold) may reduce the misclassification rate, false positive, or false negative rate, depending on the specific concern of the study (Section 2). 4. Small differences in classification accuracies by different models may result in substantial financial gains when the models select the top 10% of cases and when the cost and profit are incorporated in the calculation (Section 3). 5. Complicated models such as Neural Networks appear to suffer from a lack of explanation capability. Nevertheless, Response surfaces, Frequency Tables, and the Marginal Effects of these models may shed light to the inner working of the non-linear phenomenon. Furthermore, traditional statistical graphs such as the charts of Frequency Tables may help reveal outliers in the data (Section 5). 6. Many data mining algorithms generate predictor importance scores. Our investigation indicates that -based predictor importance scores work in our study and in certain text mining cases (Section 4). But taken as a whole, there exist wide variability among different tools by different data mining vendors. In fact, even among the tools offered by the same company, the scores of predictor importance can differ dramatically. 7. A change of seed in the randomization can significantly change the structure of the Decision Tree or other models, and there is no way to tell which model captures the reality. Nevertheless, many predictive models tend to produce decent, similar profits for the investment. 8. It can be very fruitful to use a combination of data mining tools and software of GIS (Geographic Information Systems) to generate probability maps for visual case selection, and for higher return in the investment that involves spatial information (Section 6). In conclusion, data mining is a fast growing field with countless opportunities and challenges that may re-shape our profession and the world as a whole. With effort and luck, a better future may unfold in our life time. 408 Chamont Wang and Pin-Shuo Liu Additional Remarks 1. (Link Analysis): A powerful package for link analysis is the Analyst’s Notebook by a software company called i2. One of the authors of this paper attended an i2 workshop with other attendees from business and government agencies such as IRS, UBS, AT&T, Pfizer Inc., N.Y. Automobile Insurance Plan, Commerce Bank, US Army Corps of Engineers, and U.S. Attorney’s Office, SDNY. Other users of i2 software include the Federal Bureau of Investigation, Drug Enforcement Administration, U.S. Customs Service, U.S. Postal Service and U.S. Department of Treasury. Note that the commercial value of the i2 package is about $50,000 per copy, but they are willing to give it free to academic institutions so that the students can further spread the software to the commercial world. The software is mainly charts-oriented but can be teamed up with regression, text mining, GIS and other statistical tools (http://www.i2.co.uk/Products/Analysts Notebook/default.asp 2. About Table 2: (a) N = 10, 000. For hot-spot detection such as Figure 7, we use the full data set of 1.4 million records. For predictive modeling, n = 500 would be enough in most cases. For certain engineering problem such as 2dimensional motion tracking, our neural network needs only 30 data points which correspond to 30 minutes of observations (kind of long time for many tracking problems). For the Urban Development case, the full data of 1.4 million records will slow down the calculation quite a bit and does not improve the prediction accuracy in any manner. We pick 10,000 for good coverage and for the speed when we run neural networks and other complicated models. (b) N/A cases: DISCIPULU Genetic Algorithm uses hold-out data, but as said in the last sentence of p.8, “In our study, we used the free demo software which does not allow the deployment of the third data set.” Salford MARS does not use hold-out data. STATISTICA MARSplines does. In our opinion, the hold-out data is necessary for neural network and most machine learning tools but not really so for MARS which is a generalization of regression (and there is no need for hold-out data in most cross-sectional regression problems). (c) Equbits SVM: RBF kernel, tuning parameter = 0.1687, 3-fold cross validation. Note that 5-fold or 10-fold cross validation did not improve model Data Mining and Hotspot Detection 409 accuracy. (d) SAS-EM regression includes both standard regression and logistic regression. SAS-EM automatically picks one of them according to the targetvariable type. 3. (Classification accuracy-I): In the calculation of the classification accuracy, it is desirable to report the confidence interval of the estimate. STATISTICA reports the Standard Error of the risk estimate in their boosted tree and random forest modules. For other models, one can repeat the experiments by changing the seeds of randomization to get a feeling of the variability in the misclassification rates. For example, six runs of the SASEM default neural network model produced the following misclassification rates for the hold-out data: 23.25%, 23.4%, 23.65%, 23.9%, 24.95%, 24.05%. Therefore SE(risk estimate) is about 0.61%. 4. (Classification accuracy-II): A special trick to boost the classification accuracy is by setting up a spreadsheet that contains the columns of False Positive, False Negative, Hit Rates, and various Classification Cutoffs (thresholds). The user then selects the Cutoff interactively to hunt for the best rate. Salford provides this kind of spreadsheet. SAS-EM takes this approach one-step further to provide threshold-based charts as follows: On this chart, the user can move the cursor to explore the impact of different classification cutoffs in a fun, interactive manner. In the chart, the back curve is Sensitivity, the middle curve is Specificity, and the front curve is the overall accuracy. 410 Chamont Wang and Pin-Shuo Liu 5. One possibility of boosting the hit rate is by the use of higher-order interaction terms in logistic regression and in MARSplines. Among the 14 variables in our study, the 6 interval predictors have various distributions as follows: In this figure, the graphs on the diagonal line are the histograms while the charts off the diagonal line are scatterplots with a quadratic line fit. The histogram of ELEV is close to normal, while other histograms are odd shaped. This figure also indicates that these continuous variables appear uncorrelated or weakly correlated. Nevertheless, interactions may exist among other variables, and when we considered 2nd order interaction, the overall hit rate of STATISTICA MARSplines increased from 75.8% (1st order, no interaction) to 77.1% when running in interactive mode. On the other hand, we tried transformations of the variables ((log, square root, inverse, square, exponential, standardize, maximize normality, maximize correlation with target) and trimming/filtering/binning of non-normal variables, but none of these techniques really improved the predictive power in this particular study. For each individual tool, it is usually possible to calibrate the model parameters to increase the prediction accuracy. For example, the default neural network in SAS-EM is a multilayer perceptron (MLP) model, but the user Data Mining and Hotspot Detection 411 has the flexibility to adjust the number of hidden layers, the objective functions, convergence parameters, plus a list of other options. Or the user can change the architecture of the model from a MLP to a RBF (radial basis function) model altogether. For the second example, the mutation rate of DISCIPULUS Genetic Algorithm is often fixed at 70% or higher, which would be a shock to biologists. But the fact of matter is that the DISCIPULUS Genetic Algorithm gave us the best rates for the training and validation data. For the third example, the default of boosted trees is usually set at 200 trees, but 1,000 trees were used in TreeNet when Salford Systems captured the grand prize of the 2003 Duke University data mining competition. A few years from now, someone may come up with a meta-algorithm and a super computing machine to incorporate all the strengths of these models for the best outcomes under various criteria. It is conceivable that the mathematicians and engineers at Google Inc. might have been working on that without any disclosure to the public. The field of data mining is rapidly unfolding, with countless holes in popular models and software packages (see, e.g., Section 4). When it comes to the statistical analysis of large data sets, the best is yet to come. 6. (Variable Importance-I): For another example, a series of articles in the New York Times indicates that both the Republican and the Democratic parties rely heavily on data mining tools to predict voters’ political leanings and “to identify the issues that most motivate them.” It is conceivable that statistical techniques on variable importance, if well-calibrated, can be very useful to rank the issues that are closer to the hearts of the voters (and perhaps to reduce the harassment as perceived by voters as well). (http://www.nytimes.com/2006/10/28/nyregion/28bloomberg.html?hp&ex =1162094400&en=487b3f42dc55b77d&ei=5094&partner=homepage, http: //select.nytimes.com/gst/abstract.html?res=F50B1EF63E5B0C7A8EDDA 90994DE404482, http://time-blog.com/allen report/2006/10/why some top republicans think.html, http://www.washingtonpost.com/wp-dyn/content/ article/2006/03/07/AR2006030701860 pf.html). 7. (Variable Importance-II): In Table 4, the scores in the second column (Salford TreeNet) are sorted in descending order. Scores in all other columns are then entered accordingly. 8. (Variable Importance-III): In one experiment, we used the SAS-EM default neural network to compare classification accuracies in the following manners: (1) Each time we delete one predictor from the full model of 14 predictors and then compare the accuracy with the full model. (2) Each 412 Chamont Wang and Pin-Shuo Liu time we use only one predictor in the model and then compare the accuracies with the model without any predictor. The experiment indicates that PUB and SEWER are the two top ranking predictors, which match with the results by the following tools (see Table 4): SALFORD TreeNet, Random Forest, CART, STATISTICA Random Forest, and SAS-EM Decision Tree. This is a small comfort we have in this section. Nevertheless, if the predictors are highly correlated, then this scheme may create other problems. 9. (Variable Importance-IV): In statistical literature, the analysis on variable importance is often called Sensitivity Analysis. The STATISTICA Help file cautions that “there may be interdependent variables that are useful only if included as a set,” and that ”sensitivity analysis does not rate the ‘usefulness’ of variables in modeling in a reliable or absolute manner.” The STATISTICA Help file urges the user to be “cautious in the conclusions you draw about the importance of variables” but maintains that ”nonetheless, in practice it is extremely useful” and that “if a number of models are studied, it is often possible to identify key variables that are always of high sensitivity, others that are always of low sensitivity, and ‘ambiguous’ variables that change ratings and probably carry mutually redundant information.” The above cautionary notes may help in many case studies; however in light of inconsistent scores in Table 4.1, those cautionary notes are correct only in some cases but amount to sweeping the dust under the carpet in other applications. 10. (Variable Importance-V): A similar perspective to G.P.E. Box can be drawn from a proteomic study (Conrads, et al., 2004, p. 177, Figure 8) that involves some 350,000 predictors. The study revealed four distinct models, each of them with a handful of predictors, that all produced 100% accuracy as measured by sensitivity and specificity in testing and validation. Consequently in certain applications, it may be more meaningful to focus on a set of important variables than to consider the ranking or the scores of predictor importance. In short, we believe the importance scores are useful in many applications, but the current state of the affair is rather chaotic and deserves further scrutiny from the statistical community. 11. The output takes more than one page, which is available from author. For binary prediction, MARSplines module does not use any link function in the model building process. Instead, the module treats the binary target as a continuous variable. The end result is remarkably accurate, but the equation may yield probabilities that are outside the [0, 1] range. Data Mining and Hotspot Detection 413 Acknowledgement The authors would like to thank the reviewers for their comments on the earlier drafts which led to substantial improvement of the manuscript. In addition, the authors would like to thank the software vendors for their installation assistance and for their prompt replies to our questions. References Adnan, A. and Bastos, E. (2005). A Comparative estimation of machine learning methods on QSAR data sets. SUGI-30 Proceedings. Amaratunga, D. and Cabrera, J. (2004). Exploration and Analysis of DNA Microarray and Protein Array Data. Wiley. Bastos, E. and Wolfinger, R. (2004). Data mining in clinical and genomics data. presented at M2004, the SAS 7th annual Data Mining Technology Conference. Berry, M. J. A. and Linoff, G. S. (2004). Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management, 2nd Edition , Wiley. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group. Conrads, T. P., Fusaro, V. A., Ross, S., Johann, D., Rajapakse, V., Hitt, B. A., Steinberg, S. M., Kohn, E. C., Fishman, D. A., Whiteley, G., Barrett, J. C., Liotta, L. A., Petricoin III, E. F., and Veenstra, T. D. (2004). High-resolution serum proteomic features for ovarian cancer detection. Endocrine-Related Cancer 11, 163-178. De Veaux, R. D. (2005). Data mining in the real world: five lessons learned in the pit. Presented at the 26th Spring Symposium on Statistical Data Mining, the New Jersey Chapter of the American Statistical Association. Draghici, S. (2003). Data Analysis Tools for DNA Microarrays. Chapman & Hall/CRC. Freund, Y. and Schapire, R. (1996). Experiments with a new boosting algorithm. Machine Learning: Proceedings of the Thirteenth International Conference, 148156. Friedman, J. H. (2006). Recent advances in predictive (machine) learning. Journal of Classification 23, 175-197. Hasti, T., Tibshirani, R. and Friedman, J. H. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. Larose, D. T. (2005). Discovering Knowledge in Data. Wiley. Luan, J. (2006). Using academic behavior index (AB-index) to develop a learner typology for managing enrollment and course offerings - a data mining approach. Institutional Research Applications 10, ****page numbers **** 414 Chamont Wang and Pin-Shuo Liu Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y. and Wang, Z. (2007). A novel feature selection algorithm for text categorization. Expert Systems with Applications 33, 1-5. Spector, L. (2004). Automatic Quantum Computer Programming: A Genetic Programming Approach. luwer Academic Publishers. Tomassini, M. (2005). Spatially Structured Evolutionary Algorithms: Artificial Evolution in Space and Time. Springer. Tan, K. C., Khor, E. F., and Lee, T.H . (2005). Multi-objective Evolutionary Algorithms and Applications. Springer. . Yu, B. (2005). Mining earth science data for geophysical structure: a case study in cloud detection. Presented at the 5th SIAM International Conference on Data Mining. Yu, J. S. and Chen, X. W. (2005). Bayesian neural network approaches to ovarian cancer identification from high-resolution mass spectrometry data. Bioinformatics 21, Suppl. 1, i487-i494. Yu, J. S., Ongarello, S., Fiedler, R., Chen, X. W., Toffolo, G., Cobelli, C., and Trajanoski, Z. (2005). Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics 21, 2200-2209. Received April 15, 2007; accepted October 7, 2007. Chamont Wang Department of Mathematics and Statistics The College of New Jersey 2000 Pennington Road Ewing, NJ 08628-4700, USA wang@tcnj.edu Pin-Shuo Pin-Shuo Liu Department of Geography and Urban Studies William Paterson University 300 Pompton Road Wayne, NJ 07470, USA LiuP@wpunj.edu