Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Tom Khabaza Hard Hats for Data Miners: Myths and Pitfalls of Data Mining Hard Hats for Data Miners: Myths and Pitfalls of Data Mining By Tom Khabaza The intrepid data miner runs many risks, including being buried under mountains of data. Some risks are just myths that need to be debunked. Others, however, are real. In this article, I will debunk several of these myths and misconceptions and then describe some problems and pitfalls commonly encountered when conducting data mining, along with steps that you can take to protect yourself from them. A critical point to note is that data mining is a business process-a way of finding patterns in your data that provide insight you can use to conduct your business more effectively. Data mining also makes predictions to guide customer interactions and other business decisions. You'll see these points reinforced numerous times in the information that follows. Myths and misconceptions about data mining Myth #1: Data mining is all about algorithms A businessperson attending a typical data mining conference or reading its proceedings might form the impression that data mining is all about advanced data analysis algorithms. This misconception might be summarized as follows: "All you need for data mining is good algorithms. The better your algorithms, the better your data mining; advancing the effectiveness of data mining means advancing our knowledge of algorithms." To hold this view is to misunderstand the data mining process. Data mining is a process consisting of many elements, such as formulating business goals, mapping business goals to data mining goals, acquiring, understanding, and pre-processing the data, evaluating and presenting the results of analysis and deploying these results to achieve business benefits. This is not to minimize the importance of new or improved data mining algorithms. The problem occurs when data miners focus too much on the algorithms and ignore the other 90-95 percent of the data mining process. The consequences this misconception can be disastrous for a data mining project, possibly resulting in a failure to produce any useful results. Experienced data miners recognize the need for a broader view of the data mining process. Myth #2: Data mining is all about predictive accuracy constructed with data mining specifically in mind and with knowledge of the requirements of the data mining project. If this has not been the case, however, the ware- While data mining is not all about data analysis algo- housed data may be less useful for data mining than the rithms, there is a part of data mining that is about algo- source or operational data. In the worst case, ware- rithms. This raises the question, "How can you judge the housed data may be completely useless (for example, if quality of an algorithm?" only summary data are stored). You might think that the main criterion would be the pre- A more accurate depiction of the relationship between dictive accuracy of the models it generates. This view, the two would be that data mining benefits from a prop- however, misrepresents the role of algorithms in the data erly designed data warehouse; and that constructing mining process. such a warehouse often benefits from first doing some It is true that a predictive model should have some exploratory data mining. degree of accuracy, because this demonstrates that it has truly discovered patterns in the data. However, the usefulness of an algorithm or model is also determined Myth #4: Data mining is all about vast quantities of data by a number of other properties, one of which is whether understanding the resulting model requires deep techni- Early explanations of data mining often began with state- cal knowledge or is something that can be understood by ments like, "We now collect more data than ever, yet how a typical analyst. are we to benefit from these vast data stores?" Focusing Data miners who believe that predictive accuracy is the on the size of data stores provided a convenient intro- primary criterion of algorithm evaluation might use algo- duction to the topic of data mining, but subtly misrepre- rithms that can only be used by technology experts. sented its nature. These algorithms will then play only the most limited While there are many large datasets that organizations role, because data mining is a process that is driven by can benefit from mining, it would be a mistake to believe business expertise; it relies on the input and involvement that these should be the sole focus of data mining. Many of non-technical business professionals in order to be useful data mining projects are performed on small or successful. medium-sized datasets-some, for example, containing only a few hundreds or thousands of records. Myth #3: Data mining requires a data warehouse Subscribing to the erroneous belief that data mining is only appropriate for vast data stores would lead organi- Business people often think that a data warehouse is a zations to choose tools that sacrifice usability for scala- prerequisite for data mining. This is a subtle misconcep- bility when, in fact, both attributes are essential. To quote tion about the relationship between the two technolo- a customer of a leading data mining tool: "Other data gies. mining tools optimize machine time, but this tool opti- It is true that data mining can benefit from warehoused mizes my time." Whether the datasets are large or small, data that is well organized, relatively clean, and easy to organizations should choose a data mining tool that opti- access. This is particularly true if the warehouse has been mizes the user's time. Myth #5: Data mining should be done by a technology expert full million examples, or even 500,000. Data mining uses advanced technology, and its workings, Q: How many churn profiles do we expect to find? particularly those of modeling techniques, are unlikely to A: Maybe ten be understood by the wider IT community. Does this Q: How many examples of each profile do we need? mean that data mining should be conducted only by A: Maybe a thousand Consider the following questions and answers: those who understand every nuance of the technology that is involved? Therefore, a sample of ten or twenty thousand churners Quite the opposite is true, due to the paramount impor- and an equivalent number of non-churners is likely to be tance of business knowledge in data mining. When per- sufficient for this analysis. formed without business knowledge, data mining can produce nonsensical or useless results (see pitfall #3, Note that this does not mean that data miners will never below), so it is essential that data mining be performed encounter the need to build models from millions of by someone with extensive knowledge of the business examples; only that they should not assume that they problem. Very seldom is this the same person with exten- must do so, just because the data are available. sive knowledge of the data mining technology. It is the responsibility of data mining tool providers to ensure that Pitfall #2: The Mysterious Disappearing Terabyte tools are accessible to business users. This is a common phenomenon, but not always a pitfall. It refers to the fact that, for a given data mining problem, Pitfalls of data mining and how to avoid them the amount of available and relevant data may be much less than initially supposed. Consider the following scenario: You are a data mining Pitfall #1: Buried under mountains of data consultant, and your client is a large bank, which wishes to mine its customer data to determine credit risk. The Data mining should be an interactive, iterative process in bank holds terabytes of data on its customers and is con- which the analyst applies substantial business knowl- cerned that the available computing resources may be edge and is "engaged" with the data. However, those inadequate to mine this volume of data. who hold myth #4 (that data mining is about vast quanti- Here's how the situation might unfold. Different types of ties of data) often suppose that this process must be credit (personal loans, business loans, overdrafts) pres- applied to all of the available data. ent different patterns of credit risk, so each data mining This can lead to attempts to mine volumes of data for project will concentrate on just one type of borrower. The which the available hardware and software cannot pro- bank's domain experts judge a number of factors to be vide an acceptable interactive response. In these situa- relevant, and the bank, planning ahead, began collecting tions, the data mining process becomes sluggish, and by data on these factors about 18 months ago. Since then, the time a question is answered, the analyst cannot almost a thousand cases of bad debt have occurred. remember why it was asked. Thus, the relevant data consist of less than a thousand The way to avoid this pitfall is to employ some form of cases of bad debt plus a sample from a plentiful supply of sampling. For example, if we have a million customers cases of good debt-let's say 3,000 records in all. and a 20 percent annual attrition (or "churn") rate, we Somehow, the need to mine terabytes of data has disap- need not plot our graphs or build our models using the peared "mysteriously". Pitfall #3: Disorganized data mining surprisingly hard to come by. It might be that the data expert has left the organization or moved to another Data mining can occasionally, despite the best of inten- department or, in the case of legacy systems, there may tions, take place in an ad hoc manner, with no clear goals be no data expert at all. This problem is exacerbated and no idea of how the results will be used. This leads to when the database or data warehouse management is wasted time and unusable results. outsourced: the external supplier is even less motivated To produce useful results, it is critical to have clearly than the user organization to maintain this information defined business and data mining goals, formulated early "just in case it might be needed in future." in the project, and clearly articulated deployment plans. A There is no simple resolution to this problem. IT depart- simple way of ensuring this is to use a standard process ments should be made aware of the need to maintain such as the CRoss-Industry Standard Practice for Data information about their organization's databases. Also, Mining (CRISP-DM) [1]. Such a process ensures the cor- when a data mining project is proposed, data miners rect preparation for data mining and provides a common should consider how much data knowledge is available language for communicating methods and results. Data and evaluate any risks caused by its absence or scarcity. mining tools should support standard process models. Pitfall #4: Insufficient business knowledge Pitfall #6: Erroneous assumptions, courtesy of the experts On a number of occasions this article has mentioned the Business and data experts are crucial resources, but this crucial role that business knowledge plays in data min- does not mean that the data miner should unquestion- ing. Without it, organizations can neither achieve useful ingly accept every statement they make. The data miner results nor guide the data mining process towards them. should seek to confirm the validity of experts' state- It is sometimes supposed that the end user can reason- ments. ably tell the data miner: "Here are the data, please go Typical examples of erroneous or misleading statements away, do your data mining, and come back with the might include: answers." If this were to happen, the project would, at No customer can hold accounts of both these types best, take many long and costly iterations to produce No case will include more than one event of this type useful results. At worst, the results would be gibberish, Only the following codes will be present in this field and the project would fail. This pitfall can only be avoided by involving, at every Data miners should verify statements like these by exam- stage of the data mining process, both the end user and ining the data. This is particularly important when pro- someone with a detailed knowledge of the business. cessing of the data will depend on their accuracy. Ideally, Ideally, the data miner or data mining consultant would mistakes in assumptions about data can be spotted have the business knowledge. Lacking it, the data miner before they lead to errors in the treatment of data. Data should literally sit next to someone with the required mining tools should make this easy to accomplish. business knowledge who understands the question under consideration. For this to work effectively, a highly Pitfall #7: Incompatibility of data mining tools interactive data mining environment with good response time is required. The data mining process requires a wide range of capabilities, so it's not unusual that during a single project a Pitfall #5: Insufficient data knowledge wide variety of tools might be used. This can, however, lead to high overhead costs due to the time and In order to perform data mining, we must be able to resources required to switch contexts and convert data answer questions like "What do the codes in this field from one format to another. At its worst, this can lead to mean?" and "Can there be more than one record per cus- the omission of necessary steps in the data mining tomer in this table?". In some cases, this information is process and can seriously interfere with the exploratory character of data mining. The best solution is to use a data mining toolkit that inte- Conclusion grates all the required capabilities. However, no toolkit will provide every possible capability, especially when Data mining is a business process, requiring extensive the individual preferences of analysts are taken into business knowledge. It is best practiced by business account, so the toolkit should also be "open"-that is, experts or by data mining experts in close collaboration able to interface easily with other available tools and with business experts. third-party options. Data mining uses a variety of techniques and should not focus only on modeling algorithms and their predictive Pitfall #8: Locked in the data jail-h house In addition to openness with regard to tools, data mining solutions should also be open with regard to data. Some accuracy. Each technique can play a variety of roles. During the data mining process, data miners interact and engage with the data in an iterative fashion. A standard data mining tools require the data to be held in a propri- data mining process model, such as CRISP-DM [1], helps etary format that is not compatible with commonly used to ensure the correct preparation for and use of data min- database systems. (This is sometimes referred to as the ing. Data mining tools should be evaluated based on their "data jail-house.") This can result in high overhead costs, accessibility to business users, their scalability and due to the need for transferring data into the required for- usability, and their support for standard processes. mat, and lead to difficulty in deploying the results into an organization's operational systems. A good data mining tool will interface with your data via common standards. Data miners should make intelligent decisions about the amount of data required, assuming neither that all of an organization's data will be relevant nor that all the available data will be required. Effective data mining requires flexible and interoperable techniques. This requirement is best met by integrated, open toolkits that can interface to data by means of open standards. References [1] Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., and Wirth, R. CRISP-DM 1.0 Step-by-step data mining guide, CRISP-DM Consortium, 2000, available at http://www.crisp-dm.org . Weitere Information über SPSS erhalten Sie unter www.spss.ch SPSS Schweiz AG, Schneckenmannstrasse 25, 8044 Zürich Telefon +41 (0) 1 266 90 30, Fax +41 (0) 1 266 90 39 SPSS is a registered trademark and the other SPSS products named are trademarks of SPSS Inc. All other names are trademarks of their respective owners. © 2005 SPSS Inc. All rights reserved. DamiD/0404