Download Hard Hats for Data Miners: Myths and Pitfalls of Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Tom Khabaza
Hard Hats for Data
Miners:
Myths and Pitfalls of
Data Mining
Hard Hats for Data Miners:
Myths and Pitfalls of Data Mining
By Tom Khabaza
The intrepid data miner runs many risks, including being buried under mountains of data. Some
risks are just myths that need to be debunked. Others, however, are real. In this article, I will
debunk several of these myths and misconceptions and then describe some problems and pitfalls commonly encountered when conducting data mining, along with steps that you can take
to protect yourself from them.
A critical point to note is that data mining is a business process-a way of finding patterns in your
data that provide insight you can use to conduct your business more effectively. Data mining
also makes predictions to guide customer interactions and other business decisions. You'll see
these points reinforced numerous times in the information that follows.
Myths and misconceptions about data mining
Myth #1: Data mining is all about algorithms
A businessperson attending a typical data mining conference or reading its proceedings might
form the impression that data mining is all about advanced data analysis algorithms. This misconception might be summarized as follows: "All you need for data mining is good algorithms.
The better your algorithms, the better your data mining; advancing the effectiveness of data
mining means advancing our knowledge of algorithms."
To hold this view is to misunderstand the data mining process. Data mining is a process consisting of many elements, such as formulating business goals, mapping business goals to data
mining goals, acquiring, understanding, and pre-processing the data, evaluating and presenting
the results of analysis and deploying these results to achieve business benefits.
This is not to minimize the importance of new or improved data mining algorithms. The problem
occurs when data miners focus too much on the algorithms and ignore the other 90-95 percent
of the data mining process.
The consequences this misconception can be disastrous for a data mining project, possibly
resulting in a failure to produce any useful results. Experienced data miners recognize the need
for a broader view of the data mining process.
Myth #2: Data mining is all about predictive
accuracy
constructed with data mining specifically in mind and
with knowledge of the requirements of the data mining
project. If this has not been the case, however, the ware-
While data mining is not all about data analysis algo-
housed data may be less useful for data mining than the
rithms, there is a part of data mining that is about algo-
source or operational data. In the worst case, ware-
rithms. This raises the question, "How can you judge the
housed data may be completely useless (for example, if
quality of an algorithm?"
only summary data are stored).
You might think that the main criterion would be the pre-
A more accurate depiction of the relationship between
dictive accuracy of the models it generates. This view,
the two would be that data mining benefits from a prop-
however, misrepresents the role of algorithms in the data
erly designed data warehouse; and that constructing
mining process.
such a warehouse often benefits from first doing some
It is true that a predictive model should have some
exploratory data mining.
degree of accuracy, because this demonstrates that it
has truly discovered patterns in the data. However, the
usefulness of an algorithm or model is also determined
Myth #4: Data mining is all about vast quantities of
data
by a number of other properties, one of which is whether
understanding the resulting model requires deep techni-
Early explanations of data mining often began with state-
cal knowledge or is something that can be understood by
ments like, "We now collect more data than ever, yet how
a typical analyst.
are we to benefit from these vast data stores?" Focusing
Data miners who believe that predictive accuracy is the
on the size of data stores provided a convenient intro-
primary criterion of algorithm evaluation might use algo-
duction to the topic of data mining, but subtly misrepre-
rithms that can only be used by technology experts.
sented its nature.
These algorithms will then play only the most limited
While there are many large datasets that organizations
role, because data mining is a process that is driven by
can benefit from mining, it would be a mistake to believe
business expertise; it relies on the input and involvement
that these should be the sole focus of data mining. Many
of non-technical business professionals in order to be
useful data mining projects are performed on small or
successful.
medium-sized datasets-some, for example, containing
only a few hundreds or thousands of records.
Myth #3: Data mining requires a data warehouse
Subscribing to the erroneous belief that data mining is
only appropriate for vast data stores would lead organi-
Business people often think that a data warehouse is a
zations to choose tools that sacrifice usability for scala-
prerequisite for data mining. This is a subtle misconcep-
bility when, in fact, both attributes are essential. To quote
tion about the relationship between the two technolo-
a customer of a leading data mining tool: "Other data
gies.
mining tools optimize machine time, but this tool opti-
It is true that data mining can benefit from warehoused
mizes my time." Whether the datasets are large or small,
data that is well organized, relatively clean, and easy to
organizations should choose a data mining tool that opti-
access. This is particularly true if the warehouse has been
mizes the user's time.
Myth #5: Data mining should be done by a
technology expert
full million examples, or even 500,000.
Data mining uses advanced technology, and its workings,
Q: How many churn profiles do we expect to find?
particularly those of modeling techniques, are unlikely to
A: Maybe ten
be understood by the wider IT community. Does this
Q: How many examples of each profile do we need?
mean that data mining should be conducted only by
A: Maybe a thousand
Consider the following questions and answers:
those who understand every nuance of the technology
that is involved?
Therefore, a sample of ten or twenty thousand churners
Quite the opposite is true, due to the paramount impor-
and an equivalent number of non-churners is likely to be
tance of business knowledge in data mining. When per-
sufficient for this analysis.
formed without business knowledge, data mining can
produce nonsensical or useless results (see pitfall #3,
Note that this does not mean that data miners will never
below), so it is essential that data mining be performed
encounter the need to build models from millions of
by someone with extensive knowledge of the business
examples; only that they should not assume that they
problem. Very seldom is this the same person with exten-
must do so, just because the data are available.
sive knowledge of the data mining technology. It is the
responsibility of data mining tool providers to ensure that
Pitfall #2: The Mysterious Disappearing Terabyte
tools are accessible to business users.
This is a common phenomenon, but not always a pitfall. It
refers to the fact that, for a given data mining problem,
Pitfalls of data mining and how to
avoid them
the amount of available and relevant data may be much
less than initially supposed.
Consider the following scenario: You are a data mining
Pitfall #1: Buried under mountains of data
consultant, and your client is a large bank, which wishes
to mine its customer data to determine credit risk. The
Data mining should be an interactive, iterative process in
bank holds terabytes of data on its customers and is con-
which the analyst applies substantial business knowl-
cerned that the available computing resources may be
edge and is "engaged" with the data. However, those
inadequate to mine this volume of data.
who hold myth #4 (that data mining is about vast quanti-
Here's how the situation might unfold. Different types of
ties of data) often suppose that this process must be
credit (personal loans, business loans, overdrafts) pres-
applied to all of the available data.
ent different patterns of credit risk, so each data mining
This can lead to attempts to mine volumes of data for
project will concentrate on just one type of borrower. The
which the available hardware and software cannot pro-
bank's domain experts judge a number of factors to be
vide an acceptable interactive response. In these situa-
relevant, and the bank, planning ahead, began collecting
tions, the data mining process becomes sluggish, and by
data on these factors about 18 months ago. Since then,
the time a question is answered, the analyst cannot
almost a thousand cases of bad debt have occurred.
remember why it was asked.
Thus, the relevant data consist of less than a thousand
The way to avoid this pitfall is to employ some form of
cases of bad debt plus a sample from a plentiful supply of
sampling. For example, if we have a million customers
cases of good debt-let's say 3,000 records in all.
and a 20 percent annual attrition (or "churn") rate, we
Somehow, the need to mine terabytes of data has disap-
need not plot our graphs or build our models using the
peared "mysteriously".
Pitfall #3: Disorganized data mining
surprisingly hard to come by. It might be that the data
expert has left the organization or moved to another
Data mining can occasionally, despite the best of inten-
department or, in the case of legacy systems, there may
tions, take place in an ad hoc manner, with no clear goals
be no data expert at all. This problem is exacerbated
and no idea of how the results will be used. This leads to
when the database or data warehouse management is
wasted time and unusable results.
outsourced: the external supplier is even less motivated
To produce useful results, it is critical to have clearly
than the user organization to maintain this information
defined business and data mining goals, formulated early
"just in case it might be needed in future."
in the project, and clearly articulated deployment plans. A
There is no simple resolution to this problem. IT depart-
simple way of ensuring this is to use a standard process
ments should be made aware of the need to maintain
such as the CRoss-Industry Standard Practice for Data
information about their organization's databases. Also,
Mining (CRISP-DM) [1]. Such a process ensures the cor-
when a data mining project is proposed, data miners
rect preparation for data mining and provides a common
should consider how much data knowledge is available
language for communicating methods and results. Data
and evaluate any risks caused by its absence or scarcity.
mining tools should support standard process models.
Pitfall #4: Insufficient business knowledge
Pitfall #6: Erroneous assumptions, courtesy of the
experts
On a number of occasions this article has mentioned the
Business and data experts are crucial resources, but this
crucial role that business knowledge plays in data min-
does not mean that the data miner should unquestion-
ing. Without it, organizations can neither achieve useful
ingly accept every statement they make. The data miner
results nor guide the data mining process towards them.
should seek to confirm the validity of experts' state-
It is sometimes supposed that the end user can reason-
ments.
ably tell the data miner: "Here are the data, please go
Typical examples of erroneous or misleading statements
away, do your data mining, and come back with the
might include:
answers." If this were to happen, the project would, at
No customer can hold accounts of both these types
best, take many long and costly iterations to produce
No case will include more than one event of this type
useful results. At worst, the results would be gibberish,
Only the following codes will be present in this field
and the project would fail.
This pitfall can only be avoided by involving, at every
Data miners should verify statements like these by exam-
stage of the data mining process, both the end user and
ining the data. This is particularly important when pro-
someone with a detailed knowledge of the business.
cessing of the data will depend on their accuracy. Ideally,
Ideally, the data miner or data mining consultant would
mistakes in assumptions about data can be spotted
have the business knowledge. Lacking it, the data miner
before they lead to errors in the treatment of data. Data
should literally sit next to someone with the required
mining tools should make this easy to accomplish.
business knowledge who understands the question
under consideration. For this to work effectively, a highly
Pitfall #7: Incompatibility of data mining tools
interactive data mining environment with good response
time is required.
The data mining process requires a wide range of capabilities, so it's not unusual that during a single project a
Pitfall #5: Insufficient data knowledge
wide variety of tools might be used. This can, however,
lead to high overhead costs due to the time and
In order to perform data mining, we must be able to
resources required to switch contexts and convert data
answer questions like "What do the codes in this field
from one format to another. At its worst, this can lead to
mean?" and "Can there be more than one record per cus-
the omission of necessary steps in the data mining
tomer in this table?". In some cases, this information is
process and can seriously interfere with the exploratory
character of data mining.
The best solution is to use a data mining toolkit that inte-
Conclusion
grates all the required capabilities. However, no toolkit
will provide every possible capability, especially when
Data mining is a business process, requiring extensive
the individual preferences of analysts are taken into
business knowledge. It is best practiced by business
account, so the toolkit should also be "open"-that is,
experts or by data mining experts in close collaboration
able to interface easily with other available tools and
with business experts.
third-party options.
Data mining uses a variety of techniques and should not
focus only on modeling algorithms and their predictive
Pitfall #8: Locked in the data jail-h
house
In addition to openness with regard to tools, data mining
solutions should also be open with regard to data. Some
accuracy. Each technique can play a variety of roles.
During the data mining process, data miners interact and
engage with the data in an iterative fashion. A standard
data mining tools require the data to be held in a propri-
data mining process model, such as CRISP-DM [1], helps
etary format that is not compatible with commonly used
to ensure the correct preparation for and use of data min-
database systems. (This is sometimes referred to as the
ing. Data mining tools should be evaluated based on their
"data jail-house.") This can result in high overhead costs,
accessibility to business users, their scalability and
due to the need for transferring data into the required for-
usability, and their support for standard processes.
mat, and lead to difficulty in deploying the results into an
organization's operational systems. A good data mining
tool will interface with your data via common standards.
Data miners should make intelligent decisions about the
amount of data required, assuming neither that all of an
organization's data will be relevant nor that all the available data will be required.
Effective data mining requires flexible and interoperable
techniques. This requirement is best met by integrated,
open toolkits that can interface to data by means of open
standards.
References
[1] Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., and Wirth, R. CRISP-DM 1.0
Step-by-step data mining guide, CRISP-DM Consortium, 2000, available at http://www.crisp-dm.org .
Weitere Information über SPSS erhalten Sie unter www.spss.ch
SPSS Schweiz AG, Schneckenmannstrasse 25, 8044 Zürich
Telefon +41 (0) 1 266 90 30, Fax +41 (0) 1 266 90 39
SPSS is a registered trademark and the other SPSS
products named are trademarks of SPSS Inc. All
other names are trademarks of their respective owners.
© 2005 SPSS Inc. All rights reserved. DamiD/0404