Download From Business Objectives to Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
FROM BUSINESS OBJECTIVES TO DATA
MINING: TOWARDS A SISTEMATIC WAY OF
DATA MINING PROJECT DEVELOPMENT
Ernestina Menasalvas
Facultad de Informática
Universidad Politecnica de Madrid. Spain
emenasalvas@fi.upm.es
November 2004
Facultad de Informática
Background(I)
• 1995: doctoral student.
– Visit University of Regina (Prof. Ziarko)
– Visit Warsaw University (Prof. Pawlak)
• 1998: Defend thesis. Data Mining process model
(Anita Wasilewska & C. Fernandez-Baizan)
• Since then:
– Data Bases Professor: Data bases, data mining
– Coordinator of the Data Mining group at Facultad de
Informática UPM
• Techniques: Rough Sets, Bayes, …
• Methodologies for data mining process management
– Evaluation in Data Mining
– Experimentation in Web Mining
• Web Mining: Web Goal Mining
Background(II)
• Projects developed:
– Pure Research:
• Data Mining to be integrated on RDBMS
• Web Profiler
• Methodology for Data Mining process management
– Research and application:
• Data Mining applied on different domains:
– Car dealers
– Travel agency
– ….
Data Mining Project Development
• Methodologies for Data Mining project
development
– Is it really Data Mining a Science?
– Are we developing proyects as an art?
– Has the research got the same results in all the areas??
•
•
•
•
Algorithms
Data Preparation
Data enrichment
Conceptualization of Data Mining problems
Data Mining: an art, a science?
• Since it appeared a lot of algorithms have been
programmed
• Standards:
– Crisp-DM
– SEMMA
– PMML 3.0
• Process depends on the expertise of the data miner
• User speaks about business problems
• Data Miner speaks about algorithms
Data Mining as a project
• Data Mining is data intensive activity
– Data understanding
– Data Preparation
• Database manager:
– Transactional databases
– Datawarehouses
• The end result of a data mining project is a tool
(software project) for better decision making
process:
– Software development project
• IT department has to be involved
Project Management
• Why?
– In order to organize the process of develpoment and to
produce a project plan
• How?
• Establish how the process is going to be develop:
– Sequential
– Incremental
LIFECYCLE
MODELS
•Way of making things
• Independent of the
process being developed
• What?
• Establish how is the process is splitted into phases and
define the tasks to be developed in each step:
– RUP
– XP
– COMMONKADS
METHODOLOGY
•Particular tasks
• Detail of tasks to be
developed
Common pitfall of data mining
implementation
• The common pitfall of data mining implementation the
following:
– Not being able to efficiently communicate mining results
within an organization.
– Not having the right data to conduct effective analysis.
– Not using existing data correctly.
– Not being able to evaluate results
• Questions that arise:
– Can the adequateness of a set of data for a problem be
established when preparing the project plan?
– How the set of data can be used to produce the expected
results?
– How we can evaluate the results?
– Cost estimation?
Data Mining Approaches
• Vendor
independent:
– CRISP-DM
• Based on the
commercial tools:
– CAT’s
– SEMMA
Model Process
Not Real Methodology
Based on Crisp-DM
• CRM Methodology:
– CRM Catalyst
Globlal CRM process
Does not concentrate on
Data Mining step
Cross-Industry Standard Process for
Data Mining:CRISP-DM
Data Mining as a project: CATs
• CATs :Clementine Application Templates : [CATs]
– Specific libraries of best practices that provide inmediate
value right out of the box
– Following the CRISP-DM standard. Every CAT stream is
assigned to a CRISP-DM phase
– They provide long term value as they can always be used
with a new data set for new insight in other projects.
• Available as an add-on module to Clementine, include:
– Telco CAT - improve retention and cross-selling efforts for
telecommunications
– CRM CAT - understand and predict customer migration
between segments,
– Microarray CAT - accelerate biological discoveries, find
genes Fraud CAT - predict and detect instances of fraud in
financial transactions, claims, tax returns …
– Web CAT
What is a CAT?
[CATs]
SEMMA(1)
• SEMMA (Sample, Explore, Modify, Model, Assess):
[SEMMA]
– Is not a data mining methodology
– Rather a logical organization of the functional tool set of
SAS Enterprise Miner for carrying out the core tasks of
data mining.
– Enterprise Miner can be used as part of any iterative
data mining methodology adopted by the client.
– Naturally steps such as formulating a well defined
business or research problem and assembling quality
representative data sources are critical to the overall
success of any data mining project.
SEMMA(2)
•SEMMA is focused on the model development aspects of data
mining:[SEMMA]
–
–
–
–
–
Sample the data to extract a portion of a large data set big
enough to contein significant information, yet small to manipulate
quickly.
Explore the data by searching for anticipated trends and
anomalies in order to gain understanding and ideas.
Modify the data by creating selecting and transforming the
variables to focus the model selection problem.
Model the data allowing the software to search automatically for
a combination of data that reliably predicts a desired outcome.
Modelling techniques include neural networks, tree-clasiffiers,
statistical models, etc.
Assess the data by evaluating the usefulness and reliability of
the findings from the data mining process and estimate how well
it performs.
Methods for Project Management:
CRM Catalyst(1)
• Developed jointly by CustomISe, MACS and SalesPathways.
Together they have formed the Catalyst Foundation
http://www.crmmethodology.com/
Motivations:
• CRM projects are difficult to execute successfully because of the
wide range of factors influencing their success. So it can take a
long time to make CRM work properly for an organisation.
• Solution: CRM Catalyst.
• Methodology acts as a catalyst for CRM projects enabling them
to achieve their objectives more reliably and in less time.
• It gives a project life cycle with a set of defined phases broken
down into steps with clearly stated inputs and outputs.
Methods for Project Management:
CRM Catalyst(2)
Implementation requires
Data Mining development
process
Progressive
Lifecycle Model
The resutls are obtained in
a progressive way
Implementation is
Knowledge intensive
In some steps
Knowledge
Intensive
Methdology could
be appropriate
Main steps in a Data Mining Project
1. Define the goals:
–
–
Business and data mining experts together have to define
the goals
Each goal must be defined with measurements for success
2. Obtain the models:
–
–
Apply data mining algorithms.
Preprocesing is important
3. Evaluate results:
–
ascertaine the value of an object according to specified
criteria, operationalised in terms of measures.
4. Deploy:
–
Decide patterns and models that can be deployed
5. Evaluate
–
After product working it should be contrasted the result
1. Define the goals
• Distinguish between :
– Data Mining goals
– Business goals
• How do we translate?
Increase the lifetime value of valuable customers
¿?
¿?
Clasification
Estimation
¿?
Association
It has to be solved in the Business
Understanding step of CRISP-DM
Business Understanding
in the CRISP-DM Process
Business
Understanding
Determine
Business
Objectives
Assess
Situation
Determine
Data Mining
Goals
Produce
Project Plan
Background
Business
Objectives
Inventory &
Resources
Reqs,
Assumptions
&Constraints
Data Mining
Goals
Project Plan
Business
Success
Criteria
Risks &
Contingencies
Data Mining
Success Criteria
Initial Assessment of Tools
& Techniques
Terminology
Costs &
Benefits
1.1 Determine Business
objectives and success criteria
• Not only business objectives have to be established but
measures in order to be able to evaluate the results
• Business objectives:
– What is the customer's primary objective?
• Increase the number of loyal customers
• Selling more of a certain product
• Have a positive marketing campaing
• Business success criteria:
– What constitutes a successful outcome of the project?
– Objectives measures so that the success can be established
– ROI
1.2 Costs & Benefits
• Perform a cost-benefits analysis
• Compute the benefits of the project
–
–
–
–
Which measures do we have?
ROI
APEX
OPEX....
• Compute the costs of the project (equipment, human
resources...)
– Which methodology do we have?
– COCOMO for sortware
• Quantify the risk that the project fails
– Knowledge not available
– Data Not available
– Proper tools
Data Mining Estimation Model
• Establishing a parametrical estimation model for Data
Mining (Marban’03)
DMCOMO
(Data Mining COst MOdel)
Data Mining Cost Estimation
• Main factors in a Data Mining project
– Data Sources (number, kind, nature, …)
– Data mining problem to be solved (descriptive,
predictive, …)
– Development platform
– Available tools
– Expertise of the development team
• Drivers
 Data Drivers
 Model Drivers
 Platform Drivers
 Tools and techniques Drivers
 Project Drivers
 People Drivers
1.3 Data Mining goals and success
Data mining goals:
– Translate the customer's primary objective into a data
mining goal, e.g.
• Loyalty program translated into segmentation problem
• Decreasing the attrition rate transformed into classification
problem
• Data mining success criteria:
– Determine success in technical terms
• Translate the notion of sucess into confidence, support and
lift and other parameteres
• Determine de cost of errors
• How do we make the translation?
Methodology
• Which is the methodology to be followed to
translate business objectives into data mining
objectives?
• Unluckily, there is no such methodology. First we
have to solve:
–
–
–
–
How a business objective is expressed?
What is a data mining goal?
How are data mining goals achieved?
Which are the requirements of data mining functions?
In order to describe everything in a standard way:
Conceptualize the problem
Conceptualization in other disciplines
• Data Bases:
– E/R diagrams
– Independent of the domain
– A tool for business understanding and for data base
designer
– Translation from E/R to implementation
External view 1
External view n
Conceptual Schema
Internal Schema
3 levels proposed architecture
Business problem
Business problem
Requirements of algorithms will
Conceptual Schema
be solved at this level
Internal Schema
Tools requirements to
be solved
SAS,
WEKA,
Clementine…
3 layers architecture for data mining
• It is the bridge:
– Between business goals and the final tool
– Independent of the domain
• Provides independence:
– Changes in the tool do not reflect to the solution
• It has to be decided what to model in the
conceptualization
• Automatic translation of business goals into data
mining goals
• Data Mining goals +constraints = feasible data
mining goals
Elements to conceptualize
• Elements to be taken into account:
– Data:
• Quality from data mining point of view
• Adequateness for the problem
• Classification for data mining purposes
– Knowledge:
• Related to the process being analyzed
• Related to the data used
– People
• Owners of data
• Experts in the process
– Data mining problems requirements
– Data mining methods requirements
Proposed process
DMMO
• Data Mining Modelling Objects:
–
–
–
–
Data
Knowledge
Constraints of data and applications
Data Mining objects
• Algorithms
• Measures
• Methods
• To bridge the gap between data miners and
business users
Are data adequate for analysis?
• The adequateness of the data is analyzed taking into
account goals to fulfil.
• Data together with the knowledge extracted from the
experts can be transformed so that just by being the input
of a certain data mining algorithm will produce the
required patterns.
• Quality of the data, in this context:
– is not only related to the technical quality: proper model,
percentage of null values,
•
but also has to do with:
–
–
–
–
meaning of the attributes,
Where each piece of data comes from,
relationship among data, and
finally how the data fulfil the requirements of the data mining
functions
2. Data Mining: obtain models
• Apply data mining process model
• Associated problems solved by the 3 layers
architecture:
– Comparison of approaches
– Evaluate costs
– Pros and cons of approaches
• Only experience or a conceptualization can help
• The conceptual model will help to establish the
process to obtain each feasible model.
• Requirements and transformations implicit in the
model
2.1 Determine type of problem
– What are data mining problems?
•
•
•
•
Classification
Estimation
Association
Segmentation
– In the conceptual model requirements for each type will
be settled
2.2 Apply CRISP-DMprocess model
– Data Mining problem has to be settled before going into
modeling step
– Requierements will be established in Business
understanding
– Requierements will be checked in Data Understanding
and data Preparation
– Preparation will be guided by conceptual model
– Evaluation on feasibility can be done before applying
the model
Business
Understanding
Data
Understa
nding
Data
Prep
arati
on
M
o
d
el
in
g
Eval
uati
on
Deplo
yment
3. Evaluate results
[Spilipopou, Berendt]
• Evaluation: the act of ascertaining the value of an object
according to specified criteria, operationalised in terms of
measures.
• Object= model already obtained
• Criteria and Measures and has to do with goals
• Evaluation requires a well-defined notion of success, which
must be in place before
– the evaluation takes place
– the data mining phase starts
– any work with the data starts
• i.e. already during the business understanding process.
• Here once again conceptualization plays its role
Evaluation in the CRISP-DM Process
• The CRISP-DM process is
– a non-ending circle of iterations
– a non-sequential process, where backtracking at previous
phases is usually necessary
• In each sequential instantiation evaluation takes place:
Business
Understanding
Data
Understa
nding
Data
Prep
arati
on
M
o
d
el
in
g
Eval
uati
on
• But it is a cycle
• In all the iterations all the steps should be revisited
• Results have to be evaluated!!
Deplo
yment
4. Deployment
• All the models that have possitive evaluation can
be deployed
• For measurements of success to trust
deployment has to follow rules established at the
beginning of the project
– The real evaluation has not yet been performed
5. Evaluate after deployment
• After deployment there is the need to proof that
the improvements are really due to the actions
taken after a data mining discovery and not to
any other factor or action carried out in the
company
• None of the obvious claims about success of data
mining have ever been systematically tested.
• Experiments are crucial to establish if the impact
of the deployment is really positive or negative
• Experiments have to be designed at the
beginning of the project
Conclusions
• Data mining projects are being developed more as
art than a science
• Many algorithms have been implemented but no
systematically proof of one better than another in
real case is done after deployment
• Conceptual model is required:
– To map business goals to the model
– To map data mining algorithms to a conceptual model
• Achievements of the model:
– Will be used along the process to guide the project
– Evaluation tool
Future works
• Conceptual model
– Define DMMO objects
• Evaluation techniques related to the model:
– Evaluate data mining goals
– Evaluate business goals
• Experimentation methods:
– obstursively and
– non obstrusivelsly
References
•
•
•
•
•
•
•
•
•
Evaluation in Web mining Tutorial at ECML/PKDD 2004 Pisa, Italy;
20th September, 2004. Bettina Berendt, Myra Spiliopoulou, Ernestina
Menasalvas
Towards a Methodology for Data mining Project Development : The
Importance of Abstraction. Menasalvas, Millán, Gonzalez-Aranda,
Segovia
Bettina Berendt, Andreas Hotho, Dunja Mladenic, Maarten van
Someren, Myra Spiliopoulou, Gerd Stumme: Web Mining: From Web
to Semantic Web, First European Web Mining Forum, EMWF 2003,
Cavtat-Dubrovnik, Croatia, September 22, 2003, Revised Selected
and Invited Papers Springer 2004
Myra Spiliopoulou, Carsten Pohle: Modelling and Incorporating
Background Knowledge in the Web Mining Process. Pattern Detection
and Discovery 2002: 154-169
www.crisp-dm.org
www.spss.com/clementine/cats.htm
www.sas.com/technologies/analytics/datamining/miner/semma.html
www.crmmethodology.com
www.emetrics.org/articles/whitepaper.html
THANKS
Facultad de Informática