Download Data Mining 1 - WordPress.com

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
DATA MINING
LECTURE 1
INTRODUCTION TO
DATA MINING
Data Mining Outline
–Introduction
–Related Concepts
–Data Mining Techniques
DATA MINING VESIT M.VIJAYALAKSHMI
2
Introduction Outline
Goal: Provide an overview of data mining.
•
•
•
•
•
Define data mining
Data mining vs. databases
Basic data mining tasks
Data mining development
Data mining issues
DATA MINING VESIT M.VIJAYALAKSHMI
3
Introduction
• Data is growing at a phenomenal rate
• Users expect more sophisticated
information
• How?
UNCOVER HIDDEN INFORMATION
DATA MINING
DATA MINING VESIT M.VIJAYALAKSHMI
4
Data Mining Definition
• Finding hidden information in a huge
store of data
• Fit data to a model
• Similar terms
– Exploratory data analysis
– Data driven discovery
– Deductive learning
DATA MINING VESIT M.VIJAYALAKSHMI
5
What Is Data Mining?
• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
• Alternative names and their “inside stories”:
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
• What is not data mining?
– (Deductive) query processing.
– Expert systems or small ML/statistical programs
DATA MINING VESIT M.VIJAYALAKSHMI
6
Potential Applications
• Market analysis and management
– target marketing, CRM, market basket
analysis, cross selling, market segmentation
• Risk analysis and management
– Forecasting, customer retention, quality
control, competitive analysis
• Fraud detection and management
• Text mining (news group, email, documents)
and Web analysis.
– Intelligent query answering
DATA MINING VESIT M.VIJAYALAKSHMI
7
Market Analysis and Management (1)
• Where are the data sources for analysis?
– Credit card transactions, loyalty cards, discount
coupons, customer complaint calls,
– Target marketing (Find clusters of “model”
customers who share the same characteristics:
interest, income level, spending habits, etc.)
• Determine customer purchasing patterns over
time
• Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
DATA MINING VESIT M.VIJAYALAKSHMI
8
Market Analysis and Management (2)
• Customer profiling
– data mining can tell you what types of customers buy
what products (clustering or classification)
• Identifying customer requirements
– identifying the best products for different customers
– use prediction to find what factors will attract new
customers
• Provides summary information
– various multidimensional summary reports
– statistical summary information (data central tendency
and variation)
DATA MINING VESIT M.VIJAYALAKSHMI
9
Fraud Detection and Management
• Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
• Approach
– use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
• Examples
– auto insurance: detect a group of people who stage
accidents to collect on insurance
– money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
– medical insurance: detect professional patients and ring of
doctors and ring of references
DATA MINING VESIT M.VIJAYALAKSHMI
10
Other Applications
• game statistics to gain competitive advantage
Astronomy
• JPL and the Palomar Observatory discovered
22 quasars with the help of data mining
• IBM Surf-Aid applies data mining algorithms
to Web access logs for market-related pages
to discover customer preference and behavior
pages, analyzing effectiveness of Web
marketing, improving Web site organization,
etc.
DATA MINING VESIT M.VIJAYALAKSHMI
11
Data Mining Algorithm
• Objective: Fit Data to a Model
– Descriptive
– Predictive
• Preference – Technique to choose the best
model
• Search – Technique to search the data
– “Query”
DATA MINING VESIT M.VIJAYALAKSHMI
12
Database Processing vs. Data
Mining Processing
• Query
• Query
– Well defined
– SQL

– Poorly defined
– No precise query language
Data

– Operational data

Output
– Precise
– Subset of database
DATA MINING VESIT M.VIJAYALAKSHMI
Data
– Not operational data

Output
– Fuzzy
– Not a subset of database
13
Query Examples
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than
$10,000 in the last month.
– Find all customers who have purchased milk
• Data Mining
– Find all credit applicants who are poor credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased with
milk. (association rules)
DATA MINING VESIT M.VIJAYALAKSHMI
14
Data Mining: On What Kind of
Data?
•
•
•
•
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
–
–
–
–
–
–
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
DATA MINING VESIT M.VIJAYALAKSHMI
15
Data Mining Models And Tasks
DATA MINING VESIT M.VIJAYALAKSHMI
16
Data Mining Tasks
• Prediction Methods
– Use some variables to predict unknown or future
values of other variables.
• Description Methods
– Find human-interpretable patterns that describe the
data.
• Concept description: Characterization and
discrimination
– Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
DATA MINING VESIT M.VIJAYALAKSHMI
17
Basic Data Mining Tasks
• Classification & Prediction
• maps data into predefined groups or classes
• Finds models (functions) that describe and distinguish
classes or concepts for future prediction
• E.g., classify countries based on climate, or classify cars
based on gas mileage
• Presentation: decision-tree, classification rule, neural
network
• Prediction: Predict some unknown or missing numerical
values
• 3 methods
– Supervised learning
– Pattern recognition
– Prediction
DATA MINING VESIT M.VIJAYALAKSHMI
18
Basic Data Mining Tasks
• Regression
– is used to map a data item to a real valued prediction
variable.
– Learning a function that best fits the target data
• Clustering
– groups similar data together into clusters.
– Class label is unknown: Group data to form new classes,
e.g., cluster houses to find distribution patterns
– Segmentation
– Partitioning
DATA MINING VESIT M.VIJAYALAKSHMI
19
Basic Data Mining Tasks
• Summarization maps data into subsets with associated
simple descriptions.
– Characterization
– Generalization
• Link Analysis uncovers relationships among data.
– Affinity Analysis
– Association Rules
– age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”)
[support = 2%, confidence = 60%]
– contains(T, “computer”)  contains(x, “software”) [1%,
75%]
– Sequential Analysis determines sequential patterns.
DATA MINING VESIT M.VIJAYALAKSHMI
20
Sequence Discovery
• Given is a set of objects, with each object
associated with its own timeline of events,
find rules that predict strong sequential
dependencies among different events.
• Rules are formed by first discovering
patterns.
• Event occurrences in the patterns are
governed by timing constraints.
• Patterns similar to association rules but the
events are related by time
DATA MINING VESIT M.VIJAYALAKSHMI
21
Are All the “Discovered” Patterns Interesting?
• A data mining system/query may generate thousands of
patterns, not all of them are interesting.
• Interestingness measures: A pattern is interesting if it is
easily understood by humans, valid on new or test data
with some degree of certainty, potentially useful, novel,
or validates some hypothesis that a user seeks to confirm
• Objective vs. subjective interestingness measures:
– Objective: based on statistics and structures of
patterns, e.g., support, confidence, etc.
– Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, etc.
DATA MINING VESIT M.VIJAYALAKSHMI
22
Can We Find All and Only Interesting
Patterns?
• Find all the interesting patterns: Completeness
– Association vs. classification vs. clustering
• Search for only interesting patterns:
Optimization
– Approaches
• First general all the patterns and then filter
out the uninteresting ones.
• Generate only the interesting paterns
DATA MINING VESIT M.VIJAYALAKSHMI
23
Data Mining vs. KDD
• Knowledge Discovery in Databases (KDD):
process of finding useful information and
patterns in data.
• Data Mining: Use of algorithms to extract
the information and patterns derived by the
KDD process.
DATA MINING VESIT M.VIJAYALAKSHMI
24
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DATA MINING VESIT M.VIJAYALAKSHMI
DBA
25
Visualization Techniques
•
•
•
•
•
•
Graphical
Geometric
Icon-based
Pixel-based
Hierarchical
Hybrid
DATA MINING VESIT M.VIJAYALAKSHMI
26
Data Mining: Confluence of
Multiple Disciplines
Database
Technology
Machine
Learning
Statistics
Data Mining
Information
Science
DATA MINING VESIT M.VIJAYALAKSHMI
Visualization
Other
Disciplines
27
Data Mining Development
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Neural Networks
•Decision Tree Algorithms
DATA MINING VESIT M.VIJAYALAKSHMI
28
Data Mining Issues
•
•
•
•
•
•
•
Human Interaction
Overfitting
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
DATA MINING VESIT M.VIJAYALAKSHMI
•
•
•
•
•
•
•
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integration
Application
29
Major Issues in Data Mining (1)
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of
abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Efficiency and scalability of data mining
algorithms
– Parallel, distributed and incremental mining
methods
DATA MINING VESIT M.VIJAYALAKSHMI
30
Major Issues in Data Mining (2)
• Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous
databases and global information systems
(WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
– Integration of the discovered knowledge with
existing knowledge: A knowledge fusion problem
– Protection of data security, integrity, and
privacy
DATA MINING VESIT M.VIJAYALAKSHMI
31
Social Implications of DM
• Privacy
• Profiling
• Unauthorized use
DATA MINING VESIT M.VIJAYALAKSHMI
32
Data Mining Metrics
•
•
•
•
Usefulness
Return on Investment (ROI)
Accuracy
Space/Time
DATA MINING VESIT M.VIJAYALAKSHMI
33
Related Concepts Outline
Goal: Examine some areas which are related to
data mining.
• Database/OLTP Systems
• Fuzzy Sets and Logic
• Information Retrieval(Web Search Engines)
• Dimensional Modeling
• Data Warehousing
• OLAP/DSS
• Web Search Engined
• Statistics
• Machine Learning
• Pattern Matching
DATA MINING VESIT M.VIJAYALAKSHMI
34
DB & OLTP Systems
• Schema
– (ID,Name,Address,Salary,JobNo)
• Data Model
– ER
– Relational
• Transaction
• Query:
SELECT Name
FROM T
WHERE Salary > 100000
DM: Only imprecise queries output is a KDD
object, say a rule a cluster or a classification
DATA MINING VESIT M.VIJAYALAKSHMI
35
Fuzzy Sets and Logic
• Fuzzy Set: Set membership function is a real valued
function with output in the range [0,1].
• f(x): Probability x is in F.
• 1-f(x): Probability x is not in F.
• EX:
– T = {x | x is a person and x is tall}
– Let f(x) be the probability that x is tall
– Here f is the membership function
DM: Prediction and classification are fuzzy.
DATA MINING VESIT M.VIJAYALAKSHMI
36
Information Retrieval
• Information Retrieval (IR): retrieving desired
information from textual data.
• Library Science
• Digital Libraries
• Web Search Engines
• Traditionally keyword based
• Sample query:
Find all documents about “data mining”.
DM: Similarity measures;
Mine text/Web data.
DATA MINING VESIT M.VIJAYALAKSHMI
37
Information Retrieval (cont’d)
• Similarity: measure of how close a query is to
a document.
• Documents which are “close enough” are
retrieved.
• Metrics:
– Precision = |Relevant and Retrieved|
|Retrieved|
– Recall = |Relevant and Retrieved|
|Relevant|
DATA MINING VESIT M.VIJAYALAKSHMI
38
IR Query Result Measures and
Classification
IR
DATA MINING VESIT M.VIJAYALAKSHMI
Classification
39
Dimensional Modeling
• View data in a hierarchical manner more as
business executives might
• Useful in decision support systems and mining
• Dimension: collection of logically related
attributes; axis for modeling data.
• Facts: data stored
• Ex: Dimensions – products, locations, date
Facts – quantity, unit price
DM: May view data as dimensional.
DATA MINING VESIT M.VIJAYALAKSHMI
40
Relational View of Data
ProdID
123
123
150
150
150
150
200
300
500
500
LocID
Dallas
Houston
Dallas
Dallas
Fort
Worth
Chicago
Seattle
Rochester
Bradenton
Chicago
Date
022900
020100
031500
031500
021000
Quantity
5
10
1
5
5
UnitPrice
25
20
100
95
80
012000
030100
021500
022000
012000
20
5
200
15
10
75
50
5
20
25
1
DATA MINING VESIT M.VIJAYALAKSHMI
41
Dimensional Modeling Queries
•
•
•
•
•
Roll Up: more general dimension
Drill Down: more specific dimension
Dimension (Aggregation) Hierarchy
SQL uses aggregation
Decision Support Systems (DSS): Computer
systems and tools to assist managers in
making decisions and solving problems.
DATA MINING VESIT M.VIJAYALAKSHMI
42
Data Warehousing
• Operational Data: Data used in day to day
needs of company.
• Informational Data: Supports other
functions such as planning and forecasting.
• Data mining tools often access data
warehouses rather than operational data.
DM: May access data in warehouse
& couls use OLAP queries
DATA MINING VESIT M.VIJAYALAKSHMI
43
Web Search Engines
• Web Search Engines are similar to IR systems
• Conventional Search Engines suffer from
several problems
– Abundance
– Limited Coverage
– Limited Query
– Limited Customization
• Concept of “Web Mining”
DATA MINING VESIT M.VIJAYALAKSHMI
44
Statistics
• Simple descriptive models
• Statistical inference: generalizing a model
created from a sample of the data to the
entire dataset.
• Exploratory Data Analysis:
– Data can actually drive the creation of the
model
– Opposite of traditional statistical view.
• Data mining targeted to business user
DM: Many data mining methods come
from statistical techniques.
DATA MINING VESIT M.VIJAYALAKSHMI
45
Machine Learning
• Machine Learning: area of AI that examines how to
write programs that can learn.
• Often used in classification and prediction
• Supervised Learning: learns by example.
• Unsupervised Learning: learns without knowledge of
correct answers.
• Machine learning often deals with small static datasets.
DM: Uses many machine learning
techniques.
DATA MINING VESIT M.VIJAYALAKSHMI
46
Pattern Matching (Recognition)
• Pattern Matching: finds occurrences of
a predefined pattern in the data.
• Applications include speech recognition,
information retrieval, time series
analysis.
DM: Type of classification.
DATA MINING VESIT M.VIJAYALAKSHMI
47
Data Mining Techniques Outline
• Statistical
–
–
–
–
–
Point Estimation
Models Based on Summarization
Bayes Theorem
Hypothesis Testing
Regression and Correlation
• Similarity Measures
• Decision Trees
• Neural Networks
– Activation Functions
• Genetic Algorithms
DATA MINING VESIT M.VIJAYALAKSHMI
48
Similarity Measures
• Determine similarity between two objects.
• Similarity characteristics:
• Alternatively, distance measure measure how
unlike or dissimilar objects are.
DATA MINING VESIT M.VIJAYALAKSHMI
49
Distance Measures
• Measure dissimilarity between objects
DATA MINING VESIT M.VIJAYALAKSHMI
50
Decision Trees
• Decision Tree (DT):
– Tree where the root and each internal node is
labeled with a question.
– The arcs represent each possible answer to the
associated question.
– Each leaf node represents a prediction of a
solution to the problem.
• Popular technique for classification; Leaf node
indicates class to which the corresponding
tuple belongs.
DATA MINING VESIT M.VIJAYALAKSHMI
51
Decision Tree Example
DATA MINING VESIT M.VIJAYALAKSHMI
52
Neural Networks
• Based on observed functioning of human
brain.
• (Artificial Neural Networks (ANN)
• Our view of neural networks is very simplistic.
• We view a neural network (NN) from a
graphical viewpoint.
• Alternatively, a NN may be viewed from the
perspective of matrices.
• Used in pattern recognition, speech
recognition, computer vision, and
classification.
DATA MINING VESIT M.VIJAYALAKSHMI
53
Neural Network Example
DATA MINING VESIT M.VIJAYALAKSHMI
54
Genetic Algorithms
• Optimization search type algorithms.
• Creates an initial feasible solution and iteratively
creates new “better” solutions.
• Based on human evolution and survival of the
fittest.
• Must represent a solution as an individual.
• Individual: string I=I1,I2,…,In where Ij is in given
alphabet A.
• Each character Ij is called a gene.
• Population: set of individuals.
DATA MINING VESIT M.VIJAYALAKSHMI
55
Genetic Algorithms
• A Genetic Algorithm (GA) is a computational model
consisting of five parts:
– A starting set of individuals, P.
– Crossover: technique to combine two parents to
create offspring.
– Mutation: randomly change an individual.
– Fitness: determine the best individuals.
– Algorithm which applies the crossover and mutation
techniques to P iteratively using the fitness function
to determine the best individuals in P to keep.
DATA MINING VESIT M.VIJAYALAKSHMI
56