Download Mining Knowledge in Data Explosion Age

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Mining Knowledge in
Data Explosion Age
(在資料爆炸時代中挖掘知識)
廖宜恩
中興大學資訊科學與工程系
1
Outline
•
•
•
•
•
•
•
•
•
•
Some News Reports
Why Data Mining
What is Data Mining
Knowledge Discovery Process
Data Mining Functionalities
Data Mining Process
Data Mining Tools
Trends in Data Mining
Some Research Results on Data Mining
Conclusions
2
Some News Reports
• Time's Person of the Year for 2006
• 12 IT skills that employers can't say no to
• F.B.I. Data Mining Reached Beyond Initial
Targets
• MIT names its top 10 emerging technologies
for 2008
• Effect of US Recession on Data Mining
Demand (July 2008)
3
Why Data Mining
• Data Explosion Problem(資料爆炸問題)
– Data in the world doubles every 20 months!
– NASA’s Earth Orbiting System: forty-six
megabytes of data per second
• 4,000,000,000,000 bytes a day(4 TeraByte/day;
20×200GB Hard Disk)
– FBI fingerprints image library:
• 200,000,000,000,000 bytes(200 TB)
– In-line image analysis for particle detection: 1
megabyte in one second
4
Why Data Mining? Commercial Viewpoint
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
• Competitive Pressure is Strong
– Provide better, customized services for
an edge (e.g. in Customer Relationship
Management)
5
Why Data Mining? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for
raw data
6
Mining Large Data Sets - Motivation
•
•
•
•
There is often information “hidden” in the data that is
not readily evident
Human analysts may take weeks to discover useful information
Much of the data is never analyzed at all
We are drowning in data, but starving for knowledge! (淹沒於資料,
飢渴於知識)
4,000,000
3,500,000
The Data Gap
3,000,000
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
7
What is Data Mining?
• Data Mining (Knowledge Discovery in
Databases, KDD) (資料挖掘、資料探勘、
資料採礦):
– Exploration & analysis, by automatic
or semi-automatic means, of large
quantities of data in order to discover
meaningful patterns and rules(以自
動化或半自動化方式探索、分析大量
資料以發現有意義的樣式和規則)
8
Knowledge Discovery Process
• Data mining: the core
of knowledge
discovery process.
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
9
Databases
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to
Statistics/
AI
– Enormity of data
(龐大的資料)
– Curse of high
dimensionality
(高維度資料的魔咒)
– Heterogeneous,
distributed nature
of data(分散且異質的資料)
Machine Learning/
Pattern
Recognition
Data Mining
Database
systems
10
Data Mining Functionalities
1. Concept description: Characterization and
discrimination(資料集特徵或差異的描述)
2. Classification(分類)
3. Association rule mining(關聯法則挖掘)
4. Clustering(分群)
5. Sequence analysis(序列分析)
6. Anomaly detection(異常偵測)
11
Concept description: Characterization and
discrimination
• Concept description:
– Characterization: provides a concise
summarization of the given collection of data
• Example: Describe general characteristics of
graduate students in the NCHU database
– Discrimination: provides descriptions
comparing two or more collections of data
• Example: Compare graduate and undergraduate
students of NCHU using discriminant rule
12
Classification(分類)
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the
class.
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model. Usually,
the given data set is divided into training and test sets, with
training set used to build the model and test set used to validate it.
13
Decision Tree Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learn
Model
10
Tid
Attrib1
Attrib2
Attrib3
Class
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Decision
Tree
10
14
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Model: Decision Tree
15
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
16
Examples of Classification Task
• Predicting tumor cells as benign or
malignant
• Classifying credit card transactions
as legitimate or fraudulent
• Classifying secondary structures of
protein as alpha-helix, beta-sheet, or
random coil
• Categorizing news stories as finance,
weather, entertainment, sports, etc
17
Association rule mining(關聯法則挖掘)
• Given a set of transactions, find rules that will predict
the occurrence of an item based on the occurrences of
other items in the transaction
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
18
Association Rule Discovery: Application 1
• Marketing and Sales Promotion:
– Let the rule discovered be
{Beer, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to determine
what should be done to boost its sales.
– Beer in the antecedent => Can be used to see which
products would be affected if the store discontinues selling
beer.
– Beer in antecedent and Potato chips in consequent => Can
be used to see what products should be sold with Beer to
promote sale of Potato chips!
19
Clustering(分群)
• Given a set of data points, each having a set
of attributes, and a similarity measure
among them, find clusters such that
– Data points in one cluster are more similar to
one another.
– Data points in separate clusters are less similar
to one another.
20
Illustrating Clustering
⌧Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
Intercluster distances
are maximized
21
Clustering: Applications
• Market Segmentation:(市場區隔)
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a distinct
marketing mix.
• Document Clustering:(文件分群)
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
22
Clustering of
Microarray
Data(微陣列資
料分群)
23
Sequence analysis(序列分析)
Sequence
Database
Sequence
Element
(Transaction)
Event
(Item)
Customer
Purchase history of a
given customer
A set of items bought by Books, diary
a customer at time t
products, CDs, etc
Web Data
Browsing activity of a
particular Web visitor
A collection of files
viewed by a Web visitor
after a single mouse
click
Home page, index
page, contact info,
etc
Event data
History of events
generated by a given
sensor
Events triggered by a
sensor at time t
Types of alarms
generated by sensors
Genome
sequences
DNA sequence of a
particular species
An element of the DNA
sequence
Bases A,T,G,C
Element
(Transaction)
Sequence
E1
E2
E1
E3
E2
E2
E3
E4
Event
(Item)
24
25
Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001
How does the human genome
stack up?
Organism
Genome Size (Bases)
Estimated Genes
Human (Homo sapiens)
3 billion
25,000
Laboratory mouse (M. musculus)
2.6 billion
30,000
Mustard weed (A. thaliana)
100 million
25,000
Roundworm (C. elegans)
97 million
19,000
Fruit fly (D. melanogaster)
137 million
13,000
Yeast (S. cerevisiae)
12.1 million
6,000
Bacterium (E. coli)
4.6 million
3,200
Human immunodeficiency virus (HIV)
9700
9
26
Why Finding (15,4) Motif is Difficult?
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
AgAAgAAAGGttGGG
..|..|||.|..|||
cAAtAAAAcGGcGGG
27
Anomaly Detection(異常偵測)
• Detect significant deviations from normal behavior
• Applications:
– Credit Card Fraud Detection
– Network Intrusion
Detection
• Typical network traffic
at University level may
reach over 100 million
connections per day
28
Social Network Analysis (Link Mining)
• 舞台劇<六度分離>(Six Degrees of
Separation):「我從某處得知,在地球上,人
與人之間只被六個人隔絕。六度的分隔,正是
這個星球的人際距離。」
• Link: relationship among data objects
• Link-Based Object Ranking (LBR): Exploit the
link structure of a graph to order or prioritize the
set of objects within the graph
• Web information analysis such as PageRank and
Hits are typical LBR approaches
29
Complex Network
• A complex network is a network (graph) that has certain
non-trivial topological features that do not occur in simple
networks.
• Such non-trivial features include: a heavy-tail in the degree
distribution; a high clustering coefficient; assortativity (a
correlation between two nodes) or disassortativity among
vertices; and evidence of a hierarchical structure.
30
Web Mining
• Web Usage Mining
• Web Structure Mining
• Web Content Mining
– Google has a precious asset: Database of
Intensions(人類意圖資料庫)
31
Graph Mining
• Find frequent subgraph in a given graph database
• Graphs are ubiquitous
– Web databases, XML databases
– Cheminformatics (chemical compound)
– Bioinformactics (protein structure, pathway)
– Workflow analysis
– Social network analysis
32
Example (Chemistry-informatics)
Graph Dataset
(A)
(B)
(C)
Frequent Patterns
(min support is 2)
(1)
(2)
33
Data Mining Process
•
•
•
•
•
•
•
Define the problem
Build data mining database
Explore data
Prepare data for modeling
Build model
Evaluate model
Deploy model
34
Examples of data mining in science &
engineering
• Data mining in Biomedical Engineering
– “Robotic Arm Control Using Data Mining
Techniques”
35
Data Mining Process: 1. Define the problem
• Control a robotic arm by means of EMG signals from biceps and triceps
muscles.
• Electromyography (EMG,肌電描記器) is a medical technique for
evaluating and recording physiologic properties of muscles at rest and while
contracting.
Muscle
Contraction
Biceps
Triceps
(二頭肌)
(三頭肌)
Supination
H
H
L
L
H
L
L
H
(旋後)
Pronation
(前旋)
Flexion (彎
曲)
Extension
(伸張)
Supination Pronation
Flexion
Extension
36
Data Mining Process: 2. Build a data
mining database
The dataset includes 80 records.
There are two input variables; biceps
signal and triceps signal.
One output variable, with four possible
values; supination, pronation, flexion and
extension.
37
Data Mining Process: 3. Explore data
Scatter Plot
Triceps
Record#
Flexion
Extension Supination Pronation
38
Data Mining Process: 3. Explore data
(cont.)
Scatter Plot
Biceps
Record#
Flexion
Extension Supination Pronation
39
Data Mining Process: 4. Prepare data
for modeling
Build a dataset with the ARFF format:
@relation EMG
@attribute Triceps real
@attribute Biceps real
@attribute Move {Flexion,Extension,Pronation,Supination}
@data
13,31,Flexion
14,30,Flexion
10,31,Flexion
13,29,Flexion
……
40
Data Mining Process: 5. Build Model
Classification
OneR
Decision Tree
Naïve Bayesian
K-Nearest Neighbors
Neural Networks
Linear Discriminant Analysis
Support Vector Machines
…
41
Data Mining Process: 5. Decision Tree
1. Find the attribute that best classifies the training data.
2. Use this attribute as the root of the decision tree.
3. Repeat the process for each subtree.
Triceps
<=37
>37
Triceps
Biceps
<=14
>14
<=17
>17
Flexion
Pronation
Extension
Supination
42
Data Mining Process: 6. Evaluate
Models
Simple validation : training set and test set
n-fold cross-validation
Leave-one-out
10 -fold cross-validation
OneR
76%
Decision Tree
90%
Naïve Bayesian
98%
1-Nearest Neighbors
100%
Neural Networks
100%
43
Data Mining Process: 7. Deploy Model
The neural network model was
successfully implemented inside the
robotic arm.
44
Data Mining Tools
• Commercial tools: SAS Enterprise Miner ,
IBM Intelligent Miner, SPSS Clementine
• Open source tools:
– WEKA: http://www.cs.waikato.ac.nz/ml/weka
– RapidMiner: http://rapid-i.com/index.php?lang=en
• Poll: Data mining/analytic tools you used in
2006
• Good portals for data mining: KDnuggets
45
Trends in Data Mining
• Application exploration
– development of application-specific data mining system
– Invisible data mining (mining as built-in function)
• Scalable data mining methods
– Constraint-based mining: use of constraints to guide
data mining systems in their search for interesting
patterns
• Integration of data mining with database systems,
data warehouse systems, and Web database
systems
46
Trends in Data Mining
• Web mining
• Social network analysis
• Recommender systems:
– US$1 Million prize for 10% improvement on
Cinematch movie recommender system
– Netflix
– If You Liked This, You’re Sure to Love That
(New York Times, Nov. 21, 2008)
47
Trends in Data Mining
• Spam filters:
– Cost of Spam:
– How much does spam cost you? Google will
calculate
– http://www.google.com/a/help/intl/en/security/r
oi_calculator.html
• Privacy protection and information security
in data mining
• Bioinformatics
48
Some Research Results on DM
• Localization system for WLAN
• Rogue Access Point Detection System
Based on Packet Analysis
• Library Recommender System Based
on Personal Ontology Model
49
Localization system for WLAN
• Enhancing the Accuracy of WLAN-based
Location Determination Systems Using
Predicted Orientation Information
(Information Sciences, Vol. 178, No. 4,
Feb. 15, 2008, pp. 1049–1068.)
• We proposed Accumulated Orientation
Strength (AOS) algorithm based on
Bayesian classifier to predict the orientation
of a mobile user for improving the accuracy
of localization system.
50
Rogue Access Point Detection System
• A paper entitled "Detecting Rogue Access
Points Using Client-side Bottleneck
Bandwidth Analysis" has been accepted for
publication in Computers & Security.
51
Rogue Access Point Detection System
• Big challenge in managing APs in
university campus: NCHU is a class B
network with more than 50 departmental
networks
52
Rogue Access Point Detection System:
Intruders from the Air
53
Rogue Access Point Detection System
• Proposed a novel approach for detecting rogue
access points by estimating client-side
bottleneck bandwidth based on ACK packet
pair technique.
• The system is implemented and tested in the
Computer and Information Network Center at
NCHU.
• Experimental results show that the accuracy is
higher than 90%.
54
Library Recommender System Based on
Personal Ontology Model (PORE)
• A paper entitled "PORE: A Personal
Ontology Recommender System for Digital
Library" has been accepted for publication in
The Electronic Library.
• Proposed personal ontology model for
recommending books to library patrons based
on keywords extracted from the books
borrowed by the user
55
Library Recommender System Based on
Personal Ontology Model (PORE)
• Collaborative filtering techniques are also
incorporated into the PORE system
• PORE system is in service at NCHU Library
56
Conclusions
• We are drowning in data, but starving for
knowledge!
• Data mining is the key to knowledge
discovery.
• Applications of data mining techniques can
be found in almost every research area of
computer science and engineering.
• Even in a recession, data mining services
are still in strong demand.
57
References
1. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,
Introduction to Data Mining, Addison-Wesley, 2006.
2. Jiawei Han and Micheline Kamber, Data Mining:
Concepts and Techniques, 2nd Ed., Morgan Kaufmann,
2005.
3. Jones, Neil and Pevzner, Pavel, An Introduction to
Bioinformatics Algorithms, MIT Press, 2004.
4. http://www.chem-eng.utoronto.ca/~datamining/
5. Duncan Watts,6個人的小世界(Six Degrees),大塊
文化,2004。
6. Mark Buchanan,連結(Nexus),天下文化,2003。
7. http://www.kdnuggets.com/
58