Download L8a:Overall

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
CI6227: Data Mining
Introduction (2nd-half)
Sinno Jialin PAN
School of Computer Engineering,
NTU, Singapore
Homepage: http://www3.ntu.edu.sg/home/sinnopan/
General Information
Office Hours/Consultations
 After class or during breaks.
 Q&A via email, sinnopan@ntu.edu.sg.
 Please email me to make appointment.
 My Office: can be found on my homepage.
Course Webpage
 NTULearn
 www3.ntu.edu.sg/home/sinnopan/courses/ntu/CI6227.htm
2
Content
Data Mining Tasks
Descriptive
Association
Rule Mining
Clustering
3
Predictive
Classification
Sequence
Pattern Mining
Regression
Outlier
Detection
Breadth and Depth
 Classification Algorithms (through lectures):
 Decision Tree
 Rule-based Classifier
 Nearest-Neighbor Classifier
 Bayesian Classifiers (Naïve Bayes & Bayesian Networks)
 Artificial Neural Networks
 Support Vector Machines
 Real-world Applications (through course projects)
 One course project on data mining applications.
4
Breadth and Depth …
 Focus on introducing basis concepts, motivations,
and algorithms of classification approaches.
 Most students can understand.
 For those who want to learn more, some up-to-date
techniques and advanced issues will be mentioned
 Details cannot be covered in lecture, some additional
materials for reading will be suggested (optional).
5
Course Evaluation
 Two assignments/Projects (40%)
 Assignment/Project 1 (1st-half semester) – 20%
 Assignment/Project 2 (2nd-half semester) – 20%
 Final Exam (60% – Closed Book)
 Content taught in first-half semester – 30%
 Content taught in second-half semester – 30%
6
Pre-lecture & Post-lecture Slides
 On pre-lecture slides (file name starting with
“PreLecture”), I may pose some questions to ask
you to figure out the answers.
 On post-lecture slides (file name starting with
“ci6227”), the answers will be released.
 To avoid confusion, when post-lecture slides are
loaded on NTULearn after each lecture, the
corresponding pre-lecture slides will be deleted.
7
Outline
 Overall on classification
 Project description
 Classification I: Decision Tree
8
Classification
 The task of assigning objects to one of
several predefined categories.
 Can an object be assigned to more than one
categories?
 Multi-label classification
9
Classification via Data Mining
 Given a collection of records (training set)
 Each record contains a set of attributes, one of the
attributes is the class.
 Goal: find a model for class attribute as a function
of the values of other attributes.
 Such that previously unseen records should be
assigned a class as accurately as possible.
 A test set is used to determine the accuracy of the
model.
 Usually, the given data set is divided into training and
test sets, with training set used to build the model and
test set used to validate it.
10
Classification (in Mathematics)
 In mathematics, given a set of 𝒙𝑖 , 𝑦𝑖 for
𝑖 = 1, … , 𝑁, where 𝒙𝑖 = [𝑥𝑖𝑖 , 𝑥𝑖𝑖 , … , 𝑥𝑖𝑖 ], the
goal is to learn a mapping 𝑓: 𝒙 → 𝑦 by requiring
𝑓 𝒙𝑖 = 𝑦𝑖 . The learned mapping 𝑓 is expected to
be able to make precise predictions on any unseen
𝒙∗ as 𝑓(𝒙∗ ).
Classifier
A set of attributes
Class
Training set
11
Classification v.s. Regression
 For classification, y is discrete
 If y is binary, then binary classification
 If y is nominal not binary, then multi-class classification
 If y is ordinal, then ordinal classification
 For regression, y is continuous
12
Evaluation of Performance
 Focus on the predictive capability of a model
 Rather than how fast it takes to classify or build
models, scalability, etc.
 Confusion Matrix for a binary-class problem:
f11: TP (true positive)
Predicted Class
Actual
Class
13
Class=1
Class=0
f10: FN (false negative)
Class=1
f11
f10
f01: FP (false positive)
Class=0
f01
f00
f00: TN (true negative)
Evaluation of Performance …
Predicted Class
Actual
Class
Class=1
Class=0
Class=1
f11
f10
Class=0
f01
f00
Most widely-used metric:
f11 + f 00
Number of correct predictions
Accuracy =
=
Total number of predictions
f11 + f10 + f 01 + f 00
Error rate =
14
Number of wrong predictions
= 1 − Accuracy
Total number of predictions
An Illustrating Classification Task
 Consider the problem of predicting whether
a loan applicant will repay his/her loan obligation
(no cheat) or become delinquent (cheat).
Predefined categories
Object
15
An Illustrating Classification Task …
 Training set: constructed by examining the records of
previous borrowers
1
Home Marital Taxable
Cheat
Owner Status Income
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
Single
90K
Yes
Tid
16
10 No
60K
An Illustrating Classification Task…
Tid
1
2
3
4
5
6
7
8
9
10
Home
Owner
Yes
No
No
Yes
No
No
Yes
No
No
No
Marital
Status
Single
Married
Single
Married
Divorced
Married
Divorced
Single
Married
Single
Taxable
Income
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
Cheat
No
No
No
No
Yes
No
No
Yes
No
Yes
Classification
algorithm
Induction
Learn
Model
Model
Training Set
Tid
11
12
13
14
15
17
Home
Owner
No
Yes
Yes
No
No
Marital
Status
Single
Divorce
Married
Single
Married
Taxable
Income
55K
80K
110K
95K
67K
Test Set
Apply
Model
Cheat
?
?
?
?
?
Deduction
Decision Tree
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
18
Married
NO
< 80K
> 80K
NO
YES
Rule-based Classifier
R1: (Home Owner = yes) ∧ (Taxable Income > 100k) → Cheat = No
R2: (Home Owner = no) ∧ (Marital Status = Divorced) → Cheat = Yes
R3: (Home Owner = no) ∧ (Marital Status = Single) ∧ (Taxable
Income < 10k) → Cheat = Yes
…
19
Nearest-Neighbor Classifier
1
Home Marital Taxable
Cheat
Owner Status Income
125K
Single
Yes
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
6
No
Married
7
Yes
Divorced 220K
8
No
Single
85K
Yes
9
No
Married
75K
No
Single
90K
Yes
Tid
10 No
60K
Training Set
20
Home Marital Taxable
Owner Status Income
Cheat
No
?
Married 80K
Yes
No
No
Seek for most similar record(s)
from the training set to make
predictions based on the
majority of their classes.
Bayesian Classifiers: Naïve Bayes &
Bayesian Belief Networks
P(MS=Married) = 0.5
P(MS=Single) = 0.4
Marital Status
P(HO=Yes) = 0.4
Home Owner
P(TI > 80K) = 0.4
Taxable Income
Cheat
P(C=Yes | MS=Married, HO=Yes, TI>80K) = 0.8
P(C=Yes | MS=Married, HO=No, TI>80K) = 0.6
P(C=Yes | MS=Married, HO=Yes, TI≤80K) = 0.7
P(C=Yes | MS=Married, HO=No, TI≤80K) = 0.5
…
21
Artificial Neural Networks
x1
x2
x3
Input
Layer
x5
Neuron i
Input
Hidden
Layer
I1
wi1
I2
wi2
I3
wi3
Si
Activation
function
g(Si )
Output
Oi
threshold, t
Hidden nodes
Output
Layer
y
22
x4
Oi
Support Vector Machines
B2
23
Ensemble Learning
24
Advanced Classification Issues
 Class Imbalance Problem
 Data sets with imbalanced class distributions. E.g., the
number of loan applicants who repaid their loan
obligation is much larger than that of those who were
delinquent.
 Multi-class Problem
 Some classification techniques are designed for binary
classification problems.
 How to extend them to multi-classification problems?
25
Course Schedule (Tentative)
Date
26
Topics
Note
Week 8 (7/10)
Introduction, Project Description,
Classification I: Decision Tree (a)
Chapter 4
Week 9 (14/10)
Classification II: Decision Tree (b),
Rule-based Classifier
Chapter 4 & 5
Week 10 (21/10)
Classification III: Nearest-Neighbor
Classifier, Naïve Bayes Classifier,
Bayesian Belief Networks
Chapter 5
Week 11 (28/10)
Classification IV: Artificial Neural
Networks, Ensemble Learning
Chapter 5
Week 12 (4/11)
Classification V: Support Vector
Machines, Advanced Classification
Issues, Review (2nd Half Semester)
Chapter 5
Week 13 (11/11)
No Lecture
I will be at LT20 for Q&A
and project discussion
Textbook: Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach,
and Vipin Kumar, Addison Wesley, 2005.