Download Data Mining 1.key

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Information Visualization:
Data Mining - 1
Matt Cooper
Big Data
2
•
•
Books
Part 1
•
•
•
•
David Hand
Heikki Mannila
Padhraic Smyth
Mostly about data mining algorithms
Part 1: What is the problem?
•
•
•
•
“Data Preparation for Data Mining”
•
•
•
2
“Principles of Data Mining”
•
•
1
•
Dorian Pyle
•
Concentrates on data preparation
Motivation: what is the goal of data mining?
What is data mining?
How is it used
How does data mining relate to:
•
•
InfoViz
Knowledge discovery
VDM – Visual Data Mining
3
4
3
4
What is InfoViz
Visualization
•
•
•
•
Q. What is Visualization?
A. Using some medium/media to convey
a representation of some data so that
the user can form a cognitive
understanding of the data
It is *not* making pictures!
5
Data
Often displayed like this
Transform=data filtering
Mapping?
Representation?
Transform
New
data
Mapping
6
Representation
Display
Perception
For Scientific Visualization:
•
•
•
•
•
Representation: false ‘picture’ of physical qualities
•
•
•
Molecules
Fluid flows
•
•
•
Body bits
Primarily 3D -> volume displays
Data has no ‘real’ representation
Data isn’t 3D - it’s often quite abstract
•
Very occasionally higher dimensionality
•
•
•
•
•
Imagine characterizing a person
•
•
Sciviz – 3D or maybe 4D
InfoViz – A zillion dimensions
What representation?
7
8
7
8
Data Mining
Data gathering
Wonder what it can tell us
Isolate (unexpected) relationships
•
•
(Hopefully) find some which are
•
•
•
•
No ‘spatial’ relationships at all
Data items comprise many different fields
Sometimes with time -> ‘animation’
Having an (enormous) amount of data
•
For InfoViz
Interesting
Novel
Informative
Helpful
“Secondary data analysis”
•
We generate enormous amounts of data.
Every time we:
•
•
•
•
•
•
Bank
Shop
Vote
Drive
Fly
Phone…
This data is collected.
9
9
10
Data gathering (2)
e.g. census data
•
•
•
All this data is collectable!
•
Easy to collect and believed to have value
We never throw anything away!
•
•
2011 UK census
•
•
Easy to keep and believed to have value.
Technologies to gather new information are
growing rapidly.
11
•
~63 Million people
~35 questions each
•
more than three pages
~2+ Billion data items
12
What is ‘Data Mining’
•
•
‘Statistics’ versus ‘data mining’
Statistics
•
•
•
•
Want to know the answer to a question
Database Query & Data
mining
•
•
Given a database of shoe-buyers…
•
Data mining: What common factors (if
any) affect the size of shoes people buy?
Gather suitable data (ask the question)
Analyse the answers
Gain (probabilistic?) insight into the
answer
Database: What size shoes do people in
the income bracket 20000Kr-25000Kr
buy?
14
•
13
14
Motivation
What is data mining?
“Everyone spoke of an information overload
but what there was in fact was a noninformation overload”
•
•
Richard Saul Wurman, “What-If, Could-be”,
Philadelphia, 1976.
•
Extraction of interesting (non-trivial),
previously unknown (and potentially
useful) information or patterns from
data in ((very) large) databases.
•
(Wrote the book “Information Anxiety”)
Inmon
15
15
16
Alternative names
What is not data mining?
•
Knowledge discovery in databases
(KDD)
•
•
•
•
•
Knowledge extraction
Data/pattern analysis
Data archeology
Information harvesting
•
•
•
(Deductive) query processing.
Expert systems
Statistical analysis
Business intelligence
17
18
17
18
Data Mining: What Data?
•
•
•
•
•
Relational databases
•
Each of (large) number(n) of datums is a ‘tuple’
•
•
Tuple: a (large?) number (p) of items
•
Transactional databases
Advanced DB and information repositories:
•
•
•
•
•
Object-oriented and object-relational databases
Time-series data and temporal data
Sometimes called a ‘feature vector’
Each item may be:
•
•
•
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
•
Security data (images? video?...)
•
Data warehouses
Numeric
Textual
other tuple (e.g. fingerprints, images, etc.)
May be discrete or continuous
Result is n points in a p-dimensional space
19
20
19
20
Example data set
Problems with data
ID
AGE
SEX
Education
Income
248
54
M
School
100 000
249
??
F
Degree
127 831
250
9
M
Incomplete
0
251
85
F
PhD
56 348
252
32
??
Degree
48 326
253
45
M
??
??
•
What are the
characteristics of the data?
•
Holes
•
•
Missing data values
Errors and ‘estimates’
•
•
Income of *exactly* 100000?
Sample inconsistencies:
•
E.g. medical records with different
numbers of readings for the same person
21
22
Objectives of DM
Data Mining tasks
Identifying patterns in data:
•
•
•
For representation
Because they are ‘interesting’
Unexpected!
1.
Exploratory Data Analysis
2.
Descriptive Modelling
3.
Predictive Modelling
!
Classification and Regression
4.
Discovering Patterns and Rules
5.
Retrieval by content
23
23
24
•
•
Aside: Models and
Patterns
Pure data mining
A global summary of an entire data set.
•
•
Makes statements about any point in
the full measurement space.
•
Typically very visual approach
Model:
•
•
Pattern:
•
1. Exploratory Data Analysis
Makes statements about relationships
between variables only in localized
regions of the measurement space.
•
“Explore the data with no clear idea of what
we are looking for”
•
Very tied to ‘Visual Data Mining’
Problems with:
•
•
Large number of data points
Large numbers of dimensions in data
25
26
2. Descriptive Modelling
Descriptive modelling(2)
•
•
Attempt to describe all of the data
Perhaps use:
•
Model of overall probability distribution
in the p-dimensional space
•
Partitioning into groups e.g.:
•
•
Cluster analysis for natural grouping
Segmentation for user-desired groups
27
28
3. Predictive modelling
Predictive modelling (2)
•
Form a model of the data set which allows
prediction of a variable based on the known
values of the others
•
Classification
•
•
•
Prediction of a discrete variable
Regression analysis
•
Prediction of a continuous variable
(Prediction does not mean future here)
29
29
30
Descriptive and
Predictive Modelling
•
•
Q: “Why is PM not the same as DM?”
•
Strong similarities, some similar methods
4. Discovering Rules
and Patterns
•
Concerned with the identification of local
patterns in sub-sets of the space.
•
Examples:
A: The goals are subtly different:
•
•
DM is associated with the grouping in the
variable space itself and identifying the
groups.
PM is associated with predicting one variable.
•
•
Frequently occurring sets of transactions
Finding patterns of action indicating fraud
31
32
5. Retrieval by content
Score functions
•
•
Using a pattern of interest to locate
similar patterns
•
•
Examples: Automatically…
•
•
Finding images with similar content
Finding text documents with similar
content
All of the preceding classes of task share a
common feature:
•
•
•
The notion of “is like” or “similarity”
•
Or difference (dissimilarity)
Defined through a ‘scoring function’
In numerical or categorical data this is often
easy
In general it is not…
33
34
33
34
Scoring functions (2)
Scoring functions (3)
•
•
•
Is an orange like an apple?
Yes:
•
•
Both are fruit.
Is this picture
•
Like this one?
Both grow on trees.
No:
•
•
•
One is citrus, one isn’t.
One is orange, one is is green/red
35
36
Scoring functions (4)
•
Specification of the scoring function(s) is
crucial to the effectiveness of the
system.
•
One of the biggest contributions the
user has to make!
Example applications (1)
•
•
Segmentation of sales data is extensively
used to classify customers by purchasing
patterns and demographic data (age,
income etc.)
•
Use to target marketing
Example of descriptive modelling
37
38
Example applications (2)
Example applications (3)
•
•
The Advanced Scout system
•
•
Analyses Basketball game logs
Identifies features of players behaviour
•
•
•
Dr. John Snow’s
Cholera diagram
•
Example of
Exploratory Data
Analysis
Circumstances when they play well/badly
Which opposing players are they good or
bad against.
An example of discovering rules and patterns
•
•
Also Visual Data
Mining
Done without
knowing what
caused Cholera!
39
40
Example applications (4)
Example Applications (5)
•
•
•
SKICAT
•
•
•
Classifies stars and galaxies
automatically from digital image data
Uses a 40-dimensional feature vector
•
Works as well as human experts
Predictive modelling
41
•
Image searching on the web
•
•
•
Both Altavista and Google had such functions ~2000
Both removed them
Google now has one again (2014)
Face recognition for security (spotting terrorists)
•
•
Been trialled at several airports in the US
Very limited success to date
Both examples of retrieval by content.
42
Altavista Image Search (2000)
Google image search (2015)
43
44
Google Image Search (2015)
Google Image Search (2015)
2nd
45
46
Google Image Search (2015)
Google Image Search (2015)
5th
47
48
Google Image Search (2015)
Google image search (2015)
15th
49
50
Example applications (6)
Fraud Detection and
Management
•
•
•
Searching text
documents for
lies on CV’s
•
Example of a by
content method
•
Detecting inappropriate medical
treatment
•
Australian Health Insurance
Commission identifies that in many
cases blanket screening tests were
requested (save Australia $1m/yr).
Example of Descriptive/Predictive
modelling
51
52
Summary (1)
Summary (2)
Data mining: discovering interesting
models and patterns in data
•
‘Simplifications’ enabling
understanding!
•
A natural evolution of database
technology, in great demand, with wide
applications
•
Mining can be performed in a variety of
information repositories
53
•
Information expert’s input still vital
•
•
Defining methods
Defining scoring functions
54
•
End of Part 1
55
55