Download COMP313/ 513

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
COMP313/ 513
DATA MINING
Unit organization
❙
Class times:
Lectures: Tuesday and Wednesday 11-11:50 B251
Tutorials: Thursday 10-11:50 MCL3 (starts in Week 2)
❙
Lecturer: Neil Dunstan MC207 neil@cs.une.edu.au
❙
Textbook:
Data Mining Concepts and Techniques,
J. Han and M. Kamber, Morgan Kaufman, 2nd Edition.
❙
Web site: http://mcs.une.edu.au/~comp513
❙
Note: This is unit is being rewritten in 2008. Updated
material will appear regularly on the web site.
Assessment
❙
Comp313
3 Assignments, 10,10 and 15%
Exam 65%
❙
Comp513
Literature Review 35%
Exam 65%
❙
Submissions will be routinely checked by the Turnitin
plagiarism detection system.
Unit schedule in 2008
Week
1
2
3
4
5
6
7
Starts
Feb 18
Feb 25
Mar 3
Mar 10
Mar 17
Mar 24
Mar 31
8
Topic
Introduction
Data Warehouses
Online Analytical Processes
Data Cubes
Associations and Correlations
Classification and Prediction
..continued..
mid semester break
Apr 28
Neural Networks
9
10
11
12
May
May
May
May
13
Jun 2
5
12
19
26
Clustering
..continued..
Outlier Analysis
Text and Web Mining
Assessment
ф
comp313 A1 set
comp313 A1 due
comp313 A2 set
comp513 Report Proposals due
comp313 A2 due
comp313 A3 set
comp513 Report due
comp313 A3 due
Review
Exam
ф Assessment due dates are the Saturday at the end of the week.
Week 1 lecture slides:
❙ Topics:
❙
❙
❙
❙
❙
❙
❙
Data Mining Definition
Enabling Technologies
Evolution of Database and Data Analysis Technologies
The Knowledge Discovery Process
Types of Data
Knowledge Discovery Methods
Data Mining Systems
❙ Text Reference: Chapter 1.
What is data mining?
Knowledge Discovery in Large Databases
“ Data Mining is the process of discovering meaningful
new correlations, patterns and trends by sifting
through large amounts of data stored in repositories
and by using pattern recognition technologies as well
as statistical and mathematical techniques”
Some other terms:
Machine Learning
Data Analysis
Enabling technologies
❙
❙
❙
❙
❙
Accumulated Historical Transaction Processing Data
Database Technology
Tertiary Storage
High-speed processing - Parallel Processing
Commercial Data Mining Packages
That is,
❙ the availability of large volumes of data
❙ data organization methods
❙ data storage technologies
❙ high-speed data processing
❙ development of knowledge discovery algorithms
Evolution of database technology
❙
❙
❙
❙
❙
❙
❙
❙
❙
❙
❙
❙
❙
Primitive File Processing – Electronic Data Processing
Relational Database Systems
Query Languages – SQL
Hierarchical and Networked Database Systems
Indexing and Access Methods – B-trees, Hashing
Data Modelling – Entity-Relationship Models
User Interfaces – Forms and Reports
Transactions, Concurrency Control
Online Transaction Processing
Object-oriented Databases
Spatial, Temporal and Multimedia Data
Heterogeneous Database Systems - Global Schemas
Web-based Database Systems – XML, The Semantic Web
Evolution of data analysis
❙
❙
❙
❙
❙
Ad-hoc querying of file systems and databases
Data Warehouse – Accumulation of data for analysis
Online Analytical Processing – Interactive analysis
Data Mining Algorithms – Correlations, Predictions, Clustering
Multimedia, Stream, Time-series, Text and Web mining
Operational versus
data mining systems
Online Transaction Processing
Data Mining
-------------------------------------------------------------------------------------------Reports on recent data
Analysis on historical data
Predictable and periodic
Unpredictable, depends on need
Limited data
The more data the better (generally)
Focus on transaction entity
Focus on actionable entity, region,
class
Response time in seconds
Response in days or weeks
System of records for data
Copy of data
Descriptive
Creative
Steps in knowledge discovery
❙
Data Cleansing – Remove noise, inconsistences, errors
❙
Data Integration – Combine data from heterogeneous sources
❙
Data Transformation – Data Selection, Reduction, Aggregation
❙
Data Mining – Apply Data Analysis Techniques
❙
Evaluation – Interestingness measures.
❙
Presentation – Visualization of results.
Data mining primitives
❙
Task-relevant data. What data is required?
❙
Data mining functions. What kinds of knowledge does the user want
to discover?
❙
Background knowledge of the domain. In particular: Concept
hierarchies. e.g.
1 litre carton on full cream milk is a sub-category of full cream milk which is
a sub-category of milk. Such a concept hierarchy can be useful in
summarization and association analysis at different levels of abstraction.
❙
Interestingness measures and evaluation methods. How can you
assess the value of the data mining?
❙
Representation of discovered patterns and results. How can the
results be presented to the user?
Types of data
Relational Databases
e.g. Customer(C_id#, Name, Address, Credit_Rating, ..)
Supplier(S_id#,Name, Address, .. )
Item(I_id#,Name,S_id, .. )
Data Warehouse
e.g. A combination of data from different databases
Transaction records e.g. (T_id#, attribute details.. )
Spatial Data
e.g. Maps, Geographic Information Systems. Raster or Vector representation.
Temporal and Time-Series Data
e.g. Mouse-click sequences, Transaction sequences, Stock Market Records
Text e.g. Documents
Multimedia e.g. Graphics, audio, video
Stream Data e.g. Video surveillance, Continuous output from Sensors
World Wide Web e.g. Web usage, Web logs, Linkage (Hypertext) Structures
Data terminology
In this unit, data will usually refer to data in relational databases.
e.g. Customer(C_id#, Name, Address, Credit_Rating, ..)
is a table containing records (or tuples) with values for each of the
attributes C_id#, Name, Address, Credit_Rating, ..
For example:
C_id#
Name
101
Joe Blogs 1 Alpha St. Armidale, NSW
Good
121
Bill Blick 22 Beta Rd. Armidale, NSW
Poor
...
Address
Credit_Rating
..
Directed knowledge discovery
❙ Targets some specific attribute in the data set, e.g.
What items sell well with bread?
❙ Tests hypotheses, e.g. Are women most likely to shop
during the day time?
❙ Seeks explanations for known patterns, e.g. Why are
overseas students concentrated in Brisbane?
Undirected knowledge discovery
❙ Uses all available data
❙ Seeks patterns or structures in the data set
❙ Has unspecific goals, e.g. What items sell well
together?
❙ May lead to hypotheses
❙ May precede more directed knowledge discovery
Examples of results of data
mining
❙
Classification, e.g. Of loan applications into high, medium
or low risk
❙
Prediction, e.g. Stock prices in 12 months time
❙
Association, e.g. What items seem to sell well together
❙
Grouping, e.g. Customers into different market segments
❙
❙
Explanation, e.g. Of some pattern, by visualization or
generalization
Summarization, e.g. Of items sold across all branches.
Descriptive patterns
Descriptive patterns characterize the data
❙ Class/ concept descriptions summarize the
attributes of a target class, e.g.
A general profile of customers who spend spend
more than $1000 per month in a store.
❙ Data discrimination, compares the common
attributes of a target class with those of other
classes, e.g.
Typical differences between different classes of
customers by buying patterns.
Descriptive patterns
❙ Cluster Analysis attempts to find related groups of
data by..
Maximizing the similarity of data within groups and
minimizing the similarity of data from different groups.
❙ Outlier Analysis finds data that doesn’t seem to
comply with the rest of the data. Hence it may be
noise or errors. In some applications it may indicate
fraud or identity theft.
❙ Evolution Analysis is applied to time-series data in
order to discover trends.
Descriptive patterns
❙ Frequent Itemsets are items that commonly occur in
transactional data sets
❙ Association Rules are based on frequent itemsets. e.g.
buys(X,computer) => buys(X,printer)
that is, if a customer buys a computer he usually buys a printer as
well.
❙ Association rules have interestingness measures
❙ Support (how often computer with printer occurs in the data)
❙ Confidence (how often printer occurs when computer occurs)
Predictive patterns
Predictive data mining attempts to develop models based
on current data, in order to made predictions
❙ Classification models attempt to predict which of a
given set of classes, a new data object should
belong to. e.g.
A decision tree.
❙ Predictive models output a numerical estimate. That
is, the prediction is a number rather that a class.
e.g.
A linear model based on weighted attribute values.
Evaluation of data mining
❙ Its easier to measure the results of projects with
precise goals that those with vague goals
❙ In classification and prediction the data sets are
divided into independent
❙ training set, for developing the model
❙ tuning set, for fine tuning and
❙ evaluation set, for final evaluation
❙ Association has Confidence and Support measures
Evaluation of prediction models
Predictive models provide a numerical estimate. e.g.
Evaluation in terms of prediction and actual figures
Actual Predicted Difference
2020
2040
-20
1900
1880
+20
3000
3050
-50
Sum of Differences = -20 + 20 -50 = -50 (+ve and -ve figures cancel out)
Average Difference = (20 + 20 + 50)/ 3 = 30
Data mining systems
❙
Data mining systems may be classified according to:
❙ The type of data mined
❙ The kind of knowledge mined
❙ The techniques used
❙ The application domain
❙
The integration of the data mining system and the data can
be classified as:
❙ No coupling. Data is sourced from an external database.
❙ Loose coupling. Uses features of the database to extract data
❙ Semitight coupling. Some preprocessing of the data.
❙ Tight coupling. Total integration of the database/warehouse
and data mining system.
Research issues in data mining
❙
Data mining query languages. Generic and for specific
domains.
❙
Efficiency and scalability of data knowledge discovery
algorithms.
❙
Algorithms for streamed data.
❙
Algorithms applied to multimedia data.
❙
Parallel and distributed algorithms.
❙
Knowlege discovery across the internet. Semantic Web?