Download Mining Frequent Patterns Without Candidate Generation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Ch. Eick: Introduction Data Mining and Course Information
Introduction --- Part2
1.
2.
Another Introduction to Data Mining
Course Information
1
Ch. Eick: Introduction Data Mining and Course Information
Knowledge Discovery in Data [and Data Mining] (KDD)
Let us find something interesting!




Definition := “KDD is the non-trivial process of identifying valid,
novel, potentially useful, and ultimately understandable patterns in
data” (Fayyad)
Frequently, the term data mining is used to refer to KDD.
Many commercial and experimental tools and tool suites are
available (see http://www.kdnuggets.com/siftware.html)
Field is more dominated by industry than by research institutions
2
Ch. Eick: Introduction Data Mining and Course Information
Motivation: “Necessity is the
Mother of Invention”

Data explosion problem

Automated data collection tools and mature database technology lead to
tremendous amounts of data stored in databases, data warehouses and other
information repositories

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing (“analyzing and mining the raw
data rarely works”)—idea: mine summarized,. aggregated data

Extraction of interesting knowledge (rules, regularities, patterns, constraints)
from data collections
3
Ch. Eick: Introduction Data Mining and Course Information
YAHOO!’s View of Data Mining
ACME CORP ULTIMATE DATA MINING BROWSER
What’s New?
What’s Interesting?
Predict for me
http://www.sigkdd.org/kdd2008/
4
Ch. Eick: Introduction Data Mining and Course Information
Data Mining: A KDD Process
Pattern Evaluation

Data mining: the core of
knowledge discovery
Data Mining
process.
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
5
Ch. Eick: Introduction Data Mining and Course Information
Steps of a KDD Process

Learning the application domain:




Creating a target data set: data selection
Data cleaning and preprocessing:
Data reduction and transformation (the first 4 steps may take 75%
of effort!) :




summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation


Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining


relevant prior knowledge and goals of application
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
6
Ch. Eick: Introduction Data Mining and Course Information
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
7
Ch. Eick: Introduction Data Mining and Course Information
Are All the “Discovered” Patterns
Interesting?

A data mining system/query may generate thousands of patterns,
not all of them are interesting.


Suggested approach: Human-centered, query-based, focused mining
Interestingness measures: A pattern is interesting if it is easily
understood by humans, valid on new or test data with some degree
of certainty, potentially useful, novel, or validates some hypothesis
that a user seeks to confirm

Objective vs. subjective interestingness measures:

Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.

Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
8
Ch. Eick: Introduction Data Mining and Course Information
Data Mining: Confluence of Multiple Disciplines
Machine
Learning
Applications
Algorithm
Pattern
Recognition
Data Mining
Database
Technology
Statistics
Visualization
High-Performance
Computing
9
KDD Process: A Typical View from ML and
Statistics
Input Data
Data PreProcessing
Data integration
Normalization
Feature selection
Dimension reduction

Data
Mining
Association Analysis
Classification
Clustering
Outlier analysis
Summary Generation
…
PostProcessing
Pattern
Pattern
Pattern
Pattern
evaluation
selection
interpretation
visualization
This is a view from typical machine learning and statistics communities
10
Ch. Eick: Introduction Data Mining and Course Information
Data Mining Competitions



Netflix Price:
http://www.netflixprize.com//index
KDD Cup 2009: http://www.kddcuporange.com/
KDD Cup 2011:
http://www.kdd.org/kdd2011/kddcup.shtml
11
Ch. Eick: Introduction Data Mining and Course Information
Summary






Data mining: discovering interesting patterns from large amounts of
data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
Classification of data mining systems
12
Ch. Eick: Introduction Data Mining and Course Information
COSC 6335 in a Nutshell
Preprocessing
Data Mining
Post Processing
Association Analysis Pattern Evaluation
Clustering
Classification &
Prediction
Visualization
Summarization
13
Ch. Eick: Introduction Data Mining and Course Information
Prerequisites
The course is basically self contained; however, the
following skills are important to be successful in
taking this course:
 Basic knowledge of programming
 Java/language of your own choice and data
mining tools will be used in the programming
projects—basic knowledge of Java is sufficient!
 Basic knowledge of statistics
 Basic knowledge of data structures
14
Ch. Eick: Introduction Data Mining and Course Information
Course Objectives









will know what the goals and objectives of data mining are
will have a basic understanding on how to conduct a data mining project
will obtain practical experience in data analysis and making sense out of
data
will have sound knowledge of popular classification techniques, such as
decision trees, support vector machines and nearest-neighbor
approaches.
will know the most important association analysis techniques
will have detailed knowledge of popular clustering algorithms, such as Kmeans, DBSCAN, grid-based, hierarchical and supervised clustering.
will have some knowledge of R, an open source statistics/data mining
environment
will obtain practical experience in designing data mining algorithms and
in applying data mining techniques to real world data sets
will have some exposure to more advanced topics, such as sequence
mining, spatial data mining, and web page ranking algorithms
15
Ch. Eick: Introduction Data Mining and Course Information
Data Mining Course Organization
I Introduction to Data Mining and Data Mining Basics (Chapter 1 and 2.1)
II Exploratory Data Analysis (Chapter 3) moved!
III Introduction to Classification --- Basic Concepts and Decision Trees (Chapter 4
IV Introduction to Similarity Assessment and Clustering (Other material 2.3 and
Chapter 8 in part)
V Introduction to Data Cubes (Section 3.4) moved!
VI Association Analysis (Chapter 6)
VII Spatial Data Mining
VIII More on Classification: Regression, Instance-based Learning and Support
Vector Machines (Chapter 5)
IX Data Preprocessing, Data Cubes, and Data Warehouses (Chapter 2 and …l)
X More on Clustering (Chapter 8 and Chapter 9 in part)
XI Sequence and Graph Mining (Chapter 7 in part)
XI PageRank and other Top 10 Data Mining Algorithms (Journal Paper)
XII Final Words
16
Ch. Eick: Introduction Data Mining and Course Information
Order of Coverage
Introduction  Exploratory Data Analysis 
Similarity Assessment  Clustering  Association
Analysis  Classification Spatial Data Mining 
More on Classification OLAP and Data
Warehousing  Preprocessing  More on
Clustering  Sequence and Graph Mining Top
10 Data Mining Algorithms  Summary
Also: Some introductory tutorial into R (2-3 classes)
17
Ch. Eick: Introduction Data Mining and Course Information
In particular, R will be used for most course projects,
except spatial clustering algorithms which are part
of Cougar^2 will be used in the third project.
The bad news is that it is more challenging to get
started with R (compared to Weka---but Weka is a
"dead" language), although you should be okay after
you used R for some weeks. On the other hand, the
good news about R is that it continues to grow quickly in
popularity. A recent poll at KDnuggets found that 34%
of respondents do at least half of their data mining in R.
Although it's a domain specific language, it's versatile.
As we have not used R in the course before, we expect some startup problems
and ask you for your patience, but, on the positive side
knowing R will be a plus when conducting research projects
and when looking for jobs after you graduate, due to
18
R's completeness and R's rising popularity.
Ch. Eick: Introduction Data Mining and Course Information
Where to Find References?

Data mining and KDD



Database field (SIGMOD member CD ROM):




Conference proceedings: ICML, AAAI, IJCAI, ECML, etc.
Journals: Machine Learning, Artificial Intelligence, etc.
Statistics:



Conference proceedings: VLDB, ICDE, ACM-SIGMOD, CIKM
Journals: ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc.
AI and Machine Learning:


Conference proceedings: ICDM, KDD, PKDD, PAKDD, SDM,ADMA
etc.
Journal: Data Mining and Knowledge Discovery
Conference proceedings: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization:


Conference proceedings: CHI, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
19
Ch. Eick: Introduction Data Mining and Course Information
Textbooks
Required Text: P.-N. Tang, M. Steinback, and
V. Kumar: Introduction to Data Mining,
Addison Wesley, Link to Book HomePage
Mildly Recommended Text Jiawei Han and
Micheline Kamber, Data Mining: Concepts and
Techniques, Morgan Kaufman Publishers, second
edition.
Link to Data Mining Book Home Page
20
Ch. Eick: Introduction Data Mining and Course Information
Tentative Schedule for
• Exams: October 25, December 6
• Reviews:
Plan First Half of the Fall 2011 Semester:
Aug. 23+25: Introduction to DM
August 30: Exploratory Data Analysis (Dr. Chen)
September 1+22: Lab (Zechun Cao)
September 6+8+15+20: Clustering I
September 27+29+Oct. 4: Association Analysis
October 6+11+13: Classification and Prediction
October 18+20: Spatial Data Mining
October 27+Nov.1: More on Classification and Prediction
21
October 25: Midterm Exam
Ch. Eick: Introduction Data Mining and Course Information
2011 Course Projects
Project 1: Exploratory Data Analysis
•
Project 2: Traditional Clustering with K-means and DBSCAN
Project 3: Spatial Clustering with CLEVER
Project 4: Group Project (different topics, no programming)
Project 5: TBDL (something with SVMS and/or regression)
22
Ch. Eick: Introduction Data Mining and Course Information
TA/Students of my Research Group:
Duties:
1.
2.
3.
4.
Grading of programming projects, home works, and
exams (in part)
Run 2/3 labs
Help students with homework, programming projects
and problems with the course material
Teach a class (two to three times)
Office:
Office Hours:
E-mail:
Meet our TA: Thursday
23
Ch. Eick: Introduction Data Mining and Course Information
Web


Course Webpage
(http://www2.cs.uh.edu/~ceick/DM/DM11.html )
UH-DMML Webpage
(http://www2.cs.uh.edu/~UH-DMML/index.html)
24
Ch. Eick: Introduction Data Mining and Course Information
Where to Find References? DBLP, CiteSeer, Google

Data mining and KDD (SIGKDD: CDROM)



Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)





Conferences: SIGIR, WWW, CIKM, etc.
Journals: WWW: Internet and Web Information Systems,
Statistics



Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc.
Web and IR


Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
AI & Machine Learning


Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization


Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
25
Ch. Eick: Introduction Data Mining and Course Information
Teaching Philosophy and Advice




The first 8 weeks will give a basic introduction to data
mining and follows the textbook somewhat closely.
Read the sections of the textbook before you come to
the lecture; if you work continuously for the class you
will do better and lectures will be more enjoyable.
Starting to review the material that is covered in this
class 1 week before the next exam is not a good idea.
Do not be afraid to ask questions! I really like
interactions with students in the lectures… If you do
not understand something at all send me an e-mail
before the next lecture!
If you have a serious problem talk to me, before the
problem gets out of hand.
26
Ch. Eick: Introduction Data Mining and Course Information
Where to Find References? DBLP, CiteSeer, Google

Data mining and KDD (SIGKDD: CDROM)



Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)





Conferences: SIGIR, WWW, CIKM, etc.
Journals: WWW: Internet and Web Information Systems,
Statistics



Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc.
Web and IR


Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
AI & Machine Learning


Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization


Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
27
Ch. Eick: Introduction Data Mining and Course Information
Course Planning for Research
in Data Mining






This course “Data Mining”
I also suggest to taking at least 1, preferably two, of the following
courses: Pattern Classification (COSC 6343), Artificial
Intelligence (COSC 6368), and Machine Learning (COSC 6342).
Moreover, having basic knowledge in data structures, software
design, and databases is important when conducting data mining
projects; therefore, taking COSC 6320, COSC 6318 and COSC
6340 is a good choice.
Moreover, taking a course that teaches high performance
computing is also a good choice, because data mining algorithms
are very time consuming.
Because a lot of data mining projects have to deal with images, I
suggest to take at least one of the many biomedical image
processing courses that are offered in our curriculum.
Finally, having knowledge in evolutionary computing, data
visualization, statistics, solving optimization problems, GIS
(geographical information systems) is a plus!
28