Download SATOMGI Data Mining and Matching

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Lecture outline
SATOMGI
Data Mining and
Matching
Lecture 1: Module overview
and introduction to data
mining
Module overview
• Four hours per day over three days
• Day 1: Two hours lectures (data mining process / data issues), two hours
practical sessions
• Day 2: One hour lecture (clustering), one hour practical session, one hour
lecture (association rules mining), one hour written assessment
• Day 3: One hour lecture (classification and prediction), one hour practical
session, one hour lecture (data integration and matching), one hour written
assessment
• Module lecturer: Dr Peter Christen
• Senior lecturer, ANU Department of Computer Science
• E-mail: peter.christen@anu.edu.au
• Phone: 6125 5690
• Module Web site:
http://cs.anu.edu.au/people/Peter.Christen/SATOMGI
• Lecture slides, practical sessions material, links to further resources
• Module overview
• Very short introduction to data mining
• Example applications of data mining
• Definitions of data mining
• The data mining process
• Data mining is multi-disciplinary
• Data mining challenges
• Short history of data mining
• Some data mining resources
• Data mining books
Very short introduction to data mining (1)
• Many government agencies, businesses, and research
projects collect massive amounts of data
• Ten largest decision support databases range from 17 to 100 Terabytes
(1 Terabyte = 1,024 Gigabytes = 1,232,896 Megabytes)
• Ten largest transaction-processing databases range from 6 to 23 Terabytes
• Sizes have tripled between 2003 and end of 2005!
• Source: http://wintercorp.com/VLDB/2005_TopTen_Survey/TopTenProgram.html
• Questions arise:
• Is there any new, unexpected and potentially useful information in such
large data collections?
• Can we use historical data to predict future outcomes (such as customer
behaviour, predict if a transaction is fraudulent, etc.)
Very short introduction to data mining (2)
Very short introduction to data mining (3)
• Data mining involves:
• Data mining is applied in many areas:
• Database and data warehouse technologies
• Machine learning and artificial intelligence
• Statistics
• Numerical mathematics
• Parallel and high-performance computing
• Visualisation
• Data mining techniques:
• Data cleaning and pre-processing (lecture 2)
• Data integration and matching (lecture 6)
• Cluster analysis (lecture 3)
• Frequent patterns and associations (lecture 4)
• Classification and prediction (lecture 5)
• Outlier detection
Example application 1: Telecommunication
Huge amounts of data are collected on a daily basis
Transactional data (about each phone call)
(data on mobile phones, land-line phones, Internet, etc.)
Customer data (billing, personal information, etc.)
Additional data (network load, faults, etc.)
Possible questions
Which customer group is highly profitable, which one is not?
To which customers should we advertise what kind of special offers?
What kind of call rates would increase profit without loosing good
customers?
How do customer profiles change over time?
Fraud detection (stolen mobile phones)
Network load predictions
• Retail
• Bioinformatics and health
• Governments (statistics, census, taxation, social welfare)
• Credit card and insurance companies
• Terror, crime and fraud detection, national security
• Networking and telecommunications
• Data mining applications:
• Spatial and temporal data mining
• Text and Web data mining
• Data stream and time-series mining
• Sequence mining (e.g. DNA, proteins)
• Graph and network data mining
• Multimedia data mining (audio, images, video)
Example application 2: Health
• Different aspects of the health system
• Personal health records (at general practitioners and specialists)
• Hospital data (e.g. admission data, midwives data, surgery data, etc.)
• Nursing homes and death data (admissions, causes, medications, etc.)
• Billing information (Medicare, Pharmaceutical Benefit Scheme)
• Private health insurance and ambulance/emergency data
• Possible questions
• Are doctors following the procedures (e.g. prescription of medication)?
• Can we predict adverse drug reactions (analysis of multiple linked data
collections to find correlations)
• Are people committing fraud (e.g. doctor shoppers)?
• Are there correlations between social and environmental issues and
people's health (temporal and spatial analysis of linked data collections)?
Example application 3: Astronomy
Definitions of data mining
• Terabytes of images and other data from telescopes and
satellites
• Knowledge discovery in databases is the non-trivial process of
identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
(Fayyad, Piatetsky-Shapiro and Smyth, 1996)
• Large-area sky surveys in optical, infrared, and radio wavelengths
• Time-series data
• Possible questions
• Classification of objects (stars, galaxies, pulsars, quasars, etc.)
• Detect (large scale) structures in the data
• Find rare, unusual, or even previously unknown types of astronomical
objects and phenomena
• MACHO (MAssive Compact Halo Objects) (ANU and US)
• Search for dark matter, objects like brown dwarfs or planets in the milky way
• An information extraction activity whose goal is to discover hidden
facts contained in databases. Using a combination of machine
learning, statistical analysis, modeling techniques and database
technology, data mining finds patterns and subtle relationships in data
and infers rules that allow the prediction of future results. Typical
applications include market segmentation, customer profiling, fraud
detection, evaluation of retail promotions, and credit risk analysis.
(http://www.twocrows.com/glossary.htm)
• Try also: http://www.google.com, search term: "define: data mining"
Definitions of data mining (2)
The data mining / KDD process
• Essential in definitions is:
• ... non-trivial extraction ...
• ... previously unknown or novel ...
• ... potentially useful information ...
• ... understandable and interesting ...
• ... large amounts of data ...
• ... prediction and modelling ...
• Data mining is often also called Knowledge Discovery in
Databases (KDD)
• Some say data mining is only one essential step in the KDD process
• Data mining is an interactive process
• Data mining = Build Model(s)
• Typically up to 90% of time and effort are spent in the
first three steps!
(Follows: CRoss Industry Standard Process for Data Mining, http://www.crisp-dm.org/)
The data mining / KDD process (2)
Data mining and business intelligence
Increasing potential
to support
business decisions
End User
Decision
Making
Data Presentation
Visualization Techniques
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.)
DBA
Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.)
Major challenges in data mining
Data mining is multi-disciplinary
• Data size
Database
Technology
Statistics
• Size of data collections grows more than linear, doubling around every
18 months (similar to Moore's law of processor speed)
• Scalable algorithms are needed
• Data complexity
Machine
Learning
Visualisation
Data Mining
Pattern
Recognition
Algorithms
Other
Disciplines
Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.)
Different types of data (database tables, free text, HTML, XML, multimedia)
Dimensionality of the data increases (more attributes)
The curse of dimensionality affects many algorithms (for example finding
nearest neighbours in high dimensions)
• Privacy and confidentiality
• Data mining can reveal details about people which is not available
otherwise
• Linking and matching data is especially critical / controversial
Ten grand challenges in data mining (U. Fayyad)
Short history of data mining
• Technical challenges
• The term data mining was first mentioned by statisticians
several decades ago, but with a different meaning
compared to today: data dredging (inappropriate, sometimes
• How does the data grow?
• Scalability (of algorithms)
• Complexity/understandability trade-off
• Interestingness
• A theory for what we do
• Pragmatic challenges
• Where is the data?
• Embedding algorithms and solutions within operational systems
• Integrating domain knowledge
• Managing and maintaining models
• Effectiveness measurement
(Source: http://www.acm.org/sigs/sigkdd/explorations/, Editorial, vol 5, no 2, Dec. 2003)
deliberately so, search for statistically significant relationships in large
quantities of data; from Wikipedia)
• First workshops on knowledge discovery in databases in
late 1980s and early 1990s (part of IJCAI (Artificial Intelligence) and
ACM SIGMOD (Management of Data) conferences)
• First data mining conferences in mid 1990
• Many more conferences since early 2000
• So data mining is now in it's teen years (around 18 years old)
Data mining resources (1)
Data mining resources (2)
• Conferences
• Journals
• ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (since 1995)
• European Conference on Principles and Practice of Knowledge Discovery
in Databases (PKDD) (since 1997)
• Pacific-Asia Conference on Knowledge Discovery and Data Mining
(PAKDD) (since 1997)
• SIAM (Society for Industrial and Applied Mathematics) International
Conference on Data Mining (since 2001)
• IEEE (Institute of Electrical and Electronics Engineers) International
Conference on Data Mining (ICDM) (since 2001)
• Australasian Data Mining Conference (AusDM) (workshop since 2002,
conference since 2004)
• Springer Data Mining and Knowledge Discovery
http://www.springerlink.com/content/1573-756X
• Springer Knowledge and Information Systems
http://www.springerlink.com/content/0219-3116
• IEEE Transactions on Knowledge and Data Engineering
http://www.computer.org/tkde/
• ACM SIGKDD Explorations
http://www.acm.org/sigs/sigkdd/explorations
• ACM Transactions on Knowledge Discovery from Data
http://tkdd.cs.uiuc.edu/
Data mining resources (3)
• Web resources
• http://www.kdnuggets.com/ (News, software, jobs, courses, conferences,
data repositories, polls, and more)
• http://www.kmining.com (news, definitions, people, conferences)
• http://www.iapa.org.au (Institute of Analytics Professionals of Australia)
• http://www.togaware.com/analytics/ (Canberra Analytics Group)
• http://www.acm.org/sigs/sigkdd/ (ACM Special Interest group on KDD)
• http://www.dmg.org (Data mining group, PMML)
• http://www.togaware.com/ (Graham Williams, ATO)
• http://datamining.anu.edu.au/
• http://kdd.ics.uci.edu/ (UCI Knowledge Discovery in Databases Archive)
Lecture summary
• Data mining is concerned with finding novel, valid and
potentially useful information in large data collections
• It is a relatively new field that draws from many different
disciplines
• Data mining is an iterative process
• Business and data understanding, as well as data
preparation, are major components of data mining
• Major challenges in data mining are the growing size and
complexity of data collections, privacy issues, interestingness and understandability
• Data mining is being applied in many areas
Data mining books
• There are many different book on data mining available,
with different focus (statistics, science, business, etc.)