Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1
MSIT-116C:
Data Warehousing and
Data Mining
2
_____________________________________________________________
Course Design and Editorial Committee
Prof. M.G.Krishnan
Vice Chancellor
Karnataka State Open University
Mukthagangotri, Mysore – 570 006
Prof. Vikram Raj Urs
Dean (Academic) & Convener
Karnataka State Open University
Mukthagangotri, Mysore – 570 006
Head of the Department and Course Co-Ordinator
Rashmi B.S
Assistant Professor & Chairperson
DoS in Information Technology
Karnataka State Open University
Mukthagangotri, Mysore – 570 006
Course Editor
Ms. Nandini H.M
Assistant Professor of Information Technology
DoS in Information Technology
Karnataka State Open University
Mukthagangotri, Mysore – 570 006
Course Writers
Dr. B. H. Shekar
Dr. Manjaiah
Associate Professor
Professor
Department of Computer Science
Department of Computer Science
Mangalagangothri
Mangalagangothri
Mangalore
Mangalore
Publisher
Registrar
Karnataka State Open University
Mukthagangotri, Mysore – 570 006
Developed by Academic Section, KSOU, Mysore
Karnataka State Open University, 2014
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or
any other means, without permission in writing from the Karnataka State Open University.
Further information on the Karnataka State Open University Programmes may be obtained
from the University‘s Office at Mukthagangotri, Mysore – 6.
Printed and Published on behalf of Karnataka State Open University, Mysore-6 by the
Registrar (Administration)
3
Karnataka State
Open University
Mukthagangothri, Mysore – 570 006
Third Semester M.Sc in Information Technology
MSIT-116C: Data Warehousing and Data Mining
Module 1
Unit-1
Basics of Data Mining and Data Warehousing
001-020
Unit-2
Data Warehouse and OLAP Technology: An Overview
021-060
Unit-3
Data Cubes and Implementation
061-083
Unit-4
Basics of Data Mining
084-102
Module 2
Unit-5
Frequent Patterns for Data Mining
103-117
Unit-6
FP Growth Algorithms
118-128
Unit-7
Classification and Prediction
129-138
Unit-8
Approaches for Classification
139-165
4
Module 3
Unit-9
Classification Techniques
166-191
Unit-10 Genetic Algorithms, Rough Set and Fuzzy Sets
192-212
Unit-11
Prediction Theory of Classifiers
213-236
Unit-12
Algorithms for Data Clustering
237-259
Module 4
Unit-13
Cluster Analysis
260-276
Unit-14
Spatial Data Mining
277-290
Unit-15
Text Mining
291-308
Unit-16
Multimedia Data Mining
309-334
5
PREFACE
The objective of data mining is to extract the relevant information from a large collection of
information. The large of amount of data exists due to advances in sensors, information technology,
and high-performance computing which is available in many scientific disciplines. These data sets are
not only very large, being measured in terabytes and peta bytes, but are also quite complex. This
complexity arises as the data are collected by different sensors, at different times, at different
frequencies, and at different resolutions. Further, the data are usually in the form of images or
meshes, and often have both a spatial and a temporal component. These data sets arise in diverse
fields such as astronomy, medical imaging, remote sensing, nondestructive testing, physics,
materials science, and bioinformatics. This increasing size and complexity of data in scientific
disciplines has resulted in a challenging problem. Many of the traditional techniques from
visualization and statistics that were used for the analysis of these data are no longer suitable.
Visualization techniques, even for moderate-sized data, are impractical due to their subjective
nature and human limitations in absorbing detail, while statistical techniques do not scale up to
massive data sets. As a result, much of the data collected are never even looked at, and the full
potential of our advanced data collecting capabilities is only partially realized.
Data mining is the process concerned with uncovering patterns, associations, anomalies, and
statistically significant structures in data. It is an iterative and interactive process involving data
preprocessing, search for patterns, and visualization and validation of the results. It is a
multidisciplinary field, borrowing and enhancing ideas from domains including image understanding,
statistics, machine learning, mathematical optimization, high-performance computing, information
retrieval, and computer vision. Data mining techniques hold the promise of assisting scientists and
engineers in the analysis of massive, complex data sets, enabling them to make scientific discoveries,
gain fundamental insights into the physical processes being studied, and advance their
understanding of the world around us.
We introduce basic concepts and models of Data Mining (DM) system from a computer science
perspective. The focus of the course will be on the study of different approaches for data mining,
models used in the design of DM system, search issues, text and multimedia data clustering
techniques. Different types of clustering and classification techniques are also discussed which find
applications in diversified fields. This course will empower the students to know how to design data
mining systems and in depth analysis is provided to design multimedia based data mining systems.
This concise text book provides an accessible introduction to data mining and organization that
supports a foundation or module course on data mining and data warehousing covering a broad
6
selection of the sub-disciplines within this field. The textbook presents concrete algorithms and
applications in the areas of business data processing, multimedia data processing, text mining etc.
Organization of the material: The book introduces its topics in ascending order of complexity and is
divided into four modules, containing four units each.
In the first module, we begin with an introduction to data mining highlighting its applications and
techniques. The basics of data mining and data warehousing concepts along with OLAP technology is
discussed in detail.
In the second module, we discussed the approaches to data mining. The frequent pattern mining
approach is presented in detail. The role of classification and association rule based classification is
also presented. We have also presented the prediction model of classification and different
approaches for classification.
The third module contains basics of soft computing paradigms such as fuzzy theory, rough sets and
genetic algorithms which are the basis for designing data mining algorithms. Algorithms of data
clustering are presented in this unit in detail which is central to any data mining techniques.
In the fourth module, metrics for cluster analysis are discussed. In addition, the data mining concept
for spatial data, textual data and multimedia data are presented in detail in this module.
Every module covers a distinct problem and includes a quick summary at the end, which can be used
as a reference material while reading data mining and data warehousing. Much of the material
found here is interesting as a view into how the data mining works, even if you do not need it for a
specific works.
Happy reading to all the students.
7
UNIT-1: BASICS OF DATA MINING AND DATA WAREHOUSING
Structure
1.1 Objectives
1.2 Introduction
1.3 Data warehouse
1.4 Operational data store
1.5 Extraction transformation language
1.6 Data warehouse Meta data
1.7 Summary
1.8 Keywords
1.9 Exercises
1.10 References
1.1 Objectives
The objectives covered under this unit include:
The introduction data mining and data warehousing
Techniques for data mining
Basics of operational data stores (ODS)
Basics of Extraction transformation loading (ETL)
Building the data warehouses
Role of metadata.
1.2 Introduction
8
What is data mining?
The amount of data on collected by organizations grows by leaps and bounds. The amount of
data is increasing year after year and there may be pay offs in uncovering hidden information
behind these data. Data mining is a way to gain market intelligence from this huge amount of
data. The problem today is not the lack of data, but how to learn from it. Data mining mainly
deals with structured data organized in a database. It uncovers anomalies, exceptions,
patterns, irregularities or trends that may otherwise remain undetected under the immense
volumes of data.
What is data warehousing?
A data warehouse is a database designed to support decision making in an organization. Data
from the production databases are copied to the data warehouse so that queries can be
performed without disturbing the performance or the stability of the production systems.
For data mining to occur, it is crucial that data warehousing is present.
An example of how well data warehousing and data mining has been utilized is Walmart.
Walmart maintains a 7.5 TB data warehouse. Retailers capture Point of Sale (POS)
transaction data from over 2,900 stores across 6 countries and transmit them to Walmart‘s
data warehouse. Walmart then allows their suppliers to access the data to collect information
on their products to analyse how they can improve their sales.
These suppliers will then better understand customer buying patterns and manage local store
inventory, etc.
Data mining techniques: What is it and how is it used?
Data mining is not a method of attacking the data; on the contrary, it is a way of teaming
from the data and then using that information. For that reason, we need a new mind-set in
data mining. We must be open to finding relationships and patterns that we never imagined
existed. We let data tell us the story rather than impose a model on the data that we feel will
replicate the actual patterns.
There are four categories of data mining techniques/tools (Keating, 2008):
1. Prediction
2. Classification
3. Clustering Analysis
4. Association Rules Discovery
Prediction Tools: They are the methods derived from traditional statistical forecasting for
predicting a variable‘s value. The most common and important applications in data mining
involves prediction. This technique involves traditional statistics such as regression analysis,
9
multiple discriminant analysis, etc. Non-traditional methods used in prediction tools are
Artificial Intelligence and Machine Learning.
Classification Tools: Most commonly used in data mining. Classification tools attempt to
distinguish different classes of objects or actions. For example, in a case of a credit card
transaction, these tools could classify it as one or the other. This will save the credit card
company a considerable amount of money.
Clustering Analysis Tools: These are very powerful tools for clustering products into groups
that naturally fall together. These groups are identified by the program and not by the
researchers. Most of the clusters discovered may not have little use in business decision.
However, one or two that are discovered may be extremely important and can be taken
advantage of to give the business an edge over its competitors. The most common use for
clustering tools is probably in what economists refer to as ―market segmentation.‖
Association Rules Discovery: Here the data mining tools discover associations; e.g., what
kinds of books certain groups of people read, what products certain groups of people
purchase, what movies certain groups of people watch, etc. Businesses can use this
information to target their markets. Online retailers like Netflix and Amazon use these tools
quite intensively. For example, Netflix recommends movies based on movies people have
watched and rated in the past. Amazon does something similar in recommending books when
you re-visit their website.
The two major pieces of software used at the moment for data mining are PASW Modeller
(formerly known as SPSS Clementine) and SAS Enterprise Miner. Both software packages
include an array of capabilities that enables data mining tools/ mentioned above. Newbies in
data mining can use an Excel add-in called XLMiner available from Resampling Stats, Inc.
This Excel add-in lets potential data miners not only examine the usefulness of such a
program but also get familiar with some of the data mining techniques. Although Excel is
quite limited in the number of observations it can handle, it can give the use a taste of how
valuable data mining can be – without expensing too much cost first.
Examples of use of information extracted from data mining exercises
Data mining has been used to help in credit scoring of customers in the financial industry
(Peng, 2004). Credit scoring can be defined as a technique that helps credit providers decide
whether to grant credit to customers. It‘s most common use is in making credit decisions for
loan applications. Credit scoring is also applied in decisions on personal loan applications –
the setting of credit limits, manage existing accounts and forecast the profitability of
consumers and customers (Punch, 2000).
10
Data mining and data warehousing has been particularly successful in the realm of customer
relationship management. By utilizing a data warehouse, retailers can embark on customerspecific strategies like customer profiling, customer segmentation, and cross-selling. By
using the information in the data warehouse, the business can divide its customers into four
quadrants of customer segmentation: (1) customers that should be eliminated (i.e., they cost
more than what they generate in revenues); (2) customers with whom the relationship should
be re-engineered (i.e., those that have the potential to be valuable, but may require the
company‘s encouragement, cooperation, and/ or management); (3) customers that the
company should engage; and (4) customers in which the company should in est (Buttle, 1999;
Verhoef & Donkers, 2001). The company then could use the corresponding strategies, to
manage the customer relationships (Cunningham et al, 2006)
Data mining can also help in the detection of spam in electronic mail (email) (Shih et al,
2008).
Data mining has also been used healthcare and acute care. A medical center in the US used
data mining technology to help its physicians work more efficiently and reduce mistakes
(Veluswamy, 2008).
There are other examples which we will not deal with here that have been flagship success
stories of data mining – the beer and diaper association; Harrah; Amazon and Netflix.
Essentials before you data mine
Apart from management buy in and financial backing, there are certain basics before you
embark on a data mining project. As data mining can only uncover patterns already present in
the data, the target dataset – you must already have the data and the data resides in a data
warehouse or a data mart — which must be large enough to contain these patterns while
remaining concise enough to be mined in an acceptable timeframe. The target set then needs
to be ―cleaned‖. This process removes the observations with noise and missing data. The
cleaned data is then reduced into feature vectors, one vector per observation. A feature vector
is a summarised version of the raw data observation.
Limitations of data mining
The quality of data mining applications depends on the quality and availability of data. As the
data set that needs to be mined should be of a certain quality, time and expense may be
needed to ―clean‖ the data that need to be mined.
Not to mention that the amount of data to be mined should be sufficiently large for the
software to extract meaningful patterns and association.
11
Also, as data mining requires huge amounts of resources – man hours, and financially — the
user must be a domain specialist and must understand business problems and be familiar with
data mining tools and techniques, so that resources are not wasted on a data mining project
that will fail at the start.
Also, once data have been mined, it is up to the management and decision makers to use the
information that has been extracted. Data mining is not the end all and the magic wand that
points the organization to what it should do. Human intellect and business acumen of the
decision makers is still very much required to make any sense out of the information that is
extracted from a data mining exercise.
Some issues surrounding data mining and data warehousing
1. You’ve data mined – do you think that the bosses will take the proper and appropriate
action – the dichotomy between use of sophisticated data mining software and techniques and
the conventionality of how organizations make decisions
Brydon and Gemino (2008) highlighted the dichotomy between the use of sophisticated data
mining software and techniques as opposed to the conventionality of how organisations make
decisions. They believed, rightly so, that ―tools and techniques for data mining and decision
making integration are still in their infancy. Firms must be willing to reconsider the ways in
which they make decisions if they are to realize a payoff from their investments in data
mining technology.‖
2. One size fits all data mining packages for industry. Does this fit the purpose of data mining
at all?
There are now available ―one size fits all‖ vertical applications for certain industries/ industry
segments developed by consultants. The consultants market these packages to all competitors
within that segment. This poses a potential risk for companies who are new to data mining as
when they explore the technique and these vertical ―off the shelf‖ solutions that their
competitors can also easily obtain.
Nevertheless, having said that the application of this technology is limited only by our
imagination, so that it is up to the companies to show and why they wish to use the
technology. They should also be aware of the fact that data mining is a long and resource
intensive exercise which an ―off the shelf‖ solution deceptively presents as easy and
affordable. Only companies that learn to be comfortable in utilising these tools on all
varieties of company data will benefit.
3. The use of data mining for prediction – use in non-commercial and ―problematic‖ areas.
E.g. prediction of terrorist acts
12
In 2002, the US government embarked on a massive data mining effort. Called the Total
Information Awareness The basic idea to collect as much data on everyone and sift this
through massive computers and investigate patterns that might indicate terrorist plots
(Schneier, 2006). However, a backlash of public opinion drove the US Congress to stop
funding the programme. Nevertheless, there is belief that the programme just changed its
name and moved inside the walls of the US Defence Department (Harris, 2006)
According to Schneier (2006), why data mining for use in such a situation will fail because
Terrorist plots are different from credit card fraud. Terrorist acts have no well-defined profile
and attacks are very rare. ―Taken together, these facts mean that data-mining systems won‘t
uncover any terrorist plots until they are very accurate, and that even very accurate
systems would be so flooded with false alarms that they will be useless.‖
This highlights the principle pointed earlier on in this paper – data mining is not a panacea of
all information problems and is not a magic wand to guide anyone out of the wilderness.
4. Ethical concerns over data warehousing and data mining – do you have any? Should
companies be concerned?
Data mining produces results only if it works with higher volumes of information at its
disposal. With the higher amounts of data that needs to be gathered, should we also be
concerned with the ethics behind the collection and use of that data.
As highlighted by Linstedt (2004), the implementers of the technology are simply told to
integrate data and the project manager builds a project to make it happen – these people
simply do not have the time to ponder whether the data had been handled ethically. Linstedt
proposes a checklist for project managers and technology implementers to address ethical
concerns over data:
Develop SLA‘s with end users that define who has access to what levels of
information
Have end-users involved in defining the ethical standards of use for the data that will
be delivered.
Define the bounds around the integration efforts of public data, where it will be
integrated and where it will not – so as to avoid conflicts of interest.
Do not use ―live‖ or real data for testing purposes – or lock down the test
environment; too often test environments are left wide-open and accessible to too
many individuals.
Define where, how, and who will be using Data Mining – restrict the mining efforts to
specific sets of information. Build a notification system to monitor data mining usage.
13
Allow customers to ―block‖ the integration of their own information (this one is
questionable) depending on if the customer information after integration will be made
available on the web.
Remember that any efforts made are still subject to governmental laws.
Nothing is sacred. If a government wants access to the information, they will get it.
1.3 Data warehouse
In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is
a database used for reporting (1) and data analysis (2). Integrating data from one or more
disparate sources creates a central repository of data, a data warehouse (DW). Data
warehouses store current and historical data and are used for creating trending reports for
senior management reporting such as annual and quarterly comparisons.
The data stored in the warehouse is uploaded from the operational systems (such as
marketing, sales, etc., shown in the figure to the right). The data may pass through an
operational data store for additional operations before it is used in the DW for reporting.
The typical extract transform load (ETL)-based data warehouse uses staging, data integration,
and access layers to house its key functions. The staging layer or staging database stores raw
data extracted from each of the disparate source data systems. The integration layer integrates
the disparate data sets by transforming the data from the staging layer often storing this
transformed data in an operational data store (ODS) database. The integrated data are then
moved to yet another database, often called the data warehouse database, where the data is
arranged into hierarchical groups often called dimensions and into facts and aggregate facts.
The combination of facts and dimensions is sometimes called a star schema. The access layer
helps users retrieve data.[1]
A data warehouse constructed from integrated data source systems does not require ETL,
staging databases, or operational data store databases. The integrated data source systems
may be considered to be a part of a distributed operational data store layer. Data federation
methods or data virtualization methods may be used to access the distributed integrated
source data systems to consolidate and aggregate data directly into the data warehouse
database tables.
14
Unlike the ETL-based data warehouse, the integrated source data systems and the data
warehouse are all integrated since there is no transformation of dimensional or reference data.
This integrated data warehouse architecture supports the drill down from the aggregate data
of the data warehouse to the transactional data of the integrated source data systems.
A data mart is a small data warehouse focused on a specific area of interest. Data warehouses
can be subdivided into data marts for improved performance and ease of use within that area.
Alternatively, an organization can create one or more data marts as first steps towards a larger
and more complex enterprise data warehouse.
This definition of the data warehouse focuses on data storage. The main source of the data is
cleaned, transformed, catalogued and made available for use by managers and other business
professionals for data mining, online analytical processing, market research and decision
support (Marakas & O'Brien 2009). However, the means to retrieve and analyze data, to
extract, transform and load data, and to manage the data dictionary are also considered
essential components of a data warehousing system. Many references to data warehousing
use this broader context. Thus, an expanded definition for data warehousing includes business
intelligence tools, tools to extract, transform and load data into the repository, and tools to
manage and retrieve metadata.
Difficulties of Implementing Data Warehouses
Some significant operational issues arise with data warehousing: construction, administration,
and quality control. Project management—the design, construction, and implementation of
15
the warehouse—is an important and challenging consideration that should not be
underestimated. The building of an enterprise-wide warehouse in a large organization is a
major undertaking, potentially taking years from conceptualization to implementation.
Because of the difficulty and amount of lead time required for such an undertaking, the
widespread development and deployment of data marts may provide an attractive alternative,
especially to those organizations with urgent needs for OLAP, DSS, and/or data mining
support. The administration of a data warehouse is an intensive enterprise, proportional to the
size and complexity of the warehouse. An organization that attempts to administer a data
warehouse must realistically understand the complex nature of its administration. Although
designed for read access, a data warehouse is no more a static structure than any of its
information sources. Source databases can be expected to evolve. The warehouse‘s schema
and acquisition component must be expected to be updated to handle these evolutions.
A significant issue in data warehousing is the quality control of data. Both quality and
consistency of data are major concerns. Although the data passes through a cleaning
function during acquisition, quality and consistency remain significant issues for the
database administrator. Melding data from heterogeneous and disparate sources is a
major challenge given differences in naming, domain definitions, identification
numbers, and the like. Every time a source database changes, the data warehouse
administrator must consider the possible interactions with other elements of the
warehouse.
Usage projections should be estimated conservatively prior to construction of the data
warehouse and should be revised continually to reflect current requirements. As
utilization patterns become clear and change over time, storage and access paths can
be tuned to remain optimized for support of the organization‘s use of its warehouse.
This activity should continue throughout the life of the warehouse in order to remain
ahead of demand. The warehouse should also be designed to accommodate the
addition and attrition of data sources without major redesign. Sources and source data
will evolve, and the warehouse must accommodate such change. Fitting the available
source data into the data model of the warehouse will be a continual challenge, a task
that is as much art as science. Because there is continual rapid change in technologies,
both the requirements and capabilities of the warehouse will change considerably over
time. Additionally, data warehousing technology itself will continue to evolve for
some time so that component structures and functionalities will continually be
upgraded. This certain change is excellent motivation for having fully modular design
16
of components. Administration of a data warehouse will require far broader skills than
are needed for traditional database administration. A team of highly skilled technical
experts with overlapping areas of expertise will likely be needed, rather than a single
individual. Like database administration, data warehouse administration is only partly
technical; a large part of the responsibility requires working effectively with all the
members of the organization with an interest in the data warehouse. However difficult
that can be at times for database administrators, it is that much more challenging for
data warehouse administrators, as the scope of their responsibilities is considerably
broader. Design of the management function and selection of the management team
for a database warehouse are crucial. Managing the data warehouse in a large
organization will surely be a major task. Many commercial tools are available to
support management functions. Effective data warehouse management will certainly
be a team function, requiring a wide set of technical skills, careful coordination, and
effective leadership. Just as we must prepare for the evolution of the warehouse, we
must also recognize that the skills of the management team will, of necessity, evolve
with it.
Data Warehouse Guidelines: Building Data Warehouses
Embarking on a data warehouse project is a daunting task. Many data warehouse projects are
underfunded, unfocused, end-users are not trained to access data effectively, or there are
organizational issues that cause them to fail. In fact, a large number of data warehousing
projects which fail during the first year.
According to Mitch Kramer, consulting editor at Patricia Seybold Group, strategic
technologies, best practices, and business solutions consulting group based in Boston, there
are many ways to make a data warehouse successful.
Here are a few of the areas to be aware of when creating and implementing a data warehouse:
1. Keep things focused.
"Try not to create a global solution." Kramer suggests that a good practice is to "focus on
what you need. A small data warehouse or data mart which addresses a single subject or that
is focused on a single department is much more efficient than a large data warehouse. You
will see measurable results much faster from a data mart than a data warehouse. A focused
data mart will get funding and gain organizational consensus a lot easier, too."
2. Don't worry about integration, keep things small.
"Integration can be an issue, but it has always been a problem when organizations try to
take a small filing system and integrate it into an organizational system. There are always
17
coding problems of some sort." Kramer then added, "Global systems always tend to fold, so
keep it small."
3. Spend the extra money if you need help designing your system.
Kramer commented, "Systems designing is the best place to spend the money on hiring
consultants. They know the problems, and know how to deal with them. It is possible to
design your own data warehouse system, but it is a lot less frustrating to hire out the design
process."
4. Keep things simple.
"Buy one single product from one vendor. This minimizes, or possibly eliminates any tool
integration issues," Kramer advised.
5. Be in tune with the users.
"Know your users," Kramer warned. "If you are not careful, you will wind up giving the
right users the wrong tools, and that only leads one place - frustration. Find out who your
end-users are, and work backward to the operational data. This will tell you what tools your
data warehouse needs."
6. Consider your platforms.
Kramer said "there really are no right platforms out there. You can start with a UNIX
system or NT. Keep in mind that the NT has a ceiling in terms of scalability, but it works
well with data marts, and most other small warehouses, just not global data warehouses."
7. Think before you data mine.
"Data mining is a solution in search of a problem," Kramer said. "Know what you want to
find before you select the tool. Data mining software simply relieves some of the burden
from the analyst."
1.4 Operational Data Stores (ODL)
An operational data store (ODS) is a type of database that's often used as an interim logical
area for a data warehouse.
While in the ODS, data can be scrubbed, resolved for redundancy and checked for
compliance with the corresponding business rules. An ODS can be used for integrating
disparate data from multiple sources so that business operations, analysis and reporting can
be carried out while business operations are occurring. This is the place where most of the
data used in current operation is housed before it's transferred to the data warehouse for
longer term storage or archiving.
18
An ODS is designed for relatively simple queries on small amounts of data (such as finding
the status of a customer order), rather than the complex queries on large amounts of data
typical of the data warehouse. An ODS is similar to your short term memory in that it stores
only very recent information; in comparison, the data warehouse is more like long term
memory in that it stores relatively permanent information.
Operational data store (ODS) fact build
During the ETL process, the builds extract data from the operational system and map the data
to the operational data store area in the data warehouse.
Extracting data
The source data is extracted through the XML ODBC driver from data services or XML data
files. In most cases, data is loaded directly from the data sources into the operational data
store area of the data warehouse. In some cases, however, data is extracted through staging:
small ETL builds extract the data, and store it into temporary tables. Other ETL builds
retrieve the data, transform it, and map it to the operational data store area of the data
warehouse. For products that support delta loads, extraction from data services is through
delta loads. The structure of source data is specific to the data source. The attributes are
extracted according to the measurement objectives. Therefore, not all attributes of the data
sources are loaded to the data warehouse.
Transforming data
The transformation models do not contain complex business rules, or aggregations and
calculations. The transformation of attributes happens in the following manner:
Attributes that describe the entity itself are loaded directly to the data warehouse with
the Attribute element.
Attributes that describe a relationship between an entity and another are transformed,
using lookup dimensions and derivations, into the surrogate key of the associated
entity. For example, in the case of the dbid attribute of a defect in a ClearQuest®
project, the lookup dimension takes the natural key (dbid of the project) and searches
the PROJECT table in operational data store area in the data warehouse to find a
matching record. The derivation checks the result of the lookup dimension. If a match
is found, the derivation returns the surrogate key of the project record. If a match is
not found, which indicated that no project is associated with this defect, the derivation
returns a value of -1. The result of the derivation is delivered to the data warehouse.
19
Delivering data
Similar data from different data sources is mapped to the same table in the data warehouse.
The data is stored according to the subject or business domain. For example, a defect from
Rational® ClearQuest and a defect from Rational Team Concert™ are mapped to the same
REQUEST table in the operational data store. The most common mappings are:
Record identity:
This control attribute provided by Data Manager is for a unique number for each row
and must be mapped to the surrogate key column in the data warehouse table.
Last update date
This control attribute provided by Data Manager is for the date on which an existing
row was updated and must be mapped to the REC_TIMESTAMP column in the data
warehouse table.
SOURCE_ID
This column in the data warehouse must be used to store the GUID of the data source,
which can be used for differentiating data of different sources. For data sources where
the data is extracted through the XML ODBC driver, a GUID is automatically
assigned to each resource group and the value is put in each table in the column
DATASOURCE_ID, which must be mapped to the SOURCE_ID column in the data
warehouse table. For other data sources where the XML ODBC driver is not used, the
value needs to be supplied manually.
EXTERNAL_KEY1/EXTERNAL_KEY2
An attribute to store the integer or character type of the natural key from the data
source.
REFERENCE_ID
An attribute to store a user-visible identifier, if the data source has one.
URL
An attribute to store the URL of an XML resource of a data source
Classification ID
An attribute for some commonly used artifacts such as projects, requests,
requirements, tasks, activities, and components. This attribute is used for further
classifying the data in these tables. For each artifact, a table with _CLASSIFICATION
in the name is defined in the data warehouse and the IDs and values are predefined
when the data warehouse is created. The ETL builds that deliver these artifacts into
20
the data warehouse must specify the value of the classification ID and map it to the
corresponding column with _CLASS_ID in the name.
1.5 Extraction Transformation Loading (ETL)
You must load your data warehouse regularly so that it can serve its purpose of facilitating
business analysis. To do this, data from one or more operational systems must be extracted
and copied into the data warehouse. The challenge in data warehouse environments is to
integrate, rearrange and consolidate large volumes of data over many systems, thereby
providing a new unified information base for business intelligence.
The process of extracting data from source systems and bringing it into the data warehouse is
commonly called ETL, which stands for extraction, transformation, and loading. Note that
ETL refers to a broad process, and not three well-defined steps. The acronym ETL is perhaps
too simplistic, because it omits the transportation phase and implies that each of the other
phases of the process is distinct. Nevertheless, the entire process is known as ETL.
The methodology and tasks of ETL have been well known for many years, and are not
necessarily unique to data warehouse environments: a wide variety of proprietary
applications and database systems are the IT backbone of any enterprise. Data has to be
shared between applications or systems, trying to integrate them, giving at least two
applications the same picture of the world. This data sharing was mostly addressed by
mechanisms similar to what is now called ETL.
ETL Basics in Data Warehousing
What happens during the ETL process? The following tasks are the main actions in the
process.
Extraction of Data
During extraction, the desired data is identified and extracted from many different sources,
including database systems and applications. Very often, it is not possible to identify the
specific subset of interest, therefore more data than necessary has to be extracted, so the
identification of the relevant data will be done at a later point in time. Depending on the
source system's capabilities (for example, operating system resources), some transformations
may take place during this extraction process. The size of the extracted data varies from
hundreds of kilobytes up to gigabytes, depending on the source system and the business
situation. The same is true for the time delta between two (logically) identical extractions: the
21
time span may vary between days/hours and minutes to near real-time. Web server log files,
for example, can easily grow to hundreds of megabytes in a very short period.
Transportation of Data
After data is extracted, it has to be physically transported to the target system or to an
intermediate system for further processing. Depending on the chosen way of transportation,
some transformations can be done during this process, too. For example, a SQL statement
which directly accesses a remote target through a gateway can concatenate two columns as
part of the SELECT statement.
The emphasis in many of the examples in this section is scalability. Many long-time users of
Oracle Database are experts in programming complex data transformation logic using
PL/SQL. These chapters suggest alternatives for many such data manipulation operations,
with a particular emphasis on implementations that take advantage of Oracle's new SQL
functionality, especially for ETL and the parallel query infrastructure.
ETL Tools for Data Warehouses
Designing and maintaining the ETL process is often considered one of the most difficult and
resource-intensive portions of a data warehouse project. Many data warehousing projects use
ETL tools to manage this process. Oracle Warehouse Builder, for example, provides ETL
capabilities and takes advantage of inherent database abilities. Other data warehouse builders
create their own ETL tools and processes, either inside or outside the database.
Besides the support of extraction, transformation, and loading, there are some other tasks that
are important for a successful ETL implementation as part of the daily operations of the data
warehouse and its support for further enhancements. Besides the support for designing a data
warehouse and the data flow, these tasks are typically addressed by ETL tools such as Oracle
Warehouse Builder.
Oracle is not an ETL tool and does not provide a complete solution for ETL. However,
Oracle does provide a rich set of capabilities that can be used by both ETL tools and
customized ETL solutions. Oracle offers techniques for transporting data between Oracle
databases, for transforming large volumes of data, and for quickly loading new data into a
data warehouse.
Daily Operations in Data Warehouses
The successive loads and transformations must be scheduled and processed in a specific
order. Depending on the success or failure of the operation or parts of it, the result must be
tracked and subsequent, alternative processes might be started. The control of the progress as
22
well as the definition of a business workflow of the operations are typically addressed by
ETL tools such as Oracle Warehouse Builder.
Evolution of the Data Warehouse
As the data warehouse is a living IT system, sources and targets might change. Those
changes must be maintained and tracked through the lifespan of the system without
overwriting or deleting the old ETL process flow information. To build and keep a level of
trust about the information in the warehouse, the process flow of each individual record in the
warehouse can be reconstructed at any point in time in the future in an ideal case.
1.6 Data Warehouse Metadata
Metadata is simply defined as data about data. The data that are used to represent other data is
known as metadata. For example the index of a book serves as metadata for the contents in
the book. In other words we can say that metadata is the summarized data that leads us to the
detailed data. In terms of data warehouse we can define metadata as following.
Metadata is a road map to data warehouse.
Metadata in data warehouse define the warehouse objects.
The metadata act as a directory. This directory helps the decision support system to
locate the contents of data warehouse.
Categories of Metadata
The metadata can be broadly categorized into three categories:
Business Metadata - This metadata has the data ownership information, business
definition and changing policies.
Technical Metadata - Technical metadata includes database system names, table and
column names and sizes, data types and allowed values. Technical metadata also
includes structural information such as primary and foreign key attributes and indices.
Operational Metadata - This metadata includes currency of data and data
lineage.Currency of data means whether data is active, archived or purged. Lineage of
data means history of data migrated and transformation applied on it.
23
Role of Metadata
Metadata has very important role in data warehouse. The role of metadata in warehouse is
different from the warehouse data yet it has very important role. The various roles of
metadata are explained below.
The metadata act as a directory.
This directory helps the decision support system to locate the contents of data
warehouse.
Metadata helps in decision support system for mapping of data when data are
transformed from operational environment to data warehouse environment.
Metadata helps in summarization between current detailed data and highly
summarized data.
Metadata also helps in summarization between lightly detailed data and highly
summarized data.
Metadata are also used for query tools.
Metadata are used in reporting tools.
Metadata are used in extraction and cleansing tools.
Metadata are used in transformation tools.
Metadata also plays important role in loading functions.
Diagram to understand role of Metadata.
Metadata Respiratory
The Metadata Respiratory is an integral part of data warehouse system. The Metadata
Respiratory has the following metadata:
24
Definition of data warehouse - This includes the description of structure of data
warehouse. The description is defined by schema, view, hierarchies, derived data
definitions, and data mart locations and contents.
Business Metadata - This metadata has the data ownership information, business
definition and changing policies.
Operational Metadata - This metadata includes currency of data and data lineage.
Currency of data means whether data is active, archived or purged. Lineage of data
means history of data migrated and transformation applied on it.
Data for mapping from operational environment to data warehouse - This
metadata includes source databases and their contents, data extraction, data partition
cleaning, transformation rules, data refresh and purging rules.
The algorithms for summarization - This includes dimension algorithms, data on
granularity, aggregation, summarizing etc.
Challenges for Metadata Management
The importance of metadata cannot be overstated. Metadata helps in driving the accuracy of
reports, validates data transformation and ensures the accuracy of calculations. The metadata
also enforces the consistent definition of business terms to business end users. With all these
uses of Metadata it also has challenges for metadata management. The some of the challenges
are discussed below.
The Metadata in a big organization is scattered across the organization. This metadata
is spreaded in spreadsheets, databases, and applications.
The metadata could present in text file or multimedia file. To use this data for
information management solution, this data need to be correctly defined.
There are no industry wide accepted standards. The data management solution
vendors have narrow focus.
There are no easy and accepted methods of passing metadata.
1.7 Summary
We have presented in this unit about basics of data mining and data warehousing. The
following concepts have been presented in brief.
The amount of data collected by organizations grows by leaps and bounds. The
amount of data is increasing year after year and there may be pay offs in uncovering
hidden information behind these data. Data mining is a way to gain market
25
intelligence from this huge amount of data. There are four categories of data mining
techniques/tools: Prediction, Classification, Clustering Analysis, and Association
Rules Discovery.
A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data organized in support of management decision making. Several
factors distinguish data warehouses from operational databases. Because the two
systems provide quite different functionalities and require different kinds of data, it is
necessary to maintain data warehouses separately from operational databases. Data
warehouse metadata are data defining the warehouse objects.
An operational data store (ODS) is a type of database that's often used as an interim
logical area for a data warehouse.
The process of extracting data from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for extraction, transformation, and
loading.
Metadata is simply defined as data about data. Metadata has very important role in
data warehouse. A metadata repository provides details regarding the warehouse
structure, data history, the algorithms used for summarization, mappings from the
source data to warehouse form, system performance, and business terms and issues.
1.8 Keywords
Data mining, Prediction, Classification, Clustering Analysis, Operational data store (ODS),
Extraction Transformation Loading, Data Warehouses, Metadata
1.9 Exercises
a) What is data mining?
b) What is data warehousing?
c) What are data mining techniques? How is it used?
d) Explain issues in data mining and data warehousing?
e) Define Data warehouse?
f) What are the Difficulties in Implementing Data Warehouses?
g) Explain process of building Data Warehouses?
h) Briefly explain Operational Data Stores (ODL)?
26
i) Briefly explain Extraction Transformation Loading (ETL)?
j) What is Data Warehouse Metadata? What are its Categories?
k) Explain role of Metadata?
l) Write a note on challenges for Metadata Management?
1.10 References
1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber,
Morgan Kaufmann Publisher, Second Edition, 2006.
2. Research and Trends in Data Mining Technologies and Applications, edited by David
Taniar, Idea Group Publications.
3. Data Mining Techniques by Arun K Pujari, University Press, Second Edition, 2009.
27
Unit-2: Data Warehouse and OLAP Technology: An Overview
Structure
2.1 Objectives
2.2 Introduction
2.3 Data Warehouse and OLAP Technology
2.4 A Multidimensional Data Model
2.5 Data Warehouse Architecture
2.6 Data Warehouse Implementation
2.7 Data Warehousing to Data Mining
2.8 Summary
2.9 Keywords
2.10 Exercises
2.11 References
2.1 Objectives
The objectives covered under this unit include:
The introduction to Data Warehouse
OLAP Technology
A Multidimensional Data Model
Data Warehouse Architecture
Data Warehouse Implementation
Data Warehousing to Data Mining.
28
2.2 Introduction
What is a Data Warehouse?
Data warehouses generalize and consolidate data in multidimensional space. The
construction of data warehouses involves data cleaning, data integration and data
transformation and can be viewed as an important preprocessing step for data mining.
Moreover, data warehouses provide on-line analytical processing (OLAP) tools for the
interactive analysis of multidimensional data of varied granularities, which facilitates
effective data generalization and data mining. Many other data mining functions, such as
association, classification, prediction, and clustering, can be integrated with OLAP operations
to enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the data
warehouse has become an increasingly important platform for data analysis and on-line
analytical processing and will provide an effective platform for data mining. Therefore, data
warehousing and OLAP form an essential step in the knowledge discovery process.
Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions. Data warehouses have
been defined in many ways, making it difficult to formulate a rigorous definition. Loosely
speaking, a data warehouse refers to a database that is maintained separately from an
organization‘s operational databases. Data warehouse systems allow for the integration of a
variety of application systems. They support information processing by providing a solid
platform of consolidated historical data for analysis.
According to William H. Inmon, a leading architect in the construction of data warehouse
systems, ―A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management‘s decision making process‖ This short, but
comprehensive definition presents the major features of a data warehouse. The four
keywords, subject-oriented, integrated, time-variant, and nonvolatile, distinguish data
warehouses from other data repository systems, such as relational database systems,
transaction processing systems, and file systems. Let‘s take a closer look at each of these key
features.
Subject-oriented: A data warehouse is organized around major subjects, such as
customer, supplier, product, and sales. Rather than concentrating on the day-to-day
operations and transaction processing of an organization, a data warehouse focuses on
the modeling and analysis of data for decision makers. Hence, data warehouses
29
typically provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process.
Integrated: A data warehouse is usually constructed by integrating multiple
heterogeneous sources, such as relational databases, flat files, and on-line transaction
records. Data cleaning and data integration techniques are applied to ensure
consistency in naming conventions, encoding structures, attribute measures, and so
on.
Time-variant: Data are stored to provide information from a historical perspective
(e.g., the past 5–10 years). Every key structure in the data warehouse contains, either
implicitly or explicitly, an element of time.
Nonvolatile: A data warehouse is always a physically separate store of data
transformed from the application data found in the operational environment. Due to
this separation, a data warehouse does not require transaction processing, recovery,
and concurrency control mechanisms. It usually requires only two operations in data
accessing: initial loading of data and access of data.
Based on this information, we view data warehousing as the process of constructing and
using data warehouses
2.3 Data Warehouse and OLAP Technology
The construction of a data warehouse requires data cleaning, data integration, and data
consolidation. The utilization of a data warehouse often necessitates a collection of decision
support technologies. This allows ―knowledge workers‖ (e.g., managers, analysts, and
executives) to use the warehouse to quickly and conveniently obtain an overview of the data,
and to make sound decisions based on information in the warehouse Some authors use the
term ―data warehousing‖ to refer only to the process of data warehouse construction, while
the term ―warehouse DBMS‖ is used to refer to the management and utilization of data
warehouses
.Data warehousing is also very useful from the point of view of heterogeneous database
integration. Many organizations typically collect diverse kinds of data and maintain large
databases from multiple, heterogeneous, autonomous, and distributed information sources.
The traditional database approach to heterogeneous database integration is to build wrappers
and integrators (or mediators), on top of multiple, heterogeneous databases. When a query is
posed to a client site, a metadata dictionary is used to translate the query into queries
appropriate for the individual heterogeneous sites involved. These queries are then mapped
30
and sent to local query processors. The results returned from the different sites are integrated
into a global answer set. This query-driven approach requires complex information filtering
and integration processes, and competes for resources with processing at local sources. It is
inefficient and potentially expensive for frequent queries, especially for queries requiring
aggregations.
Data warehousing employs an update-driven approach in which information from multiple,
heterogeneous sources is integrated in advance and stored in a warehouse for direct querying
and analysis. Unlike on-line transaction processing databases, data warehouses do not contain
the most current information. However, a data warehouse brings high performance to the
integrated heterogeneous database system because data are copied, preprocessed, integrated,
annotated, summarized, and restructured into one semantic data store. Furthermore, query
processing in data warehouses does not interfere historical information and support complex
multidimensional queries. As a result, data warehousing has become popular in industry with
the processing at local sources.
Differences between Operational Database Systems and Data Warehouses
Because most people are familiar with commercial relational database systems, it is easy to
understand what a data warehouse is by comparing these two kinds of systems. The major
task of on-line operational database systems is to perform on-line transaction and query
processing. These systems are called on-line transaction processing (OLTP) systems. They
cover most of the day-to-day operations of an organization, such as purchasing, inventory,
manufacturing, banking, payroll, registration, and accounting. Data warehouse systems, on
the other hand, serve users or knowledge workers in the role of data analysis and decision
making. Such systems can organize and present data in various formats in order to
accommodate the diverse needs of the different users. These systems are known as on-line
analytical processing (OLAP) systems.
The major distinguishing features between OLTP and OLAP are summarized as follows:
Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology
professionals. An OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.
Data contents: An OLTP system manages current data that, typically, are too detailed
to be easily used for decision making. An OLAP system manages large amounts of
historical data, provides facilities for summarization and aggregation, and stores and
31
manages information at different levels of granularity. These features make the data
easier to use in informed decision making.
Database design: An OLTP system usually adopts an entity-relationship (ER) data
model and an application-oriented database design. An OLAP system typically adopts
either a star or snowflake model (to be discussed in Section 2.2.2) and a subject
oriented database design.
View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historical data or data in different organizations. In
contrast, an OLAP system often spans multiple versions of a database schema, due to
the evolutionary process of an organization. OLAP systems also deal with information
that originates from different organizations, integrating information from many data
stores. Because
Access patterns: The access patterns of an OLTP system consist mainly of short,
atomic transactions. Such a system requires concurrency control and recovery
mechanisms.
However, accesses to OLAP systems are mostly read-only operations
Table 2.1
Comparison between OLTP and OLAP systems.
32
But, Why Have a Separate Data Warehouse?
Because operational databases store huge amounts of data, you may wonder, ―why not
perform on-line analytical processing directly on such databases instead of spending
additional time and resources to construct a separate data warehouse?‖ A major reason for
such a separation is to help promote the high performance of both systems. An operational
database is designed and tuned from known tasks and workloads, such as indexing and
hashing using primary keys, searching for particular records, and optimizing ―canned‖
queries. On the other hand, data warehouse queries are often complex. They involve the
computation of large groups of data at summarized levels, and may require the use of special
data organization, access, and implementation methods based on multidimensional views.
Processing OLAP queries in operational databases would substantially degrade the
performance of operational tasks.
.
2.4 A Multidimensional Data Model
Data warehouses and OLAP tools are based on a multidimensional data model. This model
views data in the form of a data cube.
From Tables and Spreadsheets to Data Cubes
―What is a data cube?‖ A data cube allows data to be modeled and viewed in. It is defined
by dimensions and facts. In general terms, dimensions are the perspectives or entities with
respect to which an organization wants to keep records. For example, AllElectronics may
create a sales data warehouse in order to keep records of the store‘s sales with respect to the
dimensions time, item, branch, and location. These dimensions allow the store to keep track
of things like monthly sales of items and the branches and locations.
A 2-D view of sales data for AllElectronics according to the dimensions time and item, where
the sales are from branches located in the city of Vancouver. The measure displayed is
dollars sold (in thousands).
At which the items were sold. Each dimension may have a table associated with it, called a
dimension table, which further describes the dimension. For example, a dimension table for
item may contain the attributes item name, brand, and type. Dimension tables can be
specified by users or experts, or automatically generated and adjusted based on data
distributions
33
A multidimensional data model is typically organized around a central theme, like sales, for
instance. This theme is represented by a fact table. Facts are numerical measures. Think of
themes the quantities by which we want to analyze relationships between dimensions.
Examples of facts for a sales data warehouse include dollars sold (sales amount in dollars),
units sold (number of units sold), and amount budgeted. The fact table contains the names of
the facts, or measures, as well as keys to each of the related dimension tables. You will soon
get a clearer picture of how this works when we look at multidimensional schemas.
Table 2.2 A 2-D view of sales data for AllElectronics according to the dimensions time and item,
where the sales are from branches located in the city of Vancouver. The measure displayed is dollars
sold (in thousands).
Although we usually think of cubes as 3-D geometric structures, in data warehousing the data
cube is n-dimensional. To gain a better understanding of data cubes and the multidimensional
data model, let‘s start by looking at a simple 2-D data cube that is, in fact, a table or
spreadsheet for sales data from AllElectronics. In particular, we will look at the
AllElectronics sales data for items sold per quarter in the city of Vancouver. These data are
shown in Table 2.2. In this 2-D representation, the sales for Vancouver are shown with
respect to the time dimension (organized in quarters) and the item dimension (organized
according to the types of items sold). The fact or measure displayed is dollars sold (in
thousands).
Now, suppose that we would like to view the sales data with a third dimension. For instance,
suppose we would like to view the data according to time and item, as well as location for the
cities Chicago, New York, Toronto, and Vancouver. These 3-D data are shown in Table 2.3.
The 3-D data of Table 2.3 are represented as a series of 2-D tables. Conceptually, we may
also represent the same data in the form of a 3-D data cube, as in Figure 2.1.
As a cuboid. Given a set of dimensions, we can generate a cuboid for each of the possible
subsets of the given dimensions. The result would form a lattice of cuboids, each showing the
34
data at a different level of summarization, or group by. The lattice of cuboids is then referred
to as a data cube. Figure 2.3 shows a lattice of cuboids forming a data cube for the
dimensions time, item, location, and supplier.
The cuboid that holds the lowest level of summarization is called the base cuboid. For
example, the 4-D cuboid in Figure 2.2 is the base cuboid for the given time, item, location,
and supplier dimensions. Figure 2.1 is a 3-D (non base) cuboid for time, item, and location,
summarized for all suppliers. The 0-D cuboid, which holds the highest level of
summarization, is called the apex cuboid. In our example, this is the total sales, or dollars
sold, summarized over all four dimensions. The apex cuboid is typically denoted by all.
Table 2.3 A 3-D view of sales data for AllElectronics, according to the dimensions time, item, and
location. The measure displayed is dollars sold (in thousands).
Figure 2.1 A 3-D data cube representation of the data in Table 2.3, according to the dimensions time,
item, and location. The measure displayed is dollars sold (in thousands).
Suppose that we would now like to view our sales data with an additional fourth dimension,
such as supplier. Viewing things in 4-D becomes tricky. However, we can think of a 4-D
35
cube as being a series of 3-D cubes, as shown in Figure 2.2. If we continue in this way, we
may display any n-D data as a series of (n-1)-D ―cubes
Figure 2.2 A 4-D data cube representation of sales data, according to the dimensions time, item,
location, and supplier. The measure displayed is dollars sold (in thousands).
For improved readability, only some of the cube values are shown.
Figure 2.3 Lattice of cuboids, making up a 4-D data cube for the dimensions time, item, location, and
supplier. Each cuboid represents a different degree of summarization.
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Databases
The entity-relationship data model is commonly used in the design of relational databases,
where a database schema consists of a set of entities and the relationships between them.
Such a data model is appropriate for on-line transaction processing. A data warehouse,
however, requires a concise, subject-oriented schema that facilitates on-line data analysis.
36
The most popular data model for a data warehouse is a multidimensional model. Such a
model can exist in the form of a star schema, a snowflake schema, or a fact constellation
schema. Let‘s look at each of these schema types.
Star schema: The most common modeling paradigm is the star schema, in which the data
warehouse contains (1) a large central table (fact table) containing the bulk of the data, with
no redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each
dimension. The schema graph resembles a starburst, with the dimension tables displayed in a
radial pattern around the central fact table
Example 2.1 Star schema: A star schema for All Electronics sales is shown in Figure 2.4.
Sales are considered along four dimensions, namely, time, item, branch, and location. The
schema contains a central fact table for sales that contains keys to each of the four
dimensions, along with two measures: dollars sold and units sold. To minimize the size of the
fact table, dimension identifiers (such as time key and item key) are system-generated
identifiers
Notice that in the star schema, each dimension is represented by only one table, and each
table contains a set of attributes. For example, the location dimension table contains the
attribute set flocation key, street, city, province or state, countryg. This constraint may
introduce some redundancy. For example, ―Vancouver‖ and ―Victoria‖ are both cities in the
Canadian province of British Columbia. Entries for such cities in the location dimension table
will create redundancy among the attributes province or state and country, that is, (...,
Vancouver,
British
Columbia,
Canada)
and
(...,
Victoria,
British
Columbia,
Canada).Moreover, the attributes within a dimension table may form either a hierarchy (total
order) or a lattice (partial order).
37
Figure 2.4 Star schema of a data warehouse for sales.
Snowflake schema: The snowflake schema is a variant of the star schema model, where
some dimension tables are normalized, thereby further splitting the data into additional tables.
The resulting schema graph forms a shape similar to a snowflake
The major difference between the snowflake and star schema models is that the dimension
tables of the snowflake model may be kept in normalized form to reduce redundancies. Such
a table is easy to maintain and saves storage space. However, this saving of space is
negligible in comparison to the typical magnitude of the fact table. Furthermore, the
snowflake structure can reduce the effectiveness of browsing, since more joins will be needed
to execute a query. Consequently, the system performance may be adversely impacted.
Hence, although the snowflake schema reduces redundancy, it is not as popular as the star
schema in data warehouse design.
Example 2.2 Snowflake schema: A snowflake schema for All Electronics sales is given in
Figure 2.5. Here, the sales fact table is identical to that of the star schema in Figure 2.4. The
main difference between the two schemas is in the definition of dimension tables. The single
dimension table for item in the star schema is normalized in the snowflake schema, resulting
in new item and supplier tables. For example, the item dimension table now contains the
attributes item key, item name, brand, type, and supplier key, where supplier key is linked to
the supplier dimension table, containing supplier key and supplier type information.
38
Similarly, the single dimension table for location in the star schema can be normalized into
two new tables: location and city. The city key in the new location table links to the city
dimension. Notice that further normalization can be performed on province or state and
country in the snowflake schema shown in Figure 2.5, when desirable.
Figure 2.5 Snowflake schema of a data warehouse for sales.
Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is
called a galaxy schema or a fact constellation.
Example 2.3 Fact constellation: A fact constellation schema is shown in Figure 2.6. This
schema specifies two fact tables, sales and shipping. The sales table definition is identical to
that of the star schema (Figure 2.4). The shipping table has five dimensions, or keys: item
key, time key, shipper key, from location, and to location, and two measures: dollars cost and
units shipped. A fact constellation schema allows dimension tables to be shared between fact
tables. For example, the dimensions tables for time, item, and location are shared between
both the sales and shipping fact tables.
In data warehousing, there is a distinction between a data warehouse and a data mart. A data
warehouse collects information about subjects that span the entire organization, such as
customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide. For data
a warehouse, the fact constellation schema is commonly used, since it can model multiple,
interrelated subjects. A data mart, on the other hand, is a department subset of the data
39
warehouse that focuses on selected subjects, and thus its scope is department wide. For data
marts, the star or snowflake schemas are commonly used, since both are geared toward
modeling single subjects, although the star schema is more popular and efficient.
Figure 2.6 Fact constellation schema of a data warehouse for sales and shipping.
Measures: Their Categorization and Computation
―How are measures computed?‖ To answer this question, we first study how measures can
be categorized. Note that a multidimensional point in the data cube space can be defined by a
set of dimension-value pairs, for example, (time = ―Q1‖, location = ―Vancouver‖, item
=―computer‖). A data cube measure is a numerical function that can be evaluated at each
point in the data cube space. A measure value is computed for a given point by aggregating
the data corresponding to the respective dimension-value pairs defining the given point. We
will look at concrete examples of this shortly.
Measures can be organized into three categories (i.e., distributive, algebraic, holistic), based
on the kind of aggregate functions used.
Distributive: An aggregate function is distributive if it can be computed in a distributed
manner as follows. Suppose the data are partitioned into n sets. We apply the function to each
partition, resulting in n aggregate values. If the result derived by applying the function to the
n aggregate values is the same as that derived by applying the function to the entire data set
(without partitioning), the function can be computed in a distributed manner. For example,
count() can be computed for a data cube by first partitioning the cube into a set of sub cubes,
40
computing count() for each sub cube, and then summing up the counts obtained for each sub
cube. Hence, count() is a distributive aggregate function.
Algebraic: An aggregate function is algebraic if it can be computed by an algebraic function
with M arguments (where M is a bounded positive integer), each of which is obtained by
applying a distributive aggregate function. For example, avg() (average) can be computed by
sum()/count(), where both sum() and count() are distributive aggregate functions.
Holistic: An aggregate function is holistic if there is no constant bound on the storage size
needed to describe a sub aggregate. That is, there does not exist an algebraic function with M
arguments (where M is a constant) that characterizes the computation .Common examples of
holistic functions include median(), mode(), and rank(). A measure is holistic if it is obtained
by applying a holistic aggregate function.
Example 2.7 Interpreting measures for data cubes. Many measures of a data cube can be
computed by relational aggregation operations. In Figure 2.4, we saw a star schema for
AllElectronics sales that contains two measures, namely, dollars sold and units sold. In
Example 2.4, the sales star data cube corresponding to the schema was defined using DMQL
commands.
―But how are these commands interpreted in order to generate the specified data cube?‖
Suppose that the relational database schema of AllElectronics is the following:
time(time key, day, day of week, month, quarter, year)
item(item key, item name, brand, type, supplier type)
branch(branch key, branch name, branch type)
location(location key, street, city, province or state, country)
sales(time key, item key, branch key, location key, number of units sold, price)
The DMQL specification of Example 2.4 is translated into the following SQL query, which
generates the required sales star cube. Here, the sum aggregate function is used to compute
both dollars sold and units sold:
select s.time key, s.item key, s.branch key, s.location key, sum(s.number of units sold _
s.price), sum(s.number of units sold) from time t, item i, branch b, location l, sales s, where
s.time key = t.time key and s.item key = i.item key and s.branch key = b.branch key and
s.location key = l.location key group by s.time key, s.item key, s.branch key, s.location key
The cube created in the above query is the base cuboid of the sales star data cube. It contains
all of the dimensions specified in the data cube definition, where the granularity of each
dimension is at the join key level. A join key is a key that links a fact table and a dimension
41
table. The fact table associated with a base cuboid is sometimes referred to as the base fact
table.
Most of the current data cube technology confines the measures of multidimensional
databases data, such as spatial, multimedia, or text data. However, measures can also be
applied to other kinds of data, such as spatial, multimedia, or text data.
Concept Hierarchies
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to
higher-level, more general concepts. Consider a concept hierarchy for the dimension location.
City values for location include Vancouver, Toronto, Newyork, and Chicago. Each city,
however, can be mapped to the province or state to which it belongs. For example,
Vancouver can be mapped to British Columbia, and Chicago to Illinois. The provinces and
states can in turn be mapped to the country to which they belong, such as Canada or the USA.
These mappings forma concept hierarchy for the dimension location, mapping a set of lowlevel concepts (i.e., cities) to higher-level, more general concepts (i.e., countries). The
concept hierarchy described above is illustrated in Figure 2.7.
Many concept hierarchies are implicit within the database schema. For example, suppose that
the dimension location is described by the attributes number, street, city, province or state,
zipcode, and country. These attributes are related by a total order, forming a concept
hierarchy such as ―street < city < province or state < country‖. This hierarchy is shown in
Figure 2.8(a). Alternatively, the attributes of a dimension may be organized in a partial order,
forming a lattice. An example of a partial order for the time dimension based on the attributes
day, week, month, quarter, and year is ―day < {month <quarter; week} < year‖. This lattice
structure is shown in Figure 2.8(b). A concept hierarchy
42
Figure 2.7 A concept hierarchy for the dimension location. Due to space limitations, not all
of the nodes of the hierarchy are shown (as indicated by the use of ―ellipsis‖ between nodes).
Figure 2.8 Hierarchical and lattice structures of attributes in warehouse dimensions:
OLAP Operations in the Multidimensional Data Model
In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives. A
number of OLAP data cube operations exist to materialize these different views, allowing
interactive querying and analysis of the data at hand. Hence, OLAP provides a user-friendly
environment for interactive data analysis.
Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs
aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by
dimension reduction. Figure 2.10 shows the result of a roll-up operation performed on the
central cube by climbing up the concept hierarchy for location. This hierarchy was defined as
the total order ―street < city < province or state < country.‖ The roll-up operation shown
aggregates the data by ascending the location hierarchy from the level of city to the level of
country. In other words, rather than grouping the data by city, the resulting cube groups the
data by country.
43
When roll-up is performed by dimension reduction, one or more dimensions are removed
from the given cube. For example, consider a sales data cube containing only the two
dimensions location and time. Roll-up may be performed by removing, say, the time
dimension, resulting in an aggregation of the total sales by location, rather than by location
and by time.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed
data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or
introducing additional dimensions. Figure 2.10 shows the result of a drill-down operation performed
on the central cube by stepping down a concept hierarchy for time defined as ―day < month < quarter
< year.‖ Drill-down occurs by descending the time hierarchy from the level of quarter to the more
detailed level of month. The resulting data cube details the total sales per month rather than
summarizing them by quarter.
Because a drill-down adds more detail to the given data, it can also be performed by adding
new
Figure 2.10 can occur by introducing an additional dimension, such as customer group.
Slice and dice: The slice operation performs a selection on one dimension of the given cube,
resulting in a sub cube. Figure 2.10 shows a slice operation where the sales data are selected
from the central cube for the dimension time using the criterion time = ―Q1‖. The dice
operation defines a sub cube by performing a selection on two or more dimensions. Figure
2.10 shows a dice operation on the central cube based on the following selection criteria that
involve three dimensions: (location = ―Toronto‖ or ―Vancouver‖) and (time = ―Q1‖ or
―Q2‖) and (item =
―home entertainment‖ or ―computer‖).
44
Figure 2.10 Examples of typical OLAP operations on multidimensional data
Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes
in view in order to provide an alternative presentation of the data. Figure 2.10 shows a pivot
operation where the item and location axes in a 2-D slice are rotated. Other examples include
rotating the axes in a 3-D cube, or transforming a 3-D cube into a series of 2-D planes.
45
Other OLAP operations: Some OLAP systems offer additional drilling operations. For
example, drill-across executes queries involving (i.e., across) more than one fact table. The
drill-through operation uses relational SQL facilities to drill through the bottom level of a
data cube down to its back-end relational tables. Other OLAP operations may include ranking
the top N or bottom or/and items in lists, as well as computing moving averages, growth
rates, interests, internal rates of return, depreciation, currency conversions, and statistical
functions.
OLAP offers analytical modeling capabilities, including a calculation engine for deriving
ratios, variance, and so on, and for computing measures across multiple dimensions. It can
generate summarizations, aggregations, and hierarchies at each granularity level and at every
dimension intersection. OLAP also supports functional models for forecasting, trend analysis,
and statistical analysis. In this context, an OLAP engine is a powerful data analysis tool.
OLAP Systems versus Statistical Databases
Many of the characteristics of OLAP systems, such as the use of a multidimensional data
model and concept hierarchies, the association of measures with dimensions, and the notions
of roll-up and drill-down, also exist in earlier work on statistical databases (SDBs). A
statistical database is a database system that is designed to support statistical applications.
Similarities between the two types of systems are rarely discussed, mainly due to differences
in terminology and application domains.
OLAP and SDB systems, however, have distinguishing differences. While SDBs tend to
focus on socioeconomic applications, OLAP has been targeted for business applications.
Privacy issues regarding concept hierarchies are a major concern for SDBs. For example,
given summarized socioeconomic data, it is controversial to allow users to view the
corresponding low-level data. Finally, unlike SDBs, OLAP systems are designed for handling
huge amounts of data efficiently.
A Starnet Query Model for Querying Multidimensional Databases
The querying of multidimensional databases can be based on a starnet model. A starnet
model consists of radial lines emanating from a central point, where each line represents a
concept hierarchy for a dimension. Each abstraction level in the hierarchy is called a
footprint. These represent the granularities available for use by OLAP operations such as
drill-down and roll-up.
46
Example 2.9 Starnet. A starnet query model for the AllElectronics data warehouse is shown
in Figure 2.11. This starnet consists of four radial lines, representing concept hierarchies
Figure 2.11 Modeling business queries: a starnet model.
For the dimensions location, customer, item, and time, respectively. Each line consists of
footprints representing abstraction levels of the dimension. For example, the time line has
four footprints: ―day,‖ ―month,‖ ―quarter,‖ and ―year.‖ A concept hierarchy may involve a
single attribute (like date for the time hierarchy) or several attributes (e.g., the concept
hierarchy for location involves the attributes street, city, province or state, and country).
2.5 Data Warehouse Architecture
2.3.1 Steps for the Design and Construction of Data Warehouses
This subsection presents a business analysis framework for data warehouse design. The basic
steps involved in the design process are also described.
The Design of a Data Warehouse: A Business Analysis Framework
First, having a data warehouse may provide a competitive advantage by presenting relevant
information from which to measure performance and make critical adjustments in order to
help win over competitors. Second, a data warehouse can enhance business productivity
because it is able to quickly and efficiently gather information that accurately describes the
47
organization. Third, a data warehouse facilitates customer relationship management because
it provides a consistent view of customers and items across all lines of business, all
departments, and all markets. Finally, a data warehouse may bring about cost reduction by
tracking trends, patterns, and exceptions over long periods in a consistent and reliable
manner.
Four different views regarding the design of a data warehouse must be considered: the topdown view, the data source view, the data warehouse view, and the business query view.
The top-down view allows the selection of the relevant information necessary for the
data warehouse. This information matches the current and future business needs.
The data source view exposes the information being captured, stored, and managed by
operational systems. This information may be documented at various levels of detail
and accuracy, from individual data source tables to integrated data source tables. Data
sources are often modeled by traditional data modeling techniques, such as the entityrelationship model or CASE (computer-aided software engineering) tools.
The data warehouse view includes fact tables and dimension tables. It represents the
information that is stored inside the data warehouse, including pre calculated totals
and counts, as well as information regarding the source, date, and time of origin,
added to provide historical context.
Finally, the business query view is the perspective of data in the data warehouse from
the viewpoint of the end user.
Building and using a data warehouse is a complex task because it requires business skills,
technology skills, and program management skills. Regarding business skills, building a
data warehouse involves understanding how such systems store and manages their data,
how to build extractors that transfer data from the operational system to the data
warehouse, and how to build warehouse refresh software that keeps the data warehouse
reasonably up-to-date with the operational system‘s data. Regarding technology skills,
data analysts are required to understand how to make assessments from quantitative
information and derive facts based on conclusions from historical information in the data
warehouse. These skills include the ability to discover patterns and trends, to extrapolate
trends based on history and look for anomalies or paradigm shifts, and to present coherent
managerial recommendations based on such analysis. Finally, program management
skills involve the need to interface with many technologies, vendors, and end users in
order to deliver results in a timely and cost-effective manner.
48
The Process of Data Warehouse Design
A data warehouse can be built using a top-down approach, a bottom-up approach, or a
combination of both. The top-down approach starts with the overall design and planning. It is
useful in cases where the technology is mature and well known, and where the business
problems that must be solved are clear and well understood. The bottom-up approach starts
with experiments and prototypes. This is useful in the early stage of business modeling and
technology development. It allows an organization to move forward at considerably less
expense and to evaluate the benefits of the technology before making significant
commitments. In the combined approach, an organization can exploit the planned and
strategic nature of the top-down approach while retaining the rapid implementation and
opportunistic application of the bottom-up approach.
From the software engineering point of view, the design and construction of a data warehouse
may consist of the following steps: planning, requirements study, problem analysis,
warehouse design, data integration and testing, and finally deployment of the data
warehouse. In general, the warehouse design process consists of the following steps:
1. Choose a business process to model, for example, orders, invoices, shipments,
inventory, account administration, sales, or the general ledger. If the business process
is organizational and involves multiple complex object collections, a data warehouse
model should be followed. However, if the process is departmental and focuses on the
analysis of one kind of business process, a data mart model should be chosen.
2. Choose the grain of the business process. The grain is the fundamental, atomic level
of data to be represented in the fact table for this process, for example, individual
transactions, individual daily snapshots, and so on.
3. Choose the dimensions that will apply to each fact table record. Typical dimensions
are time, item, customer, supplier, warehouse, transaction type, and status.
4. Choose the measures that will populate each fact table record. Typical measures are
numeric additive quantities like dollars sold and units sold
Because data warehouse construction is a difficult and long-term task, its implementation
scope should be clearly defined. The goals of an initial data warehouse implementation
should be specific, achievable, and measurable. This involves determining the time and
budget allocations, the subset of the organization that is to be modeled, the number of data
sources selected, and the number and types of departments to be served.
49
Once a data warehouse is designed and constructed, the initial deployment of the warehouse
includes initial installation, roll-out planning, training, and orientation. Platform upgrades and
maintenance must also be considered. Data warehouse administration includes data
refreshment, data source synchronization, planning for disaster recovery, managing access
control and security, managing data growth, managing database performance, and data
warehouse enhancement and extension. Scope management includes controlling the number
and range of queries, dimensions, and reports; limiting the size of the data warehouse; or
limiting the schedule, budget, or resources.
Various kinds of data warehouse design tools are available. Data warehouse development
tools provide functions to define and edit metadata repository contents (such as schemas,
scripts, or rules), answer queries, output reports, and ship metadata to and from relational
database system catalogues. Planning and analysis tools study the impact of schema changes
and of refresh performance when changing refresh rates or time windows.
2.3.2. Three-Tier Data Warehouse Architecture
Data warehouses often adopt three-tier architecture, as presented in Figure 2.12.
1. The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom tier
from operational databases or other external sources (such as customer profile
information provided by external consultants). These tools and utilities perform data
extraction, cleaning, and transformation (e.g., to merge similar data from different
Sources into a unified format), as well as load and refresh functions to update the data
warehouse (Section 2.3.3). The data are extracted using application program
interfaces known as gateways. A gateway is supported by the underlying DBMS and
allows client programs to generate SQL code to be executed at a server. Examples of
gateways include ODBC (Open Database Connection) and OLEDB (Open Linking
and Embedding for Databases) by Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about the data
warehouse and its contents. The metadata repository is further described in Section
2.3.4.
2. The middle tier is an OLAP server that is typically implemented using either (1) a
relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps
operations on multidimensional data to standard relational operations; or (2) a
50
multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and operations. OLAP servers are
discussed in Section 2.3.5.
3. The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
From the architecture point of view, there are three data warehouse models: the
enterprise warehouse, the data mart, and the virtual warehouse.
Figure 2.12 A three-tier data warehousing architecture.
Enterprise warehouse: An enterprise warehouse collects all of the information about
subjects spanning the entire organization. It provides corporate-wide data integration, usually
from one or more operational systems or external information providers, and is crossfunctional in scope. It typically contains detailed data as well as summarized data, and can
range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. An
enterprise data warehouse may be implemented on traditional mainframes, computer super
51
servers, or parallel architecture platforms. It requires extensive business modeling and may
take years to design and build.
Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific
group of users. The scope is confined to specific selected subjects. For example, a marketing
data mart may confine its subjects to customer, item, and sales. The data contained in data
marts tend to be summarized.
Depending on the source of data, data marts can be categorized as independent or dependent.
Independent data marts are sourced from data captured from one or more operational systems
or external information providers, or from data generated locally within a particular
department or geographic area. Dependent data marts are sourced directly from enterprise
data warehouses.
Virtual warehouse: A virtual warehouse is a set of views over operational databases. For
efficient query processing, only some of the possible summary views may be materialized. A
virtual warehouse is easy to build but requires excess capacity on operational database
servers.
The top-down development of an enterprise warehouse serves as a systematic solution and
minimizes integration problems. However, it is expensive, takes a long time to develop, and
lacks flexibility due to the difficulty in achieving consistency and consensus for a common
data model for the entire organization. The bottom-up approach to the design, development,
and deployment of independent data marts provides flexibility, low cost, and rapid return of
investment. It, however, can lead to problems when integrating various disparate data marts
into a consistent enterprise data warehouse.
52
Figure 2.13 A recommended approach for data warehouse development.
A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner, as shown in Figure 2.13. First, a highlevel corporate data model is defined within a reasonably short period (such as one or two
months) that provides a corporate-wide, consistent, integrated view of data among different
subjects and potential usages.
2.3.3 Data Warehouse Back-End Tools and Utilities
Data warehouse systems use back-end tools and utilities to populate and refresh their data
(Figure 2.12). These tools and utilities include the following functions:
Data extraction, which typically gathers data from multiple, heterogeneous, and external
sources
Data cleaning, which detects errors in the data and rectifies them when possible
Data transformation, which converts data from legacy or host format to warehouse
format
Load, which sorts, summarizes, consolidates, computes views, checks integrity, and
builds indices and partitions
Refresh, which propagates the updates from the data sources to the warehouse
Besides cleaning, loading, refreshing, and metadata definition tools, data warehouse systems
usually provide a good set of data warehouse management tools.
53
Data cleaning and data transformation are important steps in improving the quality of the data
and, subsequently, of the data mining results.
2.3.4 Metadata Repository
Metadata are data about data. When used in a data warehouse, metadata are the data that
define warehouse objects. Figure 2.12 showed a metadata repository within the bottom tier of
the data warehousing architecture. Metadata are created for the data names and definitions of
the given warehouse. Additional metadata are created and captured for time stamping any
extracted data, the source of the extracted data, and missing fields. That have been added by
data cleaning or integration processes.
A metadata repository should contain the following:
A description of the structure of the data warehouse, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data
mart locations and contents
Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or
purged), and monitoring information (warehouse usage statistics, error reports, and
audit trails)
The algorithms used for summarization, which include measure and dimension definition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports
The mapping from the operational environment to the data warehouse, which
includes source databases and their contents, gateway descriptions, data partitions,
data extraction, cleaning, transformation rules and defaults, data refresh and purging
rules, and security (user authorization and access control)
Data related to system performance, which include indices and profiles that improve
data access and retrieval performance, in addition to rules for the timing and
scheduling of refresh, update, and replication cycles
Business metadata, which include business terms and definitions, data ownership
information, and charging policies
A data warehouse contains different levels of summarization, of which metadata is one type.
Other types include current detailed data (which are almost always on disk), older detailed
data (which are usually on tertiary storage), lightly summarized data and highly summarized
data (which may or may not be physically housed).
54
Metadata play a very different role than other data warehouse data and are important for
many reasons. For example, metadata are used as a directory to help the decision support
system analyst locate the contents of the data warehouse, as a guide to the mapping of data
when the data are transformed from the operational environment to the data warehouse
environment, and as a guide to the algorithms used for summarization between the current
detailed data and the lightly summarized data, and between the lightly summarized data and
the highly summarized data. Metadata should be stored and managed persistently (i.e., on
disk).
Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP
Logically, OLAP servers present business users with multidimensional data from data
warehouses or data marts, without concerns regarding how or where the data are stored.
However, the physical architecture and implementation of OLAP servers must consider data
storage issues. Implementations of a warehouse server for OLAP processing include the
following:
Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in
between a relational back-end server and client front-end tools. They use a relational or
extended relational DBMS to store and manage warehouse data, and OLAP middleware to
support missing pieces. ROLAP servers include optimization for each DBMS back end,
implementation of aggregation navigation logic, and additional tools and services. ROLAP
technology tends to have greater scalability than MOLAP technology. The DSS server of
Micro strategy, for example, adopts the ROLAP approach.
Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views
of data through array-based multidimensional storage engines. They map multidimensional
views directly to data cube array structures. The advantage of using a data cube is that it
allows fast indexing to pre computed summarized data. Notice that with multidimensional
data stores, the storage utilization may be low if the data set is sparse. In such cases, sparse
matrix compression techniques should be explored. Many MOLAP servers adopt a two-level
storage representation to handle dense and sparse data sets: denser sub cubes are identified
and stored as array structures, whereas sparse sub cubes employ compression technology for
efficient storage utilization.
55
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and
MOLAP technology, benefiting from the greater scalability of ROLAP and the faster
computation of MOLAP. For example, a HOLAP server may allow large volumes of detail
data to be stored in a relational database, while aggregations are kept in a separate MOLAP
store. The Microsoft SQL Server 2000 supports a hybrid OLAP server.
Specialized SQL servers: To meet the growing demand of OLAP processing in relational
databases, some database system vendors implement specialized SQL servers that provide
advanced query language and query processing support for SQL queries over star and
snowflake schemas in a read-only environment.
Example 2.10 A ROLAP data store. Table 2.4 shows a summary fact table that contains both
base fact data and aggregated data. The schema of the table is ―hrecord identifier (RID), item,
day, month, quarter, year, dollars soldi‖, where day, month, quarter, and year define the date
of sales, and dollars sold is the sales amount. Consider the tuples with an RID of 1001 and
1002, respectively. The data of these tuples are at the base fact level, where the dateof sales is
October 15, 2003, and October 23, 2003, respectively. Consider the tuple with an RID of
5001. This tuple is at a more general level of abstraction than the tuples 1001
and 1002. The day value has been generalized to all, so that the corresponding time value is
October 2003. That is, the dollars sold amount shown is an aggregation representing the
entire month of October 2003, rather than just October 15 or 23, 2003. The special value all is
used to represent subtotals in summarized data.
MOLAP uses multidimensional array structures to store data for on-line analytical
processing. This structure is discussed in the following section on data warehouse
implementation
Most data warehouse systems adopt a client-server architecture. A relational data store
always resides at the data warehouse/data mart server site. A multidimensional data store can
reside at either the database server site or the client site.
56
2.6 Data Warehouse Implementation
At the core of multidimensional data analysis is the efficient computation of aggregations
across many sets of dimensions. In SQL terms, these aggregations are referred to as groupby. Each group-by can be represented by a cuboid, where the set of group-by forms a lattice
of cuboids defining a data cube. In this section, we explore issues relating to the efficient
computation of data cubes.
The compute cube Operator and the Curse of Dimensionality
One approach to cube computation extends SQL so as to include a compute cube operator.
The compute cube operator computes aggregates over all subsets of the dimensions specified
in the operation. This can require excessive storage space, especially for large numbers of
dimensions. We start with an intuitive look at what is involved in the efficient computation of
data cubes.
Example 2.11 A data cube is a lattice of cuboids. Suppose that you would like to create a
data cube for AllElectronics sales that contains the following: city, item, year, and sales in
dollars. You would like to be able to analyze the data, with queries such as the following:
―Compute the sum of sales, grouping by city and item.‖
―Compute the sum of sales, grouping by city.‖
―Compute the sum of sales, grouping by item.‖
What is the total number of cuboids, or group-by‘s that can be computed for this data cube?
Taking the three attributes, city, item, and year, as the dimensions for the data cube, and sales
in dollars as the measure, the total number of cuboids, or group by‘s, that can be computed
for this data cube is 23 = 8. The possible group-by‘s are the following: {(city, item, year),
(city, item), (city, year), (item, year), (city), (item), (year), ()}, where () means that the groupby is empty (i.e., the dimensions are not grouped). These group-by‘s form a lattice of cuboids
for the data cube, as shown in Figure 2.14. The base cuboid contains all three dimensions,
city, item, and year. It can return the total sales for any combination of the three dimensions.
The apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty.
57
Figure 2.14 Lattice of cuboids, making up a 3-D data cube. Each cuboid represents a different groupby. The base cuboid contains the three dimensions city, item, and year.
Partial Materialization: Selected Computation of Cuboids
There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not pre compute any of the ―nonbase‖ cuboids. This leads to
computing expensive multidimensional aggregates on the fly, which can be extremely
slow.
2. Full materialization: Pre compute all of the cuboids. The resulting lattice of
computed cuboids is referred to as the full cube. This choice typically requires huge
amounts of memory space in order to store all of the pre computed cuboids.
3. Partial materialization: Selectively compute a proper subset of the whole set of
possible cuboids. Alternatively, we may compute a subset of the cube, which contains
only those cells that satisfy some user-specified criterion, such as where the tuple
count of each cell is above some threshold. We will use the term sub cube to refer to
the latter case, where only some of the cells may be pre computed for various cuboids.
Partial materialization represents an interesting trade-off between storage space and
response time.
2.4.2 Indexing OLAP Data
58
To facilitate efficient data accessing, most data warehouse systems support index structures
and materialized views (using cuboids). The bitmap indexing method is popular in OLAP
products because it allows quick searching in data cubes. The bitmap index is an alternative
representation of the record ID (RID) list. In the bitmap index for a given attribute, there is a
distinct bit vector, Bv, for each value v in the domain of the attribute. If the domain of a given
attribute consists of n values, then n bits are needed for each entry in the bitmap index (i.e.,
there are n bit vectors). If the attribute has the value v for a given row in the data table, then
the bit representing that value is set to 1 in the corresponding row of the bitmap index. All
other bits for that row are set to 0.
Example2.12 Bitmap indexing. In the AllElectronics data warehouse, suppose the dimension
item at the top level has four values (representing item types): ―home entertainment,‖
―computer,‖ ―phone,‖ and ―security.‖ Each value (e.g., ―computer‖) is represented by a bit
vector in the bitmap index table for item. Suppose that the cube is stored as a relation table
with 100,000 rows. Because the domain of item consists of four values, the bitmap index
table requires four bit vectors (or lists), each with 100,000 bits. Figure 2.15 shows a base
(data) table containing the dimensions item and city, and its mapping to bitmap index tables
for each of the dimensions.
Figure 2.15 Indexing OLAP data using bitmap indices.
The join indexing method gained popularity from its use in relational database query
processing. Traditional indexing maps the value in a given column to a list of rows having
that value. In contrast, join indexing registers the joinable rows of two relations from a
relational database. For example, if two relations R(RID, A) and S(B, SID) join on the
attributes A and B, then the join index record contains the pair (RID, SID), where RID and
SID are record identifiers from the R and S relations, respectively. Hence, the join index
59
records can identify joinable tuples without performing costly join operations. Join indexing
is especially useful for maintaining the relationship between a foreign key3 and its matching
primary keys, from the joinable relation.
Example 2.13 Join indexing. In Example 2.4, we defined a star schema for AllElectronics of
the form ―sales star [time, item, branch, location]: dollars sold = sum (sales in dollars)‖. An
example of a join index relationship between the sales fact table and the dimension tables for
location and item is shown in Figure 2.16. For example, the ―Main Street‖ value in the
location dimension table joins with tuples T57, T238, and T884 of the sales fact table.
Similarly, the ―Sony-TV‖ value in the item dimension table joins with tuples T57 and T459 of
the sales fact table. The corresponding join index tables are shown in Figure 2.17.
Figure 2.16 Linkages between a sales fact table and dimension tables for location and item.
Suppose that there are 360 time values, 100 items, 50 branches, 30 locations, and 10 million
sales tuples in the sales star data cube. If the sales fact table has recorded sales for only 30
items, the remaining 70 items will obviously not participate in joins. If join indices are not
used, additional I/Os have to be performed to bring the joining portions of the fact table and
dimension tables together.
60
2.4.3 Efficient Processing of OLAP Queries
The purpose of materializing cuboids and constructing OLAP index structures is to speed up
query processing in data cubes. Given materialized views, query processing should proceed
as follows:
1. Determine which operations should be performed on the available cuboids: This involves
transforming any selection, projection, roll-up (group-by), and drill-down operations
specified in the query into corresponding SQL and/or OLAP operations. For example, slicing
and dicing a data cube may correspond to selection and/or projection operations on a
materialized cuboid.
2. Determine to which materialized cuboid(s) the relevant operations should be applied: This
involves identifying all of the materialized cuboids that may potentially be used to answer the
query, pruning the above set using knowledge of ―dominance‖ relationships among the
cuboids, estimating the costs of using the remaining materialized cuboids, and selecting the
cuboid with the least cost.
Example 2.14 OLAP query processing. Suppose that we define a data cube for
AllElectronics of the form ―sales cube [time, item, location]: sum(sales in dollars)‖. The
dimension hierarchies used are ―day < month < quarter < year‖ for time, ―item name < brand
< type‖ for item, and ―street < city < province or state < country‖ for location.
61
Suppose that the query to be processed is on fbrand, province or stateg, with the selection
constant ―year = 2004‖. Also, suppose that there are four materialized cuboids available, as
follows:
cuboid 1: fyear, item name, cityg
cuboid 2: fyear, brand, countryg
cuboid 3: fyear, brand, province or stateg
cuboid 4: fitem name, province or stateg where year = 2004
2.5 Data Warehousing to Data Mining
2.5.1 Data Warehouse Usage
Data warehouses and data marts are used in a wide range of applications. Business executives
use the data in data warehouses and data marts to perform data analysis and make strategic
decisions. In many firms, data warehouses are used as an integral part of a plan-executeassess ―closed-loop‖ feedback system for enterprise management. Data warehouses are used
extensively in banking and financial services, consumer goods and retail distribution sectors,
and controlled manufacturing, such as demand based production.
The data warehouse is used for strategic purposes, performing multidimensional analysis and
sophisticated slice-and-dice operations. Finally, the data warehouse may be employed for
knowledge discovery and strategic decision making using data mining tools. In this context,
the tools for data warehousing can be categorized into access and retrieval tools, database
reporting tools, data analysis tools, and data mining tools.
Business users need to have the means to know what exists in the data warehouse (through
metadata), how to access the contents of the data warehouse, how to examine the contents
using analysis tools, and how to present the results of such analysis.
There are three kinds of data warehouse applications: information processing, analytical
processing, and data mining:
Information processing supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts, or graphs. A current trend in data warehouse
information processing is to construct low-cost Web-based accessing tools that are
then integrated with Web browsers.
Analytical processing supports basic OLAP operations, including slice-and-dice,
drill-down, roll-up, and pivoting. It generally operates on historical data in both
summarized and detailed forms. The major strength of on-line analytical processing
62
over information processing is the multidimensional data analysis of data warehouse
data.
Data mining supports knowledge discovery by finding hidden patterns and
associations, constructing analytical models, performing classification and prediction,
and presenting the mining results using visualization tools.
On-line analytical processing comes a step closer to data mining because it can derive
information summarized at multiple granularities from user-specified subsets of a data
warehouse.
The functionalities of OLAP and data mining can be viewed as disjoint: OLAP is a data
summarization/aggregation tool that helps simplify data analysis, while data mining allows
the automated discovery of implicit patterns and interesting knowledge hidden in large
amounts of data. OLAP tools are targeted toward simplifying and supporting interactive data
analysis, whereas the goal of data mining tools is to automate as much of the process as
possible, while still allowing users to guide the process. In this sense, data mining goes one
step beyond traditional on-line analytical processing.
Data mining is not confined to the analysis of data stored in data warehouses. It may analyze
data existing at more detailed granularities than the summarized data provided in a data
warehouse. It may also analyze transactional, spatial, textual, and multimedia data that are
difficult to model with current multidimensional database technology. In this context, data
mining covers a broader spectrum than OLAP with respect to data mining functionality and
the complexity of the data handled.
2.5.2. From On-Line Analytical Processing to On-Line Analytical Mining
In the field of data mining, substantial research has been performed for data mining on
various platforms, including transaction databases, relational databases, spatial databases, text
databases, time-series databases, flat files, data warehouses, and so on. On-line analytical
mining (OLAM) (also called OLAP mining) integrates on-line analytical processing (OLAP)
with data mining and mining knowledge in multidimensional databases. Among the many
different paradigms and architectures of data mining systems, OLAM is particularly
important for the following reasons:
High quality of data in data warehouses: Most data mining tools need to work on
integrated, consistent, and cleaned data, which requires costly data cleaning, data
integration and data transformation as preprocessing steps. A data warehouse
63
constructed by such preprocessing serves as a valuable source of high quality data for
OLAP as well as for data mining. Notice that data mining may also serve as a
valuable tool for data cleaning and data integration as well.
Available information processing infrastructure surrounding data warehouses:
Comprehensive information processing and data analysis infrastructures have been or
will be systematically constructed surrounding data warehouses, which include
accessing, integration, consolidation, and transformation of multiple heterogeneous
databases, ODBC/OLE DB connections, Web-accessing and service facilities, and
reporting and OLAP analysis tools. It is prudent to make the best use of the available
infrastructures rather than constructing everything from scratch.
OLAP-based exploratory data analysis: Effective data mining needs exploratory
data analysis. A user will often want to traverse through a database, select portions of
relevant data, and analyze them at different granularities, and present knowledge /
results in different forms. On-line analytical mining provides facilities for data mining
on different subsets of data and at different levels of abstraction, by drilling, pivoting,
filtering, dicing, and slicing on a data cube and on some intermediate data mining
results. This, together with data/knowledge visualization tools, will greatly enhance
the power and flexibility of exploratory data mining.
On-line selection of data mining functions: Often a user may not know what kinds
of knowledge she would like to mine. By integrating OLAP with multiple data mining
functions, on-line analytical mining provides users with the flexibility to select
desired data mining functions and swap data mining tasks dynamically.
Architecture for On-Line Analytical Mining
An OLAM server performs analytical mining in data cubes in a similar manner as an OLAP
server performs on-line analytical processing. An integrated OLAM and OLAP architecture
is shown in Figure 2.18, where the OLAM and OLAP servers both accept user on-line
queries (or commands) via a graphical user interface API and work with the data cube in the
64
data analysis via a cube API. A metadata directory is used to guide the access of the data
cube.
Figure 2.18 An integrated OLAM and OLAP architecture.
2.8 Summary
In this unit, we have given the detailed discussion on data warehousing concepts. The
following concepts have been presented.
A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data organized in support of management decision making. Several
factors distinguish data warehouses from operational databases. Because the two
65
systems provide quite different functionalities and require different kinds of data, it is
necessary to maintain data warehouses separately from operational databases.
A multidimensional data model is typically used for the design of corporate data
warehouses and departmental data marts. Such a model can adopt a star schema,
snowflake schema, or fact constellation schema. The core of the multidimensional
model is the data cube, which consists of a large set of facts (or measures) and a
number of dimensions.
Concept hierarchies organize the values of attributes or dimensions into gradual levels
of abstraction. They are useful in mining at multiple levels of abstraction.
On-line analytical processing (OLAP) can be performed in data warehouses/marts
using the multidimensional data model. Typical OLAP operations include rollup,
drill-(down, across, through), slice-and-dice, pivot (rotate), as well as statistical
operations such as ranking and computing moving averages and growth rates. OLAP
operations can be implemented efficiently using the data cube structure.
Data warehouses often adopt three-tier architecture. Data warehouse metadata are
data defining the warehouse objects.
2.9 Keywords
Data Warehouse, OLAP, Star schema, Snowflake schema, Fact constellation, Distributive,
Algebraic, Holistic, Hierarchies, Multidimensional Data Model, Data Warehouse
Architecture, Data Warehouse Design, Enterprise warehouse, ROLAP versus MOLAP versus
HOLAP, Partial Materialization.
2.10 Exercises
1. What is a Data Warehouse? Explain?
2. What are key features of Data Warehouse? Explain?
3. Differentiate between Operational Database Systems and Data Warehouses?
4. Write a note on Multidimensional Data Model?
5. What are the two schemas of Multidimensional Data Model? Explain?
6. Explain OLAP Operations.
7. Explain steps for the design and construction of Data Warehouses?
8. Explain Three-Tier Data Warehouse Architecture?
66
9. What are types of OLAP Servers? Explain?
10. What are three choices for data cube materialization?
11. Explain Architecture for On-Line Analytical Mining?
2.11 References
1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber,
Morgan Kaufmann Publisher, Second Edition, 2006.
2. Introduction to Data Mining with Case Studies by G. K. Gupta, Eastern Economy
Edition (PHI, New Delhi), Third Edition, 2009.
3. Data Mining Techniques by Arun K Pujari, University Press, Second Edition, 2009.
4. Gartner: Data Warehouses, Operational Data Stores, Data Marts and Data Outhouses,
Dec 2005.
5. Data Warehousing Fundamentals for IT Professionals, 2ed, Paulraj Ponniah.
67
UNIT-3: DATA CUBES AND IMPLEMENTATION
Structure
3.1 Objectives
3.2 Introduction
3.3 Data Cube Implementations
3.4 Data Cube operations
3.5 Implementation of OLAP
3.6 Overview on OLAP Software
3.7 Summary
3.8 Keywords
3.9 Exercises
3.10 References
3.1 Objectives
The objectives covered under this unit include:
The introduction data cubes
Data Cube Implementations
Conceptual Modeling of Data Warehousing
OLAP Operations in a Multidimensional Data
OLAP Operations
OLAP implementations
3.2 Introduction
The cube is used to represent data along some measure of interest. Although called a "cube",
it can be 2-dimensional, 3-dimensional, or higher-dimensional. Each dimension represents
some attribute in the database and the cells in the data cube represent the measure of interest.
For example, they could contain a count for the number of times that attribute combination
68
occurs in the database, or the minimum, maximum, sum or average value of some attribute.
Queries are performed on the cube to retrieve decision support information.
3.3 Data Cube Implementations
Implementation of the data cube is one of the most important, albeit ―expensive,‖ processes
in On-Line Analytical Processing (OLAP). It involves the computation and storage of the
results of aggregate queries grouping on all possible dimension-attribute combinations over a
fact table in a data warehouse. Such pre computation and materialization of (parts of ) the
cube is critical for improving the response time of OLAP queries and of operators such as
roll-up, drill-down, slice-and-dice, and pivot, which use aggregation extensively [Chaudhuri
and Dayal 1997; Gray et al. 1996]. Materializing the entire cube is ideal for fast access to
aggregated data but may pose considerable costs both in computation time and in storage
space. To balance the tradeoff between query response times and cube resource requirements,
several efficient methods have been proposed, whose study is the main purpose of this article.
As a running example, consider a fact table R consisting of three dimensions (A, B, C), and
one measure M (Figure 1a). Figure 1b presents the corresponding cube. Each view that
belongs to the data cube (also called cube node hereafter) materializes a specific group-by
query as illustrated in Figure 1b. Clearly, if D is the number of dimensions of a fact table, the
number of all possible group-by queries is 2D, which implies that the data cube size is
exponentially larger with respect to D than the size of the original data (in the worst case). In
typical applications, this is in the order of gigabytes, so development of efficient data-cube
implementation algorithms is extremely critical.
The data-cube implementation algorithms that have been proposed in the literature can be
partitioned into four main categories, depending on the format they use in order to compute
and store a data cube: Relational-OLAP (ROLAP) methods use traditional materialized
views; Multidimensional-OLAP (MOLAP) methods use multidimensional arrays; GraphBased methods take advantage of specialized graphs that usually take the form of tree-like
data structures; finally, approximation methods exploit various in memory representations
(like histograms), borrowed mainly from statistics. Our focus in this article is on algorithms
for ROLAP environments, due to several reasons: (a) Most existing publications share this
focus; (b) ROLAP methods can be easily incorporated into existing relational servers, turning
them into powerful OLAP tools with little effort; by contrast, MOLAP and Graph-Based
69
methods construct and store specialized data structures, making them incompatible, in any
direct sense, with conventional database engines; (c) ROLAP methods generate and store
precise results, which are much easier to manage at run time compared to approximations.
Implementation of the data cube consists of two sub problems: one concerning the actual
computation of the cube and one concerning the particulars of storing parts of the results of
that computation. The set of algorithms those are applicable to each sub problem is intimately
dependent on the particular approach that has been chosen: ROLAP, MOLAP, Graph-Based,
or Approximate. Specifically for ROLAP, which is the focus of this article, the two sub
problems take on the following specialized forms:
—Data cube computation is the problem of scanning the original data, applying the required
aggregate function to all groupings, and generating relational views with the corresponding
cube contents.
—Data cube selection is the problem of determining the subset of the data cube views that
will actually be stored. Selection methods avoid storing some data cube pieces according to
certain criteria, so that what is finally materialized balances the tradeoff between query
response time and cube resource requirements.
Definitions
A data warehouse is based on a multidimensional data model which views data in the form of
70
a data cube. This is not a 3-dimensional cube: it is n-dimensional cube. Dimensions of the
cube are the equivalent of entities in a database, e.g., how the organization wants to keep
records.
Examples: Product;
Dates;
Locations
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
o
Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year)
o
Fact table contains measures (such as dollars_sold) and keys to each of
the related dimension tables
In data warehousing literature, an n-D base cube is called base cuboids. The top most 0-D
cuboids, which holds the highest-level of summarization, is called the apex cuboids. The
lattice of cuboids forms a data cube.
Conceptual Modeling of Data Warehousing
Star schema: A fact table in the middle connected to a set of dimension tables. The star
schema architecture is the simplest data warehouse schema. It is called a star schema because
the diagram resembles a star, with points radiating from a center. The center of the star
consists of fact table and the points of the star are the dimension tables.
Fact Tables
A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.
Dimension Tables
A dimension is a structure usually composed of one or more hierarchies that categorizes data.
If a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The
primary keys of each of the dimension tables are part of the composite primary key of the fact
table. Dimensional attributes help to describe the dimensional value. They are normally
descriptive, textual values. Dimension tables are generally small in size then fact table.
71
Snowflake schema:
A refinement of star schema where some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to snowflake. The snowflake schema
architecture is a more complex variation of the star schema used in a data warehouse, because
the tables which describe the dimensions are normalized.
72
Fact constellations:
Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation. For each star schema it is possible to construct fact
constellation schema (for example by splitting the original star schema into more star
schemes each of them describes facts on another level of dimension hierarchies). The fact
constellation architecture contains multiple fact tables that share many dimension tables.
73
Data Measures
Three Categories: A data cube function is a numerical function that can be evaluated at each
point in the data cube space. Given a data point in the data cube space:
Entry (v1, v2, …, vn)
Where vi is the value corresponding to dimension di.
We need to apply the aggregate measures to the dimension values v1, v2, …, vn
Distributive: If the result derived by applying the function to n aggregate values is the same
as that derived by applying the function on all the data without partitioning.
Example: count(), sum(), min(), max().
Algebraic: Use distributive aggregate functions. If it can be computed by an algebraic
function with M arguments (where M is a bounded integer), each of which is obtained by
applying a distributive aggregate function.
Example: avg(), min_N(), standard_deviation().
Holistic: If there is no constant bound on the storage size needed to describe a subaggregate.
Example: median(), mode(), rank().
How to compute data cube measures?
How do evaluate the dollars_sold and unit_sold in the star schema of the previous example?
Assume that the relation database schema corresponding to our example is the following:
74
time (time_key, day, day_of_week, month, quarter, year)
item (item_key, item_name, brand, type, supplier(supplier_key,
supplier_type))
branch (branch_key, branch_name, branch_type)
location (location_key, street, city, province_or_state, country) sales
(time_key, item_ key, branch_key, location_key, number_of_unit_sold,
price)
Let us then compute the two measures we have in our data cube: dollars_sold and units_sold
select
s.time_key,
s.item_key,
s.branch_key,
s.location_key,
sum(s.number_of_units_sold*s.price), sum(s.number_of_units_sold)
from time t, item i, branch b, location l, sales s where
s.time_key = t.time_key
and s.item_key = i.item_key
and s.branch_key = b.branch_key and
s.location_key = l.location_key
group by s.time_key, s.item_key, s.branch_key, s.location_key
Relationship between ―data cube‖ and ―group by‖?
The above query corresponds to the base cuboid. By changing the group by clause in our
query, we may generate other cuboids.
A Concept Hierarchy
A concept hierarchy is an order relation between a set of attributes of a concept or dimension.
It can be manually (users or experts) or automatically generated (statistical analysis).
Multidimensional data is usually organized into dimension and each dimension is further
defined into a lower level of abstractions defined by concept hierarchies.
Example: Dimension (location)
75
The order can be either partial or total:
Location dimension: Street <city<state<country
Time dimension: Day < {month<quarter ; week} < year
Set-grouping hierarchy:
It is a concept hierarchy among groups of values.
Example: {1..10} < inexpensive
3.4 Data Cube operations
OLAP Operations in a Multidimensional Data
Sales volume as a function of product, time, and region.
Dimensions hierarchical concepts: Product, Location, Time
Industry  Category  Product
Region  Country  City  Office
Year  Quarter  Month  Day
Week
Sales volume as a function of product, month, and region.
76
region
Product
Month
A Sample data cube:
Cuboids of the sample cube:
77
Querying a data cube:
OLAP Operations
Objectives:
OLAP is a powerful analysis tool:
o Forecasting
o Statistical computations,
o aggregations, etc.
Roll up (drill-up): It is performed by climbing up hierarchy of a dimension or by dimension
reduction (reduce the cube by one or more dimensions).The roll up operation in the example
is based location (roll up on location) is equivalent to grouping the data by country.
Drill down (roll down): It is the reverse of roll-up. It is performed by stepping down a
concept hierarchy for a dimension or introducing new dimensions.
Slice: Slice is the act of picking a rectangular subset of a cube by choosing a single value for
one of its dimensions, creating a new cube with one fewer dimension.
Dice: The dice operation produces a sub cube by allowing the analyst to pick specific values
of multiple dimensions.
Pivot (rotate): Re-orient the cube for an alternative presentation of the data. Transform 3D
view to series of 2D planes. Pivot allows an analyst to rotate the cube in space to see its
various faces. For example, cities could be arranged vertically and products horizontally
while viewing data for a particular quarter. Pivoting could replace products with time periods
78
to see data across time for a single product
Other operations:
Drill across: Involving (across) more than one fact table.
Drill through: Through the bottom level of the cube to its back-end relational tables
(using SQL)
Starnet Query Model for Multidimensional Databases
Each radial line represents a dimension. Each abstraction level in a hierarchy concept is
called a footprint
79
3.5 Implementation of OLAP
Multi Dimensional OLAP and Relational OLAP
In this section, we will compare OLAP implementations using traditional relational star
schemas and multidimensional databases.
Multi Dimensional OLAP
MOLAP - Multidimensional OLAP – OLAP done using MDBMS (traditional
approach).
Multi Dimensional Database Management Systems (MDBMS) are used to define
and store data cubes in special data structures: Dimensions and Cubes.
MDBMS have special storage structures and access routines (typically Array based)
to efficiently store and retrieve data cubes.
Advantages: Powerful, efficient database engines for manipulating data cubes
(including indexes, access routines, etc.).
Disadvantages: MDBMS use proprietary database structure and DML (e.g., not SQL).
Requires different skill set, modeling tools, etc. Requires another vendor‘s DBMS (or
feature subset) to be maintained. Not designed for transaction processing – for
example, updating existing data is inefficient.
Many commercial MOLAP systems are tightly integrated with reporting and analysis
tools (BI tool sets).
Some commercial MOLAP servers also support Relational OLAP
Example proprietary MOLAP databases and BI tool sets:
o
Oracle Sybase (also supports Relational OLAP)
80
o
IBM Cognos TM1
o
MicroStrategy (also supports Relational OLAP)
Relational OLAP
ROLAP - Relational OLAP – OLAP done on relational DBMS.
Advantages:
Uses familiar RDBMS technologies and products.
Uses familiar SQL.
Existing skill base and tools.
(Possibly) deal with only one vendor‘s DBMS for OLTP and OLAP.
Disadvantages:
Historically
inefficient
implementation
(although
have
improved
considerably over time).
Example ROLAP databases and BI tool sets:
o
Oracle OLAP
o
Microsoft Analysis Services
o
Oracle Essbase (also supports MOLAP)
o
Mondrian (Open Source) offered by Pentaho
o
MicroStrategy (also support MOLAP)
Oracle OLAP Implementations
Oracle RDBMS has supported OLAP structures in both a proprietary MOLAP
implementation (within the relational database system) and as relational OLAP cubes.
Oracle OLAP Architecture
81
Implementation Techniques for OLAP
Data Warehouse Implementation
Objectives:
ƒ
Monitoring: Sending data from sources
ƒ
Integrating: Loading, cleansing,...
ƒ
Processing: Efficient cube computation, and query processing in general,
indexing, ...
Cube Computation: One approach extends SQL using compute cube operator. A cube
operator is the n-dimensional generalization of the group-by SQL clause. OLAP needs to
compute the cuboid corresponding each input query. Pre-computation: for fast response time,
it seems a good idea to pre-compute data for all cuboids or at least a subset of cuboids since
the number of cuboids is:
Materialization of data cube:
Store in warehouse results useful for common queries
Pre-compute some cuboids: This is equivalent to the define new warehouse relations
using SQL expressions
Materialize every (cuboid) (full materialization), none (no materialization), or
some (partial materialization)
Selection of which cuboids to materialize based on size, sharing, access frequency,
etc.
Define new warehouse relations using SQL expressions
Cube Operation
Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales
Transform it into a SQL-like language (with a new operator cube by, introduced by
Gray et al.‘96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year
Need compute the following Group-Bys
(date, product, customer), (date,product),(date,
customer), (product, customer),
(date), (product), (customer)
82
()
Cube Computation Methods
ROLAP-based cubing: Sorting, hashing, and grouping operations are applied to the
dimension attributes in order to reorder and cluster related tuples. Grouping is performed on
some subaggregates as a ―partial grouping step‖. Aggregates may be computed from
previously computed aggregates, rather than from the base fact table
MOLAP Approach
Uses Array-based algorithm
The base cuboid is stored as multidimensional array.
Read in a number of cells to compute partial cuboids
Indexing OLAP Data: Bitmap Index
Approach:
Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value for the indexed column
Not suitable for high cardinality domains
Example: Base Table:
83
Indexing OLAP Data: Join Indices
Join index:
JI(R-id, S-id)
where R (R-id, …) >< S (S-id, …)
Traditional indices map the values to a list of record ids. It materializes relational join in JI
file and speeds up relational join — a rather costly operation. In data warehouses, join index
relates the values of the dimensions of a star schema to rows in the fact table.
E.g. fact table: Sales and two dimensions city and product
A join index on city maintains for each distinct city a list of R-IDs of the tuples
recording the Sales in the city
Join indices can span multiple dimensions
84
3.6 Overview on OLAP Software
What is Online Analytical Processing (OLAP)?
Online Analytical Processing (OLAP) databases facilitate business-intelligence queries.
OLAP is a database technology that has been optimized for querying and reporting, instead of
processing transactions. The source data for OLAP is Online Transactional Processing
(OLTP) databases that are commonly stored in data warehouses. OLAP data is derived from
this historical data, and aggregated into structures that permit sophisticated analysis. OLAP
data is also organized hierarchically and stored in cubes instead of tables. It is a sophisticated
technology that uses multidimensional structures to provide rapid access to data for analysis.
This organization makes it easy for a PivotTable report or PivotChart report to display highlevel summaries, such as sales totals across an entire country or region, and also display the
details for sites where sales are particularly strong or weak.
OLAP databases are designed to speed up the retrieval of data. Because the OLAP server,
rather than Microsoft Office Excel, computes the summarized values, less data needs to be
sent to Excel when you create or change a report. This approach enables you to work with
much larger amounts of source data than you could if the data were organized in a traditional
database, where Excel retrieves all of the individual records and then calculates the
summarized values.
OLAP databases contain two basic types of data: measures, which are numeric data, the
quantities and averages that you use to make informed business decisions, and dimensions,
which are the categories that you use to organize these measures. OLAP databases help
organize data by many levels of detail, using the same categories that you are familiar with to
analyze the data.
The following sections describe each of these components in more detail:
Cube: A data structure that aggregates the measures by the levels and hierarchies of
each of the dimensions that you want to analyze. Cubes combine several dimensions,
such as time, geography, and product lines, with summarized data, such as sales or
inventory figures. Cubes are not "cubes" in the strictly mathematical sense because
they do not necessarily have equal sides. However, they are an apt metaphor for a
complex concept.
Measure: A set of values in a cube that are based on a column in the cube's fact table
and that are usually numeric values. Measures are the central values in the cube that
85
are preprocessed, aggregated, and analyzed. Common examples include sales, profits,
revenues, and costs.
Member: An item in a hierarchy representing one or more occurrences of data. A
member can be either unique or non-unique. For example, 2007 and 2008 represent
unique members in the year level of a time dimension, whereas January represents
non-unique members in the month level because there can be more than one January
in the time dimension if it contains data for more than one year.
Calculated member: A member of a dimension whose value is calculated at run time
by using an expression. Calculated member values may be derived from other
members' values. For example, a calculated member, Profit, can be determined by
subtracting the value of the member, Costs, from the value of the member, Sales.
Dimension: A set of one or more organized hierarchies of levels in a cube that a user
understands and uses as the base for data analysis. For example, a geography
dimension might include levels for Country/Region, State/Province, and City. Or, a
time dimension might include a hierarchy with levels for year, quarter, month, and
day. In a PivotTable report or PivotChart report, each hierarchy becomes a set of
fields that you can expand and collapse to reveal lower or higher levels.
Hierarchy: A logical tree structure that organizes the members of a dimension such
that each member has one parent member and zero or more child members. A child is
a member in the next lower level in a hierarchy that is directly related to the current
member. For example, in a Time hierarchy containing the levels Quarter, Month, and
Day, January is a child of Qtr1. A parent is a member in the next higher level in a
hierarchy that is directly related to the current member. The parent value is usually a
consolidation of the values of all of its children. For example, in a Time hierarchy that
contains the levels Quarter, Month, and Day, Qtr1 is the parent of January.
Level: Within a hierarchy, data can be organized into lower and higher levels of
detail, such as Year, Quarter, Month, and Day levels in a Time hierarchy.
OLAP features in Excel
Retrieving OLAP data: You can connect to OLAP data sources just as you do to other
external data sources. You can work with databases that are created with Microsoft SQL
Server OLAP Services version 7.0, Microsoft SQL Server Analysis Services version 2000,
and Microsoft SQL Server Analysis Services version 2005, the Microsoft OLAP server
86
products. Excel can also work with third-party OLAP products that are compatible with OLEDB for OLAP.
You can display OLAP data only as a PivotTable report or PivotChart report or in a
worksheet function converted from a PivotTable report, but not as an external data range.
You can save OLAP PivotTable reports and PivotChart reports in report templates, and you
can create Office Data Connection (ODC) files (.odc) to connect to OLAP databases for
OLAP queries. When you open an ODC file, Excel displays a blank PivotTable report, which
is ready for you to lay out.
Creating cube files for offline use
You can create an offline cube file (.cub) with a subset
of the data from an OLAP server database. Use offline cube files to work with OLAP data
when you are not connected to your network. A cube enables you to work with larger
amounts of data in a PivotTable report or PivotChart report than you could otherwise, and
speeds retrieval of the data. You can create cube files only if you use an OLAP provider, such
as Microsoft SQL Analysis Services Server version 2005, which supports this feature.
Server Actions: A server action is an optional but useful feature that an OLAP cube
administrator can define on a server that uses a cube member or measure as a parameter into
a query to obtain details in the cube, or to start another application, such as a browser. Excel
supports URL, Report, Rowset, Drill Through, and Expand to Detail server actions, but it
does not support Proprietary, Statement, and Dataset. For more information, see Perform an
OLAP server action in a PivotTable report .
KPIs: A KPI is a special calculated measure that is defined on the server that allows you to
track "key performance indicators" including status (Does the current value meet a specific
number?) and trend (what is the value over time?). When these are displayed, the Server can
send related icons that are similar to the new Excel icon set to indicate above or below status
levels (such as a Stop light icon) or whether a value is trending up or down (such as a
directional arrow icon).
Server Formatting: Cube administrators can create measures and calculated members with
color formatting, font formatting, and conditional formatting rules, that may be designated as
a corporate standard business rule. For example, a server format for profit might be a number
format of currency, a cell color of green if the value is greater than or equal to 30,000 and red
if the value is less than 30,000, and a font style of bold if the value is less than 30,000 and
regular if greater than or equal to 30,000. For more information, see Design the layout and
format of a PivotTable report.
87
Office display language: A cube administrator can define translations for data and errors on
the server for users who need to see PivotTable information in another language. This feature
is defined as a file connection property and the user's computer country/regional setting must
correspond to the display language.
Software components that you need to access OLAP data sources
An OLAP provider
To set up OLAP data sources for Excel, you need one of the
following OLAP providers:
Microsoft OLAP provider
Excel includes the data source driver and client
software that you need to access databases created with Microsoft SQL Server OLAP
Services version 7.0, Microsoft SQL Server OLAP Services version 2000 (8.0), and
Microsoft SQL Server Analysis Services version 2005 (9.0).
Third-party OLAP providers
For other OLAP products, you need to install
additional drivers and client software. To use the Excel features for working with
OLAP data, the third-party product must conform to the OLE-DB for OLAP standard
and be Microsoft Office compatible. For information about installing and using a
third-party OLAP provider, consult your system administrator or the vendor for your
OLAP product.
Server databases and cube files
The Excel OLAP client software supports connections to
two types of OLAP databases. If a database on an OLAP server is available on your network,
you can retrieve source data from it directly. If you have an offline cube file that contains
OLAP data or a cube definition file, you can connect to that file and retrieve source data from
it.
Data sources
A data source gives you access to all of the data in the OLAP database or
offline cube file. After you create an OLAP data source, you can base reports on it, and return
the OLAP data to Excel in the form of a PivotTable report or PivotChart report, or in a
worksheet function converted from a PivotTable report.
Microsoft Query
You can use Query to retrieve data from an external database such as
Microsoft SQL or Microsoft Access. You do not need to use Query to retrieve data from an
OLAP PivotTable that is connected to a cube file.
IBM OLAP
The online analytical processing (OLAP) in IBM® Cognos® Enterprise software makes data
available for users to explore, query and analyze on their own in interactive workspaces. The
Cognos platform, the foundation for Cognos Enterprise, offers different OLAP options to
88
meet different needs while providing a consistent user experience and accelerated
performance.
Cognos Enterprise software provides OLAP capabilities for:
Write-back, what-if analysis, planning and budgeting, or other specialized
applications. IBM Cognos TM1® is a 64-bit, in-memory OLAP engine designed to
meet these needs.
Querying a data warehouse that is structured in a star or snowflake schema. A Cognos
dynamic cube is an OLAP component designed to accelerate performance over
terabytes of data in relational databases.
3.7 Summary
The cube is used to represent data along some measure of interest. Although called a
"cube", it can be 2-dimensional, 3-dimensional, or higher-dimensional. Each
dimension represents some attribute in the database and the cells in the data cube
represent the measure of interest.
Implementation of the data cube is one of the most important, albeit ―expensive,‖
processes in On-Line Analytical Processing (OLAP). A data warehouse is based on a
multidimensional data model which views data in the form of a data cube.
Online Analytical Processing (OLAP) databases facilitate business-intelligence
queries. OLAP is a database technology that has been optimized for querying and
reporting, instead of processing transactions. The source data for OLAP is Online
Transactional Processing (OLTP) databases that are commonly stored in data
warehouses. OLAP data is derived from this historical data, and aggregated into
structures that permit sophisticated analysis.
3.8 Keywords
Data Cube, Star schema, Snowflake schema, Multi Dimensional OLAP, Relational OLAP,
Cube Operation, ROLAP, MOLAP, IBM OLAP.
3.9 Exercises
1. Explain Conceptual Modeling of Data Warehousing?
2. What are Data Measures? Explain?
3. How to compute data cube measures?
89
4. Explain OLAP Operations?
5. Explain Implementation of OLAP?
6. What are Advantages and Disadvantages of Relational OLAP?
7. Write a note on OLAP Software?
3.10 References
1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber,
Morgan Kaufmann Publisher, Second Edition, 2006.
2. Data Mining Techniques by Arun K Pujari, University Press, Second Edition,
2009.
90
UNIT-4: BASICS OF DATA MINING
Structure
4.1 Objectives
4.2 Introduction
4.3 Challenges of Data Mining
4.4 Data Mining Tasks
4.5 Types of Data
4.6 Data Pre-processing
4.7 Measures of Similarity and Dissimilarity
4.8 Data Mining Applications
4.7 Summary
4.8 Keywords
4.9 Exercises
4.10 References
4.1 Objectives
The objectives covered under this unit include:
The introduction data mining
Challenges of Data Mining
Data Mining Tasks
Types of Data
Data Pre-processing
Measures of Similarity and Dissimilarity
Data Mining Applications
91
4.2 Introduction
Data mining is the process of discovering meaningful new correlations, patterns and trends by
siftings rough large amounts of Data stored in repositories, using pattern recognition
technologies as well as statistical and mathematical techniques.‖ There are other definitions:
Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable
and useful to the data owner.
Data mining is an interdisciplinary field bringing together techniques from machine
learning, pattern recognition, statistics, databases, and visualization to address the
issue of information extraction from large data bases.
Data mining is predicted to be ―one of the most revolutionary developments of the next
decade,‖ according to the online technology magazine ZDNET. In fact, the MIT Technology
Review chose data mining as one of 10 emerging technologies that will change the world.
―Data mining expertise is the most sought after...‖ among information technology
professionals, according to the 1999 Information Week National Salary Survey. The survey
reports: ―Data mining skills are in high demand this year, as organizations increasingly put
data repositories online. Effectively analyzing information from customers, partners, and
suppliers has become important to more companies. ‗Many companies have implemented a
data warehouse strategy and are now starting to look at what they can do with all that data.
4.3 Challenges in Data Mining
 Developing a Unifying Theory of Data Mining:
Several respondents feel that the current state of the art of data mining research is too ―adhoc.‖ Many techniques are designed for individual problems, such as classification or
clustering, but there is no unifying theory. However, a theoretical framework that unifies
different data mining tasks including clustering, classification, association rules, etc., as
well as different data mining approaches (such as statistics, machine learning, database
systems, etc.), would help the field and provide a basis for future research.
 Scaling Up for High Dimensional Data and High Speed:
Data Streams One challenges how to design classifiers to handle ultra-high dimensional
classification problems. There is a strong need now to build useful classifiers with
92
hundreds of millions or billions of features, for applications such as text mining and drug
safety analysis. Such problems often begin with tens of thousands of features and also with
Interactions between the features, so the number of implied features get huge quickly.
One important problem is mining data streams in extremely large databases (e.g. 100TB).
Satellite and computer network data can easily be of this scale. However, today‘s data
mining technology is still too slow to handle data of this scale. In addition, data mining
should be a continuous, online process, rather than an occasional one-shot process.
Organizations that can do this will have a decisive advantage over ones that do not. Data
streams present a new challenge for data mining researchers.
 Mining Sequence Data and Time Series Data:
Sequential and time series data mining remains an important problem. Despite progress in
other related fields, how to efficiently cluster, classify and predict the trends of these data
are still an important open topic. A particularly challenging problem is the noise in time
series data. It is an important open issue to tackle. Many time series used for predictions
are contaminated by noise, making it difficult to do accurate short-term and long-term
predictions. Examples of these applications include the predictions of financial time series
and seismic time series. Although signal processing techniques, such as wavelet analysis
and filtering, can be applied to remove the noise, they often introduce lags in the filtered
data. Such lags reduce the accuracy of predictions because the predictor must overcome
the lags before it can predict into the future. Existing data mining methods also have
difficulty in handling noisy data and learning meaningful information from the data.
 Mining Complex Knowledge from Complex Data:
One important type of complex knowledge is in the form of graphs. Recent research has
touched on the topic of discovering graphs and structured patterns from large data, but
clearly, more needs to be done. Another form of complexity is from data that are non-i.i.d.
(independent and identically distributed). This problem can occur when mining data from
multiple relations. In most domains, the objects of interest are not independent of each
other, and are not of a single type. We need data mining systems that can soundly mine the
rich structure of relations among objects, such as interlinked Web pages, social networks,
metabolic networks in the cell, etc. Yet another important problem is how to mine nonrelational data. A great majority of most organizations‘ data is in text form, not databases,
and in more complex data formats including Image, Multimedia, and Web data. Thus,
there is a need to study data mining methods that go beyond classification and clustering.
93
Some interesting questions include how to perform better automatic summarization of text
and how to recognize the movement of objects and people from Web and Wireless data
logs in order to discover useful spatial and temporal knowledge.
 Distributed Data Mining and Mining Multi-Agent Data:
The problem of distributed data mining is very important in network problems. In a
distributed environment (such as a sensor or IP network), one has distributed probes
placed at strategic locations within the network. The problem here is to be able to correlate
the data seen at the various probes, and discover patterns in the global data seen at all the
different probes. There could be different models of distributed data mining here, but one
could involve a NOC that collects data from the distributed sites, and another in which all
sites are treated equally. The goal here obviously would be to minimize the amount of data
shipped between the various sites — essentially, to reduce the communication overhead.
 Security, Privacy, and Data Integrity:
Several researchers considered privacy protection in data mining as an important topic.
That is, how to ensure the users‘ privacy while their data are being mined. Related to this
topic is data mining for protection of security and privacy.
One respond states that if we do not solve the privacy issue, data mining will become a
derogatory term to the general public. Some respondents consider the problem of
knowledge integrity assessment to be important. We quote their observations: ―Data
mining algorithms are frequently applied to data that have been intentionally modified
from their original version, in order to misinform the recipients of the data or to counter
privacy and security threats. Such modifications can distort, to an unknown extent, the
knowledge contained in the original data. As a result, one of the challenges facing
researchers is the development of measures not only to evaluate the knowledge integrity of
a collection of data, but also of measures to evaluate the knowledge integrity of individual
patterns. Additionally, the problem of knowledge integrity assessment presents several
challenges.‖
4.4 Data Mining Tasks
The following list shows the most common data mining tasks.
Description
Estimation
Prediction
94
Classification
Clustering
Association
Description:
Sometimes, researchers and analysts are simply trying to find ways to describe patterns and
trends lying within data. For example, a pollster may uncover evidence that those who have
been laid off are less likely to support the present incumbent in the presidential election.
Descriptions of patterns and trends often suggest possible explanations for such patterns and
trends. For example, those who are laid off are now less well off financially than before the
incumbent was elected, and so would tend to prefer an alternative. Data mining models
should be as transparent as possible. That is, the results of the data mining model should
describe clear patterns that are amenable to intuitive interpretation and explanation. Some
data mining methods are more suited than others to transparent interpretation. For example,
decision trees provide an intuitive and human friendly explanation of the irresults. On the
other hand, neural networks are comparatively opaque to non specialists, due to the non
linearity and complexity of the model.
Estimation:
Estimation is similar to classification except that the target variable is numerical rather than
categorical. Models are built using ―complete‖ records, which provide the value of the target
variable as well as the predictors. Then, for new observations, estimates of the value of the
target variable are made, based on the values of the predictors.
For example, we might be interested in estimating the systolic blood pressure reading of a
hospital patient, based on the patient‘s age, gender, body-mass index, and blood sodium
levels. The relationship between systolic blood pressure and the predictor variables in the
training set would provide us with an estimation model. We can then apply that model to new
cases.
Examples of estimation tasks in business and research include:
Estimating the amount of money a randomly chosen family of four will spend for
back-to-school shopping this fall.
Estimating the percentage decrease in rotary-movement sustained by a National
Football League running back with a knee injury.
95
Estimating the number of points per game that Patrick Ewing will score when doubleteamed in the playoffs.
Prediction:
Prediction is similar to classification and estimation, except that for prediction, the results lie
in the future. Examples of prediction tasks in business and research include:
Predicting the price of a stock three months into the future.
Predicting the percentage increase in traffic deaths next year if the speed limit is
increased
Predicting the winner of this fall‘s baseball World Series, based on a comparison of
team statistics
Predicting whether a particular molecule in drug discovery will lead to a profitable
new drug for a pharmaceutical company any of the methods and techniques used for
classification and estimation may also is used, under appropriate circumstances, for
prediction. These include the traditional statistical methods of point estimation and
confidence interval estimations, simple linear regression and correlation, and multiple
regression.
Classification:
In classification, there is a target categorical variable, such as income bracket, which, for
example, could be partitioned into three classes or categories: high income, middle income,
and low income. The data mining model examines a large set of records, each record
containing information on the target variable as well as a set of input or predictor variables.
For example, consider the excerpt from a data set.
Suppose that the researcher would like to be able to classify the income brackets of persons
not currently in the database, based on other characteristics associated with that person, such
as age, gender, and occupation. This task is a classification task, very nicely suited to data
mining methods and techniques. The algorithm would proceed roughly as follows. First,
examine the data set containing both the predictor variables and the (already classified) target
variable, income bracket. In this way, the algorithm (software) ―learns about‖ which
combinations of variables are associated with which income brackets. For example, older
females may be associated with the high-income bracket. This data set is called the training
set. Then the algorithm would look at new records, for which no information about income
96
bracket is available. Based on the classifications in the training set, the algorithm would
assign classifications to the new records. For example, a 63-year-old female professor might
be classified in the high-income bracket.
Examples of classification tasks in business and research include:
Determining whether a particular credit card transaction is fraudulent
Placing a new student into a particular track with regard to special needs
Assessing whether a mortgage application is a good or bad credit risk
Diagnosing whether a particular disease is present
Determining whether a will was written by the actual deceased, or fraudulently by
someone else
Identifying whether or not certain financial or personal behaviour indicates a possible
terrorist threat.
Clustering:
Clustering refers to the grouping of records, observations, or cases into classes of similar
objects. A cluster is a collection of records that are similar to one another, and dissimilar to
records in other clusters. Clustering differs from classification in that there is no target
variable for clustering. The clustering task does not try to classify, estimate, or predict the
value of a target variable. Instead, clustering algorithms seek to segment the entire data set
into relatively homogeneous subgroups or clusters, where the similarity of the records within
the cluster is maximized and the similarity to records outside the cluster is minimized.
Examples of clustering tasks in business and research include:
Target marketing of a niche product for a small-capitalization business that does not
have a large marketing budget
For accounting auditing purposes, to segmentize financial behaviour into benign and
suspicious categories
As a dimension-reduction tool when the data set has hundreds of attributes
For gene expression clustering, where very large quantities of genes may exhibit
similar behaviour.
Association:
The association task for data mining is the job of finding which attributes ―go together.‖
Most prevalent in the business world, where it is known as affinity analysis or market basket
analysis, the task of association seeks to uncover rules for quantifying the relationship
between two or more attributes. Association rules are of the form ―If antecedent, then
97
consequent,‖ together with a measure of the support and confidence associated with the rule.
For example, a particular supermarket may find that of the 1000 customers shopping on a
Thursday night, 200 bought diapers, and of those 200 who bought diapers, 50 bought beer.
Thus, the association rule would be ―If buy diapers, then buy beer‖ with a support of
200/1000 = 20% and a confidence of 50/200 = 25%.
Examples of association tasks in business and research include:
Investigating the proportion of subscribers to a company‘s cell phone plan that
respond positively to an offer of a service upgrade
Examining the proportion of children whose parents read to them who are themselves
good readers
Predicting degradation in telecommunications networks
Finding out which items in a supermarket are purchased together and which items are
never purchased together
Determining the proportion of cases in which a new drug will exhibit dangerous side
effects.
4.5 Types of Data
Data: As a general technology, data mining can be applied to any kind of data as long as the
data are meaningful for a target application.
Database Data:
A database system, also called a database management system (DBMS), consists of a
collection of interrelated data, known as a database, and a set of software programs to
manage and access the data. The software programs provide mechanisms for defining
database structures and data storage; for specifying and managing concurrent, shared, or
distributed data access; and for ensuring consistency and security of the information stored
despite system crashes or attempts at unauthorized access.
A relational database is a collection of tables, each of which is assigned a unique name. Each
table consists of a set of attributes (columns or fields) and usually stores a large set of tuples
(records or rows). Each tuple in a relational table represents an object identified by a unique
key and described by a set of attribute values. A semantic data model, such as an entityrelationship (ER) data model, is often constructed for relational databases. An ER data model
represents the database as a set of entities and their relationships.
98
Data Warehouses:
A data warehouse is a repository of information collected from multiple sources, stored under
a unified schema, and usually residing at a single site. Data warehouses are constructed via a
process of data cleaning, data integration, data transformation, data loading, and periodic data
refreshing. A data warehouse is usually modeled by a multidimensional data structure, called
a data cube, in which each dimension corresponds to an attribute or a set of attributes in the
schema, and each cell stores the value of some aggregate measure such as count or sum(sales
amount). A data cube provides a multidimensional view of data and allows the precomputation and fast access of summarized data. Although data warehouse tools help support
data analysis, additional tools for data mining are often needed for in-depth analysis.
Multidimensional data mining (also called exploratory multidimensional data mining)
performs data mining in multidimensional space in an OLAP style. That is, it allows the
exploration of multiple combinations of dimensions at varying levels of granularity in data
mining, and thus has greater potential for discovering interesting patterns representing
knowledge. Although data warehouse tools help support data analysis, additional tools for
data mining are often needed for in-depth analysis. Multidimensional data mining (also called
exploratory multidimensional data mining) performs data mining in multidimensional space
in an OLAP style. That is, it allows the exploration of multiple combinations of dimensions at
varying levels of granularity in data mining, and thus has greater potential for discovering
interesting patterns representing knowledge.
Transactional Data:
In general, each record in a transactional database captures a transaction, such as a customer's
purchase, a flight booking, or a user's clicks on a web page. A transaction typically includes a
unique transaction identity number (trans_ID) and a list of the items making up the
transaction, such as the items purchased in the transaction. A transactional database may have
additional tables, which contain other information related to the transactions, such as item
description, information about the salesperson or the branch, and so on.
Other kinds of Data:
Besides relational database data, data warehouse data, and transaction data, there are many
other kinds of data that have versatile forms and structures and rather different semantic
meanings. Such kinds of data can be seen in many applications: time-related or sequence data
(e.g., historical records, stock exchange data, and time-series and biological sequence data),
data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
spatial data (e.g., maps), engineering design data (e.g., the design of buildings, system
99
components, or integrated circuits), hypertext and multimedia data (including text, image,
video, and audio data), graph and networked data (e.g., social and information networks), and
the Web (a huge, widely distributed information repository made available by the Internet).
These applications bring about new challenges, like how to handle data carrying special
structures (e.g., sequences, trees, graphs, and networks) and specific semantics (such as
ordering, image, audio and video contents, and connectivity), and how to mine patterns that
carry rich structures and semantics.
4.6 Data Pre-processing
The major steps involved in data pre-processing, namely, data cleaning, data integration, data
reduction, and data transformation.
1. Data Cleaning :
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data. In this section, you will study basic methods
for data cleaning.
1.1 Missing Values:
Imagine that you need to analyze AllElectronics sales and customer data. You note
that many tuples have no recorded value for several attributes such as customer income. How
can you go about filling in the missing values for this attribute? Let's look at the following
methods.
Ignore the tuple:
This is usually done when the class label is missing (assuming the mining task
involves classification). This method is not very effective, unless the tuple contains
several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably. By ignoring the tuple, we do not
make use of the remaining attributes' values in the tuple. Such data could have been
useful to the task at hand.
Fill in the missing value manually:
In general, this approach is time consuming and may not be feasible given a large data
set with many missing values.
Use a global constant to fill in the missing value:
100
Replace all missing attribute values by the same constant such as a label like
―Unknown‖ or −∞. If missing values are replaced by, say, ―Unknown,‖ then the
mining program may mistakenly think that they form an interesting concept, since
they all have a value in common—that of ―Unknown.‖
1.2 Noisy Data:
―What is noise?‖ Noise is a random error or variance in a measured variable. In Chapter 2 ,
we saw how some basic statistical description techniques (e.g., boxplots and scatter plots),
and methods of data visualization can be used to identify outliers, which may represent noise.
Given a numeric attribute such as, say, price, how can we ―smooth‖ out the data to remove
the noise? Let's look at the following data smoothing techniques.
Binning: Binning methods smooth a sorted data value by consulting its ―neighborhood,‖ that
is, the values around it. The sorted values are distributed into a number of ―buckets,‖ or bins.
Because binning methods consult the neighborhood of values, they perform local smoothing.
2. Data Integration:
Data mining often requires data integration—the merging of data from multiple data stores.
Careful integration can help reduce and avoid redundancies and inconsistencies in the
resulting data set. This can help improve the accuracy and speed of the subsequent data
mining process.
Entity Identification Problem
It is likely that your data analysis task will involve data integration, which combines
data from multiple sources into a coherent data store, as in data warehousing. These sources
may include multiple databases, data cubes, or flat files.
101
There are a number of issues to consider during data integration. Schema integration and
object matching can be tricky. How can equivalent real-world entities from multiple data
sources be matched up? This is referred to as the entity identification problem. For example,
how can the data analyst or the computer be sure that customer_id in one database and
cust_number in another refer to the same attribute? Examples of metadata for each attribute
include the name, meaning, data type, and range of values permitted for the attribute, and null
rules for handling blank, zero, or null values (Section 3.2 ). Such metadata can be used to
help avoid errors in schema integration. The metadata may also be used to help transform the
data (e.g., where data codes for pay_type in one database may be ―H‖ and ―S‖ but 1 and 2 in
another). Hence, this step also relates to data cleaning, as described earlier.
χ2 Correlation Test for Nominal Data:
For nominal data, a correlation relationship between two attributes, A and B, can be
discovered by a χ2 (chi-square) test. Suppose A has c distinct values, namely a1, a2, … ac. B
has r distinct values, namely b1, b2, … br. The data tuples described by A and B can be
shown as a contingency table, with the c values of A making up the columns and the r values
of B making up the rows. Let (Ai, Bj) denote the joint event that attribute A takes on value ai
and attribute B takes on value bj, that is, where (A = ai, B = bj). Each and every possible (Ai,
Bj) joint event has its own cell (or slot) in the table. The χ2 value (also known as the Pearson
χ2statistic) is computed.
Tuple Duplication :
In addition to detecting redundancies between attributes, duplication should also be
detected at the tuple level (e.g., where there are two or more identical tuples for a given
unique data entry case). The use of denormalized tables (often done to improve performance
by avoiding join s) is another source of data redundancy. Inconsistencies often arise between
various duplicates, due to inaccurate data entry or updating some but not all data occurrences.
For example, if a purchase order database contains attributes for the purchaser's name and
address instead of a key to this information in a purchaser database, discrepancies can occur,
such as the same purchaser's name appearing with different addresses within the purchase
order database.
102
3. Data Reduction
Imagine that you have selected data from the AllElectronics data warehouse for analysis. The
data set will likely be huge! Complex data analysis and mining on huge amounts of data can
take a long time, making such analysis impractical or infeasible.
Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data. That
is, mining on the reduced data set should be more efficient yet produce the same (or almost
the same) analytical results. In this section, we first present an overview of data reduction
strategies, followed by a closer look at individual techniques.
3.1 Wavelet Transforms:
The discrete wavelet transform (DWT) is a linear signal processing technique that,
when applied to a data vector X, transforms it to a numerically different vector, X′, of
wavelet coefficients. The two vectors are of the same length. When applying this
technique to data reduction, we consider each tuple as an n-dimensional data vector,
that is, X = (x1, x2, …, xn), depicting n measurements made on the tuple from n
database attributes.
Histograms:
Histograms use binning to approximate data distributions and are a popular form of data
reduction. Histograms were introduced in Section 2.2.3. A histogram for an attribute, A,
partitions the data distribution of A into disjoint subsets, referred to as buckets or bins. If
each bucket represents only a single attribute–value/frequency pair, the buckets are
called singleton buckets. Often, buckets instead represent continuous ranges for the
given attribute.
4. Data Transformation Strategies Overview
In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Strategies for data transformation include the following:
 Smoothing, which works to remove noise from the data. Techniques include binning,
regression, and clustering.
 Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.
 Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
103
total amounts. This step is typically used in constructing a data cube for data analysis at
multiple abstraction levels.
 Normalization, where the attribute data are scaled so as to fall within a smaller range,
such as −1.0 to 1.0, or 0.0 to 1.0.
 Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).
The labels, in turn, can be recursively organized into higher-level concepts, resulting in
a concept hierarchy for the numeric attribute.
4.7 Measures of Similarity and Dissimilarity
Measuring Data Similarity and Dissimilarity
In data mining applications, such as clustering, outlier analysis, and nearest-neighbor
classification, we need ways to assess how alike or unalike objects are in comparison to one
another. For example, a store may want to search for clusters of customer objects, resulting in
groups of customers with similar characteristics (e.g., similar income, area of residence, and
age). Such information can then be used for marketing. A cluster is a collection of data
objects such that the objects within a cluster are similar to one another and dissimilar to the
objects in other clusters. Outlier analysis also employs clustering-based techniques to identify
potential outliers as objects that are highly dissimilar to others. Knowledge of object
similarities can also be used in nearest-neighbor classification schemes where a given object
(e.g., a patient) is assigned a class label (relating to, say, a diagnosis) based on its similarity
toward other objects in the model.
Data Matrix versus Dissimilarity Matrix:
In Section 2.2, we looked at ways of studying the central tendency, dispersion, and spread of
observed values for some attribute X. Our objects there were one-dimensional, that is,
described by a single attribute. In this section, we talk about objects described by multiple
attributes. Therefore, we need a change in notation. Suppose that we have n objects (e.g.,
persons, items, or courses) described by p attributes (also called measurements or features,
such as age, height, weight, or gender). The objects are , , and so on, where xij is the value for
object xi of the jth attribute. For brevity, we hereafter refer to object xi as object i. The objects
104
may be tuples in a relational database, and are also referred to as data samples or feature
vectors.
Data matrix (or object-by-attribute structure): This structure stores the n data objects in the
form of a relational table, or n-by-p matrix (n objects × p attributes)
Dissimilarity matrix (or object-by-object structure): This structure stores a collection of
proximities that are available for all pairs of n objects. It is often represented by an n-by-n
table.
Dissimilarity of Numeric Data: Minkowski Distance
In this section, we describe distance measures that are commonly used for computing the
dissimilarity of objects described by numeric attributes. These measures include the
Euclidean, Manhattan, and Minkowski distances.
The most popular distance measure is Euclidean distance (i.e., straight line or ―as the crow
flies‖). Let i = (Xi1,Xi2….Xip) and j = (Xj1,Xj2….Xjp) be two objects described by p
numeric attributes. The Euclidean distance between objects i and j is defined as
D(i , j) = ((X1-Y1) – (X2-Y2))^1/2
Another well-known measure is the Manhattan (or city block) distance, named so because it
is the distance in blocks between any two points in a city (such as 2 blocks down and 3
blocks over for a total of 5 blocks). It is defined as
D(i , j) = (X1-Y1) – (X2-Y2)
Both the Euclidean and the Manhattan distance satisfy the following mathematical properties:
Non-negativity:: Distance is a non-negative number.
Identity of indiscernibles:: The distance of an object to itself is 0.
Symmetry:: Distance is a symmetric function.
Triangle inequality:: Going directly from object i to object j in space is no more than
making a detour over any other object k.
Cosine Similarity:
A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as a keyword) or phrase in the document. Thus, each document is an
object represented by what is called a term-frequency vector. For example, in Table 2.5, we
see that Document1 contains five instances of the word team, while hockey occurs three
105
times. The word coach is absent from the entire document, as indicated by a count value of 0.
Such data can be highly asymmetric.
Cosine similarity is a measure of similarity that can be used to compare documents or, say,
give a ranking of documents with respect to a given vector of query words. Let x and y be two
vectors for comparison. Using the cosine measure as a
similarity function, we have,
sim = x.y/||x||.||y||
4.8 Data Mining Applications
In this section, we have focused some of the applications of data mining and its techniques
are analyzed respectively Order.
o Data Mining Applications in Healthcare
Data mining applications in health can have tremendous potential and usefulness.
However, the success of healthcare data mining hinges on the availability of clean
healthcare data. In this respect, it is critical that the healthcare industry look into how
data can be better captured, stored, prepared and mined. Possible directions include the
standardization of clinical vocabulary and the sharing of data across organizations to
enhance the benefits of healthcare data mining applications
o Data mining is used for market basket analysis
Data mining technique is used in MBA(Market Basket Analysis).When the customer
want to buying some products then this technique helps us finding the associations
between different items that the customer put in their shopping buckets. Here the
discovery of such associations that promotes the business technique .In this way the
retailers uses the data mining technique so that they can identify that which customers
intension (buying the different pattern).In this way this technique is used for profits of
the business and also helps to purchase the related items.
o The data mining is used an emerging trends in the education system in the whole world
In Indian culture most of the parents are uneducated .The main aim of in Indian
government is the quality education not for quantity. But the day by day the education
systems are changed and in the 21st century a huge number of universalities are
established by the order of UGC. As the numbers of universities are established side by
side, each and every day a millennium of students are enrolls across the country. With
huge number of higher education aspirants, we believe that data mining technology can
106
help bridging knowledge gap in higher educational systems. The hidden patterns,
associations, and anomalies that are discovered by data mining techniques from
educational data can improve decision making processes in higher educational
systems. This improvement can bring advantages such as maximizing educational
system efficiency, decreasing student's drop-out rate, and increasing student's
promotion rate, increasing student's retention rate in, increasing student's transition
rate, increasing educational improvement ratio, increasing student's success, increasing
student's learning outcome, and reducing the cost of system processes. In this current
era we are using the KDD and the data mining tools for extracting the knowledge this
knowledge can be used for improving the quality of education .The decisions tree
classification is used in this type of applications.
o Data mining is now used in many different areas in manufacturing engineering
When we retrieve the data from manufacturing system then the customer is to use
these data for different purposes like to find the errors in the data ,to enhance the
design methodology ,to make the good quality of the data, how best the data can be
supported for making the decision. But most of times the data can be first analyzed
then after find the hidden patterns which will be control the manufacturing process
which will further enhance the quality of the products. Since the importance of data
mining in manufacturing has clearly increased over the last 20 years, it is now
appropriate to critically review its history and Application .
o Data Mining Applications can be generic or domain specific.
Data mining system can be applied for generic or domain specific. Some generic data
mining applications cannot take its own these decisions but guide users for selection of
data, selection of data mining method and for the interpretation of the results. The
multi agent based data mining application has capability of automatic selection of data
mining technique to be applied. The Multi Agent System used at different levels [8]:
First, at the level of concept hierarchy definition then at the result level to present the
best adapted decision to the user. This decision is stored in knowledge Base to use in a
later decision-making. Multi Agent System Tool used for generic data mining system
development uses different agents to perform different tasks.
107
4.7 Summary
Database technology has evolved from primitive file processing to the development of
database management systems with query and transaction processing.
Data mining is the task of discovering interesting patterns from large amounts of data,
where the data can be stored in databases, data warehouses, or other information
repositories. It is a young interdisciplinary field, drawing from areas such as database
systems, data warehousing, statistics, machine learning, data visualization,
information retrieval, and high-performance computing.
Data pre-processing is an important issue for both data warehousing and data mining,
as real-world data tend to be incomplete, noisy, and inconsistent. Data pre-processing
includes data cleaning, data integration, data transformation, and data reduction. Data
cleaning routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data. Data integration combines data from
multiple sources to form a coherent data store. Data transformation routines convert
the data into appropriate forms for mining.
4.8 Keywords
Data mining, Prediction, Clustering, data cleaning, data integration, data transformation, and
data reduction, Missing Values, Noisy Data, Wavelet Transforms, Histograms, Smoothing,
Aggregation, Normalization, Discretization, Similarity, Dissimilarity, Applications.
4.9 Exercises
1. Define Data mining?
2. Explain Challenges in Data Mining?
3. Explain Various Data Mining Tasks?
4. What are Types of Data? Explain?
5. What is Data Pre-processing?
6. Describe various stages of Data Pre-processing?
7. Explain measures of Similarity and Dissimilarity?
8. Explain any four data mining applications?
108
4.10 References
1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber,
Morgan Kaufmann Publisher, Second Edition, 2006.
2. Introduction to Data Mining with Case Studies by G. K. Gupta, Eastern Economy
Edition (PHI, New Delhi), Third Edition, 2009.
3. Data Mining Techniques by Arun K Pujari, University Press, Second Edition,
2009.
4. Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurasamy, "Advances in Knowledge Discovery and Data Mining", AAAI Press/
The MIT Press, 1996.
5. Michael Berry and Gordon Linoff, "Data Mining Techniques (For Marketing,
Sales, and Customer Support), John Wiley & Sons, 1997.
109
UNIT-5: FREQUENT PATTERNS FOR DATA MINING
Structure
5.1
Objectives
5.2
Introduction
5.3
Basic Concepts and Algorithms of Mining Frequent Patterns
5.4
Associations, and Correlations,
5.5
Frequent Item set Generation
5.6
Rule Generation
5.7
Compact Representation of Frequent Item sets.
5.8
Summary
5.9
Keywords
5.10
Exercises
5.11
References
5.1 Objectives
In this unit, you will study about the following concepts:
The basic concepts of Mining frequent Patters and associated algorithms which we
can employ
An example :Market basket analysis will be discussed
Several Algorithms related to Mining of frequent patterns are covered
Some association rules are described
Correlations are mentioned
Frequent Item set Generation,
Rule Generation
Compact Representation of Frequent Item sets.
110
5.2 Introduction
The topic of frequent pattern mining is indeed rich. This unit is dedicated to methods of
frequent item set mining. We delve into the following questions: How can we find frequent
item sets from large amounts of data, where the data are either transactional or relational?
How can we mine association rules in multilevel and multidimensional space? Which
association rules are the most interesting? How can we help or guide the mining procedure to
discover interesting associations or correlations? How can we take advantage of user
preferences or constraints to speed up the mining process? The techniques learned in this unit
may also be extended for more advanced forms of frequent pattern mining, such as from
sequential and structured data sets, as we will study in later units.
In this section, you will learn methods for mining the simplest form of frequent pat-Terns
such as those discussed for market basket analysis in Section We begin by Presenting
Apriori, the basic algorithm for finding frequent item sets we look at how to generate strong
association rules from frequent item-sets. This Unit describes several variations to the Apriori
algorithm for improved efficiency and scalability. It presents pattern-growth methods for
mining frequent item sets that confine the subsequent search space to only the data sets
containing the current frequent item sets. It presents methods for mining frequent item sets
that take advantage of the vertical data format.
Finally, we discuss how the results of sequence mining can be applied in a real application
domain. The sequence mining task is to discover a set of attributes, shared across time among
a large number of objects in a given database. For example, consider the sales database of a
bookstore, where the objects represent customers and the attributes represent authors or
books. Let‘s say that the database records the books bought by each customer over a period
of time. The discovered patterns are the sequences of books most frequently bought by the
customers. Consider another example of a web access database at a popular site, where an
object is a web user and an attribute is a web page. The discovered patterns are the sequences
of most frequently accessed pages at that site. This kind of information can be used to
restructure the web-site, or to dynamically insert relevant links in web pages based on user
access patterns.
111
5.3 Basic concepts and algorithms of mining frequent patterns
Frequent patterns are patterns (such as item sets, subsequences, or substructures) that
appear in a data set frequently. For example, a set of items, such as milk and bread that
appear frequently together in a transaction data set is a frequent itemset. A subsequence, such
as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in
a shopping history database, is a (frequent) sequential pattern. A substructure can refer to
different structural forms, such as sub-graphs, sub-trees, or sub-lattices, which may be
combined with itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern. Finding such frequent patterns plays an essential role in mining
associations, correlations, and many other interesting relationships among data.
Moreover, it helps in data classification, clustering, and other data mining tasks as well. Thus,
frequent pattern mining has become an important data mining task and a focused theme in
data mining research. In this unit, we introduce the concepts of frequent patterns,
associations, and correlations, study how they can be mined efficiently.
Frequent pattern mining searches for recurring relationships in a given data set. This section
introduces the basic concepts of frequent pattern mining for the discovery of databases. By
presenting an example of market basket analysis, the earliest form of frequent pattern mining
for association rules. The basic concepts of mining frequent patterns and associations are
presents a road map to the different kinds of frequent patterns, association rules, and
correlation rules that can be mined. Interesting associations and correlations between itemsets
in transactional and relational
Market Basket Analysis: A Motivating Example
Frequent itemset mining leads to the discovery of associations and correlations among items
in large transactional or relational data sets. With massive amounts of data continuously
being collected and stored, many industries are becoming interested in mining such patterns
from their databases. The discovery of interesting correlation relationships among huge
amounts of business transaction records can help in many business decision-making
processes, such as catalog design, cross-marketing, and customer shopping behavior analysis.
112
A typical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different items that
customers place in their ―shopping baskets‖. The discovery of such associations can help
retailers develop marketing strategies by gaining insight into which items are frequently
purchased together by customers. For instance, if customers are buying milk, how likely are
they to also buy bread (and what kind of bread) on the same trip to the supermarket? Such
information can lead to increased sales by helping retailers do selective marketing and plan
their shelf space.
Market basket analysis: Suppose, as manager of an All Electronics branch, you would like to
learn more about the buying habits of your customers. Specifically, you wonder, ―Which
groups or sets of items are customers likely to purchase on a given trip to the store?‖ To
answer your question, market basket analysis may be performed on the retail data of
customer transactions at your store. You can then use the results to plan marketing or
advertising strategies, or in the design of a new catalog. For instance, market basket analysis
may help you design different store layouts. In one strategy, items that are frequently
purchased together can be placed in proximity in order to further encourage the sale of such
items together. If customers who purchase computers also tend to buy antivirus software at
the same time, then placing the hardware display close to the software display may help
increase the sales of both items. In an alternative strategy, placing hardware and software at
opposite ends of the store may entice customers who purchase such items to pick up other
items along the way. For instance, after deciding on an expensive computer, a customer may
observe security systems for sale while heading toward the software display to purchase
antivirus software and may decide to purchase a home security system as well. Market basket
analysis can also help retailers plan which items to put on sale at reduced prices. If customers
tend to purchase computers and printers together, then having a sale on printers may
encourage the sale of printers as well as computers.
If we think of the universe as the set of items available at the store, then each item has a
Boolean variable representing the presence or absence of that item. Each basket can then be
represented by a Boolean vector of values assigned to these variables. The Boolean vectors
can be analyzed for buying patterns that reflect items that are frequently associated or
purchased together. These patterns can be represented in the form of association rules. For
113
example, the information that customers who purchase computers also tend to buy antivirus
software at the same time is represented in Association Rule (5.1) below:
computer)antivirus software [support = 2%; confidence = 60%]
Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for Association Rule
(5.1) means that 2% of all the transactions under analysis show that computer and antivirus
software are purchased together. A confidence of 60% means that 60% of the customers who
purchased a computer also bought the software. Typically, association rules are considered
interesting if they satisfy both a minimum support threshold and a minimum confidence
threshold. Such thresholds can be set by users or domain experts. Additional analysis can be
performed to uncover interesting statistical correlations
Mining frequent patterns without candidate generation
Mining frequent patterns in transaction databases, time-series databases, and many other
kinds of databases has been studied popularly in data mining research. Most of the previous
studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate
set generation is still costly, especially when there exist prolific patterns and/or long patterns.
In this study, we discuss a novel frequent pattern tree (FP-tree) structure, which is an
extended prefix-tree structure for storing compressed, crucial information about frequent
patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the
complete set of frequent patterns by pattern fragment growth. Efficiency of mining is
achieved with three techniques: (1) a large database is compressed into a highly condensed,
much smaller data structure, which avoids costly, repeated database scans, (2) our FP-treebased mining adopts a pattern fragment growth method to avoid the costly generation of a
large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is
used to decompose the mining task into a set of smaller tasks for mining confined patterns in
conditional databases, which dramatically reduces the search space. Our performance study
shows that the FP-growth method is efficient and scalable for mining both long and short
frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and
also faster than some recently reported new frequent pattern mining methods.
114
Algorithms:
Data Mining Algorithms in R/Frequent Pattern Mining/ FP-Growth Algorithm:
In Data Mining, the task of finding frequent pattern in large databases is very important and
has been studied in large scale in the past few years. Unfortunately, this task is
computationally expensive, especially when a large number of patterns exist.
The FP-Growth Algorithm, proposed by Han, is an efficient and scalable method for mining
the complete set of frequent patterns by pattern fragment growth, using an extended prefixtree structure for storing compressed and crucial information about frequent patterns named
frequent-pattern tree (FP-tree). In his study, Han proved that his method outperforms other
popular methods for mining frequent patterns, e.g. the Apriori Algorithm and the Tree
Projection.
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs
frequently in a data set.
Motivation: Finding inherent regularities in data
What products were often purchased together? — Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalogue design, sale campaign analysis, Web log
(click stream) analysis, and DNA sequence analysis
Why is Frequent Pattern mining important?
Discloses an intrinsic and important property of data sets
Forms the foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
Classification: associative classification
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
115
Association Rules Mining: A Recent Overview
Association rule mining, one of the most important and well researched techniques of data
mining. It aims to extract interesting correlations, frequent patterns, associations or casual
structures among sets of items in the transaction databases or other data repositories.
Association rules are widely used in various areas such as telecommunication networks,
market and risk management, inventory control etc. Various association mining techniques
and algorithms will be briefly introduced and compared later.
Association rule mining is to find out association rules that satisfy the predefined minimum
support and confidence from a given database. The problem is usually decomposed into two
sub-problems. One is to find those itemsets whose occurrences exceed a predefined threshold
in the database; those itemsets are called frequent or large itemsets. The second problem is to
generate association rules from those large itemsets with the constraints of minimal
confidence. Suppose one of the large itemsets is Lk, Lk = {I1, I2, … , Ik}, association rules
with this itemsets are generated in the following way: the first rule is {I1, I2, … , Ik-1}⇒
{Ik}, by checking the confidence this rule can be determined as interesting or not. Then other
rule are generated by deleting the last items in the antecedent and inserting it to the
consequent, further the confidences of the new rules are checked to determine the
interestingness of them. Those processes iterated until the antecedent becomes empty. Since
the second sub-problem is quite straight forward, most of the researches focus on the first
sub-problem.
The first sub-problem can be further divided into two sub-problems: candidate large itemsets
generation process and frequent itemsets generation process. We call those itemsets whose
support exceeds the support threshold as large or frequent item. In many cases, the algorithms
generate an extremely large number of association rules, often in thousands or even millions.
Further, the association rules are sometimes very large. It is nearly impossible for the end
users to comprehend or validate such large number of complex association rules, thereby
limiting the usefulness of the data mining results. Several strategies have been proposed to
reduce the number of association rules, such as generating only ―interesting‖ rules, generating
only ―non-redundant‖ rules, or generating only those rules satisfying certain other criteria
such as coverage, leverage, lift or strength.
116
Association Rules
Apriori is an algorithm for frequent item set mining and association rule learning over
transactional databases. It proceeds by identifying the frequent individual items in the
database and extending them to larger and larger item sets as long as those item sets appear
sufficiently often in the database. The frequent item sets determined by Apriori can be used to
determine association rules which highlight general trends in the database: this has
applications in domains such as market basket analysis.
Apriori is designed to operate on databases containing transactions (for example, collections
of items bought by customers, or details of a website frequentation). Other algorithms are
designed for finding association rules in data having no transactions (Winepi and Minepi), or
having no time stamps (DNA sequencing). Each transaction is seen as a set of items
(an itemset). Given a threshold
subsets of at least
, the Apriori algorithm identifies the item sets which are
transactions in the database.
Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time
(a step known as candidate generation), and groups of candidates are tested against the data.
The algorithm terminates when no further successful extensions are found.
Apriori uses breadth-first search and a Hash tree structure to count candidate item sets
efficiently. It generates candidate item sets of length
from item sets of length
. Then
it prunes the candidates which have an infrequent sub pattern. According to the downward
closure lemma, the candidate set contains all frequent
-length item sets. After that, it scans
the transaction database to determine frequent item sets among the candidates.
Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be generated
117
Psudo-code
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do
begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
118
Step 2: pruning
How to count supports of candidates?
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
Algorithm
Furthermore, most approaches use very complicated internal data structures which have poor
locality and add additional space and computational overheads.
SPADE, is a new algorithm for fast discovery of Sequential Patterns. SPADE utilizes
combinatorial properties to decompose the original problem into smaller sub-problems that
can be independently solved in main-memory using efficient lattice search techniques, and
using simple join operations. All sequences are discovered in only three database scans.
Experiments show that SPADE outperforms the best previous algorithm by a factor of two,
and by an order of magnitude with some pre-processed data. It also has linear scalability with
respect to the number of input-sequences, and a number of other database parameters.
The task of discovering all frequent sequences in large databases is quite challenging. The
search space is extremely large. For example, with m attributes there are O.mk/potentially
frequent sequences of length k. With millions of objects in the database the problem of I/O
minimization becomes paramount. However, most current algorithms are iterative in nature,
requiring as many full database scans as the longest frequent sequence; clearly a very
expensive process. Some of the methods, especially those using some form of sampling, can
be sensitive to the data-skew, which can adversely affect performance. using Equivalence
classes), for discovering the set of all frequent sequences
SPADE not only minimizes I/O costs by reducing database scans, but also minimizes
computational costs by using efficient search schemes. The vertical id-list based approach is
also insensitive to data-skew. An extensive set of experiments shows that SPADE
outperforms previous approaches by a factor of two, and by an order of magnitude if we have
119
some additional off-line information. Furthermore, SPADE scales linearly in the database
size, and a number of other database parameters.
Challenges of Frequent Pattern Mining
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
Basic Concepts & Basic Association Rules Algorithms
Let I=I1, I2, … , Im be a set of m distinct attributes, T be transaction that contains a set of
items such that T ⊆ I, D be a database with different transaction records Ts. Anassociation
rule is an implication in the form of X⇒Y, where X, Y ⊂ I are sets of items called itemsets,
and X ∩ Y =∅. X is called antecedent while Y is called consequent, the rule means X implies
Y.
There are two important basic measures for association rules, support(s) and confidence(c).
Since the database is large and users concern about only those frequently purchased items,
usually thresholds of support and confidence are predefined by users to drop those rules that
are not so interesting or useful. The two thresholds are called minimal support and minimal
confidence
respectively.
Support(s)
of
an
association
rule
is
defined
as
the
percentage/fraction of records that contain X ∪ Y to the total number of records in the
database. Suppose the support of an item is 0.1%, it means only 0.1 percent of the transaction
contain purchasing of this item.
Confidence of an association rule is defined as the percentage/fraction of the number of
transactions that contain X ∪ Y to the total number of records that contain X.
120
Confidence is a measure of strength of the association rules, suppose the confidence of the
association rule X⇒Y is 80%, it means that 80% of the transactions that contain X also
contain Y together.
In general, a set of items (such as the antecedent or the consequent of a rule) is called an
itemset. The number of items in an itemset is called the length of an itemset. Itemsets of some
length k are referred to as k-itemsets.
Generally, an association rules mining algorithm contains the following steps:
•
The set of candidate k-itemsets is generated by 1-extensions of the large (k -1)itemsets generated in the previous iteration.
•
Supports for the candidate k-itemsets are generated by a pass over the database.
•
Itemsets that do not have the minimum support are discarded and the remaining
itemsets are called large k-itemsets. This process is repeated until no more large
itemsets are found.
The AIS algorithm was the first algorithm proposed for mining association rule. In this
algorithm only one item consequent association rules are generated, which means that the
consequent of those rules only contain one item, for example we only generate rules like X ∩
Y⇒Z but not those rules as X⇒Y∩ Z. The main drawback of the AIS algorithm is too many
candidate itemsets that finally turned out to be small are generated, which requires more
space and wastes much effort that turned out to be useless. At the same time this algorithm
requires too many passes over the whole database.
Apriori is more efficient during the candidate generation process. Apriori uses pruning
techniques to avoid measuring certain itemsets, while guaranteeing completeness. These are
the itemsets that the algorithm can prove will not turn out to be large. However there are two
bottlenecks of the Apriori algorithm. One is the complex candidate generation process that
uses most of the time, space and memory. Another bottleneck is the multiple scan of the
database. Based on Apriori algorithm, many new algorithms were designed with some
modifications or improvements.
121
Increasing the Efficiency of Association Rules Algorithms
The computational cost of association rules mining can be reduced in four ways:
•
by reducing the number of passes over the database
•
by sampling the database
•
by adding extra constraints on the structure of patterns
•
through parallelization.
In recent years much progress has been made in all these directions.
Reducing the number of passes over the database
FP-Tree, frequent pattern mining, is another milestone in the development ofassociation rule
mining, which breaks the main bottlenecks of the Apriori. The frequent itemsets are
generated with only two passes over the database and without any candidate generation
process. FP-tree is an extended prefix-tree structure storing crucial, quantitative information
about frequent patterns. Only frequent length-1 items will have nodes in the tree, and the tree
nodes are arranged in such a way that more frequently occurring nodes will have better
chances of sharing nodes than less frequently occurring ones.
FP-Tree scales much better than Apriori because as the support threshold goes down, the
number as well as the length of frequent itemsets increase dramatically. The candidate sets
that Apriori must handle become extremely large, and the pattern matching with a lot of
candidates by searching through the transactions becomes very expensive. The frequent
patterns generation process includes two sub processes: constructing the FT-Tree, and
generating frequent patterns from the FP-Tree. The mining result is the same with Apriori
series algorithms. To sum up, the efficiency of FP-Tree algorithm account for three reasons.
First, the FP-Tree is a compressed representation of the original database because
only those frequent items are used to construct the tree, other irrelevant information
are pruned.
Secondly this algorithm only scans the database twice.
Thirdly, FP-Tree uses a divide and conquer method that considerably reduced the size
of the subsequent conditional FP-Tree.
Every algorithm has his limitations, for FP-Tree it is difficult to be used in an interactive
mining system. During the interactive mining process, users may change the threshold of
support according to the rules. However for FP-Tree the changing of support may lead to
repetition of the whole mining process. Another limitation is that
122
FP-Tree is that it is not suitable for incremental mining. Since as time goes on databases keep
changing, new datasets may be inserted into the database, those insertions may also lead to a
repetition of the whole process if we employ FP-Tree algorithm. Tree Projection is another
efficient algorithm recently proposed in. The general idea of Tree Projection is that it
constructs a lexicographical tree and projects a large database into a set of reduced, itembased sub-databases based on the frequent patterns mined so far. The number of nodes in its
lexicographic tree is exactly that of the frequent itemsets. The efficiency of Tree Projection
can be explained by two main factors: (1) the transaction projection limits the support
counting in a relatively small space; and (2) the lexicographical tree facilitates the
management and counting of candidates and provides the flexibility of picking efficient
strategy during the tree generation and transaction projection phrases.
5.8 Summary
In this unit, we learnt about the following concepts:
Frequent patterns which are patterns (such as itemsets, subsequences, or
substructures) that appears in a data set frequently.
Market basket analysis is just one form of frequent pattern mining. In fact, there are
many kinds of frequent patterns, association rules, and correlation relationships.
We begin by presenting Apriori, the basic algorithm for finding frequent itemsets we
look at how to generate strong association rules from frequent itemsets. Describes
several variations to the Apriori algorithm for improved efficiency and scalability.
5.9 Keywords
Apriori, Itemset, Pattern, Market basket analysis, Frequent Pattern mining
5.10 Exercises
1. Explain the Market Basket Analysis.
2. Explain Association and Correlation.
3. Explain apriori algorithm.
123
4. Discuss frequent pattern mining concept and algorithm.
5. Write short notes on association rules.
6. Why is frequent pattern mining important?
7. How do you mine frequent patterns without candidate generation?
8. Explain SPADE algorithm.
5.11 References
1. Data Mining Concepts and techniques by Jiawei Han and Micheline Kamber.
2. Fast Algorithms for Mining Association Rules by Rakesh Agrawal, Ramakrishnan
Srikant
3. Discovering Frequent Closed Itemsets for Association Rules Nicolas Pasquier, Yves
Bastide, Rafik Taouil, and Lotfi Lakhal
4. Introduction to Data Mining by Tan, Steinbach, Kumar
124
UNIT-6: FP GROWTH ALGORITHMS
Structure
6.1
Objectives
6.2
Introduction
6.3
Alternative methods for generating Frequent Itemsets
6.4
FP Growth Algorithm
6.5
Evaluation of Association Patterns
6.6
Summary
6.7.
Keywords
6.8
Exercises
6.9
References
6.1 Objectives
In this unit, we will learn alternative methods for generating frequent itemsets, description of
FP growth algorithm. We also learn about how the evaluation of association patterns is made.
6.2 Introduction
This section highlights the alternative methods that are available for generating frequent
itemsets. The Apriori Algorithm that was explained is one of the most widely used algorithms
in association mining but it is not without its limitations. When there is a dense data set the
Apriori Algorithm performance lessens and another disadvantage is that it has a high
overhead.
Butter →Bread,
Chocolate → Teddy,
Bear, Beer → Diapers, which of these three seem interesting to you?
Which of these three might affect the way you do business? We all can already assume that
most people who buy bread will buy butter and if I were to tell you that I have analysis to
show that most customers who buy chocolate also buy a teddy bear you wouldn‘t be
125
surprised. But what if we told you about a link between Beer and Diapers, would not that
spark your interest?
6.3 Alternative methods for generating frequent item sets
The way that the transaction data set is represented can also affect the performance of the
algorithm. The more popular representation is the vertical data layout which has been shown
in previous section examples; an alternative representation is the horizontal data layout. The
horizontal data layout can only be used for smaller data sets because the initial layout
transaction data set might not be able to fit into main memory. The traversal of the itemset
lattice is another crucial area that has improved over the last couple of years. The rest of this
section introduces alternative methods for generating frequent itemsets that take the
aforementioned limitations into account and try to improve the efficiency of the Apriori
algorithm.
Mining Closed Frequent Itemsets
In this section, we saw how frequent itemset mining may generate a huge number of frequent
itemsets, especially when the min sup threshold is set low or when there exist long patterns in
the data set. It showed that closed frequent itemsets9 can substantially reduce the number of
patterns generated in frequent itemset mining while preserving the complete information
regarding the set of frequent itemsets. That is, from the set of closed frequent itemsets, we
can easily derive the set of frequent itemsets and their support. Thus in practice, it is more
desirable to mine the set of closed frequent itemsets rather than the set of all frequent itemsets
in most cases.
―How can we mine closed frequent itemsets?‖ A naïve approach would be to first mine the
complete set of frequent itemsets and then remove every frequent itemset that is a proper
subset of, and carries the same support as, an existing frequent itemset. However, this is quite
costly. this method would have to first derive 2100 �1 frequent itemsets in order to obtain a
length-100 frequent itemset, all before it could begin to eliminate redundant itemsets. This is
prohibitively expensive. In fact, there exist only a very small number of closed frequent
itemsets in the data set
A recommended methodology is to search for closed frequent itemsets directly during the
mining process. This requires us to prune the search space as soon as we can identifythe case
of closed itemsets during mining. Pruning strategies include the following:
126
Item merging: If every transaction containing a frequent itemset X also contains an itemset Y
but not any proper superset of Y, then X [Y forms a frequent closed itemset and there is no
need to search for any itemset containing X but no Y., the projected conditional database for
prefix itemset fI5:2g is ffI2, I1g,fI2, I1, I3gg, from which we can see that each of its
transactions contains itemset fI2, I1g but no proper superset of fI2, I1g. Itemset fI2, I1g can
be merged with fI5g to form the closed itemset, fI5, I2, I1: 2g, and we do not need to mine for
closed itemsets that contain I5 but not fI2, I1g.
Traversal of Itemset Lattice Methods:
General-to-Specific
This is the strategy that is employed by the Apriori algorithm. It uses prior frequent itemsets
to generate future itemsets; by this we mean that it uses frequent k-1 itemsets to generate
candidate k-itemset. This strategy is effective when the maximum length of a frequent itemset
is not too long.
Specific-to-General
This strategy is the reverse of the general-to-specific and does as the name suggests. It starts
with more specific frequent itemsets before finding general frequent itemsets. This strategy
helps us discover maximal frequent itemsets in dense transactions where the frequent itemset
border is located near the bottom of the lattice.
127
Bi-directional
This strategy is a combination of the previous two, and it is useful because it can rapidly
identify the frequent itemset border. It is highly efficient when the frequent itemset border
isn‘t located at either extreme which is conveniently handle by one of the previous strategies.
The only limitation is that it requires more space to store the candidate itemsets.
Equivalence Classes
This strategy breaks the lattice equivalence classes as shown in the figure and it then employs
a frequent itemset generation algorithm that searches through each class thoroughly before
moving to the next class. This is beneficial when a certain itemset is known to be a frequent
itemset. For instance, if we knew that the first two items are predominant in most
transactions, partitioning the lattice based on the prefix as shown in the diagram might prove
advantageous, we can also partition the lattice based on its suffix.
128
Breadth-First
As the name suggests this strategy applies a breadth-first approach to the lattice. This method
searches through the 1-itemset row first, discovers all the frequent 1-itemsets before going to
the next level to find frequent 2-itemsets.
Depth-First
This strategy traverses the lattice in a depth-first manner. The algorithm picks a certain 1itemset and follows it through all the way down until an infrequent itemset is found with the
1-itemset as its prefix. This method is useful because it helps us determine the border
between frequent and infrequent itemsets more quickly than use the breadth-first approach.
6.4 FP Growth Algorithm
The FP-Growth Algorithm is an alternative way to find frequent itemsets without using
candidate generations, thus improving performance. For so much it uses a divide-and129
conquer strategy. The core of this method is the usage of a special data structure named
frequent-pattern tree (FP-tree), which retains the itemset association information.
In simple words, this algorithm works as follows: first it compresses the input database
creating an FP-tree instance to represent frequent items. After this first step it divides the
compressed database into a set of conditional databases, each one associated with one
frequent pattern. Finally, each such database is mined separately. Using this strategy, the FPGrowth reduces the search costs looking for short patterns recursively and then concatenating
them in the long frequent patterns, offering good selectivity.
In large databases, it‘s not possible to hold the FP-tree in the main memory. A strategy to
cope with this problem is to firstly partition the database into a set of smaller databases
(called projected databases), and then construct an FP-tree from each of these smaller
databases.
The next subsections describe the FP-tree structure and FP-Growth Algorithm, finally an
example is presented to make it easier to understand these concepts.
FP-Tree structure
The frequent-pattern tree (FP-tree) is a compact structure that stores quantitative information
about frequent patterns in a database.
Han defines the FP-tree as the tree structure defined below:
• One root labelled as ―null‖ with a set of item-prefix subtrees as children, and a
frequent-item-header table (presented in the left side of Figure 1);
• Each node in the item-prefix subtree consists of three fields:
•
Item-name: registers which item is represented by the node;
•
Count: the number of transactions represented by the portion of the path
reaching the node;
•
Node-link: links to the next node in the FP-tree carrying the same item-name,
or null if there is none.
• Each entry in the frequent-item-header table consists of two fields:
•
Item-name: as the same to the node;
•
Head of node-link: a pointer to the first node in the FP-tree carrying the itemname.
Additionally the frequent-item-header table can have the count support for an item.
The Figure 6.1 below show an example of a FP-tree
130
Figure 6.1: An example of an FP-tree
The original algorithm to construct the FP-Tree defined by Han is presented below.
Algorithm 1: FP-tree construction
Input: A transaction database DB and a minimum support threshold?.
Output: FP-tree, the frequent-pattern tree of DB.
Method: The FP-tree is constructed as follows.
Scan the transaction database DB once. Collect F, the set of frequent items,
and the support of each frequent item. Sort F in support-descending order as
FList, the list of frequent items.
Create the root of an FP-tree, T, and label it as ―null‖. For each transaction
Trans in DB do the following:
Select the frequent items in Trans and sort them according to the order of
FList. Let the sorted frequent-item list in Trans be [ p | P], where p is the
first element and P is the remaining list. Call insert tree([ p | P], T ).
The function insert tree([ p | P], T ) is performed as follows. If T has a child
N such that N.item-name = p.item-name, then increment N ‘s count by 1;
else create a new node N , with its count initialized to 1, its parent link
linked to T , and its node-link linked to the nodes with the same item-name
via the node-link structure. If P is nonempty, call insert tree(P, N )
recursively.
By using this algorithm, the FP-tree is constructed in two scans of the database.
The first scan collects and sort the set of frequent items, and the second
constructs the FP-Tree.
131
Algorithm 2: FP-Growth
Input: A database DB, represented by FP-tree constructed according to Algorithm 1,
and a minimum support threshold?.
Output: The complete set of frequent patterns.
Method: Call FP-growth(FP-tree, null).
Procedure FP-growth(Tree, a) {
(01) if Tree contains a single prefix path then // Mining single prefix-path FP-tree {
(02) let P be the single prefix-path part of Tree;
(03) let Q be the multipath part with the top branching node replaced by a null root;
(04) for each combination (denoted as ß) of the nodes in the path P do
(05) generate pattern ß ∪ a with support = minimum support of nodes in ß;
(06) let freq pattern set(P) be the set of patterns so generated;
}
07) else let Q be Tree;
(08) for each item ai in Q do { // Mining multipath FP-tree
(09) generate pattern ß = ai ∪ a with support = ai .support;
(10) construct ß‘s conditional pattern-base and then ß‘s conditional FP-tree Tree ß;
(11) if Tree ß ≠ Ø then
(12) call FP-growth(Tree ß , ß);
(13) let freq pattern set(Q) be the set of patterns so generated;
}
(14) return(freq pattern set(P) ∪ freq pattern set(Q) ∪ (freq pattern set(P) × freq
pattern set(Q)))
}
When the FP-tree contains a single prefix-path, the complete set of frequent patterns can be
generated in three parts: the single prefix-path P, the multipath Q, and their combinations
(lines 01 to 03 and 14). The resulting patterns for a single prefix path are the enumerations of
its subpaths that have the minimum support (lines 04 to 06). Thereafter, the multipath Q is
defined (line 03 or 07) and the resulting patterns from it are processed (lines 08 to 13).
Finally, in line 14 the combined results are returned as the frequent patterns found.
132
6.5 Evaluation of association patterns
After the creation of association rules we must decide which rules are actually interesting and
of use to us. A market basket data which has about 10 transactions and 5 items can have up to
100 association rules and we need to be able to sift through this all these patterns and identify
the most interesting ones. Interestingness is the term coined to define patterns that we
consider of interest can be identified by subject and objective measures.
Subjective measures are those that depend on the class of users who examine the pattern, for
instance the example in the introduction about Teddy Bears → Chocolate & Beer → Diapers
is an example of subjective measures, the pattern Teddy Bear → Chocolate can be considered
subjectively uninteresting because it doesn‘t reveal any information that isn‘t expected.
Incorporating subjective knowledge into pattern evaluation is a complex task and is beyond
the scope of this introductory course.
An objective measure on the other hand uses statistical information which can be derived
from the data to determine whether a particular pattern is interesting; support and confidence
are both examples of objective measures of interestingness. These measures can be applied
independently of a particular application. But there are limitations that we encounter when we
try to use just the numerical support and confidence to determine the usefulness of a
particular rule and because of these limitations other measures have been used to evaluate the
quality of an association pattern. The rest of this section covers the details of objective
measures of interestingness.
Objective Measures of Interestingness
Lift
This is the most popular objective measure of interestingness. It computes the ratio between
the rule‘s confidence and the support of the itemset in the rules consequent.
Lift = c(A  B) / s(b)
Interest Factor
Is the binary variables equivalent to the lift. Basically it compares the frequency of a pattern
against a baseline frequency.
I(A, B) = s(A, B)/ s(A) x s(B) = Nf11 / f1+f+1
133
The interest factor lets you know if the itemsets are independent of each other, positively
correlated or negatively correlated.
= 1,
𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡
I(A, B) = > 1, 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑
< 1, 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑
This measure is not without its own limitation, when dealing with association rules in which
the itemset has a high support, the interest factor ends up being close to 1, which suggests
that they itemsets are independent, this is a false conclusion and so in situations such as these,
using the confidence measure is a better choice.
Correlation Analysis
Is another objective measure used to analyze relationships between a pair of variables. For
binary variables, correlation is can be measured using the equation below:
Ф = [f11f00 – f01f10] / [sqrt(f1+f+1f0+f+0] where
= 0,
𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡
Ф = +1, 𝑝𝑒𝑟𝑓𝑒𝑐𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑
−1, 𝑝𝑒𝑟𝑓𝑒𝑐𝑡 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑
Limitations: The correlation measure does not remain invariant when there are proportional
changes to the sample size. Another limitation is that it gives equal importance to both copresence and co-absence of items in the transaction and so it is more suitable for analysis of
symmetric binary variables
IS Measure
IS an object measure of interestingness that was proposed to help deal with the limitation of
the Correlation measure? It is defined as follows
The limitation is that the value of the measure can be large even for uncorrelated and
negatively correlated patterns.
6.6 Summary
In this unit, we have studied about FP growth algorithm and studied about generation of
alternate frequent item sets. We have also discussed about the evaluation of association rules
and studied about the parameters used for the said purpose.
134
6.7 Keywords
FP-Growth algorithm, Apriori algorithm, Association patterns.
6.8 Exercises
1. Discuss in brief alternative methods of generating frequent item sets.
2. Explain traversal of itemsets based on lattice methods.
3. Explain FP-growth algorithm.
4. Write short notes on FP tree structure.
5. Devise an algorithm to construct an FP-tree.
6. How do you evaluate association patterns?
6.9 References
1. Introduction to Data Mining with Case Studies, by Gupta G. K
2. Data & Text Mining - Business Applications Approach by Thomas W Miller
135
UNIT-7: CLASSIFICATION AND PREDICTION
Structure
7.1
Objectives
7.2
Introduction
7.3
Basics of Classification
7.4
General approach to solve classification problem
7.5
Prediction
7.6
Issues Regarding Classification and Prediction
7.7
Summary
7.8
Keywords
7.9
Exercises
7.10
References
7.1 OBJECTIVES
In this unit we will learn about basics of classification followed by a brief introduction of
general approach to solve classification problem, description of predictions and description of
issues regarding classification and prediction.
7.2 Introduction
Databases are rich with hidden information that can be used for making intelligent business
decisions. Classification and prediction are two forms of data analysis that can be used to
extract models describing important data classes or to predict future data trends. Whereas
classification predicts categorical labels, prediction models continuous-valued functions. For
example, a classification model may be built to predict the expenditures of potential
customers on computer equipment given their income and occupation. Many classification
and prediction methods have been proposed by researches in machine learning, expert
systems, statistics, and neurobiology. Most algorithms are memory resident, typically
assuming a small data size. Recent database mining research has built on such work,
136
developing scalable classification and prediction techniques capable of handling large diskresident data. These techniques often consider parallel and distributed processing.
Classification may refer to categorization, the process in which ideas and objects are
recognized, differentiated, and understood. It is the processes of assigning the data to the
predefined classes. Modern systems analysis, which is a tool for complex analysis of objects,
is based on the technology of data mining as a tool for identification of structures and laws
under not only adequate but also incomplete information. Data mining algorithms primarily
include methods for the reconstruction of dependences in identification, classification, and
clusterization of objects.
What is classification?
Following are the examples of cases where the data analysis task is called as Classification:
A bank loan officer wants to analyze the data in order to know which customers (loan
applicant) are risky or which are safe.
A marketing manager at a company needs to analyze to guess a customer with a given
profile will buy a new computer.
In both of the above examples a model or classifier is constructed to predict
categorical labels. These labels are risky or safe for loan application data and yes or
no for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is called as Prediction:
Suppose the marketing manager needs to predict how much a given customer will
spend during a sale at his company. In this example we are bother to predict a
numeric value. Therefore the data analysis task is example of numeric prediction. In
this case a model or predictor will be constructed that predicts a continuous-valuedfunction or ordered value.
7.3 Basics of Classification
Definition: Given a collection of records (training set) each record contains a set of
attributes; one of the attributes is the class. Find a model for class attribute as a function of
the values of other attributes.
137
Goal: Previously unseen records should be assigned a class as accurately as possible. A test
set is used to determine the accuracy of the model. Usually, the given data set is divided into
training and test sets, with training set used to build the model and test set used to validate it.
When the class is numerical, the problem is a regression problem where the model
constructed predicts a continuous valued function, or ordered value, as opposed to a class
label. This model is prediction. Regression analysis is a statistical methodology that is most
used for numeric prediction.
Classification is a two-Step Process:
1. Model construction: Describing a set of predetermined classes. Each tuple/sample is
assumed to belong to a predefined class, as determined by the class label attribute.
The set of tuples used for model construction is training set. The model is represented
as classification rules, decision trees, or mathematical formulae.
2. Model usage: For classifying future or unknown objects, estimate the accuracy of the
model. The known label of test sample is compared with the classified result from the
model. Accuracy rate is the percentage of test set samples that are correctly classified
by the model. Test set is independent of training set, otherwise over-fitting will occur
Classification V/s Predictio:
A bank loans officer needs analysis of her data in order to learn which loan applicants are
―safe‖ and which are ―risky‖ for the bank. A marketing manager at All Electronics needs data
analysis to help guess whether a customer with a given profile will buy a new computer. A
medical researcher wants to analyze breast cancer data in order to predict which one of three
specific treatments a patient should receive. In each of these examples, the data analysis task
is classification, where a model or classifier is constructed to predict categorical labels, such
as ―safe‖ or ―risky‖ for the loan application data; ―yes‖ or ―no‖ for the marketing data; or
―treatment A,‖ ―treatment B,‖ or ―treatment C‖ for the medical data. These categories can be
represented by discrete values, where the ordering among values has no meaning. For
example, the values 1, 2, and 3 may be used to represent treatments A, B, and C, where there
is no ordering implied among this group of treatment regimes.
Suppose that the marketing manager would like to predict how much a given customer will
spend during a sale at All Electronics. This data analysis task is an example of numeric
prediction, where the model constructed predicts a continuous-valued function, or ordered
value, as opposed to a categorical label. This model is a predictor. Regression analysis is a
statistical methodology that is most often used for numeric prediction; hence the two terms
138
are often used synonymously. We do not treat the two terms as synonyms, however, because
several other methods can be used for numeric prediction, as we shall see later in this chapter.
Classification and numeric prediction are the two major types of prediction problems. For
simplicity, when there is no ambiguity, we will use the shortened term of prediction to refer
to numeric prediction.
How does classification work?
Data classification is a two-step process. In the first step, a classifier is built describing a
predetermined set of data classes or concepts. This is the learning step (or training phase),
where a classification algorithm builds the classifier by analyzing or ―learning from‖ a
training set made up of database tuples and their associated class labels. A tuple, X, is
represented by an n-dimensional attribute vector, X = (x1, x2, : : : , xn), depicting n
measurements made on the tuple from n database attributes, respectively, A1, A2, : : : , An.
Each tuple, X, is assumed to belong to a predefined class as determined by another database
attribute called the class label attribute. The class label attribute is discrete-valued and
unordered. It is categorical in that each value serves as a category or class. The individual
tuples making up the training set are referred to as training tuples and are selected from the
database under analysis. In the context of classification, data tuples can be referred to as
samples, examples, instances, data points, or objects. Because the class label of each training
tuple is provided, this step is also known as supervised learning (i.e., the learning of the
classifier is ―supervised‖ in that it is told
7.4 General approach to solve classification problem
―How does classification work?‖
Data classification is a two-step process, consisting of a learning step (where a classification
model is constructed) and a classification step (where the model is used to predict class labels
for given data). The process is shown for the loan application data of Figure 7.1 . (The data
are simplified for illustrative purposes. In reality, we may expect many more attributes to be
considered.
139
Figure 7.1. Data Classification Process
The data classification process involves:
(a) Learning: Training data are analyzed by a classification algorithm. Here, the class
label attribute is loan_decision, and the learned model or classifier is represented in
the form of classification rules.
(b) Classification: Test data are used to estimate the accuracy of the classification rules. If
the accuracy is considered acceptable, the rules can be applied to the classification of
new data tuples.
140
In the first step, a classifier is built describing a predetermined set of data classes or concepts.
This is the learning step (or training phase), where a classification algorithm builds the
classifier by analyzing or ―learning from‖ a training set made up of database tuples and their
associated class labels. A tuple, X, is represented by an n-dimensional attribute vector,
depicting n measurements made on the tuple from n database attributes respectively. Each
tuple, X, is assumed to belong to a predefined class as determined by another database
attribute called the class label attribute. The class label attribute is discrete-valued and
unordered. It is categorical (or nominal) in that each value serves as a category or class. The
individual tuples making up the training set are referred to as training tuples and are
randomly sampled from the database under analysis. In the context of classification, data
tuples can be referred to as samples, examples, instances, data points, or objects.
Because the class label of each training tuple is provided, this step is also known as
supervised learning (i.e., the learning of the classifier is ―supervised‖ in that it is told to
which class each training tuple belongs). It contrasts with unsupervised learning (or
clustering), in which the class label of each training tuple is not known, and the number or set
of classes to be learned may not be known in advance. For example, if we did not have the
loan_decision data available for the training set, we could use clustering to try to determine
―groups of like tuples,‖ which may correspond to risk groups within the loan application data.
This first step of the classification process can also be viewed as the learning of a mapping or
function, , that can predict the associated class label y of a given tuple X. In this view, we
wish to learn a mapping or function that separates the data classes. Typically, this mapping is
represented in the form of classification rules, decision trees, or mathematical formulae. The
rules can be used to categorize future data tuples, as well as provide deeper insight into the
data contents. They also provide a compressed data representation.
―What about classification accuracy?‖
Firstly, the predictive accuracy of the classifier is estimated. If we were to use the training set
to measure the classifier's accuracy, this estimate would likely be optimistic, because the
classifier tends to over fit the data (i.e., during learning it may incorporate some particular
anomalies of the training data that are not present in the general data set overall). Therefore, a
test set is used, made up of test tuples and their associated class labels. They are independent
of the training tuples, meaning that they were not used to construct the classifier. The
accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier. The associated class label of each test tuple is compared with the
learned classifier's class prediction for that tuple.
141
If the accuracy of the classifier is considered acceptable, the classifier can be used to classify
future data tuples for which the class label is not known. (Such data are also referred to in the
machine learning literature as ―unknown‖ or ―previously unseen‖ data.).
7.5 Prediction
Many forms of data mining are predictive. For example, a model might predict income based
on education and other demographic factors. Predictions have an associated probability (How
likely is this prediction to be true?). Prediction probabilities are also known as confidence
(How confident can I be of this prediction?).
Some forms of predictive data mining generate rules, which are conditions that imply a given
outcome. For example, a rule might specify that a person who has a bachelor's degree and
lives in a certain neighbourhood is likely to have an income greater than the regional average.
Rules have an associated support (What percentage of the population satisfies the rule?).
Prediction is nothing but models continuous-valued functions i.e., predicts unknown or
missing values.
7.6 Issues Regarding Classification and Prediction
This section describes issues regarding pre-processing the data for classification and
prediction. Criteria for the comparison and evaluation of classification methods are also
described.
Preparing the Data for Classification and Prediction
The following pre-processing steps may be applied to the data to help improve the accuracy,
efficiency, and scalability of the classification or prediction process.
Data cleaning: This refers to the pre-processing of data in order to remove or reduce
noise (by applying smoothing techniques, for example) and the treatment of missing
values (e.g., by replacing a missing value with the most commonly occurring value for
that attribute, or with the most probable value based on statistics). Although most
classification algorithms have some mechanisms for handling noisy or missing data,
this step can help reduce confusion during learning.
Relevance analysis: Many of the attributes in the data may be redundant. Correlation
analysis can be used to identify whether any two given attributes are statistically
related. For example, a strong correlation between attributes A1 and A2 would suggest
142
that one of the two could be removed from further analysis. A database may also
contain irrelevant attributes. Attribute subset selection can be used in these cases to find
a reduced set of attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained using all attributes.
Hence, relevance analysis, in the form of correlation analysis and attribute subset
selection, can be used to detect attributes that do not contribute to the classification or
prediction task. Including such attributes may otherwise slow down, and possibly
mislead, the learning step. Ideally, the time spent on relevance analysis, when added to
the time spent on learning from the resulting ―reduced‖ attribute (or feature) subset,
should be less than the time that would have been spent on learning from the original
set of attributes. Hence, such analysis can help improve classification efficiency and
scalability.
Data transformation and reduction: The data may be transformed by normalization,
particularly when neural networks or methods involving distance measurements are
used in the learning step. Normalization involves scaling all values for a given attribute
so that they fall within a small specified range, such as �1:0 to 1:0, or 0:0 to 1:0. In
methods that use distance measurements, for example, this would prevent attributes
with initially large ranges (like, say, income) from out weighing attributes with initially
smaller ranges (such as binary attributes).
The data can also be transformed by generalizing it to higher-level concepts. Concept
hierarchies may be used for this purpose. This is particularly useful for continuous
valued attributes. For example, numeric values for the attribute income can be
generalized to discrete ranges, such as low, medium, and high. Similarly, categorical
attributes, like street, can be generalized to higher-level concepts, like city. Because
generalization compresses the original training data, fewer input/output operations may
be involved during learning. Data can also be reduced by applying many other methods,
ranging from wavelet transformation and principle components analysis to
discretization techniques, such as binning, histogram analysis, and clustering.
Comparing Classification and Prediction Methods
Here are the criteria for comparing methods of Classification and Prediction:
143
Accuracy - Accuracy of classifier refers to ability of classifier predict the class label
correctly and the accuracy of predictor refers to how well a given predictor can guess
the value of predicted attribute for a new data.
Speed - This refers to the computational cost in generating and using the classifier or
predictor.
Robustness - It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
Scalability - Scalability refers to ability to construct the classifier or predictor
efficiently given large amount of data.
Interpretability - This refers to what extent the classifier or predictor understand.
7.7 Summary
In this unit, we studied about classification and prediction methods. The pre-processing issues
for classification/prediction are also discussed in brief. Comparative analogy is provided to
understand classification and prediction processes.
7.8 Key words
Prediction, Classification, Data cleaning, Data pre-processing.
7.9 Exercises
1. Explain the basic concepts of Classification.
2. Discuss the issues regarding data preparation in classification process.
3. What are the criteria used for comparing classification to prediction?
4. Discuss the general strategy of classification process.
7.10 References
1. Data Mining Techniques, Arun K Pujari, 1st Edition.
2. Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber.
3. Data Mining: Introductory and Advanced topics, Margaret H Dunham PEA.
4. The Data Warehouse lifecycle toolkit, Ralph Kimball Wiley Student Edition.
144
UNIT-8: APPROACHES FOR CLASSIFICATION
Structure
8.1
Objectives
8.2
Introduction
8.3
Basics of Probability Theory
8.4
Statement and Interpretation
8.5
Examples and Applications
8.6
Advantages and disadvantages of Bayesian methods
8.7
Bayesian Classifier
8.8
Classification by decision tree induction
8.9
Rule based classification
8.10
Summary
8.11
Keywords
8.12
Exercises
8.13
References
8.1 Objectives
In this unit we will learn about
The basics of probability theory
The statement and examples of Bayes theorem
The applications of Bayes theorem
Bayesian classification
Decision tree and its application for classification.
Rule based classification
Pruning Tree
145
8.2 Introduction
Bayesian classifiers: Bayesian classifiers are statistical classifiers. They can predict class
membership probabilities, such as the probability that a given tuple belongs to a particular
class. Bayesian classification is based on Bayes‘ theorem, described below. Studies
comparing classification algorithms have found a simple Bayesian classifier known as the
naive Bayesian classifier to be comparable in performance with decision tree and selected
neural network classifiers. Bayesian classifiers have also exhibited high accuracy and speed
when applied to large databases.
Rule-Based Classification: Here we look at rule-based classifiers, where the learned model is
represented as a set of IF-THEN rules. We first examine how such rules are used for
classification. We then study ways in which they can be generated, either from a decision tree
or directly from the training data using a sequential covering algorithm.
8.3 Basics of Probability Theory
In the logic based approaches we have still assumed that everything is either believed false or
believed true. However, it is often useful to represent the fact that we believe that something
is probably true, or true with probability (say) 0.65. This is useful for dealing with problems
where there is genuine randomness and unpredictability in the world (such as in games of
chance) and also for dealing with problems where we could, if we had sufficient information,
work out exactly what is true in the world, but where this is impractical. It is possible to have
solution through the concept of probability.
Probability theory is the branch of mathematics concerned with probability, the analysis of
random phenomena. The central objects of probability theory are random variables,
stochastic processes, and events: mathematical abstractions of non-deterministic events or
measured quantities that may either be single occurrences or evolve over time in an
apparently random fashion. If an individual coin toss or the roll of dice is considered to be a
random event, then if repeated many times the sequence of random events will exhibit certain
patterns, which can be studied and predicted. Two representative mathematical results
describing such patterns are the law of large numbers and the central limit theorem.
As a mathematical foundation for statistics, probability theory is essential to many human
activities that involve quantitative analysis of large sets of data. Methods of probability
146
theory also apply to descriptions of complex systems given only partial knowledge of their
state, as in statistical mechanics. A great discovery of twentieth century physics was the
probabilistic nature of physical phenomena at atomic scales, described in quantum
mechanics.
In probability theory and statistics, Bayes's theorem (alternatively Bayes's law or Bayes's
rule) is a theorem with two distinct interpretations. In the Bayesian interpretation, it
expresses how a subjective degree of belief should rationally change to account for evidence.
In the frequentist interpretation, it relates inverse representations of the probabilities
concerning two events. In the Bayesian interpretation, Bayes' theorem is fundamental to
Bayesian statistics, and has applications in fields including science, engineering, medicine
and law. The application of Bayes' theorem to update beliefs is called Bayesian inference.
Introductory example: If someone told you they had a nice conversation in the train, the
probability it was a woman they spoke with is 50%. If they told you the person they spoke to
was going to visit a quilt exhibition, it is far more likely than 50% it is a woman. Call W the
event "they spoke to a woman", and Q, the event "a visitor of the quilt exhibition". Then:
P(W) = 0.50; but with the knowledge of Q, the updated value is P(W|Q) that may be
calculated with Bayes' formula as:
in which M (man) is the complement of W.
As P(M) = P(W) = 0.5 and P(Q|W) >> P(Q|M), the updated value will be quite close to 1.
8.4 Statement and Interpretation
Mathematically, Bayes' theorem gives the relationship between the probabilities of A and B,
P(A) and P(B), and the conditional probabilities of A given B and B given A, P(A|B) and
P(B|A). In its most common form, it is:
The meaning of this statement depends on the interpretation of probability ascribed to the
terms:
147
Bayesian interpretation:
In the Bayesian (or epistemological) interpretation, probability measures a degree of belief.
Bayes' theorem then links the degree of belief in a proposition before and after accounting for
evidence. For example, suppose somebody proposes that a biased coin is twice as likely to
land heads as tails. Degree of belief in this might initially be 50%. The coin is then flipped a
number of times to collect evidence. Belief may rise to 70% if the evidence supports the
proposition.
For proposition A and evidence B,
P(A), the prior, is the initial degree of belief in A.
P(A|B), the posterior, is the degree of belief having accounted for B.
P(B|A)/P(B) represents the support B provides for A.
Frequentist interpretation:
In the frequentist interpretation, probability is defined with respect to a large number of trials,
each producing one outcome from a set of possible outcomes,
. An event is a subset of
.
The probability of event A, P(A), is the proportion of trials producing an outcome in A.
Similarly for the probability of B, P(B). If we consider only trials in which A occurs, the
proportion in which B also occurs is P(B|A). If we consider only trials in which B occurs, the
proportion in which A also occurs is P(A|B). Bayes' theorem is a fixed relationship between
these quantities.
This situation may be more fully visualized with tree diagrams, shown to the right. The two
diagrams represent the same information in different ways. For example, suppose that A is
having a risk factor for a medical condition, and B is having the condition. In a population,
the proportion with the condition depends whether those with or without the risk factor are
examined. The proportion having the risk factor depends whether those with or without the
condition is examined. Bayes' theorem links these inverse representations.
Bayesian forms:
Simple form
For events A and B, provided that P(B) ≠ 0.
148
In a Bayesian inference step, the probability of evidence B is constant for all models An. The
posterior may then be expressed as proportional to the numerator:
Extended form
Often, for some partition of the event space {Ai}, the event space is given or conceptualized
in terms of P(Ai) and P(B|Ai). It is then useful to eliminate P(B) using the law of total
probability:
In the special case of a binary partition,
Three or more events
Extensions to Bayes' theorem may be found for three or more events. For example, for three
events, two possible tree diagrams branch in the order BCA and ABC. By repeatedly
applying the definition of conditional probability:
As previously, the law of total probability may be substituted for unknown marginal
probabilities.
149
For random variables
Figure 13.1 Diagram illustrating the meaning of Bayes' theorem as applied to an event space
generated by continuous random variables X and Y. Note that there exists an instance of Bayes'
theorem for each point in the domain. In practise, these instances might be parameterised by writing
the specified probability densities as a function of x and y.
Consider a sample space Ω generated by two random variables X and Y. In principle, Bayes'
theorem applies to the events A = {X=x} and B = {Y=y}. However, terms become 0 at points
where either variable has finite probability density. To remain useful, Bayes' theorem may be
formulated in terms of the relevant densities (see Derivation).
Simple form
If X is continuous and Y is discrete,
If X is discrete and Y is continuous,
If both X and Y are continuous,
150
Extended form
A continuous event space is often conceptualized in terms of the numerator terms. It is then
useful to eliminate the denominator using the law of total probability. For fy(Y), this becomes
an integral:
Bayes' rule
Under the Bayesian interpretation of probability, Bayes' rule may be thought of as Bayes'
theorem in odds form.
Where
Derivation of Bayes Theorem:
For general events: Bayes' theorem may be derived from the definition of conditional
probability:
For random variables: For two continuous random variables X and Y, Bayes' theorem may
be analogously derived from the definition of conditional density:
151
8.5. Examples and Applications
(a) Frequentist example
An entomologist spots what might be a rare subspecies of beetle, due to the pattern on its
back. In the rare subspecies, 98% have the pattern. In the common subspecies, 5% have the
pattern. The rare subspecies accounts for only 0.1% of the population. How likely is the
beetle to be rare?
From the extended form of Bayes' theorem,
(b) Drug testing
Suppose a drug test is 99% sensitive and 99% specific. That is, the test will produce 99% true
positive results for drug users and 99% true negative results for non-drug users. Suppose that
0.5% of people are users of the drug. If a randomly selected individual tests positive, what is
the probability he or she is a user?
152
Despite the apparent accuracy of the test, if an individual tests positive, it is more likely that
they do not use the drug than that they do.
This surprising result arises because the number of non-users is very large compared to the
number of users, such that the number of false positives (0.995%) outweighs the number of
true positives (0.495%). To use concrete numbers, if 1000 individuals are tested, there are
expected to be 995 non-users and 5 users. From the 995 non-users,
positives are expected. From the 5 users,
false
true positives are expected. Out of 15
positive results, only 5, about 33%, are genuine.
Applications: Bayesian inference has applications in artificial intelligence and expert
systems. Bayesian inference techniques have been a fundamental part of computerized
pattern recognition techniques since the late 1950s. There is also an ever growing connection
between Bayesian methods and simulation-based Monte Carlo techniques since complex
models cannot be processed in closed form by a Bayesian analysis, while a graphical model
structure may allow for efficient simulation algorithms like the Gibbs sampling and other
Metropolis–Hastings algorithm schemes. Recently Bayesian inference has gained popularity
amongst the phylogenetics community for these reasons; a number of applications allow
many demographic and evolutionary parameters to be estimated simultaneously. In the areas
of population genetics and dynamical systems theory, approximate Bayesian computation
(ABC) is also becoming increasingly popular. As applied to statistical classification,
Bayesian inference has been used in recent years to develop algorithms for identifying e-mail
spam. Applications which make use of Bayesian inference for spam filtering include
DSPAM, Bogofilter, SpamAssassin, SpamBayes, and Mozilla. Spam classification is treated
in more detail in the article on the naive Bayes classifier. Solomonoff's Inductive inference is
the theory of prediction based on observations; for example, predicting the next symbol based
upon a given series of symbols. The only assumption is that the environment follows some
unknown but computable probability distribution. It combines two well-studied principles of
inductive inference: Bayesian statistics and Occam‘s razor. Solomonoff's universal prior
probability of any prefix p of a computable sequence x is the sum of the probabilities of all
153
programs (for a universal computer) that compute something starting with p. Given some p
and any computable but unknown probability distribution from which x is sampled, the
universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal
fashion.
8.6 Advantages and Disadvantages of Bayesian Methods
The Bayesian methods have a number of advantages that indicates their suitability in
uncertainty management.
Most significant is their sound theoretical foundation in probability theory. Thus, they are
currently the most mature of all of the uncertainty reasoning methods.
While Bayesian methods are more developed than the other uncertainty methods, they are not
without faults.
1.
They require a significant amount of probability data to construct a knowledge base.
Furthermore, human experts are normally uncertain and uncomfortable about the
probabilities they are providing.
2.
What are the relevant prior and conditional probabilities based on? If they are
statistically based, the sample sizes must be sufficient so the probabilities obtained are
accurate. If human experts have provided the values, are the values consistent and
comprehensive?
3.
Often the type of relationship between the hypothesis and evidence is important in
determining how the uncertainty will be managed. Reducing these associations to
simple numbers removes relevant information that might be needed for successful
reasoning about the uncertainties. For example, Bayesian-based medical diagnostic
systems have failed to gain acceptance because physicians distrust systems that cannot
provide explanations describing how a conclusion was reached (a feature difficult to
provide in a Bayesian-based system).
4.
The reduction of the associations to numbers also eliminated using this knowledge
within other tasks. For example, the associations that would enable the system to
explain its reasoning to a user are lost, as is the ability to browse through the hierarchy
of evidences to hypotheses.
154
8.7 BAYESIAN CLASSIFIER
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem
(from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive
term for the underlying probability model would be "independent feature model".
In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a
particular feature of a class is unrelated to the presence (or absence) of any other feature. For
example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter.
Even if these features depend on each other or upon the existence of the other features, a
naive Bayes classifier considers all of these properties to independently contribute to the
probability that this fruit is an apple.
Depending on the precise nature of the probability model, naive Bayes classifiers can be
trained very efficiently in a supervised learning setting. In many practical applications,
parameter estimation for naive Bayes models uses the method of maximum likelihood; in
other words, one can work with the naive Bayes model without believing in Bayesian
probability or using any Bayesian methods.
In spite of their naive design and apparently over-simplified assumptions, naive Bayes
classifiers have worked quite well in many complex real-world situations. An advantage of
the naive Bayes classifier is that it requires a small amount of training data to estimate the
parameters (means and variances of the variables) necessary for classification. Because
independent variables are assumed, only the variances of the variables for each class need to
be determined and not the entire covariance matrix.
Probabilistic model
Abstractly, the probability model for a classifier is a conditional model.
over a dependent class variable
with a small number of outcomes or classes, conditional on
several feature variables F through Fn. The problem is that if the number of features
is
large or when a feature can take on a large number of values, then basing such a model on
probability tables is infeasible. We therefore reformulate the model to make it more tractable.
Using Bayes' theorem, this can be written
In plain English the above equation can be written as
155
In practice, there is interest only in the numerator of that fraction, because the denominator
does not depend on
and the values of the features Fi are given, so that the denominator is
effectively constant. The numerator is equivalent to the joint probability model
which can be rewritten as follows, using the chain rule for repeated applications of the
definition of conditional probability:
Now the "naive" conditional independence assumptions come into play: assume that each
feature Fi is conditionally independent of every other feature Fj for j≠ i given the category
.
This means that
,
,
, and so on,
For i≠ j, k, l, and so the joint model can be expressed as
This means that under the above independence assumptions, the conditional distribution over
the class variable
Where
is:
(the evidence) is a scaling factor dependent only on
if the values of the feature variables are known.
156
, that is, a constant
Bayes’ Theorem
Bayes‘ theorem is named after Thomas Bayes, a nonconformist English clergyman who did
early work in probability and decision theory during the 18th century. Let X be a data tuple.
In Bayesian terms, X is considered ―evidence.‖ As usual, it is described by measurements
made on a set of n attributes. Let H be some hypothesis, such as that the data tuple X belongs
to a specified class C. For classification problems, we want to determine P(HjX), the
probability that the hypothesis H holds given the ―evidence‖ or observed data tuple X. In
other words, we are looking for the probability that tuple X belongs to class C, given that we
know the attribute description of X. P(HjX) is the posterior probability, or a posteriori
probability, of H conditioned on X. For example, suppose our world of data tuples is confined
to customers described by the attributes age and income, respectively, and that X is a 35-yearold customer with an income of Rs. 40,000. Suppose that H is the hypothesis that our
customer will buy a computer. Then P(HjX) reflects the probability that customer X will buy
a computer given that we know the customer‘s age and income.
In contrast, P(H) is the prior probability, or a priori probability, of H. For our example, this is
the probability that any given customer will buy a computer, regardless of age, income, or
any other information, for that matter. The posterior probability, P(HjX), is based on more
information (e.g., customer information) than the prior probability, P(H), which is
independent of X. Similarly, P(XjH) is the posterior probability of X conditioned on H. That
is, it is the probability that a customer, X, is 35 years old and earns Rs. 40,000, given that we
know the customer will buy a computer. P(X) is the prior probability of X. Using our
example, it is the probability that a person from our set of customers is 35 years old and earns
Rs. 40,000.
―How are these probabilities estimated?‖
P(H), P(XjH), and P(X) may be estimated from the given data, as we shall see below. Bayes‘
theorem is useful in that it provides a way of calculating the posterior probability, P(HjX),
from P(H), P(XjH), and P(X).
8.8 CLASSIFICATION BY DECISION TREE INDUCTION
Decision tree learning uses a decision tree as a predictive model which maps observations
about an item to conclusions about the item's target value. It is one of the predictive
modelling approaches used in statistics, data mining and machine learning. More descriptive
157
names for such tree models are classification trees or regression trees. In these tree
structures, leaves represent class labels and branches represent conjunctions of features that
lead to those class labels.
For example, we might have a decision tree to help a financial institution decide whether a
person should be offered a loan:
We wish to be able to induce a decision tree from a set of data about instances together with
the decisions or classifications for those instances.
Example Instance Data
size:
small medium large
colour: red blue green
shape:
brick wedge sphere pillar
%% yes
medium blue
brick
small
red
sphere
large
green
pillar
large
green
sphere
small
red
wedge
large
red
wedge
large
red
pillar
%% no
•
In this example, there are 7 instances, described in terms of three features or
attributes (size, colour, and shape), and the instances are classified into two classes
%% yes and %% no.
158
•
We shall now describe an algorithm for inducing a decision tree from such a
collection of classified instances.
•
Originally termed CLS (Concept Learning System) it has been successively enhanced.
Tree Induction Algorithm
•
The algorithm operates over a set of training instances, C.
•
If all instances in C are in class P, create a node P and stop, otherwise select a feature
or attribute F and create a decision node.
•
Partition the training instances in C into subsets according to the values of V.
•
Apply the algorithm recursively to each of the subsets C.
Output of Tree Induction Algorithm
This can easily be expressed as a nested if-statement
if (shape == wedge)
return no;
if (shape == brick)
return yes;
if (shape == pillar)
{
if (colour == red)
return no;
if (colour == green)
return yes;
}
if (shape == sphere)
return yes;
159
Classification by Decision Tree Induction
Decision tree induction is the learning of decision trees from class-labelled training tuples. A
decision tree is a flowchart-like tree structure, where each internal node (non leaf node)
denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
Node (or terminal node) holds a class label. The topmost node in a tree is the root node.
A typical decision tree is shown above. Internal nodes are denoted by rectangles, and leaf
nodes are denoted by ovals. Some decision tree algorithms produce only binary trees (where
each internal node branches to exactly two other nodes), whereas others can produce non
binary trees.
―How are decision trees used for classification?‖
Given a tuple, X, for which the associated class label is unknown, the attribute values of the
tuple are tested against the decision tree. A path is traced from the root to a leaf node, which
holds the class prediction for that tuple. Decision trees can easily be converted to
classification rules.
―Why are decision tree classifiers so popular?‖
The construction of decision tree classifiers does not require any domain knowledge or
parameter setting, and therefore is appropriate for exploratory knowledge discovery. Decision
trees can handle high dimensional data. Their representation of acquired knowledge in tree
form is intuitive and generally easy to assimilate by humans. The learning and classification
steps of decision tree induction are simple and fast. In general, decision tree classifiers have
good accuracy. However, successful use may depend on the data at hand. Decision tree
induction algorithms have been used for classification in many application areas, such as
medicine, manufacturing and production, financial analysis, astronomy, and molecular
biology. Decision trees are the basis of several commercial rule induction systems.
160
Decision Tree Algorithm:
Algorithm: Generate decision tree. Generate a decision tree from the training tuple of data
Partition D.
Input: Data partition, D, which is a set of training tuples and their associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the splitting criterion that ―best‖
partitions the data tuples into individual classes. This criterion consists of a splitting attribute
and, possibly, either a split point or splitting subset.
Output: A decision tree.
Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6) apply Attribute selection method(D, attribute list) to find the ―best‖ splitting criterion;
(7) label node N with splitting criterion;
(8) if splitting attribute is discrete-valued and multi way splits allowed then // not
restricted to binary trees
(9) Attribute list attribute list �splitting attribute; // remove splitting attribute
(10)
for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
(11)
let Dj be the set of data tuples in D satisfying outcome j; // a partition
(12)
if Dj is empty then
(13)
attach a leaf labeled with the majority class in D to node N;
else attach the node returned by Generate decision tree(Dj, attribute list) to
node N;
(14)
endfor
return N;
Starts with a training set of tuples and their associated class labels. The training set is
recursively partitioned into smaller subsets as the tree is being built. A basic decision tree
algorithm is summarized above. The strategy is as follows. The algorithm is called with three
161
parameters: D, attribute list, and Attribute selection method. We refer to D as a data partition.
Initially, it is the complete set of training tuples and their associated class labels. The
parameter attribute list is a list of attributes describing the tuples. Attribute selection method
specifies a heuristic procedure for selecting the attribute that ―best‖ discriminates the given
tuples according to class. This procedure employs an attribute selection measure, such as
information gain or the gini index. Whether the tree is strictly binary is generally driven by
the attribute selection measure. Some attribute selection measures, such as the gini index,
enforce the resulting tree to be binary. Others, like information gain, do not, therein allowing
multi-way splits (i.e., two or more branches to be grown from a node).
The tree starts as a single node, N, representing the training tuples in D (step 1) If the tuples
in D are all of the same class, then node N becomes a leaf and is labelled with that class
(steps 2 and 3). Note that steps 4 and 5 are terminating conditions. All of the terminating
conditions are explained at the end of the algorithm. Otherwise, the algorithm calls Attribute
selection method to determine the splitting criterion. The splitting criterion tells us which
attribute to test at node N by determining the ―best‖ way to separate or partition the tuples in
D into individual classes (step 6). The splitting criterion also tells us which branches to grow
from node N with respect to the outcomes of the chosen test. More specifically, the splitting
criterion indicates the splitting attribute and may also indicate either a split-point or a splitting
subset. The splitting criterion is determined so that, ideally, the resulting partitions at each
branch are as ―pure‖ as possible. A partition is pure if all of the tuples in it belong to the same
class. In other words, if we were to split up the tuples in D according to the mutually
exclusive outcomes of the splitting criterion, we hope for the resulting partitions to be as pure
as possible. The node N is labelled with the splitting criterion, which serves as a test at the
node (step 7). A branch is grown from node N for each of the outcomes of the splitting
criterion. The tuples in D are partitioned accordingly (steps 10 to 11). There are three
possible scenarios. Let A be the splitting attribute. A has v distinct values, fa1, a2, : : : , avg,
based on the training data.
1. A is discrete-valued: In this case, the outcomes of the test at node N correspond directly to
the known values of A. A branch is created for each known value, aj, of A and labelled with
that value. Partition Dj is the subset of class-labelled tuples in D having value aj of A.
Because all of the tuples in a given partition have the same value for A, then A need not be
considered in any future partitioning of the tuples. Therefore, it is removed from attribute list
(steps 8 to 9).
162
2. A is continuous-valued: In this case, the test at node N has two possible outcomes,
corresponding to the conditions A _ split point and A > split point, respectively, where split
point is the split-point returned by Attribute selection method as part of the splitting criterion.
(In practice, the split-point, a, is often taken as the midpoint of two known adjacent values of
A and therefore may not actually be a pre-existing value of A from the training data.) Two
branches are grown from N and labelled according to the above outcomes. The tuples are
partitioned such thatD1 holds the subset of class-labelled tuples in D for which A_split point,
while D2 holds the rest.
3. A is discrete-valued and a binary tree must be produced (as dictated by the attribute
selection measure or algorithm being used): The test at node N is of the form ―A 2 SA?‖. SA is
the splitting subset for A, returned by Attribute selection method as part of the splitting
criterion. It is a subset of the known values of A. If a given tuple has value aj of A and if aj 2
SA, then the test at node N is satisfied. Two branches are grown from N. By convention, the
left branch out of N is labelled yes so that D1 corresponds to the subset of class-labelled
tuples in D that satisfy the test. The right branch out of N is labelled no so that D2
corresponds to the subset of class-labelled tuples from D that do not satisfy the test. The
algorithm uses the same process recursively to form a decision tree for the tuples at each
resulting partition, Dj, of D (step 14).
The recursive partitioning stops only when any one of the following terminating conditions is
true:
1. All of the tuples in partition D (represented at node N) belong to the same class (steps 2
and 3), or
2. There are no remaining attributes on which the tuples may be further partitioned (step 4).
In this case, majority voting is employed (step 5). This involves converting node N into a leaf
and labelling it with the most common class in D. Alternatively, the class distribution of the
node tuples may be stored.
3. There are no tuples for a given branch, that is, a partition Dj is empty (step12).
In this case, a leaf is created with the majority class in D (step 13).
The resulting decision tree is returned (step 15).
The computational complexity of the algorithm given training set D is O(n_jDj_log(jDj)),
where n is the number of attributes describing the tuples in D and jDj is the number of
training tuples in D. This means that the computational cost of growing a tree grows at most
n_jDj_log(jDj) with jDj tuples.
163
Tree Pruning
When a decision tree is built, many of the branches will reflect anomalies in the training data
due to noise or outliers. Tree pruning methods address this problem of over fitting the data.
Such methods typically use statistical measures to remove the least reliable branches. Pruned
trees tend to be smaller and less complex and, thus, easier to comprehend. They are usually
faster and better at correctly classifying independent test data (i.e., of previously unseen
tuples) than un-pruned trees.
―How does tree pruning work?‖
There are two common approaches to tree pruning: pre-pruning and post-pruning.
In the pre-pruning approach, a tree is ―pruned‖ by halting its construction early (e.g.,by
deciding not to further split or partition the subset of training tuples at a given node). Upon
halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset
tuples or the probability distribution of those tuples. When constructing a tree, measures such
as statistical significance, information gain, Gini index, and so on can be used to assess the
goodness of a split. If partitioning the tuples at a node would result in a split that falls below a
pre-specified threshold, then further partitioning of the given subset is halted. There are
difficulties, however, in choosing an appropriate threshold. High thresholds could result in
oversimplified trees, whereas low thresholds could result in very little simplification. The
second and more common approach is post-pruning, which removes sub-trees from a ―fully
grown‖ tree. A sub-tree at a given node is pruned by removing its branches and replacing it
with a leaf. The leaf is labelled with the most frequent class among the sub-tree being
replaced.
In the pruned version of the tree, the sub-tree in question is pruned by replacing it with the
leaf ―class B.‖ This approach considers the cost complexity of a tree to be a function of the
number of leaves in the tree and the error rate of the tree (where the error rate is the
percentage of tuples misclassified by the tree). It starts from the bottom of the tree. For each
internal node, N, it computes the cost complexity of the sub-tree at N, and the cost complexity
of the sub-tree at N if it were to be pruned (i.e., replaced by a leaf node). The two values are
compared. If pruning the sub-tree at node N would result in a smaller cost complexity, then
the sub-tree is pruned. Otherwise, it is kept. A pruning set of class-labelled tuples is used to
estimate cost complexity. This set is independent of the training set used to build the unpruned tree and of any test set used for accuracy estimation.
164
8.9 Rule-Based Classification
IF-THEN Rules: Rule-based classifier make use of set of IF-THEN rules for classification.
We can express the rule in the following from:
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age=youth AND student=yes
THEN buy_computer=yes
•
The IF part of the rule is called rule antecedent or precondition.
•
The THEN part of the rule is called rule consequent.
•
In the antecedent part the condition consist of one or more attribute tests and these tests
are logically ANDed.
•
The consequent part consist class prediction.
We can also write rule R1 as follows:
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds the true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule based classifier by extracting IF-THEN rules from
decision tree. Points to remember to extract rule from a decision tree:
•
One rule is created for each path from the root to the leaf node.
•
To from the rule antecedent each splitting criterion is logically ANDed.
•
The leaf node holds the class prediction, forming the rule consequent.
Using IF-THEN Rules for Classification
Rules are a good way of representing information or bits of knowledge. A rule-based
classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression
of the form
IF condition THEN conclusion.
An example is rule R1,
R1: IF age = youth AND student = yes THEN buys computer = yes.
The ―IF‖-part (or left-hand side)of a rule isknownas the rule antecedent or precondition. The
―THEN‖-part (or right-hand side) is the rule consequent. In the rule antecedent, the condition
consists of one or more attribute tests (such as age = youth, and student = yes) that are
165
logically ANDed. The rule‘s consequent contains a class prediction (in this case, we are
predicting whether a customer will buy a computer). R1 can also be written as
R1: (age = youth) ^ (student = yes))(buys computer = yes).
If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given
tuple,we say that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that
the rule covers the tuple.
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a classlabeled
data set,D, let ncovers be the number of tuples covered by R; ncorrect be the number of
tuples correctly classified by R; and jDj be the number of tuples in D. We can define the
coverage and accuracy of R as
coverage(R) =n covers
|D|
accuracy(R) =n correct
ncovers
That is, a rule‘s coverage is the percentage of tuples that are covered by the rule (i.e., whose
attribute values hold true for the rule‘s antecedent). For a rule‘s accuracy, we look at the
tuples that it covers and see what percentage of them the rule can correctly classify.
Rule Extraction from a Decision Tree
we learned how to build a decision tree classifier from a set of training data. Decision tree
classifiers are a popular method of classification—it is easy to understand how decision trees
work and they are known for their accuracy. Decision trees can become large and difficult to
interpret. In this subsection, we look at how to build a rule based classifier by extracting IFTHEN rules from a decision tree. In comparison with a decision tree, the IF-THEN rules may
be easier for humans to understand, particularly if the decision tree is very large.
To extract rules from a decision tree, one rule is created for each path from the root to a leaf
node. Each splitting criterion along a given path is logically ANDed to form the rule
antecedent (―IF‖ part). The leaf node holds the class prediction, forming the rule consequent
(―THEN‖ part).
Rule Induction Using a Sequential Covering Algorithm
IF-THEN rules can be extracted directly from the training data (i.e., without having to
generate a decision tree first) using a sequential covering algorithm. The name comes from
the notion that the rules are learned sequentially (one at a time), where each rule for a given
166
class will ideally cover many of the tuples of that class (and hopefully none of the tuples of
other classes). Sequential covering algorithms are the most widely used approach to mining
disjunctive sets of classification rules, and form the topic of this subsection. Note that in a
newer alternative approach, classification rules can be generated using associative
classification algorithms, which search for attribute-value pairs that occur frequently in the
data. These pairs may form association rules, which can be analyzed
and used in classification. Since this latter approach is based on association rule mining , we
prefer to defer its treatment until later, There are many sequential covering algorithms.
Popular variations include AQ, CN2, and the more recent, RIPPER. The general strategy is as
follows. Rules are learned one at a time. Each time a rule is learned, the tuples covered by the
rule are removed, and the process repeats on the remaining tuples. This sequential learning of
rules is in contrast to decision tree induction. Because the path to each leaf in a decision tree
corresponds to a rule, we can consider decision tree induction as learning a set of rules
simultaneously.
A basic sequential covering algorithm Here, rules are learned for one class at a time. Ideally,
when learning a rule for a class, Ci, we would like the rule to cover all (or many) of the
training tuples of class C and none (or few) of the tuples from other classes. In this way, the
rules learned should be of high accuracy. The rules need not necessarily be of high coverage.
This is because we can have more than one rule for a class, so that different rules may cover
different tuples within the same class. The process continues until the terminating condition is
met, such as when there are no more training tuples or the quality of a rule returned is below
a user-specified threshold. The Learn One Rule procedure finds the ―best‖ rule for the current
class, given the current set of training tuples.
Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification.
Input: D, a data set class-labeled tuples; Att vals, the set of all attributes and their possible
values.
Output: A set of IF-THEN rules.
Method:
(1) Rule set = fg; // initial set of rules learned is empty
(2) for each class c do
(3) repeat
(4) Rule = Learn One Rule(D, Att vals, c);
167
(5) remove tuples covered by Rule from D;
(6) until terminating condition;
(7) Rule set = Rule set +Rule; // add new rule to rule set
(8) endfor
(9) return Rule Set;
―How are rules learned?‖
Typically, rules are grown in a general-to-specific manner We can think of this as a beam
search, where we start off with an empty rule and then gradually keep appending attribute
tests to it. We append by adding the attribute test as a logical conjunct to the existing
condition of the rule antecedent. Suppose our training set, D, consists of loan application
data. Attributes regarding each applicant include their age, income, education level,
residence, credit rating, and the term of the loan. The classifying attribute is loan decision,
which indicates whether a loan is accepted (considered safe) or rejected (considered risky).
To learn a rule for the class ―accept,‖ we start off with the most general rule possible, that is,
the condition of the rule antecedent is empty. The rule is:
IF THEN loan decision = accept.
We then consider each possible attribute test that may be added to the rule. These can be
derived from the parameter Att vals, which contains a list of attributes with their associated
values. For example, for an attribute-value pair (att, val), we can consider attribute tests such
as att = val, att _ val, att > val, and so on. Typically, the training data will contain many
attributes, each of which may have several possible values. Finding an optimal rule set
becomes computationally explosive. Instead, Learn One Rule adopts a greedy depth-first
strategy. Each time it is faced with adding a new attribute test (conjunct) to the current rule, it
picks the one that most improves the rule quality, based on the training samples. We will say
more about rule quality measures in a minute. For the moment, let‘s say we use rule accuracy
as our quality measure., suppose Learn One Rule finds that the attribute test income = high
best improves the accuracy of our current (empty) rule. We append it to the condition, so that
the current rule becomes
IF income = high THEN loan decision = accept.
Each time we add an attribute test to a rule, the resulting rule should cover more of the
―accept‖ tuples. During the next iteration, we again consider the possible attribute tests and
end up selecting credit rating = excellent. Our current rule grows to become
IF income = high AND credit rating = excellent THEN loan decision = accept.
168
The process repeats, where at each step, we continue to greedily grow rules until the resulting
rule meets an acceptable quality level.
Greedy search does not allow for backtracking. At each step, we heuristically add what
appears to be the best choice at the moment. What if we unknowingly made a poor choice
along the way? To lessen the chance of this happening, instead of selecting the best attribute
test to append to the current rule, we can select the best k attribute tests. In this way, we
perform a beam search of width k wherein we maintain the k best candidates overall at each
step, rather than a single best candidate.
Rule Induction Using Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data.
We do not require to generate a decision tree first. In this algorithm each rule for a given
class covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by
the rule is removed and the process continues for rest of the tuples. This is because the path to
each leaf in a decision tree corresponds to a rule.
Note: The Decision tree induction can be considered as learning a set of rules
simultaneously. The following is the sequential learning Algorithm where rules are learned
for one class at a time. When learning a rule from a class Ci, we want the rule to cover all the
tuples from class C only and no tuple form any other class.
Algorithm: Sequential Covering
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
Output: A Set of IF-THEN rules.
Method:
Rule_set={ }; // initial set of rules learned is empty
for each class c do
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
Rule_set=Rule_set+Rule; // add a new rule to rule-set
end for
return Rule_Set;
169
Rule Pruning
The rule is pruned is due to the following reason:
•
The Assessment of quality are made on the original set of training data. The rule may
perform well on training data but less well on subsequent data. That's why the rule pruning is
required.
•
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has
greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos-neg/ pos+neg
Where pos and neg is the number of positive tuples covered by R, respectively.
Note:This value will increase with the accuracy of R on pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.
Learn One Rule does not employ a test set when evaluating rules. Assessments of rule quality
as described above are made with tuples from the original training data. Such assessment is
optimistic because the rules will likely overfit the data. That is, the rules may perform well on
the training data, but less well on subsequent data. To compensate for this, we can prune the
rules. A rule is pruned by removing a conjunct (attribute test). We choose to prune a rule, R,
if the pruned version of R has greater quality, as assessed on an independent set of tuples. As
in decision tree pruning, we refer to this set as a pruning set. Various pruning strategies can
be used, such as the pessimistic pruning approach described in the previous section. FOIL
uses a simple yet effective method.
8.10 Summary
The significance of Bayes theorem in classification is discussed in detail with suitable
examples. We have also discussed the advantages and disadvantages of Bayesian methods.
The basics of decision tree and its application in classification are presented in brief. The rule
based classification is discussed in detail.
8.11 Keywords
Bayesian Classification, Decision Tree Induction, Rule-Based Classification.
170
8.12 Exercises
1. Give the statement of Bayes theorem.
2. Discuss the significance of Bayes theorem in classification
3. List and explain the applications of Bayes theorem.
4. What are the advantages and disadvantages of Bayes theorem?
5. Explain classification by decision tree induction.
6. Explain rule based classification.
7. Write short notes on pruning.
8. Suppose a drug test is 88% sensitive and 88% specific. That is, the test will produce
88% true positive results for drug users and 88% true negative results for non-drug
users. Suppose that 0.6% of people are users of the drug. If a randomly selected
individual tests positive, what is the probability he or she is a user?
8.13 References
1. Introduction to Data Mining with Case Studies, by Gupta G. K.
2. Applications Of Data Mining by T Sudha, M Usha Rani
171
Unit 9: CLASSIFICATION TECHNIQUES
Structure
9.1
Objectives
9.2
Introduction
9.3
Classification by Back propagation,
9.4
Support Vector Machines,
9.5
Associative Classification,
9.6
Decision Trees,
9.7
Lazy Learners (K-NN).
9.8
Summary
9.9
Keywords
9.10
Exercises
9.11
References
9.1 Objectives
The objectives covered under this unit include:
Classification by Back propagation
Support Vector Machines
Associative Classification
Decision Trees
Lazy Learners
9.2 Introduction
Back propagation, an abbreviation for "backward propagation of errors", is a common
method of training artificial neural networks. From a desired output, the network learns from
many inputs, similar to the way a child learns to identify a dog from examples of dogs.
172
Neurocomputing is computer modeling based, in part, upon simulation of the structure and
function of the brain. Neural networks excel in pattern recognition, that is, the ability to
recognize a set of previously learned data. Although their use is rapidly growing in
engineering, they are new to the pharmaceutical community
Although the long-term goal of the neural-network community remains the design of
autonomous machine intelligence, the main modern application of artificial neural networks
is in the field of pattern recognition. In the sub-field of data classification, neural-network
methods have been found to be useful alternatives to statistical techniques such as those
which involve regression analysis or probability density estimation. The potential utility of
neural networks in the classification of multisource satellite-imagery databases has been
recognized for well over a decade, and today neural networks are an established tool in the
field of remote sensing.
The most widely applied neural network algorithm in image classification remains the feed
forward back propagation algorithm.
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a
separating hyper plane. In other words, given labelled training data (supervised learning), the
algorithm outputs an optimal hyper plane which categorizes new examples.
A Decision Tree takes as input an object given by a set of properties, output a Boolean value
(yes/no decision). Each internal node in the tree corresponds to test of one of the properties.
Branches are labelled with the possible values of the test.
Lazy learners store training examples and delay the processing (―lazy evaluation‖) until a
new instance must be classified.
Imagine a contrasting lazy approach, in which the learner instead waits until the last minute
before doing any model construction to classify a given test tuple. That is, when given a
training tuple, a lazy learner simply stores it(or does only a little minor pro-cessing) and waits
until it is given a test tuple. Only when it sees the test tuple does it perform generalization to
classify the tuple based on its similarity to the stored training tuples. Unlike eager learning
methods, lazy learners do less work when a training tuple is presented and more work when
making a classification or numeric prediction. Because lazy learners store the training tuples
or ―instances,‖ they are also referred to as instance-based learners, even though all learning is
essentially based on instances.
173
9.3 Classification by Back propagation
Back propagation is a neural network learning algorithm. Psychologists originally kindled the
field of neural networks and neurobiologists who sought to develop and test computational
analogues of neurons. Roughly speaking, a neural network is a set of connected input/output
units where each connection has a weight associated with it. During the learning phase, the
network learns by adjusting the weights so as to be able to predict the correct class label of
the input samples. Neural network learning is also referred to as connectionist learning due to
the connections between units.
Neural networks involve long training times and are therefore more suitable for applications
where this is feasible. They require a number of parameters that are typically best determined
empirically, such as the network topology or ―structure.‖ Neural networks have been
criticized for their poor interpretability, since it is difficult for humans to interpret the
symbolic meaning behind the learned weights. These features initially made neural networks
less desirable for data mining.
Advantages of neural networks, however, include their high tolerance to noisy data as well as
their ability to classify patterns on which they have not been trained. In Addition, several
algorithms have recently have been developed for the extraction of rules from trained neural
networks. These factors contribute towards the usefulness of neural networks for
classification in data mining.
The most popular neural network algorithm is the back propagation algorithm, Proposed in
the 1980‘s.
A Multilayer Feed-Forward Neural Network
The back propagation algorithm performs learning on a multilayer fee-forward neural
network. The inputs correspond to the attributes measured for each raining sample. The
inputs are fed simultaneously into layer of units making up the input layer. The weighted
outputs of these units are, in turn, fed simultaneously to a second layer of neuron like units,
174
known as a hidden layer. The hidden layers weighted outputs can be input to another hidden
layer, and so on. The number of hidden layers is arbitrary, although in practice, usually only
one is used. The weighted outputs of the last hidden layer are input to units making up the
output layer, which emits the network‘s prediction for given samples.
The units in the hidden layers and output layer are sometimes referred to as neurodes, due to
their symbolic biological basis, or as output units. Multilayer feed-forward networks of linear
threshold functions, given enough hidden units, can closely approximate any function.
Defining Network Topology
Before training can begin, the user must decide on the network topology by specifying the
number of units in the input layer, the number of hidden layers (if more than one), the
number of units in each hidden layer, and the number of units in the output layer.
Normalizing the input values for each attribute measured in the training samples will help
speed up the learning phase. Typically, input values are normalized so as to fall between 0.0
and 1.0. Discrete-valued attributes may be encoded such that there is one input unit per
domain value. For example, if the domain of an attribute A is (a0,a1,a2) then we may assign
three input units to represent A. That is, we may have, say, as input units. Each unit is
initialized to 0. If A =a0, then it is set to 1. If A==a1 it is set to 1, and so on. One output unit
may be used to represent two classes (where the value I represents one class, and the value 0
represents the other). If there are more than two classes, then one output unit per class is used.
There are no clear rules as to the ―best‖ number of hidden layer units. Network design is a
trial-and –error process and may affect the accuracy of the resulting trained network. The
initial values of the weights may also affect the resulting accuracy.
Back propagation
Back propagation learns by iteratively processing a set of training samples, comparing the
network‘s prediction for each sample with the actual known class label. For each training
sample, the weights are modified so as to minimize the mean squared error between the
network‘s prediction and the actual class.
These modifications are made in the ―backwards‖ direction, that is, form the output layer
through each hidden layer down to the first hidden layer (hence the name backpropagation).
Although it is not guaranteed in general the weights will eventually converge, and the
learning process stops. The algorithm is summarized below. Initialize the weights. The
175
weights in the network are initialized to small random number (e.g., ranging from -1.0 to 1.0,
or -0.5 to 0.5). Each unit has a bias associated with it, as explained below. The biases are
similarly initialized to small random numbers.
Each training sample: X, is processed by the following steps
Propagate the inputs forward:
In this step, the net input and output of each unit in the hidden and output layers are
computed. First, the training sample is fed to the input layer of the network. Note that for unit
in the input layer, its output is equal to its input layers is computed as a linear combination to
it in the previous layer. To compute the net input to the unit, each input is connected to the
unit is multiplied by its corresponding weight, and this is summed. Given a unit in a hidden
or output layer, the net input to unit is
Ij=∑Wij Oi+θj
Where Wij, is the weight of the connection from unit; in the previous layer to unit; 0i is the
output of unit j from the previous layer; and 0j is the bias of the unit. The bias acts as a
threshold in that it serves to vary the activity of the unit.
Each unit in the hidden and output layers takes its net input and then applies an activation
function to it. The function symbolizes the activation of the neuron represented by the unit.
The logistic, or sigmoid, function is used. This function is also referred to as a squashing
function, since it maps a large input domain onto the smaller range of 0 to 1. The logistic
function is non-linear and differentiable, allowing there back propagation algorithm to model
classification problems that are linearly inseparable.
Back propagate the error:
The error is propagated backwards by updating the weights and biases to reflect the error of
the network‘s prediction. For a unit j in the output layer, the error Errj is computed by
Errj = 0j (1-0j)(Tj-0j)
To compute the error of a hidden layer unit j, the weighted sum of the errors of the units
connected to unit j in the next layer is considered. The error of a hidden layer unit j is
Erri =0j(1-0j)∑Errk Wjk
176
Where Wjk is the weight of the connection from unit; to a unit k in the next higher layer and
Errk is the error of unit k. The weights and biases are updated to reflect the i propagated
errors. Weights are updated by the following equations, where DWij is the change in weight
Wij.
Δ Wij = (L) Errj 0i
W ij=W ij + DWij
The variable L is the learning rate, a constant typically having a value between 0.0 to 1.0.
Back propagation learns using a method of gradient descent to search for a set of weights that
can model the given classification problem so as to minimize the mean squard distance
between the networks class prediction and the actual class label of the samples. The learning
rate helps to avoid getting stuck at a local minimum in decision space (i.e., where the weights
appear to converge, but are not the optimum solution) and encourages finding the global
minimum. If the learning rate is too small, then learning will occur at a very slow pace. If the
learning rate is too large, then oscillation between. Inadequate solutions may occur. A rule
of thumb is to set the learning rate to 1/f, where T is the number of iterations through the
training set so far.
Biases are updated by the following equations below, where Δ θj, is the change in Bias θj;
Δ θj =(L) Errj
θj= θj+ Δ θj
Note that here we are updating the weights and biases after the presentation of each sample.
This is referred to as case updating. Alternatively, the weight and bias Increments could be
accumulated in variables, so that the weights and biases are updated after all of the samples in
the training set have been presented. This latter strategy is called epoch updating, where one
iteration through the training set is an epoch. In theory, the mathematical derivation of back
propagation employs epoch updating, yet in practice, case updating is more common since it
tends to yield more accurate results.
Terminating condition
Training stops when
•
All ΔWij in the previous epoch were so small as to be below some specified
threshold, or
177
•
The percentage of samples misclassified in the previous epoch is below some
threshold, or
•
A pre-specified number of epochs have expired.
In practice, several hundreds of thousands of epochs may be required before the weights will
converge.
9.4 Support Vector Machines
Support Vector Machines, is a promising new method for the classification of both linear and
nonlinear data. In a nutshell, a support vector machine (or SVM) is an algorithm that works
as follows. It uses a nonlinear mapping to transform the original training data into a higher
dimension. Within this new dimension, it searches for the linear optimal separating
hyperplane (that is, a ―decision boundary‖ separating the tuples of one class from another).
With an appropriate nonlinear mapping to a sufficiently high dimension, data from two
classes can always be separated by a hyperplane. The SVM finds this hyperplane using
support vectors (―essential‖ training tuples) and margins (defined by the support vectors).
―Why SVMs have attracted a great deal of attention lately?‖ Although the training time of
even the fastest SVMs can be extremely slow, they are highly accurate, owing to their ability
to model complex nonlinear decision boundaries. They are much less prone to overfitting
than othermethods. The support vectors found also provide a compact description of the
learned model. SVMs can be used for prediction as well as classification. They have been
applied to a number of areas, including handwritten digit recognition, object recognition, and
speaker identification, as well as benchmark time-series prediction tests.
The Case When the Data Are Linearly Separable
To explain the mystery of SVMs, let‘s first look at the simplest case—a two-class problem
where the classes are linearly separable. Let the data set D be given as (X1, y1), (X2, y2)...
(X|D|, y|D|), where Xi is the set of training tuples with associated class labels, yi. Each yi can
take one of two values, either+1 or-1 (i.e., yi є{+1, -1}), corresponding to the classes
buys_computer= yes and buys_computer = no, respectively. To aid in visualization, consider
an example based on two input attributes, A1 and A2, as shown in Figure.
178
From the graph, we see that the 2-D data are linearly separable (or ―linear,‖ for short) because
a straight line can be drawn to separate all of the tuples of class +1 from all of the tuples of
class -1. There are an infinite number of separating lines that could be drawn. Next to find the
―best‖ one, that is, one that (we hope) will have the minimum classification error on
previously unseen tuples.
How to find this best line? Note that if our data were 3-D (i.e.,with three attributes),we would
want to find the best separating plane. Generalizing to n dimensions, we want to find the best
hyperplane. We will use the term ―hyperplane‖ to refer to the decision boundary that we are
seeking, regardless of the number of input attributes. So, in other words, how can we find the
best hyperplane? An SVM approaches this problem by searching for the maximum marginal
hyperplane. Consider Figure, which shows two possible separating hyperplanes and their
associated margins. Before we get into the definition of margins, let‘s take an intuitive look at
this figure.
179
Both hyper planes can correctly classify all of the given data tuples. Intuitively, however, the
hyper plane with the larger margin to be more accurate at classifying future data tuples than
the hyper plane with the smaller margin. This is why (during the learning or training phase),
the SVM searches for the hyper plane with the largest margin, that is, the maximum marginal
hyper plane (MMH). The associated margin gives the largest separation between classes.
Getting to an informal definition of margin, say that the shortest distance from a hyper plane
to one side of its margin is equal to the shortest distance from the hyper plane to the other
side of its margin, where the ―sides‖ of the margin are parallel to the hyper plane. When
dealing with the MMH, this distance is, in fact, the shortest distance from the MMH to the
closest training tuple of either class.
A separating hyper plane can be written as
W.X+b = 0,
Where W is a weight vector, namely, W = {w1, w2, ... , wn}, n is the number of attributes;
and b is a scalar, often referred to as a bias. To aid in visualization, let‘s consider two input
attributes, A1 and A2, as in Figure (b). Training tuples are 2-D, e.g., X = (x1, x2), where x1
and x2 are the values of attributes A1 and A2, respectively, for X. If we think of b as an
additional weight, w0, we can rewrite the above separating hyperplane as
w0+w1x1+w2x2 = 0,
Thus, any point that lies above the separating hyperplane satisfies
w0+w1x1+w2x2 > 0:
Similarly, any point that lies below the separating hyperplane satisfies
w0+w1x1+w2x2 < 0:
180
The weights can be adjusted so that the hyperplanes defining the ―sides‖ of the margin can be
written as
H1 : w0+w1x1+w2x2 ≥ 1 for yi = +1, and
H2 : w0+w1x1+w2x2 ≤-1 for yi = -1:
That is, any tuple that falls on or above H1 belongs to class +1, and any tuple that falls on or
below H2 belongs to class-1. Combining the two inequalities of Equations in above H1 and
H2.
yi(w0+w1x1+w2x2) ≥1, ∀i.
Any training tuples that fall on hyper planes H1 or H2 (i.e., the ―sides‖ defining the margin)
satisfy Equation above and are called support vectors. That is, they are equally close to the
(separating) MMH. Essentially, the support vectors are the most difficult tuples to classify
and give the most information regarding classification. From the above, we can obtain a
formulae for the size of the maximal margin. The distance from the separating hyperplane to
any point on H1 is 1/ ||W||, where ||W|| is the Euclidean norm of W, that is
W. W. By
definition, this is equal to the distance from any point on H2 to the separating hyperplane.
Therefore, the maximal margin is 2/||W||.
―So, how does an SVM find the MMH and the support vectors?‖Using some ―fancy math
tricks,‖ we can rewrite Equation (6.38) so that it becomes what is known as a constrained.
if W = w1 , w2 … . , w3 then W. W =
w12 + w22 + ⋯ + wn2
Lagrangian formulation (convex) quadratic optimization problem.
If the data are small (say, less than 2,000 training tuples), any optimization software package
for solving constrained convex quadratic problems can then be used to find the support
vectors and MMH. For larger data, special and more efficient algorithms for training SVMs
can be used instead.
Once we have found the support vectors and MMH (note that the support vectors define the
MMH!), we have a trained support vector machine. The MMH is a linear class boundary, and
so the corresponding SVM can be used to classify linearly separable data. We refer to such a
trained SVM as a linear SVM.
181
―Once we have got a trained support vector machine, how do we use it to classify test (i.e.,
new) tuples?‖ Based on the Lagrangian formulation mentioned above, the MMH can be
rewritten as the decision boundary.
l
d X
T
yi αi Xi X T + b0 ,
=
i=1
where yi is the class label of support vector Xi; X T is a test tuple; αi and b0 are numeric
parameters that were determined automatically by the optimization or SVM algorithm above;
and l is the number of support vectors.
Note that the αi are Lagrangian multipliers. For linearly separable data, the support vectors
are a subset of the actual training tuples (although there will be a slight twist regarding this
when dealing with nonlinearly separable data, as see below).
Given a test tuple, X T , we plug it into Equation above, and then check to see the sign of the
result. This tells us onwhich side of the hyperplane the test tuple falls. If the sign is positive,
then X T falls on or above the MMH, and so the SVM predicts thatX T , belongs to class +1
(representing buys_ computer = yes, in buys_computer case). If the sign is negative, then X T
falls on or below the MMH and the class prediction is -1 (representing buys computer = no).
Notice that the Lagrangian formulation contains a dot product between support vector Xi and
test tuple X T ,. This will prove very useful for finding the MMH and support vectors for the
case when the given data are nonlinearly separable, as described further below.
Before we move on to the nonlinear case, there are two more important things to note. The
complexity of the learned classifier is characterized by the number of support vectors rather
than the dimensionality of the data. Hence, SVMs tend to be less prone to overfitting than
some other methods. The support vectors are the essential or critical training tuples they lie
closest to the decision boundary (MMH). If all other training tuples were removed and
training were repeated, the same separating hyperplane would be found. Furthermore, the
number of support vectors found can be used to compute an (upper) bound on the expected
error rate of the SVM classifier, which is independent of the data dimensionality. An SVM
with a small number of support vectors can have good generalization, even when the
dimensionality of the data is high.
182
The Case When the Data Are Linearly Inseparable
SVMs for classifying linearly separable data, but what if the data are not linearly separable,
as in Figure.
In such cases, no straight line can be found that would separate the classes. The linear SVMs
we studied would not be able to find a feasible solution here.
The approach described for linear SVMs can be extended to create nonlinear SVMs for the
classification of linearly inseparable data (also called nonlinearly separable data, or nonlinear
data, for short). Such SVMs are capable of finding nonlinear decision boundaries (i.e.,
nonlinear hypersurfaces) in input space. We obtain a nonlinear SVM by extending the
approach for linear SVMs as follows. There are two main steps.
In the first step, transform the original input data into a higher dimensional space using a
nonlinear mapping. Several common nonlinear mappings can be used in this step, once the
data have been transformed into the new higher space, the second step searches for a linear
separating hyperplane in the new space. We again end up with a quadratic optimization
problem that can be solved using the linear SVM formulation. The maximal marginal
hyperplane found in the new space corresponds to a nonlinear separating hypersurface in the
original space. So that the linear decision hyperplane in the new (Z) space corresponds to a
nonlinear second-order polynomial in the original 3-D input space,
d Z = w1 x1 + w2 x2 + w3 x3 + w4 x1
2
+ w5 x1 x2 + w6 x1 x3 + b
w1 z1 + w2 z2 + w3 z3 + w4 z4 + w5 z5 + w6 z6 + b
First, how do we choose the nonlinear mapping to a higher dimensional space? Second, the
computation involved will be costly. Refer back to Equation d X T for the classification of a
test tuple, X T . Given the test tuple, compute its dot product with every one of the support
vectors. In training, we have to compute a similar dot product several times in order to find
183
the MMH. This is especially expensive. Hence, the dot product computation required is very
heavy and costly. It so happens that in solving the quadratic optimization problem of the
linear SVM(i.e., when searching for a linear SVM in the new higher dimensional space), the
training tuples appear only in the form of dot products, ɸ(Xi). ɸ(Xj), where ɸ (X) is simply
the nonlinear mapping function applied to transform the training tuples. Instead of computing
the dot product on the transformed data tuples, it turns out that it is mathematically equivalent
to instead apply a kernel function, K(Xi, Xj), to the original input data. That is,
K(Xi, Xj) = ɸ (Xi) . ɸ (Xj).
In other words, everywhere that ɸ(Xi). ɸ(Xj) appears in the training algorithm, we can
replace it with K(Xi,Xj). In this way, all calculations are made in the original input space,
which is of potentially much lower dimensionality. We can safely avoid the mapping—it
turns out that we don‘t even have to know what the mapping is! We will talk more later about
what kinds of functions can be used as kernel functions for this problem.
Then proceed to find a maximal separating hyperplane. The procedure is similar to that
described in Section above in linear separable, although it involves placing a user-specified
upper bound, C, on the Lagrange multipliers, ai. This upper bound is best determined
experimentally.
―What are some of the kernel functions that could be used?‖ Properties of the kinds of kernel
functions that could be used to replace the dot product scenario described above have been
studied. Three admissible kernel functions include like:
Polynomial kernel of degree h: K Xi , Xj = (Xi . Xj + 1)h
Gaussian radial basis function kernel: K Xi , Xj = e
−
X i −X j
2σ 2
2
Sigmoid kernel: K Xi , Xj = tan(kXi . Xj − δ)
There are no golden rules for determining which admissible kernel will result in the most
accurate SVM. In practice, the kernel chosen does not generally make a large difference in
resulting accuracy. SVM training always finds a global solution. linear and nonlinear SVMs
for binary (i.e., two-class) classification. SVM classifiers can be combined for the multiclass
case. A simple and effective approach, given m classes, trains m classifiers, one for each
184
class (where classifier j learns to return a positive value for class j and a negative value for
the rest). A test tuple is assigned the class corresponding to the largest positive distance.
Aside from classification, SVMs can also be designed for linear and nonlinear regression.
Here, instead of learning to predict discrete class labels (like the yi ϵ{+1, −1}above), SVMs
for regression attempt to learn the input-output relationship between input training tuples, Xi,
and their corresponding continuous-valued outputs, yi ∈ R .
An approach similar to SVMs for classification is followed. Additional user-specified
parameters are required. A major research goal regarding SVMs is to improve the speed in
training and testing so that SVMs may become a more feasible option for very large data sets
(e.g., of millions of support vectors). Other issues include determining the best kernel for a
given data set and finding more efficient methods for the multiclass case.
9.5 Associative Classification
Association rule mining is an important and highly active area of data mining research.
Recently, data mining techniques have been developed that apply concepts used in
association rule mining to the problem of classification. There are three methods in historical
order. The first two, ARCS and associative classification, use association rules for
classification. The third method, CAEP, mines‖ emerging patterns‖ that consider the concept
of support used in mining associations.
The first method mines association rules based on clustering and then employs the rules for
classification. The ARCS or Association Rule Clustering System, mines association rules of
the form Aquan1ΛAquan2=>Acat where Aquan1 and Aquan2 are tests on quantitative
attributive ranges(where the ranges are dynamically determined), and Acat assigns a class
label for a categorical attribute from the given training data.
Association rules are plotted on a 2-D grid. The algorithm scans the grid, searching for
rectangular clusters of rules. In this way, adjacent ranges of the quantitative attributes
occurring within a rule cluster may be combined. The clustered association rules generated by
ARCS were empirically found to be slightly more accurate than C4.5 when there are outliers
in the data. The accuracy of ARCS is related to the degree of discretization used. In terms of
scalability, ARCS requires ―a constant amount of memory‖, regardless of the database size.
C4.5 has exponentially higher execution times than ARCS, requiring the entire database,
multiplied by some factor, to fit entirely in main memory.
185
The second method is referred to as associative classification. It mines rules of the form
condset=>y, where condset is a set of items (or attribute-value pairs) and y is a class label.
Rules that satisfy a pre-specified minimum support are frequent, where a rule has support s if.
s% of the samples in the given data set contain condset and belong to class y. A rule
satisfying minimum confidence is called accurate, where a rule has confidence c if c% of the
samples in the given data set that contain condset belongs to class y. If a set of rules has the
same condset, then the rule with the highest confidence is selected as the possible rule (PR) to
represent the set.
The association classification method consists of two steps. The first step finds the set of all
PRs that are both frequent and accurate. It uses an iterative approach, where prior knowledge
is used to prune the rule search. The second step uses a heuristic method to construct the
classifier, where the discovered rules are organized according to decreasing precedence based
on their confidence and support. The algorithm may require several passes over the data set,
depending on the length of the longest rule found. When classifying a new sample, the first
rule satisfying the sample is used to classify it. The classifier also contains a default rule,
having lowest precedence, which specifies a default class for any new sample that is not
satisfied by any other rule in the classifier. In general, the associative classification method
was empirically found to be more accurate than C4.5 on several data sets. Each of the above
two steps was shown to have linear scale-up.
The third method, CAEP (classification by aggregating emerging patterns), uses the notion of
item set supports to mine emerging patterns (EPs), which are used to construct a classifier.
Roughly speaking, an EP is an item set (or set of items) whose support increases significantly
from one class of data to another. The ratio of the two supports is called the growth rate of the
EP. For example, suppose that we have a data set of customers with the classes
buys^computer = ―yes‖, or C1, and buys_computer = ―no‖, or C2, the item set {age =
―<_30‖, student = ―no‖} is a typical EP, whose support increases from 0.2% in C1 to 57.6%
in C2 at a growth rate of EP = 288. Note that an item is either a simple equality test; on a
categorical attribute is in an interval. Each EP is a multiattribute test and can be very strong at
differentiating instances of one class from another. For instance, if a new sample X contains
the above EP, then with odds of 9916% we can claim that X belongs to C2. In general, the
differentiating power of an EP is roughly proportional to its growth rate and its support in the
target class.
The third method, CAEP (classification by aggregating emerging patterns), uses the notion of
186
item set support to mine emerging patterns (EPs), which are used to construct a classifier.
Roughly speaking, an EP is an item set (or set of items) whose support increases significantly
from one class of data to another. The ratio of the two supports is called the growth rate of the
EP. For example, suppose that we have a data set of customers with the classes
buys^computer = ―yes‖, or C1, and buys_computer = ―no‖, or C2, the item set {age =
―<_30‖,student = ―no‖} is a typical EP, whose support increases from o.2% in C1 to 57.6% in
C2 at a growth rate of EP = 288. Note that an item is either a simple equality test; on a
categorical attribute is in an interval. Each EP is a multiattribute test and can be very strong at
differentiating instances of one class from another. For instance, if a new sample X contains
the above EP, then with odds of 99l6% we can claim that X belongs to C2. In general, the
differentiating power of an EP is roughly proportional to its growth rate and its support in the
target class.
For each class C, CAEP find EPs satisfying given support and growth rate thresholds, where
growth rate computed with respect to the set of all non-C samples versus the target set of all
C samples, ―Border based‖ algorithms can be used for this purpose. Where classifying a new
sample, X, for each class C, the differentiating power of the EPs of class C that occur in X are
aggregated to derive a score for C that is then normalized. The class with the largest
normalized score determines the class label of X.
CAEP has been found to be more accurate than C4.5 and association-based classification on
several data sets. It also performs well on data sets where the mail class of interest is in the
minority. It scales up on data volume and dimensionality. An alternative classifier, called the
JEP-classifier, was proposed based on jumping emerging patterns (JEPs). A JEP is a special
type of EP, defined as an itemset whose support increases abruptly from zero in one data set
to nonzero in another data set. The two classifiers are considered complementary.
9.6 Decision Trees
A decision tree is a classifier expressed as a recursive partition of the instance space. The
decision tree consists of nodes that form a rooted tree, meaning it is a directed tree with a
node called ―root‖ that has no incoming edges. All other nodes have exactly one incoming
edge. A node with outgoing edges is called an internal or test node. All other nodes are called
leaves (also known as terminal or decision nodes). In a decision tree, each internal node splits
the instance space into two or more sub-spaces according to a certain discrete function of the
input attributes values. In the simplest and most frequent case, each test considers a single
187
attribute, such that the instance space is partitioned according to the attribute‘s value. In the
case of numeric attributes, the condition refers to a range.
Each leaf is assigned to one class representing the most appropriate target value.
Alternatively, the leaf may hold a probability vector indicating the probability of the target
attribute having a certain value. Instances are classified by navigating them from the root of
the tree down to a leaf, according to the outcome of the tests along the path. Figure below
describes a decision tree that reasons whether or not a potential customer will respond to a
direct mailing. Internal nodes are represented as circles, whereas leaves are denoted as
triangles. Note that this decision tree incorporates both nominal and numeric attributes. Given
this classifier, the analyst can predict the response of a potential customer (by sorting it down
the tree), and understand the behavioural characteristics of the entire potential customers
population regarding direct mailing. Each node is labelled with the attribute it tests, and its
branches are labelled with its corresponding values.
In case of numeric attributes, decision trees can be geometrically interpreted as a collection of
hyper planes, each orthogonal to one of the axes. Naturally, decision-makers prefer less
complex decision trees, since they may be considered more comprehensible. Furthermore, the
tree complexity has a crucial effect on its accuracy. The tree complexity is explicitly
controlled by the stopping criteria used and the pruning method employed. Usually the tree
complexity is measured by one of the following metrics: the total number of nodes, total
number of leaves, tree depth and number of attributes used. Decision tree induction is closely
related to rule induction. Each path from the root of a decision tree to one of its leaves can be
transformed into a rule simply by conjoining the tests along the path to form the antecedent
part, and taking the leaf‘s class prediction as the class value.
188
For example, one of the paths in Figure above can be transformed into the rule: ―If customer
age is less than or equal to or equal to 30, and the gender of the customer is ―Male‖ – then the
customer will respond to the mail‖. The resulting rule set can then be simplified to improve
its comprehensibility to a human user, and possibly its accuracy.
Algorithmic Framework for Decision Trees
Decision tree inducers are algorithms that automatically construct a decision tree from a
given dataset. Typically the goal is to find the optimal decision tree by minimizing the
generalization error. However, other target functions can be also defined, for instance,
minimizing the number of nodes or minimizing the average depth. Induction of an optimal
decision tree from a given data is considered to be a hard task. It has been shown that finding
a minimal decision tree consistent with the training set is NP–hard. Moreover, it has been
shown that constructing a minimal binary tree with respect to the expected number of tests
required for classifying an unseen instance is NP–complete. Even finding the minimal
equivalent decision tree for a given decision tree or building the optimal decision tree from
decision tables is known to be NP–hard.
The above results indicate that using optimal decision tree algorithms is feasible only in small
problems. Consequently, heuristics methods are required for solving the problem. Roughly
speaking, these methods can be divided into two groups: top–down and bottom–up with clear
preference in the literature to the first group.
There are various top–down decision trees inducers such as ID3 in 1986, C4.5 in 1993,
CART in 1984). Some consist of two conceptual phases: growing and pruning (C4.5 and
CART). Other inducers perform only the growing phase. The selection of the most
appropriate function is made according to some splitting measures. After the selection of an
appropriate split, each node further subdivides the training set into smaller subsets, until no
split gains sufficient splitting measure or a stopping criteria is satisfied.
Uni-variate Splitting Criteria
In most of the cases, the discrete splitting functions are univariate. Univariate means that an
internal node is split according to the value of a single attribute. Consequently, the inducer
searches for the best attribute upon which to split. There are various univariate criteria. These
criteria can be characterized in different ways, such as: According to the origin of the
189
measure: information theory, dependence, and distance. According to the measure structure:
impurity based criteria, normalized impurity based criteria and Binary criteria.
The following section describes the most common criteria used in the problems.
Impurity-based Criteria
Given a random variable x with k discrete values, distributed according to P = (p1, p2,..., pk),
an impurity measure is a function ɸ: [0, 1]k → R that satisfies the following conditions:
ɸ (P) ≥0
ɸ (P) is minimum if ∃i such that component pi = 1.
ɸ (P) is maximum if ∀i, 1 ≤ i ≤ k, pi = 1/k.
ɸ (P) is symmetric with respect to components of P.
ɸ (P) is smooth (differentiable everywhere) in its range.
Note that if the probability vector has a component of 1 (the variable x gets only one value),
then the variable is defined as pure. On the other hand, if all components are equal, the level
of impurity reaches maximum.
Given a training set S, the probability vector of the target attribute y is defined as:
Py S =
σy=c dom (y ) S
σy=c 1 S
,…….
S
S
The goodness–of–split due to discrete attribute ai is defined as reduction in impurity of the
target attribute after partitioning S according to the values
vi,j є dom(ai):
dom a i
∆ɸ ai , S = ɸ Py S
−
j=1
σa i =v i,j S
S
. ɸ(Py (σa i =v i,j S)
Information Gain
Information gain is an impurity-based criterion that uses the entropy measure (origin from
information theory) as the impurity measure.
Information Gain (ai; S) =
190
σa i=v
Entropy y, S −
i,j
S
. Entropy (y, σa i=v S)
S
v i,j ∈dom (a i )
i,j
Where
Entropy y, S =
−
σy=c j S
S
v i,j ∈dom (a i )
. log 2
σy=c j S
S
Gini Index
Gini index is an impurity-based criterion that measures the divergences between the
probabilities distributions of the target attribute‘s values. The Gini index has been used in
various works and it is defined as:
2
σy=c j S
Gini y, S = 1 −
S
c i ∈dom (y)
Consequently the evaluation criterion for selecting the attribute ai is defined as:
σy=v i,j S
GiniGain ai , S = Gini y, S −
c i ∈dom y
S
. 𝐺𝑖𝑛𝑖(𝑦, 𝜎𝑎 𝑖=𝑣𝑖,𝑗 𝑆)
Gain Ratio
The gain ratio ―normalizes‖ the information gain as follows
GainRatio ai , S =
InformationGain(ai, S)
Entropy(ai, S)
Note that this ratio is not defined when the denominator is zero. Also the ratio may tend to
favour attributes for which the denominator is very small. Consequently, it is suggested in
two stages. First the information gain is calculated for all attributes. As a consequence, taking
into consideration only attributes that have performed at least as good as the average
information gain, the attribute that has obtained the best ratio gain is selected. It has been
shown that the gain ratio tends to outperform simple information gain criteria, both from the
accuracy aspect, as well as from classifier complexity.
Multivariate Splitting Criteria
In multivariate splitting criteria, several attributes may participate in a single node split test.
Obviously, finding the best multivariate criteria is more complicated than finding the best
191
univariate split. Furthermore, although this type of criteria may dramatically improve the
tree‘s performance, these criteria are much less popular than the univariate criteria.
Most of the multivariate splitting criteria are based on the linear combination of the input
attributes. Finding the best linear combination can be performed using a greedy search, linear
programming;, linear discriminate analysis.
Stopping Criteria
The growing phase continues until a stopping criterion is triggered. The following conditions
are common stopping rules:
•
All instances in the training set belong to a single value of y.
•
The maximum tree depth has been reached.
•
The number of cases in the terminal node is less than the minimum number of cases
for parent nodes.
•
If the node were split, the number of cases in one or more child nodes would be less
than the minimum number of cases for child nodes.
•
The best splitting criteria is not greater than a certain threshold.
Pruning Methods
Employing tightly stopping criteria tends to create small and under–fitted decision trees. On
the other hand, using loosely stopping criteria tends to generate large decision trees that are
over–fitted to the training set. Pruning methods were developed for solving this dilemma.
According to this methodology, a loosely stopping criterion is used, letting the decision tree
to over fit the training set. Then the over-fitted tree is cut back into a smaller tree by
removing sub–branches that are not contributing to the generalization accuracy. Employing
pruning methods can improve the generalization performance of a decision tree, especially in
noisy domains. Another key motivation of pruning is ―trading accuracy for simplicity‖. When
the goal is to produce a sufficiently accurate compact concept description, pruning is highly
useful. Within this process, the initial decision tree is seen as a completely accurate one. Thus
the accuracy of a pruned decision tree indicates how close it is to the initial tree.
There are various techniques for pruning decision trees. Most of them perform top-down or
bottom-up traversal of the nodes. A node is pruned if this operation improves a certain
criteria.
192
Reduced Error Pruning
A simple procedure for pruning decision trees is known as reduced error pruning. While
traversing over the internal nodes from the bottom to the top, the procedure checks for each
internal node, whether replacing it with the most frequent class does not reduce the tree‘s
accuracy. In this case, the node is pruned. The procedure continues until any further pruning
would decrease the accuracy.
In order to estimate the accuracy, use a pruning set. It can be shown that this procedure ends
with the smallest accurate sub– tree with respect to a given pruning set.
Minimum Error Pruning (MEP)
The minimum error pruning performs bottom–up traversal of the internal nodes. In each node
it compares the l-probability error rate estimation with and without pruning. The l-probability
error rate estimation is a correction to the simple probability estimation using frequencies. If
St denotes the instances that have reached a leaf t, then the expected error rate in this leaf is:
εˊ t = 1 − maxc i ∈dom (y)
σy=c i St + l. papr(y = Ci )
St + l
where Papr(y = ci) is the a–priori probability of y getting the value ci, and l denotes the
weight given to the a–priori probability.
The error rate of an internal node is the weighted average of the error rate of its branches. The
weight is determined according to the proportion of instances along each branch. The
calculation is performed recursively up to the leaves. If an internal node is pruned, then it
becomes a leaf and its error rate is calculated directly using the last equation. Consequently,
we can compare the error rate before and after pruning a certain internal node. If pruning this
node does not increase the error rate, the pruning should be accepted.
9.7 Lazy Learners (or Learning from Your Neighbours)
We can think of the learned model as being ready and eager to classify previously unseen
tuples. Imagine a contrasting lazy approach, in which the learner instead waits until the last
minute before doing any model construction in order to classify a given test tuple. That is,
when given a training tuple, a lazy learner simply stores it (or does only a little minor
processing) and waits until it is given a test tuple. Only when it sees the test tuple does it
perform generalization in order to classify the tuple based on its similarity to the stored
193
training tuples. Unlike eager learning methods, lazy learners do less work when a training
tuple is presented and more work when making a classification or prediction. Because lazy
learners store the training tuples or ―instances,‖ they are also referred to as instance based
learners, even though all learning is essentially based on instances.
When making a classification or prediction, lazy learners can be computationally expensive.
They require efficient storage techniques and are well-suited to implementation on parallel
hardware. They offer little explanation or insight into the structure of the data. Lazy learners,
however, naturally support incremental learning. They are able to model complex decision
spaces having hyperpolygonal shapes that may not be as easily describable by other learning
algorithms (such as hyper-rectangular shapes modelled by decision trees). We look at one
examples of lazy learners: k-nearestneighbor classifiers.
k-Nearest-Neighbour Classifiers
The k-nearest-neighbor method was first described in the early 1950s. The method is labor
intensive when given large training sets, and did not gain popularity until the 1960s when
increased computing power became available. It has since been widely used in the area of
pattern recognition.
Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given
test tuple with training tuples that are similar to it. The training tuples are described by n
attributes. Each tuple represents a point in an n-dimensional space. In this way, all of the
training tuples are stored in an n-dimensional pattern space. When given an unknown tuple, a
k-nearest-neighbour classifier searches the pattern space for the k training tuples that are
closest to the unknown tuple. These k training tuples are the k ―nearest neighbours‖ of the
unknown tuple. ―Closeness‖ is defined in terms of a distance metric, such as Euclidean
distance.
The Euclidean distance between two points or tuples, say, X1 = (x11, x12, ... , x1n) and X2 =
(x21, x22, : : : , x2n), is
n
(x1i − x2i)2
dist(X1, X2) =
i=1
In other words, for each numeric attribute, we take the difference between the corresponding
values of that attribute in tuple X1 and in tuple X2, square this difference, and accumulate it.
The square root is taken of the total accumulated distance count. Typically, we normalize the
194
values of each attribute before using above Equation. This helps prevent attributes with
initially large ranges (such as income) from outweighing attributes with initially smaller
ranges (such as binary attributes). Min-max normalization, for example, can be used to
transforma value v of a numeric attribute A to v ˊ in the range [0, 1] by computing.
vˊ =
v − minA
maxA − minA
where minA and maxA are the minimum and maximum values of attribute A.
For k-nearest-neighbor classification, the unknown tuple is assigned the most common class
among its k nearest neighbors. When k = 1, the unknown tuple is assigned the class of the
training tuple that is closest to it in pattern space. Nearest neighbor classifiers can also be
used for prediction, that is, to return a real-valued prediction for a given unknown tuple. In
this case, the classifier returns the average value of the real-valued labels associated with the
k nearest neighbors of the unknown tuple.
―But how can distance be computed for attributes that not numeric, but categorical, such as
color?‖ The above discussion assumes that the attributes used to describe the tuples are all
numeric. For categorical attributes, a simple method is to compare the corresponding value of
the attribute in tuple X1 with that in tuple X2. If the two are identical (e.g., tuples X1 and X2
both have the color blue), then the difference between the two is taken as 0. If the two are
different (e.g., tuple X1 is blue but tuple X2 is red), then the difference is considered to be 1.
Other methods may incorporate more sophisticated schemes for differential grading (e.g.,
where a larger difference score is assigned, say, for blue and white than for blue and black).
―What about missing values?‖ In general, if the value of a given attribute A is missing in
tuple X1 and/or in tuple X2, we assume the maximum possible difference. Suppose that each
of the attributes have been mapped to the range [0, 1]. For categorical attributes, we take the
difference value to be 1 if either one or both of the corresponding values of A are missing. If
A is numeric and missing from both tuples X1 and X2, then the difference is also taken to be
1. If only one value is missing and the other (which we‘ll call v ˊ ) is present and normalized,
then we can take the difference to be either |1-v ˊ | or |0-v ˊ | (i.e., 1v ˊ or v ˊ ), whichever is
greater.
―How can we determine a good value for k, the number of neighbours?‖ This can be
determined experimentally. Starting with k = 1, we use a test set to estimate the error rate of
the classifier. This process can be repeated each time by incrementing k to allow for one
195
more neighbor. The k value that gives the minimum error rate may be selected. In general,
the larger the number of training tuples is, the larger the value of k will be (so that
classification and prediction decisions can be based on a larger portion of the stored tuples).
As the number of training tuples approaches infinity and k =1, the error rate can be no worse
than twice the Bayes error rate (the latter being the theoretical minimum).
If k also approaches infinity, the error rate approaches the Bayes error rate. Nearestneighbour classifiers use distance-based comparisons that intrinsically assign equal weight to
each attribute. They therefore can suffer from poor accuracy when given noisy or irrelevant
attributes. The method, however, has been modified to incorporate attribute weighting and
the pruning of noisy data tuples. The choice of a distance metric can be critical.
Nearest-neighbor classifiers can be extremely slow when classifying test tuples. If D is a
training database of |D| tuples and k = 1, then O(|D|) comparisons are required in order to
classify a given test tuple. By presorting and arranging the stored tuples into search trees, the
number of comparisons can be reduced to O(log(|D|). Parallel implementation can reduce the
running time to a constant, that is O(1), which is independent of |D|. Other techniques to
speed up classification time include the use of partial distance calculations and editing the
stored tuples. In the partial distance method, we compute the distance based on a subset of the
n attributes. If this distance exceeds a threshold, then further computation for the given stored
tuple is halted, and the process moves on to the next stored tuple. The editing method
removes training tuples that prove useless. This method is also referred to as pruning or
condensing because it reduces the total number of tuples stored.
9.8 Summary
•
Backpropagation, an abbreviation for "backward propagation of errors", is a common
method of training artificial neural networks. Back propagation is a neural network
learning algorithm.
•
Neurocomputing is computer modeling based, in part, upon simulation of the
structure and function of the brain. Neural networks excel in pattern recognition,
•
The back propagation algorithm performs learning on a multilayer fee-forward neural
network. The inputs correspond to the attributes measured for each raining sample.
The inputs are fed simultaneously into layer of units making up the input layer.
•
Back propagation learns by iteratively processing a set of training samples, comparing
the network‘s prediction for each sample with the actual known class label. For each
196
training sample, the weights are modified so as to minimize the mean squared error
between the network‘s prediction and the actual class.
•
Support Vector Machines, is a promising new method for the classification of both
linear and nonlinear data.
9.9 Keywords
Support Vector Machines, Back-propagation, Decision trees.
9.10 Exercises
1. Explain classification by Back Propagation.
2. Discuss associate classification in brief.
3. Explain decision trees.
4. What are lazy learners? Explain.
9.11 References
1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber,
Morgan Kaufmann Publisher, Second Edition, 2006.
2. Introduction to Data Mining (ISBN: 0321321367) by Pang-Ning Tan, Michael
Steinbach, Vipin Kumar, Addison-Wesley Publisher, 2005.
3. Data Mining Techniques by Arun K Pujari, University Press, Second Edition,
2009.
197
Unit 10: Genetic Algorithms, Rough Set and Fuzzy Sets
Structure
10.1 Objectives
10.2 Introduction
10.3 Genetic Algorithms
10.4 Rough Set Approach
10.5 Fuzzy set Approach
10.6 Summary
10.7 Keywords
10.8 Exercises
10.9 References
10.1 Objectives
The objectives covered under this unit include basic concepts about:
Genetic Algorithms for data mining
Rough Set Approach based data mining
Fuzzy Set Approach based data mining.
10.2 Introduction
In
the computer
science field
of artificial
intelligence, genetic
algorithm
(GA) is
a search heuristic that mimics the process of natural selection. This heuristic (also sometimes
called
a meta-heuristic)
is
routinely
used
to
generate
useful
solutions
to optimization and search problems. Genetic algorithms belong to the larger class
of evolutionary algorithms (EA), which generate solutions to optimization problems using
techniques
inspired
by
natural
evolution,
such
as inheritance, mutation, selection,
and crossover.
198
In a genetic algorithm, a population of candidate solutions (called individuals, creatures,
or phenotypes) to an optimization problem is evolved toward better solutions. Each candidate
solution has a set of properties (its chromosomes or genotype) which can be mutated and
altered; traditionally, solutions are represented in binary as strings of 0s and 1s, but other
encodings are also possible.
The evolution usually starts from a population of randomly generated individuals, and is
an iterative process, with the population in each iteration called a generation. In each
generation, the fitness of every individual in the population is evaluated; the fitness is usually
the value of the objective function in the optimization problem being solved. The more fit
individuals are stochastically selected from the current population, and each individual's
genome is modified (recombined and possibly randomly mutated) to form a new generation.
The new generation of candidate solutions is then used in the next iteration of the algorithm.
Commonly, the algorithm terminates when either a maximum number of generations has
been produced, or a satisfactory fitness level has been reached for the population.
A rough set, first described by Polish computer scientist Zdzisław I. Pawlak, is a formal
approximation of a crisp set (i.e., conventional set) in terms of a pair of sets which give
the lower and the upper approximation of the original set. In the standard version of rough set
theory, the lower- and upper-approximation sets are crisp sets, but in other variations, the
approximating sets may be fuzzy sets.
The following section contains an overview of the basic framework of rough set theory, as
originally proposed by Zdzisław I. Pawlak, along with some of the key definitions. The initial
and basic theory of rough sets is sometimes referred to as "Pawlak Rough Sets" or "classical
rough sets", as a means to distinguish from more recent extensions and generalizations.
Information system framework
Let
be an information system (attribute-value system), where
set of finite objects (the universe) and
that
for every
is a non-empty
is a non-empty, finite set of attributes such
. Va is the set of values that attribute a may take. The
information table assigns a value a(x) from Va to each attribute
and object
in the
universe .
With any
there is an associated equivalence relation IND(P):
The relation IND(P) is called a
-indiscernibility relation. The partition of
all equivalence classes of IND(P) and is denoted by
(or
is a family of
).
199
If (x, y)
, then
and
are indiscernible (or indistinguishable) by
attributes from p.
Definition of a rough set
Let
be a target set that we wish to represent using attribute subset
are told that an arbitrary set of objects
; that is, we
comprises a single class, and we wish to express
this class (i.e., this subset) using the equivalence classes induced by attribute subset
general,
cannot be expressed exactly, because the set may include and exclude objects
which are indistinguishable on the basis of attributes
For
. In
example,
consider
the
target
subset
.
set
,
and
let
attribute
, the full available set of features. It will be noted that
the set
cannot be expressed exactly, because in [x]p, objects
indiscernible. Thus, there is no way to represent any set
are
which includes O3
but excludes objects O7 and O10.
However, the target set
within
can be approximated using only the information contained
by constructing the
-lower and
-upper approximations of
:
Lower approximation and positive region
The
in
-lower approximation, or positive region, is the union of all equivalence classes
which are contained by (i.e., are subsets of) the target set – in the
example,
, the union of the two equivalence classes in
which are contained in the target set. The lower approximation is the complete set of objects
in U/P that can be positively (i.e., unambiguously) classified as belonging to target set
.
Upper approximation and negative region
The
-upper approximation is the union of all equivalence classes in
empty
intersection
with
the
target
example,
equivalence classes in
set
,
the
which have non–
union
in
of
the
the
three
that have non-empty intersection with the target set. The upper
approximation is the complete set of objects that in
that cannot be positively (i.e.,
200
unambiguously) classified as belonging to the complement (
) of the target set
. In other
words, the upper approximation is the complete set of objects that are possibly members of
the target set
.
The set
therefore represents the negative region, containing the set of objects that
can be definitely ruled out as members of the target set.
Boundary region
The boundary region, given by set difference
, consists of those objects that can
neither be ruled in nor ruled out as members of the target set
.
In summary, the lower approximation of a target set is a conservative approximation
consisting of only those objects which can positively be identified as members of the set.
(These objects have no indiscernible "clones" which are excluded by the target set.) The
upper approximation is a liberal approximation which includes all objects that might be
members of target set. (Some objects in the upper approximation may not be members of the
target set.) From the perspective of U/P, the lower approximation contains objects that are
members of the target set with certainty (probability = 1), while the upper approximation
contains objects that are members of the target set with non-zero probability (probability > 0).
The rough set
The tuple
composed of the lower and upper approximation is called a rough
set; thus, a rough set is composed of two crisp sets, one representing a lower boundary of the
target set
, and the other representing an upper boundary of the target set
The accuracy of the rough-set representation of the set
.
can be given (Pawlak 1991) by the
following:
That is, the accuracy of the rough set representation of
,
,
the ratio of the number of objects which can positively be placed in
objects that can possibly be placed in
, is
to the number of
– this provides a measure of how closely the rough
set is approximating the target set. Clearly, when the upper and lower approximations are
equal (i.e., boundary region empty), then
, and the approximation is perfect; at
201
the other extreme, whenever the lower approximation is empty, the accuracy is zero
(regardless of the size of the upper approximation).
In mathematics, Fuzzy sets are sets whose elements have degrees of membership. Fuzzy sets
were introduced by Lotfi A. Zadeh and Dieter Klaua in 1965 as an extension of the classical
notion of set. At the same time, Salii (1965) defined a more general kind of structures
called L-relations, which were studied by him in an abstract algebraic context. Fuzzy
relations, which are used now in different areas, such as linguistics, decision-making
and clustering are special cases of L-relations when L is the unit interval [0, 1].
In classical set theory, the membership of elements in a set is assessed in binary terms
according to a bivalent condition — an element either belongs or does not belong to the set.
By contrast, fuzzy set theory permits the gradual assessment of the membership of elements
in a set; this is described with the aid of a membership function valued in the real unit interval
[0, 1]. Fuzzy sets generalize classical sets, since the indicator functions of classical sets are
special cases of the membership functions of fuzzy sets, if the latter only take values 0 or
1. In fuzzy set theory, classical bivalent sets are usually called crisp sets. The fuzzy set theory
can be used in a wide range of domains in which information is incomplete or imprecise,
such as bioinformatics.
Definition
A fuzzy set is a pair
For each
finite
where
the value
set
is a set and
is called the grade of membership of
the
fuzzy
set
in
is
often
For a
denoted
by
Let
Then
is called not included in the fuzzy set
called fully included if
.[6] The
set
, and
is
set
if
,
is
is called a fuzzy member if
called
the support of
is called its kernel. The function
and
the
is called the membership
function of the fuzzy set
Sometimes, more general variants of the notion of fuzzy set are used, with membership
functions taking values in a (fixed or variable) algebra or structure
it is required that
of a given kind; usually
be at least a poset or lattice. These are usually called L-fuzzy sets, to
202
distinguish them from those valued over the unit interval. The usual membership functions
with values in [0, 1] are then called [0, 1]-valued membership functions.
10.3 Genetic Algorithms
Genetic algorithm is one of the commonly used approaches on data mining. We put forward a
genetic algorithm approach for classification problems. Binary coding is adopted in which an
individual in a population consists of a fixed number of rules that stand for a solution
candidate. The evaluation function considers four important factors which are error rate,
entropy measure, rule consistency and whole ratio, respectively.
or
Genetic algorithms are a data mining technique. They are used to winnow relevant data from
large data sets to produce the fittest data or, in the context of a proposed problem, the fittest
solution.
For those of us who are not computer scientists or mathematicians, a genetic algorithm may
best be understood as computer based calculations based on the idea that—as in evolutionary
biology, and genetics—entities in a population will over time evolve through natural
selection to their optimal condition.
Genetic algorithms—or sets of rules--use genetic concepts of reproduction, selection,
inheritance and so forth. If you begin with a large set of data, the application of genetic
algorithms will eventually have them winnowed down to those that are the most "fit." Fitness
will be defined in terms of the particular problem.
Genetic algorithms have been proposed, in the realm of counter terrorism, to:
Extract the fittest nodes (or connection points) in terrorist networks, in order to
analyze and act on that knowledge;
Determine the most optimal military or other strategy to use in a particular scenario.
"fitness" in this case is determined as the ability to resolve a violent conflict scenario;
Create models of new threat scenarios by 'evolving' the most dangerous scenarios
from component parts (fitness in this case means the ability to survive existing
strategies for their defeat).
Genetic Algorithms (GAs) are adaptive procedures derived from Darwin‘s principal of
survival of the fittest in natural genetics. GA maintains a population of potential solutions of
the candidate problem termed as individuals. By manipulation of these individuals through
203
genetic operators such as selection, crossover and mutation, GA evolves towards better
solutions over a number of generations. Implementation of a genetic algorithm is shown in a
flowchart in figure-1
Figure-1: Flowchart of a genetic algorithm
Genetic algorithms start with randomly created initial population of individuals that involves
encoding of every variable. A string of variables makes a chromosome or individual. In the
beginning phase of implementation of genetic algorithm in early seventies, it was applied to
solve continuous optimization problems with binary coding of variables. Binary variables are
mapped to real numbers in numerical problems.
Later, GA has been used to solve many combinatorial optimization problems such as 0/1
knapsack problem, travelling salesperson problem, scheduling problems, etc .
Binary coding has not been found suitable to solve many of these problems. Therefore,
coding other than binary have also been utilized. Continuous function optimization uses realnumber coding.
Problems such as traveling salesperson problem and graph coloring use permutation coding.
Genetic programming applications use tree coding.GA use fitness function derived from the
objective function of the optimization problem to evaluate the individuals in a population.
Fitness function is the measure of an individual‘s fitness, which is used to select individuals
for reproduction. Many of the real world problems may not have a well defined objective
function and require the user to define a fitness function.
204
Selection method in a GA selects parents from the population on the basis of fitness of
individuals. High fitness individuals are selected with higher probability of selection to
reproduce offsprings for the next population. Selection methods assign a probability P(x) to
each individual in the population at current generation, which is proportional to the fitness of
individual x relative to rest of the population.
Fitness-proportionate selection is the most commonly used selection method. Given fi as the
fitness of ith individual, P(x) in fitness-proportionate selection is calculated as:
P(x)=fx/ ∑fi .
After the expected values P(x) are calculated, the individuals are selected using the
roulette wheel sampling in the following steps
Let C be the sum of expected values of individuals in a population
Repeat two or more times to select the parents for mating.
The fitness-proportionate selection is extremely biased towards the fit individuals in
the population and exerts high selection pressure. It causes pre-mature convergence of GA as
population is made up of highly fit individuals after a few generations and there is no fitnessbias for selection procedure to work. Therefore, other selection methods such as tournament
selection, rank selection are used to avoid this biasness.
Tournament selection compares two or more randomly selected individuals and selects the
better individual with a pre-specified probability. Rank selection calculates probability of
selection of individuals on the basis of ranking according to increasing fitness values in a
population.
In
a
standard
genetic
algorithm, two parents are selected at a time and are used
to create two new children
to take part in the next generation. The offsprings are
subject
operator with a pre-specified probability of crossover.
to
crossover
Single-point crossover is the most common form of this operator. It marks a random
crossover spot within the size of chromosome and exchanges the bits (in binary coding) on
the right of the spot as shown below.
Mutation operator is applied to all the children after crossover. It flips each bit in the
individual with a pre-specified probability of mutation. An example of mutation is given
below where fifth bit has been mutated.
205
The procedure is repeated till number of individuals in the population is complete. It finishes
one generation in genetic algorithm. GA is run till a stopping criterion is satisfied that may be
defined in many ways. Pre-specified number of generations is the most used criterion. Other
criteria are the desired quality of solution, the number of generations without any
improvement in the results, etc.
A standard genetic algorithm utilizes three genetic operators: reproduction (selection),
crossover and mutation. Elitism in genetic algorithms is used to ensure that the best
individual in a population is passed on unperturbed by genetic operators to the population at
next generation.
Values of genetic parameters such as population size, crossover probability, mutation
probability, total number of generations affect convergence properties of the genetic
algorithms. Values of these parameters are generally decided before start of GA execution on
the basis of previous experience. Experimental studies recommend the values of these
parameters as: population size equal to 20 to 30, crossover probability between 0.75 to 0.95,
and mutation probability between 0.005-0.01. The parameters may also be fixed by tuning in
trial GA runs before start of actual run of the GA.
Deterministic control and adaptation of the parameter values to a particular application have
also been used to determine values of genetic parameters. In deterministic control value of a
genetic parameter is altered by some deterministic rule during the GA run Adaptation of
parameters allows change in their values during the GA run on the basis of performance
previous generations in the genetic algorithm. In self-adaptation, the operator settings are
encoded into each individual in the population that evolves values of parameters during the
GA run.
Applications in Data Mining
Data mining has been used to analyze large datasets and establish useful classification and
patterns in the datasets. Agricultural and biological research studies have used various
techniques of data mining including natural trees, statistical machine learning and other
analysis methods. Genetic algorithm has been widely used in data mining applications such
as classification, clustering, feature selection, etc.
Two applications of GA in data mining are described below.
206
Effectiveness of the classification algorithms - Genetic algorithm, Fuzzy classification and
Fuzzy clustering are compared and analyzed on the collected supervised and unsupervised
soil data.
Soil classification deals with the categorization of soils based on distinguishing
characteristics as well as criteria that dictate choices in use.
Genetic algorithm for feature selection for mining SNP‘s in association studies. Genomic
studies provide large volumes of data with thousands of single nucleotide polymorphisms
(SNPs). The analysis of SNPs determines relationships between genotypic and phenotypic
information . It helps in identification of SNPs related to a disease an approach for predicting
drug effectiveness is developed that is based on data mining and genetic algorithms.
10.4 Rough Sets Approach
Rough set theory is a new mathematical approach to data analysis and data mining. After 15
year of pursuing rough set theory and its application the theory has reached a certain degree
of ma-turity. In recent years we witnessed a rapid grow of interest in rough set theory and its
application, worldwide.
Many international workshops, conferences and seminars included rough sets in their
programs. A large number of high quality papers have been published recently on various
aspects of rough sets.
Various real life-applications of rough set theory have shown its usefulness in many
domains. Very promising new areas of application of the rough set concept seems to emerge
in the near future. They include rough control, rough data bases, rough information retrieval,
rough neural network and others. No doubt that rough set theory can contribute essentially to
material sciences, a subject of special interest to this conference.
The rough sets theory was created by Z. Pawlak in the beginning of the 1980s and it is useful
in the process of data mining. It offers the mathematic tools for discovering hidden patterns in
data through the use of identification of partial and total dependencies in data. It also enables
work with null or missing values.
Rough sets can be used separately but usually they are used together with other methods
such as fuzzy sets, statistic methods, genetics algorithms etc. The rough sets theory uses
different approach to uncertainty. As well as fuzzy sets this theory is only part of the classic
theory, not an alternative.
207
BASIC CONCEPTS
Rough set philosophy is founded on the assumption that with every object of the universe of
dis-course we associate some information (data, knowledge). Objects characterized by the
same infor-mation are indiscernible (similar) in view of the available information about them.
The indis-cernibility relation generated in this way is the mathematical basis of rough set
theory.
Any set of all indiscernible (similar) objects is called an elementary set, and form a basic
granule (atom) of knowledge about the universe. Any union of some elementary sets is
referred to as a crisp (precise) set - otherwise the set is rough (imprecise, vague).
Obviously rough sets, in contrast to precise sets, cannot be characterized in terms of
information about their elements. In the proposed approach with any rough set a pair of
precise sets - called the lower and the upper approximation of the rough set is associated. The
lower approximation consists of all objects which surely belong to the set and the upper
approximation contains all objects which possible belong to the set.
The difference between the upper and the lower approximation consti-tutes the boundary
region of the rough set. Approximations are two basic operations used in rough set theory.
Data are often presented as a table, columns of which are labeled by attributes, rows by
objects of interest and entries of the table are attribute values. Such tables are known as
information systems, attribute-value tables, data tables or information tables.
Usually we distinguish in information table‘s two kinds of attributes, called condition and
decision attributes. Such tables are known as decision tables. Rows of a decision table are
referred to as ―if...then...‖ decision rules, which give conditions necessary to make decisions
specified by the de-cision attributes. An example of a decision table is shown in Table 1.
The table contains data concerning six cast iron pipes exposed to high pressure endurance
test. In the table C, S and P are condition attributes, displaying the percentage content in the
pig-iron of coal, sulfur and phosphorus respectively, whereas the attribute Cracks revels the
result of the test. The values of condition attributes are as follows (C, high) > 3.6%, 3.5% ≤
208
(C, avg.) ≤ 3.6%, (C, low) < 3.5%, (S, high) ≥ 0.1%, (S, low) < 0.1%, (P, high) ≥ 0.3%, (P,
low) < 0.3%.
Main problem we are interested in is how the endurance of the pipes depend on the
compounds C, S and P comprised in the pig-iron, or in other words, if there is a functional
dependency between the decision attribute Cracks and the condition attributes C, S and P. In
the rough set theory language this boils down to the question, if the set {2,4,5}of all pipes
having no cracks after the test (or the set {1,3,6}of pipes having cracks), can be uniquely
defined in terms of condition attributes values.
It can be easily seen that this is impossible, since pipes 2 and 3 display the same features in
terms of attributes C, S and P, but they have different values of the attribute Cracks. Thus
information given in Table 1 is not sufficient to solve our problem. However we can give a
partial solution. Let us observe that if the attribute C has the value high for a certain pipe,
then the pipe have cracks, whereas if the value of the attribute C is low, then the pipe has no
cracks. Hance employing attributes C, S and P, we can say that pipes 1 and 6 surly are good,
i.e., surely belong to the set {1, 3, 6}, whereas pipes 1, 2, 3 and 6 possible are good, i.e.,
possible belong to the set {1, 3, 6}.Thus the sets {1, 6}, {1, 2, 3, 6} and {2, 3} are the lower,
the upper approximation and the boundary region of the set {1, 3, 6}.
This means that the quality of pipes cannot be determined exactly by the content of coal,
sulfur and phosphorus in the pig-iron, but can be determined only with some approximation.
In fact approximations determine the dependency (total or partial) between condition and
decision attributes, i.e., express functional relationship between values of condition and
decision attributes.
The degree of dependency between condition and decision attributes can be defined as a
consistency factor of the decision table, which is the number of conflicting decision rules to
all decision. rules in the table. By conflicting decision rules we mean rules having the same
conditions but different decisions. For example, the consistency factor for Table 1 is 4/6 =
2/3, hence the degree of dependency between cracks and the composition of the pig-iron is
2/3. That means that four out of six (ca. 60%) pipes can be properly classified as good or not
good on the basis of their composition.
We might be also interested in reducing some of the condition attributes, i.e. to know whether
all conditions are necessary to make decisions specified in a table. To this end we will
employ the no-tin of a reduct (of condition attributes). By a reduct we understand a minimal
subset of condition attributes which preserves the consistency factor of the table. It is easy to
209
compute that in Table 1 we have two reducts {C, S} and {C, P}. Intersection of all reducts is
called the core. In our example the core is the attribute C.
`That means that in view of the data coal is the most important factor causing cracks and
cannot be eliminated from our considerations, whereas sulfur and phosphorus play a minor
role and can be mutually exchanged as factors causing cracks.
Now we present the basic concepts more formally
Suppose we are given two finite, non-empty sets U and A, where U is the universe, and A − a
set attributes. With every attribute aA∈ we associate a set V, of its values, called the domain
of a.
Any subset B of A determines a binary relation I(B) on U which will be called an
indiscernibility relation, and is defined as follows
xI(B)y if and only if a(x) = a(y) for every aA∈,
where a(x) denotes the value of attribute a for element x.
Obviously I(B) is an equivalence relation. The family of all equivalence classes of I(B), i.e.,
partition determined by B, will be denoted by U/I(B), or simple U/B; an equivalence class of
I(B), i.e., block of the partition U/B, containing x will be denoted by B(x).
If (x,y) belong to I(B) we will say that x and y are B-indiscernible. Equivalence classes of the
relation I(B) (or blocks of the partition U/B) are refereed to as B-elementary sets. In the rough
set approach the elementary sets are the basic building blocks of our knowledge about reality.
The indiscernibility relation will be used next to define basic concepts of rough set theory.
Let us define now the following two operations on sets
Assigning to every subset X of the universe U two sets ()BX∗and ()BX∗called the B-lower and
the B-upper approximation of X, respectively. The set
will be referred to as the B-boundary region of X.
210
If the boundary region of X is the empty set, i.e, ()BNXB=∅, then the set X is crisp (exact)
with respect to B; in the opposite case, i.e., if ()BNXB≠∅, the set X is referred to as rough
(inexact) with respect to B.
Rough set can be also characterized numerically by the following coefficient
Called accuracy of approximation, where |X| denotes the cardinality of X. Obviously
()01≤≤αBX. If X is crisp with respect to B (X is precise with respect to B), and otherwise, if X
is rough with respect to B. ()αBX=1,()αBX<1,
Approximation can be employed to define dependencies (total or partial) between attributes,
reduction of attributes, decision rule generation and others, but will not discuss these issues
here. For details we refer the reader to references.
APPLICATIONS
Rough set theory has found many interesting applications. The rough set approach seems to
be of fundamental importance to AI and cognitive sciences, especially in the areas of
machine learning, knowledge acquisition, decision analysis, knowledge discovery from
databases, expert systems, inductive reasoning and pattern recognition. It seems of particular
importance to decision support systems and data mining.
The main advantage of rough set theory is that it does not need any preliminary or additional
infor-mation about data - like probability in statistics, or basic probability assignment in
Dempster-Shafer theory and grade of membership or the value of possibility in fuzzy set
theory.
The rough set theory has been successfully applied in many real-life problems in medicine,
pharmacology, engineering, banking, financial and market analysis and others. Some
exemplary applications are listed below.
There are many applications in medicine. In pharmacology the analysis of relationships
between the chemical structure and the antimicrobial activity of drugs has been successfully
investigated. Banking applications include evaluation of a bankruptcy risk and market
research.
Very interesting results have been also obtained in speaker independent speech recognition
and acoustics. The rough set approach seems also important for various engineering
211
applications, like diagnosis of machines using vibroacoustics symptoms (noise, vibrations)
and process control. Application in linguistics, environment and databases are other
important domains.
Application of rough sets requires suitable software. Many software systems for
workstations and personal computers based on rough set theory have been developed.
The most known include LERS , Rough DAS and Rough Class and DATALOGIC. Some of
them are available com-metrically.
The main advantage of rough set theory in data analysis is that it does not need any
preliminary or additional information about data − like probability in statistics, or basic
probability assignment in Dempster-Shafer theory, grade of membership or the value of
possibility in fuzzy set theory.
The proposed approach provides efficient algorithms for finding hidden patterns in data,
finds minimal sets of data (data reduction),
evaluates significance of data,
generates sets of decision rules from data,
it is easy to understand,
offers straightforward interpretation of obtained results,
most algorithms based on the rough set theory are particularly suited for parallel
Processing
10.5 Fuzzy set Approach
―Lotfi Zadeh proposed completely new, elegant approach to vagueness called fuzzy set
theory‖
In his approach element can belong to a set to a degree k (0 ≤ k ≤ 1), in contrast to classical
set theory where an element must definitely belong or not to a set.
E.g., in classical set theory one can be definitely ill or healthy, whereas in fuzzy set theory we
can say that someone is ill (or healthy) in 60 percent (i.e. in the degree 0.6). Of course, at
once the question arises where we get the value of degree from. This issue raised a lot of
discussion, but we will refrain from considering this problem here.
Thus fuzzy membership function can be presented as
212
µX(x)∈<0,1>
where, X is a set and x is an element.
Let us observe that the definition of fuzzy set involves more advanced
mathematical
concepts, real numbers and functions, whereas in classical set theory the notion of a set is
used as a fundamental notion of whole mathematics and is used to derive any other
mathematical concepts, e.g., numbers and functions.
Consequently fuzzy set theory cannot replace classical set theory, because, in fact, the theory
is needed to define fuzzy sets.
Fuzzy membership function has the following properties.
a) (x) 1 (x) µU −X = −µ X for any x∈U
b) (x) max( (x), (x)) µ X ∪Y = µ X µY for any x∈U
c) ), (x) µY (x) min( (x )for any µ X ∩Y = µ X x∈U
That means that the membership of an element to the union and intersection of sets is
uniquely determined by its membership to constituent sets. This is a very nice property and
allows very simple operations on fuzzy set, which is a very important feature both the
oretically and practically.
Fuzzy set theory and its applications developed very extensively over last years and attracted
attention of practitioners, logicians and philosophers worldwide.
FUZZY INFORMATION
Fuzzy sets are sets whose elements have degrees of membership. Fuzzy sets were introduced
simultaneously by Lotfi A. Zadeh and Dieter Klaua in 1965 as an extension of the classical
notion of set. In classical set theory, the membership of elements in a set is assessed in binary
terms according to a bivalent condition — an element either belongs or does not belong to the
set.
By contrast, fuzzy set theory permits the gradual assessment of the membership of elements
in a set; this is described with the aid of a membership function valued in the real unit interval
[0, 1]. Fuzzy sets generalize classical sets, since the indicator functions of classical sets are
special cases of the membership functions of fuzzy sets, if the latter only take values 0 or 1.
In fuzzy set theory, classical bivalent sets are usually called crisp sets. The fuzzy set theory
can be used in a wide range of domains in which information is incomplete or imprecise,
such as bioinformatics.
213
FUZZY LOGIC
Fuzzy logic is a form of many-valued logic or probabilistic logic it deals with reasoning that
is approximate rather than fixed and exact. In contrast with traditional logic theory, where
binary sets have two-valued logic, true or false, fuzzy logic variables may have a truth value
that ranges in degree between 0 and 1.
Fuzzy logic has been extended to handle the concept of partial truth, where the truth value
may range between completely true and completely false. Furthermore, when linguistic
variables are used, these degrees may be managed by specific functions.
Typical Applications of Fuzzy Set Theory
The tools and technologies that have been developed in FST have the potential to support all
of the steps that comprise a process of model induction or knowledge discovery. In particular,
FST can already be used in the data selection and preparation phase, e.g., for modeling vague
data in terms of fuzzy sets , to condense" several crisp observations into a single fuzzy one, or
to create fuzzy summaries of the data .
As the data to be analyzed thus becomes fuzzy, one subsequently faces a problem of
analyzing fuzzy data, i.e., of fuzzy data analysis.
The problem of analyzing fuzzy data can be approached in at least two principally di_erent
ways. First, standard methods of data analysis can be extended in a rather generic way by
means of an extension principle, that is, by fuzzifying"the mapping from data to models.
A second, often more sophisticated approach is based on embedding the data into more
complex mathematical spaces, such as fuzzy metric spaces , and to carry out data analysis in
these spaces.
If fuzzy methods are not used in the data preparation phase, they can still be employed in a
later stage in order to analyze the original data. Thus, it is not the data to be analyzed that is
fuzzy, but rather the methods used for analyzing the data (in the sense of resorting to tools
from FST). Subsequently, we shall focus on this type of fuzzy data analysis
Fuzzy Cluster Analysis
Many conventional clustering algorithms, such as the prominent k-means algorithm, produce
a clustering structure in which every object is assigned to one cluster in an unequivocal way.
Consequently, the individual clusters are separated by sharp boundaries. In practice, such
boundaries are often not very natural or even counterintuitive. Rather, the boundary of single
214
clusters and the transition between different clusters are usually smooth". This is the main
motivation underlying fuzzy extensions to clustering algorithms.
In fuzzy clustering, an object may belong to different clusters at the same time, at least to
some extent, and the degree to which it belongs to a particular cluster is expressed in terms of
a fuzzy membership. The membership functions of the different clusters (defined on the set
of observed data points) is usually assumed to form a partition of unity. This version, often
called probabilistic clustering, can be generalized further by weakening this constraint as,
e.g., in possibilistic clustering.
Fuzzy clustering has proved to be extremely useful in practice and is now routinely applied
also outside the fuzzy community (e.g., in recent bioinformatics applications).
Learning Fuzzy Rule-Based Systems
The most frequent application of FST in machine learning is the induction or the adaptation
of rule-based models. This is hardly astonishing, since rule based models have always been a
cornerstone of fuzzy systems and a central aspect of research in the _eld, not only in
ML&DM but also in many other sub fields, notably approximate reasoning and fuzzy control.
(Often, the term fuzzy system implicitly refers to fuzzy rule-based system.)
Fuzzy rule-based systems can represent both classification and regression functions, and
different types of fuzzy models have been used for these purposes. In order to realize a
regression function, a fuzzy system is usually wrapped in a fuzzifier" and a defuzziffier": The
former maps a crisp input to a fuzzy one, which is then processed by the fuzzy system, and
the latter maps the (fuzzy) output of the system back to a crisp value. For so-called TakagiSugeno models, which are quite popular for modeling regression functions, the
defuzzification step is unnecessary, since these models output crisp values directly.
In the case of classification learning, the consequent of single rules is usually a class
assignment (i.e. a singleton fuzzy set). Evaluating a rule base thus becomes trivial and
simply amount to maximum matching‖ that is, searching the maximally supporting rule for
each class. Thus, much of the appealing interpolation and approximation properties of fuzzy
inference gets lost, and fuzziness only means that rules can be activated to a certain degree.
There are, however, alternative methods which combine the predictions of several rules into a
classification of the query.
In methods of that kind, the degree of activation of a rule provides important information.
Besides, activation degrees can be very useful, e.g., for characterizing the uncertainty
involved in a classification decision.
215
Fuzzy Decision Tree Induction
Fuzzy variants of decision tree induction have been developed for quite a while and seem to
remain a topic of interest even today (see for a recent approach and a comprehensive
overview of research in this field). In fact, these approaches provide a typical example for the
fuzzi_cation" of standard machine learning methods.
In the case of decision trees, it is primarily the crisp" thresholds used for dining splitting
predicates (constraints), such as size _181, at inner nodes that have been criticized: Such
thresholds lead to hard decision boundaries in the input space, which means that a slight
variation of an attribute (e.g. size = 182 instead of size = 181) can entail a completely
deferent classification of an object (e.g., of a person characterized by size, weight, gender, ...)
Moreover, the learning process becomes unstable in the sense that a slight variation of the
training examples can change the induced decision tree drastically
In order to make the decision boundaries soft", an obvious idea is to apply fuzzy predicates at
the inner nodes of a decision tree, such as size 2 TALL, where TALL is a fuzzy set (rather
than an interval). In other words, a fuzzy partition instead of a crisp one is used for the
splitting attribute (here size) at an inner node.
Since an example can satisfy a fuzzy predicate to a certain degree, the examples are
partitioned in a fuzzy manner as well. That is, an object is not assigned to exactly one
successor node in a unique way, but perhaps to several successors with a certain degree.
10.6 Summary
Genetic algorithms (GA's) are search algorithms that work via the process of natural
selection. They begin with a sample set of potential solutions which then evolves toward a set
of more optimal solutions. Within the sample set, solutions that are poor tend to die out while
better solutions mate and propagate their advantageous traits, thus introducing more solutions
into the set that boast greater potential (the total set size remains constant; for each new
solution added, an old one is removed). A little random mutation helps guarantee that a set
won't stagnate and simply fill up with numerous copies of the same solution.
In general, genetic algorithms tend to work better than traditional optimization algorithms
because they're less likely to be led astray by local optima. This is because they don't make
use of single-point transition rules to move from one single instance in the solution space to
216
another. Instead, GA's take advantage of an entire set of solutions spread throughout the
solution space, all of which are experimenting upon many potential optima.
However, in order for genetic algorithms to work effectively, a few criteria must be met:
It must be relatively easy to evaluate how "good" a potential solution is relative to
other potential solutions.
It must be possible to break a potential solution into discrete parts that can vary
independently. These parts become the "genes" in the genetic algorithm.
Finally, genetic algorithms are best suited for situations where a "good" answer will
suffice, even if it's not the absolute best answer.
Rough set theory is a new mathematical approach to imperfect knowledge. The problem of
imperfect knowledge has been tackled for a long time by philosophers, logicians and
mathematicians. Recently it became also a crucial issue for computer scientists, particularly
in the area of artificial intelligence.
There are many approaches to the problem of how to understand and manipulate imperfect
knowledge. The most successful one is, no doubt, the fuzzy set theory proposed by Zadeh .
Rough set theory has found many interesting applications. The rough set approach seems to
be of fundamental importance to AI and cognitive sciences, especially in the areas of
machine learning, knowledge acquisition, decision analysis, knowledge discovery from
databases, expert systems, inductive reasoning and pattern recognition.
The areas of fuzzy sets and rough sets have become topics of great research interest,
particularly in the last 20 or so years. The integration or hybridization of such techniques has
also attracted much attention, due mainly to the fact that these distinct approaches to data and
knowledge modeling are complementary when attempting to deal with uncertainty and noise.
A large body of the work on fuzzy-rough set hybridization, however, has tended to focus on
formal aspects of the theory and thus has been framed in that context
10.7 Keywords
Genetic Algorithms, Set, Rough set, Fuzzy Set.
10.8 Exercises
1. Give an example of combinatorial problem. What is the most difficult in solving these
problems?
Write short notes on genetic algorithms.
2. Write short notes on rough sets.
3. Give the flowchart of a genetic algorithm.
217
4. Write short notes on fuzzy theory and its applications.
5. Explain fuzzy clustering.
6. Name and describe the main features of Genetic Algorithms (GA).
7. How can Rough sets be applied in Data Mining?
8. Describe the fuzzy methods for rule learning,
10.9 References
1. Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing Slezak,
D., Szczuka, Duentsch, I., Yao, Y. (Eds.)
2. Polkowski, L. (2002). "Rough sets: Mathematical foundations". Advances in Soft
Computing.
3. Dubois, D.; Prade, H. (1990). "Rough fuzzy sets and fuzzy rough sets". International
Journal of General Systems 17 (2–3): 191–209. doi:10.1080/03081079008935107.
4. Didier Dubois, Henri M. Prade, ed. (2000). Fundamentals of fuzzy sets. The
Handbooks of Fuzzy Sets Series 7. Springer. ISBN 978-0-7923-7732-0.
218
UNIT 11: PREDICTION THEORY OF CLASSIFIERS
Structure
11.1
Objectives
11.2
Introduction
11.3
Estimating the Predictive Accuracy of a Classifier
11.4
Evaluating the Accuracy of a Classifier
11.5
Multiclass Problem
11.6
Summary
11.7
Keywords
11.8
Exercises
11.9
References
11.1 Objectives
The objectives covered under this unit include:
Estimating the Predictive accuracy of a Classifier
Evaluating the accuracy of a Classifier
Multiclass Problem
11.2 Introduction
Classifiers are functions which partition a set into two classes (for example, the set of rainy
days and the set of sunny days). Classifiers appear to be the most simple nontrivial decision
making element so their study often has implications for other learning systems. Classifiers
are sufficiently complex that many phenomena observed in machine learning (theoretically or
experimentally) can be observed in the classification setting. Yet, Classifiers are simple
enough to make their analysis easy to understand. This combination of sufficient yet minimal
complexity for capturing phenomena makes the study of Classifiers especially fruitful.
The multi-class classification problem is an extension of the traditional binary class problem
where a dataset consists of k classes instead of two. While imbalance is said to exist in the
binary class imbalance problem when one class severely outnumbers the other class,
extended to multiple classes the effects of imbalance are even more problematic. That is,
given k classes, there are multiple ways for class imbalance to manifest itself in the dataset.
219
One typical way is there is one ―super majority‖ class which contains most of the instances in
the dataset. Another typical example of class imbalance in multi-class datasets is the result of
a single minority class. In such instances k−1 instances each make up roughly1/ (k − 1) of the
dataset, and the ―minority‖ class makes up the rest.
11.3 Estimating the Predictive Accuracy of a Classifier
Any algorithm which assigns a classification to unseen instances is called a classifier. A
decision tree is one of the very popular types of classifier, but there are several others, some
of which are described elsewhere in this book.
This chapter is concerned with estimating the performance of a classifier of any kind but will
be illustrated using decision trees generated with attribute selection using information gain.
Although the data compression can sometimes be important, in practice the principal reason
for generating a classifier is to enable unseen instances to be classified. However we have
already seen that many different classifiers can be generated from a given dataset. Each one is
likely to perform differently on a set of unseen instances.
The most obvious criterion to use for estimating the performance of a classifier is predictive
accuracy, i.e. the proportion of a set of unseen instances that it correctly classifies. This is
often seen as the most important criterion but other criteria are also important, for example
algorithmic complexity, efficient use of machine resources and comprehensibility.
For most domains of interest the number of possible unseen instances is potentially very large
(e.g. all those who might develop an illness, the weather for every possible day in the future
or all the possible objects that might appear on radar display), so it is not possible ever to
establish the predictive accuracy beyond dispute. Instead, it is usual to estimate the predictive
accuracy of a classifier by measuring its accuracy for a sample of data not used when it was
generated. There are three main strategies commonly used for this: dividing the data into
training set and a test set, k-fold cross-validation and N -fold (or leave-one-out) crossvalidation.
Method 1: Separate Training and Test Sets
For the ‗train and test‘ method the available data is split into two parts called a training set
and a test set (Figure 11.1). First, the training set is used to construct a classifier (decision
tree, neural net etc.). The classifier is then used to predict the classification for the instances
in the test set. If the test set contains N instances of which C are correctly classified the
220
predictive accuracy of the classifier for the test set is p = C/N. This can be used as an estimate
of its performance on any unseen dataset.
Figure 11.1 Training and Testing
NOTE. For some datasets in the UCI Repository (and elsewhere) the data is provided as two
separate files, designated as the training set and the test set. In such cases we will consider the
two files together as comprising the ‗dataset‘ for that application. In cases where the dataset
is only a single fie we need to divide it into a training set and a test set before using Method
1. This may be done in many ways, but a random division into two parts in proportions such
as 1:1, 2:1, 70:30 or 60:40 would be customary.
Standard Error
It is important to bear in mind that the overall aim is not (just) to classify the instances in the
test set but to estimate the predictive accuracy of the classifier for all possible unseen
instances, which will generally be many times the number of instances contained in the test
set.
If the predictive accuracy calculated for the test set is p and we go on to use the classifier to
classify the instances in a different test set, it is very likely that a different value for predictive
accuracy would be obtained. All that we can say is that p is an estimate of the true predictive
accuracy of the classifier for all possible unseen instances.
We cannot determine the true value without collecting all the instances and running the
classifier on them, which is usually an impossible task. Instead, we can use statistical
methods to find a range of values within which the true value of the predictive accuracy lies,
with a given probability or ‗confidence level‘.
To do this we use the standard error associated with the estimated value p. If p is calculated
using a test set of N instances the value of its standard error is
𝑝(1 − 𝑝)/𝑁 .
The significance of standard error is that it enables us to say that with a specified probability
(which we can choose) the true predictive accuracy of the classifier is within so many
standard errors above or below the estimated value p. The more certain we wish to be, the
221
greater the number of standard errors. The probability is called the confidence level, denoted
by CL and the number of standard errors is usually written as ZCL. Table-11.1 shows the
relationship between commonly used values of CL and ZCL.
Table 11.1 Value of Zcl for certain confidence levels
Confidence Level
(CL)
ZCL
0.9
0.95
0.99
1.64
1.96
2.58
If the predictive accuracy for a test set is p, with standard error S, then using this table we can
say that with probability CL (or with a confidence level CL) the true predictive accuracy lies
in the interval p ± ZCL × S.
Example
If the classifications of 80 instances out of a test set of 100 instances were predicted
accurately, the predictive accuracy on the test set would be 80/100= 0.8. The standard error
would be 0.8 × 0.2/100 =
0.0016 = 0.04. We can say that with probability 0.95 the true
predictive accuracy lies in the interval 0.8 ± 1.96 × 0.04, i.e. between 0.7216 and 0.8784 (to
four decimal places).
Instead of a predictive accuracy of 0.8 (or 80%) we often refer to an error rate of 0.2 (or
20%). The standard error for the error rate is the same as that for predictive accuracy.
The value of CL to use when estimating predictive accuracy is a matter of choice, although it
is usual to choose a value of at least 0.9. The predictive accuracy of a classifier is often
quoted in technical papers as just p ± 𝑝(1 − 𝑝)/𝑁 without any multiplier ZCL.
Repeated Train and Test:
Here the classifier is used to classify k test sets, not just one. If all the test sets are of the same
size, N, the predictive accuracy values obtained for the k test sets are then averaged to
produce an overall estimate p.
As the total number of instances in the test sets is kN, the standard error of the estimate p
is 𝑝(1 − 𝑝)/𝑘𝑁 .
If the test sets are not all of the same size the calculations are slightly more complicated.
If there are Ni instances in the ith test set (1 ≤ i ≤ k) and the predictive accuracy calculated for
the ith test set is pi the overall predictive accuracy p is
𝑖=𝑘
𝑖=1
𝑝𝑖 𝑁𝑖 /𝑇 where
is the weighted average of the pi values. The standard error is
𝑖=𝑘
𝑖=1
𝑁𝑖 = 𝑇 i.e. p
𝑝(1 − 𝑝)/𝑇.
Method 2: k-fold Cross-validation
222
An alternative approach to ‗train and test‘ that is often adopted when the number of instances
is small (and which many prefer to use regardless of size) is known as k-fold cross-validation
(Figure 11.2).
If the dataset comprises N instances, these are divided into k equal parts, k typically being a
small number such as 5 or 10. (If N is not exactly divisible by k, the final part will have fewer
instances than the other k − 1 parts.) A series of k runs is now carried out. Each of the k parts
in turn is used as a test set and the other k − 1 part is used as a training set.
The total number of instances correctly classified (in all k runs combined) is divided by the
total number of instances N to give an overall level of predictive accuracy p, with standard
error
𝑝(1 − 𝑝)/𝑁 .
Figure 11.2 k-fold Cross-validation
Method 3: N-fold Cross-validation
N -fold cross-validation is an extreme case of k-fold cross-validation, often known as ‗leaveone-out‘ cross-validation or jack-knifing, where the dataset is divided into as many parts as
there are instances, each instance effectively forming a test set of one.
N classifiers are generated, each from N − 1 instance and each is used to classify a single test
instance. The predictive accuracy p is the total number correctly classified divided by the
total number of instances. The standard error is
𝑝(1 − 𝑝)/𝑁 .
The large amount of computation involved makes N -fold cross-validation unsuitable for use
with large datasets. For other datasets, it is not clear whether any gain in the accuracy of the
estimates produced by using N -fold cross- validation justifies the additional computation
involved. In practice, the method is most likely to be of benefit with very small datasets
where as much data as possible needs to be used to train the classifier.
Experimental Results I
In this section we look at experiments to estimate the predictive accuracy of classifiers
generated for four datasets. All the results in this section were obtained using the TDIDT tree
223
induction algorithm, with information gain used for attribute selection. Basic information
about the datasets is given in Table 11.2 below. Further information about these and most of
the other datasets mentioned in this book is given in Appendix B.
Table 11.2 Four datasets
Dataset
Description
Vote
2
Attributes+
Categ Cts
16
2
8
768
*
647
214
Classes
Voting in US
Congress in 1984
Pima
Prevalence of diabetes
Indians
in pima Indian women
Chess
Chess endgame
Glass
Glass identification
+ categ: categorical; cts: continuous
2
7
Instances
Training set Test set
300
135
7
9
* Plus one ‗ignore‘ attribute
The vote, pima-Indians and glass datasets are all taken from the UCI Repository. The chess
dataset was constructed for a well-known series of machine learning experiments Quinlan JR
1979.
The vote dataset has separate training and test sets. The other three datasets were first divided
into two parts, with every third instance placed in the test set and the other two placed in the
training set in both cases.
The result for the vote dataset illustrates the point that TDIDT (along with some but not all
other classification algorithms) is sometimes unable to classify an unseen instance (Table
11.3).
Table 11.1 Train and Test Results for Four Datasets
Vote
Pima-Indians
Test set
(instances)
135
256
Chess
215
Dataset
126 (93% ± 2%)
191 (75% ± 3%)
Incorrectly
classified
7
65
214 (99.5% ± 0.5%)
1
Correctly classified
Unclassified
2
Glass
71
50 (70% ± 5%)
21
Unclassified instances can be dealt with by giving the classifier a ‗default strategy‘, such as
always allocating them to the largest class, and that will be the approach followed for the
remainder of this chapter. It could be argued that it might be better to leave unclassified
224
instances as they are, rather than risk introducing errors by assigning them to a specific class
or classes. In practice the number of unclassified instances is generally small and how they
are handled makes little difference to the overall predictive accuracy.
Table 11. 4 gives the ‗train and test‘ result for the vote dataset modified to incorporate the
‗default to largest class‘ strategy. The difference is slight.
Table 11.2 Train and Test Results for vote Dataset (Modified)
Dataset
Vote
Test set
Correctly
Incorrectly
(instances) classified
classified
135
8
127 (94% ± 2%)
Table 11.5 and Table 11.6 show the results obtained using 10-fold and N –fold Crossvalidation for the four datasets.
For the vote dataset the 300 instances in the training set are used. For the other two
datasets all the available instances are used.
Table 11.3. 10-fold Cross-validation Results for Four Datasets
Dataset
Vote
PimaIndians
Test set
Correctly
Incorrectly
(instances) classified
classified
300
275 (92% ± 2%)
25
768
536 (70% ± 3%)
232
Chess
647
Glass
214
645(99.7% ±
0.2%)
149 (70% ± 3%)
2
65
Table11.4 N-fold Cross-validation Results for Four Datasets
Dataset
Vote
PimaIndians
Test set
Correctly
Incorrectly
(instances) classified
classified
300
278 (93% ± 2%)
22
768
517 (67% ± 2%)
251
Chess
647
Glass
214
646(99.8% ±
0.2%)
144 (67% ± 3%)
1
70
225
All the tables given in this section are estimates. The 10-fold cross- validation and N -fold
cross-validation results for all four datasets are based on considerably more instances than
those in the corresponding test sets for the ‗train and test‘ experiments and so are more likely
to be reliable.
Experimental Results II: Datasets with Missing Values
We now look at experiments to estimate the predictive accuracy of a classifier in the case of
datasets with missing values. As before we will generate all the classifiers using the TDIDT
algorithm, with Information Gain for attribute selection.
Three datasets were used in these experiments, all from the UCI Repository. Basic
information about each one is given in Table 7 below.
Table 11.5 Three Datasets with Missing Values
Datase
Description
Class
t
es
Attributes+
cate
cts
g
crx
Credit
Card 2
instances
Training
Test set
set
9
6
690(37)
200(12)
22
7
2514(2514 1258(37
Application
hypo
Hypothyroid
5
Disorders
labor-
Labor Negotiations
2
8
8
)
1)
40(39)
17(17)
ne
Each dataset has both a training set and a separate test set. In each case, there are missing
values in both the training set and the test set. The values in parentheses in the ‗training set‘
and ‗test set‘ columns show the number of instances that have at least one missing value.
The ‗train and test‘ method was used for estimating predictive accuracy.
Strategy 1: Discard Instances
This is the simplest strategy: delete all instances where there is at least one missing value and
use the remainder. This strategy has the advantage of avoiding introducing any data errors. Its
main disadvantage is that discarding data may damage the reliability of the resulting
classifier.
A second disadvantage is that the method cannot be used when a high proportion of the
instances in the training set have missing values, as is the case for example with both the
226
hypo and the labor-ne datasets. A final disadvantage is that it is not possible with this strategy
to classify any instances in the test set that have missing values.
Together these weaknesses are quite substantial. Although the ‗discard instances‘ strategy
may be worth trying when the proportion of missing values is small, it is not recommended in
general.
Of the three datasets listed in Table 11.7, the ‗discard instances‘ strategy can only be applied
to crx. Doing so gives the possibly surprising result in Table 8.
Table 11.6 Discard Instances Strategy with crx Dataset
Dataset
MV strategy
Rules
Test set
Correct Incorrect
Crx
Discard
118
188
0
instances
Clearly discarding the 37 instances with at least one missing value from the training set
(5.4%) does not prevent the algorithm constructing a decision tree capable of classifying the
188 instances in the test set that do not have missing values correctly in every case.
Strategy 2: Replace by Most Frequent/Average Value
With this strategy any missing values of a categorical attribute are replaced by its most
commonly occurring value in the training set. Any missing values of a continuous attribute
are replaced by its average value in the training set. Table 9 shows the result of applying the
‗Most Frequent/Average Value‘ strategy to the crx dataset. As for the ‗Discard Instances‘
strategy all instances in the test set are correctly classified, but this time all 200 instances in
the test set are classified, not just the 188 instances in the test set that do not have missing
values.
Table 11.7 Comparison of Strategies with crx Dataset
Dataset
MV strategy
Rules
Test set
Correct Incorrect
crx
Discard instances
118
188
0
crx
Most Frequent/ Average 139
200
0
Value
With this strategy we can also construct classifiers from the hypo and crx datasets.
In the case of the hypo dataset, we get a decision tree with just 15 rules. The average number
of terms per rule is 4.8. When applied to the test data this tree is able to classify correctly
1251 of the 1258 instances in the test set (99%; Table 11.10). This is a remarkable result with
227
so few rules, especially as there are missing values in every instance in the training set. It
gives considerable credence to the belief that using entropy for constructing a decision tree is
an effective approach.
Table 11.8 Most Frequent Value/Average Strategy with hypo Dataset
Dataset
MV strategy
Rules
Test set
Correct Incorrect
crx
Most Frequent/ Average 5
1251
7
Value
In the case of the labor-ne dataset, we obtain a classifier with five rules, which correctly
classifies 14 out of the 17 instances in the test set (Table 11.11).
Table 11.9 Most Frequent Value/Average Strategy with labor-ne Dataset
Dataset
MV strategy
Rules
Test set
Correct Incorrect
crx
Most Frequent/ Average 5
14
3
Value
Missing Classifications
It is worth noting that for each dataset given in Table 7 the missing values are those of
attributes, not classifications. Missing classifications in the training set are a far larger
problem than missing attribute values. One possible approach would be to replace them all by
the most frequently occurring classification but this is unlikely to prove successful in most
cases. The best approach is probably to discard any instances with missing classifications.
Confusion Matrix
As well as the overall predictive accuracy on unseen instances it is often helpful to see a
breakdown of the classifier‘s performance, i.e. how frequently instances of class X were
correctly classified as class X or misclassified as some other class. This information is given
in a confusion matrix.
The confusion matrix in Table 11.12 gives the results obtained in ‗train and test‘ mode from
the TDIDT algorithm (using information gain for attribute se- lection) for the vote test set,
which has two possible classifications: ‗republican‘ and ‗democrat‘.
Table 11.10 Example of a Confusion Matrix
Correct
Classified as
classification Democrat
Republican
228
Democrat
81(97.6%) 2(2.4%)
Republican
6(11.5%)
46(88.5%)
The body of the table has one row and column for each possible classification. The rows
correspond to the correct classifications. The columns correspond to the predicted
classifications.
The value in the ith row and jth column gives the number of instances for which the correct
classification is the ith class which are classified as belonging to the jth class. If all the
instances were correctly classified, the only non-zero entries would be on the ‗leading
diagonal‘ running from top left (i.e. row 1, column 1) down to bottom right.
To demonstrate that the use of a confusion matrix is not restricted to datasets with two
classifications, Table 11.13 shows the results obtained using 10-fold cross-validation with
the TDIDT algorithm (using information gain for attribute section) for the glass dataset,
which has six classifications: 1, 2,3, 5, 6 and 7 (there is also a class 4 but it is not used for the
training data).
Table 11.11 Confusion Matrix for glass Dataset
Correct
Classified as
classification
1
2
3
5
6
7
1
52
10
7
0
0
1
2
15
50
6
2
1
2
3
5
6
6
0
0
0
5
0
2
0
10 0
1
6
0
1
0
0
7
1
7
1
3
0
1
0 24
True and False Positives
When a dataset has only two classes, one is often regarded as ‗positive‘ (i.e. the class of
principal interest) and the other as ‗negative‘. In this case the entries in the two rows and
columns of the confusion matrix are referred to as true and false positives and true and false
negatives (Table 11.14).
Table 11.12 True and False Positives and Negatives
Correct
classification
Classified as
+
-
229
+
True positives
False negatives
-
False positives
True negatives
When there are more than two classes, one class is sometimes important enough to be
regarded as positive, with all the other classes combined treated as negative. For example we
might consider class 1 for the glass dataset as the ‗positive‘ class and classes 2, 3, 5, 6 and 7
combined as ‗negative‘. The confusion matrix given as Table 13 can then be rewritten as
shown in Table 15.
Of the 73 instances classified as positive, 52 genuinely are positive (true positives) and the
other 21 are really negative (false positives). Of the 141 instances classified as negative, 18
are really positive (false negatives) and the other 123 are genuinely negative (true negatives).
With a perfect classifier there would be no false positives or false negatives.
Table 11.13 Revised Confusion Matrix for glass Dataset
Correct
Classified as
classification
+
-
+
52
18
-
21
123
False positives and false negatives may not be of equal importance, e.g. we may be willing to
accept some false positives as long as there are no false negatives or vice versa.
11.4 Evaluating the Accuracy of a Classifier or Predictor
Holdout, random sub-sampling, cross-validation, and the bootstrap are common techniques
for assessing accuracy based on randomly sampled partitions of the given data. The use of
such techniques to estimate accuracy increases the overall computation time, yet is useful for
model selection.
Figure 11.1 Estimating accuracy with the holdout method
230
Holdout Method and Random Subsampling
The holdout method is what we have alluded to so far in our discussions about accuracy. In
this method, the given data are randomly partitioned into two independent sets, a training set
and a test set. Typically, two-thirds of the data are allocated to the training set, and the
remaining one-third is allocated to the test set. The training set is used to derive the model,
whose accuracy is estimated with the test set (Figure 3). The estimate is pessimistic because
only a portion of the initial data is used to derive the model.
Random sub-sampling is a variation of the holdout method in which the holdout method is
repeated k times. The overall accuracy estimate is taken as the average of the accuracies
obtained from each iteration. (For prediction, we can take the average of the predictor error
rates.)
Cross-validation
In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive
subsets or ―folds,‖ D1, D2, . . . , Dk, each of approximately equal size. Training and testing is
performed k times. In iteration i, partition Di is reserved as the test set, and the remaining
partitions are collectively used to train the model. That is, in the first iteration, subsets D2, . . .
, Dk collectively serve as the training set in order to obtain a first model, which is tested on
D1; the second iteration is trained on subsets D1, D2, . . . , Dk, and tested on D2; and so on.
Unlike the holdout and random subsampling methods above, here, each sample is used the
same number of times for training and once for testing. For classification, the accuracy
estimate is the overall number of correct classifications from the k iterations, divided by the
total number of tuples in the initial data. For prediction, the error estimate can be computed as
the total loss from the k iterations, divided by the total number of initial tuples.
Leave-one-out is a special case of k-fold cross-validation where k is set to the number of
initial tuples. That is, only one sample is ―left out‖ at a time for the test set. In stratified crossvalidation, the folds are stratified so that the class distribution of the tuples in each fold is
approximately the same as that in the initial data.
In general, stratified 10-fold cross-validation is recommended for estimating accuracy (even
if computation power allows using more folds) due to its relatively low bias and variance.
Bootstrap
Unlike the accuracy estimation methods mentioned above, the bootstrap method samples the
given training tuples uniformly with replacement. That is, each time a tuple is selected, it is
231
equally likely to be selected again and readied to the training set. For instance, imagine a
machine that randomly selects tuples for our training set. In sampling with replacement, the
machine is allowed to select the same tuple more than once.
There are several bootstrap methods. A commonly used one is the .632 bootstrap, which
works as follows. Suppose we are given a data set of d tuples. The data set is sampled d
times, with replacement, resulting in a bootstrap sample or training set of d samples. It is very
likely that some of the original data tuples will occur more than once in this sample. The data
tuples that did not make it into the training set end up forming the test set. Suppose we were
to try this out several times. As it turns out, on average, 63.2% of the original data tuples will
end up in the bootstrap, and the remaining 36.8% will form the test set (hence, the name, .632
bootstrap.)
―Where does the figure, 63.2%, come from?‖ Each tuple has a probability of 1/d of being
selected, so the probability of not being chosen is (1-1/d). We have to select d times, so the
probability that a tuple will not be chosen during this whole time is (1-1/d)d. If d is large, the
probability approaches e 1 = 0:368.14 Thus, 36.8% of tuples will not be selected for training
and thereby end up in the test set, and the remaining 63.2% will form the training set.
We can repeat the sampling procedure k times, where in each iteration, we use the current test
set to obtain an accuracy estimate of the model obtained from the current bootstrap sample.
The overall accuracy of the model is then estimated as
𝑘
𝐴𝑐𝑐 𝑀 =
(0.632 ∗ 𝐴𝑐𝑐 (𝑀𝑖 )𝑡𝑒𝑠𝑡 _𝑠𝑒𝑡 )) + 0.368 ∗ 𝐴𝑐𝑐 (𝑀𝑖 )𝑡𝑟𝑎𝑖𝑛 _𝑠𝑒𝑡 ))
(1)
𝑖=1
Where Acc(Mi)test_set is the accuracy of the model obtained with bootstrap sample i when it is
applied to test set i. Acc(Mi)train_set is the accuracy of the model obtained with bootstrap
sample i when it is applied to the original set of data tuples. The bootstrap method works well
with small data sets.
232
Figure 11.2 Increasing model accuracy: Bagging and boosting each generate a set of
classification or prediction models M1,M2,. . . Mk. Voting strategies are used to combine the
predictions for a given unknown tuple.
Ensemble Methods—Increasing the Accuracy
Bagging and boosting are two techniques to improve the accuracy (Figure 4). They are
examples of ensemble methods, or methods that use a combination of models. Each combines
a series of k learned models (classifiers or predictors), M1, M2, . . ., Mk, with the aim of
creating an improved composite model, M . Both bagging and boosting can be used for
classification as well as prediction.
Bagging
We first take an intuitive look at how bagging works as a method of increasing accuracy.
For ease of explanation, we will assume at first that our model is a classifier. Suppose that
you are a patient and would like to have a diagnosis made based on your symptoms. Instead
of asking one doctor, you may choose to ask several. If a certain diagnosis occurs more than
any of the others, you may choose this as the final or best diagnosis. That is, the final
diagnosis is made based on a majority vote, where each doctor gets an equal vote. Now
replace each doctor by a classifier, and you have the basic idea behind bagging. Intuitively, a
majority vote made by a large group of doctors may be more reliable than a majority vote
made by a small group.
Given a set, D, of d tuples, bagging works as follows. For iteration i (i = 1, 2, . . ., k), a
training set, Di, of d tuples is sampled with replacement from the original set of tuples, D.
Note that the term bagging stands for bootstrap aggregation. Each training set is a bootstrap
sample, as described in Section Bootstrap. Because sampling with replacement is used, some
of the original tuples of D may not be included in Di, whereas others may occur more than
once. A classifier model, Mi, is learned for each training set, Di. To classify an unknown
tuple, X, each classifier, Mi, returns its class prediction, which counts as one vote. The
233
bagged classifier, M , counts the votes and assigns the class with the most votes to X.
Bagging can be applied to the prediction of continuous values by taking the average value of
each prediction for a given test tuple. The algorithm is summarized as follows.
Algorithm: Bagging. The bagging algorithm—creates an ensemble of models (classifiers
or pre-dictors) for a learning scheme where each model gives an equally-weighted prediction.
Input:
 D, a set of d training tuples;
 k, the number of models in the ensemble;
 a learning scheme (e.g., decision tree algorithm, backpropagation, etc.)
Output: A composite model, M.
Method:
for i = 1 to k do // create k models:
create bootstrap sample, Di, by sampling D with replacement;
use Di to derive a model, Mi;
end for
To use the composite model on a tuple, X:
if classification then
let each of the k models classify X and return the majority vote;
if prediction then
let each of the k models predict a value for X and return the average predicted value;
The bagged classifier often has significantly greater accuracy than a single classifier
derived from D, the original training data. It will not be considerably worse and is more
robust to the effects of noisy data. The increased accuracy occurs because the composite
model reduces the variance of the individual classifiers. For prediction, it was theoretically
proven that a bagged predictor will always have improved accuracy over a single predictor
derived from D.
Boosting
We now look at the ensemble method of boosting. As in the previous section, suppose
that as a patient, you have certain symptoms. Instead of consulting one doctor, you choose to
consult several. Suppose you assign weights to the value or worth of each doctor‘s diagnosis,
based on the accuracies of previous diagnoses they have made. The final diagnosis is then a
combination of the weighted diagnoses. This is the essence behind boosting.
234
In boosting, weights are assigned to each training tuple. A series of k classifiers is
iteratively learned. After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to ―pay more attention‖ to the training tuples that were
misclassified by Mi. The final boosted classifier, M , combines the votes of each individual
classifier, where the weight of each classifier‘s vote is a function of its accuracy. The
boosting algorithm can be extended for the prediction of continuous values.
Adaboost is a popular boosting algorithm. Suppose we would like to boost the accuracy
of some learning method. We are given D, a data set of d class-labeled tuples, (X1, y1), (X2,
y2), . . ., (Xd, yd ), where yi is the class label of tuple Xi. Initially, Adaboost assigns each
training tuple an equal weight of 1/d. Generating k classifiers for the ensemble requires k
rounds through the rest of the algorithm. In round i, the tuples from D are sampled to form a
training set, Di, of size d. Sampling with replacement is used the same tuple may be selected
more than once. Each tuple‘s chance of being selected is based on its weight. A classifier
model, Mi, is derived from the training tuples of Di. Its error is then calculated using Di as a
test set. The weights of the training tuples are then adjusted according to how they were
classified. If a tuple was incorrectly classified, its weight is increased. If a tuple was correctly
classified, its weight is decreased. A tuple‘s weight reflects how hard it is to classify the
higher the weight, the more often it has been misclassified. These weights will be used to
generate the training samples for the classifier of the next round. The basic idea is that when
we build a classifier, we want it to focus more on the misclassified tuples of the previous
round. Some classifiers may be better at classifying some ―hard‖ tuples than others. In this
way, we build a series of classifiers that complement each other. The algorithm is
summarized as follows
Algorithm: Adaboost. A boosting algorithm—create an ensemble of classifiers. Each one
gives a weighted vote.
Adaboost Algorithm:
Input:
 D, a set of d class-labeled training tuples;
 k, the number of rounds (one classifier is generated per round);
 a classification learning scheme.
Output: A composite model.
Method:
initialize the weight of each tuple in D to 1/d;
235
for i = 1 to k do // for each round:
sample D with replacement according to the tuple weights to obtain Di;
use training set Di to derive a model, Mi;
compute error(Mi), the error rate of Mi (Equation 2)
if error(Mi) > 0:5 then
reinitialize the weights to 1/d
go back to step 3 and try again;
end if
for each tuple in Di that was correctly classified do
multiply the weight of the tuple by error(Mi)/(1- error(Mi)); // update weights
normalize the weight of each tuple;
end for
To use the composite model to classify tuple, X:
initialize weight of each class to 0;
for i = 1 to k do // for each classifier:
𝑤𝑖 = 𝑙𝑜𝑔
1−𝑒𝑟𝑟𝑜𝑟 (𝑀𝑖 )
𝑒𝑟𝑟𝑜𝑟 (𝑀𝑖 )
; // weight of the classifier‘s vote
c = Mi(X); // get class prediction for X from Mi
add wi to weight for class c
end for
return the class with the largest weight;
Now, let‘s look at some of the math that‘s involved in the algorithm. To compute the
error rate of model Mi, we sum the weights of each of the tuples in Di that Mi misclassified.
That is,
𝑘
𝑒𝑟𝑟𝑜𝑟 𝑀𝑖 =
𝑤𝑗 ∗ 𝑒𝑟𝑟(𝑋𝑗 )
(2)
𝑗
where err(Xj) is the misclassification error of tuple Xj: If the tuple was misclassified,
then err(Xj) is 1. Otherwise, it is 0. If the performance of classifier Mi is so poor that its error
exceeds 0.5, then we abandon it. Instead, we try again by generating a new Di training set,
from which we derive a new Mi.
The error rate of Mi affects how the weights of the training tuples are updated. If a tuple
in round i was correctly classified, its weight is multiplied by error (Mi)/(1-error(Mi)). Once
the weights of all of the correctly classified tuples are updated, the weights for all tuples
236
(including the misclassified ones) are normalized so that their sum remains the same as it was
before. To normalize a weight, we multiply it by the sum of the old weights, divided by the
sum of the new weights. As a result, the weights of misclassified tuples are increased and the
weights of correctly classified tuples are decreased, as described above.
―Once boosting is complete, how is the ensemble of classifiers used to predict the class
label of a tuple, X?‖ Unlike bagging, where each classifier was assigned an equal vote,
boosting assigns a weight to each classifier‘s vote, based on how well the classifier
performed. The lower a classifier‘s error rate, the more accurate it is, and therefore, the
higher its weight for voting should be. The weight of classifier Mi‘s vote is
𝑙𝑜𝑔
1 − 𝑒𝑟𝑟𝑜𝑟(𝑀𝑖 )
𝑒𝑟𝑟𝑜𝑡(𝑀𝑖 )
(3)
For each class, c, we sum the weights of each classifier that assigned class c to X. The
class with the highest sum is the ―winner‖ and is returned as the class prediction for tuple X.
―How does boosting compare with bagging?‖ Because of the way boosting focuses on
the misclassified tuples, it risks over fitting the resulting composite model to such data.
Therefore, sometimes the resulting ―boosted‖ model may be less accurate than a single model
derived from the same data. Bagging is less susceptible to model over fitting. While both can
significantly improve accuracy in comparison to a single model, boosting tends to achieve
greater accuracy.
Model Selection
Suppose that we have generated two models, M1 and M2 (for either classification or
prediction), from our data. We have performed 10-fold cross-validation to obtain a mean
error rate for each. How can we determine which model is best? It may seem intuitive to
select the model with the lowest error rate; however, the mean error rates are just estimates of
error on the true population of future data cases. There can be considerable variance between
error rates within any given 10-fold cross-validation experiment. Although the mean error
rates obtained for M1 and M2 may appear different, that difference may not be statistically
significant. What if any difference between the two may just be attributed to chance? This
section addresses these questions.
Estimating Confidence Intervals
To determine if there is any ―real‖ difference in the mean error rates of two models, we
need to employ a test of statistical significance. In addition, we would like to obtain some
confidence limits for our mean error rates so that we can make statements like ―any observed
237
mean will not vary by +/- two standard errors 95% of the time for future samples‖ or ―one
model is better than the other by a margin of error of +/- 4%.‖
What do we need in order to perform the statistical test? Suppose that for each model, we
did 10-fold cross-validation, say, 10 times, each time using a different 10-fold partitioning of
the data. Each partitioning is independently drawn. We can average the 10 error rates
obtained each for M1 and M2, respectively, to obtain the mean error rate for each model. For a
given model, the individual error rates calculated in the cross-validations may be considered
as different, independent samples from a probability distribution. In general, they follow a t
distribution with k-1 degrees of freedom where, here, k = 10. (This distribution looks very
similar to a normal, or Gaussian, distribution even though the functions defining the two are
quite different. Both are unimodal, symmetric, and bell-shaped.) This allows us to do
hypothesis testing where the significance test used is the t-test, or Student‘s t-test. Our
hypothesis is that the two models are the same, or in other words, that the difference in mean
error rate between the two is zero. If we can reject this hypothesis (referred to as the null
hypothesis), then we can conclude that the difference between the two models is statistically
significant, in which case we can select the model with the lower error rate.
In data mining practice, we may often employ a single test set, that is, the same test set
can be used for both M1 and M2. In such cases, we do a pairwise comparison of the two
models for each 10-fold cross-validation round. That is, for the ith round of 10-fold crossvalidation, the same cross-validation partitioning is used to obtain an error rate for M1 and an
error rate for M2. Let err(M1)i (or err(M2)i) be the error rate of model M1 (or M2) on round i.
The error rates for M1 are averaged to obtain a mean error rate for M1, denoted 𝑒𝑟𝑟(M1).
Similarly, we can obtain 𝑒𝑟𝑟 (M2). The variance of the difference between the two models is
denoted var(M1-M2). The t-test computes the t-statistic with k-1 degrees of freedom for k
samples. In our example we have k = 10 since, here, the k samples are our error rates
obtained from ten 10-fold cross-validations for each model. The t-statistic for pairwise
comparison is computed as follows:
𝑡=
𝑒𝑟𝑟 𝑀1 − 𝑒𝑟𝑟(𝑀2 )
𝑣𝑎𝑟(𝑀1 − 𝑀2 )/𝑘
(4)
Where
𝑣𝑎𝑟 𝑀1 − 𝑀2
1
=
𝑘
𝑘
[𝑒𝑟𝑟 𝑀1 𝑖 − 𝑀2
𝑖
− (𝑒𝑟𝑟 𝑀1 − 𝑒𝑟𝑟(𝑀2 )]2
(5)
𝑖=1
238
To determine whether M1 and M2 are significantly different, we compute t and select a
significance level, sig. In practice, a significance level of 5% or 1% is typically used. We then
consult a table for the t distribution, available in standard textbooks on statistics. This table is
usually shown arranged by degrees of freedom as rows and significance levels as columns.
Suppose we want to ascertain whether the difference between M1 and M2 is significantly
different for 95% of the population, that is, sig = 5% or 0.05. We need to find the t
distribution value corresponding to k-1 degrees of freedom (or 9 degrees of freedom for our
example) from the table. However, because the t distribution is symmetric, typically only the
upper percentage points of the distribution are shown. Therefore, we look up the table value
for z = sig/2, which in this case is 0.025, where z is also referred to as a confidence limit. If t
> z or t < -z, then our value of t lies in the rejection region, within the tails of the distribution.
This means that we can reject the null hypothesis that the means of M1 and M2 are the same
and conclude that there is a statistically significant difference between the two models.
Otherwise, if we cannot reject the null hypothesis, we then conclude that any difference
between M1 and M2 can be attributed to chance.
If two test sets are available instead of a single test set, then a nonpaired version of the ttest is used, where the variance between the means of the two models is estimated as
𝑣𝑎𝑟 𝑀1 − 𝑀2 =
𝑣𝑎𝑟(𝑀1 ) 𝑣𝑎𝑟(𝑀2 )
+
𝑘1
𝑘2
(6)
and k1 and k2 are the number of cross-validation samples (in our case, 10-fold crossvalidation rounds) used for M1 and M2, respectively. When consulting the table of t
distribution, the number of degrees of freedom used is taken as the minimum number of
degrees of the two models.
ROC Curves
ROC curves are a useful visual tool for comparing two classification models. The name
ROC stands for Receiver Operating Characteristic. ROC curves come from signal detection
theory that was developed during World War II for the analysis of radar images. An ROC
curve shows the trade-off between the true positive rate or sensitivity (proportion of positive
tuples that are correctly identified) and the false-positive rate (proportion of negative tuples
that are incorrectly identified as positive) for a given model. That is, given a two-class
problem, it allows us to visualize the trade-off between the rate at which the model can
239
accurately recognize ‗yes‘ cases versus the rate at which it mistakenly identifies ‗no‘ cases as
‗yes‘ for different ―portions‖ of the test set. Any increase in the true positive rate occurs at the
cost of an increase in the false-positive rate. The area under the ROC curve is a measure of
the accuracy of the model.
In order to plot an ROC curve for a given classification model, M, the model must be
able to return a probability or ranking for the predicted class of each test tuple. That is, we
need to rank the test tuples in decreasing order, where the one the classifier thinks is most
likely to belong to the positive or ‗yes‘ class appears at the top of the list. Naive Bayesian and
back propagation classifiers are appropriate, whereas others, such as decision tree classifiers,
can easily be modified so as to return a class probability distribution for each prediction. The
vertical axis of an ROC curve represents the true positive rate. The horizontal axis represents
the false-positive rate. An ROC curve for M is plotted as follows. Starting at the bottom lefthand corner (where the true positive rate and false-positive rate are both 0), we check the
actual class label of the tuple at the top of the list. If we have a true positive (that is, a positive
tuple that was correctly classified), then on the ROC curve, we move up and plot a point. If,
instead, the tuple really belongs to the ‗no‘ class, we have a false positive. On the ROC curve,
we move right and plot a point. This process is repeated for each of the test tuples, each time
moving up on the curve for a true positive or toward the right for a false positive.
Figure 3 the ROC curves of two classification models
Figure 5 shows the ROC curves of two classification models. The plot also shows a
diagonal line where for every true positive of such a model, we are just as likely to encounter
a false positive. Thus, the closer the ROC curve of a model is to the diagonal line, the less
accurate the model. If the model is really good, initially we are more likely to encounter true
positives as we move down the ranked list. Thus, the curve would move steeply up from zero.
Later, as we start to encounter fewer and fewer true positives, and more and more false
positives, the curve cases off and becomes more horizontal.
240
To assess the accuracy of a model, we can measure the area under the curve. Several
software packages are able to perform such calculation. The closer the area is to 0.5, the less
accurate the corresponding model is. A model with perfect accuracy will have an area of 1.0.
11.5 Multiclass Problem
In machine learning, multiclass or multinomial classification is the problem of
classifying instances into more than two classes.
While some classification algorithms naturally permit the use of more than two classes,
others are by nature binary algorithms; these can, however, be turned into multinomial
classifiers by a variety of strategies.
Multiclass classification should not be confused with multi-label classification, where
multiple classes are to be predicted for each problem instance.
General strategies
One-vs.-all
Among these strategies are the one-vs.-all (or one-vs.-rest, OvA or OvR) strategy, where
a single classifier is trained per class to distinguish that class from all other classes. Prediction
is then performed by predicting using each binary classifier, and choosing the prediction with
the highest confidence score (e.g., the highest probability of a classifier such as naive Bayes).
In pseudocode, the training algorithm for an OvA learner constructed from a binary
classification learner L is as follows:
Inputs:
 L, a learner (training algorithm for binary classifiers)
 samples X
 labels y where yᵢ ∈ {1, … K} is the label for the sample Xᵢ
Output:
 a list of classifiers fk for k ∈ {1, … K}
Procedure:
for each k in {1 … K}:
Construct a new label vector yᵢ' = 1 where yᵢ = k, 0 (or -1) elsewhere
Apply L to X, y' to obtain fk
End for
Making decisions proceeds by applying all classifiers to an unseen sample x and
predicting the label k for which the corresponding classifier reports the highest confidence
score:
𝑦 = 𝑎𝑟𝑔𝑘∈1…𝑘 max 𝑓𝑘 𝑥
241
11.6 Summary
We discussed basic prediction theory and its impact on classification success evaluation,
implications for learning algorithm design, and uses in learning algorithm execution. There
are several important aspects of learning which the theory here casts light on. Perhaps the
most important of these is the problem of performance reporting for Classifiers. Many people
use some form of empirical variance to estimate upper and lower bounds.
Databases are rich with hidden information that can be used for making intelligent business
decisions. Classification and prediction are two forms of data analysis that can be used to
extract models describing important data classes or to predict future data trends. Whereas
classification predicts categorical labels, prediction models continuous-valued functions. For
example, a classification model may be built to predict the expenditures of potential
customers on computer equipment given their income and occupation. Many classification
and prediction methods have been proposed by researches in machine learning, expert
systems, statistics, and neurobiology. Most algorithms are memory resident, typically
assuming a small data size. Recent database mining research has built on such work,
developing scalable classification and prediction techniques capable of handling large diskresident data. These techniques often consider parallel and distributed processing.
One of the fundamental problems in data mining classification problems is that of class
imbalance. In the typical binary class imbalance problem one class (negative class) vastly
outnumbers the other (positive class). The difficulty of learning under such conditions lies in
the induction bias of most learning algorithms. That is, most learning algorithms, when
presented with a dataset in which there is a severely underrepresented class, ignore the
minority class. This is due to the fact that one can achieve very high accuracy by always
predicting the majority class, especially if the majority class represent 95+% of the dataset.
11.7 Key words
Classification, Multi class problem, Predictive accuracy, Adaboost
11.8 Exercises
1. Explain the multi-class classification problem
2. Explain the Estimating Predictive accuracy of classification methods.
3. How ROC curves are useful in classification models.
242
4. Explain Adaboost algorithm.
5. Explain Holdout Method and Random Sub-sampling concept.
6. Explain k-fold cross validation with an example.
7. Explain n-fold cross validation with an example.
8. Write short notes on confusion matrix.
11.9 References
1. Quinlan, J.R. ―Discovering Rules by Induction from Large Collections of Examples‖.
In Michie, D. (ed.), Expert Systems in the Microelectronic Age. Edinburgh University
Press, pp. 168–201. 1979.
2. Han, Jiawei and Kamber, Micheline and Pei, Jian. ―Data mining: concepts and
techniques‖. Morgan Kaufmann. 2006.
243
UNIT 12: ALGORITHMS FOR DATA CLUSTERING
Structure
12.1
Objectives
12.2
Introduction
12.3
Overview of cluster analysis
12.4
Distance measures
12.5
Different Algorithms for data clustering
12.6
Partitional Methods
12.7
Hierarchical methods
12.8
Summary
12.9
Keywords
12.10
Exercises
12.11
References
12.1 Objectives
The objectives covered under this unit include:
Overview of cluster analysis
Types of Data and Computing Distance
Different algorithms for data clustering
Partitional Methods and Hierarchical methods
12.2 Introduction
Cluster analysis is a statistical technique used to identify how various units (people, groups,
or societies), can be grouped together because of characteristics they have in common. It is an
exploratory data analysis tool that aims to sort different objects into groups in such a way that
when they belong to the same group they have a maximal degree of association and when
they do not belong to the same group their degree of association is minimal.
Cluster analysis is typically used in the exploratory phase of research when the researcher
does not have any pre-conceived hypotheses. It is commonly not the only statistical method
used, but rather is done toward the beginning phases of a project to help guide the rest of the
analysis. For this reason, significance testing is usually neither relevant nor appropriate.
244
12.3 Overview of Cluster Analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense or another) to
each other than to those in other groups (clusters). It is a main task of exploratory data
mining, and a common technique for statistical data analysis, used in many fields,
including machine learning, pattern recognition, image analysis, information retrieval, and
bioinformatics.
The term cluster analysis (first used by Tryon, 1939) encompasses a number of
different algorithms and methods for grouping objects of similar kind into respective
categories. A general question facing researchers in many areas of inquiry is how to organize
observed data into meaningful structures, that is, to develop taxonomies. In other words
cluster analysis is an exploratory data analysis tool which aims at sorting different objects
into groups in a way that the degree of association between two objects is maximal if they
belong to the same group and minimal otherwise. Given the above, cluster analysis can be
used to discover structures in data without providing an explanation / interpretation. In other
words, cluster analysis simply discovers structures in data without explaining why they exist.
We deal with clustering in almost every aspect of daily life. For example, a group of diners
sharing the same table in a restaurant may be regarded as a cluster of people. In food stores
items of similar nature, such as different types of meat or vegetables are displayed in the
same or nearby locations.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be
achieved by various algorithms that differ significantly in their notion of what constitutes a
cluster and how to efficiently find them. Popular notions of clusters include groups with
small distances among the cluster members, dense areas of the data space, intervals or
particular statistical distributions. Clustering can therefore be formulated as a multi-objective
optimization problem. The appropriate clustering algorithm and parameter settings (including
values such as the distance function to use, a density threshold or the number of expected
clusters) depend on the individual data set and intended use of the results. Cluster analysis as
such is not an automatic task, but an iterative process of knowledge discovery or interactive
multi-objective optimization that involves trial and failure. It will often be necessary to
modify data pre-processing and model parameters until the result achieves the desired
properties.
245
12.4 Distance Measures
The joining or tree clustering method uses the dissimilarities or distances between objects
when forming the clusters. These distances can be based on a single dimension or multiple
dimensions. For example, if we were to cluster fast foods, we could take into account the
number of calories they contain, their price, subjective ratings of taste, etc. The most
straightforward way of computing distances between objects in a multi-dimensional space is
to compute Euclidean distances. If we had a two- or three-dimensional space this measure is
the actual geometric distance between objects in the space (i.e., as if measured with a ruler).
However, the joining algorithm does not "care" whether the distances that are "fed" to it are
actual real distances, or some other derived measure of distance that is more meaningful to
the researcher; and it is up to the researcher to select the right method for his/her specific
application.
Euclidean distance. This is probably the most commonly chosen type of distance. It simply
is the geometric distance in the multidimensional space. It is computed as:
distance(x,y) = {
i
(xi - yi)2 }½
Note that Euclidean (and squared Euclidean) distances are usually computed from raw data,
and not from standardized data. This method has certain advantages (e.g., the distance
between any two objects is not affected by the addition of new objects to the analysis, which
may be outliers). However, the distances can be greatly affected by differences in scale
among the dimensions from which the distances are computed. For example, if one of the
dimensions denotes a measured length in centimetres, and you then convert it to millimeters
(by multiplying the values by 10), the resulting Euclidean or squared Euclidean distances
(computed from multiple dimensions) can be greatly affected, and consequently, the results
of cluster analyses may be very different.
Squared Euclidean distance. One may want to square the standard Euclidean distance in
order to place progressively greater weight on objects that are further apart. This distance is
computed as (see also the note in the previous paragraph):
distance(x,y) =
i
(xi - yi)2
City-block (Manhattan) distance. This distance is simply the average difference across
dimensions. In most cases, this distance measure yields results similar to the simple
Euclidean distance. However, note that in this measure, the effect of single large differences
(outliers) is dampened (since they are not squared). The city-block distance is computed as:
246
distance(x,y) =
i
|xi - yi|
Chebychev distance. This distance measure may be appropriate in cases when one wants to
define two objects as "different" if they are different on any one of the dimensions. The
Chebychev distance is computed as:
distance(x,y) = Maximum|xi - yi|
Power distance. Sometimes one may want to increase or decrease the progressive weight
that is placed on dimensions on which the respective objects are very different. This can be
accomplished via the power distance. The power distance is computed as:
distance(x,y) = (
i
|xi - yi|p)1/r
Where r and p are user-defined parameters. A few example calculations may demonstrate
how this measure "behaves." Parameter p controls the progressive weight that is placed on
differences on individual dimensions; parameter r controls the progressive weight that is
placed on larger differences between objects. If r and p are equal to 2, then this distance is
equal to the Euclidean distance.
Percent disagreement. This measure is particularly useful if the data for the dimensions
included in the analysis are categorical in nature. This distance is computed as:
distance(x,y) = (Number of xi
yi)/ i
Amalgamation or Linkage Rules
At the first step, when each object represents its own cluster, the distances between those
objects are defined by the chosen distance measure. However, once several objects have been
linked together, how do we determine the distances between those new clusters? In other
words, we need a linkage or amalgamation rule to determine when two clusters are
sufficiently similar to be linked together. There are various possibilities: for example, we
could link two clusters together when any two objects in the two clusters are closer together
than the respective linkage distance. Put another way, we use the "nearest neighbors" across
clusters to determine the distances between clusters; this method is called single linkage. This
rule produces "stringy" types of clusters, that is, clusters "chained together" by only single
objects that happen to be close together. Alternatively, we may use the neighbors across
clusters that are furthest away from each other; this method is called complete linkage. There
are numerous other linkage rules such as these that have been proposed.
Single linkage (nearest neighbor). As described above, in this method the distance between
two clusters is determined by the distance of the two closest objects (nearest neighbors) in the
247
different clusters. This rule will, in a sense, string objects together to form clusters, and the
resulting clusters tend to represent long "chains."
Complete linkage (furthest neighbor). In this method, the distances between clusters are
determined by the greatest distance between any two objects in the different clusters (i.e., by
the "furthest neighbors"). This method usually performs quite well in cases when the objects
actually form naturally distinct "clumps". If the clusters tend to be somehow elongated or of a
"chain" type nature, then this method is inappropriate.
Unweighted pair-group average. In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two different clusters.
This method is also very efficient when the objects form natural distinct "clumps," however,
it performs equally well with elongated, "chain" type clusters. Note that in their book, Sneath
and Sokal (1973) introduced the abbreviation UPGMA to refer to this method asunweighted
pair-group method using arithmetic averages.
Weighted pair-group average. This method is identical to the unweighted pair-group
average method, except that in the computations, the size of the respective clusters (i.e., the
number of objects contained in them) is used as a weight. Thus, this method (rather than the
previous method) should be used when the cluster sizes are suspected to be greatly uneven.
Note that in their book, Sneath and Sokal (1973) introduced the abbreviationWPGMA to
refer to this method as weighted pair-group method using arithmetic averages.
Unweighted pair-group centroid. The centroid of a cluster is the average point in the
multidimensional space defined by the dimensions. In a sense, it is the center of gravity for
the respective cluster. In this method, the distance between two clusters is determined as the
difference between centroids. Sneath and Sokal (1973) use the abbreviation UPGMC to refer
to this method as unweighted pair-group method using the centroid average.
Weighted pair-group centroid (median). This method is identical to the previous one,
except that weighting is introduced into the computations to take into consideration
differences in cluster sizes (i.e., the number of objects contained in them). Thus, when there
are (or one suspects there to be) considerable differences in cluster sizes, this method is
preferable to the previous one. Sneath and Sokal (1973) use the abbreviation WPGMCto refer
to this method as weighted pair-group method using the centroid average.
Ward's method. This method is distinct from all other methods because it uses an analysis of
variance approach to evaluate the distances between clusters. In short, this method attempts
to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at
248
each step. Refer to Ward (1963) for details concerning this method. In general, this method is
regarded as very efficient; however, it tends to create clusters of small size.
12.5 Different Algorithms for Data Clustering
Affinity propagation
In statistics and data mining, affinity propagation (AP) is a clustering algorithm based on
the concept of "message passing" between data points. Unlike clustering algorithms such
as k-means or k-medoids, AP does not require the number of clusters to be determined or
estimated before running the algorithm. Like k-medoids, AP finds "exemplars", members of
the input set that are representative of clusters.
BIRCH (data clustering)
BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data
mining algorithm used to perform hierarchical clustering over particularly large data-sets. An
advantage of Birch is its ability to incrementally and dynamically cluster incoming, multidimensional metric data points in an attempt to produce the best quality clustering for a given
set of resources (memory and time constraints). In most cases, Birch only requires a single
scan of the database. In addition, Birch is recognized as the, "first clustering algorithm
proposed in the database area to handle 'noise' (data points that are not part of the underlying
pattern) effectively".
Canopy clustering algorithm
The canopy clustering algorithm is an unsupervised pre-clustering algorithm, often used as
pre-processing step for the K-means algorithm or the Hierarchical clustering algorithm. It is
intended to speed up clustering operations on large data sets, where using another algorithm
directly may be impractical due to the size of the data set.
The algorithm proceeds as follows:
Cheaply partitioning the data into overlapping subsets (called "canopies")
Perform more expensive clustering, but only within these canopies
Since the algorithm uses distance functions and requires the specification of distance
thresholds, its applicability for high-dimensional data is limited by the curse of
dimensionality. Only when a cheap and approximative - low dimensional - distance function
is available, the produced canopies will preserve the clusters produced by K-means.
Benefits
The number of instances of training data that must be compared at each step is reduced
249
There is some evidence that the resulting clusters are improved
Cobweb (clustering)
COBWEB is an incremental system for hierarchical conceptual clustering. COBWEB was
invented by Professor Douglas H. Fisher, currently at Vanderbilt University.
COBWEB incrementally organizes observations into a classification tree. Each node in a
classification tree represents a class (concept) and is labeled by a probabilistic concept that
summarizes the attribute-value distributions of objects classified under the node. This
classification tree can be used to predict missing attributes or the class of a new object.
There are four basic operations COBWEB employs in building the classification tree. Which
operation is selected depends on the category utility of the classification achieved by
applying it. The operations are:
MergingTwoNodes:
Merging two nodes means replacing them by a node whose children is the union of the
original nodes' sets of children and which summarizes the attribute-value distributions of
all objects classified under them.
Splitting a node: A node is split by replacing it with its children.
Inserting a new node: A node is created corresponding to the object being inserted into
the tree.
Passing an object down the hierarchy: Effectively calling the COBWEB algorithm on the
object and the sub tree rooted in the node.
The COBWEB Algorithm
Algorithm COBWEB
COBWEB(root, record):
Input: A COBWEB node root, an instance to insert record
if root has no children then
children := {copy(root)}
newcategory(record) \\ adds child with record‘s feature values.
insert(record, root) \\ update root‘s statistics
else
insert(record, root)
for child in root‘s children do
calculate Category Utility for insert(record, child),
250
set best1, best2 children w. best CU.
end for
if newcategory(record) yields best CU then
newcategory(record)
else if merge(best1, best2) yields best CU then
merge(best1, best2)
COBWEB(root, record)
else if split(best1) yields best CU then
split(best1)
COBWEB(root, record)
else
COBWEB(best1, record)
end if
end
"←" is a shorthand for "changes to". For instance, "largest ← item" means that the value
of largest changes to the value of item.
"return" terminates the algorithm and outputs the value that follows.
12.6 Partitioning Methods
Partitioning methods relocate instances by moving them from one cluster to another, starting
from an initial partitioning. Such methods typically require that the number of clusters will be
pre-set by the user. To achieve global optimality in partitioned-based clustering, an
exhaustive enumeration process of all possible partitions is required. Because this is not
feasible, certain greedy heuristics are used in the form of iterative optimization. Namely, a
relocation method iteratively relocates points between the k clusters. The following
subsections present various types of partitioning methods.
Error Minimization Algorithms. These algorithms, which tend to work well with isolated
and compact clusters, are the most intuitive and frequently used methods. The basic idea is to
find a clustering structure that minimizes a certain error criterion which measures the
―distance‖ of each instance to its representative value. The most well-known criterion is the
Sum of Squared Error (SSE), which measures the total squared Euclidian distance of
251
instances to their representative values. SSE may be globally optimized by exhaustively
enumerating all partitions, which is very time-consuming, or by giving an approximate
solution (not necessarily leading to a global minimum) using heuristics. The latter option is
the most common alternative.
The simplest and most commonly used algorithm, employing a squared error criterion is the
K-means algorithm. This algorithm partitions the data into K clusters (C1;C2; : : : ;CK),
represented by their centers or means. The center of each cluster is calculated as the mean of
all the instances belonging to
that cluster. The K-means algorithm. starts with an initial set of cluster centers, chosen at
random or according to some heuristic procedure. In each iteration, each instance is assigned
to its nearest cluster center according to the Euclidean distance between the two. Then the
cluster centers are re-calculated.
The center of each cluster is calculated as the mean of all the instances belonging to that
cluster:
Where Nk is the number of instances belonging to cluster k and ¹k is the mean of the cluster
k.
A number of convergence conditions are possible. For example, the search may stop when
the partitioning error is not reduced by the relocation of the centers. This indicates that the
present partition is locally optimal. Other stopping criteria can be used also such as exceeding
a pre-defined number of iterations.
Input: S (instance set), K (number of cluster)
Output: clusters
1: Initialize K cluster centers.
2: while termination condition is not satisfied do
3: Assign instances to the closest cluster center.
4: Update cluster centers based on the assignment.
5: end while
The K-means algorithm may be viewed as a gradient-decent procedure, which begins with an
initial set of K cluster-centers and iteratively updates it so as to decrease the error function. A
rigorous proof of the finite convergence of the K-means type algorithms is given in (Selim
and Ismail, 1984). The complexity of T iterations of the K-means algorithm performed on a
252
sample size of m instances, each characterized by N attributes, is: O(T ¤ K ¤ m ¤ N).This
linear complexity is one of the reasons for the popularity of the K-means algorithms. Even if
the number of instances is substantially large (which often is the case nowadays), this
algorithm is computationally attractive. Thus, the K-means algorithm has an advantage in
comparison to other clustering methods (e.g. hierarchical clustering methods), which have
non-linear complexity. Other reasons for the algorithm‘s popularity are its ease of
interpretation, simplicity of implementation, speed of convergence and adaptability to sparse
data.
The Achilles heel of the K-means algorithm involves the selection of the initial partition. The
algorithm is very sensitive to this selection, which may make the difference between global
and local minimum.
Being a typical partitioning algorithm, the K-means algorithm works well only on data sets
having isotropic clusters, and is not as versatile as single link algorithms, for instance.
In addition, this algorithm is sensitive to noisy data and outliers (a single outlier can increase
the squared error dramatically); it is applicable only when mean is defined (namely, for
numeric attributes); and it requires the number of clusters in advance, which is not trivial
when no prior knowledge is available.
The use of the K-means algorithm is often limited to numeric attributes. Haung (1998)
presented the K-prototypes algorithm, which is based on the K-means algorithm but removes
numeric data limitations while preserving its efficiency. The algorithm clusters objects with
numeric and categorical attributes in a way similar to the K-means algorithm. The similarity
measure on numeric attributes is the square Euclidean distance; the similarity measure on the
categorical attributes is the number of mismatches between objects and the cluster prototypes.
Another partitioning algorithm, which attempts to minimize the SSE, is the K-medoids or
PAM (partition around medoids—(Kaufmann and Rousseeuw, 1987)). This algorithm is very
similar to the K-means algorithm. It differs from the latter mainly in its representation of the
different clusters. Each cluster is represented by the most centric object in the cluster, rather
than by the implicit mean that may not belong to the cluster.
The K-medoids method is more robust than the K-means algorithm in the presence of noise
and outliers because a medoid is less influenced by outliers or other extreme values than a
mean. However, its processing is more costly than the K-means method. Both methods
require the user to specify K, the number of clusters.
Other error criteria can be used instead of the SSE. Estivill-Castro (2000) analyzed the total
absolute error criterion. Namely, instead of summing up the squared error, he suggests to
253
summing up the absolute error. While this criterion is superior in regard to robustness, it
requires more computational effort.
Graph-Theoretic Clustering. Graph theoretic methods are methods that produce clusters via
graphs. The edges of the graph connect the instances represented as nodes. A well-known
graph-theoretic algorithm is based on the Minimal Spanning Tree—MST (Zahn, 1971).
Inconsistent edges are edges whose weight (in the case of clustering-length) is significantly
larger than the average of nearby edge lengths. Another graph-theoretic approach constructs
graphs based on limited neighborhood sets.
There is also a relation between hierarchical methods and graph theoretic clustering:
Single-link clusters are sub graphs of the MST of the data instances. Each sub graph is
a connected component, namely a set of instances in which each instance is connected
to at least one other member of the set, so that the set is maximal with respect to this
property. These sub graphs are formed according to some similarity threshold.
Complete-link clusters are maximal complete subgraphs, formed using a similarity
threshold. A maximal complete subgraph is a subgraph such that each node is
connected to every other node in the subgraph and the set is maximal with respect to
this property.
12.7 Hierarchical clustering Methods
In data mining, hierarchical clustering is a method of cluster analysis which seeks to build
a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:
Agglomerative: This is a "bottom up" approach: each observation starts in its own
cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top down" approach: all observations start in one cluster, and
splits are performed recursively as one moves down the hierarchy.
In general, the merges and splits are determined in a greedy manner. The results of
hierarchical clustering are usually presented in a dendrogram.
In the general case, the complexity of agglomerative clustering is
, which makes them
too slow for large data sets. Divisive clustering with an exhaustive search is
, which is
even worse. However, for some special cases, optimal efficient agglomerative methods (of
254
complexity
) are known: SLINK[1] for single-linkage and CLINK[2] for complete-
linkage clustering.
Cluster dissimilarity
In order to decide which clusters should be combined (for agglomerative), or where a cluster
should be split (for divisive), a measure of dissimilarity between sets of observations is
required. In most methods of hierarchical clustering, this is achieved by use of an
appropriate metric (a measure of distance between pairs of observations), and a linkage
criterion which specifies the dissimilarity of sets as a function of the pair wise distances of
observations in the sets.
Metric
The choice of an appropriate metric will influence the shape of the clusters, as some elements
may be close to one another according to one distance and farther away according to another.
For example, in a 2-dimensional space, the distance between the point (1,0) and the origin
(0,0) is always 1 according to the usual norms, but the distance between the point (1,1) and
the origin (0,0) can be 2,
or 1 under Manhattan distance, Euclidean distance or maximum
distance respectively.
Some commonly used metrics for hierarchical clustering are:
Names
Formula
Euclidean distance
squared Euclidean distance
Manhattan distance
maximum distance
Mahalanobis distance
where S is the covariance matrix
255
Cosine similarity
For text or other non-numeric data, metrics such as the Hamming distance or Levenshtein
distance are often used.
A review of cluster analysis in health psychology research found that the most common
distance measure in published studies in that research area is the Euclidean distance or the
squared Euclidean distance
Linkage criteria
The linkage criterion determines the distance between sets of observations as a function of
the pair wise distances between observations.
Some commonly used linkage criteria between two sets of observations A and B are:
Names
Formula
Maximum
or complete
linkage
clustering
Minimum
or singlelinkage
clustering
Mean
or
average
linkage
clustering,
or UPGMA
Minimum
energy
clustering
Where d is the chosen metric. Other linkage criteria include:
The sum of all intra-cluster variance.
256
The decrease in variance for the cluster being merged (Ward's criterion).
The probability that candidate clusters spawn from the same distribution function (Vlinkage).
The product of in-degree and out-degree on a k-nearest-neighbor graph (graph degree
linkage).
The increment of some cluster descriptor (i.e., a quantity defined for measuring the
quality of a cluster) after merging two clusters.[
Example for Agglomerative Clustering
For example, suppose this data is to be clustered, and the Euclidean distance is the distance
metric.
Cutting the tree at a given height will give a partitioning clustering at a selected precision. In
this example, cutting after the second row of the dendrogram will yield clusters {a} {b c} {d
e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser
clustering, with a smaller number of larger clusters.
Raw data
The hierarchical clustering dendrogram would be as such:
Traditional representation
This method builds the hierarchy from the individual elements by progressively merging
clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is
to determine which elements to merge in a cluster. Usually, we want to take the two closest
elements, according to the chosen distance.
257
Optionally, one can also construct a distance matrix at this stage, where the number in the i-th
row j-th column is the distance between the i-th and j-th elements. Then, as clustering
progresses, rows and columns are merged as the clusters are merged and the distances
updated. This is a common way to implement this type of clustering, and has the benefit of
caching distances between clusters. A simple agglomerative clustering algorithm is described
in the single-linkage clustering page; it can easily be adapted to different types of linkage (see
below).
Suppose we have merged the two closest elements b and c, we now have the following
clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. To do that, we need to
take the distance between {a} and {b c}, and therefore define the distance between two
clusters. Usually the distance between two clusters
and
is one of the following:
The maximum distance between elements of each cluster (also called completelinkage clustering):
The minimum distance between elements of each cluster (also called single-linkage
clustering):
The mean distance between elements of each cluster (also called average linkage
clustering, used e.g. in UPGMA):
The sum of all intra-cluster variance.
The increase in variance for the cluster being merged (Ward's method[6])
The probability that candidate clusters spawn from the same distribution
function (V-linkage).
Each agglomeration occurs at a greater distance between clusters than the previous
agglomeration, and one can decide to stop clustering either when the clusters are too far apart
to be merged (distance criterion) or when there is a sufficiently small number of clusters
(number criterion).
k-means clusterning
k-means clustering is a method of vector quantization, originally from signal processing,
that
is
popular
for cluster
analysis in data
mining. k-means
clustering
aims
to
partition n observations into k clusters in which each observation belongs to the cluster with
258
the nearest mean, serving as aprototype of the cluster. This results in a partitioning of the data
space into Voronoi cells.
The problem is computationally difficult (NP-hard); however, there are efficient heuristic
algorithms that are commonly employed and converge quickly to a local optimum. These are
usually
similar
to
the expectation-maximization
algorithm for mixtures of Gaussian
distributions via an iterative refinement approach employed by both algorithms. Additionally,
they both use cluster centers to model the data; however, k-means clustering tends to find
clusters of comparable spatial extent, while the expectation-maximization mechanism allows
clusters to have different shapes.
Description
Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real
vector, k-means
clustering
aims
to
partition
the nobservations
into k sets
(k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS):
where μi is the mean of points in Si.
Algorithms
Standard algorithm
The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is
often called the k-means algorithm; it is also referred to as Lloyd's algorithm, particularly
in the computer science community.
Given an initial set of k means m1(1),…,mk(1) (see below), the algorithm proceeds by
alternating between two steps:[
Assignment step: Assign each observation to the cluster whose mean yields the least withincluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance,
this is intuitively the "nearest" mean. (Mathematically, this means partitioning the
observations according to the Voronoi diagram generated by the means).
where each
is assigned to exactly one
, even if it could be is assigned to two or more
of them.
259
Update step: Calculate the new means to be the centroids of the observations in the new
clusters.
Since the arithmetic mean is a least-squares estimator, this also minimizes the within-cluster
sum of squares (WCSS) objective.
The algorithm has converged when the assignments no longer change. Since both steps
optimize the WCSS objective, and there only exists a finite number of such partitionings, the
algorithm must converge to a (local) optimum. There is no guarantee that the global optimum
is found using this algorithm.
The algorithm is often presented as assigning objects to the nearest cluster by distance. This
is slightly inaccurate: the algorithm aims at minimizing the WCSS objective, and thus assigns
by "least sum of squares". Using a different distance function other than (squared) Euclidean
distance may stop the algorithm from converging. It is correct that the smallest Euclidean
distance yields the smallest squared Euclidean distance and thus also yields the smallest sum
of squares. Various modifications of k-means such as spherical k-means and k-medoids have
been proposed to allow using other distance measures.
Initialization methods
Commonly used initialization methods are Forgy and Random Partition.The Forgy method
randomly chooses k observations from the data set and uses these as the initial means. The
Random Partition method first randomly assigns a cluster to each observation and then
proceeds to the update step, thus computing the initial mean to be the centroid of the cluster's
randomly assigned points. The Forgy method tends to spread the initial means out, while
Random Partition places all of them close to the center of the data set. According to Hamerly
et al., the Random Partition method is generally preferable for algorithms such as the kharmonic means and fuzzy k-means. For expectation maximization and standard k-means
algorithms, the Forgy method of initialization is preferable.
Demonstration of the standard algorithm
1) k initial "means" (in this casek=3) are randomly generated within the data domain (shown
in color).
260
2) k clusters are created by associating every observation with the nearest mean. The
partitions here represent the Voronoi diagram generated by the means.
3) The centroid of each of the kclusters becomes the new mean.
4) Steps 2 and 3 are repeated until convergence has been reached.
As it is a heuristic algorithm, there is no guarantee that it will converge to the global
optimum, and the result may depend on the initial clusters. As the algorithm is usually very
fast, it is common to run it multiple times with different starting conditions. However, in the
worst case, k-means can be very slow to converge: in particular it has been shown that there
exist certain point sets, even in 2 dimensions, on which k-means takes exponential time, that
is 2Ω(n), to converge.These point sets do not seem to arise in practice: this is corroborated by
the fact that the smoothed running time of k-means is polynomial.
The "assignment" step is also referred to as expectation step, the "update step"
as maximization step, making this algorithm a variant of the generalized expectationmaximization algorithm.
Variations
k-medians clustering uses the median in each dimension instead of the mean, and this
way minimizes
norm (Taxicab geometry).
k-medoids (also: Partitioning Around Medoids, PAM) uses the medoid instead of the
mean, and this way minimizes the sum of distances for arbitrary distance functions.
Fuzzy C-Means Clustering is a soft version of K-means, where each data point has a
fuzzy degree of belonging to each cluster.
261
Gaussian
mixture models
trained
with expectation-maximization
algorithm (EM
algorithm) maintains probabilistic assignments to clusters, instead of deterministic
assignments, and multivariate Gaussian distributions instead of means.
Several methods have been proposed to choose better starting clusters. One recent
proposal is k-means++.
The filtering algorithm uses kd-trees to speed up each k-means step.
Some methods attempt to speed up each k-means step using coresets[ or the triangle
inequality.
Escape local optima by swapping points between clusters.
The Spherical k-means clustering algorithm is suitable for directional data.
The Minkowski metric weighted k-means deals with irrelevant features by assigning
cluster specific weights to each feature
The two key features of k-means which make it efficient are often regarded as its biggest
drawbacks:
Euclidean distance is used as a metric and variance is used as a measure of cluster scatter.
The number of clusters k is an input parameter: an inappropriate choice of k may yield
poor results. That is why, when performing k-means, it is important to run diagnostic
checks for determining the number of clusters in the data set.
Convergence to a local minimum may produce counterintuitive ("wrong") results (see
example in Fig.).
A key limitation of k-means is its cluster model. The concept is based on spherical clusters
that are separable in a way so that the mean value converges towards the cluster center. The
clusters are expected to be of similar size, so that the assignment to the nearest cluster center
is the correct assignment. When for example applying k-means with a value of
onto
the well-known Iris flower data set, the result often fails to separate the three Iris species
contained in the data set. With
will be discovered, whereas with
parts. In fact,
, the two visible clusters (one containing two species)
one of the two clusters will be split into two even
is more appropriate for this data set, despite the data set containing
3 classes. As with any other clustering algorithm, the k-means result relies on the data set to
satisfy the assumptions made by the clustering algorithms. It works well on some data sets,
while failing on others.
262
Cluster analysis
In cluster analysis, the k-means algorithm can be used to partition the input data set
into k partitions (clusters). However, the pure k-means algorithm is not very flexible, and as
such of limited use (except for when vector quantization as above is actually the desired use
case!). In particular, the parameter k is known to be hard to choose (as discussed below)
when not given by external constraints. In contrast to other algorithms, k-means can also not
be used with arbitrary distance functions or be use on non-numerical data. For these use
cases, many other algorithms have been developed since.
Feature learning
k-means clustering has been used as a feature learning (or dictionary learning) step, which
can be used in the for (semi-)supervised learning or unsupervised learning. The basic
approach is first to train a k-means clustering representation, using the input training data
(which need not be labelled). Then, to project any input datum into the new feature space, we
have a choice of "encoding" functions, but we can use for example the thresholded matrixproduct of the datum with the centroid locations, the distance from the datum to each
centroid, or simply an indicator function for the nearest centroid,[or some smooth
transformation of the distance.[Alternatively, by transforming the sample-cluster distance
through a Gaussian RBF, one effectively obtains the hidden layer of a radial basis function
network.
Relation to other statistical machine learning algorithms
k-means clustering, and its associated expectation-maximization algorithm, is a special case
of a Gaussian mixture model, specifically, the limit of taking all covariance‘s as diagonal,
equal, and small. It is often easy to generalize a k-means problem into a Gaussian mixture
model. Another generalization of the k-means algorithm is the K-SVD algorithm, which
estimates data points as a sparse linear combination of "codebook vectors". K-means
corresponds to the special case of using a single codebook vector, with a weight of 1.
Mean shift clustering
Basic mean shift clustering algorithms maintain a set of data points the same size as the input
data set. Initially, this set is copied from the input set. Then this set is iteratively replaced by
263
the mean of those points in the set that are within a given distance of that point. By
contrast, k-means restricts this updated set to k points usually much less than the number of
points in the input data set, and replaces each point in this set by the mean of all points in
the input set that are closer to that point than any other (e.g. within the Voronoi partition of
each updating point). A mean shift algorithm that is similar then to k-means, called likelihood
mean shift, replaces the set of points undergoing replacement by the mean of all points in the
input set that are within a given distance of the changing set. One of the advantages of mean
shift over k-means is that there is no need to choose the number of clusters, because mean
shift is likely to find only a few clusters if indeed only a small number exist. However, mean
shift can be much slower than k-means, and still requires selection of a bandwidth parameter.
Mean shift has soft variants much as k-means does.
Principal component analysis (PCA)
It was asserted in that the relaxed solution of k-means clustering, specified by the cluster
indicators, is given by the PCA (principal component analysis) principal components, and the
PCA subspace spanned by the principal directions is identical to the cluster centroid
subspace. However, that PCA is a useful relaxation of k-means clustering was not a new
result (see, for example,, and it is straightforward to uncover counterexamples to the
statement that the cluster centroid subspace is spanned by the principal directions
Bilateral filtering
k-means implicitly assumes that the ordering of the input data set does not matter.
The bilateral filter is similar to K-means and mean shift in that it maintains a set of data
points that are iteratively replaced by means. However, the bilateral filter restricts the
calculation of the (kernel weighted) mean to include only points that are close in the ordering
of the input data. This makes it applicable to problems such as image denoising, where the
spatial arrangement of pixels in an image is of critical importance.
12.8 Summary
Identifying groups of individuals or objects that are similar to each other but different from
individuals in other groups can be intellectually satisfying, profitable, or sometimes both.
Using your customer base, you may be able to form clusters of customers who have similar
buying habits or demographics. You can take advantage of these similarities to target offers
264
to subgroups that are most likely to be receptive to them. Based on scores on psychological
inventories, you can cluster patients into subgroups that have similar response patterns. This
may help you in targeting appropriate treatment and studying typologies of diseases. By
analyzing the mineral contents of excavated materials, you can study their origins and
spread.
Distance metrics play an important role in data mining. Distance metric gives a numerical
value that measures the similarity between two data objects. In classification, the class of a
new data object having unknown class label is predicted as the class of its similar objects. In
clustering, the similar objects are grouped together. The most common distance metrics are
Euclidian distance, Manhattan distance, Max distance. There are also some other distances
such as Canberra distance, Cord distance and Chi-squared distance that are also used for
some specific purposes.
A distance metric measures the dissimilarity between two data points in terms of some
numerical value. It also measures similarity; we can say that more distance less similar and
less distance more similar.
Another strategy for dealing with large state spaces is to treat them as a hierarchy of learning
problems. In many cases, hierarchical solutions introduce slight sub-optimality in
performance, but potentially gain a good deal of efficiency in execution time, learning time,
and space.
12.9 Key words
Cluster analysis, computing distance, Partitional methods, Hierarchical method.
12.10 Exercises
1. Describe Cluster analysis.
2. Discuss the different distance measures.
3. Discuss the different types of linkages.
4. Explain COBWEB algorithm.
5. Compare agglomerative clustering to divisive clustering.
6. What are the metrics used in hierarchical clustering.
7. Compare hierarchical clustering to partitional clustering.
8. Explain different linkage criteria.
265
9. Explain agglomerative clustering with a suitable example.
10. Explain k-means clustering with a suitable example.
11. Write short notes on cluster analysis.
12. Write short notes on other machine learning algorithms.
12.11 References
1. Data Mining and Analysis: Fundamental Concepts and Algorithms, by Mohammed J.
Zaki, and Wagner Meira Jr.,
2. An Introduction to Data Mining by Dr. Saed Sayad, Publisher: University of Toronto
266
UNIT-13: CLUSTER ANALYSIS
Structure
13.1
Objectives
13.2
Introduction
13.3
Types of Clustering Methods
13.4
Clustering High-Dimensional Data
13.5
Constraint-Based Cluster Analysis
13.6
Outlier Analysis
13.7
Cluster Validation Techniques
13.8
Summary
13.9
Keywords
13.10
Exercises
13.11
References
13.1 Objectives
The objectives covered under this unit include:
The introduction to the Cluster Analysis
Different Categories of Clustering Methods
Clustering High-Dimensional Data
Constraint-Based Cluster Analysis
Outlier Analysis
Cluster Validation Techniques
267
13.2 Introduction
What Is Cluster Analysis?
The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering. A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters. A cluster of data
objects can be treated collectively as one group and so may be considered as a form of data
compression. Although classification is an effective means for distinguishing groups or
classes of objects, it requires the often costly collection and labeling of a large set of training
tuples or patterns, which the classifier uses to model each group. It is often more desirable to
proceed in the reverse direction: First partition the set of data into groups based on data
similarity (e.g., using clustering), and then assign labels to the relatively small number of
groups. Additional advantages of such a clustering-based process are that it is adaptable to
changes and helps single out useful features that distinguish different groups.
Clustering is also called data segmentation in some applications because clustering partitions
large data sets into groups according to their similarity.
Clustering can also be used for outlier detection, where outliers (values that are ―far away‖
from any cluster) may be more interesting than common cases.
Applications of outlier detection include the
detection of credit card fraud and
The monitoring of criminal activities in electronic commerce.
13.3 Types of Clustering Methods
Han and Kamber (2001) suggest categorizing the methods into additional three main
categories:
Density-based methods,
Model-based clustering
Grid based methods.
268
13.3.1 Density-based Methods
Density-based methods assume that the points that belong to each cluster are drawn from a
specific probability distribution. The overall distribution of the data is assumed to be a
mixture of several distributions.
The aim of these methods is to identify the clusters and their distribution parameters. These
methods are designed for discovering clusters of arbitrary shape which are not necessarily
convex, namely:
This does not necessarily imply that:
The idea is to continue growing the given cluster as long as the density (number of objects or
data points) in the neighborhood exceeds some threshold. Namely, the neighborhood of a
given radius has to contain at least a minimum number of objects. When each cluster is
characterized by local mode or maxima of the density function, these methods are called
mode-seeking. Much work in this field has been based on the underlying assumption that the
component densities are multivariate Gaussian (in case of numeric data) or multinominal (in
case of nominal data).
An acceptable solution in this case is to use the maximum likelihood principle. According to
this principle, one should choose the clustering structure and parameters such that the
probability of the data being generated by such clustering structure and parameters is
maximized. The expectation maximization algorithm — EM - which is a general-purpose
maximum likelihood algorithm for missing-data problems, has been applied to the problem of
parameter estimation.
269
This algorithm begins with an initial estimate of the parameter vector and then alternates
between two steps (Farley and Raftery, 1998): an ―E-step‖, in which the conditional
expectation of
the
complete
data
likelihood
given the observed data
and the current parameter estimates is computed, and an ―M-step‖, in which parameters that
maximize the expected likelihood from the E-step are determined. This algorithm was shown
to converge to a local maximum of the observed data likelihood. The K-means algorithm
may be viewed as a degenerate EM algorithm, in which:
Assigning instances to clusters in the K-means may be considered as the E-step; computing
new cluster centers may be regarded as the M-step. The DBSCAN algorithm (density-based
spatial clustering of applications with noise) discovers clusters of arbitrary shapes and is
efficient for large spatial databases. The algorithm searches for clusters by searching the
neighborhood of each object in the database and checks if it contains more than the minimum
number of objects.
Density-based clustering may also employ nonparametric methods, such as searching for bins
with large counts in a multidimensional histogram of the input instance space (Jain et al.,
1999).
DBSCAN: A Density-Based Clustering Method Based on Connected Regions with
sufficiently High Density: It is a density based clustering algorithm. The algorithm grows
regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in
spatial databases with noise. It defines a cluster as a maximal set of density-connected points.
OPTICS: Ordering Points to Identify the Clustering Structure: Although DBSCAN can
cluster objects given input parameters such as andMinPts, it still leaves the user with the
responsibility of selecting parameter values that will lead to the discovery of acceptable
clusters. Actually, this is a problem associated with many other clustering algorithms. Such
parameter settings are usually empirically set and difficult to determine, especially for realworld, high-dimensional data sets. Most algorithms are very sensitive to such parameter
values: slightly different settings may lead to very different clusterings of the data. Moreover,
high-dimensional real data sets often have very skewed distributions, such that their intrinsic
clustering structure may not be characterized by global density parameters.
270
To help overcome this difficulty, a cluster analysis method called OPTICS was proposed.
Rather than produce a data set clustering explicitly, OPTICS computes an augmented cluster
ordering for automatic and interactive cluster analysis. This ordering represents the densitybased clustering structure of the data. It contains information that is equivalent to densitybased clustering obtained from a wide range of parameter settings. The cluster ordering can
be used to extract basic clustering information (such as cluster centers or arbitrary-shaped
clusters) as well as provide the intrinsic clustering structure.
DENCLUE: Clustering Based on Density Distribution Functions: It is a clustering method
based on a set of density distribution functions.
The method is built on the following ideas:
The influence of each data point can be formally modeled using a mathematical
function, called an influence function, which describes the impact of a data point
within its neighborhood.
the overall density of the data space can be modeled analytically as the sum of the
influence function applied to all data points
Clusters can then be determined mathematically by identifying density attractors,
where density attractors are local maxima of the overall density function.
13.3.2 Grid-based Methods
These methods partition the space into a finite number of cells that form a grid structure on
which all of the operations for clustering are performed. The main advantage of the approach
is its fast processing time (Han and Kamber, 2001).
The grid-based clustering approach uses a multi resolution grid data structure. It quantizes the
object space into a finite number of cells that form a grid structure on which all of the
operations for clustering are performed. The main advantage of the approach is its fast
processing time, which is typically independent of the number of data objects, yet dependent
on only the number of cells in each dimension in the quantized space.
Some typical examples of the grid-based approach include STING, which explores statistical
information stored in the grid cells; Wave Cluster, which clusters objects using a wavelet
271
transform method; and CLIQUE, which represents a grid-and density-based approach for
clustering in high-dimensional data space.
STING: Statistical Information Grid: STING is a grid-based multi resolution clustering
technique in which the spatial area is divided into rectangular cells. There are usually several
levels of such rectangular cells corresponding to different levels of resolution, and these cells
form a hierarchical structure: each cell at a high level is partitioned to form a number of cells
at the next lower level. Statistical information regarding the attributes in each grid cell (such
as the mean, maximum, and minimum values) is pre computed and stored. These statistical
parameters are useful for query processing, as described below.
Wave Cluster: Clustering Using Wavelet Transformation: Wave Cluster is a multi resolution
clustering algorithm that first summarizes the data by imposing a multidimensional grid
structure onto the data space. It then uses a wavelet transformation to transform the original
feature space, finding dense regions in the transformed space.
In this approach, each grid cell summarizes the information of a group of points that map into
the cell. This summary information typically fits into main memory for use by the multi
resolution wavelet transform and the subsequent cluster analysis. A wavelet transform is a
signal processing technique that decomposes a signal into different frequency sub bands. The
wavelet model can be applied to d-dimensional signals by applying a one-dimensional
wavelet transforms d times. In applying a wavelet transform, data are transformed so as to
preserve the relative distance between objects at different levels of resolution. This allows the
natural clusters in the data to become more distinguishable. Clusters can then be identified by
searching for dense regions in the new domain.
―Why is wavelet transformation useful for clustering?‖
It offers the following advantages:
 It provides unsupervised clustering.
 The multi-resolution property of wavelet transformations can help detect clusters at
varying levels of accuracy.
 Wavelet-based clustering is very fast,
13.3.3 Model-Based Clustering Methods
Model-based clustering methods attempt to optimize the fit between the given data and some
mathematical model. Such methods are often based on the assumption that the data are
generated by a mixture of underlying probability distributions. These methods attempt to
272
optimize the fit between the given data and some mathematical models. Unlike conventional
clustering, which identifies groups of objects; model-based clustering methods also find
characteristic descriptions for each group, where each group represents a concept or class.
The most frequently used induction methods are:
Decision trees
Neural networks.
Decision Trees: In decision trees, the data is represented by a hierarchical tree, where each
leaf refers to a concept and contains a probabilistic description of that concept. Several
algorithms produce classification trees for representing the unlabelled data.
The most well-known algorithms are:
COBWEB: this algorithm assumes that all attributes are independent (an often too naive
assumption). Its aim is to achieve high predictability of nominal variable values, given a
cluster. This algorithm is not suitable for clustering large database data (Fisher, 1987).
CLASSIT: an extension of COBWEB for continuous-valued data, unfortunately has similar
problems as the COBWEB algorithm.
Neural Networks: This type of algorithm represents each cluster by a neuron or ―prototype‖.
The input data is also represented by neurons, which are connected to the prototype neurons.
Each such connection has a weight, which is learned adaptively during learning. A very
popular neural algorithm for clustering is the self-organizing map (SOM). This algorithm
constructs a single-layered network. The learning process takes place in a ―winner-takes-all‖
fashion:
The prototype neurons compete for the current instance.
The winner is the neuron whose weight vector is closest to the instance currently
presented.
The winner and its neighbors learn by having their weights adjusted.
The SOM algorithm is successfully used for vector quantization and speech recognition. It is
useful for visualizing high-dimensional data in 2D or 3D space. However, it is sensitive to the
initial selection of weight vector, as well as to its different parameters, such as the learning
rate and neighborhood radius.
273
13.4 Clustering High-Dimensional Data
Most clustering methods are designed for clustering low-dimensional data and encounter
Challenges when the dimensionality of the data grows really high (say, over 10 dimensions,
or even over thousands of dimensions for some tasks). This is because when the
dimensionality increases, usually only a small number of dimensions are relevant to certain
clusters, but data in the irrelevant dimensions may produce much noise and mask the real
clusters to be discovered. Moreover, when dimensionality increases, data usually become
increasingly sparse because the data points are likely located in different dimensional
subspaces. When the data become really sparse, data points located at different dimensions
can be considered as all equally distanced, and the distance measure, which is essential for
cluster analysis, becomes meaningless.
To overcome this difficulty, we may consider using feature (or attribute) transformation and
feature (or attribute) selection techniques. Feature transformation methods, such as principal
component analysis and singular value decomposition, transform the data onto a smaller
space while generally preserving the original relative distance between objects. They
summarize data by creating linear combinations of the attributes, and may discover hidden
structures in the data. However, such techniques do not actually remove any of the original
attributes from analysis. This is problematic when there are a large number of irrelevant
attributes. The irrelevant information may mask the real clusters, even after transformation.
Moreover, the transformed features (attributes) are often difficult to interpret, making the
clustering results less useful. Thus, feature transformation is only suited to data sets where
most of the dimensions are relevant to the clustering task. Unfortunately, real-world data sets
tend to have many highly correlated, or redundant, dimensions.
Another way of tackling the curse of dimensionality is to try to remove some of the
dimensions. Attribute subset selection (or feature subset selection) is commonly used for data
reduction by removing irrelevant or redundant dimensions (or attributes).
Given a set of attributes, attribute subset selection finds the subset of attributes that are most
relevant to the data mining task. Attribute subset selection involves searching through various
attribute subsets and evaluating these subsets using certain criteria. It is most commonly
performed by supervised learning—the most relevant set of attributes are found with respect
to the given class labels. It can also be performed by an unsupervised process, such as
274
entropy analysis, which is based on the property that entropy tends to be low for data that
contain tight clusters. Other evaluation functions, such as category utility, may also be used.
Subspace clustering is an extension to attribute subset selection that has shown its strength at
high-dimensional clustering. It is based on the observation that different subspaces may
contain different, meaningful clusters. Subspace clustering searches for groups of clusters
within different subspaces of the same data set. The problem becomes how to find such
subspace clusters effectively and efficiently.
In this section, we introduce three approaches for effective clustering of high-dimensional
data:
dimension-growth subspace clustering, represented by CLIQUE,
Dimension-reduction projected clustering, represented by PROCLUS,
Frequent pattern based clustering, represented by pCluster.
CLIQUE: A Dimension-Growth Subspace Clustering Method: CLIQUE (Clustering
InQUEst) was he first algorithm proposed for dimension-growth subspace clustering in highdimensional space. In dimension-growth subspace clustering, the clustering process starts at
single-dimensional subspaces and grows upward to higher-dimensional ones. Because
CLIQUE partitions each dimension like a grid structure and determines whether a cell is
dense based on the number of points it contains, it can also be viewed as an integration of
density-based and grid-based clustering methods. However, its overall approach is typical of
subspace clustering for high-dimensional space, and so it is introduced in this section.
The ideas of the CLIQUE clustering algorithm are outlined as follows.
 Given a large set of multidimensional data points, the data space is usually not
uniformly
occupied by the data points. CLIQUE‘s clustering identifies the sparse and the
―crowded‖ areas in space (or units), thereby discovering the overall distribution
patterns of the data set.
 A unit is dense if the fraction of total data points contained in it exceeds an input
model parameter. In CLIQUE, a cluster is defined as a maximal set of connected
dense units.
PROCLUS: A Dimension-Reduction Subspace Clustering Method: PROCLUS (PROjected
CLUStering) is a typical dimension-reduction subspace clustering method. That is, instead of
starting from single-dimensional spaces, it starts by finding an initial approximation of the
275
clusters in the high-dimensional attribute space. Each dimension is then assigned a weight for
each cluster, and the updated weights are used in the next iteration to regenerate the clusters.
This leads to the exploration of dense regions in all subspaces of some desired dimensionality
and avoids the generation of a large number of overlapped clusters in projected dimensions of
lower dimensionality.
PROCLUS finds the best set of medoids by a hill-climbing process similar to that used in
CLARANS, but generalized to deal with projected clustering. It adopts a distance measure
called Manhattan segmental distance, which is the Manhattan distance on a set of relevant
dimensions. The PROCLUS algorithm consists of three phases: initialization, iteration, and
cluster refinement. In the initialization phase, it uses a greedy algorithm to select a set of
initial medoids that are far apart from each other so as to ensure that each cluster is
represented by at least one object in the selected set.
More concretely, it first chooses a random sample of data points proportional to the number
of clusters we wish to generate, and then applies the greedy algorithm to obtain an even
smaller final subset for the next phase. The iteration phase selects a random set of k medoids
from this reduced set (of medoids), and replaces ―bad‖ medoids with randomly chosen new
medoids if the clustering is improved. For each medoid, a set of dimensions is chosen whose
average distances are small compared to statistical expectation. The total number of
dimensions associated to medoids must be k_l, where l is an input parameter that selects the
average dimensionality of cluster subspaces.
The refinement phase computes new dimensions for each medoid based on the clusters found,
reassigns points to medoids, and removes outliers. Experiments on PROCLUS show that the
method is efficient and scalable at finding high-dimensional clusters. Unlike CLIQUE, which
outputs many overlapped clusters, PROCLUS finds non overlapped partitions of points. The
discovered clusters may help better understand the high-dimensional data and facilitate other
subsequence analyses.
Frequent Pattern–Based Clustering Methods: Frequent pattern mining, as the name
implies, searches for patterns (such as sets of items or objects) that occur frequently in large
data sets. Frequent pattern mining can lead to the discovery of interesting associations and
correlations among data objects. Methods for frequent pattern mining were introduced in
Chapter 5. The idea behind frequent pattern–based cluster analysis is that the frequent
patterns discovered may also indicate clusters. Frequent pattern–based cluster analysis is well
suited to high-dimensional data. It can be viewed as an extension of the dimension-growth
276
subspace clustering approach. However, the boundaries of different dimensions are not
obvious, since here they are represented by sets of frequent itemsets.
That is, rather than growing the clusters dimension by dimension, we grow sets of frequent
itemsets, which eventually lead to cluster descriptions. Typical examples of frequent pattern–
based cluster analysis include the clustering of text documents that contain thousands of
distinct keywords, and the analysis of microarray data that contain tens of thousands of
measured values or ―features.‖
In this section, we examine two forms of frequent pattern–based cluster analysis:
Frequent term–based text clustering and
Clustering by pattern similarity in microarray data analysis.
In frequent term–based text clustering, text documents are clustered based on the frequent
terms they contain. Using the vocabulary of text document analysis, a term is any sequence of
characters separated from other terms by a delimiter. A term can be made up of a single word
or several words. In general, we first remove non text information (such as HTML tags and
punctuation) and stop words. Terms are then extracted.
A stemming algorithm is then applied to reduce each term to its basic stem. In this way, each
document can be represented as a set of terms. Each set is typically large. Collectively, a
large set of documents will contain a very large set of distinct terms. If we treat each term as
a dimension, the dimension space will be of very high dimensionality! This poses great
challenges for document cluster analysis. The dimension space can be referred to as term
vector space, where each document is represented by a term vector.
This difficulty can be overcome by frequent term–based analysis. That is, by using an
efficient frequent itemset mining algorithm, we can mine a set of frequent terms from the set
of text documents. Then, instead of clustering on high-dimensional term vector space, we
need only consider the low-dimensional frequent term sets as ―cluster candidates.‖ Notice
that a frequent term set is not a cluster but rather the description of a cluster. The
corresponding cluster consists of the set of documents containing all of the terms of the
frequent term set. A well-selected subset of the set of all frequent term sets can be considered
as a clustering.
277
13.5 Constraint-Based Cluster Analysis
In the above discussion, we assume that cluster analysis is an automated, algorithmic
computational process, based on the evaluation of similarity or distance functions among a
set of objects to be clustered, with little user guidance or interaction. However, users often
have a clear view of the application requirements, which they would ideally like to use to
guide the clustering process and influence the clustering results. Thus, in many applications,
it is desirable to have the clustering process take user preferences and constraints into
consideration.
Examples of such information include the expected number of clusters, the minimal or
maximal cluster size, weights for different objects or dimensions, and other desirable
characteristics of the resulting clusters. Moreover, when a clustering task involves a rather
high-dimensional space, it is very difficult to generate meaningful clusters by relying solely
on the clustering parameters. User input regarding important dimensions or the desired results
will serve as crucial hints or meaningful constraints for effective clustering. In general, we
contend that knowledge discovery would be most effective if one could develop an
environment for human-centered, exploratory mining of data, that is, where the human user is
allowed to play a key role in the process.
Foremost, a user should be allowed to specify a focus—directing the mining algorithm
toward the kind of ―knowledge‖ that the user is interested in finding. Clearly, user-guided
mining will lead to more desirable results and capture the application semantics. Constraintbased clustering finds clusters that satisfy user-specified preferences or constraints.
Depending on the nature of the constraints, constraint-based clustering may adopt rather
different approaches. Here are a few categories of constraints.
Constraints on individual objects: We can specify constraints on the objects to be
clustered. In a real estate application, for example, one may like to spatially cluster
only those luxury mansions worth over a million dollars. This constraint confines the
set of objects to be clustered. It can easily be handled by preprocessing (e.g.,
performing selection using an SQL query), after which the problem reduces to an
instance of unconstrained clustering.
Constraints on the selection of clustering parameters: A user may like to set a
desired range for each clustering parameter. Clustering parameters are usually quite
278
specific to the given clustering algorithm. Examples of parameters include k, the
desired number of clusters in a k-means algorithm; or (the radius) andMinPts (the
minimum number of points) in the DBSCAN algorithm. Although such user-specified
parameters may strongly influence the clustering results, they are usually confined to
the algorithm itself. Thus, their fine tuning and processing are usually not considered
a form of constraint-based clustering.
Constraints on distance or similarity functions: We can specify different distance
or similarity functions for specific attributes of the objects to be clustered or different
distance measures for specific pairs of objects. When clustering sportsmen, for
example, we may use different weighting schemes for height, body weight, age, and
skill level. Although this will likely change the mining results, it may not alter the
clustering process per se. However, in some cases, such changes may make the
evaluation of the distance function nontrivial, especially when it is tightly intertwined
with the clustering process. This can be seen in the following example.
13.6 Outlier Analysis
An outlier is a data point which is significantly different from the remaining data.
Hawkins formally defined [205] the concept of an outlier as follows:
―An outlier is an observation which deviates so much from the other observations as to
arouse suspicions that it was generated by a different mechanism.‖
Outliers are also referred to as abnormalities, discordants, deviants, or anomalies in the data
mining and statistics literature. In most applications, the data is created by one or more
generating processes, which could either reflect activity in the system or observations
collected about entities.
When the generating process behaves in an unusual way, it results in the creation of outliers.
Therefore, an outlier often contains useful information about abnormal characteristics of the
systems and entities, which impact the data generation process. The recognition of such
unusual characteristics provides useful application-specific insights.
Some examples are as follows:
 Intrusion Detection Systems: In many host-based or networked computer systems,
different kinds of data are collected about the operating system calls, network traffic,
279
or other activity in the system. This data may show unusual behavior because of
malicious activity. The detection of such activity is referred to as intrusion detection.
 Credit Card Fraud: Credit card fraud is quite prevalent, because of the ease with
which sensitive information such as a credit card number may be compromised. This
typically leads to unauthorized use of the credit card. In many cases, unauthorized use
may show different patterns, such as a buying spree from geographically obscure
locations. Such patterns can be used to detect outliers in credit card transaction data.
 Interesting Sensor Events: Sensors are often used to track various environmental
and location parameters in many real applications. The sudden changes in the
underlying patterns may represent events of interest. Event detection is one of the
primary motivating applications in the field of sensor networks.
 Medical Diagnosis: In many medical applications the data is collected from a variety
of devices such as MRI scans, PET scans or ECG time-series. Unusual patterns in
such data typically reflect disease conditions.
 Law Enforcement: Outlier detection finds numerous applications to law
enforcement, especially in cases, where unusual patterns can only be discovered over
time through multiple actions of an entity. Determining fraud in financial transactions,
trading activity, or insurance claims typically requires the determination of unusual
patterns in the data generated by the actions of the criminal entity.
 Earth Science: A significant amount of spatiotemporal data about weather patterns,
climate changes, or land cover patterns is collected through a variety of mechanisms
such as satellites or remote sensing. Anomalies in such data provide significant
insights about hidden human or environmental trends, which may have caused such
anomalies.
In all these applications, the data has a ―normal‖ model, and anomalies are recognized as
deviations from this normal model. In many cases such as intrusion or fraud detection, the
outliers can only be discovered as a sequence of multiple data points, rather than as an
individual data point. For example, a fraud event may often reflect the actions of an
individual in a particular sequence. The specificity of the sequence is relevant to identifying
the anomalous event. Such anomalies are also referred to as collective anomalies, because
they can only be inferred collectively from a set or sequence of data points. Such collective
anomalies typically represent unusual events, which need to be discovered from the data. This
book will address these different kinds of anomalies.
280
The output of an outlier detection algorithm can be one of two types:
Most outlier detection algorithm output a score about the level of ―outlierness‖ of a
data point. This can be used in order to determine a ranking of the data points in terms
of their outlier tendency. This is a very general form of output, which retains all the
information provided by a particular algorithm, but does not provide a concise
summary of the small number of data points which should be considered outliers.
A second kind of output is a binary label indicating whether a data point is an outlier
or not. While some algorithms may directly return binary labels, the outlier scores can
also be converted into binary labels. This is typically done by imposing thresholds on
outlier scores, based on their statistical distribution. A binary labeling contains less
information than a scoring mechanism, but it is the final result which is often needed
for decision making in practical applications.
The problem of outlier detection finds applications in numerous domains, where it is
desirable to determine interesting and unusual events in the activity which generates such
data. The core of all outlier detection methods is the creation of a probabilistic, statistical or
algorithmic
model which characterizes the normal behavior of the data. The deviations from this model
are used to determine the outliers. A good domain-specific knowledge of the underlying data
is often crucial in order to design simple and accurate models which do not over fit the
underlying
data. The problem of outlier detection becomes especially challenging, when significant
relationships exist among the different data points.
This is the case for time-series and network data in which the patterns in the relationships
among the data points (whether temporal or structural) play the key role in defining the
outliers. Outlier analysis has tremendous scope for research, especially in the area of
structural and temporal analysis.
13.7 Clustering Validation Techniques
The correctness of clustering algorithm results is verified using appropriate criteria and
techniques. Since clustering algorithms define clusters that are not known a priori,
irrespective of the clustering methods, the final partition of data requires some kind of
evaluation in most applications. One of the most important issues in cluster analysis is the
281
evaluation of clustering results to find the partitioning that best fits the underlying data. This
is the main subject of cluster validity. In the sequel we discuss the fundamental concepts of
this area while we present the various cluster validity approaches proposed in literature.
Fundamental concepts of cluster validity
The procedure of evaluating the results of a clustering algorithm is known under the term
cluster validity. In general terms, there are three approaches to investigate cluster validity.
The first is based on external criteria. This implies that we evaluate the results of a clustering
algorithm based on a pre-specified structure, which is imposed on a data set and reflects our
intuition about the clustering structure of the data set.
The second approach is based on internal criteria. We may evaluate the results of a clustering
algorithm in terms of quantities that involve the vectors of the data set themselves (e.g.
proximity matrix).
The third approach of clustering validity is based on relative criteria. Here the basic idea is
the evaluation of a clustering structure by comparing it to other clustering schemes, resulting
by the same algorithm but with different parameter values. There are two criteria proposed
for clustering evaluation and selection of an optimal clustering scheme.
External criteria: In this approach the basic idea is to test whether the points of the data set
are randomly structured or not. This analysis is based on the Null Hypothesis, H0, expressed
as a statement of random structure of a dataset, let X. To test this hypothesis we are based on
statistical tests, which lead to a computationally complex procedure. In the sequel Monde
Carlo techniques are used as a solution to high computational problems.
Internal criteria: Using this approach of cluster validity our goal is to evaluate the clustering
result of an algorithm using only quantities and features inherent to the dataset. There are two
cases in which we apply internal criteria of cluster validity depending on the clustering
structure: a) hierarchy of clustering schemes, and b) single clustering scheme.
Relative criteria: The basis of the above described validation methods is statistical testing.
Thus, the major drawback of techniques based on internal or external criteria is their high
computational demands. A different validation approach is discussed in this section. It is
based on relative criteria and does not involve statistical tests. The fundamental idea of this
approach is to choose the best clustering scheme of a set of defined schemes according to a
pre-specified criterion.
282
13.8 Summary
A cluster is a collection of data objects that are similar to one another within the same cluster
and are dissimilar to the objects in other clusters. The process of grouping a set of physical or
abstract objects into classes of similar objects is called clustering. Cluster analysis has wide
applications, including market or customer segmentation, pattern recognition, biological
studies, spatial data analysis, Web document classification, and many others. This can be
categorized into partitioning methods, hierarchical methods, density-based methods, gridbased methods, model-based methods, methods for high-dimensional data (including frequent
pattern–based methods), and constraint based methods.
13.9 Keywords
Density Based Methods, Grid-Based Methods, Model-Based Clustering Methods, Clustering
High-Dimensional Data, Constraint-Based Cluster Analysis, Outlier Analysis, Cluster
Validation Techniques
13.10 Exercises
a) What is Clustering? Explain?
b) Explain Different types of Clustering?
c) Different types of Model-based methods of Clustering?
d) Give 4 examples for Outlier analysis?
e) Briefly describe the following approaches to clustering:
a. Partitioning methods,
b. Hierarchical methods,
c. Density-based methods,
d. Grid-based methods,
e. Model-based methods,
f. Methods for high-dimensional data,
g. Constraint-based methods.
Give examples in each case.
f) Why is outlier mining important?
g) Different types of approaches of Clustering Validation Techniques?
283
13.11 References
1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, Morgan
Kaufmann Publisher, Second Edition, 2006.
2. Introduction to Data Mining with Case Studies by G. K. Gupta, Eastern Economy
Edition (PHI, New Delhi), Third Edition, 2009.
3. Data Mining Techniques by Arun K Pujari, University Press, Second Edition, 2009.
284
UNIT-14: SPATIAL DATAMINING
Structure
14.1
Objectives
14.2
Introduction
14.3
Spatial Data Cube Construction and Spatial OLAP
14.4
Mining Spatial Association and Co-location Patterns
14.5
Spatial Clustering Methods
14.6
Spatial Classification and Spatial Trend Analysis
14.7
Mining Raster Databases
14.8
Summary
14.9
Keywords
14.10 Exercises
14.11 References
14.1 Objectives
The objectives covered under this unit include:
The introduction to Spatial Data Mining
Spatial Data Cube Construction
Mining Spatial Association and Co-location Patterns
Spatial Clustering Methods
Spatial Classification and Spatial Trend Analysis
14.2 Introduction
What is Spatial Data Mining
A spatial database stores a large amount of space-related data, such as maps,
preprocessed remote sensing or medical imaging data, and VLSI chip layout data. Spatial
285
databases have many features distinguishing them from relational databases. They carry
topological and/or distance information, usually organized by sophisticated, multidimensional
spatial indexing structures that are accessed by spatial data access methods and often require
spatial reasoning, geometric computation, and spatial knowledge representation techniques.
Spatial data mining refers to the extraction of knowledge, spatial relationships, or other
interesting patterns not explicitly stored in spatial databases. Such mining demands an
integration of data mining with spatial database technologies. It can be used for understanding spatial data, discovering spatial relationships and relationships between spatial and
non spatial data, constructing spatial knowledge bases, reorganizing spatial databases, and
optimizing spatial queries. It is expected to have wide applications in geographic information
systems, geomarketing, remote sensing, image database exploration, medical imaging,
navigation, traffic control, environmental studies, and many other areas where spatial data are
used. A crucial challenge to spatial data mining is the exploration of efficient spatial data
mining techniques due to the huge amount of spatial data and the complexity of spatial data
types and spatial access methods.
―What about using statistical techniques for spatial data mining?‖ Statistical spatial data
analysis has been a popular approach to analyzing spatial data and exploring geographic
information. The term geostatistics is often associated with continuous geographic space,
whereas the term spatial statistics is often associated with discrete space. In a statistical
model that handles non spatial data, one usually assumes statistical independence among
different portions of data. However, different from traditional data sets, there is no such
independence among spatially distributed data because in reality, spatial objects are often
interrelated, or more exactly spatially co-located, in the sense that the closer the two objects
are located, the more likely they share similar properties.
For example, nature resource, climate, temperature, and economic situations are likely
to be similar in geographically closely located regions. People even consider this as the first
law of geography: ―Everything is related to everything else, but nearby things are more
related than distant things.‖ Such a property of close interdependency across nearby space
leads to the notion of spatial autocorrelation. Based on this notion, spatial statistical modeling
methods have been developed with good success. Spatial data mining will further develop
spatial statistical analysis methods and extend them for huge amounts of spatial data, with
more emphasis on efficiency, scalability, cooperation with database and data warehouse
systems, improved user interaction, and the discovery of new types of knowledge.
286
14.3 Spatial Data Cube Construction and Spatial OLAP
―Can we construct a spatial data warehouse?‖ Yes, as with relational data, we can
integrate spatial data to construct a data warehouse that facilitates spatial data mining. A
spatial data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of both spatial and non spatial data in support of spatial data mining and spatialdata-related decision-making processes. Let‘s look at the following example.
Example 1 Spatial data cube and spatial OLAP.
There are about 3,000 weather probes distributed in British Columbia (BC), Canada, each
recording daily temperature and precipitation for a designated small area and transmitting
signals to a provincial weather station. With a spatial data warehouse that supports spatial
OLAP, a user can view weather patterns on a map by month, by region, and by different
combinations of temperature and precipitation, and can dynamically drill down or roll up
along any dimension to explore desired patterns, such as ―wet and hot regions in the Fraser
Valley in summer 1999.‖
There are several challenging issues regarding the construction and utilization of spatial data
warehouses. The first challenge is the integration of spatial data from heterogeneous sources
and systems. Spatial data are usually stored in different industry firms and government
agencies using various data formats. Data formats are not only structure-specific (e.g., rastervs. vector-based spatial data, object-oriented vs. relational models, different spatial storage
and indexing structures), but also vendor-specific (e.g., ESRI, MapInfo, Intergraph).
Figure 4: A star schema of the BC weather spatial data warehouse and corresponding
BC weather probes map
287
There has been a great deal of work on the integration and exchange of heterogeneous
spatial data, which has paved the way for spatial data integration and spatial data warehouse
construction.
The second challenge is the realization of fast and flexible on-line analytical processing
in spatial data warehouses. The star schema model is a good choice for modeling spatial data
warehouses because it provides a concise and organized warehouse structure and facilitates
OLAP operations. However, in a spatial warehouse, both dimensions and measures may
contain spatial components.
There are three types of dimensions in a spatial data cube:
A non spatial dimension contains only non spatial data. Non spatial dimensions
temperature and precipitation can be constructed for the warehouse in Example
10.5, since each contains non spatial data whose generalizations are non spatial
(such as ―hot‖ for temperature and ―wet‖ for precipitation).
A spatial-to-non spatial dimension is a dimension whose primitive-level data
are spatial but whose generalization, starting at a certain high level, becomes
nons patial. For example, the spatial dimension city relays geographic data for the
U.S. map. Suppose that the dimension‘s spatial representation of, say, Seattle is
generalized to the string ―pacific northwest.‖ Although ―pacific northwest‖ is a
spatial concept, its representation is not spatial (since, in our example, it is a
string). It therefore plays the role of a non spatial dimension.
A spatial-to-spatial dimension is a dimension whose primitive level and all of
its high-level generalized data are spatial. For example, the dimension
equi_temperature region contains spatial data, as do all of its generalizations,
such as with regions covering 0-5 degrees (Celsius), 5-10 degrees, and so on.
We distinguish two types of measures in a spatial data cube:
A numerical measure contains only numerical data. For example, one measure
in a spatial data warehouse could be the monthly revenue of a region, so that a
roll-up may compute the total revenue by year, by county, and so on. Numerical
measures can be further classified into distributive, algebraic, and holistic, as
discussed in Chapter 3.
A spatial measure contains a collection of pointers to spatial objects. For
example, in a generalization (or roll-up) in the spatial data cube of Example 10.5,
288
the regions with the same range of temperature and precipitation will be grouped
into the same cell, and the measure so formed contains a collection of pointers to
those regions.
A non-spatial data cube contains only non-spatial dimensions and numerical measures. If a
spatial data cube contains spatial dimensions but no spatial measures, its OLAP operations,
such as drilling or pivoting, can be implemented in a manner similar to that for non-spatial
data cubes. ―But what if I need to use spatial measures in a spatial data cube?‖ This notion
raises some challenging issues on efficient implementation, as shown in the following
example.
Example 2: Numerical versus spatial measures:
A star schema for the BC weather
warehouse of Example 1 is shown in Figure 1. It consists of four dimensions: region
temperature, time, and precipitation, and three measures: region map, area, and count. A
concept hierarchy for each dimension can be created by users or experts, or generated
automatically by data clustering analysis. Figure 2 presents hierarchies for each of the
dimensions in the BC weather warehouse.
Of the three measures, area and count are numerical measures that can be computed
similarly as for nonspatial data cubes; region map is a spatial measure that represents a
collection of spatial pointers to the corresponding regions. Since different spatial OLAP
operations result in different collections of spatial objects in region map, it is a major
challenge to compute the merges of a large number of regions flexibly and dynamically.
Figure 5: Hierarchies for each dimension of the BC weather data warehouse.
289
For example, two different roll-ups on the BC weather map data (Figure 1) may produce two
different generalized region maps, as shown in Figure 3, each being the result of merging a
large number of small (probe) regions from Figure 1.
Figure 6: Generalized regions after different roll-up operations.
―Can we pre-compute all of the possible spatial merges and store them in the corresponding
cuboid cells of a spatial data cube?‖ The answer is - probably not. Unlike a numerical measure where each aggregated value requires only a few bytes of space, a merged region map of
BC may require multi-megabytes of storage. Thus, we face a dilemma in balancing the cost
of on-line computation and the space overhead of storing computed measures: the substantial
computation cost for on-the-fly computation of spatial aggregations calls for precomputation,
yet substantial overhead for storing aggregated spatial values discourages it.
There are at least three possible choices in regard to the computation of spatial measures in
spatial data cube construction:
Collect and store the corresponding spatial object pointers but do not perform
precomputation of spatial measures in the spatial data cube. This can be
implemented by storing, in the corresponding cube cell, a pointer to a collection
of spatial object pointers, and invoking and performing the spatial merge (or
other computation) of the corresponding spatial objects, when necessary, on the
fly. This method is a good choice if only spatial display is required (i.e., no real
spatial merge has to be performed), or if there are not many regions to be merged
in any pointer collection (so that the on-line merge is not very costly), or if online spatial merge computation is fast (recently, some efficient spatial merge
methods have been developed for fast spatial OLAP). Since OLAP results are
290
often used for on-line spatial analysis and mining, it is still recommended to
precompute some of the spatially connected regions to speed up such analysis.
Precompute and store a rough approximation of the spatial measures in the
spatial data cube. This choice is good for a rough view or coarse estimation of
spatial merge results under the assumption that it requires little storage space. For
example, a minimum bounding rectangle (MBR), represented by two points, can
be taken as a rough estimate of a merged region. Such a precomputed result is
small and can be presented quickly to users. If higher precision is needed for
specific cells, the application can either fetch precomputed high-quality results, if
available, or compute them on the fly.
Selectively precompute some spatial measures in the spatial data cube. This can
be a smart choice. The question becomes, ―Which portion of the cube should be
selected for materialization?‖ The selection can be performed at the cuboid level,
that is, either precompute and store each set of mergeable spatial regions for each
cell of a selected cuboid, or precompute none if the cuboid is not selected. Since a
cuboid usually consists of a large number of spatial objects, it may involve
precomputation and storage of a large number of mergeable spatial objects, some
of which may be rarely used. Therefore, it is recommended to perform selection
at a finer granularity level: examining each group of mergeable spatial objects in
a cuboid to determine whether such a merge should be precomputed. The
decision should be based on the utility (such as access frequency or access
priority), shareability of merged regions, and the balanced overall cost of space
and on-line computation.
With efficient implementation of spatial data cubes and spatial OLAP, generalizationbased descriptive spatial mining, such as spatial characterization and discrimination, can be
performed efficiently.
14.4 Mining Spatial Association and Co-location Patterns
Similar to the mining of association rules in transactional and relational databases, spatial
association rules can be mined in spatial databases. A spatial association rule is of the form
A  B [s%,c%]
291
where A and B are sets of spatial or non spatial predicates, s% is the support of the rule,
and c% is the confidence of the rule.
For example, the following is a spatial association rule:
is_a(X, ―school‖)˄ close_to(X, ―sports_center‖) close_to(X, ―park‖) [0.5%,80%].
This rule states that 80% of schools that are close to sports centers are also close to
parks, and 0.5% of the data belongs to such a case.
Various kinds of spatial predicates can constitute a spatial association rule. Examples
include distance information (such as close to and far away), topological relations (like
intersect, overlap, and disjoint), and spatial orientations (like left of and west of).
Since spatial association mining needs to evaluate multiple spatial relationships among a
large number of spatial objects, the process could be quite costly. An interesting mining
optimization method called progressive refinement can be adopted in spatial association
analysis. The method first mines large data sets roughly using a fast algorithm and then
improves the quality of mining in a pruned data set using a more expensive algorithm.
To ensure that the pruned data set covers the complete set of answers when applying the
high-quality data mining algorithms at a later stage, an important requirement for the rough
mining algorithm applied in the early stage is the superset coverage property: that is, it
preserves all of the potential answers. In other words, it should allow a false-positive test,
which might include some data sets that do not belong to the answer sets, but it should not
allow a false-negative test, which might exclude some potential answers.
For mining spatial associations related to the spatial predicate close to, we can first
collect the candidates that pass the minimum support threshold by
Applying certain rough spatial evaluation algorithms, for example, using an MBR
structure (which registers only two spatial points rather than a set of complex
polygons), and
Evaluating the relaxed spatial predicate, g close to, which is a generalized close
to covering a broader context that includes close to, touch, and intersect.
If two spatial objects are closely located, their enclosing MBRs must be closely located,
matching g close to. However, the reverse is not always true: if the enclosing MBRs are
closely located, the two spatial objects may or may not be located so closely. Thus, the MBR
pruning is a false-positive testing tool for closeness: only those that pass the rough test need
292
to be further examined using more expensive spatial computation algorithms. With this
preprocessing, only the patterns that are frequent at the approximation level will need to be
examined by more detailed and finer, yet more expensive, spatial computation.
Besides mining spatial association rules, one may like to identify groups of particular
features that appear frequently close to each other in a geospatial map. Such a problem is
essentially the problem of mining spatial co-locations. Finding spatial co-locations can be
considered as a special case of mining spatial associations. However, based on the property of
spatial autocorrelation, interesting features likely coexist in closely located regions. Thus
spatial co-location can be just what one really wants to explore. Efficient methods can be
developed for mining spatial co-locations by exploring the methodologies like aprior and
progressive refinement, similar to what has been done for mining spatial association rules.
14.5 Spatial Clustering Methods
Spatial data clustering identifies clusters, or densely populated regions, according to
some distance measurement in a large, multidimensional data set. Spatial clustering is a
process of grouping a set of spatial objects into groups called clusters. Objects within a
cluster show a high degree of similarity, whereas the clusters are as much dissimilar as
possible. Clustering is a very well known technique in statistics and clustering algorithm to
deal with the large geographical datasets. Clustering algorithms can be separated into four
general categories:
Partitioning method,
Hierarchical method,
Density-based method
Grid-based method.
The categorization is based on different cluster definition techniques.
Partitioning Method
This method partitioning algorithm organizes the objects into clusters such that the total
deviation of each object from its cluster center is minimized. At the beginning each object is
classified as a single cluster. In the next steps, all data points are iteratively reallocated to
293
every cluster until a stopping criterion is met. K-Means is commonly used fundamental
partitioning algorithm.
Hierarchical Method
This method hierarchically decomposes the dataset by splitting or merging all clusters until a
stopping criterion is met. Some of the recently used hierarchical clustering algorithms are
Balanced Iterative Reducing and Clustering using Hierarchies and Clustering Using
Representatives.
Density-Based Method
The method regards clusters as dense regions of objects that are separated by regions of low
density (representing noise). In contrast to partitioning methods, clusters of arbitrary shapes
can be discovered. Density-based methods can be used to filter out noise and outliers.
Grid-Based Method
Grid-based clustering algorithms first quantize the clustering space into a finite number of
cells and then perform the required operations on the grid structure. Cells that contain more
than a certain number of points are treated as dense. The main advantage of the approach is
its fast processing time, since the time is independent on the number of data objects, but
dependent on the number of cells.
14.6 Spatial Classification and Spatial Trend Analysis
Spatial classification analyzes spatial objects to derive classification schemes in
relevance to certain spatial properties, such as the neighborhood of a district, highway, or
river.
Example 3 Spatial classification: Suppose that you would like to classify regions in a
province into rich versus poor according to the average family income. In doing so, you
would like to identify the important spatial related factors that determine a region‘s
classification.
Many properties are associated with spatial objects, such as hosting a university,
containing interstate highways, being near a lake or ocean, and so on. These properties can be
used for relevance analysis and to find interesting classification schemes. Such classification
schemes may be represented in the form of decision trees or rules.
294
Spatial trend analysis deals with another issue: the detection of changes and trends along
a spatial dimension. Typically, trend analysis detects changes with time, such as the changes
of temporal patterns in time-series data. Spatial trend analysis replaces time with space and
studies the trend of non spatial or spatial data changing with space. For example, we may
observe the trend of changes in economic situation when moving away from the center of a
city, or the trend of changes of the climate or vegetation with the increasing distance from an
ocean. For such analyses, regression and correlation analysis methods are often applied by
utilization of spatial data structures and spatial access methods.
14.7 Mining Raster Databases
Spatial database systems usually handle vector data that consist of points, lines, polygons
(regions), and their compositions, such as networks or partitions. Typical examples of such
data include maps, design graphs, and 3-D representations of the arrangement of the chains of
protein molecules. However, a huge amount of space-related data is in digital raster (image)
forms, such as satellite images, remote sensing data, and computer tomography. It is
important to explore data mining in raster or image databases. Methods for mining raster and
image data are examined in the following section regarding the mining of multimedia data.
There are also many applications where patterns are changing with both space and time.
For example, traffic flows on highways and in cities are both time and space related. Weather
patterns are also closely related to both time and space. Although there have been a few
interesting studies on spatial classification and spatial trend analysis, the investigation of
spatiotemporal data mining is still in its early stage. More methods and applications of spatial
classification and trend analysis, especially those associated with time, need to be explored.
Multimedia Data Mining: ―What is a multimedia database?‖ A multimedia database system
stores and manages a large collection of multimedia data, such as audio, video, image,
graphics, speech, text, document, and hypertext data, which contain text, text markups, and
linkages. Multimedia database systems are increasingly common owing to the popular use of
audio video equipment, digital cameras, CD-ROMs, and the Internet. Typical multimedia
database systems include NASA‘s EOS (Earth Observation System), various kinds of image
and audio-video databases, and Internet databases.
Similarity Search in Multimedia Data: ―When searching for similarities in multimedia
data, can we search on either the data description or the data content?‖ That is correct. For
295
similarity searching in multimedia data, we consider two main families of multimedia
indexing and retrieval systems:
(1) description-based retrieval systems, which build indices and perform object retrieval
based on image descriptions, such as keywords, captions, size, and time of creation.
(2) content-based retrieval systems, which support retrieval based on the image content, such
as color histogram, texture, pattern, image topology, and the shape of objects and their
layouts and locations within the image. Description-based retrieval is labor-intensive if
performed manually. If automated, the results are typically of poor quality. For example, the
assignment of keywords to images can be a tricky and arbitrary task.
Multidimensional Analysis of Multimedia Data: ―Can we construct a data cube for
multimedia data analysis?‖ To facilitate the multidimensional analysis of large multimedia
databases, multimedia data cubes can be designed and constructed in a manner similar to that
for traditional data cubes from relational data. A multimedia data cube can contain additional
dimensions and measures for multimedia information, such as color, texture, and shape.
Mining Associations in Multimedia Data: ―What kinds of associations can be mined in
multimedia data?‖ Association rules involving multimedia objects can be mined in image and
video databases. At least three categories can be observed:
Associations between image content and non image content features: A rule like ―If at least
50% of the upper part of the picture is blue, then it is likely to represent sky‖ belongs to this
category since it links the image content to the keyword sky.
Associations among image contents that are not related to spatial relationships: A rule like
―If a picture contains two blue squares, then it is likely to contain one red circle as well‖
belongs to this category since the associations are all regarding image contents.
Associations among image contents related to spatial relationships: A rule like ―If a red
triangle is between two yellow squares, then it is likely a big oval-shaped object is
underneath‖ belongs to this category since it associates objects in the image with spatial
relationships.
14.8 Summary
Vast amounts of data are stored in various complex forms, such as structured or unstructured,
hyper text, and multimedia. Thus, mining complex types of data, including object data,
296
spatial data, multimedia data, text data, and Web data, has become an increasingly important
task in data mining. Spatial data mining is the discovery of interesting patterns from large
geospatial databases. Spatial data cubes that contain spatial dimensions and measures can be
constructed. Spatial OLAP can be implemented to facilitate multidimensional spatial data
analysis. Spatial data mining includes mining spatial association and co-location patterns,
clustering, classification, and spatial trend and outlier analysis. Multimedia data mining is the
discovery of interesting patterns from multimedia databases that store and manage large
collections of multimedia objects, including audio data, image data, video data, sequence
data, and hypertext data containing text, text markups, and linkages. Issues in multimedia
data mining include content based retrieval and similarity search, and generalization and
multidimensional analysis.
14.9 Keywords
Multimedia Data Mining, Raster Databases, Spatial Classification, Spatial Clustering
Methods, Mining Spatial Association, Co-location Patterns, Spatial Data Cube Construction,
Spatial OLAP, Spatial Data Mining
14.10 Exercises
1. Define Spatial Data Mining?
2. Define Spatial Data Cube Construction and Spatial OLAP? Give Example?
3. Explain Mining Spatial Association?
4. What all are Different types of Spatial Clustering Methods? Explain?
5. What is Spatial Classification and Spatial Trend Analysis?
6. Explain Multimedia Data Mining?
14.11 References
1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber,
Morgan Kaufmann Publisher, Second Edition, 2006.
2. Introduction to Data Mining with Case Studies by G. K. Gupta, Eastern Economy
Edition (PHI, New Delhi), Third Edition, 2009.
3. Data Mining Techniques by Arun K Pujari, University Press, Second Edition, 2009.
297
UNIT-15: TEXT MINING
Structure
15.1
Objectives
15.2
Introduction
15.3
Mining Text Data
15.4
Text Data Analysis and Information Retrieval
15.5
Dimensionality Reduction for Text
15.6
Text Mining Approaches
15.7
Summary
15.8
Keywords
15.9
Exercises
15.10 References
15.1 Objectives
The objectives covered under this unit include:
An introduction to Text Mining
Techniques for Text Mining
Text Data Analysis
Information Retrieval
Dimensionality Reduction for Text
Text Mining Approaches.
15.2 Introduction
What is text mining?
Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers
to the process of deriving high-quality information from text. High-quality information is
typically derived through the devising of patterns and trends through means such as statistical
298
pattern learning. Text mining usually involves the process of structuring the input text
(usually parsing, along with the addition of some derived linguistic features and the removal
of others, and subsequent insertion into a database), deriving patterns within the structured
data, and finally evaluation and interpretation of the output.
15.3 Mining Text Data
'High quality' in text mining usually refers to some combination of relevance, novelty, and
interestingness.
Typical
clustering, concept/entity
text
extraction,
mining
tasks
production
include text
of
granular
categorization, text
taxonomies, sentiment
analysis, document summarization, and entity relation modeling (i.e., learning relations
between named entities).
Text analysis involves information retrieval, lexical analysis to study word frequency
distributions, pattern
recognition, tagging/annotation, information
extraction, data
mining techniques including link and association analysis, visualization, and predictive
analytics. The overarching goal is, essentially, to turn text into data for analysis, via
application of natural language processing (NLP) and analytical methods.
A key element of text mining is its focus on the document collection. At its simplest, a
document collection can be any grouping of text-based documents. Practically speaking,
however, most text mining solutions are aimed at discovering patterns across very large
document collections. The number of documents in such collections can range from the many
thousands to the tens of millions.
Document collections can be either static, in which case the initial complement of documents
remains unchanged, or dynamic, which is a term applied to document collections
characterized by their inclusion of new or updated documents over time. Extremely large
document collections, as well as document collections with very high rates of document
change, can pose performance optimization challenges for various components of a text
mining system.
Data stored in most text databases are semistructured data in that they are neither completely
unstructured nor completely structured. For example, a document may contain a few
structured fields, such as title, authors, publication date, category, and so on, but also contain
some largely unstructured text components, such as abstract and contents. There have been a
great deal of studies on the modeling and implementation of semistructured data in recent
299
database research. Moreover, information retrieval techniques, such as text indexing methods,
have been developed to handle unstructured documents.
Traditional information retrieval techniques become inadequate for the increasingly vast
amounts of text data. Typically, only a small fraction of the many available documents will
be relevant to a given individual user. Without knowing what could be in the documents, it is
difficult to formulate effective queries for analyzing and extracting useful information from
the data. Users need tools to compare different documents, rank the importance and relevance
of the documents, or find patterns and trends across multiple documents. Thus, text mining
has become an increasingly popular and essential theme in data mining.
15.4 Text Data Analysis and Information Retrieval
The meaning of the term information retrieval can be very broad. Just getting a credit card
out of your wallet so that you can type in the card number is a form of information retrieval.
Information retrieval (IR) is a field that has been developing in parallel with database systems
for many years. Unlike the field of database systems, which has focused on query and
transaction processing of structured data, information retrieval is concerned with the
organization and retrieval of information from a large number of text-based documents. Since
information retrieval and database systems each handle different kinds of data, some database
system problems are usually not present in information retrieval systems, such as
concurrency control, recovery, transaction management, and update. Also, some common
information retrieval problems are usually not encountered in traditional database systems,
such as unstructured documents, approximate search based on keywords, and the notion of
relevance.
Due to the abundance of text information, information retrieval has found many applications.
There exist many information retrieval systems, such as on-line library catalog systems, online document management systems, and the more recently developed Web search engines.
A typical information retrieval problem is to locate relevant documents in a document
collection based on a user‘s query, which is often some keywords describing an information
need, although it could also be an example relevant document. In such a search problem, a
user takes the initiative to ―pull‖ the relevant information out from the collection; this is most
appropriate when a user has some ad hoc (i.e., short-term) information need, such as finding
information to buy a used car. When a user has a long-term information need (e.g., a
300
researcher‘s interests), a retrieval system may also take the initiative to ―push‖ any newly
arrived information item to a user if the item is judged as being relevant to the user‘s
information need. Such an information access process is called information filtering, and the
corresponding systems are often called filtering systems or recommender systems. From a
technical viewpoint, however, search and filtering share many common techniques. Below
we briefly discuss the major techniques in information retrieval with a focus on search
techniques.
Basic Measures for Text Retrieval: Precision and Recall
Text retrieval system has just retrieves a number of documents based on input in the form of
a query. How to assess how accurate or correct the system retrieves the documents. Let the
set of documents relevant to a query be denoted as {Relevant}, and the set of documents
retrieved be denoted as {Retrieved}. The set of documents that are both relevant and
retrieved is denoted as {Relevant}∩{Retrieved}, as shown below.
Figure: Relationship between the set of relevant documents and the set of retrieved
documents.
There are two basic measures for assessing the quality of text retrieval:
Precision: This is the percentage of retrieved documents that are in fact relevant to the query
(i.e., ―correct‖ responses). It is formally defined as
Precision =
{Relevant} ∩ {Retrieved}
{Retrieved}
Recall: This is the percentage of documents that are relevant to the query and were, in fact,
retrieved. It is formally defined as
301
Recall =
{Relevant} ∩ {Retrieved}
{Relevant}
An information retrieval system often needs to trade off recall for precision or vice versa.
One commonly used trade-off is the F-score, which is defined as the harmonic mean of recall
and precision:
F_score =
Precision × Recall
(Precision + Recall) 2
Precision, recall, and F-score are the basic measures of a retrieved set of documents. These
three measures are not directly useful for comparing two ranked lists of documents because
they are not sensitive to the internal ranking of the documents in a retrieved set. In order to
measure the quality of a ranked list of documents, it is common to compute an average of
precisions at all the ranks where a new relevant document is returned. It is also common to
plot a graph of precisions at many different levels of recall; a higher curve represents a betterquality information retrieval system.
Text Retrieval Methods
Text retrieval methods fall into two categories: They generally either view the retrieval
problem as a document selection problem or as a document ranking problem.
In document selection methods, the query is regarded as specifying constraints for selecting
relevant documents. A typical method of this category is the Boolean retrieval model, in
which a document is represented by a set of keywords and a user provides a Boolean
expression of keywords, such as ―car and repair shops,‖ ―tea or coffee,‖ or ―database systems
but not Oracle.‖ The retrieval system would take such a Boolean query and return documents
that satisfy the Boolean expression. Because of the difficulty in prescribing a user‘s
information need exactly with a Boolean query, the Boolean retrieval method generally only
works well when the user knows a lot about the document collection and can formulate a
good query in this way.
Document ranking methods use the query to rank all documents in the order of relevance. For
ordinary users and exploratory queries, these methods are more appropriate than document
selection methods. Most modern information retrieval systems present a ranked list of
documents in response to a user‘s keyword query. There are many different ranking methods
based on a large spectrum of mathematical foundations, including algebra, logic, probability,
302
and statistics. The common intuition behind all of these methods is that we may match the
keywords in a query with those in the documents and score each document based on how
well it matches the query. The goal is to approximate the degree of relevance of a document
with a score computed based on information such as the frequency of words in the document
and the whole collection. Notice that it is inherently difficult to provide a precise measure of
the degree of relevance between a set of keywords. For example, it is difficult to quantify the
distance between data mining and data analysis. Comprehensive empirical evaluation is thus
essential for validating any retrieval method.
The basic idea of the vector space model is to represent a document and a query both as
vectors in a high-dimensional space corresponding to all the keywords and use an appropriate
similarity measure to compute the similarity between the query vector and the document
vector. The similarity values can then be used for ranking documents.
Tokenize text
The first step in most retrieval systems is to identify key- words for representing documents,
a preprocessing step often called tokenization. To avoid indexing useless words, a text
retrieval system often associates a stop list with a set of documents. A stop list is a set of
words that are deemed ―irrelevant.‖ For example, a, the, of, for, with, and so on are stop
words, even though they may appear frequently. Stop lists may vary per document set. For
example, database systems could be an important keyword in a newspaper. However, it may
be considered as a stop word in a set of research papers presented in a database systems
conference.
A group of different words may share the same word stem. A text retrieval system needs to
identify groups of words where the words in a group are small syntactic variants of one
another and collect only the common word stem per group. For example, the group of words
drug, drugged, and drugs, share a common word stem, drug, and can be viewed as different
occurrences of the same word.
Model a document to facilitate information retrieval
Starting with a set of d documents and a set of t terms, we can model each document as a
vector v in the t dimensional space R t, which is why this method is called the vector-space
model. Let the term frequency be the number of occurrences of term t in the document d, that
is, freq (d, t ). The (weighted) term-frequency matrix TF(d, t ) measures the association of a
303
term t with respect to the given document d: it is generally defined as 0 if the document does
not contain the term, and nonzero otherwise. There are many ways to define the termweighting for the nonzero entries in such a vector. For example, we can simply set TF(d, t ) =
1 if the term t occurs in the document d, or use the term frequency freq(d, t ), or the relative
term frequency, that is, the term frequency versus the total number of occurrences of all the
terms in the document. There are also other ways to normalize the term frequency. For
example, the Cornell SMART system uses the following formula to compute the
(normalized) term frequency:
0
if freq d; t = 0
TF d, t =
1 + log(1 + log(freq(d; t )))
otherwise.
Besides the term frequency measure, there is another important measure, called inverse
document frequency (IDF), that represents the scaling factor, or the importance, of a term t .
If a term t occurs in many documents, its importance will be scaled down due to its reduced
discriminative power. For example, the term database systems may likely be less important if
it occurs in many research papers in a database system conference. According to the same
Cornell SMART system, IDF (t ) is defined by the following formula:
𝐼𝐷𝐹 𝑡 = 𝑙𝑜𝑔
1 + [𝑑]
[𝑑𝑡 ]
where d is the document collection, and dt is the set of documents containing term t . If
𝑑𝑡 ≪ 𝑑 the term t will have a large IDF scaling factor and vice versa.
In a complete vector-space model, TF and IDF are combined together, which forms the TFIDF measure:
TF-IDF(d, t ) = TF(d, t ) × IDF(t ).
Let us examine how to compute similarity among a set of documents based on the notions of
term frequency and inverse document frequency.
Text Indexing Techniques
Text indexing is the act of processing a text in order to extract statistics considered important
for representing the information available and to allow fast search on its content. Text
indexing operations can be performed not only on natural language texts, but virtually on any
type of textual information, such as source code of computer programs, DNA or protein
304
databases and textual data stored in traditional database systems. There are several popular
text retrieval indexing techniques, including inverted indices and signature files.
Text index compression is the problem of designing a reduced-space data structure that
provides fast search of a text collection, seen as a set of documents. In Information Retrieval
(IR) the searches to support are usually for whole words or phrases, either to retrieve the list
of all documents where they appear (full-text searching) or to retrieve a ranked list of the
documents where those words or phrases are most relevant according to some criterion
(relevance ranking). As inverted indexes (sometimes also called inverted lists or inverted
files) are by far the most popular type of text index in IR, this entry focuses on different
techniques to compress inverted indexes, depending on whether they are oriented to full-text
searching or to relevance ranking.
Query Processing Techniques
Once an inverted index is created for a document collection, a retrieval system can answer a
keyword query quickly by looking up which documents contain the query keywords.
Specifically, we will maintain a score accumulator for each document and update these
accumulators as we go through each query term. For each query term, we will fetch all of the
documents that match the term and increase their scores.
When examples of relevant documents are available, the system can learn from such
examples to improve retrieval performance. This is called relevance feedback and has proven
to be effective in improving retrieval performance. When we do not have such relevant
examples, a system can assume the top few retrieved documents in some initial retrieval
results to be relevant and extract more related keywords to expand a query. Such feedback is
called pseudo-feedback or blind feedback and is essentially a process of mining useful
keywords from the top retrieved documents. Pseudo-feedback also often leads to improved
retrieval performance.
One major limitation of many existing retrieval methods is that they are based on exact
keyword matching. However, due to the complexity of natural languages, keyword- based
retrieval can encounter two major difficulties. The first is the synonymy problem: two words
with identical or similar meanings may have very different surface forms. For example, a
user‘s query may use the word ―automobile,‖ but a relevant document may use ―vehicle‖
305
instead of ―automobile.‖ The second is the polysemy problem: the same keyword, such as
mining, or Java, may mean different things in different contexts.
We now discuss some advanced techniques that can help solve these problems as well as
reduce the index size.
15.5 Dimensionality Reduction for Text
Text-based queries can then be represented as vectors, which can be used to search for their
nearest neighbors in a document collection. However, for any nontrivial document database,
the number of terms T and the number of documents D are usually quite large. Such high
dimensionality leads to the problem of inefficient computation, since the resulting frequency
table will have size T × D. Furthermore, the high dimensionality also leads to very sparse
vectors and increases the difficulty in detecting and exploiting the relationships among terms
(e.g., synonymy). To overcome these problems, dimensionality reduction techniques such as
latent semantic indexing, probabilistic latent semantic analysis, and locality preserving
indexing can be used.
We now briefly introduce these methods. To explain the basic idea beneath latent semantic
indexing and locality preserving indexing, we need to use some matrix and vector notations.
In the following part, we use x1 , . . . , xtn ∈ Rm to represent the n documents with m
features (words). They can be represented as a termdocument matrix X = [x1 , x2 , . . . , xn ].
Latent Semantic Indexing
Latent semantic indexing (LSI) is one of the most popular algorithms for document
dimensionality reduction. It is fundamentally based on SVD (singular value decomposition).
Suppose the rank of the term-document X is r, then LSI decomposes X using SVD as
follows:
X = U ΣV T
where Σ = diag(σ1 , . . . , σr ) and σ1 ≥ σ2 ≥ · · · ≥ σr are the singular values of X , U = [a1 ,
. . . , ar ] and ai is called the left singular vector, and V = [v1 , . . . , vr ], and vi is called the
right singular vector. LSI uses the first k vectors in U as the transformation matrix to embed
the original documents into a k-dimensional subspace. It can be easily checked that the
column vectors of U are the eigenvectors of X X T . The basic idea of LSI is to extract the
306
most representative features, and at the same time the reconstruction error can be minimized.
Let a be the transformation vector. The objective function of LSI can be stated as follows:
𝑎𝑟𝑔𝑚𝑖𝑛
2
𝑎𝑜𝑝𝑡 =
𝑋 − 𝑎𝑎𝑇 𝑋
2
=
𝑎𝑟𝑔𝑚𝑖𝑛
2
𝑎𝑇 𝑋𝑋 𝑇 𝑎
with the constrain,
𝑎𝑎𝑇 = 1
Since X X T is symmetric, the basis functions of LSI are orthogonal.
Locality Preserving Indexing
Different from LSI, which aims to extract the most representative features, Locality
Preserving Indexing (LPI) aims to extract the most discriminative features. The basic idea of
LPI is to preserve the locality information (i.e., if two documents are near each other in the
original document space, LPI tries to keep these two documents close together in the reduced
dimensionality space). Since the neighboring documents (data points in high- dimensional
space) probably relate to the same topic, LPI is able to map the documents related to the same
semantics as close to each other as possible.
Given the document set x1 , . . . , xn ∈ Rm , LPI constructs a similarity matrix S ∈ Rn×n .The
transformation
vectors
of
LPI
can
be
obtained
by
solving
𝑎𝑟𝑔𝑚𝑖𝑛
2
𝑎𝑇 𝑋𝐿𝑋 𝑇 𝑎
the
following
minimizationproblem:
𝑎𝑇 𝑥𝑖 − 𝑎𝑇 𝑥𝑗
𝑎𝑜𝑝𝑡 = arg min
2
𝑆𝑖𝑗 =
2 𝑖𝑗
where L = D − S is the Graph Laplacian and Dii = ∑ j Si j . Dii measures the local density
around x. LPI constructs the similarity matrix S as
𝑥𝑖𝑇 𝑥𝑗
𝑆𝑖𝑗 =
𝑥𝑖𝑇 𝑥𝑗
0
,
𝐼𝑓 𝑥𝑖 𝑖𝑠 𝑎𝑚𝑜𝑢𝑛𝑔 𝑡𝑒 𝑝 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑔𝑏𝑜𝑢𝑟 𝑜𝑓 𝑥𝑗
𝑜𝑟 𝑥𝑗 𝑖𝑠 𝑎𝑚𝑜𝑢𝑛𝑔 𝑡𝑒 𝑝 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑔𝑏𝑜𝑢𝑟 𝑜𝑓 𝑥𝑖
𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒
Thus, the objective function in LPI incurs a heavy penalty if neighboring points xi and x j are
mapped far apart. Therefore, minimizing it is an attempt to ensure that if xi and x j are
―close‖ then yi (= aT xi ) and y j (= aT x j ) are close as well. Finally, the basis functions of
307
LPI are the eigenvectors associated with the smallest eigenvalues of the following
generalized eigen-problem:
𝑋𝐿𝑋 𝑇 𝑎 = 𝜆𝑋𝐷𝑋 𝑇 𝑎
LSI aims to find the best subspace approximation to the original document space in the sense
of minimizing the global reconstruction error. In other words, LSI seeks to uncover the most
representative features. LPI aims to discover the local geometrical structure of the document
space. Since the neighboring documents (data points in high- dimensional space) probably
relate to the same topic, LPI can have more discriminating power than LSI. Theoretical
analysis of LPI shows that LPI is an unsupervised approximation of the supervised Linear
Discriminant Analysis (LDA). Therefore, for document clustering and document
classification, we might expect LPI to have better performance than LSI. This was confirmed
empirically.
Probabilistic Latent Semantic Indexing
The probabilistic latent semantic indexing (PLSI) method is similar to LSI, but achieves
dimensionality reduction through a probabilistic mixture model. Specifically, we assume
there are k latent common themes in the document collection, and each is characterized by a
multinomial word distribution. A document is regarded as a sample of a mixture model with
these theme models as components. We fit such a mixture model to all the documents, and
the obtained k component multinomial models can be regarded as defining k new semantic
dimensions. The mixing weights of a document can be used as a new representation of the
document in the low latent semantic dimensions.
Formally, let C = {d1 , d2 , . . . , dn } be a collection of n documents. Let θ1 , . . . , θk be k
theme multinomial distributions. A word w in document di is regarded as a sample of the
following mixture model.
𝑘
𝑃𝑑𝑖 𝑊 =
𝜋𝑑𝑖 𝑗𝑃 𝑤 𝜃𝑗
𝑗 =1
where πd , j is a document-specific mixing weight for the j-th aspect theme,
and
𝑘
𝑗 =1 𝜋𝑑𝑖 𝑗
=1
308
n
k
logp C Λ =
[c w, di log
i=1 w∈V
(πdi , jP(W|θj )) ]
j=1
where V is the set of all the words (i.e., vocabulary), c(w ;𝑑𝑖 ) is the count of word w in
document 𝑑𝑖 and Λ = ({𝜃𝑗 , { 𝜋𝑑𝑖 , 𝑗} i=1n} ) is the set of all the theme model parameters.
The model can be estimated using the Expectation-Maximization (EM) algorithm which
computes the following maximum likelihood estimate:
Λ = argmaxΛ logp C Λ
Once the model is estimated, 𝜃1 … . . 𝜃𝑘 define k new semantic dimensions and 𝜋𝑑𝑖 , 𝑗 gives a
representation of 𝑑𝑖 in this low-dimension space
15.6 Text Mining Approaches
There are many approaches to text mining, which can be classified from different
perspectives, based on the inputs taken in the text mining system and the data mining tasks to
be performed. In general, the major approaches, based on the kinds of data they take as input,
are: (1) the keyword-based approach, where the input is a set of keywords or terms in the
documents, (2) the tagging approach,
where the input
is a set of tags, and
(3) the
information-extraction approach, which inputs semantic information, such as events, facts,
or entities uncovered by information extraction. A simple keyword-based approach may
only discover relationships at a relatively shallow level, such as rediscovery of compound
nouns (e.g., ―database‖ and ―systems‖) or co-occurring patterns with less significance (e.g.,
―terrorist‖ and ―explosion‖). It may not bring much deep understanding to the text. The
tagging approach may rely on tags obtained by manual tagging (which is costly and is
unfeasible for large collections of documents) or by some automated categorization algorithm
(which may process a relatively small set of tags and require defining the categories
beforehand). The information-extraction approach is more advanced and may lead to the
discovery of some deep knowledge, but it requires semantic analysis of text by natural
language understanding and machine learning methods. This is a challenging knowledge
discovery task.
Various text mining tasks can be performed on the extracted keywords, tags, or seman- tic
information. These include document clustering, classification, information extrac- tion,
309
association analysis, and trend analysis. We examine a few such tasks in the following
discussion.
Keyword-Based Association Analysis
Like most of the analyses in text databases, association analysis first preprocess the text data
by parsing, stemming, removing stop words, and so on, and then evokes association mining
algorithms. In a document database, each document can be viewed as a transaction, while a
set of keywords in the document can be considered as a set of items in the transaction. That
is, the database is in the format
{Document id, a set of keywords}.
The problem of keyword association mining in document databases is thereby mapped to
item association mining in transaction databases, where many interesting methods have been
developed.
A set of frequently occurring consecutive or closely located keywords may form a term or a
phrase. The association mining process can help detect compound associations, that is,
domain-dependent terms or phrases, such as [Stanford, University] or [U.S., President,
George W. Bush], or noncompound associations, such as [dollars, shares, exchange, total,
commission, stake, securities]. Mining based on these associations is referred to as ―termlevel association mining‖ (as opposed to mining on individual words). Term recognition and
term-level association mining enjoy two advantages in text analysis: (1) terms and phrases
are automatically tagged so there is no need for human effort in tagging documents; and (2)
the number of meaningless results is greatly reduced, as is the execution time of the mining
algorithms.
With such term and phrase recognition, term-level mining can be evoked to find associations
among a set of detected terms and keywords. Some users may like to find associations
between pairs of keywords or terms from a given set of keywords or phrases, whereas others
may wish to find the maximal set of terms occurring together. Therefore, based on user
mining requirements, standard association mining or max-pattern mining algorithms may be
evoked.
Document Classification Analysis
310
Automated document classification is an important text mining task because, with the
existence of a tremendous number of on-line documents, it is tedious yet essential to be able
to automatically organize such documents into classes to facilitate document retrieval and
subsequent analysis. Document classification has been used in automated topic tagging (i.e.,
assigning labels to documents), topic directory construction, identification of the document
writing styles (which may help narrow down the possible authors of anonymous documents),
and classifying the purposes of hyperlinks associated with a set of documents.
A general procedure is as follows: First, a set of pre-classified documents is taken as the
training set. The training set is then analyzed in order to derive a classification scheme. Such
a classification scheme often needs to be refined with a testing process. The so-derived
classification scheme can be used for classification of other on-line documents.
This process appears similar to the classification of relational data. However, there is a
fundamental difference. Relational data are well structured: each tuple is defined by a set of
attribute-value pairs. For example, in the tuple {sunny, warm, dry, not windy, play tennis},
the value ―sunny‖ corresponds to the attribute weather outlook, ―warm‖corresponds to the
attribute temperature, and so on. The classification analysis decides which set of attributevalue pairs has the greatest discriminating power in determining whether a person is going to
play tennis. On the other hand, document databases are not structured according to attributevalue pairs. That is, a set of keywords associated with a set of documents is not organized
into a fixed set of attributes or dimensions. If we view each distinct keyword, term, or feature
in the document as a dimension, there may be thousands of dimensions in a set of documents.
Therefore, commonly used relational data-oriented classification methods, such as decision
tree analysis, may not be effective for the classification of document databases.
According to the vector-space model, two documents are similar if they share similar
document vectors. This model motivates the construction of the k-nearest-neighbor classifier,
based on the intuition that similar documents are expected to be assigned the same class label.
We can simply index all of the training documents, each associated with its corresponding
class label. When a test document is submitted, we can treat it as a query to the IR system and
retrieve from the training set k documents that are most similar to the query, where k is a
tunable constant. The class label of the test document can be determined based on the class
label distribution of its k nearest neighbors. Such class label distribution can also be refined,
such as based on weighted counts instead of raw counts, or setting aside a portion of labeled
311
documents for validation. By tuning k and incorporating the suggested refinements, this kind
of classifier can achieve accuracy comparable with the best classifier. However, since the
method needs nontrivial space to store (possibly redundant) training information and
additional time for inverted index lookup, it has additional space and time overhead in
comparison with other kinds of classifiers.
The vector-space model may assign large weight to rare items disregarding its class
distribution characteristics. Such rare items may lead to ineffective classification. Let‘s
examine an example in the TF-IDF measure computation. Suppose there are two terms t1 and
t2 in two classes C1 and C2, each having 100 training documents. Term t1 occurs in five
documents in each class (i.e., 5% of the overall corpus), but t2 occurs in 20 documents in
class C1 only (i.e., 10% of the overall corpus). Term t1 will have a higher TF-IDF value
because it is rarer, but it is obvious t2 has stronger discriminative power in this case. A
feature selection process can be used to remove terms in the training documents that are
statistically uncorrelated with the class labels. This will reduce the set of terms to be used in
classification, thus improving both efficiency and accuracy.
After feature selection, which removes nonfeature terms, the resulting ―cleansed‖ training
documents can be used for effective classification. Bayesian classification is one of several
popular techniques that can be used for effective document classification. Since document
classification can be viewed as the calculation of the statistical distribution of documents in
specific classes, a Bayesian classifier first trains the model by calculating a generative
document distribution P(d|c) to each class c of document d and then tests which class is most
likely to generate the test document. Since both methods handle high-dimensional data sets,
they can be used for effective document classification. Other classification methods have also
been used in documentation classification. For example, if we represent classes by numbers
and construct a direct map- ping function from term space to the class variable, support
vector machines can be used to perform effective classification since they work well in highdimensional space. The least-square linear regression method is also used as a method for
discriminative classification.
Association-based classification, which classifies documents based on a set of associated,
frequently occurring text patterns. Notice that very frequent terms are likely poor
discriminators. Thus only those terms that are not very frequent and that have good
discriminative power will be used in document classification. Such an association-based
312
classification method proceeds as follows: First, keywords and terms can be extracted by
information retrieval and simple association analysis techniques. Second, concept hierarchies
of keywords and terms can be obtained using available term classes, such as WordNet, or
relying on expert knowledge, or some keyword classification systems. Documents in the
training set can also be classified into class hierarchies. A term association mining method
can then be applied to discover sets of associated terms that can be used to maximally
distinguish one class of documents from others. This derives a set of association rules
associated with each document class. Such classification rules can be ordered based on their
discriminative power and occurrence frequency, and used to classify new documents. Such
kind of association-based document classifier has been proven effective.
Document Clustering Analysis
Document clustering is one of the most crucial techniques for organizing documents in an
unsupervised manner. When documents are represented as term vectors, the clustering
methods can be applied. However, the document space is always of very high dimensionality,
ranging from several hundreds to thousands. Due to the curse of dimensionality, it makes
sense to first project the documents into a lower- dimensional subspace in which the semantic
structure of the document space becomes clear. In the low-dimensional semantic space, the
traditional clustering algorithms can then be applied. To this end, spectral clustering, mixture
model clustering, clustering using Latent Semantic Indexing, and clustering using Locality
Preserving Indexing are the most well-known techniques. We discuss each of these methods
here.
The spectral clustering method first performs spectral embedding (dimensionality reduction)
on the original data, and then applies the traditional clustering algorithm (e.g., k-means) on
the reduced document space. Recently, work on spectral clustering shows its capability to
handle highly nonlinear data (the data space has high curvature at every local area). Its strong
connections to differential geometry make it capable of discovering the manifold structure of
the document space. One major drawback of these spectral clustering algorithms might be
that they use the nonlinear embedding (dimensionality reduction), which is only defined on
―training‖ data. They have to use all of the data points to learn the embedding. When the data
set is very large, it is computationally expensive to learn such an embedding. This restricts
the application of spectral clustering on large data sets.
313
The mixture model clustering method models the text data with a mixture model, often
involving multinomial component models. Clustering involves two steps: (1) estimating the
model parameters based on the text data and any additional prior knowledge, and (2)
inferring the clusters based on the estimated model parameters. Depending on how the
mixture model is defined, these methods can cluster words and documents at the same time.
Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) are
two examples of such techniques. One potential advantage of such clustering methods is that
the clusters can be designed to facilitate comparative analysis of documents.
We can acquire the transformation vectors (embedding function) in LSI and LPI. Such
embedding functions are defined everywhere; thus, we can use part of the data to learn the
embedding function and embed all of the data to low-dimensional space. With this trick,
clustering using LSI and LPI can handle large document data corpus.
As discussed in the previous section, LSI aims to find the best subspace approximation to the
original document space in the sense of minimizing the global reconstruction error. In other
words, LSI seeks to uncover the most representative features rather than the most
discriminative features for document representation. Therefore, LSI might not be optimal in
discriminating documents with different semantics, which is the ultimate goal of clustering.
LPI aims to discover the local geometrical structure and can have more discriminating power.
Experiments show that for clustering, LPI as a dimensionality reduction method is more
suitable than LSI. Compared with LSI and LPI, the PLSI method reveals the latent semantic
dimensions in a more interpretable way and can easily be extended to incorporate any prior
knowledge or preferences about clustering.
15.7 Summary
A substantial portion of the available information is stored in text or document databases that
consist of large collections of documents, such as news articles, technical papers, books,
digital libraries, e-mail messages, and Web pages. Text information retrieval and data mining
has thus become increasingly important. Precision, recall, and the F-score are three based
measures from Information Retrieval (IR). Various text retrieval methods have been
developed. These typically either focus on document selection (where the query is regarded
as providing constraints) or document ranking (where the query is used to rank documents in
order of relevance). The vector-space model is a popular example of the latter kind. Latex
Semantic Indexing (LSI), Locality Preserving Indexing (LPI), and Probabilistic LSI can be
314
used for text dimensionality reduction. Text mining goes one step beyond keyword-based and
similarity-based information retrieval and discovers knowledge from semistructured text data
using methods such as keyword-based association analysis, document classification, and
document clustering.
15.8 Keywords
Text mining, F-score, Recall, Precision, Information retrieval, Text Indexing, Dimensionality
Reduction, Document Clustering Analysis, Probabilistic Latent Semantic Indexing, Locality
Preserving Indexing, Latent Semantic Indexing.
15.9 Exercises
1. What is text mining?
2. Explain Mining Text Data?
3. Briefly Explain Information Retrieval?
4. What are Basic Measures for Text Retrieval? Explain?
5. Explain Text Retrieval Methods?
6. Write a note on Text Indexing Techniques?
7. Write a note on Query Processing Techniques?
8. Write a note on Dimensionality Reduction for Text?
9. Why Dimensionality Reduction for Text Required?
10. Explain Text Mining Approaches?
11. Explain Document Classification Analysis & Document Clustering Analysis?
15.10 References
1.
Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber,
Morgan Kaufmann Publisher, Second Edition, 2006.
2.
Introduction to Data Mining with Case Studies by G. K. Gupta, Eastern Economy
Edition (PHI, New Delhi), Third Edition, 2009.
315
UNIT-16: MULTIMEDIA DATA MINING
Structure
16.1
Objectives
16.2
Introduction
16.3
Mining Multimedia Data
16.4
Similarity Search in Multimedia Data
16.5
Multidimensional Analysis of Multimedia Data
16.6
Mining Associations in Multimedia Data
16.7
Summary
16.8
Keywords
16.9
Exercises
16.10 References
16.1 Objectives
The objectives covered under this unit include:
The introduction Multimedia Data
Techniques for Mining Multimedia Data
Similarity Search in Multimedia Data
Multidimensional Analysis of Multimedia Data
Mining Associations in Multimedia Data
16.2 Introduction
What is multimedia data?
―What is a multimedia database?‖ A multimedia database system stores and manages a large
collection of multimedia data, such as audio, video, image, graphics, speech, text, document,
and hypertext data, which contain text, text mark-ups, and linkages. Multimedia database
systems are increasingly common owing to the popular use of audio/video equipment, digital
cameras, CD-ROMs, and the Internet. Typical multimedia database systems include NASA‘s
EOS (Earth Observation System), various kinds of image and audio-video databases, and
316
Internet databases. In digital data acquisition and storage technology, the rapid progress has
led to the fast growing tremendous and amount of data stored in databases. Although valuable
information may be hiding behind the data, the overwhelming data volume makes it difficult
(if not impossible) for human beings to extract them without powerful tools. Multimedia
mining systems that can automatically extract semantically meaningful information
(knowledge) from multimedia files are increasingly in demand. For this reason, a large
number of techniques have been proposed ranging from simple measures (e.g. color
histogram for image, energy estimates for audio signal) to more sophisticated systems like
speaker emotion recognition in audio, automatic summarization of TV programs. Generally,
multimedia database systems store and manage a large collection of multimedia objects, such
as image, video, audio and hypertext data.
16.3 Multimedia Data Mining
In multimedia documents, knowledge discovery deals with non-structured information. For
this reason, we need tools for discovering relationships between objects or segments within
multimedia document components, such as classifying images based on their content,
extracting patterns in sound, categorizing speech and music, and recognizing and tracking
objects in video streams.
In general, the multimedia files from a database must be first pre-processed to improve their
quality. Subsequently, these multimedia files undergo various transformations and features
extraction to generate the important features from the multimedia files. With the generated
features, mining can be carried out using data mining techniques to discover significant
patterns. These resulting patterns are then evaluated and interpreted in order to obtain the
final application‘s knowledge. In Figure 1, we present the model of applying multimedia
mining in different multimedia types. Data collection is the starting point of a learning
system, as the quality of raw data determines the overall achievable performance. Then, the
goal of data pre-processing is to discover important features from raw data. Data preprocessing includes data cleaning, normalization, transformation, feature selection, etc.
Learning can be straightforward, if informative features can be identified at pre-processing
stage. Detailed procedure depends highly on the nature of raw data and problem‘s domain. In
some cases, prior knowledge can be extremely valuable. For many systems, this stage is still
primarily conducted by domain experts. The product of data pre-processing is the training
set. Given a training set, a learning model has to be chosen to learn from it. It must be
317
mentioned that the steps of multimedia mining are often iterative. The analyst can also jump
back and forth between major tasks in order to improve the results.
Multimedia mining reaches much higher complexity resulting from:
The huge volume of data,
The variability and heterogeneity of the multimedia data (e.g. diversity of sensors,
time or conditions of acquisition etc) and
The multimedia content‘s meaning is subjective.
The high dimensionality of the feature spaces and the size of the multimedia datasets make
the feature extraction a challenging problem. In the following section, we analyze the feature
extraction process for multimedia data.
Feature extraction:
There are two kinds of features: description-based and content-based. The former uses
metadata, such as keywords, caption, size and time of creation. The latter is based on the
content of the object itself.
Feature extraction from text: Text categorization is a conventional classification problem
applied to the textual domain. It solves the problem of assigning text content to predefined
categories. In the learning stage, the labelled training data are first pre-processed to remove
unwanted details and to ―normalize‖ the data. For example, in text documents punctuation
symbols and non-alphanumeric characters are usually discarded, because they do not help in
318
classification. Moreover, all characters are usually converted to lower case to simplify
matters. The next step is to compute the features that are useful to distinguish one class from
another. For a text document, this usually means identifying the keywords that summarize the
contents of the document. How are these keywords learned? One way is to look for words
that occur frequently in the document. These words tend to be what the document is about. Of
course, words that occur too frequently, such as ―the‖, ―is‖, ―in‖, ―of‖ are no help at all, since
they are prevalent in every document. These common English words may be removed using a
―stop-list‖ of words during the pre-processing stage. From the remaining words, a good
heuristic is to look for words that occur frequently in documents of the same class, but rarely
in documents of other classes. In order to cope with documents of different lengths, relative
frequency is preferred over absolute frequency. Some authors used phrases, rather than
individual words, as indexing terms, but the experimental results found to date have not been
uniformly encouraging results. Another problem of text is the variant. Variant refers to the
different forms of the same word, e.g. ―go‖, ―goes‖, ―went‖, ―gone‖, ―going‖. This may be
solved by stemming, which means replacing all variants of a word by a standard one.
Feature extraction from images: Image categorization classifies images into semantic
databases that are manually precategorized. In the same semantic databases, images may have
large variations with dissimilar visual descriptions (e.g. images of persons, images of
industries etc.). In addition images from different semantic databases might share a common
background (some flowers and sunset have similar colours). Authors distinguish three types
of feature vectors for image description:
1. Pixel level features,
2. Region level features, and
3. Tile level features.
Pixel level features store spectral and textural information about each pixel of the image. For
example, the fraction of the end members, such as concrete or water, can describe the content
of the pixels. Region level features describe groups of pixels. Following the segmentation
process, each region is described by its boundary and a number of attributes, which present
information about the content of the region in terms of the end members and texture, shape,
size, fractal scale, etc. Tile level for image features present information about whole images
319
using texture, percentages of end members, fractal scale and others. Moreover, other
researchers proposed an information-driven framework that aims to highlight the role of
information at various levels of representation. This framework adds one more level of
information: the Pattern and Knowledge Level that integrates domain, related alphanumeric
data and the semantic relationships discovered from the image data.
Feature extraction from Audio: Audio data play an important role in multimedia
applications. Music information has two main branches: symbolic and audio information.
Attack, duration, volume, velocity and instrument type of every single note are available
information. Therefore, it is possible to easily access statistical measures such as tempo and
mean key for each music item. Moreover, it is possible to attach to each item high-level
descriptors, such as instrument kind and number. On the other hand, audio information deals
with real world signals and any features need to be extracted through signal analysis.
Some of the most frequently used features for audio classification are:
Total Energy: The temporal energy of an audio frame is defined by the rms of the audio
signal magnitude within each frame.
Zero Crossing Rate (ZCR): ZCR is also a commonly used temporal feature. ZCR counts the
number of times that an audio signal crosses its zero axis.
Frequency Centroid (FC): It indicates the weighted average of all frequency components of
a frame.
Bandwidth (BW): Bandwidth is the weighted average of the squared differences between
each frequency component and its frequency Centroid.
Pitch Period: It is a feature that measures the fundamental frequency of an audio signal
Feature extraction from Video:
In video mining, there are three types of videos:
a) The produced (e.g. movies, news videos, and dramas),
b) The raw (e.g. traffic videos, surveillance videos etc), and
c) The medical video (e.g. ultra sound videos including echocardiogram).
Higher-level information from video includes:
• detecting trigger events (e.g. any vehicles entering a particular area, people exiting or
entering a particular building)
320
• determining typical and anomalous patterns of activity, generating person-centric or objectcentric views of an activity
• classifying activities into named categories (e.g. walking, riding a bicycle),
The first stage for mining raw video data is grouping input frames to a set of basic units,
which are relevant to the structure of the video. In produced videos, the most widely used
basic unit is a shot, which is defined as a collection of frames recorded from a single camera
operation. Shot detection methods can be classified into many categories: pixel based,
statistics based, transform based, feature based and histogram based. Color or grayscale
histograms (such as in image mining) can also be used. To segment video, color histograms,
as well as motion and texture features can be used. Generally, if the difference between the
two consecutive frames is larger than a certain threshold value, then a shot boundary is
considered between two corresponding frames. The difference can be determined by
comparing the corresponding pixels of two images.
Data pre-processing: In a multimedia database, there are numerous objects that have many
different dimensions of interests. For example, only the color attribute can have 256
dimensions, with each counting the frequency of a given color in images. The image may still
have other dimensions. Selecting a subset of features is a method for reducing the problem
size. This reduces the dimensionality of the data and enables learning algorithms to operate
faster and more effectively. The problem of feature interaction can also be addressed by
constructing new features from the basic features set. This technique is called feature
construction/transformation. Sampling is also well accepted by the statistics community that
argues ―a powerful computationally intense procedure operating on a sub-sample of the data
may in fact provide superior accuracy than a less sophisticated one using the entire data
base‖. Moreover, discretization can significantly reduce the number of possible values of the
continuous feature, as large number of possible feature values contributes to slow and
ineffective process of machine learning. Furthermore, normalization (―scaling down"
transformation of the features) is also beneficial since there is often a large difference
between the maximum and minimum values of the features.
16.4 Similarity Search in Multimedia Data
―When searching for similarities in multimedia data, can we search on either the data
description or the data content?‖ That is correct. For similarity searching in multimedia data,
we consider two main families of multimedia indexing and retrieval systems:
321
 Description-based retrieval systems, which build indices and perform object
retrieval based on image descriptions, such as keywords, captions, size, and time of
creation;
 Content-based retrieval systems, which support retrieval based on the image
content, such as color histogram, texture, pattern, image topology, and the shape of
objects and their layouts and locations within the image.
Description-based retrieval is labour-intensive if performed manually. If automated, the
results are typically of poor quality. For example, the assignment of keywords to images can
be a tricky and arbitrary task. Recent development of Web-based image clustering and
classification methods has improved the quality of description-based Web image retrieval,
because image surrounded text information as well as Web linkage information can be used
to extract proper description and group images describing a similar theme together.
Content-based retrieval uses visual features to index images and promotes object retrieval
based on feature similarity, which is highly desirable in many applications.
In a content based image retrieval system, there are often two kinds of queries: image sample
based queries and image feature specification queries. Image-sample based queries find all of
the images that are similar to the given image sample. This search compares the feature
vector (or signature) extracted from the sample with the feature vectors of images that have
already been extracted and indexed in the image database. Based on this comparison, images
that are close to the sample image are returned. Image feature specification queries specify or
sketch image features like color, texture, or shape, which are translated into a feature vector
to be matched with the feature vectors of the images in the database. Content-based retrieval
has wide applications, including medical diagnosis, weather prediction, TV production, Web
search engines for images, and e-commerce.
Some systems, such as QBIC (Query By Image Content), support both sample-based and
image feature specification queries. There are also systems that support both content based
and description-based retrieval.
Several approaches have been proposed and studied for similarity-based retrieval in image
databases, based on image signature:
Color histogram–based signature: In this approach, the signature of an image
Includes color histograms based on the color composition of an image regardless of its scale
or orientation. This method does not contain any information about shape, image topology, or
texture. Thus, two images with similar color composition but that contains very different
322
shapes or textures may be identified as similar, although they could be completely unrelated
semantically.
Multi feature composed signature: In this approach, the signature of an image
includes a composition of multiple features: color histogram, shape, image topology, and
texture. The extracted image features are stored as metadata, and images are indexed based
on such metadata. Often, separate distance functions can be defined for each feature and
subsequently combined to derive the overall results. Multidimensional content-based search
often uses one or a few probe features to search for images containing such (similar) features.
It can therefore be used to search for similar images. This is the most popularly used
approach in practice.
Wavelet-based signature: This approach uses the dominant wavelet coefficients of an image
as its signature. Wavelets capture shape, texture, and image topology information in a single
unified framework. This improves efficiency and reduces the need for providing multiple
search primitives (unlike the second method above). However, since this method computes a
single signature for an entire image, it may fail to identify images containing similar objects
where the objects differ in location or size.
Wavelet-based signature with region-based granularity: In this approach, the computation
and comparison of signatures are at the granularity of regions, not the entire image. This is
based on the observation that similar images may contain similar regions, but a region in one
image could be a translation or scaling of a matching region in the other. Therefore, a
similarity measure between the query image Q and a target image T can be defined in terms
of the fraction of the area of the two images covered by matching pairs of regions from Q and
T. Such a region-based similarity search can find images containing similar objects, where
these objects may be translated or scaled.
The representation of multidimensional points and objects, and the development of
appropriate indexing methods that enable them to be retrieved efficiently is a well-studied
subject. Most of these methods were designed for use in application domains where the data
usually has a spatial component which has a relatively low dimension. Examples of such
application domains include geographic information systems (GIS), spatial databases, solid
modelling, computer vision, computational geometry, and robotics. However, there are many
application domains where the data is of considerably higher dimensionality, and is not
necessarily spatial. This is especially true in multimedia databases where the data is a set of
objects and the high dimensionality is a direct result of trying to describe the objects via a
collection of features (also known as a feature vector). In the case of images, examples of
323
features include color, color moments, textures, shape descriptions, etc. expressed using
scalar values.
The goal in these applications is often expressed more generally as one of the following:
 Find objects whose feature values fall within a given range or where the distance from
some query object falls into a certain range (range queries).
 Find objects whose features have values similar to those of a given query object or set
of query objects (nearest neighbour queries). These queries are collectively referred to
as similarity searching.
Curse of dimensionality: An apparently straightforward solution to finding the nearest
neighbour is to compute a Voronoi diagram for the data points (i.e., a partition of the space
into regions where all points in the region are closer to the region‘s associated data point than
to any other data point), and then locate the
Voronoi region corresponding to the query point. The problem with this solution is that the
combinatorial complexity of the Voronoi diagram in high dimensions is prohibitive —that is,
it grows exponentially with its dimension k so that for N points, the time to build and the
space requirements can grow as rapidly as Θ (N
k/2
). This renders its applicability moot. The
above is typical of the problems that we must face when dealing with high-dimensional data.
Generally speaking, multidimensional queries become increasingly more difficult as the
dimensionality increases. The problem is characterized as the curse of dimensionality. This
term is used to indicate that the number of samples needed to estimate an arbitrary function
with a given level of accuracy grows exponentially with the number of variables (i.e.,
dimensions) that comprise it. For similarity searching (i.e., finding nearest neighbours), this
means that the number of objects (i.e., points) in the data set that need to be examined in
deriving the estimate grows exponentially with the underlying dimension. The curse of
dimensionality has a direct bearing on similarity searching in high dimensions as it raises the
issue of whether or not nearest neighbour searching is even meaningful in such a domain. In
particular, letting
‗d’ denote a distance function which need not necessarily be a metric, it
has been pointed out that nearest neighbour searching is not meaningful when the ratio of the
variance of the distance between two random points p and q, drawn from the data and query
distributions, and the expected distance between them converges to zero as the dimension ‗k’
goes to infinity — that is,
lim
k-> ∞
Variance [d(p, q)]
Expected [d(p, q)]
=0
324
In other words, the distance to the nearest neighbour and the distance to the farthest
neighbour tend to converge as the dimension increases.
Multidimensional indexing: Assuming that the curse of dimensionality does not come into
play, query responses are facilitated by sorting the objects on the basis of some of their
feature values and building appropriate indexes. The high-dimensional feature space is
indexed using some multidimensional data structure (termed multidimensional indexing) with
appropriate modifications
to fit the high-dimensional problem environment. Similarity search which finds objects
similar to a target object can be performed with a range search or a nearest neighbor search in
the multidimensional data structure. However, unlike applications in spatial databases where
the distance function between two objects is usually Euclidean, this is not necessarily the case
in the high-dimensional feature space where the distance function may even vary from query
to query on the same feature. Searching in high-dimensional spaces is time-consuming.
Performing range queries in high dimensions is considerably easier, from the standpoint of
computational complexity, than performing similarity queries as range queries do not involve
the computation of distance. In particular, searches through an indexed space usually involve
relatively simple comparison tests. However, if we have to examine all of the index nodes,
then the process is again time-consuming. In contrast, computing similarity in terms of
nearest neighbour search makes use of distance and the process of computing the distance can
be computationally complex. For example, computing the Euclidean distance between two
points in a high-dimensional space, say ‗d’, requires ‗d’ multiplication operations and ‗d-1’
addition operations, as well as a square root operation (which can be omitted). Note also that
computing similarity requires the definition of what it means for two objects to be similar,
which is not always so obvious.
Distance based indexing: Often, the only information that we have available is a distance
function that indicates the degree of similarity (or dissimilarity) between all pairs of the N
objects. Usually the distance function ‗d’ is required to obey the triangle inequality, be nonnegative, and be symmetric, in which case it is known as a metric and also referred to as a
distance metric. However, at times, the distance function is not a metric. Often, the degree of
similarity is expressed using a similarity matrix which contains interobject distance values,
for all possible pairs of the N objects. Given a distance function, we usually index the objects
with respect to their distance from a few selected objects. We use the term distance-based
325
indexing to describe such methods. There are two basic partitioning schemes: ball
partitioning and generalized hyper plane partitioning. In ball partitioning, the data set is
partitioned based on distances from one distinguished object, sometimes called a vantage
point, into the subset that is inside and the subset that is outside a ball around the object .In
generalized hyper plane partitioning, two distinguished objects p1 and p2 are
chosen and the data set is partitioned based on which of the two distinguished objects is the
closest — that is, all the objects in subset A are closer to p1 than to p2, while the objects in
subset B are closer to p2. The asymmetry of ball partitioning is a potential drawback of this
method as the outer shell tends to be very narrow for metric spaces typically used in
similarity search .In contrast, generalized hyper plane partitioning is more symmetric, in that
both partitions form a ―ball‖ around an object. The advantage of distance-based indexing
methods is that distance computations are used to build the index, but once
the index has been built, similarity queries can often be performed with a significantly lower
number of distance computations than a sequential scan of the entire dataset. Of course, in
situations where we may want to apply several different distance metrics, then the drawback
of the distance-based indexing techniques is that they require that the index be rebuilt for
each different distance metric, which may be nontrivial. This is not the case for the
multidimensional indexing methods which have the advantage of supporting arbitrary
distance metrics (however, this comparison is not entirely fair, since the assumption, when
using distance-based indexing, is that often we do not have any feature values as for example
in DNA sequences).
16.5 Multidimensional Analysis of Multimedia Data
Multidimensional Analysis and Descriptive Mining of Complex Data Objects: Many
advanced, data-intensive applications, such as scientific research and engineering design,
need to store, access, and analyze complex but relatively structured data objects. These
objects cannot be represented as simple and uniformly structured records (i.e., tuples) in data
relations. Such application requirements have motivated the design and development of
object-relational and object-oriented database systems. Both kinds of systems deal with the
efficient storage and access of vast amounts of disk-based complex structured data objects.
These systems organize a large set of complex data objects into classes, which are in turn
organized into class/subclass hierarchies.
Each object in a class is associated with
326
 an object-identifier,
 a set of attributes that may contain sophisticated data structures, set- or list-valued
data, class composition hierarchies, multimedia data, and
 a set of methods that specify the computational routines or rules associated with the
object class.
There has been extensive research in the field of database systems on how to efficiently
index, store, access, and manipulate complex objects in object-relational and object-oriented
database systems. Technologies handling these issues are discussed in many books on
database systems, especially on object-oriented and object-relational database systems.
One step beyond the storage and access of massive-scaled, complex object data is the
systematic analysis and mining of such data. This includes two major tasks:
(1) construct multidimensional data warehouses for complex object data and perform
online analytical processing (OLAP) in such data warehouses,
(2) Develop effective and scalable methods for mining knowledge from object databases
and/or data warehouses.
The second task is largely covered by the mining of specific kinds of data (such as spatial,
temporal, sequence, graph- or tree-structured, text, and multimedia data), since these data
form the major new kinds of complex data objects. Thus, our focus in this section will be
mainly on how to construct object data warehouses and perform OLAP analysis on data
warehouses for such data. A major limitation of many commercial data warehouse and OLAP
tools for multidimensional database analysis is their restriction on the allowable data types
for dimensions and measures. Most data cube implementations confine dimensions to
nonnumeric data, and measures to simple, aggregated values. To introduce data mining and
multidimensional data analysis for complex objects, this section examines how to perform
generalization on complex structured objects and construct object cubes for OLAP and
mining in object databases. To facilitate generalization and induction in object-relational and
object-oriented databases, it is important to study how each component of such databases can
be generalized, and how the generalized data can be used for multidimensional data analysis
and data mining.
Generalization of Structured Data
327
An important feature of object-relational and object-oriented databases is their capability of
storing, accessing, and modelling complex structure-valued data, such as set- and list-valued
data and data with nested structures.
―How can generalization be performed on such data?‖ Let‘s start by looking at the
generalization of set-valued, list-valued, and sequence-valued attributes.
A set-valued attribute may be of homogeneous or heterogeneous type. Typically, set-valued
data can be generalized by
(1) generalization of each value in the set to its corresponding higher-level concept, or
(2) derivation of the general behavior of the set, such as the number of elements in the
set, the types or value ranges in the set, the weighted average for numerical data, or the major
clusters formed by the set. Moreover, generalization can be performed by applying different
generalization operators to explore alternative generalization paths. In this case, the result of
generalization is a heterogeneous set.
Example 1: Generalization of a set-valued attribute. Suppose that the hobby of a person is
a set-valued attribute containing the set of values (tennis, hockey, soccer, violin, SimCity).
This set can be generalized to a set of high-level concepts, such as (sports, music, computer
games) or into the number 5 (i.e., the number of hobbies in the set). Moreover, a count can be
associated with a generalized value to indicate how many elements are generalized to that
value, as in fsports(3),music(1), computer games(1)}, where sports(3) indicates three
kinds of sports, and so on.
A set-valued attribute may be generalized to a set-valued or a single-valued attribute; a
single-valued attribute may be generalized to a set-valued attribute if the values form a lattice
or ―hierarchy‖ or if the generalization follows different paths. Further generalizations on such
a generalized set-valued attribute should follow the generalization path of each value in the
set. List-valued attributes and sequence-valued attributes can be generalized in a manner
similar to that for set-valued attributes except that the order of the elements in the list or
sequence should be preserved in the generalization. Each value in the list can be generalized
into its corresponding higher-level concept. Alternatively, a list can be generalized according
to its general behaviour, such as the length of the list, the type of list elements, the value
range, the weighted average value for numerical data, or by dropping unimportant elements
in the list. A list may be generalized into a list, a set, or a single value.
Example 2: Generalization of list-valued attributes. Consider the following list or
sequence of data for a person‘s education record: ―((B.Sc. in Electrical Engineering, U.B.C.,
Dec., 1998), (M.Sc. in Computer Engineering, U. Maryland, May, 2001), (Ph.D. in Computer
328
Science, UCLA, Aug., 2005))‖. This can be generalized by dropping less important
descriptions (attributes) of each tuple in the list, such as by dropping the month attribute to
obtain ―((B.Sc., U.B.C., 1998), : : :)‖, and/or by retaining only the most important tuple(s) in
the list, e.g., ―(Ph.D. in Computer Science, UCLA, 2005)‖.A complex structure-valued
attribute may contain sets, tuples, lists, trees, records, and their combinations, where one
structure may be nested in another at any level. In general, a structure-valued attribute can be
generalized in several ways, such as
1. Generalizing each attribute in the structure while maintaining the shape of the
structure,
2. Flattening the structure and generalizing the flattened structure,
3. Summarizing the low-level structures by high-level concepts or aggregation, and
4. Returning the type or an overview of the structure.
In general, statistical analysis and cluster analysis may help toward deciding on the directions
and degrees of generalization to perform, since most generalization processes are to retain
main features and remove noise, outliers, or fluctuations.
Aggregation and Approximation in Spatial and Multimedia Data Generalization:
Aggregation and approximation are another important means of generalization. They are
especially useful for generalizing attributes with large sets of values, complex structures, and
spatial or multimedia data. Let‘s take spatial data as an example. We would like to generalize
detailed geographic points into clustered regions, such as business, residential, industrial, or
agricultural areas, according to land usage. Such generalization often requires the merge of a
set of geographic areas by spatial operations, such as spatial union or spatial clustering
methods. Aggregation and approximation are important techniques for this form of
generalization. In a spatial merge, it is necessary to not only merge the regions of similar
types within the same general class but also to compute the total areas, average density, or
other aggregate functions while ignoring some scattered regions with different types if they
are unimportant to the study. Other spatial operators, such as spatial-union, spatialoverlapping, and spatial-intersection (which may require the merging of scattered small
regions into large, clustered regions) can also use spatial aggregation and approximation as
data generalization operators.
Example 3 Spatial aggregation and approximation: Suppose that we have different pieces
of land for various purposes of agricultural usage, such as the planting of vegetables, grains,
and fruits. These pieces can be merged or aggregated into one large piece of agricultural land
by a spatial merge. However, such a piece of agricultural land may contain highways, houses,
329
and small stores. If the majority of the land is used for agriculture, the scattered regions for
other purposes can be ignored, and the whole region can be claimed as an agricultural area by
approximation. A multimedia database may contain complex texts, graphics, images, video
fragments, maps, voice, music, and other forms of audio/video information. Multimedia data
are typically stored as sequences of bytes with variable lengths, and segments of data are
linked together or indexed in a multidimensional way for easy reference.
Generalization on multimedia data can be performed by recognition and extraction of the
essential features and/or general patterns of such data. There are many ways to extract such
information. For an image, the size, color, shape, texture, orientation, and relative positions
and structures of the contained objects or regions in the image can be extracted by
aggregation and/or approximation. For a segment of music, its melody can be summarized
based on the approximate patterns that repeatedly occur in the segment, while its style can be
summarized based on its tone, tempo, or the major musical instruments played. For an article,
its abstract or general organizational structure (e.g.,
the table of contents, the subject and index terms that frequently occur in the article, etc.) may
serve as its generalization. In general, it is a challenging task to generalize spatial data and
multimedia data in order to extract interesting knowledge implicitly stored in the data.
Technologies developed in spatial databases and multimedia databases, such as spatial data
accessing and analysis techniques, pattern recognition, image analysis, text analysis, contentbased image/text retrieval and multidimensional indexing methods, should be integrated with
data generalization and data mining techniques to achieve satisfactory results. Techniques for
mining such data are further discussed in the following sections.
Generalization of Object Identifiers and Class/Subclass Hierarchies: ―How can object
identifiers be generalized?‖ At first glance, it may seem impossible to generalize an object
identifier. It remains unchanged even after structural reorganization of the data. However,
since objects in an object-oriented database are organized into classes, which in turn are
organized into class/subclass hierarchies, the generalization of an object can be performed by
referring to its associated hierarchy. Thus, an object identifier can be generalized as follows.
First, the object identifier is generalized to the identifier of the lowest subclass to which the
object belongs. The identifier of this subclass can then, in turn, be generalized to a higher
level class/subclass identifier by climbing up the class/subclass hierarchy. Similarly, a class
or a subclass can be generalized to its corresponding superclass(es) by climbing up its
associated class/subclass hierarchy. ―Can inherited properties of objects be generalized?‖
Since object-oriented databases are organized into class/subclass hierarchies, some attributes
330
or methods of an object class are not explicitly specified in the class but are inherited from
higher-level classes of the object. Some object-oriented database systems allow multiple
inheritance, where properties can be inherited from more than one superclass when the
class/subclass ―hierarchy‖ is organized in the shape of a lattice. The inherited properties of an
object can be derived by query processing in the object-oriented database. From the data
generalization point of view, it is unnecessary to distinguish which data are stored within the
class and which are inherited from its super class. As long as the set of relevant data are
collected by query processing, the data mining process will treat the inherited data in the
same manner as the data stored in the object class, and perform generalization accordingly.
Methods are an important component of object-oriented databases. They can also be inherited
by objects. Many behavioural data of objects can be derived by the application of methods.
Since a method is usually defined by a computational procedure/function or by a set of
deduction rules, it is impossible to perform generalization on the method itself. However,
generalization can be performed on the data derived by application of the method. That is,
once the set of task-relevant data is derived by application of the method, generalization can
then be performed on these data.
Generalization of Class Composition Hierarchies: An attribute of an object may be
composed of or described by another object, some of whose attributes may be in turn
composed of or described by other objects, thus forming a class composition hierarchy.
Generalization on a class composition hierarchy can be viewed as generalization on a set of
nested structured data (which are possibly infinite, if the nesting is recursive).
In principle, the reference to a composite object may traverse via a long sequence of
references along the corresponding class composition hierarchy. However, in most cases, the
longer the sequence of references traversed, the weaker the semantic linkage between the
original object and the referenced composite object. For example, an attribute vehicles owned
of an object class student could refer to another object class car, which may contain an
attribute auto dealer, which may refer to attributes describing the dealer‘s manager and
children. Obviously, it is unlikely that any interesting general regularities exist between a
student and her car dealer‘s manager‘s children. Therefore, generalization on a class of
objects should be performed on the descriptive attribute values and methods of the class, with
limited reference to its closely related components via its closely related linkages in the class
composition hierarchy. That is, in order to discover interesting knowledge, generalization
should be performed on the objects in the class composition hierarchy that are closely related
331
in semantics to the currently focused class(es), but not on those that have only remote and
rather weak semantic linkages.
Construction and Mining of Object Cubes: In an object database, data generalization and
multidimensional analysis are not applied to individual objects but to classes of objects. Since
a set of objects in a class may share many attributes and methods, and the generalization of
each attribute and method may apply a sequence of generalization operators, the major issue
becomes how to make the generalization processes cooperate among different attributes and
methods in the class(es).
―So, how can class-based generalization be performed for a large set of objects?‖ For class
based generalization, the attribute-oriented induction method for mining characteristics of
relational databases can be extended to mine data characteristics in object databases.
Consider that a generalization-based data mining process can be viewed as the application of
a sequence of class-based generalization operators on different attributes. Generalization can
continue until the resulting class contains a small number of generalized objects that can be
summarized as a concise, generalized rule in high-level terms. For efficient implementation,
the generalization of multidimensional attributes of a complex object class can be performed
by examining each attribute (or dimension), generalizing each attribute to simple-valued data,
and constructing a multidimensional data cube, called an object cube. Once an object cube is
constructed, multidimensional analysis and data mining can be performed on it in a manner
similar to that for relational data cubes. Notice that from the application point of view, it is
not always desirable to generalize a set of values to single-valued data. Consider the attribute
keyword, which may contain a set of keywords describing a book. It does not make much
sense to generalize this set of keywords to one single value. In this context, it is difficult to
construct an object cube containing the keyword dimension. We will address some progress
in this direction in the next section when discussing spatial data cube construction. However,
it remains a challenging research issue to develop techniques for handling set-valued data
effectively in object cube construction and object-based multidimensional analysis.
Generalization-Based Mining of Plan Databases by Divide-and-Conquer: To show how
generalization can play an important role in mining complex databases, we examine a case of
mining significant patterns of successful actions in a plan database using a divide-andconquer strategy. A plan consists of a variable sequence of actions. A plan database, or
simply a plan base, is a large collection of plans. Plan mining is the task of mining significant
patterns or knowledge from a plan base. Plan mining can be used to discover travel patterns
of business passengers in an air flight database or to find significant patterns from the
332
sequences of actions in the repair of automobiles. Plan mining is different from sequential
pattern mining, where a large number of frequently occurring sequences are mined at a very
detailed level. Instead, plan mining is the extraction of important or significant generalized
(sequential) patterns from a plan base.
Let‘s examine the plan mining process using an air travel example.
Example 4 An air flight plan base: Suppose that the air travel plan base shown in Table 1
stores customer flight sequences, where each record corresponds to an action in a sequential
database, and a sequence of records sharing the same plan number is considered as one plan
with a sequence of actions. The columns departure and arrival specify the codes of the
airports involved. Table 2 stores information about each airport. There could be many
patterns mined from a plan base like Table 1. For example, we may discover that most flights
from cities in the Atlantic United States to Midwestern cities have a stopover at ORD in
Chicago, which could be because ORD is the principal hub for several major airlines. Notice
that the airports that act as airline hubs (such as LAX in Los Angeles, ORD in Chicago, and
JFK in New York) can easily be derived from Table 2 based on airport size. However, there
could be hundreds of hubs in a travel database. Indiscriminate mining may result in a large
number of ―rules‖ that lack substantial support, without providing a clear overall picture.
Figure 2: A multidimensional view of a database
..
333
Table .1
Table .2
Multidimensional Analysis of Multimedia Data
―Can we construct a data cube for multimedia data analysis?‖ To facilitate the
multidimensional analysis of large multimedia databases, multimedia data cubes can be
designed and constructed in a manner similar to that for traditional data cubes from relational
data. A multimedia data cube can contain additional dimensions and measures for multimedia
information, such as color, texture, and shape. Let‘s examine a multimedia data mining
system prototype called MultiMediaMiner, which extends the DBMiner system by handling
multimedia data. The example database tested in the MultiMediaMiner system is constructed
as follows.
Each image contains two descriptors: a feature descriptor and a layout descriptor.
The original image is not stored directly in the database; only its descriptors are stored. The
description information encompasses fields like image file name, image URL, image type
(e.g., gif, tiff, jpeg, mpeg, bmp, avi), a list of all known Web pages referring to the image
(i.e., parent URLs), a list of keywords, and a thumbnail used by the user interface for image
and video browsing. The feature descriptor is a set of vectors for each visual characteristic.
The main 8x8 for RGB), an MFC (Most Frequent Color) vector, and an MFO (Most Frequent
Orientation) vector. The MFC and MFO contain five color centroids and five edge orientation
centroids for the five most frequent colors and five most frequent orientations, respectively.
The edge orientations used are 0, 22.5, 45, 67.5, 90, and so on. The layout descriptor contains
a color layout vector and an edge layout vector. Regardless of their original size, all images
are assigned an 8x8 grid. The most frequent color for each of the 64 cells is stored in the
334
color layout vector, and the number of edges for each orientation in each of the cells is stored
in the edge layout vector. Other sizes of grids, like 4x4, 2x2, and 1x1, can easily be derived.
The Image Excavator component of MultiMediaMiner uses image contextual information,
like HTML tags in Web pages, to derive keywords. By traversing on-line directory structures,
like the Yahoo! directory, it is possible to create hierarchies of keywords mapped onto the
directories in which the image was found. These graphs are used as concept hierarchies for
the dimension keyword in the multimedia data cube.
―What kind of dimensions can a multimedia data cube have?‖ A multimedia data cube can
have many dimensions. The following are some examples: the size of the image or video in
bytes; the width and height of the frames (or pictures), constituting two dimensions; the date
on which the image or video was created (or last modified); the format type of the image or
video; the frame sequence duration in seconds; the image or video Internet domain; the
Internet domain of pages referencing the image or video (parent URL); the keywords; a color
dimension; an edge-orientation dimension; and so on. Concept hierarchies for many
numerical dimensions may be automatically defined. For other dimensions, such as for
Internet domains or color, predefined hierarchies may be used. The construction of a
multimedia data cube will facilitate multidimensional analysis of multimedia data primarily
based on visual content, and the mining of multiple kinds of knowledge, including
summarization, comparison, classification, association, and clustering. The Classifier module
of MultiMediaMiner and its output are presented in Figure 3
335
Figure 3
An output of the Classifier module of MultiMediaMiner
The multimedia data cube seems to be an interesting model for multidimensional analysis of
multimedia data. However, we should note that it is difficult to implement a data cube
efficiently given a large number of dimensions. This curse of dimensionality is especially
serious in the case of multimedia data cubes. We may like to model color, orientation,
texture, keywords, and so on, as multiple dimensions in a multimedia data cube. However,
many of these attributes are set-oriented instead of single-valued.
For example, one image may correspond to a set of keywords. It may contain a set of objects,
each associated with a set of colors. If we use each keyword as a dimension or each detailed
color as a dimension in the design of the data cube, it will create a huge number of
dimensions. On the other hand, not doing so may lead to the modelling of an image at a rather
rough, limited, and imprecise scale. More research is needed on how to design a multimedia
data cube that may strike a balance between efficiency and the power of representation.
―So, how should we go about mining a plan base?‖ We would like to find a small number of
general (sequential) patterns that cover a substantial portion of the plans, and then we can
336
divide our search efforts based on such mined sequences. The key to mining such patterns is
to generalize the plans in the plan base to a sufficiently high level. A multidimensional
database model, such as the one shown in Figure 2 for the air flight plan base, can be used to
facilitate such plan generalization. Since low-level information may never share enough
commonality to form succinct plans, we should do the following:
(1) Generalize the plan base in different directions using the multidimensional model
(2) Observe when the generalized plans share common, interesting, sequential patterns with
substantial support
(3) Derive high-level, concise plans.
Let‘s examine this plan base. By combining tuples with the same plan number, the sequences
of actions (shown in terms of airport codes) may appear as follows:
ALB - JFK - ORD - LAX - SAN
SPI - ORD - JFK - SYR
...
Table. 3
Table .4
These sequences may look very different. However, they can be generalized in multiple
dimensions. When they are generalized based on the airport size dimension, we observe
some interesting sequential patterns, like S-L-L-S, where L represents a large airport (i.e., a
hub), and S represents a relatively small regional airport, as shown in Table 3. The
generalization of a large number of air travel plans may lead to some rather general but
337
highly regular patterns. This is often the case if the merge and optional operators are applied
to the generalized sequences, where the former merges (and collapses) consecutive identical
symbols into one using the transitive closure notation ―+‖ to represent a sequence of actions
of the same type, whereas the latter uses the notation ―[ ]‖ to indicate that the object or action
inside the square brackets ―[ ]‖ is optional. Table .4 shows the result of applying the merge
operator to the plans of Table 3. By merging and collapsing similar actions, we can derive
generalized sequential patterns, such as the Pattern shown below :
[S] - L+ - [S]
[98.5%] (10.1)
The pattern states that 98.5% of travel plans have the pattern [S] - L+ - [S],where [S] indicates
that action S is optional, and L+ indicates one or more repetitions of L. In other words, the
travel pattern consists of flying first from possibly a small airport, hopping through one to
many large airports, and finally reaching a large (or possibly, a small) airport. After a
sequential pattern is found with sufficient support, it can be used to partition the plan base.
We can then mine each partition to find common characteristics. For example, from a
partitioned plan base, we may find flight(x,y)^airport size(x,S)^airport size(y,L))=>region(x)
= region(y) [75%], which means that for a direct flight from a small airport x to a large
airport y, there is a 75% probability that x and y belong to the same region. This example
demonstrates a divide-and-conquer strategy, which first finds interesting, high-level concise
sequences of plans by multidimensional generalization of a plan base, and then partitions the
plan base based on mined patterns to discover the corresponding characteristics of sub plan
bases. This mining approach can be applied to many other applications. For example, in
Weblog mining, we can study general access patterns from the Web to identify popular Web
portals and common paths before digging into detailed subordinate patterns. The plan mining
technique can be further developed in several aspects.
For instance, a minimum support threshold similar to that in association rule mining can be
used to determine the level of generalization and ensure that a pattern covers a sufficient
number of cases. Additional operators in plan mining can be explored, such as less than.
Other variations include extracting associations from subsequences, or mining sequence
patterns involving multidimensional attributes—for example, the patterns involving both
airport size and location. Such dimension-combined mining also requires the generalization
of each dimension to a high level before examination of the combined sequence patterns.
338
16.6 Mining Associations in Multimedia Data
―What kinds of associations can be mined in multimedia data?‖ Association rules involving
multimedia objects can be mined in image and video databases. At least three categories can
be observed:
Associations between image content and non image content features: A rule like ―If at least
50% of the upper part of the picture is blue, then it is likely to represent sky‖ belongs to this
category since it links the image content to the keyword sky.
Associations among image contents that are not related to spatial relationships: A rule like ―If
a picture contains two blue squares, then it is likely to contain one red circle as well‖ belongs
to this category since the associations are all regarding image contents.
To mine associations among multimedia objects, we can treat each image as a transaction and
find frequently occurring patterns among different images.
―What are the differences between mining association rules in multimedia databases versus
in transaction databases?‖ There are some subtle differences. First, an image may contain
multiple objects, each with many features such as color, shape, texture, keyword, and spatial
location, so there could be many possible associations. In many cases, a feature may be
considered as the same in two images at a certain level of resolution, but different at a finer
resolution level. Therefore, it is essential to promote a progressive resolution refinement
approach. That is, we can first mine frequently occurring patterns at a relatively rough
resolution level, and then focus only on those that have passed the minimum support
threshold when mining at a finer resolution level. This is because the patterns that are not
frequent at a rough level cannot be frequent at finer resolution levels. Such a multiresolution
mining strategy substantially reduces the overall data mining cost without loss of the quality
and completeness of data mining results. This leads to an efficient methodology for mining
frequent item sets and associations in large multimedia databases.
Second, because a picture containing multiple recurrent objects is an important feature in
image analysis, recurrence of the same objects should not be ignored in association analysis.
For example, a picture containing two golden circles is treated quite differently from that
containing only one. This is quite different from that in a transaction database, where the fact
that a person buys one gallon of milk or two may often be treated the same as ―buys milk.‖
Therefore, the definition of multimedia association and its measurements, such as support and
confidence, should be adjusted accordingly.
339
Third, there often exist important spatial relationships among multimedia objects, such as
above, beneath, between, nearby, left-of, and so on. These features are very useful for
exploring object associations and correlations. Spatial relationships together with other
content-based multimedia features, such as color, shape, texture, and keywords, may form
interesting associations. Thus, spatial data mining methods and properties of topological
spatial relationships become important for multimedia mining.
16.7 Summary
A multimedia database system stores and manages a large collection of multimedia data, such
as audio, video, image, graphics, speech, text, document, and hypertext data, which contain
text, text mark-ups, and linkages. In multimedia documents, knowledge discovery deals with
non-structured information. There are two forms of feature extraction: description-based and
content-based. We consider two main families of multimedia indexing and retrieval systems:
Description-based retrieval systems, Content-based retrieval systems. To facilitate the
multidimensional analysis of large multimedia databases, multimedia data cubes can be
designed and constructed in a manner similar to that for traditional data cubes from relational
data. A multimedia data cube can contain additional dimensions and measures for multimedia
information, such as color, texture, and shape.
Association rules involving multimedia
objects can be mined in image and video databases. At least three categories can be observed:
Associations between image content and non image content features, Associations among
image contents that are not related to spatial relationships.
16.8 Keywords
Multimedia database, Multimedia Data Mining, Description-based, Content-based, Color
histogram–based, Multi feature composed, Wavelet-based , Wavelet-based signature, Mining
Associations in Multimedia Data.
16.9 Exercises
1. What is multimedia data?
2. Explain Multimedia Data Mining?
3. How Feature extraction done in case of text?
4. How Feature extraction done in case of image?
5. What are features used for audio classification?
6. Explain briefly in data pre-processing in multimedia data?
340
7. What are two types of retrieval in Multimedia Data?
8. Explain Multidimensional Analysis of Multimedia Data?
9. What are two types of descriptor of image?
10. Explain Mining Associations in Multimedia Data?
16.10 References
1.
Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber,
Morgan Kaufmann Publisher, Second Edition, 2006.
2.
Introduction to Data Mining (ISBN: 0321321367) by Pang-Ning Tan, Michael
Steinbach, Vipin Kumar, Addison-Wesley Publisher, 2005.
3.
Introduction to Data Mining with Case Studies by G. K. Gupta, Eastern Economy
Edition (PHI, New Delhi), Third Edition, 2009.
4.
Data Mining Techniques by Arun K Pujari, University Press, Second Edition,
2009.
341