Download Experiment No. 1

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Department of Computer Engineering
Lab Manual
Final Year Semester-VIII
Subject: Data Warehouse and Mining
Even Semester
1
Institutional Vision, Mission and Quality Policy
Our Vision
To foster and permeate higher and quality education with value added engineering, technology
programs, providing all facilities in terms of technology and platforms for all round development with
societal awareness and nurture the youth with international competencies and exemplary level of
employability even under highly competitive environment so that they are innovative adaptable and
capable of handling problems faced by our country and world at large.
RAIT’s firm belief in new form of engineering education that lays equal stress on academics and
leadership building extracurricular skills has been a major contribution to the success of RAIT as one
of the most reputed institution of higher learning. The challenges faced by our country and world in
the 21 Century needs a whole new range of thought and action leaders, which a conventional
educational system in engineering disciplines are ill equipped to produce. Our reputation in providing
good engineering education with additional life skills ensure that high grade and highly motivated
students join us. Our laboratories and practical sessions reflect the latest that is being followed in the
Industry. The project works and summer projects make our students adept at handling the real life
problems and be Industry ready. Our students are well placed in the Industry and their performance
makes reputed companies visit us with renewed demands and vigour.
Our Mission
The Institution is committed to mobilize the resources and equip itself with men and materials of
excellence thereby ensuring that the Institution becomes pivotal center of service to Industry,
academia, and society with the latest technology. RAIT engages different platforms such as
technology enhancing Student Technical Societies, Cultural platforms, Sports excellence centers,
Entrepreneurial Development Center and Societal Interaction Cell. To develop the college to become
an autonomous Institution & deemed university at the earliest with facilities for advanced research
and development programs on par with international standards. To invite international and reputed
national Institutions and Universities to collaborate with our institution on the issues of common
interest of teaching and learning sophistication.
RAIT’s Mission is to produce engineering and technology professionals who are innovative and
inspiring thought leaders, adept at solving problems faced by our nation and world by providing
quality education.
The Institute is working closely with all stake holders like industry, academia to foster knowledge
generation, acquisition, dissemination using best available resources to address the great challenges
being faced by our country and World. RAIT is fully dedicated to provide its students skills that make
them leaders and solution providers and are Industry ready when they graduate from the Institution.
2
We at RAIT assure our main stakeholders of students 100% quality for the programmes we deliver.
This quality assurance stems from the teaching and learning processes we have at work at our campus
and the teachers who are handpicked from reputed institutions IIT/NIT/MU, etc. and they inspire the
students to be innovative in thinking and practical in approach. We have installed internal procedures
to better skills set of instructors by sending them to training courses, workshops, seminars and
conferences. We have also a full fledged course curriculum and deliveries planned in advance for a
structured semester long programme. We have well developed feedback system employers, alumni,
students and parents from to fine tune Learning and Teaching processes. These tools help us to ensure
same quality of teaching independent of any individual instructor. Each classroom is equipped with
Internet and other digital learning resources.
The effective learning process in the campus comprises a clean and stimulating classroom
environment and availability of lecture notes and digital resources prepared by instructor from the
comfort of home. In addition student is provided with good number of assignments that would trigger
his thinking process. The testing process involves an objective test paper that would gauge the
understanding of concepts by the students. The quality assurance process also ensures that the
learning process is effective. The summer internships and project work based training ensure learning
process to include practical and industry relevant aspects. Various technical events, seminars and
conferences make the student learning complete.
Our Quality Policy
OurQuality Policy
Itisourearnestendeavourtoproducehighqualityengineeringprofessionalswhoare
innovative and inspiring, thought and action leaders, competent to solve problems
facedbysociety,nationandworldatlargebystrivingtowardsveryhighstandardsin
learning,teaching andtrainingmethodologies.
OurMotto: If itis not of quality, itis NOTRAIT!
Dr. Vijay
D.PatilPresident,
RAES
3
Departmental Vision, Mission
Vision
To impart higher and quality education in computer science with value added engineering and
technology programs to prepare technically sound, ethically strong engineers with social awareness.
To extend the facilities, to meet the fast changing requirements and nurture the youths with
international competencies and exemplary level of employability and research under highly
competitive environments.
Mission
To mobilize the resources and equip the institution with men and materials of excellence to provide
knowledge and develop technologies in the thrust areas of computer science and Engineering. To
provide the diverse platforms of sports, technical, cocurricular and extracurricular activities for the
overall development of student with ethical attitude. To prepare the students to sustain the impact of
computer education for social needs encompassing industry, educational institutions and public
service. To collaborate with IITs, reputed universities and industries for the technical and overall
upliftment of students for continuing learning and entrepreneurship.
4
Departmental Program Educational Objectives
(PEOs)
1. Learn and Integrate
To provide Computer Engineering students with a strong foundation in the mathematical,
scientific and engineering fundamentals necessary to formulate, solve and analyze
engineering problems and to prepare them for graduate studies.
2. Think and Create
To develop an ability to analyze the requirements of the software and hardware, understand
the technical specifications, create a model, design, implement and verify a computing system
to meet specified requirements while considering real-world constraints to solve real world
problems.
3. Broad Base
To provide broad education necessary to understand the science of computer engineering and
the impact of it in a global and social context.
4. Techno-leader
To provide exposure to emerging cutting edge technologies, adequate training &
opportunities to work as teams on multidisciplinary projects with effective communication
skills and leadership qualities.
5. Practice citizenship
To provide knowledge of professional and ethical responsibility and to contribute to society
through active engagement with professional societies, schools, civic organizations or other
community activities.
6. Clarify Purpose and Perspective
To provide strong in-depth education through electives and to promote student awareness on the
life-long learning to adapt to innovation and change, and to be successful in their professional
work or graduate studies.
5
Departmental Program Outcomes (POs)
Pa. Foundation of computing - An ability to apply knowledge of computing, applied
mathematics, and fundamental engineering concepts appropriate to the discipline.
Pb. Experiments & Data Analysis - An ability to understand, identify, analyze and design the
problem, implement and validate the solution including both hardware and software.
Pc. Current Computing Techniques – An ability to use current techniques, skills, and tools
necessary for computing practice .
Pd. Teamwork – An ability to have leadership and management skills to accomplish a common
goal.
Pe. Engineering Problems - anabilitytoidentify,formulates, ands olveengineering problems.
Pf. Professional Ethics – An understanding of professional, ethical, legal, security and social
issues and responsibilities.
Pg. Communication – An ability to communicate effectively with a range of audiences in both
verbal and written form.
Ph. Impact of Technology – An ability to analyze the local and global impact of computing on
individuals, organizations, and society.
Pi. Life-long learning – An ability to recognize the need for, and an ability to engage in life-long
learning.
Pj. Contemporary Issues – An ability to exploit gained skills and knowledge of contemporary
issues.
Pk. Professional Development – Recognition of the need for and an ability to engage in
continuing professional development and higher studies.
Pl. Employment - An ability to get an employment to the international repute industries through
the training programs, internships, projects, workshops and seminars.
6
Index
Sr. No.
1.
2.
Contents
List of Experiments
Course Objective, Course Outcomes and
Experiment Plan
Page No.
8
9
3.
CO-PO Mapping
11
4.
Study and Evaluation Scheme
12
5.
Experiment No. 1
13
6.
Experiment No. 2
16
7.
Experiment No. 3
21
8.
Experiment No. 4
29
9.
Experiment No. 5
32
10.
Experiment No. 6
39
11.
Experiment No. 7
42
12.
Experiment No. 8
46
13.
Experiment No. 9
53
14.
Experiment No. 10
57
15.
Experiment No. 11
61
7
List of Experiments
Sr. No.
1
2
3
4
5
6
7
8
Experiments Name
To study and implement all basic HTML tags.
To implement Cascading Style Sheet
To implement bank transaction form using JavaScript
To design email registration form and validate it using Javascript
To implement Javascript document and window object
To design home page for RAIT using Kompozer
To design online examination form using Kompozer
To design home page for online mobile shopping using Kompozer
To design XML document using XML schema for representing your semester
9
10
11
marksheet using PHP.
To design DTD for representing your semester marksheet.
To design XML schema and DTD for railway reservation system.
Design HTML form to accept the two numbers N1 and N2. Display prime
12
numbers between N1 and N2 using PHP.
Design a login form to add username, id, password into database & validate it
13
(use php).
Design course registration form and perform various database operations using
14
15
PHP and MySQL database connectivity
Mini Project
8
Course Objectives & Course Outcome,
Experiment Plan
Course Objectives:
1.
2.
3.
4.
5.
To study the methodology of engineering legacy databases for data warehousing.
To study the design modeling of data warehouse.
To study the preprocessing and online analytical processing of data.
To study the methodology of engineering legacy of datamining to derive business
rules for decision support systems.
To analyze the data, identify the problems, and choose the relevant modelsand
algorithms toapply.
Course Outcomes:
CO1
CO2
CO3
CO4
CO5
Student will be able to understand data warehouse and design model of data
warehouse.
Students will be able to learned steps of preprocessing
Students will be able to understand the analytical operations on data.
Students will be able to discover patterns and knowledge from data warehouse.
Students will be able to understand and implement classical algorithms in data
9
Experiment Plan
Module
No.
1
Week
No.
Experiments Name
One case study given to a group of 3 /4
W1, W2 students of a data mart/ data warehouse.
Course
Outcome
CO1
Weightage
10
1
W3
Implementation of classifier like Decision
tree using Java
CO5
2
1
W4
Use WEKA to implement like Decision
tree
CO5
3
2
W5
Implementation of clustering algorithm
like K-means using Java
CO5
4
2
W6
Use WEKA to implement the K-means
Clustering Algorithm
CO5
5
2
W7
Implementation Association Mining like
Apriori using Java
CO5
6
CO5
2
W8
Use WEKA to implement Association
Mining like Apriori
CO3
5
W9
Use R tool to implement
Clustering/Association Rule/ Classification
Algorithms.
CO3
5
7
8
9
W10
Detailed study of BI tool - SPSS,
Clementine.
10
W11
Study different OLAP operations.
CO4
10
CO2
10
W12
Study different pre-processing steps for
DWH
11
10
Mapping Course Outcomes (CO) Program Outcomes (PO)
Subject
Weight
Course Outcomes
Contribution to Program outcomes
Pa
PRATICA
L
50%
Pb
Pc
Pd
Pe
Student will be able to understand 1
CO1 data warehouse and design model
of data warehouse.
Students will be able to learned
CO2
steps of pre-processing
Students will be able to understand
CO3 the analytical operations on data.
1
2
3
1
1
1
Students will be able to discover
CO4 patterns and knowledge from
datawarehouse.
1
1
Students will be able to understand
and implement classical algorithms
in data mining and data
warehousing; students will be able
CO5
to assess the strengths and
weaknesses of the algorithms,
identify the application area of
algorithms, and apply them.
1
1
Pf
Pg
P
h
Pi
Pj
P
k
Pl
1
2
1
1
1
1
1
1
2
1
1
1
1
2
3
1
1
1
2
3
1
2
1
2
2
11
Study and Evaluation Scheme
Course
Code
Course Name
CPC801
Data
Teaching Scheme
Credits Assigned
Theory Practical Tutorial Theory Practical Tutorial Total
Warehouse and
04
02
--
04
01
--
05
Mining
Course Code
Course Name
Examination Scheme
CPC801
Data Warehouse
Term Work
Practical
Total
and Mining
25
25
50
Term Work:
Internal Assessment consists of two tests. Test 1, an Institution level central test, isfor 20
marks and is to be based on a minimum of 40% of the syllabus. Test 2 isalso for 20 marks
and is to be based on the remaining syllabus. Test 2 may beeither a class test or assignment
on live problems or course project
Practical & Oral:
Oral examination is to be conducted by pair of internal and external examiners based on the
syllabus.
12
Data Warehouse and Mining
Experiment No. : 1
Case study on Data warehouse System.
13
Experiment No. 1
1. Aim: One case study on Data warehouse System.
A. Write Detail Statement Problem and creation of dimensional modelling (creation star
and snowflake schema) Implementation of dimensional modeling
B. Implementation of all dimension table and fact table
C. Implementation of OLAP operations.
2. Objectives: From this experiment, the student will be able to
 Understand the basics of Data Warehouse
 Understand the design model of Data Warehouse
 Study methodology of engineering legacy databases for data warehousing
3. Outcomes: The learner will be able to
 Apply knowledge of legacy databases in creating data warehouse
 Understand, identify, analyse and design the warehouse
 Use current techniques, skills and tools necessary for designing a data
warehouse
4. Software Required :Oracle 11g
5. Theory:
In computing, online analytical processing, or OLAP is an approach to answering multidimensional analytical (MDA) queries swiftly OLAP is part of the broader category
of business intelligence which also encompasses relational database, report writing and data
mining. Typical applicationsof OLAP include business reporting for sales, marketing,
management reporting, business process management (BPM), budgeting and similar areas,
with new applications coming up, such as agriculture The term OLAP was created as a slight
modification of the traditional database term online transaction processing.
Dimensional modelingDimensional modeling (DM) names a set of techniques and concepts used in Dimensional
modeling (DM) names a set of techniques and concepts used in datawarehouse design. It is
considered to be different from Entity relationship (ER). Dimensional Modeling does not
necessarily involve a relational database. The same modeling approach, at the logical level,
can be used for any physical form, such as multidimensional database or even flat files. , DM
is a design technique for databases intended to support end-user queries in a data warehouse.
It is oriented around understandability and performance.
Star Schema
- Fact table is in middle and dimension tables are arranged around the fact table
14
Snowflake Schema
Normalization and expansion of the dimension tables in a star schema result in the
implementation of a snowflake design.
Snowflaking in the dimensional model can impact understandability of the
dimensional model and result in a decrease in performance because more tables will
need to be joined to satisfy queries
6. Conclusion:
We have studied different schemas of data warehouse, and using the methodology of
engineering legacy database, a new data warehouse was built. The normalization was applied
wherever required on star schema and snowflake schema was designed.
7. Viva Questions:
 What is data warehouse?
 What is multi-dimensional data?
 What is difference between star and snowflake schema?
8.
References:

PaulrajPonniah, “Data Warehousing: Fundamentals for IT Professionals”, Wiley India
15

ReemaTheraja “Data warehousing”, Oxford University Press
Data Warehouse and Mining
Experiment No. : 2
Implementation of decision tree
algorithm in JAVA.
16
Experiment No. 2
1. Aim:Implementation of decision tree algorithm in JAVA.
2. Objectives: From this experiment, the student will be able to
 Analyse the data, identify the problem and choose relevant algorithm to apply
 Understand and implement classical algorithms in data mining
 Identify the application of classification algorithm in data mining
3. Outcomes: The learner will be able to



Assess the strength and weaknesses of algorithms
Identify, formulate and solveengineering problems
Analyse the local and global impact of data mining on individuals,
organizations and society
4. Software Required :JDK for JAVA
5. Theory:
Decision Tree learning is one of the most widely used and practical methods for inductive
inference over supervised data. A decision tree represents a procedure for classifying
categorical data based on their attributes. It is also efficient for processing large amount of
data, so is often use in data mining operations. The construction of decision tree does not
require any domain knowledge or parameter setting, and therefore appropriate for exploratory
knowledge discovery.
Decision tree builds classification or regression models in the form of a tree structure. It
breaks down a dataset into smaller and smaller subsets while at the same time an associated
decision tree is incrementally developed. The final result is a tree with decision
nodes and leaf nodes
The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a
top-down, greedy search through the space of possible branches with no backtracking. ID3
uses Entropy and Information Gain to construct a decision tree.
17
Entropy: A decision tree is built top-down from a root node and involves partitioning the
data into subsets that contain instances with similar values (homogenous). ID3 algorithm uses
entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous
the entropy is zero and if the sample is an equally divided it has entropy of one. To build a
decision tree, we need to calculate two types of entropy using frequency tables as follows:
Information Gain: The information gain is based on the decrease in entropy after a dataset is
split on an attribute. Constructing a decision tree is all about finding attribute that returns the
highest information gain (i.e., the most homogeneous branches).
6. Procedure/Program:
1. Calculate entropy of the target
2. The dataset is then split on the different attributes. The entropy for each branch is
calculated. Then it is added proportionally, to get total entropy for the split. The
18
resulting entropy is subtracted from the entropy before the split. The result is the
Information Gain, or decrease in entropy.
3. Choose attribute with the largest information gain as the decision node
4. A. A branch with entropy of 0 is a leaf node
19
A. A branch with entropy more than 0 needs further splitting.
5. The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.
7. Results:
Decision Tree to Decision Rules
A decision tree can easily be transformed to a set of rules by mapping from the root node
to the leaf nodes one by one
8. Conclusion:
The different classification algorithms of data mining were studied and one among them
named decision tree (ID3) algorithm was implemented using JAVA. The need for
classification algorithm was recognized and understood.
9. Viva Questions:



What are various classification algorithms?
What is entropy?
How does u find information gain?
10. References:


Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd
Edition
M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education
20
Data Warehouse and Mining
Experiment No. : 3
Implementation of ID3 algorithm using
WEKA tool.
21
Experiment No. 3
1. Aim:Implementation of ID3 algorithm using WEKA tool.
2. Objectives: From this experiment, the student will be able to
 Analyse the data, identify the problem and choose relevant algorithm to apply
 Understand and implement classical algorithms in data mining
 Identify the application of classification algorithm in data mining
3. Outcomes: The learner will be able to



Assess the strength and weaknesses of algorithms
Identify, formulate and solve engineering problems
Analyse the local and global impact of data mining on individuals,
organizations and society
4. Software Required :WEKA tool
5. Theory:
Decision tree learning is a method for assessing the most likely outcome value by taking into
account the known values of the stored data instances. This learning method is among the
most popular of inductive inference algorithms and has been successfully applied in broad
range of tasks such as assessing the credit risk of applicants and improving loyality of regular
customers
6. Procedure:
1. Download dataset for implementation of ID3 algorithm (.csv or .arff file). Here bankdata.csvdataset has taken fordecision tree analysis
2. Load data in WEKA tool
22
3. Select the "Classify" tab and click the "Choose" button to select the ID3 classifier
4. Specify the various parameters. These can be specified by clicking in the text box to
the right of the "Choose" button. In this example we accept the default values. The
default version does perform some pruning (using the subtree raising approach), but
does not perform error pruning
23
5. Under the "Test options" in the main panel we select 10-fold cross-validation as our
evaluation approach. Since we do not have separate evaluation data set, this is
necessary to get a reasonable idea of accuracy of the generated model. We now click
"Start" to generate the model.
6. We can view this information in a separate window by right clicking the last result
set (inside the "Result list" panel on the left) and selecting "View in separate
window" from the pop-up menu.
24
7. WEKA also provides view a graphical rendition of the classification tree. This can
be done by right clicking the last result set (as before) and selecting "Visualize tree"
from the pop-up menu.
We will now use our model to classify the new instances. However, in the data section, the
value of the "pep" attribute is "?" (or unknown).
25
In the main panel, under "Test options" click the "Supplied test set" radio button, and then
click the "Set..." button. This will pop up a window which allows you to open the file
containing test instances.
In this case, we open the file "bank-new.arff" and upon returning to the main window, we
click the "start" button. This, once again generates the models from our training data, but this
time it applies the model to the new unclassified instances in the "bank-new.arff" file in order
to predict the value of "pep" attribute.
26
The summary of the results in the right panel does not show any statistics. This is because in
our test instances the value of the class attribute ("pep") was left as "?", thus WEKA has no
actual values to which it can compare the predicted values of new instances.
GUI vesion of WEKA is used to create a file containing all the new instances along with their
predicted class value resulting from the application of the model.
First, right-click the most recent result set in the left "Result list" panel. In the resulting popup window select the menu item "Visualize classifier errors". This brings up a separate
window containing a two-dimensional graph.
8. To save the file: In the new window, we click on the "Save" button and save the
result as the file: "bank-predicted.arff"
27
This file contains a copy of the new instances along with an additional column for the
predicted value of "pep". The top portion of the file can be seen in below figure.
7. Conclusion:
The different classification algorithms of data mining were studied and one among them
named decision tree (ID3) algorithm was implemented using JAVA. The need for
classification algorithm was recognized and understood.
8. Viva Questions:
 What is the use of WEKA tool?
9. References:


Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition
M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education
28
Data Warehouse and Mining
Experiment No. : 4
Implementation of K-means clustering
in JAVA.
29
Experiment No. 4
1. Aim:Implementation of K-means clustering in JAVA.
2. Objectives: From this experiment, the student will be able to
 Analyse the data, identify the problem and choose relevant algorithm to apply
 Understand and implement classical clustering algorithms in data mining
 Identify the application of clustering algorithm in data mining
3. Outcomes: The learner will be able to



Assess the strength and weaknesses of algorithms
Identify, formulate and solve engineering problems
Analyse the local and global impact of data mining on individuals,
organizations and society
4. Software Required :JDK for JAVA
5. Theory:
Clustering is dividing data points into homogeneous classes or clusters:


Points in the same group are as similar as possible
Points in different group are as dissimilar as possible
When a collection of objects is given, we put objects into group based on similarity.
Clustering Algorithms:
A Clustering Algorithm tries to analyse natural groups of data on the basis of some
similarity. It locates the centroid of the group of data points. To carry out effective
clustering, the algorithm evaluates the distance between each point from the centroid of
the cluster. The goal of clustering is to determine the intrinsic grouping in a set of
unlabelled dataTheory:
K-means Clustering
K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms
that solve the well-known clustering problem. K-means clustering is a method of
vector quantization, originally from signal processing, that is popular for cluster
analysis in data mining.
30
6. Procedure:
Input:
K: the number of clusters
D: a data set containing n objects.
Output: A set of k clusters.
1.
Arbitrarily choose K objects from D as the initial cluster centers
2.
Partition of objects into k non-empty subsets
3.
Identifying the cluster centroids (mean point) of the current partition.
4.
Assigning each point to a specific cluster
5.
Compute the distances from each point and allot points to the cluster where
the distance from the centroid is minimum.
6.
After re-allotting the points, find the centroid of the new cluster formed.
7. Conclusion:
The different clustering algorithms of data mining were studied and one among them named
k-means clustering algorithm was implemented using JAVA. The need for clustering
algorithm was recognized and understood.
8. Viva Questions:
 What are different clustering techniques?
 What is difference between K-means and K-medoids?
 What is dendogram?
9. References:


Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd
Edition
M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education
31
Data Warehouse and Mining
Experiment No. : 5
To implement the clustering algorithm –
K-means using WEKA tool.
32
Experiment No. 5
1. Aim:To implement the clustering algorithm, K-means using WEKA tool.
2. Objectives: From this experiment, the student will be able to
 Analyse the data, identify the problem and choose relevant algorithm to apply
 Understand and implement classical clustering algorithms in data mining
 Identify the application of clustering algorithm in data mining
3. Outcomes: The learner will be able to



Assess the strength and weaknesses of algorithms
Identify, formulate and solve engineering problems
Analyse the local and global impact of data mining on individuals,
organizations and society
4. Software Required :WEKA tool
5. Theory:
Weka is a landmark system in the history of the data mining and machine learning
research communities,because it is the only toolkit that has gained such widespread
adoption and survived for an extended period of time
The key features responsible for Weka's success are: –
• It provides many different algorithms for data mining and machine learning.
• Is is open source and freely available.
• It is platform-independent.
• It is easily useable by people who are not data mining specialists.
• It provides flexible facilities for scripting experiments – it has kept up-to-date, with
new algorithms
WEKA INTERFACE
The GUI Chooser consists of four buttons—one for each of the four major Weka
applications—and four menus.The buttons can be used to start the following applications:
• Explorer : An environment for exploring data with WEKA .
33
• Experimenter : An environment for performing experiments and conducting statistical tests
between learning schemes.
• KnowledgeFlow : This environment supports essentially the same functions as the Explorer
but with a drag-and-drop interface. One advantage is that it supports incremental learning.
• SimpleCLI : Provides a simple command-line interface that allows direct execution of
WEKA commands for operating systems that do not provide their own command line
interface.
WEKA CLUSTERER
It contains “clusterers” for finding groups of similar instances in a dataset. Some
implemented schemes are: k-Means, EM, Cobweb, X-means, FarthestFirst .Clusters can be
visualized and compared to “true” clusters.
6. Procedure:
The basic step of k-means clustering is simple. In the beginning, we determine number of
cluster K and we assume the centroid or center of these clusters. We can take any random
objects as the initial centroids or the first K objects can also serve as the initial centroids.
Then the K means algorithm will do the three steps below until convergence. Iterate until
stable (= no object move group):
1. Determine the centroid coordinate
2. Determine the distance of each object to the centroids
3. Group the object based on minimum distance (find the closest centroid)
K-means in WEKA 3.7
The sample data set used is based on the "bank data" available in comma-separated format
bank-data.csv. The resulting data file is “bank.arff” and includes 600 instances. As an
illustration of performing clustering in WEKA, we will use its implementation of the Kmeans algorithm to cluster the cutomers in this bank data set, and to characterize the resulting
customer segments.
34
To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose"
button. This results in a drop down list of available clustering algorithms. In this case we
select "SimpleKMeans".
Next, click on the text box to the right of the "Choose" button to get the pop-up window
shown below, for editing the clustering parameter.
In the pop-up window we enter 2 as the number of clusters and we leave the value of "seed"
as is.
35
The seed value is used in generating a random number which is, in turn, used for making the
initial assignment of instances to clusters.
Once the options have been specified, we can run the clustering algorithm. Here we make
sure that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click
"Start".
We can right click the result set in the "Result list" panel and view the results of clustering in
a separate window.
36
We can even visualize the assigned cluster as below
You can choose the cluster number and any of the other attributes for each of the three
different dimensions available (x-axis, y-axis, and color). Different combinations of choices
will result in a visual rendering of different relationships within each cluster.
37
Note that in addition to the "instance_number" attribute, WEKA has also added "Cluster"
attribute to the original data set. In the data portion, each instance now has its assigned cluster
as the last attribute value (as shown below).
7. Conclusion:
The different clustering algorithms of data mining were studied and one among them
named k-means clustering algorithm was implemented using JAVA. The need for clustering
algorithm was recognized and understood.
8.
References:


Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition
M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education
38
Data Warehouse and Mining
Experiment No. : 6
To study and implement Apriori
Algorithm.
39
Experiment No. 6
1. Aim: To study and implement Apriori Algorithm.
2. Objectives: From this experiment, the student will be able to
 Analyse the data, identify the problem and choose relevant algorithm to apply
 Understand and implement classical association mining algorithms
 Identify the application of association mining algorithms
3. Outcomes: The learner will be able to



Assess the strength and weaknesses of algorithms
Identify, formulate and solve engineering problems
Analyse the local and global impact of data mining on individuals,
organizations and society
4. Software Required :JDK for JAVA
5. Theory:
Apriori algorithm is well known association rule algorithm is used in most
commercial product. It uses itemset property: Any subset of large item set must be
large
6. Procedure:
Input:
I = // itemset
D = // db of transactions.
S= // support
Output:
L1
Apiriori Algorithm:
K=0;
L= #;
Ci= I;
repeat
k=k+1;
Lk= #;
for each Ji belong to Ck do
Ci=0;
for each I,j belong to D do
for each Ii belong to tj then
Ci=Ci+1;
for each Ii belong to Ck do
40
ifCi>=(S*/D/)do
Lk=L U Ii;
L=L U Lk;
Ck+1=Apriori-Gen(Lk) until Ck+1= # ;
7. Conclusion:
The different association mining algorithms of data mining were studied and one among them
named Apriori association mining algorithm was implemented using JAVA. The need for
association mining algorithm was recognized and understood.
8. Viva Questions:



What is support and confidence?
What are differenttypes association mining algorithms?
What is the disadvantage of apriori algorithm?
9. References:


Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd
Edition
M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson
Education
41
Data Warehouse and Mining
Experiment No. : 7
Implementation of Apriori algorithm in
WEKA.
42
Experiment No. 7
1. Aim: Implementation of Apriori algorithm in WEKA.
2. Objectives: From this experiment, the student will be able to
 Analyse the data, identify the problem and choose relevant algorithm to apply
 Understand and implement classical association mining algorithms
 Identify the application of association mining algorithms
3. Outcomes: The learner will be able to



Assess the strength and weaknesses of algorithms
Identify, formulate and solve engineering problems
Analyse the local and global impact of data mining on individuals,
organizations and society
4. Software Required :WEKA tool
5. Theory:
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean
association rules. Some key concepts for Apriori algorithm are:



Frequent Item sets: The sets of item which has minimum support (denoted by Li for
ith-Itemset)
Apriori Property: Any subset of frequent item set must be frequent.
Join Operation: To find Lk , a set of candidate k itemsets is generated by joining Lk1 with itself.
6. Procedure:
WEKA implementation:
To learn the system, TEST_ITEM_TRANS.arff has been used.
43
Using the Apriori Algorithm we want to find the association rules that have
minSupport=50% and minimum confidence=50%. After we launch the WEKA
application and open the TEST_ITEM_TRANS.arff file as shown in below figure.
Then we move to the Associate tab and we set up the configuration as shown below
After the algorithm is finished, we get the following results:
=== Run information ===
44
Scheme: weka.associations.Apriori -N 20 -T 0 -C 0.5 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: TEST_ITEM_TRANS
Instances: 15
Attributes: 8
ABCDEFGH
=== Associator model (full training set) ===Apriori =======
Minimum support: 0.5 (7 instances)
Minimum metric: 0.5 Number of cycles performed: 10
Generated sets of large itemsets:
Size of set of large itemsetsL(1): 10
Size of set of large itemsetsL(2): 12
Size of set of large itemsetsL(3): 3
Best rules found
1. E=TRUE 11 ==> H=TRUE 11 conf:(1)
2. B=TRUE 10 ==> H=TRUE 10 conf:(1)
3. C=TRUE 10 ==> H=TRUE 10 conf:(1)
4. A=TRUE 9 ==> H=TRUE 9 conf:(1)
5. G=FALSE 9 ==> H=TRUE 9 conf:(1)
6. D=TRUE 8 ==> H=TRUE 8 conf:(1)
7. F=FALSE 8 ==> H=TRUE 8 conf:(1)
8. D=FALSE 7 ==> H=TRUE 7 conf:(1)
9. F=TRUE 7 ==> H=TRUE 7 conf:(1)
10. B=TRUE E=TRUE 7 ==> H=TRUE 7 conf:(1)
11. C=TRUE G=FALSE 7 ==> H=TRUE 7 conf:(1)
12. E=TRUE G=FALSE 7 ==> H=TRUE 7 conf:(1)
13. G=FALSE 9 ==> C=TRUE 7 conf:(0.78)
14. G=FALSE 9 ==> E=TRUE 7 conf:(0.78)
15. G=FALSE H=TRUE 9 ==> C=TRUE 7 conf:(0.78)
16. G=FALSE 9 ==> C=TRUE H=TRUE 7 conf:(0.78)
17. G=FALSE H=TRUE 9 ==> E=TRUE 7 conf:(0.78)
18. G=FALSE 9 ==> E=TRUE H=TRUE 7 conf:(0.78)
19. H=TRUE 15 ==> E=TRUE 11 conf:(0.73)
20. B=TRUE 10 ==> E=TRUE 7 conf:(0.7)
7. Conclusion:
The different association mining algorithms of data mining were studied and one among them
named Apriori association mining algorithm was implemented using JAVA. The need for
association mining algorithm was recognized and understood.
8. References:


Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd
Edition
M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson
Education
45
Data Warehouse and Mining
Experiment No. : 8
Study of R Tool
46
Experiment No. 8
1. Aim: Study of R Tool.
2. Objectives: From this experiment, the student will be able to
 Learn basics of mining tool
 Create web page for mobile shopping using editor tool
 Study the methodology of engineering legacy of data mining
3. Outcomes: The learner will be able to
 Use current techniques, skills and tools for mining.
 Engage them in life-long learning.
 Able to match industry requirements in domains of data mining
4. Software Required :R tool
5. Theory:
R tool is "a programming “environment”, object-oriented similar to S-Plus freeware that
provides calculations on matrices, excellent graphics capabilities and supported by a large
user network.
Installing R:
1)
2)
3)
4)
Download from CRAN
Select a download site
Download the base package at a minimum
Download contributed packages as needed
R Basics / Components Of R:






Objects
Naming convention
Assignment
Functions
Workspace
History
Objects




names
types of objects: vector, factor, array, matrix, data.frame, ts, list
attributes
o mode: numeric, character, complex, logical
o length: number of elements in object
creation
o assign a value
o create a blank object
47
Naming Convention




must start with a letter (A-Z or a-z)
can contain letters, digits (0-9), and/or periods “.”
case-sensitive
o eg. mydata different from MyData
do not use underscore “_”
Assignment

“<-” used to indicate assignment
o egs. <-c(1,2,3,4,5,6,7)
<-c(1:7)
<-1:4
Functions




actions can be performed on objects using functions (note: a function is itself an
object)
have arguments and options, often there are defaults
provide a result
parentheses () are used to specify that a function is being called.
Workspace




during an R session, all objects are stored in a temporary, working memory
list objects
o ls()
remove objects
o rm()
objects that you want to access later must be saved in a “workspace”
o from the menu bar: File->save workspace
o from the command line: save(x,file=“MyData.Rdata”)
History



command line history
can be saved, loaded, or displayed
o savehistory(file=“MyData.Rhistory)
o loadhistory(file=“MyData.Rhistory)
o history(max.show=Inf)
during a session you can use the arrow keys to review the command history
Two most common object types for statistics:
A. Matrix
a matrix is a vector with an additional attribute (dim) that defines the number of columns
and rowsonly one mode (numeric, character, complex, or logical) allowedcan be created
using matrix()
x<-matrix(data=0,nr=2,nc=2)
or
48
o x<-matrix(0,2,2)
B. Data Frame
several modes allowed within a single data framecan be created using data.frame()
L<-LETTERS[1:4] #A B C D
x<-1:4
#1 2 3 4
data.frame(x,L) #create data frame
attach() and detach()
o the database is attached to the R search path so that the database is searched by
R when it is evaluating a variable.
o objects in the database can be accessed by simply giving their names
Data Elements:





select only one element
eg. x[2]
select range of elements
eg. x[1:3]
select all but one element
eg. x[-3]
slicing: including only part of the object
eg. x[c(1,2,5)]
select elements based on logical operator
eg. x(x>3)
Data Import & Entry:
Importing Data





read.table(): reads in data from an external file
data.entry(): create object first, then enter data
c(): concatenate
scan(): prompted data entry
R has ODBC for connecting to other programs.
Data entry & editing



start editor and save changes
o data.entry(x)
start editor, changes not saved
o de(x)
start text editor
o edit(x)
Useful Functions



length(object) # number of elements or components
str(object) # structure of an object
class(object) # class or type of an object
49









names(object) # names
c(object,object,...) # combine objects into a vector
cbind(object, object, ...) # combine objects as columns
rbind(object, object, ...) # combine objects as rows
ls()
# list current objects
rm(object) # delete an object
newobject<- edit(object) # edit copy and save a
newobject
fix(object)
Exporting Data



To A Tab Delimited Text File
o write.table(mydata, "c:/mydata.txt", sep="\t")
To an Excel Spreadsheet
o library(xlsReadWrite)
write.xls(mydata, "c:/mydata.xls")
To SAS
o library(foreign)
write.foreign(mydata,c:/mydata.txt", "c:/mydata.sas", package="SAS")
Viewing Data
There are a number of functions for listing the contents of an object or dataset:
•list objects in the working environment: ls()
•list the variables in mydata: names(mydata)
•list the structure of mydata: str(mydata)
•list levels of factor v1 in mydata: levels(mydata$v1)
•dimensions of an object: dim(object)
•class of an object (numeric, matrix, dataframe, etc): class(object)
•print mydata :mydata
•print first 10 rows of mydata: head(mydata, n=10)
•print last 5 rows of mydata: tail(mydata, n=5)
Interfacing with R







CSV Files
Excel Files
Binary Files
XML Files
JSON Files
Web data
Database
We can also create
50






pie charts
bar charts
box plots
histograms
line graphs
scatterplots
DataTypesIn R Tool






Vectors
Lists
Matrices
Arrays
Factors
Data Frames
Input



The source( ) function runs a script in the current session.
If the filename does not include a path, the file is taken from the current working
directory.
#input a script
source("myfile")
Output



The sink( ) function defines the direction of the output.
o # direct output to a file
 sink("myfile", append=FALSE, split=FALSE)
o # return output to the terminal
 sink()
The append option controls whether output overwrites or adds to a file.
The split option determines if output is also sent to the screen as well as the output
file.
Creating new variables

Use the assignment operator <- to create new variables. A wide array of operators and
functions are available here.
# Three examples for doing the same computations
1. mydata$sum<- mydata$x1 + mydata$x2
mydata$mean<- (mydata$x1 + mydata$x2)/2
2. attach(mydata)
mydata$sum<- x1 + x2
mydata$mean<- (x1 + x2)/2
detach(mydata)
51
3. mydata<- transform( mydata,sum = x1 + x2,
mean = (x1 + x2)/2 )
Renaming variables

You can rename variables programmatically or interactively.
o # rename interactively
o fix(mydata) # results are saved on close
o # rename programmatically
library(reshape)
mydata<- rename(mydata, c(oldname="newname"))


Sorting



To sort a dataframe in R, use the order( ) function.
By default, sorting is ASCENDING.
Prepend the sorting variable by a minus sign to indicate DESCENDING order.
Merging


To merge two dataframes (datasets) horizontally, use the merge function.
In most cases, you join two dataframes by one or more common key variables (i.e., an
inner join).
Examples:
# merge two dataframes by ID --
total <- merge(dataframeA,dataframeB,by="ID")
# merge two dataframes by ID and Country -total<- merge(dataframeA,dataframeB,by=c("ID","Country"))
6. Conclusion:
R tool, free software environment for statistical computing and graphics is studied.
Using R tool, various data mining algorithms were implemented. R and its packages,
functions and task views for data mining process and popular data mining techniques were
learnt.
7. Viva Questions:
 How R tool is used for mining big data?
8. References:


Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd
Edition
M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson
Education
52
Data Warehouse and Mining
Experiment No. : 9
Study ofBI Tool
53
Experiment No. 9
1. Aim:Study of Business Intelligence Tool, SPSS Clementine, and XL Miner etc
2. Objectives: From this experiment, the student will be able to
 Learn basics of business intelligence
 Create web page for mobile shopping using editor tool
 Study the methodology of engineering legacy of data mining
3. Outcomes: The learner will be able to
 Use current techniques, skills and tools for mining.
 Engage them in life-long learning.
 Able to match industry requirements in domains of data mining
4. Software Required :BI tool - SPSS Clementine
5. Theory:
IBM SPSS Modeler is a data mining and text analytics software application built by IBM.
It is used to build predictive models and conduct other analytic tasks. It has a visual
interface which allows users to leverage statistical and data mining algorithms without
programming. IBM SPSS Modeler was originally named Clementine by its creators,
Applications:
SPSS Modeler has been used in these and other industries:
• Customer analytics and Customer relationship management (CRM)
• Fraud detection and prevention
• Optimizing insurance claims
• Risk management
• Manufacturing quality improvement
• Healthcare quality improvement
• Forecasting demand or sales
• Law enforcement and border security
• Education
• Telecommunications
• Entertainment: e.g., predicting movie box office receipts
SPSS is available in two separate bundles of features called editions.
1. SPSS Modeller Professional
2. SPSS Modeller Premium
It all includes:
o text analytics
o entity analytics
o social network analysis
Both the editions are available in desktop and server configurations.
54
Earlier it was Unix based and designed as a consulting tool and not for sale to the
customers. Originally developed by a UK Company called Integral Solutions in
collaboration with Artificial Intelligence researchers at Sussex University. It mainly uses
two of the Poplog languages, Pop11 and Prolog. It was the first data mining tool to use an
icon based graphical user interface rather than writing programming languages.
Clementine is a data mining software for business solutions.
Previous version was a stand alone application architecture while new version is a distributed
architecture.
Fig. Previous version (stand alone)
Distributed Architecture
Fig. New version (Distributed architecture)
55
Multiple model building techniques in Clementine:
 Rule Induction
 Graph
 Clustering
 Association Rules
 Linear Regression
 Neural Networks
Functionalities:
 Classification: Rule Induction, neural Networks
 Association: Rule Induction, Apriori
 Clustering: Kohonen Networks, Rule Induction
 Sequence: Rule Induction, Neural Networks, Linear Regression
 Prediction: Rule Induction, Neural Networks
Applications:





Predict market share
Detect possible fraud
Locate new retail sites
Assess financial risk
Analyze demographic trends and patterns
6. Conclusion:
IBM SPSS Modeler is a data mining and text analytics software application is
studied. It has a visual interface which allows users to leverage statistical and data mining
algorithms without programming is understood
7. Viva Questions:

What are the functionalities of SPSS Clementine?
8. References:


Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd
Edition
M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson
Education
56
Data Warehouse and Mining
Experiment No. : 10
Study different OLAP operations
57
Experiment No. 10
1. Aim:Study different OLAP operations
2. Objectives: From this experiment, the student will be able to
 Discover patterns from data warehouse
 Online analytical processing of data
 Obtain knowledge from data warehouse
2. Outcomes: The learner will be able to
 Recognize the need of online analytical processing.
 Identify, formulate and solve engineering problems.
 Able to match industry requirements in domains of data warehouse
3. Theory:
Following are the different OLAP operations
 Roll up (drill-up): summarize data
o by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
o from higher level summary to lower level summary or detailed data, or
introducing new dimensions
 Slice and dice:
o project and select
 Pivot (rotate):
o reorient the cube, visualization, 3D to series of 2D planes
o
Fact table View
Multi-dimensional cube
Dimension = 3
58
Example
Cube aggregation – roll up and drill down
Example – slicing
59
Example – slicing and pivoting
4. Conclusion:
OLAP, which performs multidimensional analysis of business data and provides the
capability for complex calculations, trend analysis, and sophisticated data modelingis studied.
6. Viva Questions:



What are OLAP operations?
What is difference between OLTP and OLAP?
What is difference between slicing and dicing?
7. References:


Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd
Edition
M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson
Education
60
Data Warehouse and Mining
Experiment No. : 11
Study different pre-processing steps of
data warehouse
Experiment No. 11
61
1. Aim:Study different pre-processing steps of data warehouse
2. Objectives: From this experiment, the student will be able to
 Discover patterns from data warehouse
 Learn steps of pre-processing of data
 Obtain knowledge from data warehouse
2. Outcomes: The learner will be able to
 Recognize the need of data pre-processing.
 Identify, formulate and solve engineering problems.
 Able to match industry requirements in domains of data warehouse
3. Theory:
Data pre-processing is an often neglected but important step in the data mining process. The
phrase "Garbage In, Garbage Out" is particularly applicable to data mining and machine
learning. Data gathering methods are often loosely controlled, resulting in out-of-range
values (e.g., Income:-100), impossible data combinations (e.g., Gender: Male, Pregnant:
Yes), missing values, etc
If there is much irrelevant and redundant information present or noisy and unreliable data,
then knowledge discovery during the training phase is more difficult. Data preparation and
filtering steps can take considerable amount of processing time. Data pre-processing includes
cleaning, normalization, transformation, feature extraction and selection, etc. The product of
data pre-processing is the final training set.
Data Pre-processing Methods
Raw data is highly susceptible to noise, missing values, and inconsistency. The quality of
data affects the data mining results. In order to help improve the quality of the data and,
consequently, of the mining results raw data is pre-processed so as to improve the efficiency
and ease of the mining process. Data pre-processing is one of the most critical steps in a data
mining process which deals with the preparation and transformation of the initial dataset.
Data pre-processing methods are divided into following categories:
1) Data Cleaning 2)Data Integration 3)Data Transformation 4)Data Reduction
62
Fig. Forms of data Preprocessing
Data Cleaning
Data that is to be analyze by data mining techniques can be incomplete (lacking attribute
values or certain attributes of interest, or containing only aggregate data), noisy (containing
errors, or outlier values which deviate from the expected), and inconsistent (e.g., containing
discrepancies in the department codes used to categorize items).Incomplete, noisy, and
inconsistent data are commonplace properties of large, real -world databases and data
warehouses. Incomplete data can occur for a number of reasons. Attributes of interest may
not always be available, such as customer information for sales transaction data. Other data
may not be included simply because it was not considered important at the time of entry.
Relevant data may not be recorded due to a misunderstanding, or because of equipment
malfunctions. Data that were inconsistent with other recorded data may have been deleted.
Furthermore, the recording of the history or modifications to the data may have been
overlooked. Missing data, particularly for tuples with missing values for some attributes, may
need to be inferred. Data can be noisy, having incorrect attribute values, owing to the
following. The data collection instruments used may be faulty. There may have been human
or computer errors occurring at data entry. Errors in data transmission can also occur. There
may be technology limitations, such as limited buffer size for coordinating synchronized data
transfer and consumption. Incorrect data may also result from inconsistencies in naming
conventions or data codes used. Duplicate tuples also require data cleaning. Data cleaning
routines work to “clean" the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies. Dirty data can cause
confusion for the mining procedure. Although most mining routines have some procedures
for dealing with incomplete or noisy data, they are not always robust. Instead, they may
concentrate on avoiding over fitting the data to the function being modelled. Therefore, a
useful pre-processing step is to run your data through some data cleaning routines.
Missing Values: If it is noted that there are many tuples that have no recorded value
forseveral attributes, then the missing values can be filled in for the attribute by various
methods described below:
1. Ignore the tuple: This is usually done when the class label is missing (assuming
themining task involves classification or description). This method is not very
effective, unless the tuple contains several attributes with missing values. It is
especially poor when the percentage of missing values per attribute varies
considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming
andmay not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attributevalues
by the same constant, such as a label like \Unknown", or -∞. If missing values are
replaced by, say, \Unknown", then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common | that of
\Unknown". Hence, although this method is simple, it is not recommended.
4. Use the attribute mean to fill in the missing value
5. Use the attribute mean for all samples belonging to the same class as the given tuple.
6. Use the most probable value to fill in the missing value: This may be determinedwith
inference-based tools using a Bayesian formalism or decision tree induction.
Inconsistent data: There may be inconsistencies in the data recorded for some
transactions.Some data inconsistencies may be corrected manually using external references.
63
For example, errors made at data entry may be corrected by performing a paper trace. This
may be coupled with routines designed to help correct the inconsistent use of codes.
Knowledge engineering tools may also be used to detect the violation of known data
constraints. For example, known functional dependencies between attributes can be used to
find values contradicting the functional constraints.
Data Integration
It is likely that your data analysis task will involve data integration, which combines data
from multiple sources into a coherent data store, as in data warehousing. These sources may
include multiple databases, data cubes, or flat files. There are a number of issues to consider
during data integration. Schema integration can be tricky. How can like real world entities
from multiple data sources be 'matched up'? This is referred to as the entity identification
problem. For example, how can the data analyst or the computer be sure that customer id in
one database, and cust_number in another refer to the same entity? Databases and data
warehouses typically have metadata - that is, data about the data. Such metadata can be used
to help avoid errors in schema integration. Redundancy is another important issue. An
attribute may be redundant if it can be “derived" from another table, such as annual revenue.
Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting
data set.
Data Transformation
In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Data transformation can involve the following:
1. Normalization, where the attribute data are scaled so as to fall within a small
specifiedrange, such as -1.0 to 1.0, or 0 to 1.0.
2. Smoothing works to remove the noise from data. Such techniques include
binning,clustering, and regression.
3. Aggregation, where summary or aggregation operations are applied to the data.
Forexample, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for analysis of the data at
multiple granularities.
4. Generalization of the data, where low level or 'primitive' (raw) data are replaced byhigher
level concepts through the use of concept hierarchies. For example, categorical attributes, like
street, can be generalized to higher level concepts, like city or country. Similarly, values for
numeric attributes, like age, may be mapped to higher level concepts, like young, middleaged, and senior.
Data Reduction
Data reduction techniques have been helpful in analyzing reduced representation of the
dataset without compromising the integrity of the original data and yet producing the quality
knowledge. The concept of data reduction is commonly understood as either reducing the
volume or reducing the dimensions (number of attributes). There are a number of methods
that have facilitated in analyzing a reduced volume or dimension of data and yet yield useful
knowledge. Certain partition based methods work on partition of data tuples. That is, mining
on the reduced data set should be more efficient yet produce the same (or almost the same)
analytical results
Strategies for data reduction include the following.
64
1. Data cube aggregation, where aggregation operations are applied to the data in
theconstruction of a data cube.
2. Dimension reduction, where irrelevant, weakly relevant, or redundant attributes
ordimensions may be detected and removed.
3. Data compression, where encoding mechanisms are used to reduce the data set size.
Themethods used for data compression are wavelet transform and Principal Component
Analysis.
4. Numerosity reduction, where the data are replaced or estimated by alternative, smallerdata
representations such as parametric models (which need store only the model parameters
instead of the actual data e.g. regression and log-linear models), or nonparametric methods
such as clustering, sampling, and the use of histograms.
5.Discretization and concept hierarchy generation, where raw data values for attributes
arereplaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of
data at multiple levels of abstraction, and are a powerful tool for data mining.
4. Conclusion:
Data preprocessing is a data mining technique that involves transforming raw data
into an understandable format. Real-world data is often incomplete, inconsistent, and/or
lacking in certain behaviors or trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such issues. Such pre-processing is thus
studied.
5. Viva Questions:



What is pre-processing of data?
What is the need for data pre-processing?
What kind of data can be cleaned?
6. References:


Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd
Edition
M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson
Education
65