Download Clustering - IDA.LiU.se

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
Computer lab 11: Clustering
Learning objective
The main objective of this computer lab is to make the student aquainted with the
Clustering node in SAS Enterprise Miner.
After completing the lab the student shall be able to:
(i)
Explain the main features of k-means clustering.
(ii)
Run the node about so-called self-organizing maps (Kohonen maps) and
interpret the obtained results..
Recommended reading
Chapter 13.1-13.3 in Hastie et al.
Assignment 1: k-means clustering
Your task is to:
(i) create and run a data flow diagram containing a Clustering node;
(ii) investigate how the results of the clustering are influenced by the number of
input variables and the presence of outliers.
More specifically, you shall use the Clustering node to undertake a centroid-based
clustering called k-means clustering. The main idea behind that method is to form clusters
by drawing spheres around k centers and optimizing the location and radii of these
spheres so that the clusters become as homogeneous as possible. The k-value is normally
specified by the user. Algorithms in which the k-value is determined by the data can be
computationally very cumbersome.
Prepare for running the Clustering node
Create a workflow diagram with an Input Data Source node and a Clustering node.
Import and assign the data in ‘lakesurvey.xls’ to the Input Data Source node. This Excel
document ‘lakesurvey.xls’ contains water quality data from a survey of 2782 Swedish
lakes that was carried out in 2005. Further information about this data set was given in
computer lab 1, exercise 1.
Open the Variables tab of the Clustering node and specify that you would like to use
conductivity (COND_MS_M25_C), total nitrogen (TOT_N_PS_UG_L), and total
organic carbon (TOC_MG_L) as input variables. (Conductivity represents the total
amount of ions in the water, total nitrogen is an indicator of agricultural influence, and
the level total organic carbon us high in brown-water lakes often found in forests and
peaty areas.)
The Clustering node can operate on nonstandardized or standardized data. Decide
whether or not the data in this exercise should be standardized. (Hint: The clustering aims
at identifying spherical subsets.)
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
Click the Clusters tab and specify that you would like to divide the entire data set into k =
10 clusters.
Run the Clustering node and interpret the results
Run the Clustering node and examine the graphs in the Partition tab. How many large
clusters were formed?
Proceed to the Variables tab. Which variable had the strongest influence on the
clustering?
Click the Distances tab. Which of the clusters were very different from the other clusters?
Finally, click the Statistics tab and use View  Statistics Plots to illustrate the main
features of the different clusters.
How would you characterize the three largest clusters?
Were there any clusters with very few observations?
How did the presence of outliers influence the clustering process?
Repeat the analysis above using all fifteen water quality variables (pH to silica) as inputs.
Did the results of the cluster analysis change dramatically?
Assignment 2: Clustering using self-organizing maps
This exercise is based on the same data set as the previous exercise.Your task is to
create and run a data flow diagram containing a SOM/Kohonen clustering node. In
contrast to k-means clustering, for which the order of the clusters is arbitrary, selforganizing maps create clusters that are related to each other so that adjacent clusters are
more similar than clusters far apart.
Prepare for running the SOM/Kohonen node
Create a workflow diagram with an Input Data Source node and a SOM/Kohonen node.
Open the Variables tab of the SOM/Kohonen node and specify that you would like to use
conductivity (COND_MS_M25_C), total nitrogen (TOT_N_PS_UG_L), and total
organic carbon (TOC_MG_L) as input variables. Decide whether or not the data in this
exercise should be standardized.
Click the General tab and specify that you would lika to have a map with 2 rows and 5
columns (10 clusters).
Run the SOM/Kohonen node and interpret the results
Run the SOM/Kohonen node and examine the graphs in the Map tab. How many large
clusters were formed?
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
Proceed to the Variables tab. Which variable had the strongest influence on the
clustering?
Click the Distances tab. Which of the clusters were very different from the other clusters?
Finally, click the Statistics tab and use View  Statistics Plots to illustrate the main
features of the different clusters.
How would you characterize the three largest clusters?
Were there any clusters with very few observations?
How did the presence of outliers influence the clustering process?