Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
732A20 Data Mining and Statistical Learning Department of Computer and Information Science Computer lab 11: Clustering Learning objective The main objective of this computer lab is to make the student aquainted with the Clustering node in SAS Enterprise Miner. After completing the lab the student shall be able to: (i) Explain the main features of k-means clustering. (ii) Run the node about so-called self-organizing maps (Kohonen maps) and interpret the obtained results.. Recommended reading Chapter 13.1-13.3 in Hastie et al. Assignment 1: k-means clustering Your task is to: (i) create and run a data flow diagram containing a Clustering node; (ii) investigate how the results of the clustering are influenced by the number of input variables and the presence of outliers. More specifically, you shall use the Clustering node to undertake a centroid-based clustering called k-means clustering. The main idea behind that method is to form clusters by drawing spheres around k centers and optimizing the location and radii of these spheres so that the clusters become as homogeneous as possible. The k-value is normally specified by the user. Algorithms in which the k-value is determined by the data can be computationally very cumbersome. Prepare for running the Clustering node Create a workflow diagram with an Input Data Source node and a Clustering node. Import and assign the data in ‘lakesurvey.xls’ to the Input Data Source node. This Excel document ‘lakesurvey.xls’ contains water quality data from a survey of 2782 Swedish lakes that was carried out in 2005. Further information about this data set was given in computer lab 1, exercise 1. Open the Variables tab of the Clustering node and specify that you would like to use conductivity (COND_MS_M25_C), total nitrogen (TOT_N_PS_UG_L), and total organic carbon (TOC_MG_L) as input variables. (Conductivity represents the total amount of ions in the water, total nitrogen is an indicator of agricultural influence, and the level total organic carbon us high in brown-water lakes often found in forests and peaty areas.) The Clustering node can operate on nonstandardized or standardized data. Decide whether or not the data in this exercise should be standardized. (Hint: The clustering aims at identifying spherical subsets.) 732A20 Data Mining and Statistical Learning Department of Computer and Information Science Click the Clusters tab and specify that you would like to divide the entire data set into k = 10 clusters. Run the Clustering node and interpret the results Run the Clustering node and examine the graphs in the Partition tab. How many large clusters were formed? Proceed to the Variables tab. Which variable had the strongest influence on the clustering? Click the Distances tab. Which of the clusters were very different from the other clusters? Finally, click the Statistics tab and use View Statistics Plots to illustrate the main features of the different clusters. How would you characterize the three largest clusters? Were there any clusters with very few observations? How did the presence of outliers influence the clustering process? Repeat the analysis above using all fifteen water quality variables (pH to silica) as inputs. Did the results of the cluster analysis change dramatically? Assignment 2: Clustering using self-organizing maps This exercise is based on the same data set as the previous exercise.Your task is to create and run a data flow diagram containing a SOM/Kohonen clustering node. In contrast to k-means clustering, for which the order of the clusters is arbitrary, selforganizing maps create clusters that are related to each other so that adjacent clusters are more similar than clusters far apart. Prepare for running the SOM/Kohonen node Create a workflow diagram with an Input Data Source node and a SOM/Kohonen node. Open the Variables tab of the SOM/Kohonen node and specify that you would like to use conductivity (COND_MS_M25_C), total nitrogen (TOT_N_PS_UG_L), and total organic carbon (TOC_MG_L) as input variables. Decide whether or not the data in this exercise should be standardized. Click the General tab and specify that you would lika to have a map with 2 rows and 5 columns (10 clusters). Run the SOM/Kohonen node and interpret the results Run the SOM/Kohonen node and examine the graphs in the Map tab. How many large clusters were formed? 732A20 Data Mining and Statistical Learning Department of Computer and Information Science Proceed to the Variables tab. Which variable had the strongest influence on the clustering? Click the Distances tab. Which of the clusters were very different from the other clusters? Finally, click the Statistics tab and use View Statistics Plots to illustrate the main features of the different clusters. How would you characterize the three largest clusters? Were there any clusters with very few observations? How did the presence of outliers influence the clustering process?