Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Scientific Research Engineering & Technology (IJSRET) Volume 2 Issue 7 pp 393-396 October 2013 www.ijsret.org ISSN 2278 – 0882 Novel Approach for Cluster Analysis of Similar Binary Variables using Normal Approximation of Binomial Probability Distribution 1 1,2 Makwana Jay, 2Makwana Pratik (MCA student, Gujarat Technological University, ISTAR, VallabhVidhyanagar, Anand) 3. Categorical, Ordinal, and RatioScaled Variables 4. Variables of Mixed Types 5. Vector Objects ABSTRACT One approach to determine similarity or dissimilarity in discrete random binary variables is contingency table [3]. A cluster is a set of meaningful sub classes those have similar characteristics. Cluster is unsupervised classification [7]. Novel approach to enhance computation of binary probability distribution for continuous random variable using normal. Keywords - binomial distribution, continuity correction factor, continuous variable, discrete variable, Normal approximation of binomial probability distribution. 1 INTRODUCTION A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the [1] relatively small number of groups . Clustering is a multivariate technique of grouping rows together that share similar values. The goal of clustering is to organize data by finding some sensible grouping of the data items. Clustering is unsupervised learning because it does not use predefined category labels associated with data items [4]. Clustering algorithms are engineered to determine characteristics in the data. 2 TYPES OF DATA IN CLUSTER ANALYSIS Following are different types of data/variables[3]: 1. Interval-Scaled Variables 2. Binary Variables 2.1 Binary Variables Binary variable has only two states: 0 or 1, where 0 means that the variable is absent, and 1 means that it is present. Treating binary variables as if they are interval-scaled can lead misleading clustering result [1]. Therefore, methods specific to binary data are necessary for computing similarities or dissimilarities. There are two types of binary variable objects, either symmetric or asymmetric [3]. In statistics, binary data is a statistical data type described by binary variables, which can take only two possible values. Binary data is used to represent the outcomes of Bernoulli trials, statistical experiments with only two possible outcomes. In modern computers, almost all data is ultimately represented in binary form. Although the binary numeral system is usually cited as the main reason of this, many (if not most) data in modern computers are not numbers [5]. 3 BINOMIAL PROBABILITY DISTRIBUTION [1] [6] Binomial probability typically deals with the probability of several successive decisions, each of which has two possible outcomes. In probability theory and statistics, the binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. The binomial distribution is frequently used to model the number of successes in a sample of size n drawn with replacement from a population of size N. If the sampling is carried out without replacement, the draws are not independent and IJSRET @ 2013 International Journal of Scientific Research Engineering & Technology (IJSRET) Volume 2 Issue 7 pp 393-396 October 2013 so the resulting distribution is a hyper geometric distribution, not a binomial one. However, for N much larger than n, the binomial distribution is a good approximation, and widely used. An experiment has a binomial probability distribution if three conditions are satisfied. a. There are a fixed number of trials. The number of trials is denoted by n. b. The trials are independent. c. The only outcomes of this experiment can be classified as "succeed" or "fail" (equivalently "yes" or "no"). Furthermore, the probability of success is fixed. The probability of success is denoted by p. Consider the following table [1]: Table – 1 NAME GENDER A M B F C M D M … … … … If we compare two objects gender wise (symmetric), we can apply binomial mass function [3] as per the following: P(x) = nCx(1/p)x(1/q)n-x From the Table-1, 1 for Male, 0 for Female Table – 2 Name Gender A 1 B 0 C 1 D 1 p = 1/2, q=1-p=1/2 n=2 Binomial Probability Distribution table for n=2: www.ijsret.org ISSN 2278 – 0882 Table – 3 Discrete Random Variable x (Success – Getting Male) 0 (Both zeros - same) Probability of x Similarity 2 C0 (1/2)0 (1/2)2 = 0.25 Similar 1 (One 0 & One 1 Different) 2 C1 (1/2)1 (1/2)1 = 0.5 Not Similar 1 (One 1 & One 0 Different) 2 C1 (1/2)1 (1/2)1 = 0.5 Not Similar 2 (Both ones – same) 2 C2 (1/2)0 (1/2)2 = 0.25 Similar From Table – 2 and Table – 3, Measurement (Gender of A=1 & B=0) = 0.5 … … … (a) Measurement (Gender of A=1 & C=1) = 0.25 … … …(b) Measurement (Gender of A=1 & D=1) = 0.25 … … …(c) Measurement (Gender of B=0 & C=1) = 0.5 … … …(d) Measurement (Gender of B=0 & D=1) = 0.5 … … …(e) Measurement (Gender of C=1 & D=1) = 0.25 … … …(f) Above results compare with (1/2n) (n = number of pairs / number of records), if it is equals then pair has same value. In equation, (b),(c) and (f) equals to (1/2n). [1] 4 NORMAL APPROXIMATION OF BINOMIAL PROBABILITY [2] DISTRIBUTION There is a problem with approximating the binomial with the normal. That problem arises because the binomial distribution is a discrete distribution while the normal distribution is a continuous distribution. 4.1 Continuity Correction Factor IJSRET @ 2013 International Journal of Scientific Research Engineering & Technology (IJSRET) Volume 2 Issue 7 pp 393-396 October 2013 1. There is a problem with approximating the binomial with the normal. That problem arises because the binomial distribution is a discrete distribution while the normal distribution is a continuous distribution. 2. The basic difference here is that with discrete values, we are talking about heights but no widths, and with the continuous distribution we are talking about both heights and widths. 3. The correction is to either add or subtract 0.5 of a unit from each discrete x-value. 4. This fills in the gaps to make it continuous. This is very similar to expanding of limits to form boundaries that we did with group frequency distributions. Steps to working a normal approximation to the binomial distribution[7]: 5. Identify success, the probability of success, the number of trials, and the desired number of successes. Since this is a binomial problem, these are the same things which were identified when working a binomial problem. 6. Convert the discrete x to a continuous x. Some people would argue that step 3 should be done before this step, but go ahead and convert the x before you forget about it and miss the problem. 7. Find the smaller of np or nq. If the smaller one is at least five, then the larger must also be, so the approximation will be considered good. When you find np, you're actually finding the mean, mu, so denote it as such. 8. Find the standard deviation, sigma = sqrt (npq). It might be easier to find the variance and just stick the square root in the final calculation - that way you don't have to work with all of the decimal places. 9. Compute the z-score using the standard formula for an individual score (not the one for a sample mean). 10. Calculate the probability desired. www.ijsret.org ISSN 2278 – 0882 5 PROPOSED WORK When the number of trials become large, evaluating the binomial probability function by hand or with a calculator is difficult. Hence, when we encounter a binomial distribution problem with a large number of trials, we may want to approximate the binomial distribution. In cases where the number of trials is greater than 20, np>= 5 , and n(1 - p) >= 5, the normal distribution provides an easy-to-use approximation of binomial probabilities. When using the normal approximation to the and np(1-p) in binomial, we setnp the definition of the normal curve. Let us illustrate the normal approximation to the binomial. Example: Toss a coin for 12 times. What is the probability of getting exactly 7 heads. Solution for Normal Approximation for Binomial Distribution: Formula: = np = (np(1-p))1/2 Z = (x – ) / The values are .. n=12, p( 6.6 < X < 7.5), p = 0.5 p=0.5 = np = 12 * 0.5 =6 = (np (1-p))1/2 = (12 * 0.5 (1 – 0.5)) 1/2 = 1.73 Z= (x – ) / for x= 6.5 z = (6.5 - 6) / 1.73 = 0.29 for x= 7.5 z = (7.5 - 6) / 1.73 = 0.87 Now, if we see the value of z = 0.29 and 0.87 in Z-table we get the 0.6141 and 0.8078 simultaneously, IJSRET @ 2013 International Journal of Scientific Research Engineering & Technology (IJSRET) Volume 2 Issue 7 pp 393-396 October 2013 www.ijsret.org Therefore, 0.8078 – 0.6141 = 0.1937 The probability of getting exactly 7 heads is 0.19 6 CONCLUSION In this Paper we describe Normal Approximation for Binomial Probabilities to find similarity or dissimilarity between two binary variables for continuous random variable. This Novel technique may improve computation & apply N number of records without any constraints. REFERENCES: Journal Papers: [1]Parag M. Moteria and Dr. Y R Ghodasara, Application of Binomial Probability Distribution in Cluster Analysis of Similar Categorical Variables (ISSN 22502459, Volume 2, Issue 6, June 2012) (133-135) Books: [2]Anderson, Sweeney, Williams, “Statistics for business and economics”, 9th edition, Thompson Publication [3]Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques Second Edition”, ELSEVIER Morgan Kaufman Publisher, 2011, (page: 383, 389, 390 ) Others: [4]http://www.jmp.com/support/help/Introdc tion_to_Clustering_Methods.shtml [5]http://en.wikipedia.org/wiki/Binary_data [6]http://en.wikipedia.org/wiki/Binomial_dis tribution [7]https://people.richland.edu/james/lecture/ m170/ch07-bin.html [8]http://webdocs.cs.ualberta.ca IJSRET @ 2013 ISSN 2278 – 0882