Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN No: 2309-4893 International Journal of Advanced Engineering and Global Technology I Vol-04, Issue-06, November 2016 Big Data Mining: A Study Hitesh Kataria1, Shubham Grover1 1 Student, Computer Science Engineering, Maharaja Agrasen Institute of Technology, India Abstract: The last decade has seen an explosive growth of data. Our pace to analyse data is a lot slower that its rate of production. Data mining is the technique used to discover and predict useful insights from the data, making it valuable and powerful. This paper present an overall study of this process, the methods used and its major applications in today’s world. I. INTRODUCTION Data mining can be viewed as a result of the natural evolution of information technology [1]. Data mining is a knowledge discovery process involving extraction of interesting (non‐ trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data. Also referred to as knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archaeology, data dredging, information harvesting, business intelligence etc, it is basically a two step process: rules are developed by taking the behaviour of given system (data sets) which are then used to evaluate the behaviour/ outcome for the given circumstances. Decision Support (1990s) Data Mining (Emerging Today) New England last March? Drill down to Boston." (OLAP), multi Cognos, dimensional Micro databases, strategy data warehouses multiple levels "What’s likely to happen to Boston unit sales next month? Why?" Advanced algorithms, multiprocessor computers, massive databases Prospective, proactive information delivery Pilot, Lockheed, IBM, SGI, numerous startups (nascent industry) It is an iterative process, consisting of the following major steps: TABLE I STEPS IN THE EVOLUTION OF DATA MINING.[2] Evolutionary Business Enabling Step Question Technologies Product Characteristics Providers Data Collection (1960s) Data Access (1980s) Data Warehousing & "What was my total revenue in the last five years?" Computers, tapes, disks IBM, CDC Retrospective, static data delivery "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC Oracle, Sybase, Informix, IBM, Microsoft Retrospective, dynamic data delivery at record level "What On-line were unit analytic sales in processing Pilot, Retrospective, Comshare, dynamic data Arbor, delivery at Figure 1 Data Mining steps 1397 www.ijaegt.com ISSN No: 2309-4893 International Journal of Advanced Engineering and Global Technology I Vol-04, Issue-06, November 2016 A. Selection: Data relevant to the task are retrieved from appropriate sources B. Preprocessing: Data cleaning: Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration: Integration of multiple databases, data cubes, or files Data transformation: Normalization and aggregation Data reduction: Obtains reduced representation in volume but produces the same or similar analytical results Data Discretization: Part of data reduction but with particular importance, especially for numerical data C. Transformation: Transform /consolidate into a new format for processing. [3] D. Data mining: Essential process in which intelligent methods are applied in order to extract useful results. E. Interpretation / evaluation: Interpret the result/query to give meaningful report/information Several major data mining techniques have been developed. We will briefly examine them to have a good overview of them.[4] A. Clustering Clustering is process of grouping related records together. Related records are grouped together on the basis of having similar values for attributes.This process can be can be very effective if data is clustered but not if data is “smeared”. This technique is based on the unsupervised learning (i.e. desired output for a given input is not known). Most commonly used algorithms are: Enhanced K-Means Orthogonal Partitioning Expectation Maximization B. Classification Classification is the process of assigning an object to a certain class based on its similarity to previous examples of other objects. It can be done with reference to original data or based on a model of that data. Classification is similar to clustering in that it also segments customer records into distinct segments called classes. But unlike clustering, a classification analysis requires that the end-user/analyst know ahead of time how classes are defined. For example, classes can be defined to represent the likelihood that a customer defaults on a loan (Yes/No). Classification is a supervised learning process. Most commonly used algorithms are: II TECHNIQUES Logistic Regression Naive Bayes Support Vector Machine Decision Tree C. Summarization Summarization is the generalization or abstraction of data. It is the process of reducing a large volume of information to a summary or abstract preserving only the most essential items. For example, the long distance calls of customer can be summarized in to total minutes, total calls, total spending etc instead of detailed calls. D. Association Rules Association is the most popular data mining techniques and fined most frequent item set. Association strives to discover patterns in data which are based upon Fig 2 1398 www.ijaegt.com ISSN No: 2309-4893 International Journal of Advanced Engineering and Global Technology I Vol-04, Issue-06, November 2016 relationships between items in the same transaction. Because of its nature, association is sometimes referred to as “relation technique”. These types of findings are often used for targeting coupons/deals or advertising. Apriori is the commonly used algorithm. useful/relevant for building models to solve a particular problem[6]. By extracting as much information as possible from a given data table using the smallest number of attributes, a user can save significant computing time and often build better models. E. Anomaly detection In a large data set it is possible to get a picture of what the data tends to look like in a typical case. Statistics can be used to determine if something is notably different from this pattern. For instance, the IRS could model typical tax returns and use anomaly detection to identify specific returns that differ from this for review and audit. J. Sequence Discovery Sequential patterns analysis is one of data mining technique that seeks to discover or identify similar patterns, regular events or trends in transaction data over a business period. For eg. in sales, with historical transaction data, businesses can identify a set of items that customers buy together a different times in a year. Then businesses can use this information to recommend customers buy it with better deals based on their purchasing frequency in the past. F. Regression Regression is finding function with minimal error to model data. It is statistical methodology that is most often used for numeric prediction. Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so cautions advisable [5] III APPLICATIONS Various industries have been adopting data mining to their mission-critical business processes to gain competitive advantages [4]: A. Intrusion detection Data mining can help improve intrusion detection by adding a level of focus to anomaly detection. By identifying bounds for valid network activity, data mining will aid an analyst in his/her ability to distinguish attack activity from common everyday traffic on the network. B. Retail industry Retail industry collects large amount of data on sales and customer shopping history. The quantity of data collected continues to expand rapidly, especially due to the increasing ease, availability and popularity of the business conducted on web, or e-commerce. Retail industry provides a rich source for data mining. Retail data mining can help identify customer behavior, discover customer shopping patterns and trends, improve the quality of customer service, achieve better customer retention and satisfaction, enhance goods consumption ratios design more effective goods transportation and distribution policies and reduce the G. Prediction The prediction, as it name implies, is one of data mining techniques that discovers relationship between independent variables and relationship between dependent and independent variables. For instance, the prediction analysis technique can be used in sale to predict profit for the future if we consider sale is an independent variable, profit could be a dependent variable. H. Time series Analysis A Time Series is an ordered sequence of data points It consists of sequences of values or events changing with time Time series analysis is the process of using statistical techniques to generate predictions (forecasts) for future events based on known past events . I. cost of business. Attribute Importance Attribute Importance provides an automated solution for improving the speed and possibly the accuracy of classification models built on data tables with a large number of attributes. Using this technique, the analyst can determine which of the attributes are most C. Telecommunications Due to the development of new computer and communication technologies, the telecommunication industry is rapidly expanding. This is the reason why data mining has become very important to help and 1399 www.ijaegt.com ISSN No: 2309-4893 International Journal of Advanced Engineering and Global Technology I Vol-04, Issue-06, November 2016 understand the business. Data mining in telecommunication industry helps in identifying the telecommunication patterns, catch fraudulent activities, make better use of resource, and improve quality of service. G. Biological Data Analysis Now a days we see that there is vast growth in field of biology such as genomics, proteomics, functional Genomics and biomedical research. Biological data mining is very important part of Bioinformatics. Following are the aspects in which Data mining contribute for biological data analysis: Semantic integration of heterogeneous, distributed genomic and proteomic databases. Alignment, indexing, similarity search and comparative analysis of multiple nucleotide sequences. Discovery of structural patterns and analysis of genetic networks and protein pathways.[10] D. Finance With the increasing economic globalization and improvements in IT, large amounts of financial data are being generated and stored. These can be subjected to data mining techniques to discover hidden patterns and obtain predictions for trends in the future and the behaviour of the financial markets.This is turn would result in an improved market place responsiveness and awareness leading to reduced costs and increased revenue. Analytics can contribute to solving business problems in banking and finance by finding patterns, causalities, and correlations in business information and market prices that are not immediately apparent to managers because the volume data is too large or is generated too quickly to screen by experts. The managers of the banks may go a step further to find the sequences, episodes and periodicity of the transaction behaviour of their customers which may help them in actually better segmenting, targeting, acquiring, retaining and maintaining a profitable customer base. E. Cloud computing Data Mining techniques are used in cloud computing. The implementation of data mining techniques through Cloud computing will allow the users to retrieve meaningful information from virtually integrated data warehouse that reduces the costs of infrastructure and storage [7].Cloud computing uses the Internet services that rely on clouds of servers to handle tasks. The data mining technique helps Cloud Computing to perform efficient, reliable and secure services for their users. [8] H. Agriculture Data mining is emerging in agriculture field for crop yield analysis a with respect to four parameters namely year, rainfall, production and area of sowing. Yield prediction is a very important agricultural problem that remains to be solved based on the available data. The yield prediction problem can be solved by employing Data Mining techniques such as K Means, K nearest neighbor (KNN), Artificial Neural Network and support vector machine (SVM) [11]. IV CONCLUSIONS In this paper, we discussed about the data mining process and also briefly presented the major steps involved in the same. The most frequently used techniques for knowledge discovery are also mentioned. Lastly, we have given a short description of the major applications of this field. REFERENCES [1] By Jiawei Han, Micheline, Kamber, Jian Pei, Data Mining: Concepts and Techniques, 2nd edition [2] Dr. Borne 2005UMUC Data Mining Lecture 21 Data Mining UMUC F. Biomedical analysis In recent years, Data Mining has been widely used in area of Medical science such as Biomedical, DNA, Genetics and Medicine etc. In the area of Genetics, the important goal is to understand the mapping relationship between the variation in human DNA sequences and the disease susceptibility. Data Mining is very important tool to help improve the diagnosis, prevention and treatment of the diseases.[9] CSMN 667 Lecture #2. [3] NONG YE, Data Mining: Theories Algorithms and Examples [4] Aakanksha Bhatnagar, Shweta P. Jadye, Madan Mohan Nagar” Data Mining Techniques & Distinct Applications: A Literature Review” International Journal of Engineering Research & Technology (IJERT) Vol. 1 Issue 9, November- 2012 1400 www.ijaegt.com ISSN No: 2309-4893 International Journal of Advanced Engineering and Global Technology I Vol-04, Issue-06, November 2016 [5] R.Kaur, S.Kaur, A.Kaur, R.Kaur, A.Kaur, “An Overview of Database management System, Data warehousing and Data Mining”. IJARCCE, Vol.2, issue.7, July 2013. [6] Java Data Mining: Strategy, Standard, and Practice: A Practical Guide for By Mark F. Hornick, Erik Marcadé, Sunil Venkayala [7] Ruxandra-Ştefania PETRE, “Data mining in Cloud Computing” Database Systems Journal vol. III, no. 3/2012 [8] Smita, Priti Sharma,Use of Data Mining in Various Field: A Survey Paper, IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 3, Ver. V (MayJun. 2014), PP 18-21 [9] Simmi Bagga, Dr. G.N. Singh , Applications of Data Mining, International Journal for Science and Emerging Technologies with Latest Trends, 2012 [10] Data Mining: Task, Tools, Techniques and Applications S.D.Gheware1 , A.S.Kejkar2 , S.M.Tondare3, International Journal of Advanced Research in Computer and Communication Engineering Vol. 3, Issue 10, October 2014 [11] D Ramesh , B Vishnu Vardhan, “Data Mining Techniques and Applications to Agricultural Yield Data” International Journal of Advanced Research in Computer and Communication Engineering Vol. 2, Issue 9, September 2013 1401 www.ijaegt.com