Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal On Advanced Computer Theory And Engineering (IJACTE) ________________________________________________________________________________________________ Data Mining With Big Data: A Survey Subik Pokharel Information Technology, Sikkim Manipal Institute of Technology, Sikkim (INDIA) Email: subikpokharel93@gmail.com Abstract— Big Data is a new term for a large and complicated data set that it becomes difficult to process using a traditional data management tools. Big Data are now growing rapidly in all science and engineering domains, including biological, physical and biomedical sciences. Big Data mining is the art of extracting useful information from these data sets that was not possible before due to its volume, variability, and velocity. The Big Data is becoming one of the most exciting and challenging opportunities for the next coming years. structures, from large amounts of data stored in databases, data warehouses or other information repositories. In modern business to transform data into business intelligence giving an informational advantage, data mining is seen as an important tool. In Data Mining Association Rule Mining, Sequential Pattern Mining, Clustering, and Classification are the various techniques that are used. Different algorithms are developed for each of these techniques. III. BIG DATA Keywords— Big Data, Big Data mining, Data mining I. INTRODUCTION In the last few years we have witnessed a technology revolution which has been facilitating millions of people by generating tremendous data from various sensors, devices, in different formats and from independent or connected applications. These tremendous data is referred as Big Data. With the help of Big Data many impossible things such as preventing disease spreading, crime, personalizing healthcare, quickly identifying business opportunities, and protection of home land and so on are possible. As discussed by the Economist [2] “Managed well, the data can be used to unlock new sources of economic value, provide fresh insights into science and hold governments to accounts”. Each day there are billions and trillions of data generated in each fields, for example Google nearly processes 1 billion queries per day, Twitter has nearly 250 million tweets per day, YouTube has more than 4 billion views per day and the data in each places is nearly similar. The DBMS (once very successful) are no longer being able to meet the increasing demands of Big Data. Due to these challenges, call for new stack of high scalable computing models, tools, frameworks and platforms, etc, are required. Data mining has opened many new challenges and opportunities for mining Big Data. II. DATA MINING Data mining which is said to be a branch of computer science and artificial intelligence is a process of discovering interesting knowledge, such as patterns, associations, changes, anomalies and significant Big data is the term for data sets so large and complicated that it becomes difficult to process using traditional data management tools or processing applications [4]. There are two types of big data: Structured data and unstructured data Structured data are those data that are numbers and words and can be easily categorized and analyzed. Some examples from where these data are generated are network sensors embedded in electronic devices, smart phones, and global positioning system (GPS) devices. It also includes data like sales figures, account balances, and transaction data. Unstructured data are those data that contains more complex information, such as customer reviews from commercial websites, photos and other multimedia, and comments on social networking sites. These data cannot easily be separated into categories or analyzed numerically. In the year 1998 the term „Big Data‟ appeared for the first time in a Silicon Graphics (SGI) slide deck by John Mashey with the title of "Big Data and the Next Wave of InfraStress". Big Data mining was very relevant from the beginning, as the first book mentioning „Big Data‟ is a data mining book that appeared also in 1998 by Weiss and Indrukya [6] .However, the first academic _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -4, Issue -3, 2015 7 International Journal On Advanced Computer Theory And Engineering (IJACTE) ________________________________________________________________________________________________ paper with the words 'Big Data' in the title appeared a bit later in 2000 in a paper by Diebold [7]. Doug Laney was the first person talking about the 3V‟s in Big Data Management which is as follows: Volume: It refers to the amount of data. Its size increases continuously every day. It varies from terabytes to zettabytes. Variety: It refers to different types of data and data sources such as text, images, blogs, video, sensor data, etc. Velocity: It refers to data in motion. Data arrives continuously as a stream of data at high-speed and processed to meet demands and the challenges which lie ahead in the path of growth and development. Nowadays, there are two more V‟s: Variability: It refers to as the change in structure of the data as per required by the user and how they want to interrupt the data. Value: It refers to the data that are being used is valuable to our society or not. There are many applications of Big Data, for example: Business, Technology, and Health, Smart cities, online transaction, Education, etc. Some of the features of Big Data are: The size of Big Data is huge. The data keeps on increasing and as well as changing from time to time. The data sources are from different sources. It is hard to handle as it is complex in nature. It is free from the influence, guidance or control of anyone. the decision is to be made more rapidly as it is a competition era. One of the solutions to it is the hardware. Some are using increased memory and powerful processing to crunch large volumes of data quickly. B. Understanding the data: Big Data refers to huge amount of data. So to work on such types of data it is very important to understand it and direct it in the right shape so as to perform mining operations. For example, if the data comes from social media content, we need to know who the user is in a general way. C. Displaying meaningful results: Performing data mining in Big Data, we get some hidden information or some patterns. Plotting points on a graph from this information becomes very difficult when dealing with Big Data. Therefore grouping huge data into smaller groups can be helpful. D. Privacy: Big Data contains huge amount of data. These data are not usually stored in same places. Hence for mining purpose these data needs to be transported from one place to another. Therefore privacy plays an important role during Big Data mining. Presently, to mine information from Big Data, parallel computing based algorithms such as MapReduce are used. V. FORECAST TO THE FUTURE Since the era of petabyte is almost at its end and we are entering in the era of Exabyte, the data plays a vital role in making decisions in the near future. In the coming years, the challenges in Big Data will also increase as the data will increase. Following are some of the challenges that researchers may have to deal during the next few years: Since the optimal architecture of an analytical system is still unclear on dealing with historic data and with real-time data at the same time. Lambda architecture is an interesting architecture proposed by Nathan Marz which solves computing arbitrary functions problem on arbitrary data in real time. Big Data has huge amount of data, so it is important to achieve statistical significance and not be flooded by randomness. Data mining technique is used for extracting patterns or hidden information from the Big Data and many of these techniques are not trivial to paralyze. Hence lot of research is needed in these fields. Since the technologies are growing rapidly, hence researches should also be done in these fields as well. IV. CHALLENGES IN BIG DATA Data visualization is becoming an increasingly important component of analytics in the age of big data. There are many challenges that must be addressed to realize the full potential of Big Data. Meeting these challenges presented by Big Data will be difficult. Some of the challenges in Big Data are given below: A. Meeting the need for speed: In today‟s era, we not only have to find and analyze the data but also must find it quickly. For example, in case of an organization _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -4, Issue -3, 2015 8 International Journal On Advanced Computer Theory And Engineering (IJACTE) ________________________________________________________________________________________________ Big Data deals with huge amount of data stored in different places or warehouses and the amount of data is increasing day by day. Hence to store these data, compression is a important factor. World Congress of the Econometric Society, 2000. [8] Rohit Pitre and Vijay Kolekar, “A Survey Paper on Data Mining With Big Data,” International Journal of Innovative Research in Advanced Engineering (IJIRAE)Volume 1 Issue 1 (April 2014). [9] Manika Verma and Dr. Devarshi Mehta, “A Comparative study of Techniques in Data Mining,” International Journal of Emerging Technology and Advanced Engineering, Volume 4, Issue 4, April 2014. [10] A.N.Pathak, Manu Sehgal and Divya Christopher, “A Study on Selective Data Mining Algorithms,” International Journal of Computer Science, Issues, Vol. 8, Issue 2, March 2011. [11] Bharti Thakur and Manish Mann, “Data Mining for Big Data: A Review,” International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 5, May 2014. [12] Dunren Che, Mejdl Safran and Zhiyong Peng, “From Big Data to Big Data Mining: Challenges, Issues, and Opportunities,” Springer Berlin Heidelberg, 2013. [13] Juha K. Laurila, Daniel Gatica-Perez, Imad Aad, Jan Blom, Olivier Bornet, Trinh-Minh-Tri Do, Olivier Dousse, Julien Eberle and Markus Miettinen, “The Mobile Data Challenge: Big Data for Mobile Computing Research,” unknown. [14] F. Diebold. On the Origin(s) and Development of the Term "Big Data". Pier working paper archive, Penn Institute for Economic Research, Department of Economics, University of Pennsylvania, 2012. Subana Shanmuganathan, “FROM DATA MINING AND KNOWLEDGE DISCOVERY TO BIG DATA ANALYTICS AND KNOWLEDGE EXTRACTION FOR APPLICATIONS IN SCIENCE,” Journal of Computer Science, 2014. [15] S. M. Weiss and N. Indurkhya. Predictive data mining: a practical guide. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998. Vitthal Yenkar and Prof.Mahip Bartere, “Review on Data Mining with Big Data,” IJCSMC, Vol. 3, Issue. 4, April 2014, pg.97 – 102. [16] Manish Kumar Kakhani, Sweeti Kakhani and S.R. Biradar, “Research Issues in Big Data Analytics,” International Journal of Application or Innovation in Engineering & Management (IJAIEM), Volume 2, Issue 8, August 2013. VI. CONCLUSION Big Data is going to be more diverse, larger and faster in coming years. This paper discussed about the term „Big Data‟, its challenges and its forecast to the future. The coming years is going to be a challenge for the researchers working on „Big Data‟, as well as for the organizations. ACKNOWLEDGMENT I would like to thank Mr. Ashis Datta and Mr. Joyashri Datta, for their support and guidance throughout this period. I would always be thankful for their support. Without their guidance, it wouldn‟t have been possible. REFERENCES [1] Elisa Bertino, “Big Data – Opportunities and Challenges,” IEEE 37th Annual Computer Software and Applications Conference, 2013. [2] “Data, data everywhere”, The Economist, 25 February 2010, available at http://www.economist.com/node/15557443. [3] Wei Fan and Albert Bifet, “Mining Big Data: Current Status, and Forecast to the Future,” Vol. 14, Issue 2, 2013. [4] Bo Li, “Survey of Recent Research Progress and Issues in Big Data,” December10,2013, avilable at http://www.cse.wustl.edu/~jain/cse57013/index.html. [5] [6] [7] F. Diebold. "Big Data" Dynamic Factor Models for Macroeconomic Measurement and Forecasting. Discussion Read to the Eighth _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -4, Issue -3, 2015 9