Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Approachable Analytical Study on Big Educational Data Mining Saeed Aghabozorgi1, Hamidreza Mahroeian 2, Ashish Dutt1, Teh Ying Wah1, and Tutut Herawan 1,3 1 Department of Information System University of Malaya 50603 Pantai Valley, Kuala Lumpur, Malaysia 2 University of Otago New Zealand 3 AMCS Research Center, Yogyakarta, Indonesia {saeed,teh,tutut}@um.edu.my, hamidreza.mahroeian@postgrad.otago.ac.nz, ashish_dutt@siswa.um.edu.my Abstract. The persistent growth of data in education continues. More institutes now store terabytes and even petabytes of educational data. Data complexity in education is increasing as people store both structured data in relational format and unstructured data such as Word or PDF files, images, videos and geo-spatial data. Indeed learning developers, universities, and other educational sectors confirm that tremendous amount of data captured is in unstructured or semi-structured format. Educators, students, instructors, tutors, research developers and people who deal with educational data are also challenged by the velocity of different data types, organizations as well as institutes that process streaming data such as click streams from web sites, need to update data in real time to serve the right advert or present the right offers to their customers. This analytical study is oriented to the challenges and analysis with big educational data involved with uncovering or extracting knowledge from large data sets by using different educational data mining approaches and techniques. Keywords: Big Data; Educational Data; Educational Data Mining; Data Mining; Analytical Study. 1 Introduction Big data can be considered as the theory of looking at voluminous didactic amounts of data be it in physical or digital format being stored in diverse repositories ranging from tangible account bookkeeping records of an educational institution to class test or examination records to alumni records [1]. These records continue to grow in size and variety. We learn from our mistakes as the old adage goes, in a similar fashion, today the businesses are being operated based on the decisions over the data that was collected by the business. Predictions, associations, clustering and many other commonly occurring business decisions are taken each day by corporate to enhance productivity and mutual growth [2]. And these significant decisions are dependent exclusively on the data collected during business operations and human judgments. This concept of big data has now been applied to various sectors like governments, businesses, hospital management to name a few but there has been little research work been done in its application in the educational sector. This is what we aim to find, through this research work. Tomes have been written on the efficacy of Big Data, the technologies that can be used to harness the sheer strength it exudes. But there has been very little to negligible research work on the application of big data in educational sector. Utilizing different data for making decisions is not new concept; corporations use complicated calculation on data generated by different customers for business intelligence or analytics. Various techniques used in Business intelligence can distinguish historical trends and customer patterns from data and can generate different models that can result in prediction of future patterns and trends [3]. Consist of proven methodologies from computer science, mathematics and statistics used for deriving non-redundant information from large scaled datasets (big data) [4] . One of the clear examples of exploiting useful data to discover online mode behavior is Web analytics with different methods that sign and report visits of Web page, specific region or particular domains and the different links that clicked through. To understand how people use the Web, Web analytics are applied, but corporations have utilized more complicated approaches and tools to track more sophisticated user interactions with their websites [5, 6]. Example of web analytics include analyzing the purchasing habits of the consumer, the application of recommendation algorithms in commercial websites search engines such that they are able to recommend the most likely product a consumer would like, notable examples are Netflix, Amazon. The same concept is now being applied to various e-learning systems for example Edmodo is a free open source LMS that is able to predict similar books or resources based on the learner’s web activity on the e-LMS [7]. New approaches and methods are considered imperative for extraction and analysis of the aforementioned tasks so as to seamlessly integrate with the unstructured data that these information systems generate. Big data is voluminous and would be futile to bind it within a specific number boundary. One of the means by which it can be defined could be its net usability worth. According to Manyika et al. [10] a data set whose computational size exceeds the processing limit of software can be categorized as big data. Several studies have been conducted in the past that have provided detailed insights into the application of traditional data mining algorithms like clustering, prediction, association to tame the sheer voluminous power of big data. Recent advances in machine learning field has provided with unique approaches to foresee knowledge discovery in datasets. These algorithms have been successful in finding correlations between unstructured data and one of their applications has been into predictive modeling. Such models can be treated as virtual prototypes of a real working model. When injected with real datasets in such models can help ascertain any debacles that can then be promptly addressed to thus mitigating operational costs of both man and machine labor. Two specific fields that are significant to the exploitation of big data in education are educational data mining and learning analytics. Although there is no hard and fast distinction between these two areas, they have had different research histories to some extent and are developing as discrete research areas. In general, educational data mining tries to uncover new patterns in captured data, building new algorithms or new models, whilst learning analytics looks for identified predictive models in educational systems[1, 4]. As it can be seen in the figure 1, educational data such as log files, user (learner) interaction data, and social network data types are expected to grow in the near future. This research study is oriented to the challenges and analysis with big educational data involved with uncovering or extracting knowledge from large data sets by using different educational data mining approaches and techniques. It is arranged in the following ways: in the next section, background of study including importance of education and educational data, the nature of big data, the basic understanding of data mining or knowledge discovery techniques will be described. In section 3, from big educational data mining perspective, the concept of educational data mining, big educational data, as well as big data mining is further discussed. Section 4 details the major challenges concerned with big educational data mining, and finally related discussion and conclusion is outlined. Fig 1. Growth of different Educational Data 2 Rudimentary 2.1 Education Learning providers, institutes, universities, schools and colleges always had the ability to generate huge amounts of educational data [8]. Even a small kindergarten school that only supply to a play group of children aged between 4-6 years can produce enormous quantities of data which is ranged from their academics to their peer activities, classroom activities and so forth. After the detonation of the buzzword, “Big data” in different industrial sectors, researchers and industry workers are collating towards vista’s that could presumably be affected by this surge [9]. Recent advances in technology has made it now possible to explore any previously unknown information that lay buried in deep caveats of heaps of data sets [10]. However, the most basic question that needs to be answered first is that, “Is there really any big data in education?” or are we simply looking at an impasse. 2.2 Big Data There are a number of similar definitions of big data. Perhaps the most well-known and popular version is derived from IBM,2 which proposed that big data could be differentiated by any or all of three V words to examine situations, events, and so on: volume, variety, and velocity [1, 9, 11]. Volume is attributed to larger quantities of data being produced from a various range of resources. For instance, big data can comprise data captured from the Internet of Things (IoT). As initially pictured, IoT is associated to the data collected from a range of different devices and sensors networked together, over the Internet [12]. Big data can also be cited to the explosion of information accessible on common social media such as Facebook and Twitter [13]. Variety is referred to utilizing numerous types of data to investigate a situation or event. On the IoT, millions of devices generating a steady flow of data results in not only a large volume of data but different kinds of data features of different situations. Furthermore, people on the Internet produce a highly various set of structured, semistructured as well as unstructured data [9]. Velocity of data which is attributed to a rapid increase in data over time for both kind of structured and unstructured data, and more frequent decision making about that data is essential [1]. As the world becomes more global and developed, and as the IoT generates, there is a growing frequency of data capture and decision making procedure about those things as they progress throughout the world. Additionally, the velocity of social media use is in its obvious upward trend. The clear example would be 250 million tweets per day. As decisions are made using big data, those decisions eventually can have a substantial impact on the next data that’s captured and analyzed, counting another dimension to velocity of big data [1, 10]. 2.3 Data Mining In databases, Data mining or knowledge discovery popularly known as KDD is the automatic mining of implied and appealing patterns from vast amounts of data [14]. Data mining is recognized as a field which is multidisciplinary in which a number of computing paradigm congregated such as decision tree construction, rule induction, artificial neural networks, instance-based learning, Bayesian learning, logic programming. In addition, some of the functional data mining techniques and methods are listed like statistics, visualization, clustering, classification and association rule mining [15,16]. These techniques discover new, implicit and practical knowledge based on students’ usage data. Data mining has been broadly applied in different kinds of educational systems. On one hand, there are common traditional classroom environments such as special education [17] and higher education [18]. On the other hand, there is education which is computer-based as well as web-based education like well-known and learning management systems known as LMS Systems [19], web-based adaptive hypermedia systems [20] and intelligent tutoring systems(ITS) [21]. The major difference between one and the other is the data accessible in each system. Traditional classrooms only have obtainable information about attendance of student, basic course syllabus; course objectives and learners plan data. However, web and computer-based education has much more readily information because these education systems can track all the data pertained to specific students’ actions and interactions onto log files and databases, (e.g. generating log files data) [22]. Educational system (Traditional classrooms, elearning systems, LMSs, intelligent tutoring systems, web-based adaptive systems) Educational data Knowledge components User transaction Log files User performance Identity type data Users (Instructors, learners, students, course administrators, academic researchers, educators) Data mining techniques (Visualization, clustering, classification, statistic, association rule mining, sequence mining) Fig. 2. Applying data mining to the design of educational systems In order to improve learning effectiveness, the application of data mining approaches and techniques to educational systems, can be observed as a formative evaluation technique which is the evaluation of an educational program while it is still in development phase, and with the purpose of continually enhancing the program. Auditing the way students use the educational system is perhaps one common way to assess instructional design in a this manner would help learning developers to have the improved instructional materials which is going to result in having different data types such as log files, performance, transaction [23]. Data mining techniques should be applied to collect information that can be used to assist instructional designers/developers to build an educational foundation for judgments when designing or improving an environment’s instructive approach. The application of data mining to the design of educational systems is an iterative cycle of hypothesis formation, testing, and refinement (see Figure 2). Extracted knowledge should go through the loop towards guiding, facilitating, and enhancing learning as a whole. In this process, the aim is not just to turn data into knowledge, but also to filter mined knowledge for decision making [16,24]. As it is represented in Figure 2, educators and educational designers (whether in school districts, curriculum companies, or universities) design, plan, create, and maintain educational systems. Students use those educational systems to learn. Building off of the available information about courses, students, usage, and interaction, data mining techniques can be applied in order to discover useful knowledge that helps to improve educational designs. The discovered knowledge can be used not only by educational designers and teachers, but also by end users— students. Hence, the application of data mining in educational systems can be oriented to supporting the specific needs of each of these categories of stakeholders[23]. 3 Big Educational Data Mining 3.1 Educational Data Mining Educational Data Mining popularly known as EDM is a field that exploits statistical, machine-learning, and data-mining (DM) algorithms over the different types of educational data. Its major objective is to analyze these types of data in order to resolve educational research issues [25,26]. EDM is concerned with developing methods to explore the relationships between unique types of data, produced in educational settings and, using these methods, to better understand students and the settings in which they learn. While, the increase in both instrumental educational software as well as state databases of student’s information have created large repositories of data reflecting how students learn [26]. Whereas, the use of Internet in education has created a new context known as e-learning or web-based education in which large amounts of data about teaching–learning interaction are endlessly generated and ubiquitously available [23]. All this information provides a gold mine of educational data. EDM seeks to tap these untouched or maiden data repositories to better discern learners and learning abilities, and to develop computational approaches that combine data and theory to transform practice to benefit learners. EDM has emerged as a prolific research area in recent years for researchers all over the world from different and related research areas [7]. Education Data Mining can be extremely helpful in deducing inferences, make predictions and more to establish students behavior and attitude as well as concentration to its educational goals. The results deciphered by utilizing the traditional data mining algorithms to educational context can help enhance the educational system as all stakeholders can look into the trends found once analytic reasoning is applied on the data of student related parameters [25]. Usually we use regression techniques to analyze data, When we unitize the data into statistical numbers for analytic reasons, usually the produced results can be plotted on a graphs and trends can be found in terms of lines or combination of several data points as a concentration of some student behavior to learning or researching or any such related activity [25]. EDM is involved with various groups of users such as learning developers, instructors, educators, researchers. Different groups consider educational data from different angles, based on their mission, vision, and major purpose for using data mining as it is depicted in Table 1. Table 1. EDM Users/Stakeholders User/Actors Learners/Students/pupils Objectives for using data mining To personalize e-learning, to recommend activities to learners Educators/Instructors/Teachers/Tu tors Course Developers/Educational Researchers Organizations/Learning Providers/Universities/Private Training Companies Administrators/School District Administrators/Network Administrators/System Administrators resources and learning tasks that could further improve their learning, to suggest interesting learning experiences to the students[27] To get objective feedback about instructions, to analyze students’ learning and behavior, to detect which student need support, to predict student performance, to classify learners into groups[28] To evaluate and maintain courseware, to improve student learning, to evaluate structure of course content and its effectiveness in learning process[29] To enhance the decision processes in higher learning institutions to streamline the efficiency in the decision making process, to achieve specific objectives[30] To develop the best way to organize institutional resources and their educational offer, to utilize available resources more effectively, to enhance educational program offers and determine the effectiveness of distance learning approach[31] Today, there exists a wide variety of educational data sets that can be downloaded for free from the Internet. Some widely acclaimed and used repositories are PSLC DataShop (The world’s largest repository of learning interaction data), Data.gov (official website of United States Government on Educational data sets), NSES Data sets (is the primary federal entity for collecting and analyzing data related to education in United States) [26,32], Barro-Lee data set (the data set provided by researchers Barrow and lee whose contribution has been discussed in section 1), UNISTATS Dataset (website provides comparable sets of information about full or part time undergraduate courses and is designed to meet the information needs of prospective students), SABINS (The School Attendance Boundary Information System) provides free of charge, aggregate census data and GIS-compatible boundary files for school attendance areas, or school catchment areas, for selected areas in the United States for the 2009-10, 2010-11 and 2011-12 school years. UIS (is an UNESCO initiative), EdStats (A World Bank Initiative), Education Human Development Network (A World Bank Initiative) and IPEDS Data Center (the primary source for data on colleges, universities, and technical and vocational postsecondary institutions in the United States), TLRP [33] . 3.1.1 Analysis of current tools being used for educational data sets At present statistical tools are predominantly being used to quantify and assess the educational data sets. Prominent ones are RapidMiner, SAS, IBM SPSS, KEEL [34] (is a knowledge extraction tool based on evolutionary learning). Programming language like R is mostly used for statistical analysis and plays a pivotal role in programming custom tests that may not be available in commercial software packages. There are some online web based data exploration tools typical java based that gives the user the freedom to choose from the varied dataset types and see a graphical representation of them. One of these is Education Data Explorer being provided by Oregon Department of Education, United States. Another one is Educational Data Analysis Tool (EDAT), it allows you to download NCES survey datasets to your computer. EDAT guides you through selecting a survey, population, and variables relevant to your analysis [30]. 3.1.2 Educational data set problem and possible solutions Does the problem really exist or are we running behind a chimera? The “Education for All”, a global monitoring report prepared by United Nations is the prime instrument to assess global progress towards its goals. It seems that there is a flurry of activity around big data and how it’s touching and transforming every aspect of our life. Analysis of these large scale datasets can help improve the robustness and generalizability of educational research. The problem with most large scale secondary data-sets used in higher education research is that they are constructed using complex sample designs that often cluster lower level units (students), within higher level units (colleges) to achieve efficiencies in the sampling process [35]. As it is clearly shown in Figure 3, the term “Big Educational Data Mining” known as BEDM can be proposed for the extraction of useful big educational data from vast quantities of different large data sets. Fig 3. Big Educational Data Mining (BEDM), Extracting new knowledge from Big Data Sets 3.2 Big Educational Data Education has always had the capacity to produce a tremendous amount of data, compared to any other industry. First, academic study requires many hours of schoolwork and homework for several numbers of years. These extended interactions with materials produce a huge quantity of data. Second, education content is tailormade for big data, generating cascading effects of insights thanks to the high correlation between concepts [31]. Recent advancement in technology and data science has made it possible to unlock/explore these large data sets [15]. The benefits range from more effective self-paced learning to tools that enable instructors to pinpoint interventions, create productive peer groups, and free up class time for creativity and problem solving. For instance, as it is represented in Table 2, educational data can be categorized to five different categories: one pertaining to student identity and on boarding, and four student activity-based data sets that have the potential to improve learning outcomes. They are listed below in order to see how complicated they are to attain: Table 2. Educational Data Type classes No. 1 Data Type Identity Data 2 User Interaction Data Inferred Content Data 3 4 System-Wide Data 5 Inferred Student Data Description Personal Information, Authority, Domain Rights, Geographical Information engagement metrics, click rate, page views, bounce rate, etc How well does a piece of content perform across a group, or for any one subgroup, of students? What measurable student proficiency gains result when a certain type of student interacts with a certain piece of content? Rosters, grades, disciplinary records, and attendance information are all examples of system-wide data. Exactly what concepts does a student know, at exactly what percentile of proficiency? What is the probability that a student will pass next week’s quiz, and what can she do right this moment to increase it? Two areas that are specific to the use of big data in education are educational data mining and learning analytics. Although there is no hard and fast distinction between these two fields, they have had somewhat different research histories and are developing as distinct research areas. Generally, educational data mining looks for new patterns in data and develops new algorithms and/or new models, while learning analytics applies known predictive models in instructional systems. Big Data practical examples in Educational context are the following: The clear example is an education initiative. Analysts estimate that £16 billion is wasted in productivity due to under-educated citizens. In response, the UK government gathered data on outcomes of Kindergarten-12 education (elementary and high school) as well as higher education (university). The data pertained to student school performance and “success” afterwards as measured by employment[36]. The government increasingly contributes to the open data movement; it is okay with releasing “dirty data,” which is raw and not cleansed. Open data enables individuals and entrepreneurs to use public data to innovate. Data visualization tools enable parents to understand schools’ outcomes, so they can select appropriate schools for their children. Universities can use data in exciting ways; they analyze students’ social media sharing, patterns in checking out library materials, what courses they take (and outcomes they achieve). This data helps them steer students to courses that are aligned with their goals. It helps with student retention. Big data enables interesting insights and correlations such as students that have high library fines tend to perform worse on tests. Universities also correlate performance data with socioeconomic and email data, so they can learn what student characteristics predict the best performance at their schools, and they use this to guide their recruitment. They are also starting to be able to predict which students will drop out before graduating, which helps them give additional support [9]. Cost drivers [of education] are keys in big data adoption in the UK, which has developed the most comprehensive database of pupils (schoolchildren) in the world. It traces 600,000 pupils' performance from 3,000 elementary schools through career. It has ten years of data on pupils’ exams, tests, socioeconomic status, geography, transport, free meals, behavior issues and many others. It is a rich dataset from which the government can learn and improve schools. It can answer political questions. The government is also combining its data with health, crime and welfare datasets. It studies what students’ lives are like outside school, to try to develop a fuller picture of factors that affect performance. This can help challenge conventional thinking and guide policy. This initiative is teaching us many things. Socioeconomic status is not as important as we thought; school performance and responsiveness is very important. Schools can use data to change. For example, science, technology, engineering and math courses are far more important than we thought, even when students don’t intend to pursue STEM careers. Privacy is an issue with these databases, but the government believes that the advantages outweigh the pupils' compromised privacy [37] . Another traditional belief is that poor pupils do poorly and that schools need more money to increase performance. The data are showing that how the money is invested is more important than how much money is in the school’s budget. We are starting to be able to measure return on outcomes. The UK example is more complex, but it effectively illustrates how internal and external data can be mashed up to address complex problems such as school performance. It's an excellent example of big data. 3.3 Big Data Mining In typical data mining systems, the mining procedures require computational intensive computing units for data analysis and comparisons. A computing platform is, therefore, needed to have efficient access to, at least, two types of resources: data and computing processors. For small scale data mining tasks, a single desktop computer, which contains hard disk and CPU processors, is sufficient to fulfill the data mining goals. Indeed, many data mining algorithm are designed for this type of problem settings. For medium scale data mining tasks, data are typically large (and possibly distributed) and cannot be fit into the main memory. Common solutions are to rely on parallel computing [43], [33] or collective mining [12] to sample and aggregate data from different sources and then use parallel computing programming (such as the Message Passing Interface) to carry out the mining process. For Big Data mining, because data scale is far beyond the capacity that a single personal computer (PC) can handle, a typical Big Data processing framework will rely on cluster computers with a high-performance computing platform, with a data mining task being deployed by running some parallel programming tools, such as MapReduce or Enterprise Control Language (ECL), on a large number of computing nodes (i.e., clusters). The role of the software component is to make sure that a single data mining task, such as finding the best match of a query from a database with billions of records, is split into many small tasks each of which is running on one or multiple computing nodes. For example, as of this writing, the world most powerful super computer Titan, which is deployed at Oak Ridge National Laboratory in Tennessee, contains 18,688 nodes each with a 16-core CPU. Such a Big Data system, which blends both hardware and software components, is hardly available without key industrial stockholders’ support. In fact, for decades, companies have been making business decisions based on transactional data stored in relational databases [10]. Big Data mining offers opportunities to go beyond traditional relational databases to rely on less structured data: weblogs, social media, e-mail, sensors, and photographs that can be mined for useful information [1]. Major business intelligence companies, such IBM, Oracle, Teradata, and so on, have all featured their own products to help customers acquire and organize these diverse data sources and coordinate with customers’ existing data to find new insights and capitalize on hidden relationships. 4 Major Challenges in Big Educational Data Mining 4.1 Is education data big enough to call it big data? Startups like Knewton [38] and Desire2Learn [10] have been founded on the concept of Big Data. We had seen similar e-commerce startup during the early nineties when the e-commerce boom was there but history is a mute audience to some of that startup’s fate. Few of them have perished by now. However, the business startup’s founded on big data in educational context would not face the similar fate because its foundation rests on the principle of didactic unstructured data that is already present in Informational systems. Perhaps one of the reasons for some of the ill-fated ecommerce startups to fail was that their business model did not rest on the availability of a constant flow of data from which information could be minded. But this is not the case here. All we need is specialized algorithms that are designed to work with educational datasets because we already have the data with us. Now companies like Yahoo, Google, Dell, HP to name a few have ventured into open-source development of big data software’s like Apache foundation Hadoop to facilitate collective learning by using contests like hackadays or hackathons [25,9]. We also need to understand that there lies a gap between the application of big data in commerce and that in education sector. While the former has seen various advances in it but for the latter we are still dependent on traditional data mining algorithms. And the problem of using such algorithm is that they may not fit the dataset and that can cause a loss of valuable predictions that otherwise could have been ascertained by using the data mining algorithms that would fit the educational dataset. Educational experts have posed various deployment and implementation barriers to harness the power of big data in education and learning analytics that most importantly includes technical lacunae, institutional velocity, legal and quite often ethical issues by applying general data mining algorithms. For big data to be meaningful it will require the seamless integration of specifically tailored algorithms that could the power of this raging beast to tame it into knowledge that will be useful to both the learner and the educator [39]. 4.2 BDM: Challenges in applying DM approaches on Big Data (from the educational perspective) A conceptual view of the Big Data processing framework can be depicted in the figure 4, which includes three tiers from inside out with considerations on data accessing and computing (Tier I), data privacy and domain knowledge (Tier II), and Big Data mining algorithms (Tier III). The challenges at Tier I focus on data accessing and actual computing procedures. Fig. 4. A conceptual view of the Big Data processing framework [4] Because Big Data are often stored at different locations and data volumes may continuously grow, an effective computing platform will have to take distributed large-scale data storage into consideration for computing [4,11]. For example, while typical data mining algorithms require all data to be loaded into the main memory, this is becoming a clear technical barrier for Big Data because moving data across different locations is expensive (e.g., subject to intensive network communication and other IO costs), even if we do have a super large main memory to hold all data for computing. The challenges at Tier II center on semantics and domain knowledge for different Big Data applications [40]. Such information can provide additional benefits to the mining process, as well as add technical barriers to the Big Data access (Tier I) and mining algorithms (Tier III). For example, depending on different domain applications, the data privacy and information sharing mechanisms between data producers and data consumers can be significantly different. Sharing sensor network data for applications like water quality monitoring may not be discouraged, whereas releasing and sharing mobile users’ location information is clearly not acceptable for majority, if not all, applications [41]. In addition to the above privacy issues, the application domains can also provide additional information to benefit or guide Big Data mining algorithm designs. For example, in market basket transactions data, each transaction is considered independent and the discovered knowledge is typically represented by finding highly correlated items, possibly with respect to different temporal and/or spatial restrictions. In a social network, on the other hand, users are linked and share dependency structures. The knowledge is then represented by user communities, leaders in each group, and social influence modeling etc. Therefore, understanding semantics and application knowledge is important for both low-level data access and for high level mining algorithm designs [16]. At Tier III, the data mining challenges concentrate on algorithm designs in tackling the difficulties raised by the Big Data volumes, distributed data distributions, and by complex and dynamic data characteristics. The circle at Tier III contains three stages [4]. Primarily, sparse, heterogeneous, uncertain, incomplete, and multi-source data are preprocessed by data fusion techniques. Secondarily, complex and dynamic data are mined after pre-processing. Tertiary, the global knowledge that is obtained by local learning and model fusion is tested and relevant information is fed back to the pre-processing stage. Then the model and parameters are adjusted according to the feedback. In the whole process, information sharing is not only a promise of smooth development of each stage, but also a purpose of Big Data processing [30]. 4.3 EDM: Challenges in applying DM approaches on education data The recent advances in information technology have seen the proliferation of software’s that can code a completely functional website replete with a backend database system in less than an hour. So this has led to a rampant growth of e-learning systems mostly cloud technology based. Most of these have incorporated recommendation features as used by their business oriented counterparts. And both of them are generating voluminous amounts of data. While online learning systems have proffered the educator, developer and researcher opportunities to create personalized learning systems but do note these personalization are using the traditional data mining algorithms [7]. So what’s the problem then? One would ask. Well, one of the problem is which most of the e-learning systems are not able to ascertain from an educational point of view is that these systems are used by learners who have their individual learning styles. When a learner interacts with an LMS it leaves behind a trail of breadcrumbs or log text files for example its interactions within the LMS forum with either other students or with the course facilitator [33]. So it logically follows that if we have to mine this data then it becomes imperative to figure out the correct dataset to use so as to derive logical conclusions from it. Till now, there have been fewer instances where data mining methods [44-48] have been introduced within the e-learning systems to facilitate learner progress. The other problem from a developer’s point of view would be to determine how to classify individual learning style of a learner so as to provide it with a truly personalized learning environment. While another challenge will be as we have repeatedly mentioned it in previous sections too on how to develop specific data mining algorithms [49-52] that can cater to the learning analytical domain. So essentially what really matters at this point is to find out methodologies that can help clean educational dataset so that it could further be processed [23, 42]. 5 Discussion and Conclusion In this analytical study, on the whole, the background of study regarding to importance of education and its educational data growth as big data, big data mining tools and techniques to mine these vast amounts of data has been discussed. Moreover, the challenges involved with big educational data mining and extraction of big educational data has been addressed from different educational data mining perspectives. Working with big data using data mining and analytics is rapidly becoming common in the commercial sector. Tools and techniques once confined to research laboratories are being adopted by forward-looking industries, most notably those serving end users through online systems [4,43]. Higher education institutions are applying learning analytics to improve the services they provide and to improve visible and measurable targets such as grades and retention. K–12 schools and school districts are starting to adopt such institution-level analyses for detecting areas for improvement, setting policies, and measuring results. Now, with advances in adaptive learning systems, possibilities exist to harness the power of feedback loops at the level of individual teachers and students [40]. Measuring and making visible students’ learning and assessment activities open up the possibility for learner’s to develop skills in monitoring their own learning and to see directly how their effort builds onto their success. Teachers gain views into students’ performance that help them adapt their teaching or initiate interventions in the form of tutoring, tailored assignments, and the like. Personalized adaptive learning systems enable educators to quickly see the effectiveness of their adaptations and interventions, providing feedback for continuous improvement. The practical applications of open source data mining tools in an educational setting can augment both the researcher and developer to compare distinct prototypes bearing the same design functionalities. The results thus obtained could then be used to integrate within the existing in-house educational framework as used by institutions so as to keep pace with the rapid adoption of blended learning environment. Open source tools for adaptive learning systems, commercial offerings, and increased understanding of what data reveal are leading to fundamental shifts in teaching and learning systems. As content moves online and mobile devices for interacting with content enable teaching to be always on, educational data mining and learning analytics will enable learning to be always assessed. Educators at all levels will benefit from understanding the possibilities of the developments described in the use of big data herein. Besides challenges of this new field which is introduced as big educational data mining concerned with big identified educational data, the importance of analyzing big educational data captured, extracted from large scaled data sets using multiple approaches of big data and data mining analysis has to be considered in further studies. Acknowledgments. This work is supported by University of Malaya High Impact Research Grant no vote UM.C/625/HIR/MOHE/SC/13/2 from Ministry of Education Malaysia. References 1. 2. 3. 4. 5. 6. 7. S. Sagiroglu and D. Sinanc, "Big data: A review," in Collaboration Technologies and Systems (CTS), 2013 International Conference on, 2013, pp. 42-47. A. Peña-Ayala, "Educational data mining: A survey and a data mining-based analysis of recent works," Expert systems with applications, 2013. G. Siemens and P. Long, "Penetrating the fog: Analytics in learning and education," Educause Review, vol. 46, pp. 30-32, 2011. X. Wu, X. Zhu, G. Wu, and W. Ding, "Data mining with big data," 2012. C. Bizer, P. Boncz, M. L. Brodie, and O. Erling, "The meaningful use of big data: four perspectives--four challenges," ACM SIGMOD Record, vol. 40, pp. 56-60, 2012. A. Abraham, "Business intelligence from web usage mining," Journal of Information & Knowledge Management, vol. 2, pp. 375-390, 2003. C. Romero and S. Ventura, "Educational data mining: A survey from 1995 to 2005," Expert Systems with Applications, vol. 33, pp. 135-146, 2007. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. J. Gobert, M. Sao Pedro, R. Baker, E. Toto, and O. Montalvo, "Leveraging educational data mining for real time performance assessment of scientific inquiry skills within microworlds," Journal of Educational Data Mining (accepted), 2012. S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, "Big data, analytics and the path from insights to value," MIT Sloan Management Review, vol. 52, pp. 21-31, 2011. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, et al., "Big data: The next frontier for innovation, competition, and productivity," 2011. O. Trelles, P. Prins, M. Snir, and R. C. Jansen, "Big data, but are we ready?," Nature reviews Genetics, vol. 12, pp. 224-224, 2011. G. Bathuriya and M. Sai Nandhinee, "Implementation of Big Data for Future Education Developement Data Mining Data Analytics." D. Centola, "The spread of behavior in an online social network experiment," science, vol. 329, pp. 1194-1197, 2010. L.A. Kurgan and P. Musilek, "A survey of Knowledge Discovery and Data Mining process models," Knowledge Engineering Review, vol. 21, pp. 1-24, 2006. C. Romero, S. Ventura, and E. García, "Data mining in course management systems: Moodle case study and tutorial," Computers & Education, vol. 51, pp. 368-384, 2008. S.-H. Liao, P.-H. Chu, and P.-Y. Hsiao, "Data mining techniques and applications–A decade review from 2000 to 2011," Expert Systems with Applications, vol. 39, pp. 11303-11311, 2012. L. Tsantis and J. Castellani, "Enhancing learning environments through solution-based knowledge discovery tools: Forecasting for self-perpetuating systemic reform," Journal of Special Education Technology, vol. 16, pp. 39-52, 2001. C. Romero, S. Ventura, A. Zafra, and P. d. Bra, "Applying Web usage mining for personalizing hyperlinks in Web-based adaptive educational systems," Computers & Education, vol. 53, pp. 828-840, 2009. C. Romero, S. Ventura, and P. De Bra, "Knowledge discovery with genetic programming for providing feedback to courseware authors," User Modeling and User-Adapted Interaction, vol. 14, pp. 425-464, 2004. Y. Wang, "Web mining and knowledge discovery of usage patterns," CS 748T Project, 2000. S. Cetintas, L. Si, Y. P. Xin, and C. Hord, "Automatic detection of off-task behaviors in intelligent tutoring systems with machine learning techniques," Learning Technologies, IEEE Transactions on, vol. 3, pp. 228-236, 2010. I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques: Morgan Kaufmann, 2005. C. Romero, S. Ventura, M. Pechenizkiy, and R. S. Baker, Handbook of educational data mining: Taylor & Francis US, 2011. J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques: Morgan kaufmann, 2006. C. Romero and S. Ventura, "Educational data mining: a review of the state of the art," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 40, pp. 601-618, 2010. R. Baker and K. Yacef, "The state of educational data mining in 2009: A review and future visions," Journal of Educational Data Mining, vol. 1, pp. 3-17, 2009. N. Sael, A. Marzak, and H. Behja, "Multilevel clustering and association rule mining for learners’ profiles analysis," 2013. N. Anozie and B. W. Junker, "Predicting end-of-year accountability assessment scores from monthly student records in an online tutoring system," in Proceedings of the American Association for Artificial Intelligence Workshop on Educational Data Mining (AAAI-06), July 17, 2006, Boston, MA, 2006, pp. 1-6. L. Razzaq, M. Feng, N. T. Heffernan, K. R. Koedinger, B. Junker, G. Nuzzo-Jones, et al., "A web-based authoring tool for intelligent tutors: blending assessment and instructional assistance," in Intelligent Educational Machines, ed: Springer, 2007, pp. 23-49. A. Peña-Ayala and L. Cárdenas, "How Educational Data Mining Empowers State Policies to Reform Education: The Mexican Case Study," in Educational Data Mining, ed: Springer, 2014, pp. 65-101. 31. J.A. Lara, D. Lizcano, M.A. Martínez, J. Pazos, and T. Riera, "A System for Knowledge Discovery in E-Learning Environments within the European Higher Education AreaApplication to student data from Open University of Madrid, UDIMA," Computers & Education, 2013. 32. M. J. Berry and G. Linoff, Data mining techniques: For marketing, sales, and customer support: John Wiley & Sons, Inc., 1997. 33. M.-S. Chen, J. S. Park, and P. S. Yu, "Data mining for path traversal patterns in a web environment," in Distributed Computing Systems, 1996., Proceedings of the 16th International Conference on, 1996, pp. 385-392. 34. J. Alcalá-Fdez, L. Sánchez, S. García, M. J. del Jesús, S. Ventura, J. Garrell, et al., "KEEL: a software tool to assess evolutionary algorithms for data mining problems," Soft Computing, vol. 13, pp. 307-318, 2009. 35. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "The KDD process for extracting useful knowledge from volumes of data," Communications of the ACM, vol. 39, pp. 27-34, 1996. 36. G. Siemens and R. S. d. Baker, "Learning analytics and educational data mining: Towards communication and collaboration," in Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, 2012, pp. 252-254. 37. A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, "Privacy preserving mining of association rules," Information Systems, vol. 29, pp. 343-364, 2004. 38. D. Agrawal, S. Das, and A. El Abbadi, "Big data and cloud computing: current state and future opportunities," in Proceedings of the 14th International Conference on Extending Database Technology, 2011, pp. 530-533. 39. H. Chen, R. H. Chiang, and V. C. Storey, "Business Intelligence and Analytics: From Big Data to Big Impact," MIS Quarterly, vol. 36, pp. 1165-1188, 2012. 40. P. Zikopoulos, C. Eaton, D. DeRoos, T. Deutsch, and G. Lapis, "Understanding big data," New York et al: McGraw-Hill, 2012. 41. M. Bienkowski, M. Feng, and B. Means, "Enhancing teaching and learning through educational data mining and learning analytics: An issue brief," Washington, DC: SRI International, 2012. 42. R. Nisbet, J. Elder IV, and G. Miner, Handbook of statistical analysis and data mining applications: Access Online via Elsevier, 2009. 43. P. Guide, "Getting Started with Big Data," 2013. 44. H. Kalia, S. Dehuri, and A. Ghosh: A Survey on Fuzzy Association Rule Mining. International Journal of Data Warehousing and Mining 9(1): 1-27 (2013) 45. F. Waas, R. Wrembel, T. Freudenreich, M. Thiele, C. Koncilia, and P. Furtado: On-Demand ELT Architecture for Right-Time BI: Extending the Vision. International Journal of Data Warehousing and Mining 9(2): 21-38 (2013) 46. A. Abelló, J. Darmont, L. Etcheverry, M. Golfarelli, J.-N. Mazon, F. Naumann, T. Bach Pedersen, S. Rizzi, J. Trujillo, P. Vassiliadis, and G. Vossen: Fusion Cubes: Towards SelfService Business Intelligence. International Journal of Data Warehousing and Mining 9(2): 6688 (2013) 47. P. Williams, C. Soares, and J.E. Gilbert: A Clustering Rule Based Approach for Classification Problems. International Journal of Data Warehousing and Mining 8(1): 1-23 (2012) 48. R.V. Priya and A. Vadivel: User Behaviour Pattern Mining from Weblog. International Journal of Data Warehousing and Mining 8(2): 1-22 (2012) 49. T. Kwok, K.A. Smith, S. Lozano, and D. Taniar: Parallel Fuzzy c-Means Clustering for Large Data Sets‚ Proceedings of the 8th International Euro-Par Conference (Euro-Par 2002), Lecture Notes in Computer Science, Volume 2400, Springer, pp: 365-374, 2002. 50. O. Daly and D. Taniar: Exception Rules Mining Based on Negative Association Rules‚ Proceedings of the International Conference on Computational Science and Its Applications (ICCSA 2004), Part IV, Lecture Notes in Computer Science, Volume 3046, Springer, pp: 543552, 2004. 51. D. Taniar, W. Rahayu, V.C.S. Lee, and O. Daly: Än Exception rules in association rule mining‚ Applied Mathematics and Computation, 205(2): 735-750 (2008) 52. M.Z. Ashrafi, D. Taniar, and K.A. Smith: Redundant association rules reduction techniques‚ International Journal of Business Intelligence and Data Mining, 2(1): 29-63 (2007)