Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Article Analyzing Consumer Reviews with Text Mining Approach: A Case Study on Samsung Galaxy S3 Paradigm 20(1) 56–68 © 2016 IMT SAGE Publications sagepub.in/home.nav DOI: 10.1177/0971890716637700 http://par.sagepub.com Subhasis Dasgupta1 Kalyan Sengupta2 Abstract In the era of Internet, it is not necessary to run an expensive market survey to explore what the users are saying about a product and to find out whether there are any modifications required within the product. There are several sites available where users from different parts of the world post their comments after using a product. These comments can be analyzed scientifically through text mining to understand how the users have used different words in relation to the said product. The current study has been focused at finding out the word association with Samsung Galaxy 3 (a high-end smart phone). It also deals with how a few keywords are related to other words through correlation analysis. Keywords Text mining, word cloud, document clustering, association rule mining Introduction In a hyper competitive market, it is essential to innovate products and to create a positive image of the product as well as the brand in the minds of the customer. Product development is a tedious process. Once the product is developed and marketed, it is important to receive feedback points from the market related to the product. Such feedback points are, at times, crucial to prevent the product from any premature death in the market. Particularly, in the case of electronics goods, the product life cycle of individual products is quite small in comparison to other market. Hence, it becomes even more important to understand the acceptability of the newly launched product in this market so that corrective measures can be taken before it is too late. That is why gathering market information becomes an indispensable activity for any company to remain competitive in the market. Information can flow through different channels, and the Internet has become a prominent channel in this regard to gather as well as to facilitate the distribution of information. Social media sites, blogging sites and product review sites are the prime 1 2 PhD scholar, School of Management, RK University and Assistant Professor, Praxis Business School, Kolkata, West Bengal, India. Professor & Head of Computer Dept, Indian Institute of Social Welfare and Business Management, Kolkata, West Bengal, India. Corresponding author: Subhasis Dasgupta, Assistant Professor, Praxis Business School, Kolkata 700104, West Bengal, India. E-mail: subhasis@praxis.ac.in Dasgupta and Sengupta 57 sources of customer feedback for global companies for their products. The data and other information available on these sites can provide a meaningful insight into the companies regarding their product popularities and their acceptance by the target customers. In the current study, the product review of only one product, that is, Samsung Galaxy S3, has been chosen and text mining approaches are used to identify what customers have spoken mostly about this product and how different words are used in relation to other words. Literature Review Internet is a collection of a huge amount of data which is expanding every day. While talking about data, it is to be understood that data can be broadly classified into two areas, that is, quantitative data and qualitative data. Both of these data types can further be divided into two parts, that is, structured data and unstructured data. Lean, Wang and Lai (2005) in their paper quoted a survey done by Delphi Group which said that around 80 per cent of the data are stored in an unstructured manner. That is why data mining techniques are gaining importance to recover underlying critical information from these huge masses of data. Specifically, text mining is gaining more importance to analyze unstructured textual data to retrieve critical information about customer feedback. If we go roughly one decade back, structured questionnaire was the main tool of collecting customer feedback. Structured questionnaire is no doubt a very strong tool in market research but gathering quality responses were always a challenge. Cost of survey, responders’ fatigue (Hess, Hensher & Daly, 2012) and reliability of responses (Ferber, 2012) play a critical role in getting usable data for analysis. However, when a reviewer is putting up his comments spontaneously in any review portals, respondent fatigue is definitely absent in that reviewer. Reliability of responses cannot be guaranteed in such reviews because paid reviews are also possible where intentionally good or bad reviews are put up. To reduce these tendencies, many review sites are putting up their own checks so that their sites are populated with mostly genuine reviews. Hence, if more number of reviews is collected, the effects of biased reviews will be minimized. Collecting reviews from such websites is much cheaper than collecting responses from structured questionnaire surveys. That is why, efforts must be made to capture first-hand information from such reviews so that more focused analysis can be done with structured questionnaire at a later stage based on the information retrieved from review analysis. In this regard, researchers have provided empirical pieces of evidence that online reviews have significant amount of impact on businesses (Dellarocas, Zhang & Awad, 2007; Eliashberg & Shugan, 1997; Godes & Mayzlin, 2004; Netzer, Feldman, Goldenber & Fresko, 2012) and Glazer (1991) did mention that market knowledge is an asset for any business. Since reviews are collection of texts, and hence unstructured, text mining approaches must be used to extract meaningful information out of plain texts. Consumer reviews are good source of market response data and text mining on these reviews can provide significant insight about how consumers are viewing the product. That is why many researchers have used text mining methods in different business contexts to extract meaningful insights. For example, Leong, Ewing and Pitt (2004) tried to understand, through text mining approach, how different promotional communications are made by competitors. In a separate study, Lau, Lee, Ho and Lam (2004) through a combination of web and text mining approach tried to analyze how potential customers can be acquired from vast amount of data available on the Internet. Sentiment analysis was done by Na, Thet and Khoo (2010) using the text mining approach where the researchers tried to understand the differences in sentiments about popular movies by different genres. Chou, Sinha and Zhao (2008) used the same text mining approach in detecting 58 Paradigm 20(1) Internet abuse. Text mining, even though, still being developed for exploring the true potential of analyzing unstructured textual data, it is getting more and more importance in the business. The current study focuses on how to use simple text mining techniques to explore meaningful insights from consumer reviews on a product like Galaxy S3. Background Text analysis is a relatively new area of research. One of the most important tasks in text mining is to convert the unstructured text into a structured form. That is why the unstructured texts are first converted to collection of words in a vector space model (VSM; Salton & McGill, 1983). The terms extracted into VSM can be the individual words or selected indexed terms or stemmed words which are extracted through different stemming algorithm (Baeza-Yates & Ribeiro-Neto, 1999). One of the issues in forming such a vector model is that the linear combination of words is ignored which means that the model cannot differentiate between ‘soldiers fire bullets’ and ‘bullets fire soldiers’. That is why, VSM is also called ‘bag of words’ because it only contains words but cannot relate or differentiate combination of words as human can do. Hence, each word is required to be weighted in some way to attach importance of the word in the document. There are a few ways it can be done. The first way is to weigh the words in terms of their frequency of occurrences, that is, based on the Tern Frequency (tf), binary occurrences of terms and tf–idf value. Tf–idf stands for term frequency–inverse document frequency. This is considered to be one of the most useful weighing technique of individual word. The factor idf tries to reduce the importance all those words which are occurring in almost all the documents. But, it tries to give high value to those words which are occurring in limited number of documents. Hence, a product of term frequency and inverse document frequency for each word tends to give a better weight to that word. Mathematically tf–idf is given by TF − IDF = TFij ×log N ni (1) where TFij is the term frequency of word i in document j and ni is the number of documents containing the word i and N stands for total number of documents. Longer documents will try to produce biasness if simple term frequencies are considered. That is why the term frequencies are normalized by dividing the frequency of occurrence of a word i in document j by the occurrence of another word d in the same document having maximum frequency of occurrence. Mathematically, it is given by TFij = f ( i, j ) max { f ( d , j ) : d ∈ j} (2) where f(i, j) is the simple frequency of occurrence of word i in the document j and the denominator represents the frequency of occurrence of another word d which has the maximum frequency of occurrence in the same document. The TF–IDF value can be used for many statistical and heuristic analyses. Later, in this study, the same TF–IDF values are used for correlation analysis of words. Correlation analysis is good for finding out how different words are related but association rule mining is also a good way of analyzing how different words are related to each other. Correlation analysis is a statistical analysis but association rule mining is heuristic in nature. Dasgupta and Sengupta 59 Association rule mining aims at finding strength of association between two objects or entities which may occur together. It was initially developed for analyzing how one product is bought along with other products in retail shops. That is why it is also called a market basket analysis because products are put into baskets and which products are bought together makes lots of business sense in retail businesses. The same rule can be easily applied in finding word associations as well because words are never used randomly in a text. Talking about Association rules mining, it is a heuristic technique which tries to find the relation x → y on the basis of frequency of occurrence of X and Y. The problem of mining association rule was introduced by Agrawal, Imielinski and Swami (1993) and later many modifications were done on it. They introduced the concept of support and confidence. Support is the threshold frequency which allows only those items to form any rule whose frequencies of occurrences are above the support value. Hence, support of the rule x → y is the percentage of transaction which contains both X and Y (Srikant & Agrawal, 1997). Confidence of the rule x → y is the percentage of transaction that contain Y among transactions that contain X (Lin, Alvarez & Ruiz, 2002). Association rules are generated on the basis of frequent item sets with given minimum support values. And, Frequent Set Counting (FSC) is the most time consuming activity before generating association rules. There are quite a few algorithms available for doing FSC. Apriori algorithm (Agrawal & Srikant, 1994) is one of the earliest and most famous algorithms in this context. Apriori algorithm works iteratively to search for frequent item sets. At each iteration k, the algorithm forms a set Fk which contains all the frequent items of k-items, in other words, k-itemset. However, for generating Fk, a candidate set Ck is generated first which acts as the superset of Fk. Generating candidate set Ck is computationally intensive because for generating k-itemset, the algorithm searches support of all candidate set by scanning the entire database. Hence, the computational requirements depend on both the size of the candidate set Ck and the size of the database. That is why, in case of text mining, when the size of documents as well as number of documents increases, Apriori algorithm starts taking too much of time to produce results. Moreover, physical memory requirements also increase because of the generation of large number of candidate set Ck and frequent set Fk through kth iteration. A different approach in dealing with such short coming is through the generation of Frequent Pattern Tree (FP-Tree; Han, Pei & Yin, 2000). Frequent Pattern Growth (or FP-Growth) algorithm does not produce candidate sets like the popular Apriori algorithm. FP-Growth algorithm works on the basis of generation of a compact data structure called FP-Tree in two passes through the dataset. Once the tree is developed, frequent item sets are extracted directly from that tree. That is why FP-growth is considered to be one of the most popular and fastest frequent set extracting algorithms (Christian, 2005). Hence, the FP-Growth algorithm is more applicable in association rule mining of large data set in comparison to Apriori type candidate set generating algorithms. Readers can find the algorithm of FP-Growth in the research article of Han, Pei and Yin (2000), which was presented in the Conference on the management of Data in New York, USA. Since text mining involves the generation of large sparse data matrix, for doing association rule mining, the FP-Growth algorithm is considered more appropriate than Apriori like algorithms. Methodology For collecting review data, gsmarena.com i was visited where people from all around the world put their remarks about mobile phones and other wireless gadgets. Hence, reviews of Samsung Galaxy 3 were taken from this site. Apart from gsmarena.com, other sites were also visited to collect reviews on Galaxy S3. The other sites were techradar.com, review.cnet.com and techcrunch.com. A total of 201 reviews were collected from various sites to analyze them through text mining approach. 60 Paradigm 20(1) To apply data mining techniques on unstructured texts, the same text is required to be converted to a structured form. This is achieved through tokenization of texts. Through tokenization of texts, the linguistic pattern is broken because the entire text gets converted to bag of words. Moreover, texts contain many words which occur very frequently but carry no significant information such as the words ‘a’, ‘an’, ‘is’, ‘are’, etc. These words are called stopwords. Hence after tokenizing the texts, stop words were removed. Apart from this, it is considered to be a good idea to transform the cases of all words to the lower case so that same word with different case is never considered as two separate tokens. Hence, all the texts were converted to their lower cases. Afterward, each word in each document was weighed with the respective TF–IDF value. TF–IDF gives a numerical value towards importance of a word in a document. This importance value is metric in nature and the same can be used for doing other statistical as well as heuristic analyses. In this way a document term matrix was created with TF–IDF scores for each word existing in the matrix. After forming the document term matrix, clustering of documents was done using X-mean algorithm to produce optimum number of clusters and X-mean (Pelleg & Moore, 2000) clustering produced two clusters. The entire example set was split on the basis of the cluster membership so that a better analysis can be done with respect to each cluster. Experimental pieces of evidence show that using top 10 per cent of most frequently occurring words does not reduce the performance of a text classifier (Feldman & Sanger, 2006). However, in this analysis, removal of tokens was not considered because total number of words extracted from the 201 reviews was 2,507 only. Throughout this analysis, Rapidminer was used as the analytics software. Analysis and Findings Cluster Analysis of Words Document clustering is important in analyzing texts. In this analysis, X-mean clustering had been employed. X-mean clustering produced two clusters which were identified as emotional feedback and technical feedback. Some of the words whose centroids were found well separated are given in Table 1. Clearly, Cluster_1 contains documents/feedback points which have talked more about the technical aspects of Samsung Galaxy S3, whereas Cluster_0 is containing documents where customers have given their emotional feedback. Another interesting part in Cluster_1 is that both Apple and iPhone are appearing there, which suggests that some comparative study has been done in at least a few documents with respect to the features of Galaxy S3 and iPhones. Cluster_1 does contain a few negative words like drawback, sad, ugly, insecurity and, on identifying the respective documents, it was found that many people who had used iPhone and iPads described Samsung Galaxy 3 as ugly in terms of looks. But if Cluster_0 is considered, people have attached mostly positive feedback about Samsung Galaxy 3. But, at least three documents were found where people bade goodbye to Samsung because of their prior experiences with this brand. Word Cloud Analysis Word counts also can be used for identifying what people are talking about. Saiz and Simonsohn (2008) gave very strong evidence that the frequency occurrence of words in the web represents the true likelihood of the phenomenon. Hence, word counts were also taken into consideration in the output. Since, all the texts contained reviews of Samsung Galaxy S3, Samsung, S and Galaxy have occurred very frequently. 61 Dasgupta and Sengupta Table 1. Selected Important Words in Two Clusters Cluster_0 Cluster_1 Interesting Android Excellent GB Looks Processor Love Battery Innovative Core Thank Devices Amazing OS Beat RAM Goodbye Memory Beautiful Card Hope Apple Boring iPhone, iPad Worried Browser Listen Movie Wi-Fi iOS Social Applications Performance Game Awesome Display Uglier Drawback Source: Authors’ own. Interesting point in the word count was that along with Galaxy S3, people had talked about iPhone, HTC and also in a minor way about Sony XperiaTM brand. Another thing that the reviewer talked much about was the screen and battery. People have also talked about processor and type of core, more specifically about quad core. Samsung Galaxy S3 has plastic back and people have talked about that also. When those documents were identified where the word plastic is appearing, it was found that people were dissatisfied with this plastic fitting in the back of Samsung Galaxy S3. People have also found the phone a little bit too big to hold comfortably in hand. However, they were satisfied with the quality of display. When some of the people talked about camera, mega pixel (MP) and gaming, they have also talked about Sony XperiaTM brand and its quality. Regarding video and overall performance, people have mostly given their positive feedback about the product. The selected important words and their frequency counts are given in Table 2. Word Association Analysis A different way of representing word association is through association rule mining. Association rule mining, in this case, suggests how different words have occurred in relation to other words and their probability of occurrences. Association rule mining is particularly important in doing market basket 62 Paradigm 20(1) Table 2. Word Counts Words Freq. Words Freq. Words Freq. S 414 Design 49 Apps 28 Samsung 165 Looks 45 MP 28 iPhone 127 Apple 43 Big T Words Freq. Words Freq. Card 21 Memory 18 Hand 21 Size 18 27 Mobile 21 Ugly 18 114 m 43 Display 27 Processor 21 Using 18 Screen 87 GB 39 RAM 27 Quad 21 Market 17 Galaxy 85 Features 38 Love 26 Bad 20 Review 17 Battery 82 Great 38 Make 24 Plastic 20 Game 16 X 71 Use 37 Want 24 Play 20 Xperia 16 HTC 67 Core 34 Video 23 SIII 20 Price 15 Good 61 Quality 33 Device 22 AMOLED 18 Storage 15 Android 58 Think 31 Feature 22 Devices 18 UI 13 Camera 56 Buy 29 Feels 22 Happy 18 Performance 12 Source: Authors’ own. Table 3. Selected Association Rules Along with Support, Confidence and Lift Values Premises Conclusion Support Confidence S, iPhone, design S, Samsung, iPhone Lift Phone, Android, Samsung, HTC and Apple 0.083333 0.833333 10 Phone, Apple and looks 0.083333 0.555555 6.6667 Android, design Phone, iPhone, HTC and Apple 0.083333 0.555555 6.6667 S, Phone, Android, T, iPhone and think Battery, Apple and use 0.083333 1 10 Phone, screen and feels Good, camera and display 0.083333 0.833333 10 X, Looks Good, look 0.083333 0.833333 10 Cheap Android, Samsung, Galaxy and display 0.083333 0.833333 10 Cheap Android, Samsung, screen and display 0.083333 0.833333 10 Cheap Android, Samsung, quality and display 0.083333 0.833333 10 Screen, quality and cheap Android, Galaxy and display 0.083333 1 12 Screen, camera, quality and hand Galaxy, build 0.083333 1 12 S, Android, Samsung, screen and display Galaxy, cheap 0.083333 1 12 Source: Authors’ own. analysis but in case of text analysis not all rules are important. Association rules only show how words are appearing in the documents and how different words have occurred in relation to the occurrence of other word(s) in the entire document. That is why, several rules are generated out of which only a few bear any meaningful insight. In this analysis, over 9 lac of such rules got generated. Hence, only a few of the important rules are shown in the Table 3. In association rule mining, one of the important parameter is lift value of an individual rule. Higher lift value indicates higher strength of the rule and Dasgupta and Sengupta 63 it also helps in pruning redundant lower strength rules. While developing the rules, the support value was intentionally kept low at 6 per cent level so that even less frequently occurring but important associations could also be found out. As it is seen in Table 3, the rules could identify the association between brands (Samsung, Apple) and products (Galaxy S, iPhone) but it could not find association of individual brands with the respective products. This is the issue with association rule that it, many a time, fails to identify important word associations. However, in this study, looking at the high lift value, it can be said that at least some people have associated the word cheap with either Samsung or the product Galaxy S or the screen or may be to the display of this product. Hence, after going through the respective documents, it was found that people who were loyal to iPhone had attributed the word cheap to both the product as a whole and also to the display that Galaxy S3 is having. The curved edges of Galaxy S3 were not liked by many people and loyal customers of iPhone had also considered the entire design as quite ugly. Word Correlation Analysis Correlation analysis helps in finding the occurrences of words that are correlated with each other. In this analysis not all the words were taken into consideration. Cleaning of words was done on the basis of variance. Since the data table generated through word tokenization was a sparse table with several cells containing zero, those words were taken into considerations whose variances were found above 0.03. 0.03 was taken arbitrarily so that around 10 per cent of the important words (based on TF–IDF value) were kept in the analysis. By using filtration technique, 271 words were extracted from cluster_1 and 300 words were extracted from Cluster_0. Hence, two sparse correlation matrices of dimensions 271 × 271 and 300 × 300 were got from this analysis. Further filtration was done on both the matrices to retain only those words which were having correlation coefficient of above 0.4. Using simple VBA program in MS Excel, a separate table was prepared which showed the extracted words and the correlated words with those words. A few important correlations are shown in Appendix A. The numbers in the parentheses denote the correlation coefficients of those words with the main words. This was done on both the clusters to see how various words are related to each other. Documents which belong to Cluster_0 are those feedback points that are more emotional in nature, whereas documents belonging to Cluster_1 are those which talk about the technical aspects. Another set of important correlated words can be found in Appendix B which are important from the business point of view. For example, reviewers of Cluster_0 have said that the battery backup of Galaxy S3 is horrible, but most of the reviewers have attached the word awesome with the performance of the CPU. The correlation analysis of words of Cluster_1 gives a few interesting outcomes. In the very first instance, it showed that the word account is correlated with the word Google and the word browser is correlated with chrome (i.e., Google Chrome which is a browser developed by Google). MB stands for mega byte which is related to memory and it is an internal component of RAM which is also found out by this correlation analysis. While talking about the CPU of Galaxy S3, people have used the word awesome very frequently (suggested by a high correlation coefficient 0.8936). And when they have talked about camera, they have also talked about the quality of camera, its specification in mega pixel (MP) and have also compared the same with Sony XperiaTM. In Cluster_0, there are definitely a few reviewers who did not like the ergonomic aspect of Galaxy S3 and related the word sucks to it. Moreover, Apple did come out with a product called fanboy, which is correctly captured by this analysis. However, no much of discussions were done about that product and hence no comparison could be made. 64 Paradigm 20(1) Managerial Implications and Conclusion Text mining cannot uncover the meaning of entire texts like the way human understands it. But it gives an indication what the text is trying to say and which texts are required to be read for more specific information. Text mining reduces human efforts by identifying documents which are relevant for the subject in hand. And hence, not all 201 reviews were required to be read thoroughly to understand what customers are saying about Samsung Galaxy S3. From this entire study a few things have definitely come out. First, the product has been greatly appreciated by most of the reviewers. However, they did not like the plastic cover in the back and even though most of the people liked the performance of CPU and liked the size of memory, they had chosen Sony XperiaTM for comparing the gaming experience and camera quality. Quite a few respondents did not like the battery life and backup. A few of them had also compared the same with Nokia. An interesting thing which came into light was that people who were more loyal to iPhone had called the design as ugly in comparison to iPhone and holding the phone was an issue due to its bigger screen size. Hence a few has remarked that the phone is not so good ergonomically. The study did bring out a few critical customer feedback using simple statistical and heuristic analysis. This information can be used for product upgradation or design modification. One of the critical limitations which were faced during this analysis was the shortage of sufficient physical memory due to which larger set of reviews could not be analyzed. But through this study it is found out that using simple text mining techniques important and critical information can be found out in an effective way which can reduce the cost of running expensive marketing research activity in many cases. Browser Android g Galaxy Hand HTC MB Facebook Screen Battery App Camera Curved RAM Awesome Feel Wi-Fi Camera Performance Big Network 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Compared (0.6129) Beat (0.5072) Quality (0.4173) (0.48) Direct (0.8016) Angry (0.4786) CPU (0.8936) GB (0.4357) im (0.432) (0.4167) Card (0.4019) Life (0.454) Big (0.4795) Twitter (0.5666) RAM (0.6403) x (0.5697) Feels (0.5309) Love (0.5555) Speed (0.4895) OS (0.5545) Chrome (0.8007) Google (0.6871) Reception (0.5248) Feature (0.4403) Sammy (0.4895) MP (0.621) Laptops (0.906) Insulted (0.5472) Rest (0.694) MB (0.6403) Looks (0.4236) MP (0.529) SD card (0.7406) Signal (0.5248) Innovative (0.6019) Slow (0.4821) Quality (0.4198) Works (0.7446) Speak (0.5529) sg (0.454) Correlated Words Cluster_1 Speed (0.4164) Interesting (0.5114) Uglier (0.5344) Xperia (0.453) Weak (0.651) Screen (0.4795) Notes: (i.) ** Numbers in the parentheses show the correlation of the word with the main word. For example the word ‘weak’ in Sr. No. 22 has a correlation coefficient of 0.651 with the word ‘network’. (ii.) sg: Samsung Galaxy; im: internet messaging. Account 1 Sr. No. Main Word Table A1. Words which Are Correlated with the Main Word Appendix A Display (0.5309) Account (0.5203) Battery (0.5453) Big (0.4027) Fake (0.7834) Grips (0.5244) (0.48) Awesome (0.9999) Boring (0.642) Contact (0.5352) Bring (0.6764) Interesting (0.4138) Impressive (0.692) Disappointed Gmail Horrible Awesome Beijing Bigger Camera CPU iPhones Account Apple Congratulate Ergonomic Note:sg: Samsung Galaxy; Lets: allows. x (0.646) HTC Main Word Lets (0.7046) Life (0.5738) Carefully (0.4167) Gmail (0.5203) Finally (1) Big (0.4027) MP (0.621) Puzzled (0.5244) Goods (1) CPU (0.9999) Life (0.4602) Contact (0.8752) Pretty (0.5228) Making (0.9563) Fanboy (0.7371) Google (0.8578) Prized (1) Rest (0.706) Quality (0.4198) Shovel (0.5244) Nasty (0.911) Rest (0.5168) Simple (1) Laughing (0.4167) Settings (0.4509) Rise (1) sg (0.5616) Xperia (0.453) Width (0.8099) Top (0.4745) sg (0.5616) Correlated Words Cluster_0 Rest (0.706) Table B1. Words which Are Correlated with the Main Word Appendix B Samples (1) Samsung (1) Listen (0.4162) Sign (0.473) Sucks (0.7816) Wait (0.4478) Dasgupta and Sengupta 67 Acknowledgement This is an academic research and all the data were collected from openly available consumer reviews at various sites. No private or confidential data were used in any format to complete the research. References Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large database. Paper presented at the ACM SIGMOD Conference on Management of Data, Washington, DC. Agrawal, R., & Srikant, R. (1994). Fast algorithm for mining association rule in large database. Paper presented at the 20th VLDB Conference, Chile. Baeza–Yates, R. A., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: Addison–Wesley. Chou, C. H., Sinha, A. P., & Zhao, H. (2008). A text mining approach to Internet abuse detection. Information Systems and e-Business Management, 6(4), 419–439. Christian, B. (2005). An implementation of the FP-growth algorithm. Paper presented at the 1st International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations, New York, USA. Dellarocas, C., Zhang, X. M., & Awad, N. (2007). Exploring the value of online product reviews in forecasting sales: The case of motion pictures. Journal of Interactive Marketing, 21(4), 23–45. Eliashberg, J., & Shugan, S. M. (1997). Film critics: Influencers or predictors? Journal of Marketing, 61(April), 68–78. Feldman, R., & Sanger, J. (2006). The text mining handbook advance approaches in analyzing unstructured data. Cambridge: Cambridge University Press. Ferber, R. (2012). On the reliability of responses secured in sample surveys. Journal of the American Statistical Association, 50(271), 788–810. Glazer, R. (1991). Marketing in an information-intensive environment: Strategic implications of knowledge as an asset. Journal of Marketing, 55(4), 1–19. Godes, D., & Mayzlin, D. (2004). Using online conversations to study word-of-mouth communication. Marketing Science, 23(4), 545–560. Han, J., Pei., J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. Paper presented at the ACM SIGMOD international conference on Management of data, New York, USA. Hess, S., Hensher, D. A., & Daly, A. (2012). Not bored yet—Revisiting respondent fatigue in stated choice experiments. Transportation Research Part A: Policy and Practice, 46(3), 626–644. Lau, K., Lee, K., Ho, Y., & Lam, P. (2004). Mining the web for business intelligence: Homepage analysis in the internet era. Journal of Database Marketing and Customer Strategy Management, 12(1), 32–54. Lean, Y., Wang, S., & Lai, K. K. (2005). A rough-set-refined text mining approach for crude oil market tendency forecasting. International Journal of Knowledge and Systems Sciences, 2(1), 33–46. Leong, E. K. F., Ewing, M. T., & Pitt, L. F. (2004). Analysing competitors’ online persuasive themes with text mining. Marketing Intelligence and Planning, 22(2/3), 187–200. Lin, W., Alvarez, S. A., & Ruiz, C. (2002). Efficient adaptive-support association rule mining for recommender system. Data Mining and Knowledge Discovery, 6(1), 83–105. Na, J. C., Thet, T. T., & Khoo, C. S. G. (2010). Comparing sentiment expression in movie reviews from four online genres. Online Information Review, 34(2), 317–338. Netzer, O., Feldman, R., Goldenber, J., & Fresko, M. (2012). Mind your own business: Market structure surveillance through text mining. Marketing Science, 31(3), 521–543. Pelleg, D., & Moore, A. (2000). X–means: Extending K–means with efficient estimation of the number of clusters. Paper presented at the Proceedings of the Seventeenth International Conference on Machine Learning, CA, USA. Saiz, A., & Simonsohn, U. (2008). Downloading wisdom from online crowds (IZA Discussion Paper Series, Paper No. 3809, pp. 1–44). The Wharton School, University of Pennsylvania, USA. Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill. Srikant, R., & Agrawal, R. (1997). Mining generalized association rules. Future Generation Computer System, 13(2–3), 161–180. 68 Paradigm 20(1) Authors’ bio-sketch Subhasis Dasgupta has been in academics for close to 4 years, teaching subjects like Business Research Method, Quantitative Analysis with MS-Excel, Text Mining Business Process Modeling and Simulation. Subhasis has worked in the industry for 4 years and was involved in Planning and Operations at HPCL. He has a strong inclination towards quantitative management. Currently he is pursuing his PhD on applied text mining in businesses. Kalyan Sengupta is an Electrical Engineer and also a Post Graduate from Warwick University. UK. Sengupta earned his PhD in Business Management from Calcutta University. Professor Sengupta also has been visiting professor to reputed business schools like IIM, IIFT, IMI, VGSOM and others. Areas of teaching interest of Professor Sengupta are: Market Analytics, Business Intelligence, Large Scale Data Analysis, Advanced Excel, R Programming. Copyright of Paradigm (09718907) is the property of Sage India and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.