Mining Interval Time Series

... If A1 and A2 and … and Ah occur within V units of time, then B occurs within time T. This rule format is different from the containment relationship defined in the current paper. The mining strategies are also different. The technique in [7] uses a sliding window to limit the comparisons to only the ...

Approximate algorithms for efficient indexing, clustering

Using text clustering to predict defect resolution time: a conceptual

... Five different algorithms are tested, and logistic regression yielded the best results and provided the best prediction accuracy, i.e., 34.9 %, for the defect reports in the test set. The author concludes that “there are other attributes or metrics that may have greater influence of the resolution t ...

New Method to Improve Mining of Multi

... Marwa Fouad Al-Rouby Abstract Class imbalance is one of the challenging problems for data mining and machine learning techniques. The data in real-world applications often has imbalanced class distribution. That is occur when most examples are belong to a majority class and few example belong to a m ...

Discovering Frequent Closed Itemsets for Association Rules

... reduced to the problem of determining frequent itemsets and their support. Recent works demonstrated that the frequent itemset discovery is also the key stage in the search for episodes from sequences and in nding keys or inclusion as well as functional dependencies from a relation [12]. All existi ...

An Architecture for High-Performance Privacy-Preserving

... services to ensure flexibility and extensibility. This dissertation first develops a comprehensive example algorithm, a privacy-preserving Probabilistic Neural Network (PNN), which serves a basis for analysis of the difficulties of DDM/PPDM development. The privacy-preserving PNN is the first such ...

Hybrid Self-Organizing Modeling System based on GMDH

... The Group Method of Data Handling (GMDH) was invented by A.G. Ivakhnenko in the late 1960s [18]. He was looking for computational instruments allowing him to model real world systems characterized by data with many inputs (dimensions) and few records. Such ill-posed problems could not be solved trad ...

Scaling Up All Pairs Similarity Search

... Google, Inc. srikant@google.com ...

Chi-square-based Scoring Function for Categorization of MEDLINE

... with the SVM penalty parameter C were optimized by nested cross-validation over d values {1, 2, 3} and C values {0.01, 1, 100} [27]. For each learning algorithm we conducted four experiments with the following inputs for each MEDLINE citation: i) title, ii) abstract, iii) title and abstract, and iv) ...

Let`s Get in the Mood: An Exploration of Data Mining

file (4.3 MB, pdf)

... Such systems are called OLTP systems (OnLine Transaction Processing). • The systems are mostly relational database systems designed for transaction processing. • The performance of OLTP systems is usually very important, since such systems are used to support users(i.e. staff) who provide service to ...

Querying and Mining of Time Series Data

Steven F. Ashby Center for Applied Scientific

... Partitioning of data only – large number of classification tree nodes gives high communication cost ...

On the relationships between user profiles and navigation sessions

... The profile is made of 14 fields: the first is the nickname, i.e. a personal ID characterizing uniquely each single user, while the other 13 fields specify, respectively, the age, the gender, the spoken language, the job, the country, the zodiac sign, the favorite place to live, the favorite music, ...

The ethics of algorithms: Mapping the debate

Biclustering Algorithms for Biological Data Analysis: A Survey

... According to this criterion, a perfect bicluster is a sub-matrix with variance equal to ...

Evolutionary Model Tree Induction

... which attempt to take advantage of the unstable induction of models by growing a forest of trees from the data and later averaging their predictions. While presenting very good predictive performance, ensemble methods fail to produce a single-tree solution, operating also in a black-box fashion. We ...

Algorithm development for physiological signals

Density-based Cluster Analysis for Identification of Fire Hot Spots in

... This study identified regions that are fire hot spots in Kenya’s protected areas by performing a density-based cluster analysis on the Moderate Resolution Imaging Spectroradiometer (MODIS) MCD14ML active fire data set for a 12 year period between 2003 and 2014. Feature subset selection was done usin ...

1 =A T

Cyberbullying Detection based on Text

Application Of Data Mining Technology To Support Fraud Protection

... CONCLUSION AND RECOMMENDATIONS..................................................................................................... 87 6.1 Conclusion ..................................................................................................................................................... ...

Mining Frequent Approximate Sequential Patterns.

... REPuter [15] is the closest effort toward mining frequent approximate sequential patterns under the Hamming distance model. Unfortunately, REPuter achieves its efficiency by strictly relying on the suffix tree for constant-time longest common prefix computation in seed extension. Consequently, the t ...

Finding Cyclic Frequent Itemsets

... underlying problem is to find frequent sequential patterns in the temporal databases. Manilla et al. [16] discuss about the problem of recognizing frequent episodes in an event sequence where an episode is defined as a collection of events that occur during time intervals of a specific size. The ass ...

Rank Based Anomaly Detection Algorithms - SUrface

< 1 ... 11 12 13 14 15 16 17 18 19 ... 169 >

K-means clustering

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.The algorithm has a loose relationship to the k-nearest neighbor classifier, a popular machine learning technique for classification that is often confused with k-means because of the k in the name. One can apply the 1-nearest neighbor classifier on the cluster centers obtained by k-means to classify new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

K-means clustering