1 Introduction
The classification of objects or elements according to the similarities is one of the fundamental bases to learn and understand. The classification of elements arises in the human being since childhood, for example, to place objects by colors or shapes. The cluster analysis helps in the development of methods and algorithms to group and classify. Also, the problem of clustering data is being widely studied in data mining and machine learning, being its applications included to sumaries, learning, image segmentation, and marketing.
There are different ways to classify clustering algorithms, in particular, by the type of obtained clusters, [14] propose the following classification: disjoint, when an element belongs to exactly one cluster, for example, cluster movies by their content(AA, A, B, B15, C and D), in fuzzy, when an element belongs to all clusters but with a certain degree of belonging, for example, the clustering of a range of a million colors; and finally those that are overlapped, where an element may belong to more than one cluster, for example, the likes of feeding people.
Another categorization, where an exclusive and non-exclusive classification is proposed, is indicated in [14]. The first considers disjoint clusters, and the second one allows overlaps. Within the exclusive classification, it is used the intrinsic, where a proximity matrix is used, it is also known as unsupervised learning. The extrinsic classification uses labels for the elements.
The intrinsic classification is sub-classified in hierarchical and partitioned depending on the imposed structure of the data. The hierarchical clustering can be divisive (these algorithms form clusters by separating the existing ones) considering some similarity measure. The partitioned clustering takes into account a k parameter, which indicates the number of clusters. This taxonomy is shown in 1.
Following with the hierarchical algorithms, different authors have developed algorithms of this kind, working on different domains [7, 17], with coverage subgraphs to cluster documents [2], density subgraphs [3], suffix trees [22], center based [9], density [11], objective functions and dendograms [12], using the closest neighbor for the identification of duplicates [5] and predictions [10], among others.
Among the iterative algorithms, there are jobs where researchers work with maximum likelihood [16], using EM with Gaussian Mixtures [21], hybrid algorithms using GSA y K-means [13], etc. The used techniques are varied and with different results, these techniques have been applied in different types of data, such as texts, images and with discrete, numeric and categorical data.
In the particular case of the present study, it is carried out on clustering algorithms with overlap. The work is organized as follow, it starts with an explanation of the chosen algorithms, to continue in the next section with an analysis of the computational behavior of the presented algorithms, and the specification of the tested data with each technique, the last section will contain the conclusions.
2 Clustering Algorithms with Overlap
This section explains the solutions that different authors have given to the clustering problem, taking into account overlapping clusters.
2.1 ADditive CLUStering (ADCLUS)
One of the first works is [1] where
a new model of clustering is described, in this model, the restrictions of the
clustered objects in an exhaustive or mutually exclusive categories is relaxed
allowing the establishment of overlapping clusters. It is notorious that many
datasets to be grouped do not require exclusive clusters, it is from there the
need of a solution with overlap, but create all possible overlapping sets gives
a total of
ADCLUS is the proposed model, which cluster elements that meet some property,
with a certain weight. ADCLUS considers n objects to be grouped
and a symmetric proximity matrix
where
2.2 Overlapped K-Means (OKM)
The reason for performing an overlapping algorithm based on K-Means[4], is due to different applications in information retrieval, natural language processing, chemistry, biology, medicine; among others, where an overlapping data coverage is required, so an objective criterion is proposed associated with the OKM algorithm that generalizes the K-means algorithm.
The objective criterion is defined as follows: Given a set of data vectors
where each
where
2.3 Dynamic Overlapping Algorithm based on Relevance (DClustR)
The DClustR algorithm [18] is an algorithm that allows the overlap between its groups, as an alternative for analysis in social networks, information retrieval and bioinformatics, this algorithm is based on graph theory by introducing strategies for the construction of more precise overlapping clusters or when the collection changes.
The main idea is to generate a set of clusters that are a coverage
To define the ws-graph, let
Being the ws-graph determined by its center, the problem is to
build the set
To avoid analyzing all vertexes in
2.4 Overlapping Clustering based on Density and Compactness (OCDC)
The OCDC algorithm [19],
introduces a new graph coverage and a new filter strategy, with which a small
set of overlapping clusters are can be obtained. The collection of objects are
represented as a graph of similarity threshold with weight
In the initialization phase, an initial set of groups of
coverage vertexes is built
where
The compactness of a vertex
where
2.5 MCLC Algorithm
The MCLC algorithm is proposed to discover overlapping communities [6], for which is used a random path in a line graph and attraction intensity. Unlike the traditional random path that starts from a node, it starts from a link. In the first instance, a network graph is transformed into a weighted linear graph, and the random path in this linear graph is associated with a string of Markov. In order to obtain the probability of the Markov’s chains, a similarity between the pair of leagues is obtained. Then, the leagues can be grouped into “league communities” where those nodes can be overlapped.
The league communities become “node communities”, and a attraction intensity is defined for the control of the overlap’s size. Finally the communities that allow overlapping are detected.
The distance or similarity between pairs of leagues is obtained by calculating
the probability transition of random paths in the linear graph. A matrix of
The number of repetitions of random paths that start from the league
The cluster analysis can use “peer league pairs” in communities of candidate
leagues, so a similarity (symetric)
The distance
2.6 Clustering Method with Incremental Overlapping based on Trees
The clustering method with incremental overlap based on trees [20], uses the tripartite decision theory. A tree is represented by points that can improve the relevance of the search result. The overlapped clusters are represented by the tripartite decision with a set of intervals. Tripartite decision strategies are designed for the update of clusters, at the moment that the data increases. Further, with this method is possible to determine the number of clusters during the process.
To define the tripartite decision clustering, be
The algorithm starts calculating the distance(Euclidean) between objects. The
similarity between the objects is obtained with the complement of distance.
Subsequently, the representative points are calculated using the following
condition: if
The next step is the construction of an indirect
2.7 INDCLUS
In this section is examined the scalability of the ADCLUS and INDCLUS models [8], which are techniques that can be used to extract overlapping clusters with similar data. In this paper was taken the models ADCLUS and INDCLUS appropriately and were designed different metaheuristics extensions to have more relaxed models.
For the INDCLUS model, N elements are considered, with a
similarity matrix
where
similarity of the elements
The used heuristics with these algorithms are: alternating approach of minimum squares (SINDCLUS), a symmetric approach applied to SINDCLUS (SYMPRES), simulated annealing (SA-SINDCLUS), tabu search (TABU-SINDCLUS), and relaxed solution space (SMC-Relax). The tests were realized with medium size real datasets, SMC-Relax had the a better execution than SINDCLUS and SYMPRES. The use of heuristics makes the ADCLUS and INDCLUS models scalable.
2.8 Hybrid K-Means
In [15] is described an algorithm (HKM-OKM) that combines harmonic k-means with overlapped k-means. By making use of the overlapped k-means algorithm; which is an extension of k-means is sensitive to the centroid of the initial cluster but when is combined with harmonic k-means this limitation can be overcome.
The main idea in this method is to use the output HKM method to initialize the centroids of the OKM method. The OKM method was explained in section 2.2. The HKM algorithm introduces a bias(using the weight) to move the cluster centers to the data points that are most important according some criteria.
Similar to the k-means algorithm, the HKM method can be formulated as an optimization problem where the objective is to minimize:
where
where
And
The HKM-OKM algorithm starts by finding centers, using the HKM, initializes OKM using the found centers. A set of medical data is used, because it is required to model elements with overlap. It improves the obtained results by OKM.
3 Algorithms Analysis
The analyzed algorithms have a maximum time complexity of the quadratic order as they are: OCDC, MDLC based on trees, INDCLUS with hybrid heuristics and k-means; the particular case of OKM maintains the order of the algorithm on which it was based (k-means), and only the ADCLUS algorithm is of the cubic order. This information can be reviewed in table 1.
Algorithm | Datatype | Amount of Data | Complexity |
ADCLUS | Discrete | 105 | |
OKM | Qualitative - Documents | 1,308 | |
DClustR | Qualitative – Documents | 16,006 | |
OCDC | Documents | 16,006 | |
MCLC | Discrete | 1,133 | |
Based on Trees | Discrete | 5,473 | |
INDCLUS | Qualitative – Documents | 102,294 | |
Hybrid K-Means | Qualitative | 699 |
In the experiments the number of instances that were used with these algorithms varies, being the minimum of 105 and a maximum of 102, 294 instances. Further, the objects are of different types: discrete, qualitative or documents. In all experiments were tested stable datasets, for example, the repository UCI Machine Learningfn is used, testing the cancer dataset, heart disease, parkinson, among others, also the dataset on the karate Zacharyfn, KDD, ISOLET were used; Reuters-21578fn, TDT2fn, were mainly used for the dataset with documents.
In general, clustering algorithms with overlap are based on others algorithms that do not support overlap and even improve some aspects of the same.
4 Conclusion and Future Work
In this article, the clustering algorithms with overlap were analyzed.
Over the last years, the interest in the development and improvement of this type of algorithms has been constant and researchers continue to seek to improve the obtained results.
Different techniques have been used in this type of algorithms, from basic algorithms such as k-means, using heuristics to have scalability. In addition, graph theory has also been used and finally, the combination of algorithms was used to counteract some deficiencies of the algorithms.
The amount of elements that have been worked with these algorithms isn’t very large in general, standardized datasets are used and the quality of the algorithms is verified by standard measures such as F-Measure or F-Bcubed.