1 Introduction
Many similarity and dissimilarity measures are proposed for probability distributions [1,2]. Recently, an involutive negation of probability distributions [3] and measure of correlation between distributions were introduced [4, 5]. This correlation measure used co-symmetric distance between probability distributions based on involutive negation of probability distributions. Co-symmetric similarity and dissimilarity measures are important for applications because they take into account the symmetry of data related to involution operation [6]. In this paper, four new co-symmetric dissimilarity functions for probability distributions are created and compared with the other co-symmetric distances between probability distributions considered in [7].
In Section 2, a small outline of the theory used to support the results is given. In Section 3, four distances are used to create four dissimilarity functions that comply with the co-symmetry property. In sections 4 and 5, they are compared with four other co-symmetric distances introduced in [7]. Sections 6 and 7 contain results and conclusion.
2 Preliminary Definitions
2.1 Negator and Negation of Probability Distributions
Let
The first example of negation of probability distributions was introduced in [8]. In [9], the general properties of negations of probability distributions and the class on linear negations of probability distributions are considered. In [3], it was introduced an involutive negation of probability distributions. Relationships of negation with entropy of probability distributions are studied in [10]. Interpretation of probability distributions as fuzzy distribution sets and extension on probability distributions parametric negations of fuzzy sets is considered in [11, 12]. A negator
A negation is called an involutive if
where
2.2 Co-symmetric Dissimilarity Functions
Suppose
symmetry:
irreflexivity:
Dissimilarity function is co-symmetric if for all probability distributions
2.3 Correlation Coefficients
Pearson's correlation coefficient, commonly used in statistical analyses, allows the evaluation of the presence and strength of a linear relationship between two quantitative variables. It varies between -1 and 1.
A value of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 suggests no linear correlation. On the other hand, Spearman and Kendall correlations are useful tools to investigate monotonic relationships between variables.
While Spearman's is based on the ranges of observations, Kendall's focuses on the agreement of data pairs. Both correlations can vary between -1 and 1 and are designed to be robust to outlier data and not assume specific distributions.
3 New Co-Symmetric Dissimilarity Functions
In [1], different similarity and dissimilarity measures that are usually used to compare distributions of probability functions are considered. They are not co-symmetric. We apply the method of co-symmetrization of similarity and dissimilarity functions proposed in [13] to create new dissimilarity measures of probability distributions that comply with the co-symmetry property:
where * is the product of real numbers. It is easy to show that the distances obtained from (1) are co-symmetric dissimilarity functions. Table 2 shows co-symmetric dissimilarity functions obtained from the four known [1] dissimilarity functions presented in Table 1.
Name | Distance |
Soergel | |
Sørensen | |
Jaccard | |
Dice |
Distance | |
Soergel | |
Sørensen | |
Jaccard | |
Dice |
4 Comparative Analysis of New Co-Symmetric Dissimilarity Functions
For comparative analysis of new dissimilarity measures, we used one thousand probability distributions created randomly, each with 10 elements. For the first analysis, the dissimilarity matrices were constructed for the four new co-symmetric dissimilarity measures created by equation (1).
Subsequently, each dissimilarity matrix is transformed into a vector, and the correlation is calculated between two vectors corresponding to two dissimilarity matrices obtained for two different methods.
The scatter graphs for each pair of vectors are created to graphically observe the correlation that exists between dissimilarity functions, see Fig. 1. In the same way, the correlation between the dissimilarity functions is calculated using Pearson, Kendall and Spearman correlation coefficients.
5 Comparing Similarity Functions with Higher Similarity
In [7], new co-symmetric dissimilarity measures based on average operation were obtained. It can be seen from the analysis carried out in the paper that the dissimilarity measures with the greatest correlations were two distances Sorensen Co-Avg and Soergel Co-Avg, and two distances Jaccard Co-Avg and Dice Co-Avg.
In this paper, we obtained the same result for the product-based co-symmetrization (1) of these pair of distances, see Table 3 and Fig. 1, where scatter graphs demonstrate the almost strict monotone dependence between Sorensen Co-Pro and Soergel Co-Pro distances, and between Jaccard Co-Pro and Dice Co-Pro distances.
Distance | Pearson | Kendall | Spearman |
Sorensen Co-Pro Vs Soergel Co-Pro | 0,9919 | 0,9613 | 0,9976 |
Sorensen Co-Pro Vs Jaccard Co-Pro | 0,9398 | 0,7828 | 0,9346 |
Sorensen Co-Pro Vs Dice Co-Pro | 0,9310 | 0,7705 | 0,9268 |
Soergel Co-Pro Vs Jaccard Co-Pro | 0,9372 | 0,7750 | 0,9299 |
Soergel Co-Pro Vs Dice Co-Pro | 0,9137 | 0,7580 | 0,9187 |
Jaccard Co-Pro Vs Dice Co-Pro | 0,9903 | 0,9634 | 0,9978 |
For further analysis of the correlation-based similarity of obtained co-symmetric distances, we divided them into two groups of most correlated distances. The first group contains co-symmetric dissimilarity functions (distances) obtained from Sorensen and Soergel distances, and the second group contains co-symmetric dissimilarity functions obtained from Jaccard and Dice distances.
The results are presented in Tables 4 and 5 and on Figures 2a) and 2b). As was expected, for most co-symmetric dissimilarity functions, different co-symmetrization of the same distance usually gives co-symmetric distances without the correlation less than 0.99. We have paid more attention to the results of the Spearman correlation, which is a measure of monotonic relationship. Only one unexpected result was obtained for Soergel Co-Avg and Sorensen Co-Pro co-symmetric dissimilarity functions, see Table 4.
Distance | Pearson | Kendall | Spearman |
Soergel Co-Avg Vs. Soergel Co-Pro | 0.9758 | 0.9137 | 0.9867 |
Soergel Co-Avg Vs. Sorensen Co-Avg | 0.9906 | 0.9446 | 0.9933 |
Soergel Co-Pro Vs. Sorensen Co-Pro | 0.9919 | 0.9613 | 0.9976 |
Soergel Co-Pro Vs. Sorensen Co-Avg | 0.9656 | 0.8494 | 0.9628 |
Soergel Co-Avg Vs. Sorensen Co-Pro | 0.9587 | 0.9411 | 0.9934 |
Sorensen Co Avg Vs. Sorensen Co-Pro | 0.9629 | 0.8825 | 0.9766 |
Distance | Pearson | Kendall | Spearman |
Jaccard Co-Avg Vs. Jaccard Co-Pro | 0.9566 | 0.8788 | 0.9781 |
Jaccard Co-Avg Vs. Dice Co-Avg | 0.988 | 0.9375 | 0.9939 |
Jaccard Co-Pro Vs. Dice Co-Pro | 0.9903 | 0.9634 | 0.9978 |
Jaccard Co-Pro Vs. Dice Co-Avg | 0.9437 | 0.8164 | 0.9501 |
Jaccard Co-Avg Vs. Dice Co-Pro | 0.9318 | 0.9153 | 0.9894 |
Dice Co-Avg Vs. Dice Co-Pro | 0.9365 | 0.8529 | 0.9688 |
6 Results
We applied the procedure of co-symmetrization based on product aggregation to the four most popular distances between probability distributions [1]. The correlation analysis of similarity between these distances show high similarity between them with highest correlation between Soergel and Sorensen based co-symmetric distances, and between Jaccard and Dice based co-symmetric distances. These two pairs of distance are considered as two classes of similar co-symmetric distances with mutual Spearman correlation greater than 0.997 between distances from the same class.
Although we applied three correlation coefficients, Pearson, Spearman, and Kendall correlation, we paid more attention to the Spearman correlation, which is a measure of monotonic relationship. This property is important in the comparison of similarity and dissimilarity measures [14, 15].
Further, we compared co-symmetric distances in each class based on product co-symmetrization with co-symmetric distances obtained in our previous paper [7] based on average co-symmetrization of the same initial distances. The correlation between distances from the same class based on different co-symmetrization of distances is higher than 0.95.
7 Conclusion
We introduced new co-symmetric dissimilarity functions that can serve as distances between probability distributions. These dissimilarity functions take into account the symmetry of the space of finite probability distributions with respect to the uniform distribution
Co-symmetrization of four popular distance measures and further correlation analysis of these functions showed highest correlation between Soergel and Sorensen based co-symmetric distances, and between Jaccard and Dice based co-symmetric distances. The same results were obtained for co-symmetric distances obtained previously for another co-symmetrization method.
The obtained results give us a better understanding of known distances and co-symmetric distances obtained from them, which can be used to select suitable distances between probability distributions.