1 Introduction
In data-driven era, data generation, data measurement and data processing are important steps of computation. With the advent of information technology that has successfully percolated to the bottom-most layers of our daily lives, enormous amount of data is being generated, effortlessly and inadvertently. In addition, to add to our misery, relevant and not-so-relevant data are generated indistinguishably and measured together, and one needs to put extra effort to separate the relevant data for the irrelevant one.
In that sense, the data that we access is of enormous volume, but might contain relatively less information than we might need. The problem persists in many aspects of life and in many disciplines of scientific research. Typical real-life situations are mixtures of simultaneous sounds or human voices that have been picked up by several microphones, brain signal measurements from multiple EEG sensors, several radio signals arriving at a portable phone, or multiple parallel time series obtained from some industrial process.
A well-known example of noisy room, known as Cocktail Party Problem, is appropriate to state here. Suppose in a cocktail party many people are talking at the same time and isolation of the individual signals is of interest. A guest at a cocktail party must focus on one person’s voice in a room filled with competing voices and other noises. This ‘cocktail-party problem’ is solved effortlessly by humans with binaural hearing. Another example in the context of image processing can also be stated here. Consider the problem of removing blur from an image due to camera motion.
A photographer tries to take a photo, but their camera is not steady when the aperture is open. Each pixel in the sensor array records the combination of all lights within an integration period from the intended image along the camera motion trajectory. Thus, in blurred image each recorded pixel is the mixture of multiple image pixels. De-blurring an image requires recovering the original image as well as the underlying camera motion trajectory from the blurry image. Both the cocktail party and de-blurring problems are ill-posed and additional information must be employed to recover a solution.
The term Blind Source Separation (BSS) is coined to characterize this problem. Simply speaking, any real-life data measurement process measures the combined data of relevant and irrelevant data, which result from their respective independent sources and therefore, it is necessary to separate out the data of relevant source from the data of other sources. That said, progress has been made when the interactions between signals are simple in particular, linear interactions, as in both examples. When the combination of two signals results in the superposition of signals, we term this problem a linear mixture problem.
In mathematical terms, we need to find a suitable, proper multivariate description of random vectors. The representation is also referred, for simplicity, as linear transformation of initial data. To put it another way, each representative component is a linear blend of initial variables. There are renowned linear modes of transformation which includes Factor Analysis [1], Principal Component Analysis (PCA) [2, 3], and Projection Pursuit [4] etc.
Independent Component Analysis is a technique of data transformation that finds independent sources of activity in recorded mixtures of sources. Independent Component Analysis (ICA) is a computational technique for revealing hidden factors that underlie sets of measurements or signals. ICA assumes a statistical model whereby the observed multivariate data, typically given as a large database of samples, are assumed to be linear or nonlinear mixtures of some unknown latent variables.
Independent Component Analysis (ICA) was introduced in 1986 [5]. However, in that paper, there is no theoretical explanation was presented and the proposed algorithm was not applicable in several cases and in 1991 partial theoretical structure was laid down [6]. Thus, the ICA technique remained mostly unknown till 1994, where the name of ICA appeared and introduced as a new concept [7] where in it is suggested that signals of the sources are independent.
Several algorithms have been proposed since then for calculating ICA techniques, which differ among themselves in the handling of statistical independence, on estimation of the separation matrix, and use of statistics of higher order. For BSS, presumably, the source signals may be combined linearly or nonlinearly. ICA is ideal if the signals are supposed to be combined linearly. Several other methods with a nonlinear mixture assumption exist for BSS [8, 9, 10]. ICA's linear mixture model attempts to separate source signals according to certain assumptions:
The source vectors are statistically independent.
The mixing matrix (A, as defined in the next section) should be a square and full rank.
The source matrix (S, as defined in the next section) does not have any external noise.
The data are centered (zero mean).
The signals from the source should be non-Gaussian probability density function (pdf) with one source expected, which may be Gaussian.
Independent Component Analysis (ICA) has been employed for nearly 30 years for unmixing of complex signals.
Unmixing the signals without the use or even providing any background knowledge about signals of the source without mixing them up is generally identified as Blind Source Separation (BSS). ICA is the unique BSS technique developed with respect to signal processing. The key concern of the ICA is extraction of "source signals" from a set of observed signal mixtures and their mixing coefficients (proportion). That means, this way the derived information can be translated further straight.
Independent components analysis (ICA) is a probabilistic method, whose goal is to extract underlying component signals that are maximally independent and non-Gaussian, from mixed observed signals. The mixing coefficients are also unknown. The latent variables are non-Gaussian and mutually independent and they are called the independent components of the observed data. By ICA, these independent components, also called sources or factors, can be found.
Thus, ICA can be seen as an extension to Principal Component Analysis and Factor Analysis. ICA is a much richer technique, however, capable of finding the sources when these classical methods fail completely. In many cases, the measurements are given as a set of parallel signals or time series.
With the aim of maximizing non-Gaussianity or minimizing Gaussianity in order to reach source as independently as possible, ICA is therefore an optimization problem. The hypothesis for independence has to be approximated, thus turning the estimation of the sources into an optimization problem described by a contrast (cost) function that is minimal when the sources estimation are as far as independent.
Somehow or other, contrast function is an independent measure. This idea led to the concept of a contrast function: by definition, it is a criterion, which leads to an acceptable solution of the BSS problem by maximizing the separator, that is every row of the mixing-separating system is extracted one by one, that is, source signal is built component by component. This approach is called deflation. When the entire source signal (multi-unit) is simultaneously collected, it is called symmetric approach. ICA method in particular can be expressed as a sum of the Contrast Function and Optimization Algorithm.
More than 30 different ICA algorithms are already available so far [11]. ICA technique basically deals with two independent classes, i.e. Single method of optimization used for different contrast functions or different method of optimization used for single contrast functions.
The widespread and interdisciplinary applications of ICA in the context of image processing, text mining, data mining, audio signal processing, biomedical signal processing, and time series applications motivate us to present ICA theory and its most used methods in one article.
The goal of this review is to explain ICA and to present some of the widely used algorithms for ICA computation as well as some more contrast functions in addition to Aapo Hyvarinen's much earlier survey in 1999 [12].
The remainder of the survey is organized in the following way. We give introduction to ICA in section 2. Section 3 addresses different higher order statistical notations that are useful in ICA. Section 4 gives six different ICA algorithms. Section 5 describes the various ICA contrast functions. Section 6 gives applications of ICA in real world. Finally, section 7 of the survey is conclusion and references.
2. Independent Component Analysis (ICA)
Independent Component Analysis (ICA) [12, 13, 14, 15, 16] is a statistical tool for the transformation of an observed multidimensional random vector into statistically independent components. This approach is used to separate the mixed signals. PCA functions only in second-order statistics and provides optimal data for the Gaussian distribution sets. ICA is a PCA extension designed to optimize non-Gaussianity or minimize the Gaussianity of the datasets. ICA attempts to find independent components by assuming their statistical property of higher order.
The random vectors, 𝑥 and 𝑠 represent the data in ICA and the independent components respectively. ICA has many algorithms such as FastICA [17], projection pursuit [15], and Infomax [15, 18].
The main goal of these algorithms is to extract independent components by (1) maximizing the non-Gaussianity, (2) minimizing the mutual information, or (3) using maximum likelihood (ML) estimation method [19]. However, ICA suffers from a number of problems such as over-complete ICA and under-complete ICA.
Let us consider an observed 𝑚 dimensional column vector x(m) = [x1, x2, …, xm]T which represents a linear combinations of 𝑛 elements (𝑛 ≤ 𝑚 ) 𝑛-dimensional elements s(n) = [s1, s2, …, sn]T those are different from the statistics (or are as independent as possible). So the ICA model is:
where 𝐴 is a linear matrix mixture of order 𝑚 × 𝑛. The input elements are usually statistically dependent due to the mixing phase although the elements were not original. Both the mixing matrix and independent components (ICs) - 𝑠i, 𝑖 = 1, 2, …, 𝑛 are unknown. If a demixing matrix 𝑊, that produces 𝑦(𝑚) can be found then that will give components which are statistically independent:
The model assumed that the data variables were linear or nonlinear mixtures of these latent variables, and the type of mixing was also unknown. The latent variables are not Gaussian and should be mutually independent. They are referred to as independent components of the data observed.
This approach is called blind because there is no much knowledge about both the mixing matrix A and the matrix of the source 𝑠. In addition to this, ICA method can be described as finding a linear transformation, which maximizes the 𝑠̂ non-Gaussianity. The Matrix 𝑊 is de-mixed by optimizing cost function. Specific cost functions such as negentropy, kurtosis, etc. can be used for ICA method. Therefore, various methods for computing 𝑊 exist in ICA method.
That said, progress has been made when the interactions between signals are simple – in particular, linear interactions, as in both of these examples.
When the combination of two signals results in the superposition of signals, we term this problem a linear mixture problem. The goal of ICA is to solve BSS problems, which arise from a linear mixture.
Furthermore, the metrics of cumulants, likelihood function, negentropy, kurtosis, and mutual information have been developed to obtain a demixing matrix in different adaptations of ICA-based algorithms. FastICA [18], [16] was developed to maximize non-Gaussianity with relative speed and simplicity. Recently, Zarzoso and Comon [20, 21] proposed the Robust Independent Component Analysis (R-ICA) method for better convergence performance.
They used a truncated polynomial expansion, rather than the output marginal probability density functions, to simplify the estimation process. Moreover, in [19], the authors developed the rapid ICA algorithm which takes advantage of multi-step past information with respect to a fixed-point method in order to augment the non-Gaussianity among the estimated signals. In [7, 22, 23], the authors have presented ICA methods using mutual information. They constructed a formulation by minimizing the difference between the joint entropy and the marginal entropy among the estimated sources. Moreover, the Euclidean distance divergence (ED-DIV) and the Kullback divergence (Kl-DIV) were used as the measure functions for nonnegative matrix factorization (NMF) problems in [24].
3. Definition of Independence and Higher Order Statistics
The various ICA algorithms could be divided into two classes due to their independent descriptions: algorithms, which maximizes the non-Gaussian complexity of the components or minimizes mutual information. ICA makes sense when you look for components which are absolutely as non-Gaussian as possible.
In fact, when a Gaussian distribution fits a random variable, all of those moments and order cumulants above 2 are null [15, 25]. Locating the ICs therefore implies detecting signals of its moments and order cumulants above 2.
Therefore, different notations need to be introduced to present and define contrast functions used in ICA.
3.1 Moments
For a variable, the 𝑖th moment μi is equal to:
where E is the expectation and for 𝑖 = 1, 𝑚1 = 𝑚𝑒𝑎𝑛(𝑥).
A variable's moments define its function of probability density, that is, its distribution.
3.2 Central Moments
For a variable, the 𝑖th central moment μi is equal to the moment of the centered variable 𝑥, i.e.:
Hence: The mean of 𝑥 is 𝜇1 = 0 and, the variance of 𝑥 is 𝜇2 = 𝜎2.
The third moment 𝜇3 = E{(x − m1)3} is classified as skewness, and is a measure of distribution asymmetry. Skewness may be positive, negative or null for Gaussian distribution.
For variable 𝑥 the fourth moment is 𝜇4 = E{(x − m1)4}. It's related to its kurtosis, which reflects the pointedness or flatness of the distribution of the value.
3.3 Kurtosis
Under Central Limit Theorem, that declares the linear combination of independent random variables on finite support probability density functions (pdfs) tend to a Gaussian distribution. In general, higher-ordered statistics such as fourth order cumulant or kurtosis are used for non Gaussian measurement. When the data is pre processed to show unit variance, the kurtosis shall be equal to the fourth moment in the data.
Kurtosis, defined for a centered variable 𝑥, as:
which is a non-Gaussianity measurement for distribution.
When the data is whitened and centered i.e., E{x2} = 1, kurtosis is same as :
when k(x) = 0, the distribution declared as Gaussian, similarly when k(x) > 0 and k(x) < 0 the distribution is declared as super-Gaussian and sub-Gaussian respectively. The probability density function peak is very sharp for super-Gaussian and peak is rather flat in case of sub-Gaussian.
It is thus possible to calculate non-Gaussian components, maximizing their kurtosis absolute value. It is possible to optimize the components’ independence by maximizing each individual kurtosis (maximum non-Gaussianity) while minimizing their mutual kurtosis (minimum non-Gaussianity), that may be described as the fourth – order cumulant function.
Thanks to its computational and mathematical simplicity, kurtosis has already been used in ICA as a measure of non-Gaussianity and in related fields. It has a linear structure, so mathematically:
and
where 𝛼 is constant.
Kurtosis is easy to calculate but is of poor statistical significance. Therefore, a better measure of non-Gaussianity is required for kurtosis.
3.4 Cumulants
Covariances applied in second-order statistics can be compared with cumulants. The first three cumulants have their moments equal, for centered variables, i.e.:
The cumulant of fourth-order is expressed as:
and hence is same with kurtosis.
Fourth-order auto- and cross-cumulants of 4 vectors ui, uj, uk and ul are specified as:
From the general viewpoint, one may note that the auto-cumulant fourth order of a centered variable is identical to its kurtosis. Cumulants may be represented as a tensor. Cumulant tensor is the simplification of the covariance matrix with diagonal auto-cumulants. One vector is characterized with auto-cumulants that corresponds to variable variance, whereas two vectors are characterized by cross-cumulants that corresponds to variance of two variables. The independent vectors statistically generate maximum auto-cumulant and cumulant tensor of null off-diagonal elements.
Kurtosis is highly sensible to outliers and is a reasonable approximation of non-Gaussianity; therefore it is non-robust estimation of non-Gaussianity. Negentropy is another measure of the (non-)Gaussianity variable and robust, hence preferred in place of kurtosis.
3.5 Negentropy
Negentropy, which is based on the information theoretic quantity of (differential) entropy, can calculate non-Gaussianity. The sum of the product of each observation’s probability and their log probabilities is called as entropy of a discrete variable. As Hyvarien explains, “Random-variable entropy may be understood as the degree of information provided by variable observation.
The more the variable is 'random', the greater its entropy, i.e., unstructured and unpredictable. A Gaussian variable is the biggest entropy with equal variance of all the random variables” [26].
Hence, entropy may also be a valid criterion for estimating a variable 's non-Gaussianity. For a variable 𝑥, it is specified as:
and hence, it carries a negative value. At the other hand, entropy is called differential entropy for a continuous function, as given by the function integral times of the function log i.e.:
Negentropy is never negative, so if 𝑥 has a Gaussian distribution, it is zero. Negentropy has a valuable property and its invertible linear transformation is invariant. It is also a robust non-Gaussian measure. One downside of negentropy is very hard to measure. This is why it needs approximation. The approximation is expressed by:
where xGauss is a random Gaussian variable 𝑥 with the same matrix of covariances. The more the variable is "non-Gaussian", the higher the negentropy value. Hence, one should seek to maximize the negentropy of component when looking for ICs. In real cases, the value of the negentropy is nevertheless hard to estimate, so usually one needs to work with a more simple approximation. Hyvarinen [26] includes a number of such approximations as shown below:
where 𝑥 is the variable of zero mean and unit variance. This estimate however is based on kurtosis, that is not a dependable estimator. A further approximation may instead be used:
where 𝑣 is a variable with mean 0 and unit variance, 𝑘 is a constant and G is a non quadratic function. Again, Hyvarinen suggests two important choices of G:
and
3.6 Maximum Likelihood Estimation
Maximum Likelihood is a traditional method used for independent component estimations. This is built on the density of a linear transform using well-known results. Taking the basic ICA model into consideration, x(m) = As(n), the density px of the mixture signal observed may be formulated as:
where W = A−1. As the source is considered to be statistically independent and the mixture signal density is the product of the sources' marginal densities, so eq. 15 can be expressed with a function of W = (w1, w2, …, wn)T and 𝑥, giving:
Suppose we have 𝑥(𝑚) of 𝑇 observations, so this likelihood can be obtained as the density product assessed at 𝑇 points. The likelihood of matrix 𝑊 is given by:
Very often, the use of the logarithm of likelihood is more practical, as it is algebraically simpler. This makes no difference here since the logarithm maximum is found at the same point as the maximum likelihood. Thus the log likelihood function regarding the parameter 𝑊 is:
Simplifying the notation and dividing by 𝑇 to the likelihood, to get the equation as:
The log probability here is the function of the separation matrix 𝑊 and the marginal density of the estimated sources. The estimation of estimated source densities is a non-parametric problem. The non-Gaussianity is used for non-parametric problem solving.
4. Different ICA Methods
4.1 FastICA
FastICA algorithm was first introduced by Hyvarien et al. [19]. FastICA is a fixed-point iterative algorithm to maximize non-Gaussianity, which is an alternative to gradient-based ways that illustrates rapid (cubic) convergence.
The approach is used to optimize various forms of contrast functions like kurtosis/negentropy. Unlike gradient-based methods, the Fast-ICA method lacks the learning rate or other personalized parameters. It is a major advantage as a poor learning rate choice destroys convergence in general.
The Hyvarinen algorithm quickly converges as it seeks component one by one. For independent component estimation, FastICA uses kurtosis [19]. Whitening is generally done on the data before the algorithm is executed. This ensures that all correlation inside the data is eliminated, i.e. the data has to be uncorrelated.
The information-theoretical amount of entropy, which is the base of negentropy is robust but computationally complicated than kurtosis.
Nevertheless, computationally simple negentropy approximations are available to relieve the complexity of negentropy computation.
The followings are two different algorithms to perform FastICA.
4.2 INFOMAX
This approach maximizes the entropy of a nonlinear output (information flow) of neural network and is called as InfoMax [18]. Infomax specializes in locating ICs by optimizing member joint entropy.
Bell and colleagues tried to formulate a method, which was based on the Linsker’s Infomax principle [27] to create unsupervised neural network learning rules and this was successful in solving the Blind Source separation (BSS) problem.
1. Data center to zero the mean and so whiten the outcome of giving 𝑥. |
2. Select a primary version of the 𝑝-vector 𝑤 with unit norm. |
3. Consider G, be any of these non-quadratic density with partial derivatives g (first) and g' (second). |
4. Let |
5. Let |
6. Iterate among steps 4 and 5. End once convergence has been reached. |
Deflation algorithm |
1. Data center to zero the mean and so whiten the outcome of giving 𝑥. |
2. Choose a number, 𝑚, independent components to be extracted. |
3. For l = 1, 2, …, m : |
– Initialize (e.g., randomly) the 𝑝-vector wl to have unit norm. |
– Let |
for |
– Use Gram-Schmidt to orthogonalize |
– Let |
– Iterate |
4. Set 𝑙 ← 𝑙 +1. If 𝑙 ≤ 𝑚, return to step 3. |
Parallel algorithm |
1. Data center to zero the mean and so whiten the outcome of giving 𝑥. |
2. Choose a number, 𝑚, independent components to be extracted. |
3. Initialize (e.g., randomly) the 𝑝-vector |
a. let |
4. Conduct a symmetric orthogonalization of 𝑊 by W ← (WWT)−1/2W. |
5. For each 𝑙 = 1, 2, …, n, let |
6. Component update for |
7. Conduct another symmetric 𝑊 orthogonalazation. |
8. If there is to be convergence, return to step 5. |
1 Initialize W ∗ (0) (e.g. random), |
2 W ∗ (t + 1) = W ∗ (t) + η(t)(I − f(Y)YT)W ∗ (t), |
3 If not converged, go back to step 2. |
1 From sample covariance Rx and calculate a whitening matrix W+. |
2 From the sample 4th –order cumulant 𝑄z of the whitened process z(m) = W+x(m); calculate the 𝑛 most significant Eigen pairs {λr | 1 ≤ r ≤ n}. |
3 With a unitary matrix U, Jointly diagonalize the set N = {λrMr | 1 ≤ r ≤ n}. |
4 A+ = W+U is the estimation of A. |
Input: Data Vectors y1, y2, …, yN |
Kernel K(x, y) |
1 Whiten the data |
2 Minimize the contrast function C(W) |
(Regarding W) defined as: |
a. Compute the centered Gram matrices K1, K2, …, Km of the estimated sources {x1, x2, …, xN}, where xi = Wyi |
b. Define |
c. Define |
Output: W |
Whereas |
The non-linearities in the transform function can take input distribution higher-order moments and reduce redundancy. This helps the neural network identify components, which are statistically independent in the data input. This method is also shown to be equivalent to the methods of maximum likelihood [15]. Amari et al. (1996) proposed the algorithm as follows to calculate the unmixing matrix W (called Infomax) [28].
Being η(t) a learning-rate function and f(⋅) a function related to the distribution nature (i.e., super Gaussian or sub Gaussian). It is important to bear in mind that a W* initial value is a random matrix usually [22]. For more detail on Infomax's procedure see [22, 28].
4.3 JADE
The Joint Approximated Diagonalization of Eigen matrices (JADE) [20, 29] is a joint diagonalization method of the cumulant matrices, especially with regard to signal treatments for application in chemometrics [30]. Cumulants orders two and four are involved, and Joint diagonalization is carried out with Jacobi technique. The JADE algorithm also has no customizable parameters, and hence it is robust. However, this approach is very computationally intensive, since all cumulant matrices are diagonalized at once.
This algorithm works well in small dimension but is poor in high dimensional spaces. The Matrix 𝑋 is first converted to a reduced set of PCA loadings, and then these are centered and whitened to equal variances. The auto-and cross-cumulants of these loadings are then put to a dimension tensor of fourth order n × n × n × n (is the load count).
The tensor is projected into orthogonal Eigen matrices and is diagonalized into a rotation matrix (using Jacobi algorithm). In the pre-processing stage, the rotational matrix is applied to the whitening matrix. Providing the computation of matrix 𝑊.
Equations (2, 1) gives respectively independent components and mixing matrix. The description of algorithm could be found from [29, 30].
4.4 KERNEL ICA
Kernel ICA [31], which is a non-parametric approach, works by setting a contrast function to replicate kernel Hilbert space. Contrast function may be selected either as a canonical correlation (KCC) or as a generalized variance (KGV). Here mixtures are designed for a higher dimensional space, and then the mixing matrix 𝑊 is obtained to minimize pair-wise correlations in that space. If it does, this could be seen that reproducing kernel Hilbert spaces based on Gaussian kernels guarantees that the source is independent. In addition, this approach is argued to be more stable than previous ICA algorithms as for the existence of outliers.
Bach and Jordan presented the algorithm Kernel ICA-KGV [31] as follows.
Input: Data vectors X1, X2, …, XN, assumed whitened. |
Parameters: m: Size of spacing equivalent to √𝑁. |
|
R: Number of replicated points per original data point. |
K: Number of angles from which to determine cost function. |
Procedure: 1. Create 𝑋′ by replicating R points with Gaussian noise for each original point. |
2. For each 𝜃, rotate the data to this angle (𝑌 = 𝑊(𝜃) ∗ 𝑋′) and evaluate cost function. |
3. Output W correspond to the optimal 𝜃. |
Output: W (demixing matrix). |
All parameter and notation are taken from (Learned-Miller and Fisher III (2003) [32]) |
4.5 RADICAL
RADICAL (Robust, Accurate and Direct ICA algorithm) is an effective entropy estimator based ICA algorithm [32].
The ICA approach is based on a direct minimization of the measurement of the departure from independence by estimated divergence of Kullback-Leibler between the joint distribution and the marginal distribution product. RADICAL’s entropy estimator is a function of the order statistics.
The entropy estimator used in particular is consistent, and shows rapid convergence. This entropy estimate is reliable, fast converging in computational efficiency and pretends to be robust to outliers.
The RADICAL algorithm described as follows, was proposed by E.G. Learned-Miller and J. W. Fisher III [32].
4.6 ICA with PSO
In recent times, Particle Swarm optimization (PSO) is a familiar population based-search. Particle swarm optimization (PSO) is used to detect the search space of a particular problem to find the settings or parameters required to maximize/minimize a specific objective. The algorithm works by maintaining a few candidate solutions at the same time in the search space. PSO algorithm works in 3 steps; first it estimates the fitness of each moving particle, secondly, it updates individual and global best fitness and position, and finally, it updates the velocity and position of each particle.
There has been much work to solve Independent Component Analysis with Particle Swarm Optimization. One of the Algorithms was presented [33] for optimizing the objective function:
where the auto cumulant in fourth order is given by:
The objective function J is kurtosis [34, 35], so it can be written as a function of an orthogonal matrix U to be determined by the method of optimization. It is not directly easy to work with this kurtosis objective function, so later on this objective function is modified with the help of reference vector. Therefore, a reference-based contrast function is defined. Reference signals are merely signals artificially introduced to facilitate the maximization of contrast function.
Since reference signals are indirectly involved in the process of iterative optimization, these reference-based contrast functions have a common appealing feature and the respective optimization algorithms are quadratic regarding the parameters searched:
where E{⋅} denotes the expectation value and z is the reference signal. Considering another (reference) separation matrix V and z(m) = Vx(m), now the contrasts function specifically in terms W and V as follows:
where y(m) = Wx(m) and z(m) = Vx(m).
Gradient-based PSO |
Input : x(m) : Observed signal, 𝑆: swarm size and 𝛿: Trade-off parameter |
Output: best Ubest : Separation vector |
Initialize U0 and the corresponding reference signal |
for 𝑘 = 0, 1, …, kmax − 1 do |
end |
Gradient-based PSO with Fixed-point update |
Input : x(m) : Observed signal, 𝑆: swarm size and 𝛿: Trade-off parameter |
Output: Ubest : Separation vector |
Initialize U0 and the corresponding reference signal |
for 𝑘 = 0, 1, …, kmax − 1 do |
for 𝑙 = 0, 1, …, lmax − 1 do |
end |
end |
Some earlier suggestions have been made using PSO to solve ICA problem [36, 37]. The method of combining swarm-search with gradient-based optimization for ICA is different from PSO algorithm proposed in [33], where the particle velocity component is modified with gradient direction at each iteration, and the direction of global best. The two algorithms presented by Pati et al., are presented below [33].
5. Contrast Function for ICA
The model of the data estimation in independent component analysis is generally carried out by formulating an objective function and then minimizing or maximizing the function. The objective function is called as contrast function. Many researchers use the terms, ‘loss function or cost function’ in their researches. In simpler terms, this can be interpreted as any function whose optimization makes it possible to estimate independent components.
Some of the classical optimization methods can be used for explicitly formulated objective function to optimize the objective function. Some such methods are gradient method, Newton methods, Iterative method etc. For certain cases however, the theory of algorithm and estimation is difficult.
The initial phase of BSS comparison works focused on Shannon entropy and Kullback-Leibler divergence (KLD) based on theoretical definitions of information independence and its approximations through statistics of higher order. The other important group of contrasts came from the non-Gaussian definitions and their approximations of independence [15]. You will find more information on these commonly used, conventional contrast functions in [12, 20].
5.1 General Contrast Functions
This is a one-unit contrast function developed [38] that has statistically attractive properties (contrast to cumulant) without prior understanding of the densities of the independent components which is required to allow simple algorithmic implementation to make it simple. This so-called one-unit contrast function as optimization makes it possible to estimate a single independent component rather than estimate the entire ICA model. A family of such non-normality measures could be virtually built using any function G and taking into account the gap between G ’s expectations regarding actual data and Gaussian Data Expectations. Put another way, a contrast function J can be defined, which measures the non-normality of a zero-mean random variable using any case, a non-quadratic, sufficiently smooth function G as follows:
where v is a standardized Gaussian random variable, 𝑦 is supposed to be normalized to the unit variance, and the exponent 𝑝 = 1, 2 usually. The subscripts indicates expectation regarding 𝑦 and 𝑣. ( The 𝐽G notation not to be confused with the notion of negentropy, 𝐽.)
Clearly, 𝐽G can be regarded a generalization of (the modulus of) kurtosis. For G(y) = y4, 𝐽G becomes simply the modulus of kurtosis of 𝑦. Note, 𝐺 is not be quadratic, because 𝐽ீ or all distribution would then be trivially zero. So, apparently 𝐽G could be a contrast function just like kurtosis. The 𝐽G is really a contrast functions in an appropriate sense (locally).
In [39], the estimators' finite-sample statistical properties were evaluated based on the optimization of such a general contrast function. It was found that for an acceptable choice of G, the estimator's statistical characteristics (asymptotic variance and robustness) are significantly better than those of the cumulant-based estimators. Varieties of G were suggested below:
where a1, a2 ≥ 1 are certain adequate constants. Without the detail information about the distribution of independent components or outliers, both of these functions are the optimal contrast function, which seem to approximate fairly well in most cases. It was experimentally observed that the values in particular 1 ≤ a1 ≤ 2, a2 = 1 bring good constant approximations. One explanation for this is that G1 above matches to the log-density of a super-Gaussian distribution and is therefore closely connected to the estimation of maximum likelihood.
Since the BSS problem the linear combination of the observed mixtures is discussed 𝑥(𝑚)j, say 𝑤T 𝑥(𝑚), where the weight vector w is constrained so that 𝐸{(𝑤T 𝑥(𝑚))2} = 1. So the algorithms are extreme based on the kurtosis square 𝐾2(𝑤T 𝑥(𝑚)) = (𝐸{(𝑤T 𝑥(𝑚))4} − 3)2 of such linear combinations [7, 23]. The kurtosis square could be presented as approximately to the negentropy of 𝑤T 𝑥(𝑚). One may see that the kurtosis square of 𝑤T 𝑥(𝑚) is maximized precisely where the linear combination is equal to, up to the sign, one of the ICs i.e., 𝑤T 𝑥(𝑚) = ±𝑠i.
This can be used to create a contrast function instead of kurtosis, basically on any quadratic well behaving even function, say G. Such contrast function may generally be described as:
where v is a standardized Gaussian variable and JG can be considered as a generalization of the kurtosis square, as for 𝐺(𝑢) = 𝑢4, JG turn into simple kurtosis of 𝑤T 𝑥(𝑚).
JG is locally maximized when 𝑤T y(n) = ±𝑠i.
Thus JG can be used as a contrast function just like the kurtosis square. Widely used one unit contrast functions are:
Skew: g(x) = x2
Pow3: g(x) = x3
g(x) = x4/4, x2/2
Gauss:
Tanh: g(x) = tanh(x), logcosh(x).
The key advantage of the FastICA algorithm is its speed (superior to gradient based schemes), user-friendly (needs nonprobability distribution or collection of other parameters) and its flexibility for performance optimization by selection of the contrast function 𝐺(𝑥) or equivalently 𝑔(𝑥) = 𝐺′(𝑥).
5.2 Contrast Function without Permutation Ambiguity
A linear combination of the fourth order marginal cumulants (kurtosis) for the separator output is a true contrast function for ICA under the pre-whiting assumption if the weights show the same sign of kurtosis as the source.
If the weights are equal to the source kurtosis then the contrast function is a cumulant criterion based on the principle of maximum likelihood.
If the source kurtosis is different from the linear weight combination (even not matched from the former) then the contrast eliminates the ambiguity of the permutation, since at the separator output the estimated source is shortened according to its kurtosis values in the same order as the weights. For more details, see [40].
5.3 Non-differentiable Contrast Functions
For ICA, the contrast function can be global (multi-unit) or component wise (single unit). When we speak of multi-unit, the function 𝐶(y) summarizes the level of independence between all pairs of components in one scalar value. In single unit contrast function measures a quantity associated with the 𝑖th component of 𝑦 which is typically higher for independence signals than for mixtures of signals.
The multi-unit algorithm is called symmetric approach (all source is extracted simultaneously) whereas the single-unit algorithm is called a deflation approach (source is extracted one after another). For the component wise contrast function as 𝐶(𝑦, 𝑖) is written as 𝐶(wi z), where wi is the 𝑖th row of 𝑊 knowing that wi is orthogonal to any other row of wj. Positive and negative angular variations of wi that preserve unit norm defined and note as:
The relevant contrast values may be written as C(wi↑jz) and C(wi↓jz).
The maximization of the contrast function here is dependent on the assumptions:
The contrast function in relation to α should be continuous or at least almost continuous.
All maxima of the contrast function is differentiable with respect to α.
More information can be found in [41] on the Non-differentiability Contrast function.
5.4 Quadratic Contrast Function
The most enticing approach to the issue of blind equalization is the use of a suitable contrast function. A contrast function basically plays the role of an objective function in the sense that its (global) maximization makes it possible to solve problems. In [42] a contrast function for the i.i.d. source signals is defined as:
Definition 1:
Let 𝐶(. ) be a real function of the signal:
where
P1. ∃/ ∈ Z such that for all possible output y(n) of the equalizer:
P2. If equality holds in the equation P1, then
Definition 1 cannot be used for non i.i.d. source signals since the independence property for these signals only leads to one source being extracted up to a scale filter. Therefore, a generalization of definition 1 is required for non i.i.d. source signals.
Definition 2:
The real function 𝐶(. ) is called a Contrast Function when there exists i0 ∈ {1, ..., N} such that:
>>>>P1. For all possible equalizer output:
>>
5.5 Reference Based Contrast Function
A Singular Value Decomposition (SVD) based maximization algorithm is substantially faster than other maximization algorithms.
However, because of its sensitivity to a rank estimation, the method frequently suffers from the need to know the filter orders well. A method of gradient optimization with reference signals based on Kurtosis gets an optimal step size and requires no estimation of the level.
Thus, the SVD-based methods’ drawback can be managed well. During the optimization process the reference signals involved in this method are fixed, which can result in poor separation output due to inappropriate initialization value of the corresponding reference signals. We usually inject reference signals artificially into an algorithm so that the contrast function can be maximized.
Since we consider linear separator, whose output is defined as: y(m) = Wx(m), where W is the separation matrix of 𝑛 × 𝑚. 𝑦(𝑚) is the approximate estimation of 𝑠(𝑛). The "parameters searched" here are the 𝑊 vectors in the row. with the obvious assumtion of independence in the source and the general defination of 𝐶𝑢𝑚{∙}. Signal may be apprised in real-valued or complex valued. Considering real valued signals for any jointly stationary signals y(m) and 𝑧(𝑚), let:
where 𝐸{∙} denotes the expection value.
Introducing a “reference signals”, one can consider similar like y(m) = Wx(m) a separating matrix of 𝑛 × 𝑚 denoted by 𝑉. The respective output can be denoted as:
where 𝑧(𝑚) components are the reference signals. The reference signals directly influence the reference-based contrast functions and their values affect the results on optimization, in particular the value of initialization.
With the criteria:
where 𝐽 is the well known Kurtosis contrast function. 𝐼 is the reference-based contrast function.
As described [24], ∇ refers to a gradient and partial gradient operator are denoted by ∇1 and ∇2 with correspond to the first and second parameters, respectively. More accurately, ∇𝐽(w) is the vector containing all of the partial derivatives of 𝐽(𝑤), whereas ∇1𝐼(w, v) and ∇2𝐼(w, v) are Partial Derivatives Vectors of 𝐼(𝑤, 𝑣) with respect to 𝑤 and 𝑣. Combined with (21, 24):
Because 𝑥(𝑚) in prewhitened and 𝑤, 𝑣 are normalized so
More details on the reference-based contrast can be found at [24, 42, 43].
6. Applications of ICA in Real World
There is several success of ICA application in a number of practical problems apart from the signal processing in telecommunications [7, 26]. Now the use of ICA has been expanded to a wide variety of domains. Some of these applications includes:
6.1 Face Recognition by ICA
Today, data security attacks remain a top concern, and the research on reliable recognition of humans’ faces has seen a major research area of Computer Science, Artificial Intelligence and Machine Learning.
Building a face recognition system to replicate the human ability of face recognition is a non-trivial task and modeling the varying uncertain and imprecise condition has posed insurmountable challenges to researchers in the past few decades.
The problem of face recognition can be expelled as follows. Given a database of face images and a query face image, the goal is to find the most similar face images from the database. By Bartlett et al. (2002) [66] a method for face recognition is proposed based on ICA. Two architectures are presented for face recognition-spatial local images for the faces and factorial face code and it is shown that both ways of recognition are superior to PCA.
7. Conclusion
This review provides basic information about ICA and its methodology in addition to the earlier surveys; a few more contrast functions and advance algorithms for ICA have been clubbed together. ICA is a general term for the wide variety of applications in the fields of neural computation, signal processing and statistics. ICA provides a systematic transformation or representation of multidimensional data for the subsequent processing of information.
The transformation helps to examine and find its own interesting ways, rules and patterns. It was clear from the discussion that ICA operates mainly with two factors called contrast function and its optimization algorithm. Reference-based contrast functions are especially attractive, as the corresponding problem of maximization is quadratic in relation to the parameters searched. Non-differentiable contrast function is useful if source is extracted one by one (deflation approach).
Maximizing the non-differentiable contrast function is based on the assumptions that the contrast function is to be continuous or at least almost continuous and all contrast function maxima are differentiable. Evolutionary computing techniques are common methods of optimization based on population searches. Genetic algorithms and swarm intelligence are the most widely applied techniques of optimization based on evolutionary computation.
Particle swarm optimization (PSO) is used in ICA technology. The ICA method currently uses various biologically inspired optimization algorithms.