IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013 1 Kernel Multivariate Analysis Framework for Supervised Subspace Learning: A Tutorial on Linear and Kernel Multivariate Methods arXiv:1310.5089v1 [stat.ML] 18 Oct 2013 Jerónimo Arenas-Garcı́a, Senior Member, IEEE, Kaare Brandt Petersen, Gustavo Camps-Valls, Senior Member, IEEE, and Lars Kai Hansen Abstract: Feature extraction and dimensionality reduction are important tasks in many fields of science dealing with signal processing and analysis. The relevance of these techniques is increasing as current sensory devices are developed with ever higher resolution, and problems involving multimodal data sources become more common. A plethora of feature extraction methods are available in the literature collectively grouped under the field of Multivariate Analysis (MVA). This paper provides a uniform treatment of several methods: Principal Component Analysis (PCA), Partial Least Squares (PLS), Canonical Correlation Analysis (CCA) and Orthonormalized PLS (OPLS), as well as their non-linear extensions derived by means of the theory of reproducing kernel Hilbert spaces. We also review their connections to other methods for classification and statistical dependence estimation, and introduce some recent developments to deal with the extreme cases of large-scale and low-sized problems. To illustrate the wide applicability of these methods in both classification and regression problems, we analyze their performance in a benchmark of publicly available data sets, and pay special attention to specific real applications involving audio processing for music genre prediction and hyperspectral satellite images for Earth and climate monitoring. I. I NTRODUCTION As sensory devices develop with ever higher resolution and the combination of diverse data sources is more common, feature extraction and dimensionality reduction become increasingly important in automatic learning. This is especially true in fields dealing with intrinsically high-dimensional signals, such as those acquired for image analysis, spectroscopy, neuroimaging, and remote sensing, but also for situations when many heterogeneous features are computed from a signal and stacked together for classification, clustering, or prediction. Multivariate analysis (MVA) constitutes a family of methods for dimensionality reduction successfully used in several scientific areas [1]. The goal of MVA algorithms is to exploit correlations among the variables to find a reduced set of features that are relevant for the learning task. Among the most well-known MVA methods are Principal Component Analysis (PCA), Partial Least Squares (PLS), and Canonical Correlation Analysis (CCA). PCA disregards the target data and exploits correlations between the input variables to maximize the variance of the projections, while PLS and CCA look for projections that maximize, respectively, the covariance and the correlation between the features and the target data. Therefore, they should in principle be preferred to PCA for regression or classification problems. In this paper, we consider also a fourth MVA method known as Orthonormalized PLS (OPLS) that is also well-suited to supervised problems, with certain optimality in least-squares (LS) multiregression. A common advantage of these methods is that they can be formulated using standard linear algebra, and can be implemented as standard or generalized eigenvalue problems. Furthermore, implementations and variants of these methods have been proposed that operate either in a blockwise or iterative manner to improve speed or numerical stability. No matter how refined the various methods of MVA are, they are still constrained to account for linear input-output relations. Hence, they can be severely challenged when features exhibit non-linear relations. In order to address these problems, non-linear versions of MVA methods have been developed and these can be classified into two fundamentally different approaches [2]: 1) The modified methods in which the linear relations among the latent variables are substituted by non-linear relations [3], [4]; and 2) Variants in which the algorithms are reformulated to fit a kernel-based approach [5]– [7]. In this paper, we will review the second approach, in which the input data is mapped by a non-linear function into a high-dimensional space where the ordinary MVA problems are stated. A central property of this kernel approach is the exploitation of the so-called kernel trick, by which the inner products in the transformed space are replaced with a kernel function working solely with input space data so the explicit non-linear mapping is not explicitly necessary. An appealing property of the resulting kernel algorithms is that they obtain the flexibility of non-linear expressions using straightforward methods from linear algebra. However, supervised kernel MVA methods are hampered in applications involving large datasets or a small number of labeled samples. Sparse and incremental versions have been presented to deal with the former problem, while the field of semisupervised learning has recently emerged for the latter. Remedies to these problems involve a particular kind of regularization, guided either by selection of a reduced number of basis functions or by considering the information about the manifold conveyed by the unlabeled samples. Both approaches will be also reviewed in this paper. Concretely, we aim: 1) To review linear and kernel MVA algorithms, providing their theoretical characterization and comparing their main properties under a common framework. 2) To present relations between kernel MVA and other 2 IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013 TABLE I ACRONYMS AND NOTATION USED IN THE PAPER . AR ECMWF GMM HSIC IASI (k)CCA kFD (k)MVA (k)OPLS (k)PCA (k)PLS MSE LDA LS OA RBF rkCCA rkOPLS rkPCA RMSE RTM skPLS UCI Autoregressive European Centre for Medium-Range Weather Forecasts Gaussian Mixture Model Hilbert-Schmidt Independence Criterion Infrared Atmospheric Sounding Interferometer (Kernel) Canonical Correlation Analysis Kernel Fisher Discriminant (Kernel) Multivariate Analysis (Kernel) Orthonormalized Partial Least Squares (Kernel) Principal Component Analysis (Kernel) Partial Least Squares Mean-square error Linear Discriminant Analysis Least-squares Overall Accuracy Radial Basis Function reduced complexity kCCA reduced complexity kOPLS reduced complexity kPCA Root-mean-square Error Radiative transfer models Sparse kPLS University of California, Irvine feature extraction methods based on Fisher’s discriminant analysis and nonparametric kernel dependence estimates. 3) To review sparse and semisupervised approaches that make the kernel variants practical for large-scale or undersampled labeled datasets. We will illustrate how these approaches overcome some of the difficulties that may have limited a more widespread use of the most powerful kernel MVA methods. 4) To illustrate the wide applicability of these methods, for which we consider several publicly available data sets, and two real scenarios involving audio processing for music genre prediction and hyperspectral satellite images for Earth and climate monitoring. Methods will be assessed in terms of accuracy and robustness to the number of extracted features. We continue the paper with a brief review of linear and kernel MVA algorithms, as well as connections to other methods. Then, Section III introduces some extensions that increase the applicability of kernel MVA methods in real applications. Section IV provides illustrative evidence of method’s performance. Finally, we conclude the paper in Section V with some discussion and future lines of research. II. M ULTIVARIATE A NALYSIS IN R EPRODUCING K ERNEL H ILBERT S PACES This section reviews the framework of MVA both in the linear case and with kernel methods. Interesting connections are also pointed out between particular classification and dependence estimation kernel methods. Table I provides a list of acronyms, and basic notation and variables that will be used throughout the paper. l u n=l+u d m X Y Cx , Cy Cxy X† W I kAkF nf ui , vi U, V X′ , Y ′ F φ(x) k(xi , xj ) Φ Kx = ΦΦ⊤ A Size of labelled dataset Number of unlabeled data Total number of training samples Dimension of input space Dimension of output space Centered input data (size l × d) Centered output data (size l × m) Input, output data sample covariance matrices Input-output sample cross-covariance matrix Moore-Penrose pseudoinverse of matrix X Regression coefficients matrix Identity matrix Frobenius norm of matrix A Number of extracted features ith projection vector for the input, output data [u1 , . . . , unf ], [v1 , . . . , vnf ]. Projection matrices Extracted features for the input, output data Reproducing kernel Hilbert Space Mapping of x in feature space hφ(xi ), φ(xj )iF . Kernel function [φ(x1 ), . . . , φ(xl )]⊤ . Input data in feature space Gram Matrix [α1 , · · · , αnf ]. Coefficients for U = Φ⊤ A A. Problem statement and notation Let us consider a supervised regression or classification problem, and let X and Y be columnwise-centered input and target data matrices of sizes l × d and l × m, respectively. Here, l is the number of training data points in the problem, and d and m are the dimensions of the input and output spaces, respectively. The target data can be either a set of variables that need to be approximated, or a matrix that encodes the class membership information. The sample covariance matrices are given by Cx = 1l X⊤ X and Cy = 1l Y⊤ Y, whereas the covariance between the input and output data is Cxy = 1l X⊤ Y. The objective of standard linear multiregression is to adjust a linear model for predicting the output variables from the b = XW, where W contains the regression input features, Y model coefficients. Ordinary least-squares (LS) regression solution is W = X† Y, where X† = (X⊤ X)−1 X⊤ is the Moore-Penrose pseudoinverse of X. Highly correlated input variables can result in rank-deficient Cx , making the inversion of this matrix unfeasible. The same situation is encountered in the small-sample-size case (i.e., when l < d, which is usually the case when using kernel extensions). Including a Tikhonov’s regularization term leads to better conditioned solutions: for instance, by also minimizing the Frobenius norm of the weights matrix kWk2F , one obtains the regularized LS solution W = (X⊤ X + λI)−1 X⊤ Y, where parameter λ controls the amount of regularization. The solution suggested by MVA to the above problem consists in projecting the input data into a subspace that preserves the most relevant information for the learning problem. Therefore, MVA methods obtain a transformed set of features via a linear transformation of the original data, X′ = XU, where U = [u1 , u2 , . . . , unf ] will be referred hereafter as the projection matrix, ui being the ith projection vector and nf the ARENAS-GARCÍA ET AL.: KERNEL MULTIVARIATE ANALYSIS FRAMEWORK 3 number of extracted features1 . Some MVA methods consider also a feature extraction in the output space, Y′ = YV, with V = [v1 , v2 , . . . , vnf ]. Generally speaking, MVA methods look for projections of the input data that are “maximally aligned” with the targets, and the different methods are characterized by the particular objectives they maximize. Table II provides a summary of the MVA methods that we will discuss in the rest of the section. An interesting property of linear MVA methods is that they are based on first and second order moments, and that their solutions can be formulated in terms of (generalized) eigenvalue problems. Thus, standard linear algebra methods can be readily applied. sensing, where signals typically are acquired in a range of highly correlated spectral wavelengths. Rather than maximizing covariance, CCA maximizes the correlation between projected input and output data [17]. In this way, CCA can more conveniently deal with directions of the input or output spaces that present very high variance, and that would therefore be over-emphasized by PLS, even if the correlation between the projected input and output data is not very significant. A final method we will pay attention to is Orthonormalized PLS (OPLS), also known as multilinear regression [18] or semi-penalized CCA [19]. OPLS is optimal for performing multilinear LS regression on the features extracted from the training data, i.e., U = arg min kY − X′ Wk2F . B. Linear Multivariate Analysis Principal Component Analysis, which is also known as the Hotelling transform or the Karhunen-Loève transform [9], is probably the most widely used MVA method and the oldest dating back to 1901 [10]. PCA selects the maximum variance projections of the input data, imposing an orthonormality constraint for the projection vectors (see Table II). PCA works under the hypothesis that high variance projections contain the relevant information for the learning task at hand. PCA is based on the input data alone, and is therefore an unsupervised feature extraction method. Methods that explicitly look for the projections that better explain the target data should in principle be preferred in a supervised setup. Nevertheless, PCA and its kernel version, kPCA, are used as a preprocessing stage in many supervised problems, likely because of their simplicity and ability to discard irrelevant directions [11], [12]. The PLS algorithm [13] is based on latent variables that account for the information in Cxy . In order to do so, PLS extracts the projections that maximize the covariance between the projected input and output data, again under orthonormality constraints for the projection vectors. This is done either as an iterative procedure or by solving an eigenvalue problem. In the iterative schemes, the data sets X and Y are recursively transformed in a process which subtracts the information contained in the already estimated latent variables. This process, which is often referred to as deflation, can be done in a number of ways that define the many variants of PLS. Perhaps the most popular PLS method was presented in [14]. The algorithm, hereafter referred to as PLS2, assumes a linear relation between X and Y that implies a certain deflation scheme, where the latent variable of X is used to deflate also Y [7, p. 182]. Several other variants of PLS exist such as ‘PLS Mode A’ and PLS-SB; see [15] for a discussion of the early history of PLS and [16] for a well-written contemporary overview. Among its many advantages, PLS does not involve matrix inversion and deals efficiently with highly correlated data. This has justified its very extensive use in fields such as chemometrics and remote 1 Note that strictly speaking U is not a projection operator, since it implies a transformation from Rd to Rnf and does not satisfy the idempotent property of projection operators. Nevertheless, if the columns of U are linearly independent, vectors ui constitute a basis of the subspace of Rd where the data is projected, and it is in this sense that we refer to ui and U as projection vectors and matrix, and to X′ = XU as projected data. This nomenclature has been widely adopted in the machine learning field [8]. (1) U with W = X′† Y being the matrix containing the optimal regression coefficients. It can be shown that this optimization problem is equivalent to the one stated in Table II. Alternatively, this problem can also be associated with the maximization of a Rayleigh coefficient involving projections (u⊤ Cxy v)2 of both input and output data, (u⊤ Cx u)(v ⊤ v) . It is in this sense that this method is called semi-penalized CCA, since it disregards variance of the projected input data, but rewards those input features that better predict large variance projections of the target data. This asymmetry makes sense in supervised subspace learning where matrix Y contains target values to be approximated from the extracted input features. In fact, it has been shown that for classification problems OPLS is equivalent to Linear Discriminant Analysis (LDA), provided an appropriate labeling scheme is used for Y [19]. However, in “two-view learning” problems in which X and Y represent different views of the data [7, Sec. 6.5], one would like to extract features that can predict both data representations simultaneously, and CCA could be preferred to OPLS. A common framework for PCA, PLS, CCA and OPLS was proposed in [18], where it was shown that these methods can be reformulated as (generalized) eigenvalue problems, so that linear algebra packages can be used to solve them. Concretely: PCA : Cx u = λu PLS :  0 Cxy C⊤ 0 xy  u v  =λ  u v  Cx 0 0 Cy  (2) OPLS : Cxy C⊤ xy u = λCx u CCA :  0 Cxy C⊤ xy 0  u v  =λ  u v  We can see that CCA and OPLS require the inversion of matrices Cx and Cy . If these are rank-deficient, then it becomes necessary to first extract the dimensions with non-zero variance using PCA, and then solve the CCA or OPLS problems. A very common approach is to solve the above problems using a two-step iterative procedure. In the first step, the projection vectors corresponding to the largest (generalized) eigenvalue are chosen, for which there exist 4 IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013 TABLE II S UMMARY OF LINEAR AND KERNEL MVA METHODS . F OR EACH METHOD IT IS STATED THE OBJECTIVE TO MAXIMIZE (1 ST ROW ), CONSTRAINTS FOR THE OPTIMIZATION (2 ND ROW ), AND MAXIMUM NUMBER OF FEATURES ( LAST ROW ). PCA ⊤ u Cx u ⊤ U U=I r(X) PLS ⊤ u Cxy v U⊤ U = I V⊤ V = I r(X) CCA ⊤ u Cxy v U ⊤ Cx U = I ⊤ V Cy V = I r(Cxy ) OPLS u ⊤ Cxy C⊤ xy u ⊤ kPCA α ⊤ K2x α ⊤ U Cx U = I A Kx A = I r(Cxy ) r(Kx ) kPLS kCCA ⊤ ⊤ α Kx Yv A ⊤ Kx A = I V⊤ V = I r(Kx ) kOPLS α Kx Yv A⊤ K2x A = I V ⊤ Cy V = I r(Kx Y) ⊤ α Kx YY ⊤ Kx α A⊤ K2x A = I r(Kx Y) Vectors u and α are column vectors in matrices U and A, respectively. r(·) denotes the rank of a matrix. efficient methods such as the power method. The second step is known as deflation, and consists in removing from the data or covariance matrices the variance that can be already explained with the features extracted in the first step. Alternative solutions for these methods can be obtained by reformulating them as regularized least squares minimization problems. For instance, the work in [20]–[22] introduced sparse versions of PCA, CCA and OPLS by adding sparsity promotion terms, such as LASSO or ℓ1 -norm on the projection vectors, to the LS functional. Figure 1 illustrates the features extracted by the methods for a toy classification problem with three classes. The data was generated from three noisy sinusoid fragments, so that a certain overlap exists between classes. For the application of supervised methods, class membership is defined in matrix Y using 1-of-c encoding [23]. Above each scatter plot we provide, for the first extracted feature, its sample variance, the largest covariance and correlation that can be achieved with any linear transformation of the output data, and the optimum mean-square-error (MSE) when that feature is used to 1 approximate the target data, i.e., lm kY−Xu1 (Xu1 )† Yk2F . As expected, the first projection of PCA, PLS and CCA maximize, respectively, the variance, covariance and correlation, whereas OPLS finds the projection that minimizes the MSE. However, since these methods can just perform linear transformations of the data, they are not able to capture any non-linear relations between the input variables. being the new argument for the optimization2 This is typically referred to as the kernel trick and has been used to develop kernel versions of the previous linear MVA, as indicated in the last four columns of Table II. For PCA, it was Schölkopf, Smola and Müller who in 1998 introduced a kernel version denoted kPCA [6]. Lai and Fyfe in 2000 first introduced the kernel version of CCA denoted kCCA [24] (see also [7]). Later, Rosipal and Trejo presented a non-linear kernel variant of PLS in [25]. In that paper, Kx and the Y matrix are deflated the same way, which goes more in line with the PLS2 variant than to the traditional algorithm ‘PLS Mode A’, and therefore we will denote it as kPLS2. A kernel variant of Orthonormalized PLS was presented in [26] and is here referred to as kOPLS. Specific versions of kernel methods to deal with signal processing applications have also been proposed, such as the temporal kCCA of [27], that is designed to exploit temporal structure in the data. As for the linear case, kernel MVA methods can be implemented as (generalized) eigenvalue problems: kPCA : Kx α = λα kPLS :  The framework of kernel MVA (kMVA) algorithms is aimed at extracting nonlinear projections while actually working with linear algebra. Let us first consider a function φ : Rd → F that maps input data into a Hilbert feature space F. The new mapped data set is defined as Φ = [φ(x1 ), · · · , φ(xl )]⊤ , and the features extracted from the input data will now be given by Φ′ = ΦU, where matrix U is of size dim(F) × nf . The direct application of this idea suffers from serious practical limitations when the dimension of F is very large, which is typically the case. To implement practical kernel MVA algorithms we need to rewrite the equations in the first half of Table II in terms of inner products in F only. For doing so, we rely on the availability of a kernel matrix Kx = ΦΦ⊤ of dimension l × l, and on the Representer’s Theorem [7], which states that the projection vectors can be written as a linear combination of the training samples, i.e, U = Φ⊤ A, matrix A = [α1 , . . . , αnf ]  α v  =λ  α v  Kx Kx 0 0 Cy  (3) kOPLS : Kx YY⊤ Kx α = λKx Kx α kCCA : C. Kernel Multivariate Analysis 0 Kx Y YKx 0  0 Kx Y YKx 0  α v  =λ  α v  It should be noted that the output data could also be mapped to some feature space H, as it was considered for kCCA in [24] for a multi-view learning case. Here, we consider that it is the actual labels in Y which need to be well-represented by the extracted input features, so we will deal with the original representation of the output data. For illustrative purposes, we have incorporated to Fig. 1 the projections obtained in the toy problem by kMVA methods using the radial basis function (RBF) kernel, k(xi , xj ) =  exp −kxi − xj k2 /(2σ 2 ) . Input data was normalized to zero mean and unit variance, and the kernel width σ was selected as the median of all pairwise distances between samples [28]. The same σ has been used for all methods, so that features are extracted from the same mapping of the input data. We can see that the non-linear mapping improves class separability. As expected, kPCA, kPLS and kCCA maximize in F the same variance, covariance and correlation objectives, respectively, 2 In this paper, we assume that data is centered in feature space, what can easily be done through a simple modification of the kernel matrix [7]. ARENAS-GARCÍA ET AL.: KERNEL MULTIVARIATE ANALYSIS FRAMEWORK 5 PCA PLS OPLS CCA var = 1.272; MSE = 0.153 var = 1.134; MSE = 0.139 var = 1.001; MSE = 0.135 var = 0.957; MSE = 0.136 cov = 0.353; corr = 0.299 cov = 0.404; corr = 0.401 cov = 0.391; corr = 0.431 cov = 0.381; corr = 0.433 Original data kPCA kPLS kOPLS kCCA var = 0.161; MSE = 0.136 var = 0.161; MSE = 0.135 var = 3.41e−6; MSE = 0.084 var = 2.45e−5; MSE = 0.124 cov = 0.156; corr = 0.421 cov = 0.157; corr = 0.43 cov = 0.001; corr = 0.856 cov = 0.002; corr = 0.915 Fig. 1. Features extracted by different MVA methods in a three-class problem. For the first feature extracted by each method we show its variance (var), the mean-square-error when the projected data is used to approximate Y (MSE), and the largest covariance (cov) and correlation (corr) that can be achieved with any linear projection of the target data. as their linear counterparts. kOPLS looks for the directions of data in F that can provide the best approximation of Y in the MSE sense. This example illustrates also that maximizing the variance or even the covariance may not be the best choice for supervised learning. Although kernel MVA methods can still be described in terms of linear equations, their direct solution faces several problems. In particular, it is well-known that kOPLS and kCCA can easily overfit the training data, so regularization is normally required to alleviate numerical instabilities [7], [26]. A second important problem is related to the computational cost. Since Kx is of size l × l, methods’ complexity scales quadratically with l in terms of memory, and cubically with respect to the computation time. Further, the solution of the maximization problem (matrix A) is not sparse, so that feature extraction for new data requires the evaluation of l kernel functions per pattern, becoming computationally expensive for large l. Finally, it is worth mentioning the opposite situation: when l is small, the extracted features may be useless, especially for high dimensional F [12]. Actually, the information content of the features is elusive and has not been characterized so far. These issues limit the applicability of supervised kMVA in real-life scenarios with either very large or very small labeled data sets. In Section III, we describe sparse and semisupervised approaches for kMVA that tackle both difficulties. D. Relations with other methods As already stated, close connections have been established among Fisher’s LDA, CCA, PLS, and OPLS for classification [19]; such links extend to their kernel counterparts as well. Under the framework of Rayleigh coefficients, Mika et al. [29], [30] extended LDA to its kernel version for binary problems, and Baudat and Anouar [31] proposed the generalized discriminant analysis (GDA) for multiclass problems. A great many kernel discriminants have appeared since then, focused on alleviating problems such as those induced by highdimensional small-sized datasets, the presence of noise, high levels of collinearity, or unbalanced classes. The number and heterogeneity of these methods makes difficult their unified treatment. Besides, in the recent years interesting connections of kMVA and statistical dependence estimates have been established. For instance, the Hilbert-Schmidt Independence Criterion (HSIC) [32] is a simple yet very effective method to estimate statistical dependence between random variables. HSIC corresponds to estimating the norm of the crosscovariance in F, whose empirical (biased) estimator is 1 HSIC := (l−1) 2 Tr(Kx Ky ), where Kx works with samples in the source domain and Ky = YY⊤ . It can be shown that, if the RKHS kernel is universal, HSIC asymptotically tends to zero when the input and output data are independent. The so-called Hilbert-Schmidt Component Analysis (HSCA) method iteratively seeks for projections that maximize dependence with the target variables and simultaneously minimize the dependence with previously extracted features, both in HSIC terms. This objective translates into the iterative resolution of the generalized eigen-decomposition problem Kx Ky Kx α = λKx Kf Kx α, where Kf is a kernel matrix of already extracted features in the previous iteration. If one is only interested in maximizing source-target dependence in HSIC terms, the problem boils down to solving kOPLS. Similarly, the connection between kCCA and other kernel measures of dependence, such as the kernel Generalized Variance (kGV) or the kernel Mutual Information (kMI), was introduced in [33]. The empirical kGV estimates dependence between input-output data with a function that depends on the entire spectrum of the associated correlation operator in RKHS, kGV(θ) = − 21 log(Π(1 − λ2i )), where λi are the i 6 IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013 solutions to the generalized eigenvalue problem and was later used in [35] to directly approximate the feature mapping itself rather than the kernel, thus giving rise to sparse α 0 Kx Ky = versions of kPLS and kCCA. v Ky Kx 0 Among the reduced set methods, a sparse kPCA (skPCA)    α θKx Kx + η(1 − θ)Kx 0 was proposed by Tipping in [36], where the sparsity in the , λ v 0 θKy Ky + η(1 − θ)Ky representation is obtained by assuming a generative model for where Kx and Ky are defined using RKHS kernels obtained the data in F that follows a normal distribution and includes via convolution of the associated Parzen windows, η is a a noise term with variance vn . The maximum likelihood scaling factor, and θ is a parameter in the range [0, 1]. Gretton estimation of the covariance matrix is shown to depend on et al. [33] showed that, under certain conditions, kGV reduces just a subset of the training data, and so it does the resulting solution. A sparse kPLS (skPLS) was introduced in [37]. The to kMI for θ = 0 and to kCCA for θ = 1 (cf. Eq. 3). It is worth noting that the previous kernel measures of statis- method uses a fraction of the training samples for computing tical dependence hold connections with Information Theoretic the projections. Each projection vector is found through a sort Learning concepts as well. For instance, it can be shown that of ε-insensitive loss similar to the one used in the support HSIC is intimately related to the concept of correntropy [34]. vector regression method. The sparsification is induced via All these connections could shed light in the future about the a multi-step adaptation with high computational burden. In informative content of the extracted features in a principled spite of obtaining sparse solutions, the algorithms from [36] and [37] still require the computation of the full kernel matrix way. during the training. A reduced complexity kOPLS (rkOPLS) was proposed III. E XTENSIONS FOR LARGE SCALE AND in [26] by imposing sparsity in the projection vectors repSEMISUPERVISED PROBLEMS resentation a priori, U = Φ⊤ r β, where Φr is a subset of the Supervised kernel MVA methods are hampered either by training data containing r samples (r ≪ l) and β is the new the wealth or the scarcity of labeled samples, which can make argument for the maximization problem, which now becomes: these methods impractical for many applications. We next summarize some extensions to deal with large scale problems max β ⊤ Krl YY⊤ K⊤ rl β (4) and semisupervised situations in which few labeled data is ⊤ ⊤ subject to : β Krl Krl β = 1, available. Since kernel matrix Krl = Φr Φ⊤ involves the inner products in F of all training points with the patterns in the reduced set, A. Sparse Kernel Feature Extraction this method still takes into account all data during the training A critical bottleneck of kernel methods is that for a dataset phase, and is therefore different from simple subsampling. of l samples, the kernel matrices are l × l, which, even for a This sparsification procedure avoids the computation of the moderate number of samples, quickly becomes a problem with full kernel matrix at any step of the algorithm. An additional respect to both memory and computation time. Furthermore, advantage of this method is that matrices Krl YY⊤ K⊤ rl and in kernel MVA this is also an issue during the extraction Krl K⊤ are both of size r × r, and can be expressed as sums rl of features for test data, since kernel MVA solutions will over the training data, so the storage requirements become just in general depend on all training data (i.e., matrix A will quadratic with r. Furthermore, the sparsity constraint acts as a generally be dense): evaluating thousands of kernels for every regularizer that can significantly improve the generalization new input vector is, in most applications, simply not acceptability of the method. In the experiments section, we will able. Furthermore, these so-called dense solutions may result apply the same sparsification procedure to kPCA and kCCA, in severe problems of overfitting, which is particularly true obtaining reduced complexity versions of these methods to for kOPLS and kCCA [7], [26]. To address these problems, which we will refer to as rkPCA and rkCCA. Interestingly, the several solutions have been proposed to obtain sparse solutions extension to kPLS2 is not straightforward, since the deflation that can be expressed as a combination of a subset of the step would still require the full kernel matrix Kll . training data, and therefore require only r kernel evaluations Alternatively, two sparse kPLS schemes were presented per sample (with r ≪ l) for feature extraction. Note that, in [38] under the name of Sparse Maximal Aligment (SMA) in contrast to the many linear MVA algorithms that induce and Sparse Maximal Covariance (SMC). Here kPLS iteratively sparsity with respect to the original variables, in this subsection estimates projections that either maximize the kernel alignwe review methods that obtain sparse solutions in terms of the ment (c.1) or the covariance (c.2) of the projected data and samples (i.e., sparsity in the αi vectors). the true labels: The approaches to obtain sparse solutions can be broadly max β ⊤ Kj Y divided into low rank approximation methods, that aim at working with reduced r × r matrices (r ≪ l), and reduced (5) subject to (c.1) : β ⊤ K2j β = 1, set methods that work with l × r matrices. Following the first ⊤ subject to (c.2) : β Kj β = 1, approach, the Nyström low-rank approximation of an l × l −1 kernel matrix Kll is expressed as K̃ll = Klr Krr Krl , where where K1 = Kx and Kj denotes the deflated kernel matrix subscripts indicate row and column dimensions. The method at iteration j, according to [38, Eq. (3)]. The method imposes was originally exploited in the context of Gaussian processes, the additional constraint that the cardinality of β is 1. This    ARENAS-GARCÍA ET AL.: KERNEL MULTIVARIATE ANALYSIS FRAMEWORK restriction explicitly enforces sparsity through an ℓ0 -norm in the weights space. At each iteration, β is obtained by performing an exhaustive search over all training patterns. However, the complexity of the algorithm can be significantly reduced by constraining the search to just p randomly chosen samples. In Table III, we summarize some computational and implementation issues of the aforementioned sparse kMVA methods, and of standard non-sparse kMVA and linear methods. An analysis of the properties of each algorithm provides some hints that can help us choose the algorithm for a particular application. Firstly, a critical step when using kernel methods is the selection of an appropriate kernel function and tuning its parameters. To avoid overfitting, kMVA methods can be adjusted using cross-validation at the cost of higher computational cost. Sparse methods can help in this respect by regularizing the solution. Secondly, most methods can be implemented as either eigenvalue or generalized eigenvalue problems, whose complexity typically scales cubically with the size of the analyzed matrices. Therefore, both for memory and computational reasons, only linear MVA and the sparse approaches from [26] and [38] are affordable when dealing with large data sets. A final advantage of sparse kMVA is the reduced number of kernel evaluations to extract features for new data. B. Semisupervised Kernel Feature Extraction When few labeled samples are available, the extracted features do not capture the structure of the data manifold well, and hence using them for classification or regression may lead to very poor results. Recently, semisupervised learning approaches have been introduced to alleviate these problems. Two approaches are encountered: the information conveyed by the unlabeled samples is either modeled with graphs or via kernel functions derived from generative clustering models. Notationally, we are given l labeled and u unlabeled samples, a total of n = l + u. The semisupervised kCCA (sskCCA) has been recently introduced in [28] by using the graph Laplacian. The method essentially solves the standard kCCA using kernel matrices computed with both labeled and unlabeled data, which are further regularized with the graph Laplacian:    α 0 Kxnl Kyln = y v 0 Kxnl Kln λ  Kxnl Kxln + Rxnn 0 0 Kynl Kyln + Rynn  α v  , (6) where Rxnn = αx Kxnn + γx Kxnn Lxnn Kxnn and Rynn = αy Kynn + γy Kynn Lynn Kynn . For notation compactness, subindexes here indicate the size of the corresponding matrices while superscripts denote whether they involve input or output data. Parameters αx , αy , γx and γy trade off the contribution of labeled and unlabeled samples, and L = D−1/2 (D−M)D1/2 represents the (normalized) graph Laplacian for the input and target domains, where D is the degree matrix whose entries are the sums of the rows of the corresponding similarity matrix 7 P M, i.e. Dii = j Mij . It should be noted that for n = l and null regularization, one obtains the standard kCCA (cf. Eq. 3). Note also that this form of regularization through the graph Laplacian can be applied to any kMVA method. A drawback of this approach is that it involves tuning several parameters and working with larger matrices of size 2n × 2n, which can make its application difficult in practice. Alternatively, cluster kernels have been used to develop semisupervised versions of kernel methods in general, and of kMVA methods in particular. The approach was used for kPLS and kOPLS in [39]. Essentially, the method relies on combining a kernel function based on labeled information only, ks (xi , xj ), and a generative kernel directly learned by clustering all (labeled and unlabeled) data, kc (xi , xj ). Building kc requires first running a clustering algorithm, such as Expectation-Maximization assuming a Gaussian mixture model (GMM) with different initializations, q = 1, . . . , Q, and with different number of clusters, g = 2, . . . , G + 1. This results in Q · G cluster assignments where each sample xi has its corresponding posterior probability vector π i (q, g) ∈ Rg . The probabilistic cluster kernel kc is computed by averaging all the dot products between posterior probability vectors, kc (xi , xj ) = Q G+1 1 XX π i (q, g)⊤ π j (q, g), Z q=1 g=2 where Z is a normalization factor. The final kernel function is defined as the weighted sum of both kernels, k(xi , xj ) = βks (xi , xj )+(1−β)kc (xi , xj ), where β ∈ [0, 1] is a scalar parameter to be adjusted. Intuitively, the cluster kernel accounts for probabilistic similarities at small and large scales (number of clusters) between all samples along the data manifold. The method does not require computationally demanding procedures (e.g. current GMM clustering algorithms scale linearly with n), and the kMVA still relies just on the labeled data, and thus requires an l × l kernel matrix. All these properties are quite appealing from the practitioner’s point of view. IV. E XPERIMENTAL R ESULTS In this section, we illustrate through different application examples the use and capabilities of the supervised kernel multivariate feature extraction framework. We start by comparing the performance of the linear and kernel MVA methods in a benchmark of classification problems from the publicly available Machine Learning Repository at University of California, Irvine (UCI)3 . We then consider two real applications to show the potential of these algorithms: satellite image processing [40] and audio processing for music genre prediction [41]. The size of the data set used in this second scenario is sufficiently large to make standard kMVA methods impractical, a situation that we will use to illustrate the benefits of the sparse extensions. A. UCI repository benchmark Our first battery of experiments deals with standard benchmark data sets taken from the UCI repository, and will be 3 8 IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013 TABLE III M AIN PROPERTIES OF ( K )MVA METHODS . C OMPUTATIONAL COMPLEXITY AND IMPLEMENTATION ISSUES ARE CATEGORIZED FOR THE CONSIDERED DENSE AND SPARSE METHODS , INDICATING FROM LEFT TO RIGHT: THE FREE PARAMETERS , NUMBER OF KERNEL EVALUATIONS (KE) DURING TRAINING , STORAGE REQUIREMENTS , WHETHER AN EIGENPROBLEM (EIG) OR GENERALIZED EIGENPROBLEM (GEV) NEEDS TO BE SOLVED , AND THE NUMBER OF KERNELS THAT NEED TO BE EVALUATED TO EXTRACT PROJECTIONS FOR NEW DATA . Method PCA PLS CCA OPLS kPCA kPLS kCCA kOPLS skPCA [36] skPLS [37] rkPCA rkCCA rkOPLS [26] SMA / SMC [38]‡ Parameters none none none none kernel kernel kernel kernel kernel, vn kernel, ν, ε kernel, r kernel, r kernel, r kernel, r KE tr. none none none none l2 l2 l2 l2 l2 l2 rl rl rl l2 Storage Req. O(d2 ) O((d + m)2 ) O((d + m)2 ) O(d2 ) O(l2 ) O((l + m)2 ) O((l + m)2 ) O(l2 ) O(l2 ) O(l2 ) O(r2 ) O((r + m)2 ) O(r2 ) O(l2 ) EIG / GEV / Other EIG EIG GEV GEV EIG EIG GEV GEV ML + EIG† ν-SVR EIG GEV GEV Ex. search KE test none none none none l l l l vn dependent ν, ε dependent r r r r † A maximum likelihood estimation step is required. ‡ By constraining the search to p random samples at each step of the algorithm, kernel evaluations and storage requirements during training can be reduced to rlp and O(lp), respectively. oriented to discuss some important properties of the standard linear and kernel MVA methods. The main properties of the selected problems are given in Table IV, namely the number of training and test patterns (l, ltest ), the dimensionality of the input space (d), the number of classes (c), the ratio of training patterns per dimension, and the Kullback-Leibler divergence (KL) between the sample probabilities of each class and a uniform distribution, that can be seen as an indicator of balance among classes. The train/test partition has left 60% of the total data (or alternatively a maximum of 500 samples) in the training set, so that standard kMVA complexity is kept under control. Since all selected problems involve a classification task, matrix Y was used to store the class information using 1-of-c coding. To obtain the overall accuracies (OA) displayed in the table, we have trained an LS model to approximate Y, followed by a “winner-takes-all” activation function. We start the discussion by comparing the performance of linear methods. For OPLS and CCA we have used the maximum number of features (c − 1), whereas for PCA nf has been fixed through a 10-fold cross-validation scheme on the training set, and is indicated next to the results of the method. When the maximum number of features are extracted, linear OPLS and CCA become identical. We can see that in most cases they perform similarly to PCA, but requiring significantly fewer features. It is important to see the very poor performance of OPLS and CCA in two particular problems: semeion and sonar. We see that the ratio l/d is very small for these problems, so the input covariance matrix is likely to be ill-conditioned. Then, very low variance directions of the input space are used by CCA and OPLS to overfit the data. To avoid this problem, it becomes necessary to regularize these algorithms, e.g., by loading the main diagonal of the covariance matrix Cx or by executing the method on the projections already extracted by PCA [11]. The results of regularized OPLS and CCA following the latter approach, and using the maximum of (c − 1) features, are given in the last two columns of Table IV, where we can see that the regularization does indeed help to overcome the “small- sample-size” problem. For completeness, we present also the results of the PLS2 approach. To get a fair comparison with the other supervised schemes, PLS2 is also trained to extract (c − 1) features. Thus, we can conclude that OPLS and CCA allow to obtain more discriminative features than the other methods, but one needs to be aware of the likely need to regularize the solution. Next, we turn our attention to non-linear versions, whose results have been displayed in the bottom half of the table. An RBF kernel has been used in all cases, selecting the kernel width with 10-fold cross-validation. A first consideration is that kernel approaches considerably outperform the linear schemes. Since the RBF kernel implies a mapping to a very high dimensional space, it is not surprising that standard kOPLS and kCCA are even more prone to overfitting than before, this being a well-known problem of these methods. As before, regularized solutions allow kOPLS and kCCA to achieve comparable performance to kPCA, but retaining a much smaller number of features (c − 1), which demonstrates the superior discriminative capabilities of the features extracted by these methods. For completeness, kPLS2 was used to extract the same number of features as kOPLS and kCCA, achieving considerably smaller OAs in all problems. Nevertheless, with kPLS2 it is possible to extract a larger number of projections, and this method is known to be more robust to overfitting than kOPLS and kCCA. B. Remote sensing image analysis The last few hundred years human activities have precipitated an environmental crisis on Earth. In the last decade, advanced statistical methods have been introduced to quantify our impact on the land/vegetation and atmosphere, to better understand their interactions. Nowadays, multi- and hyperspectral sensors mounted on satellite or airborne platforms may acquire the reflected energy by the Earth with high spatial detail and in several wavelengths. Recent infrared sounders also allow us to estimate the profiles of atmospheric parameters with unprecedented accuracy and vertical resolution. Here, ARENAS-GARCÍA ET AL.: KERNEL MULTIVARIATE ANALYSIS FRAMEWORK 9 TABLE IV M AIN CHARACTERISTICS OF THE DATA SETS THAT COMPOSE THE UCI BENCHMARK AND PERFORMANCE OF THE DIFFERENT ( K )MVA FEATURE EXTRACTION METHODS . A S A FIGURE OF MERIT WE USE THE OVERALL ACCURACY (OA, [%]) ± THE BINOMIAL STANDARD DEVIATION . B EST RESULTS FOR EACH PROBLEM ARE HIGHLIGHTED IN BOLDFACE . T HE NUMBER OF EXTRACTED FEATURES IS INDICATED FOR PCA AND K PCA, WHEREAS ALL OTHER METHODS USE c − 1 FEATURES . data set car glass optdigits semeion sonar vehicle vowel yeast data set car glass optdigits semeion sonar vehicle vowel yeast l 500 128 500 500 125 500 500 500 ltest 1228 86 5120 1093 83 346 490 984 d 6 9 62 256 60 18 13 8 c 4 6 10 10 2 4 11 10 l/d nf,PCA 83.3 5 14.2 9 8.06 27 1.95 78 2.08 7 27.8 18 38.5 13 62.5 7 KL nf,kPCA 0.55 197 0.28 17 0 330 0 321 0 97 0 229 0 310 0.58 56 PCA 79 ±1.2 57 ±5.3 88.8 ±0.4 83.8 ±1.1 74.7 ±4.8 78 ±2.2 48.8 ±2.3 55.9 ±1.6 kPCA 93 ±0.7 60.5 ±5.3 95.4 ±0.3 89.5 ±0.9 84.3 ±4 81.5 ±2.1 92.7 ±1.2 58.8 ±1.6 PLS2 79.3 ±1.2 50 ±5.4 86.6 ±0.5 82.4 ±1.2 67.5 ±5.1 63.9 ±2.6 46.9 ±2.3 55 ±1.6 kPLS2 80.8 ±1.1 60.5 ±5.3 82.2 ±0.5 79 ±1.2 67.5 ±5.1 49.7 ±2.7 53.1 ±2.3 56.8 ±1.6 we pay attention to the performance of several kMVA methods for both image segmentation of hyperspectral images, and estimation of climate parameters from infrared sounders. a) Hyperspectral image classification: The first case study deals with image segmentation of hyperspectral images [42]. We have used the standard AVIRIS image taken over NW Indiana’s Indian Pine test site in June 19924 . We removed 20 noisy bands covering the region of water absorption, and finally worked with 200 spectral bands. The high number of narrow spectral bands induce a high collinearity among features. Discriminating among the major crop classes in the area can be very difficult (in particular, given the moderate spatial resolution of 20 meters), which has made the scene a challenging benchmark to validate classification accuracy of hyperspectral imaging algorithms. The image is 145 × 145 pixels and contains 16 quite unbalanced classes (ranging from 20 − 2468 pixels). Among the available 10366 labeled pixels, 20% were used for training the feature extractors, and the remaining 80% for testing. The discriminative power of all extracted features was tested using a simple classifier consisting of a linear least squares model followed by a “winner-takes all” activation function. Figure 2 shows the test classification accuracy for a varying number of extracted features, nf . For linear models, OPLS performs better than all other methods for any number of extracted features. Even though CCA provides similar results for nf = 10, it involves a slightly more complex generalized eigenproblem. When the maximum number of projections is used, all methods result in the same error. Nevertheless, while PCA and PLS2 require 200 features (i.e., the dimensionality of the input space), CCA and OPLS only need 15 features to achieve virtually the same performance. We also considered non-linear kPCA, kPLS2, kOPLS and kCCA, using an RBF kernel whose width was adjusted using 5-fold cross-validation in the training set. The same conclu4 The calibrated data is available online (along with detailed ground-truth information) from∼biehl/MultiSpec. OPLS 79.6 ±1.1 57 ±5.3 90.3 ±0.4 69.1 ±1.4 65.1 ±5.2 78 ±2.2 48.8 ±2.3 55 ±1.6 kOPLS 92.1 ±0.8 41.9 ±5.3 95.2 ±0.3 89.3 ±0.9 80.7 ±4.3 76.6 ±2.3 92.4 ±1.2 48.1 ±1.6 CCA 79.6 ±1.1 57 ±5.3 90.3 ±0.4 69.1 ±1.4 65.1 ±5.2 78 ±2.2 48.8 ±2.3 55 ±1.6 kCCA 71.8 ±1.3 29.1 ±4.9 93.3 ±0.4 89.4 ±0.9 80.7 ±4.3 75.1 ±2.3 92 ±1.2 33.8 ±1.5 reg. OPLS 79 ±1.2 57 ±5.3 88.8 ±0.4 83.8 ±1.1 74.7 ±4.8 78 ±2.2 48.8 ±2.3 55.9 ±1.6 reg. kOPLS 93 ±0.7 62.8 ±5.2 95.3 ±0.3 89 ±0.9 84.3 ±4 82.1 ±2.1 93.1 ±1.1 58.7 ±1.6 reg. CCA 79 ±1.2 57 ±5.3 88.8 ±0.4 83.8 ±1.1 74.7 ±4.8 78 ±2.2 48.8 ±2.3 55.9 ±1.6 reg. kCCA 91.6 ±0.8 60.5 ±5.3 88.8 ±0.4 83.9 ±1.1 49.4 ±5.5 72.8 ±2.4 88.4 ±1.4 58.9 ±1.6 sions obtained for the linear case apply also to MVA methods in kernel feature space. The features extracted by kOPLS allow to achieve a slightly better Overall Accuracy (OA) than kCCA, and both methods perform significantly better than kPLS2 and kPCA. In the limit of nf , all methods achieve similar accuracy. The classification maps obtained for nf = 10 confirm these conclusions: higher accuracies lead to smoother maps and smaller error in large spatially homogeneous vegetation covers. b) Temperature estimation from infrared sounding data: The second case study focuses on the estimation of temperature from spaceborne very high spectral resolution infrared sounding instruments. Despite the constant advances in sensor design and retrieval techniques, it is not trivial to invert the full information of the atmospheric state contained by such hyperspectral measurements. Statistical regression and feature extraction methods have overcome the numerical difficulties of radiative transfer models (RTMs), and enable fast retrievals from high volumes of data [43]. We concentrate here on the Infrared Atmospheric Sounding Interferometer (IASI) onboard the MetOp-A satellite data. IASI spectra consist of 8461 spectral channels (input features) with a spatial resolution of 25 km at nadir. Due to its large spatial coverage and low radiometric noise, IASI provides twice daily global measurements of key atmospheric species such as ozone, carbon monoxide, methane and methanol. Due to the impossibility to obtain real radiosound data for the whole atmospheric column, we resorted to the standard hybrid approach for developing the prediction models: we used synthetic data for training the models, and then applied it to a full IASI orbit (91800 ‘pixels’ on March 4th, 2008). A total amount of 67475 synthetic samples were simulated with an infrared RTM according to input profiles of temperature at 90 pressure levels. We are confronted here with a challenging multi-output regression problem (xi ∈ R8461 and yi ∈ R90 ), which needs fast methods for retrieval (prediction). Note that the IASI mission delivers approximately 1.3 × 106 spectra per day, 10 IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013 Overall Accuracy (%) 90 RGB composite PLS2 (61.5%) CCA (67.5%) kOPLS (80.4%) 80 70 PCA PLS2 CCA OPLS kPCA kPLS2 kCCA kOPLS 60 50 40 0 1 10 10 Number of features 2 10 Fig. 2. Average classification accuracy (%) for linear and kernel MVA methods as a function of the number of extracted features, along some classification maps for the case of 10 extracted features. ECMWF world temperature map [K] p [hPa] 10 10 10 1 ECMWF PCA (2.54) kPCA (1.72) kOPLS (1.55) PCA PLS2 CCA OPLS kPCA kPLS2 kCCA kOPLS 2 3 0 2 4 RMSE [K] 6 8 Fig. 3. RMSE atmospheric temperature profiles (left); surface temperature [K] world map provided by the official ECMWF model,, on March 4, 2008 (middle); and estimated surface temperature maps in California/Mexico area for several methods along with the averaged RMSE across the whole atmospheric column given in brackets. which gives a rate of about 29 Gbytes/day to be processed. We compared several MVA methods followed by LS in terms of the root-MSE (RMSE) computed as the discrepancy to the official European Centre for Medium-Range Weather Forecasts (ECMWF) estimations. We obtain the RMSE for all spatial ‘pixels’ in the orbit and for all layers of the atmosphere. We fixed a maximum nf = 100, and used only l = 2000 samples for learning the transformation and the LS regression weights. For the kernel approaches, an RBF kernel is used, selecting the width using 5-fold cross validation. The left panel of Fig. 3 illustrates the RMSE obtained for different pressure levels, using a spatial average over all the pixels. Results show that kernel methods outperform linear approaches in RMSE terms, specially in the lower parts of the atmosphere, probably due to the presence of clouds and haze. Estimated surface temperature maps for a particular area are also given in the right panel of Fig. 3 for PCA, kPCA, and kOPLS. These plots reveal that kernel methods yield closer maps to those provided by the ECMWF in averaged RMSE terms. These results can be of high value because the kernel-based estimations are obtained with a drastic reduction in computational time compared to the physical-inversion methods used by the ECMWF. C. Analysis of integrated short-time music features for genre prediction In this subsection, we consider the problem of predicting the genre of a song using the audio data only, a task which has recently been a subject of much interest. The data set we analyze has been previously used in [26], [41], and consists of 1317 snippets of 30 seconds distributed evenly over 11 music genres: alternative, country, easy listening, electronica, jazz, latin, pop&dance, rap&hip-hop, r&b, reggae and rock. This is a rather complex data set with an average of 1.83 songs per artist. An estimate of human performance on this data set has been carried out via subjective tests, providing an average accuracy rate around 55%. The music snippets are MP3 (MPEG1-layer3) encoded music with a bitrate of 128 kbps or higher, down sampled to 22050 Hz, and they are processed following the method in [41]: Mel Frequency Cepstral Coefficients (MFCC) features are extracted from overlapping frames of the song, using a ARENAS-GARCÍA ET AL.: KERNEL MULTIVARIATE ANALYSIS FRAMEWORK 11 window size of 20 ms. Then, a multivariate autoregressive (AR) model is adjusted for every 1.2 seconds of the song to capture temporal correlation, and finally the parameters of the AR model are stacked into a 135 length feature vector for every such frame. For training and testing the system we have split the data set into two subsets with 817 and 500 songs, respectively. After processing the audio data, we have 57388 and 36556 135-dimensional vectors in the training and test partitions, an amount which for standard kernel MVA methods is prohibitively large. For this reason, in this subsection we study the performance of linear MVA methods, as well as of sparse kernel methods that promote sparsity following the approach of [26]: rkPCA, rkOPLS, and rkCCA. For completeness, we have also considered the kPLS2 method of [7]; in this case the deflation scheme does not allow to use reduced set methods, so we applied mere subsampling. As in the previous subsections, we use an RBF kernel, with a 10-fold validation scheme over the training data to adjust the kernel width. We also resorted to an LS scheme followed by “winner-takes-all” to carry out the classification. Figure 4 illustrates the performance of all methods for different number of features. For the sparse methods, the influence of the machine size, measured as the number of kernel evaluations required to extract features for new data, is also analyzed. In rkPCA, rkOPLS and rkCCA this is given by the cardinality of the reduced set, whereas in kPLS2 it coincides with the number of samples used to train the feature extractor. Since every song consists of about 70 AR vectors, we can measure the classification accuracy in two different ways: 1) On the level of individual AR vectors, or 2) by majority voting across the AR vectors of a given song. The left panel of Fig. 4 illustrates the discriminative power of the features extracted by all considered methods. Overall, the best performance is obtained by OPLS- and CCA-type methods, with the kernel schemes outperforming the linear ones for nf > 5. The poor performance of PCA and rkPCA makes evident the need of exploiting label information during the feature extraction step. Finally, it is evident that a mere subsampling does not provide kPLS2 with enough data to extract relevant features, and this method performs worse than its linear counterpart. On the right panel of Fig. 4 we analyze the accuracy of kernel methods as a function of machine size. As before, it is clear that rkOPLS and rkCCA are the best performing methods both at the AR and song levels. Increasing the size of the machine results in better performance, although the improvement is not very noticeable in excess of 250 nodes. Altogether, these results allow us to conclude that sparse methods can be used to enhance the applicability of kMVA for large data sets. In this subsection, we have focused on the sparse-promotion technique of [26], but similar advantages can be expected from other sparse approaches. of these techniques in real world applications is becoming increasingly popular. Beyond standard PCA, there is a plethora of linear and kernel methods that are generally better suited for supervised applications since they seek for projections that maximize the alignment with the target variables. We analyzed the commonalities and basic differences of the most representative MVA approaches in the literature, as well as the relationships to existing kernel discriminative feature extraction and statistical dependence estimation approaches. We also studied recent methods to make kernel MVA more suitable to real life applications, both for large scale data sets and for problems with few labeled data. In such approaches, sparse and semi-supervised learning extensions have been successfully introduced for most of the models. Actually, seeking for the appropriate features that facilitate classification or regression cuts to the heart of manifold learning. We have illustrated MVA methods in challenging real problems that typically exhibit complex manifolds. A representative subset of the UCI repository has been used to illustrate the general applicability of the standard MVA implementations in moderate data sizes. We have completed the panorama with challenging real-life applications: the prediction of music genre from recorded signals, and the classification of land-cover classes and estimation of climate variables to monitor our Planet. This tutorial presented the framework of kernel multivariate methods and outlined relations and possible extensions. The adoption of the methods in many disciplines of science and engineering is nowadays a fact. New exciting advances in the theory and applications are yet to come. V. C ONCLUSIONS We reviewed the field of supervised feature extraction from the unified framework of multivariate analysis. The use ACKNOWLEDGMENTS The authors would like to thank the reviewers and the special issue editors for useful comments that improved the presentation of the present manuscript significantly. This work was partially supported by Banco Santander and Universidad Carlos III de Madrid’s Excellence Chair programme, and by the Spanish Ministry of Economy and Competitiveness (MINECO) under projects TIN2012-38102C03-01, TEC2011-22480 and PRI-PIBIN-2011-1266. R EFERENCES [1] H. Wold, “Estimation of principal components and related models by iterative least squares,” in P. R. Krishnaiah (ed.) Multivariate Analysis, pp. 391–420, Academic Press, 1966. [2] R. Rosipal, “Nonlinear Partial Least Squares: An Overview,” in Chemoinformatics and Advanced Machine Learning Perspectives: Complex Computational Methods and Collaborative Techniques, H. Lodhi and Y. Yamanishi (eds.), ACCM, IGI Global, pp. 169–189, 2011. [3] S. Wold, N. Kettaneh-Wold, and B. Skagerberg, “Nonlinear PLS Modeling,” Chemometrics and Intelligent Laboratory Systems, vol. 7, pp. 53–65, 1989. [4] S. Qin and T. McAvoy, “Non-linear PLS modelling using neural networks,” Computers & Chemical Engineering, vol. 16, pp. 379–391, 1992. [5] B. E. Boser, I. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proc. COLT’92, Pittsburgh, PA, 1992, pp. 144–152. [6] B. Schölkopf, A. Smola and K.-R. Muller. “Nonlinear Component Analysis as a Kernel Eigenvalue Problem,” Neural Computation, vol. 10, pp. 1299–1319, 1998. [7] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. 12 IEEE SIGNAL PROCESSING MAGAZINE, VOLUME 30, ISSUE 4, 2013 45 45 Accuracy rates (%) Accuracy rates (%) 40 35 30 25 20 15 Baseline (Random Guess) 10 5 0 10 1 10 Number of Features PCA PLS2 CCA OPLS rkPCA kPLS2 rkCCA rkOPLS 10 40 35 rkPCA, AR kPLS2, AR rkCCA, AR rkOPLS, AR rkPCA, song kPLS2, song rkCCA, song rkOPLS, song 30 25 2 20 100 250 500 Machine size 750 Fig. 4. Genre classification accuracy of linear and sparse kernel MVA methods. Left: Accuracy at the song level as a function of the number of extracted features. Right: Accuracy of the sparse methods measured as percentage of correctly classified AR patterns and songs for different machine sizes. [8] IEEE Signal Process. Mag., special issue on Dimensionality Reduction via Subspace and Submanifold Learning, vol. 28, Mar. 2011. [9] I. T. Jollife, Principal Component Analysis, Springer-Verlag, 1986. [10] K. Pearson, “On lines and planes of closest fit to systems of points in space,” Philosophical Mag., vol. 2 pp. 559–572, 1901. [11] M. L. Braun, J. M. Buhmann, and K.-R. Müller, “On Relevant Dimensions in Kernel Feature Spaces,” Journal of Machine Learning Research, vol. 9, 1875–1908, 2008. [12] T. J. Abrahamsen and L. K. Hansen, “A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis,” Journal of Machine Learning Research, vol. 12, 2027–2044, 2011. [13] H. Wold, “Non-linear estimation by iterative least squares procedures,” in F. David (ed.) Research papers in Statistics, pp. 411–444, New York, NY: Wiley, 1966. [14] S. Wold, et al., “Multivariate Data Analysis in Chemistry,” in B. R. Kowalski (ed.) Chemometrics, Mathematics and Statistics in Chemistry, pp. 17–95. Reidel Publishing Company, Holland, 1984. [15] P. Geladi, “Notes on the history and nature of partial least squares (PLS) modelling”, Journal of Chemometrics, vol. 2, pp. 231–246, 1988. [16] N. Kramer and R. Rosipal, “Overview and recent advances in partial least squares,” in C. Saunders et al. (eds.) Subspace, Latent Structure and Feature Selection Techniques, pp. 34–51, Springer-Verlag, 2006. [17] H. Hotelling, “Relations between two sets of variates,” Biometrika, 28, 321–377, 1936. [18] M. Borga, T. Landelius, and H. Knutsson, “A unified approach to PCA, PLS, MLR and CCA,” Tech. Report LiTH-ISY-R-1992, Linköping, Sweden, 1997. [19] M. Barker and W. Rayens, “Partial least squares for discrimination,” Journal of Chemometrics, vol. 17, pp. 166–173, 2003. [20] H. Zou, T. Hastie, and R. Tibshirani, “Sparse Principal Component Analysis,” Journal of Computational and Graphical Statistics, vol. 15, pp. 265–286, 2006. [21] D. R. Hardoon and J. Shawe-Taylor, “Sparse Canonical Correlation Analysis,” Machine Learning, vol. 83, pp. 331–353, 2011. [22] M. van Gerven, Z. Chao, and T. Heskes “On the decoding of intracranial data using sparse orthonormalized partial least squares,” Journal of Neural Engineering, vol. 9, 2012. [23] C. M. Bishop, Neural networks for pattern recognition, Oxford University Press, 1995. [24] P. L. Lai and C. Fyfe, “Kernel and non-linear Canonical Correlation Analysis,” Intl. Journal of Neural Systems, vol. 10, pp. 365–377, 2000. [25] R. Rosipal and L. J. Trejo, “Kernel partial least squares regression in reproducing Hilbert spaces,” Journal of Machine Learning Research, 2, 97–123, 2001. [26] J. Arenas-Garcı́a, K. B. Petersen, and L. K. Hansen, “Sparse kernel orthonormalized PLS for feature extraction in large data sets,” in NIPS, 19, MIT Press, 2007. [27] F. Biessmann,, “Temporal kernel CCA and its application in multimodal neuronal data analysis,” Machine Learning, vol. 79, pp. 5–27, 2010. [28] M. Blaschko, J. Shelton, A. Bartels, C. Lampert, and A. Gretton, “Semisupervised Kernel Canonical Correlation Analysis with Application to Human fMRI”, Patt. Recogn. Lett. vol. 32, pp.1572–1583, 2011. [29] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K.-R. Müller, “Fisher discriminant analysis with kernels,” in Proc. IEEE Neural Networks for Signal Processing Workshop, Madison, WI, Aug. 1999, pp. 41–48. [30] S. Mika et al., “Constructing Descriptive and Discriminative Nonlinear Features: Rayleigh Coefficients in Kernel Feature Spaces,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 25, pp. 623–628, 2003. [31] G. Baudat and F. Anouar, “Generalized Discriminant Analysis Using a Kernel Approach,” Neural Computation, vol. 12, pp. 2385–2404, 2000. [32] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf, “Measuring statistical dependence with Hilbert-Schmidt norms,” in Proc. 16th Intl. Conf. Algorithmic Learning Theory, Springer, 2005, pp. 63–77. [33] A. Gretton, R. Herbrich and A. Hyvärinen, “Kernel methods for measuring independence”, Journal of Machine Learning Research, vol. 6, 2075–2129, 2005. [34] J. C. Principe, Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives, Springer, 2010. [35] L. Hoegaerts, J. A. K. Suykens, J. Vanderwalle, and B. De Moor, “Subset based least squares subspace regression in RKHS,” Neurocomputing, vol. 63, pp. 293–323, 2005. [36] M. E. Tipping, “Sparse Kernel Principal Component Analysis,” in NIPS, 13, MIT Press, 2001. [37] M. Momma and K. Bennett, “Sparse kernel partial least squares regression,” in Proc. Conf. on Learning Theory, 2003. [38] C. Dhanjal, S. R. Gunn, and J. Shawe-Taylor, “Efficient sparse kernel feature extraction based on partial least squares,” IEEE Trans. Patt. Anal. and Mach. Intell., vol. 31, pp. 1347–1360, 2009. [39] E. Izquierdo-Verdiguier, J. Arenas-Garcı́a, S. Muñoz-Romero, L. Gómez-Chova and G. Camps-Valls, “Semisupervised Kernel Orthonormalized Partial Least Squares,” in Proc. IEEE Mach. Learn. Sign. Proc. Workshop, Santander, Spain, 2012. [40] J. Arenas-Garcı́a and G. Camps-Valls, “Efficient Kernel OPLS for remote sensing applications,” IEEE Trans. Geosc. Rem. Sens., 44, 2872– 2881, 2008. [41] A. Meng, P. Ahrendt, J. Larsen, and L. K. Hansen, “Temporal Feature Integration for Music Genre Classification,” IEEE Trans. Audio, Speech and Lang. Process., vol. 15, pp. 1654–1664, 2007. [42] J. Arenas-Garcı́a and K. B. Petersen, “Kernel multivariate analsis in remote sensing feature extraction,” in G. Camps-Valls and L. Bruzzone (eds.) Kernel methods for Remote Sensing Data Analysis, Wiley, 2009. [43] G. Camps-Valls, J. Muñoz-Marı́, L. Gómez-Chova, L. Guanter, and X. Calbet, “Nonlinear Statistical Retrieval of Atmospheric Profiles from MetOp-IASI and MTG-IRS Infrared Sounding Data,” IEEE Trans. Geoscience and Remote Sensing, 2012.