pca (pyleoclim.utils.decomposition.pca)¶

pyleoclim.utils.decomposition.pca(ys, n_components=None, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)[source]¶

Principal Component Analysis (Empirical Orthogonal Functions)

Decomposition of a signal or data set in terms of orthogonal basis functions.

From scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Parameters

ys (array) – timeseries
n_components (int,None,or str) – [default: None] Number of components to keep. if n_components is not set all components are kept: If n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension. Use of n_components == ‘mle’ will interpret svd_solver == ‘auto’ as svd_solver == ‘full’. If 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If svd_solver == ‘arpack’, the number of components must be strictly less than the minimum of n_features and n_samples.
copy (bool,optional) – [default: True] If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
whiten (bool,optional) – [default: False] When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.
svd_solver (str {‘auto’, ‘full’, ‘arpack’, ‘randomized’}) –

If auto :
The solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.

If full :
run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing

If arpack :
run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)

If randomized :
run randomized SVD by the method of Halko et al.
tol (float >= 0 ,optional) – [default: 0] Tolerance for singular values computed by svd_solver == ‘arpack’.
iterated_power (int >= 0, or string {'auto'}) – [default: ‘auto’] Number of iterations for the power method computed by svd_solver == ‘randomized’.
random_state (int, RandomState instance, or None, optional) – [default: None] If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when svd_solver == ‘arpack’ or ‘randomized’.

Returns

Sklearn PCA object dictionary of all attributes and values.

-components_array, shape (n_components, n_features): Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
-explained_variance_array, shape (n_components,): The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X. New in version 0.18.
-explained_variance_ratio_array, shape (n_components,): Percentage of variance explained by each of the selected components. If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.
-singular_values_array, shape (n_components,): The singular values corresponding to each of the selected components. The singular values are equal to the 2-norms of the n_components variables in the lower-dimensional space. New in version 0.19.
-mean_array, shape (n_features,): Per-feature empirical mean, estimated from the training set. Equal to X.mean(axis=0).
-n_components_int: The estimated number of components. When n_components is set to ‘mle’ or a number between 0 and 1 (with svd_solver == ‘full’) this number is estimated from input data. Otherwise it equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None.
-n_features_int: Number of features in the training data.
-n_samples_int: Number of samples in the training data.

-noise_variance_float: The estimated noise covariance following the Probabilistic PCA model from Tipping and Bishop 1999. See “Pattern Recognition and Machine Learning” by C. Bishop, 12.2.1 p. 574 or http://www.miketipping.com/papers/met-mppca.pdf. It is required to compute the estimated data covariance and score samples. Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.

Return type

dict