ikpls.fast_cross_validation.numpy_ikpls

Contains the PLS class which implements fast cross-validation with partial least-squares regression using Improved Kernel PLS by Dayal and MacGregor: https://arxiv.org/abs/2401.13185 https://doi.org/10.1002/(SICI)1099-128X(199701)11:1%3C73::AID-CEM435%3E3.0.CO;2-%23

The implementation is written using NumPy and allows for parallelization of the cross-validation process using joblib.

Author: Ole-Christian Galbo Engstrøm E-mail: ocge@foss.dk

Classes

PLS(algorithm, center_X, center_Y, scale_X, ...)

Implements fast cross-validation with partial least-squares regression using Improved Kernel PLS by Dayal and MacGregor: https://arxiv.org/abs/2401.13185 https://doi.org/10.1002/(SICI)1099-128X(199701)11:1%3C73::AID-CEM435%3E3.0.CO;2-%23

class ikpls.fast_cross_validation.numpy_ikpls.PLS(algorithm: int = 1, center_X: bool = True, center_Y: bool = True, scale_X: bool = True, scale_Y: bool = True, ddof: int = 1, copy: bool = True, dtype: type[floating] = <class 'numpy.float64'>)

Bases: object

Implements fast cross-validation with partial least-squares regression using Improved Kernel PLS by Dayal and MacGregor: https://arxiv.org/abs/2401.13185 https://doi.org/10.1002/(SICI)1099-128X(199701)11:1%3C73::AID-CEM435%3E3.0.CO;2-%23

Parameters:
  • algorithm (int, default=1) – Whether to use Improved Kernel PLS Algorithm #1 or #2. Generally, Algorithm #1 is faster if X has less rows than columns, while Algorithm #2 is faster if X has more rows than columns.

  • center_X (bool, default=True) – Whether to center X before fitting by subtracting its row of column-wise means from each row. The row of column-wise means is computed on the training set for each fold to avoid data leakage.

  • center_Y (bool, default=True) – Whether to center Y before fitting by subtracting its row of column-wise means from each row. The row of column-wise means is computed on the training set for each fold to avoid data leakage.

  • scale_X (bool, default=True) – Whether to scale X before fitting by dividing each row with the row of X’s column-wise standard deviations. The row of column-wise standard deviations is computed on the training set for each fold to avoid data leakage.

  • scale_Y (bool, default=True) – Whether to scale Y before fitting by dividing each row with the row of X’s column-wise standard deviations. The row of column-wise standard deviations is computed on the training set for each fold to avoid data leakage.

  • ddof (int, default=1) – The delta degrees of freedom to use when computing the sample standard deviation. A value of 0 corresponds to the biased estimate of the sample standard deviation, while a value of 1 corresponds to Bessel’s correction for the sample standard deviation.

  • dtype (type[np.floating], default=numpy.float64) – The float datatype to use in computation of the PLS algorithm. This should be numpy.float32 or numpy.float64. Using a lower precision than float64 will yield significantly worse results when using an increasing number of components due to propagation of numerical errors.

  • copy (bool, default=True) – Whether to copy X, Y, and weights when cross-validating. If True or the data is not already arrays of the specified dtype, then the data is copied. Otherwise, the data is not copied and changes to X, Y, and weights outside may affect the results of the cross-validation.

Raises:

ValueError – If algorithm is not 1 or 2.

Notes

Any centering and scaling is undone before returning predictions to ensure that predictions are on the original scale. If both centering and scaling are True, then the data is first centered and then scaled.

Setting either of center_X, center_Y, scale_X, or scale_Y to True, while using multiple jobs, will increase the memory consumption as each job will then have to keep its own copy of \(\mathbf{X}^{\mathbf{T}}\mathbf{X}\) and \(\mathbf{X}^{\mathbf{T}}\mathbf{Y}\) with its specific centering and scaling.

cross_validate(X: ArrayLike, Y: ArrayLike, A: int, folds: Iterable[Hashable], metric_function: Callable[[ArrayLike, ArrayLike], Any], weights: ArrayLike | None = None, n_jobs=-1, verbose=10) dict[Hashable, Any]

Cross-validates the PLS model using folds splits on X and Y with n_components components evaluating results with metric_function.

Parameters:
  • X (Array of shape (N, K)) – Predictor variables.

  • Y (Array of shape (N, M) or (N,)) – Target variables.

  • A (int) – Number of components in the PLS model.

  • folds (Iterable of Hashable with N elements) – An iterable defining cross-validation splits. Each unique value in folds corresponds to a different fold.

  • metric_function (Callable receiving arrays Y_val (N_val, M), Y_pred (A, N_val, M), and, if weights is not None, also, weights_val (N_val,), and returning Any.) – Computes a metric based on true values Y_val and predicted values Y_pred. Y_pred contains a prediction for all A components.

  • weights (Array of shape (N,) or None, optional, default=None) – Weights for each observation. If None, then all observations are weighted equally.

  • n_jobs (int, default=-1) – Number of parallel jobs to use. A value of -1 will use the minimum of all available cores and the number of unique values in folds.

  • verbose (int, default=10) – Controls verbosity of parallel jobs.

Returns:

metrics – A dictionary mapping each unique value in folds to the result of evaluating metric_function on the validation set corresponding to that value.

Return type:

dict of Hashable to Any

Raises:

ValueError – If weights are provided and not all weights are non-negative.

Notes

The order of cross-validation folds is determined by the order of the unique values in folds. The keys and values of metrics will be sorted in the same order.