mars.learn.wrappers.ParallelPostFit#

class mars.learn.wrappers.ParallelPostFit(estimator: Optional[BaseEstimator] = None, scoring: Optional[Union[str, Callable]] = None)[source]#

Meta-estimator for parallel predict and transform.

Parameters

estimator (Estimator) – The underlying estimator that is fit.
scoring (string or callable, optional) –
A single string (see scoring_parameter) or a callable (see scoring) to evaluate the predictions on the test set.

For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.

NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.

See multimetric_grid_search for an example.

Warning

If None, the estimator’s default scorer (if available) is used. Most scikit-learn estimators will convert large Mars tensors to a single NumPy array, which may exhaust the memory of your worker. You probably want to always specify scoring.

Notes

Warning

This class is not appropriate for parallel or distributed training on large datasets. For that, see Incremental, which provides distributed (but sequential) training. If you’re doing distributed hyperparameter optimization on larger-than-memory datasets, see mars.learn.model_selection.IncrementalSearch.

This estimator does not parallelize the training step. This simply calls the underlying estimators’s fit method called and copies over the learned attributes to self afterwards.

It is helpful for situations where your training dataset is relatively small (fits on a single machine) but you need to predict or transform a much larger dataset. predict, predict_proba and transform will be done in parallel (potentially distributed if you’ve connected to a Mars cluster).

Note that many scikit-learn estimators already predict and transform in parallel. This meta-estimator may still be useful in those cases when your dataset is larger than memory, as the distributed scheduler will ensure the data isn’t all read into memory at once.

See also

Incremental, mars.learn.model_selection.IncrementalSearch

Examples

>>> from sklearn.ensemble import GradientBoostingClassifier
>>> from sklearn.datasets import make_classification
>>> import mars.tensor as mt
>>> from mars.learn.wrappers import ParallelPostFit

Make a small 1,000 sample 2 training dataset and fit normally.

>>> X, y = make_classification(n_samples=1000, random_state=0)
>>> clf = ParallelPostFit(estimator=GradientBoostingClassifier(),
...                       scoring='accuracy')
>>> clf.fit(X, y)
ParallelPostFit(estimator=GradientBoostingClassifier(...))

>>> clf.classes_
array([0, 1])

Transform and predict return Mars outputs for Mars inputs.

>>> X_big, y_big = make_classification(n_samples=100000,
                                       random_state=0)
>>> X_big, y_big = mt.tensor(X_big), mt.tensor(y_big)
>>> clf.predict(X_big)
array([1, 0, 0, ..., 1, 0, 0])

Which can be computed in parallel.

>>> clf.predict_proba(X_big)
array([[0.01780031, 0.98219969],
       [0.62199242, 0.37800758],
       [0.89059934, 0.10940066],
       ...,
       [0.03249968, 0.96750032],
       [0.951434  , 0.048566  ],
       [0.99527114, 0.00472886]])

__init__(estimator: Optional[BaseEstimator] = None, scoring: Optional[Union[str, Callable]] = None)[source]#

Methods

`__init__`([estimator, scoring])
`fit`(X[, y])	Fit the underlying estimator.
`get_params`([deep])	Get parameters for this estimator.
`partial_fit`(X[, y])
`predict`(X[, execute])	Predict for X.
`predict_log_proba`(X[, execute])	Log of probability estimates.
`predict_proba`(X[, execute])	Probability estimates.
`score`(X, y)	Returns the score on the given data.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Transform block or partition-wise for Mars inputs.