Integrate with XGBoost#
This is an introduction about how to use XGBoost for training and prediction in Mars.
Installation#
If you are trying to use Mars on a single machine e.g. on your laptop, make sure XGBoost is installed.
You can install XGBoost via pip:
pip install xgboost
Visit installation guide for XGBoost for more information.
On the other hand, if you are using Mars on a cluster, make sure XGBoost is installed on each worker.
Prepare data#
First, we use scikit-learn to load the Boston Housing dataset.
In [1]: from sklearn.datasets import load_boston
In [2]: boston = load_boston()
Then create Mars DataFrame from the dataset.
In [3]: import mars.dataframe as md
In [4]: data = md.DataFrame(boston.data, columns=boston.feature_names)
Explore the top 5 rows data of the DataFrame.
In [5]: data.head().execute()
Out[5]:
CRIM ZN INDUS CHAS NOX ... RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 ... 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 ... 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 ... 3.0 222.0 18.7 396.90 5.33
[5 rows x 13 columns]
mars.dataframe.DataFrame.describe()
gives summary statistics of the
columns.
In [6]: data.describe().execute()
Out[6]:
CRIM ZN INDUS ... PTRATIO B LSTAT
count 506.000000 506.000000 506.000000 ... 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 ... 18.455534 356.674032 12.653063
std 8.601545 23.322453 6.860353 ... 2.164946 91.294864 7.141062
min 0.006320 0.000000 0.460000 ... 12.600000 0.320000 1.730000
25% 0.082045 0.000000 5.190000 ... 17.400000 375.377500 6.950000
50% 0.256510 0.000000 9.690000 ... 19.050000 391.440000 11.360000
75% 3.677083 12.500000 18.100000 ... 20.200000 396.225000 16.955000
max 88.976200 100.000000 27.740000 ... 22.000000 396.900000 37.970000
[8 rows x 13 columns]
We can shuffle the sequence of the data, and separate the data into train and test parts.
In [7]: from mars.learn.model_selection import train_test_split
In [8]: X_train, X_test, y_train, y_test = \
...: train_test_split(data, boston.target, train_size=0.7, random_state=0)
Now we can create a MarsDMatrix
which is very similar to
xgboost.DMatrix.
In [9]: from mars.learn.contrib import xgboost as xgb
In [10]: train_dmatrix = xgb.MarsDMatrix(data=X_train, label=y_train)
In [11]: test_dmatrix = xgb.MarsDMatrix(data=X_test, label=y_test)
Training#
We can train data in two ways:
Call
train()
which accepts aMarsDMatrix
.Use scikit-learn API including
XGBClassifier
andXGBRegressor
.
For train()
, you can run the snippet.
In [12]: params = {'objective': 'reg:squarederror','colsample_bytree': 0.3,'learning_rate': 0.1,
...: 'max_depth': 5, 'alpha': 10, 'n_estimators': 10}
In [13]: booster = xgb.train(dtrain=train_dmatrix, params=params)
On the other hand, run the snippet below for scikit-learn API.
In [14]: xg_reg = xgb.XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3,
...: learning_rate=0.1, max_depth=5, alpha=10, n_estimators=10)
In [15]: xg_reg.fit(X_train, y_train)
Out[15]:
XGBRegressor(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.3, gamma=0,
importance_type='gain', learning_rate=0.1, max_delta_step=0,
max_depth=5, min_child_weight=1, missing=None, n_estimators=10,
n_jobs=1, nthread=None, objective='reg:squarederror',
random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
seed=None, silent=None, subsample=1, verbosity=1)
Prediction#
For prediction, there are still two ways
Call
predict()
which accepts aMarsDMatrix
as well.Call
XGBClassifier.predict()
orXGBRegressor.predict()
which has been fitted.
For predict()
, we call it with trained model.
In [16]: xgb.predict(booster, X_test)
Out[16]:
476 12.694860
490 9.062592
304 19.793633
216 14.832405
256 24.101620
...
250 16.733646
224 21.917801
500 14.239252
134 11.500128
248 15.969764
Name: predictions, Length: 152, dtype: float32
For XGBRegressor.predict()
, you can run the snippet.
In [17]: xg_reg.predict(X_test)
Out[17]:
476 12.059338
490 8.448854
304 20.644527
216 14.706422
256 23.231501
...
250 16.597778
224 22.945301
500 13.720667
134 11.226119
248 15.548668
Name: predictions, Length: 152, dtype: float32
Distributed training and prediction#
Refer to Run on Clusters section for deployment, or Run on Kubernetes section for running Mars on Kubernetes.
Once a cluster exists, you can either set the session as default, the training
and prediction shown above will be submitted to the cluster, or you can specify
session=***
explicitly as well.
Take XGBRegressor.fit()
as an example.
# A cluster has been configured, and web UI is started on <web_ip>:<web_port>
import mars
# set the session as the default one
sess = mars.new_session('http://<web_ip>:<web_port>')
reg = xgb.XGBRegressor()
# training will submitted to cluster by default
reg.fit(X_train)
# Or, session could be specified as well
reg.fit(X_train, session=sess)