Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add xgboost documentation #78

Merged
merged 4 commits into from
Dec 3, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,18 +101,18 @@ will automatically place weights into the `smd.CollectionKeys.WEIGHTS` collectio
| `GRADIENTS` | TensorFlow, PyTorch, MXNet | Matches all gradients tensors. In TensorFlow non-DLC, must use `hook.wrap_optimizer()`. |
| `LOSSES` | TensorFlow, PyTorch, MXNet | Matches all loss tensors. |
| `SCALARS` | TensorFlow, PyTorch, MXNet | Matches all scalar tensors, such as loss or accuracy. |
| `METRICS` | TensorFlow, XGBoost | ??? |
| `METRICS` | TensorFlow, XGBoost | Evaluation metrics computed by the algorithm. |
| `INPUTS` | TensorFlow | Matches all inputs to a layer (outputs of the previous layer). |
| `OUTPUTS` | TensorFlow | Matches all outputs of a layer (inputs of the following layer). |
| `SEARCHABLE_SCALARS` | TensorFlow | Scalars that will go to SageMaker Metrics. |
| `OPTIMIZER_VARIABLES` | TensorFlow | Matches all optimizer variables. |
| `HYPERPARAMETERS` | XGBoost | ... |
| `PREDICTIONS` | XGBoost | ... |
| `LABELS` | XGBoost | ... |
| `FEATURE_IMPORTANCE` | XGBoost | ... |
| `AVERAGE_SHAP` | XGBoost | ... |
| `FULL_SHAP` | XGBoost | ... |
| `TREES` | XGBoost | ... |
| `HYPERPARAMETERS` | XGBoost | [Booster paramameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) |
| `PREDICTIONS` | XGBoost | Predictions on validation set (if provided) |
| `LABELS` | XGBoost | Labels on validation set (if provided) |
| `FEATURE_IMPORTANCE` | XGBoost | Feature importance given by [get_score()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_score) |
| `FULL_SHAP` | XGBoost | A matrix of (nsmaple, nfeatures + 1) with each record indicating the feature contributions ([SHAP values](https://github.com/slundberg/shap)) for that prediction. Computed on training data with [predict()](https://github.com/slundberg/shap) |
| `AVERAGE_SHAP` | XGBoost | The sum of SHAP value magnitudes over all samples. Represents the impact each feature has on the model output. |
| `TREES` | XGBoost | Boosted tree model given by [trees_to_dataframe()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.trees_to_dataframe) |



Expand Down
98 changes: 97 additions & 1 deletion docs/xgboost.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,99 @@
# XGBoost

TODO: Fill this out (ask Edward for examples).
## Contents

- [SageMaker Example](#sagemaker-example)
- [Full API](#full-api)

## SageMaker Example

### Use XGBoost as a built-in algorithm

The XGBoost algorithm can be used 1) as a built-in algorithm, or 2) as a framework such as MXNet, PyTorch, or Tensorflow.
If SageMaker XGBoost is used as a built-in algorithm in container verision `0.90-2` or later, Amazon SageMaker Debugger will be available by default (i.e., zero code change experience).
See [XGBoost Algorithm AWS docmentation](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) for more information on how to use XGBoost as a built-in algorithm.
See [Amazon SageMaker Debugger examples](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) for sample notebooks that demonstrate debugging and monitoring capabilities of Aamazon SageMaker Debugger.
See [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) for more information on how to configure the Amazon SageMaker Debugger from the Python SDK.

### Use XGBoost as a framework

When SageMaker XGBoost is used as a framework, it is recommended that the hook is configured from the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/).
By using SageMaker Python SDK, you can run different jobs (e.g., Processing jobs) on the SageMaker platform.
You can retrieve the hook as follows.
```python
import xgboost as xgb
from smdebug.xgboost import Hook

dtrain = xgb.DMatrix("train.libsvm")
dtest = xgb.DMatrix("test.libsmv")

hook = Hook.create_from_json_file()
hook.train_data = dtrain # required
hook.validation_data = dtest # optional
hook.hyperparameters = params # optional

bst = xgb.train(
params,
dtrain,
callbacks=[hook],
evals_result=[(dtrain, "train"), (dvalid, "validation")]
)
```

Alternatively, you can also create the hook from `smdebug`'s Python API as shown in the next section.

### Use the Debugger hook

If you are in a non-SageMaker environment, or even in SageMaker, if you want to configure the hook in a certain way in script mode, you can use the full Debugger hook API as follows.
```python
import xgboost as xgb
from smdebug.xgboost import Hook

dtrain = xgb.DMatrix("train.libsvm")
dvalid = xgb.DMatrix("validation.libsmv")

hook = Hook(
out_dir=out_dir, # required
train_data=dtrain, # required
validation_data=dvalid, # optional
hyperparameters=hyperparameters, # optional
)
```

## Full API

```python
def __init__(
self,
out_dir,
export_tensorboard = False,
tensorboard_dir = None,
dry_run = False,
reduction_config = None,
save_config = None,
include_regex = None,
include_collections = None,
save_all = False,
include_workers = "one",
hyperparameters = None,
train_data = None,
validation_data = None,
)
```
Initializes the hook. Pass this object as a callback to `xgboost.train()`.
* `out_dir` (str): A path into which tensors and metadata will be written.
* `export_tensorboard` (bool): Whether to use TensorBoard logs.
* `tensorboard_dir` (str): Where to save TensorBoard logs.
* `dry_run` (bool): If true, evaluations are not actually saved to disk.
* `reduction_config` (ReductionConfig object): Not supported in XGBoost and will be ignored.
* `save_config` (SaveConfig object): See the [Common API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md).
* `include_regex` (list[str]): List of additional regexes to save.
* `include_collections` (list[str]): List of collections to save.
* `save_all` (bool): Saves all tensors and collections. **WARNING: May be memory-intensive and slow.**
* `include_workers` (str): Used for distributed training, can also be "all".
* `hyperparameters` (dict): Booster params.
* `train_data` (DMatrix object): Data to be trained.
* `validation_data` (DMatrix object): Validation set for which metrics will evaluated during training.

See the [Common API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md) page for details about Collection, SaveConfig, and ReductionConfig.\
See the [Analysis](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md) page for details about analyzing a training job.