This repo contains scripts to train models for Nirdizati predictive process monitoring engine. The following instructions are for the standalone use. For use with Apromore, please refer to these instructions.
Tested with Python 3.5 and Python 3.6. Install the necessary packages with
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:/wherever/you/keep/nirdizati-training-backend
cd core/
python train.py training-config-ID
training-config-ID
- JSON file (without extension) that contains training configuration (see below), must be placed undercore/training_params/
directory
Example:
python train.py myconfig_remtime
This script assumes that you have a training configuration file core/training_params/myconfig_remtime.json
with the following structure:
{
"target": {
"bucketing_type": {
"encoding_type": {
"learner_type": {
"learner_param1": "value1",
"learner_param2": "value2",
...
}
}
}
},
"ui_data": {
"log_file": "/wherever/you/keep/your/log.csv"
}
}
bucketing_type
-zero
,cluster
,state
orprefix
. Use tooltips in the Nirdizati tool (advanced training mode) for short explanationencoding_type
-agg
,laststate
,index
orcombined
learner_type
-rf
for random forest,gbm
for gradient boosting,dt
for decision tree orxgb
for extreme gradient boostinglearner_param
's - most important hyperparameters for each learner, taken from sklearn lib:n_estimators
andmax_features
forrf
;n_estimators
,max_features
andlearning_rate
forgbm
;max_features
andmax_depth
fordt
;n_estimators
,max_depth
,learning_rate
,colsample_bytree
andsubsample
forxgb
target
- variable that you want to predict (see below). The prediction problem type (classification or regression) is determined automatically based on the number of unique levels of a target variable and whether or not it can be parsed as a numeric series.
Example of a training configuration file
- Remaining cycle time. Use
remtime
keyword as atarget
argument for train.py - Binary case outcome based on the expected case duration (whether case duration will exceed a specified threshold). Use a positive threshold value for
target
or "-1" if you want the labeling to be based on the median case duration. - Next activity to be executed. Use
next
fortarget
- Any static, i.e. case, attribute that is already available in the log as a column. In this case,
target
is the name of the corresponding column.
- Fitted model -
pkl/
- Validation results by prefix length -
results/validation/
- Detailed validation results -
results/detailed/
- Data on feature importance -
results/feature_importance/
Bucketing - No bucketing (zero)
Encoding - Frequency (agg)
Predictor - XGBoost
Default hyperparameters for XGBoost predictor:
- Random forest: Number of estimators 300, max_features 0.5
- Gradient boosting: Number of estimators 300, max_features 0.5, learning rate 0.1
- Decision tree: max_features 0.8, max_depth 5
- XGBoost: Number of estimators 300, learning rate 0.04, subsample row ration 0.7, subsample column ratio 0.7, max_depth 5
export PYTHONPATH=$PYTHONPATH:/wherever/you/keep/nirdizati-training-backend
cd core/
python predict_trace.py path_to_single_test_prefix.json path_to_pickle_model_filename
Example:
python predict_trace.py ../logdata/1170931347.json ../pkl/bpi17_sample_myconfig_remtime.pkl
The output should be printed to stdout, for example:
[{"remtime":15.9565439224}]