You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running inference with a Lightgbm model having categorical features, experienced much higher latencies when compared to treating them as all numerical features, especially as vocab size increased. Did a little bit of looking around and it seems like nodes with categorical split conditions (such as value in value_1|value_2|value_3, i.e.set contains split) are expanded to "==" nodes which can increase the node count with an amount depending on the vocab size. This is I guess due to not supporting set contains split natively in onnxruntime.
I was wondering if you had any plans for supporting such categorical split conditions natively? We were looking forward to use categorical features but due to unpredictable latencies we weren't able to utilize them on a latency critical path.
Here is a toy example, the slowdown might depend on setup but using a few categorical feature columns was enough to double the average runtime.
fromsklearn.datasetsimportfetch_covtypefromsklearn.model_selectionimporttrain_test_splitfromonnxruntimeimportInferenceSessionfromonnxmltools.convert.common.data_typesimportFloatTensorTypefromonnxmltools.convertimportconvert_lightgbmimportlightgbmaslgbimportnumpyasnpfromonnxruntimeimportInferenceSessiondata=fetch_covtype()
X, y=data.data, data.targetX=X.astype(np.float32)
X_train, X_test, y_train, y_test=train_test_split(X, y, random_state=0)
print(X_train.shape) # (435759, 54)# Variant 1: Fit all as numericalmodel=lgb.LGBMClassifier(max_depth=6, n_estimators=100, seed=0)
model.fit(X_train, y_train)
# Variant 2: get top 3 important features, fit as categoricalcategorical_features=np.argsort(model.booster_.feature_importance())[-3:].tolist()
cardinalities= [len(np.unique(X[:, c])) forcincategorical_features]
print(cardinalities) # [1978, 5827, 5785]cat_model=lgb.LGBMClassifier(max_depth=6, n_estimators=100, seed=0)
cat_model.fit(X_train, y_train, categorical_feature=categorical_features)
onx=convert_lightgbm(model, None, init, zipmap=False)
s=InferenceSession(onx.SerializeToString())
onx_cat=convert_lightgbm(cat_model, None, init, zipmap=False)
s_cat=InferenceSession(onx_cat.SerializeToString())
It is a known issue. A new rule must be added to the definition of operators TreeEnsembleRegressor and TreeEnsembleClassifier to support that scenario. That's the first step before implementing this new feature in onnxruntime and updating the converter.
Hi,
When running inference with a Lightgbm model having categorical features, experienced much higher latencies when compared to treating them as all numerical features, especially as vocab size increased. Did a little bit of looking around and it seems like nodes with categorical split conditions (such as
value in value_1|value_2|value_3
, i.e.set contains
split) are expanded to "==" nodes which can increase the node count with an amount depending on the vocab size. This is I guess due to not supportingset contains
split natively in onnxruntime.I was wondering if you had any plans for supporting such categorical split conditions natively? We were looking forward to use categorical features but due to unpredictable latencies we weren't able to utilize them on a latency critical path.
Here is a toy example, the slowdown might depend on setup but using a few categorical feature columns was enough to double the average runtime.
The text was updated successfully, but these errors were encountered: