Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create template classes for model representations to support multiple data types (Part of #196) #198

Merged
merged 4 commits into from
Sep 25, 2020

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Sep 10, 2020

Extracted from #196 for easier reviewing.
(The CI tests in this PR won't pass due to missing parts. See #196 to look up the test results.)

  • The ModelImpl class (created in Refactor struct Model -> class Model + class ModelImpl (Part of #196) #201) becomes the template class ModelImpl<ThresholdType, LeafOutputType>.
  • Implement template classes ModelImpl<ThresholdType, LeafOutputType> and TreeImpl<ThresholdType, LeafOutputType> that contain the details of the tree ensemble model. The template classes are parameterized by the types of the thresholds and leaf outputs. Currently the following combinations are allowed:
Threshold type Leaf output type
float32 float32
float32 uint32
float64 float64
float64 uint32
  • Revise the zero-copy serialization protocol, to prepend the type information (threshold_type, leaf_output_type) to the serialized types so that the recipient will choose the correct ModelImpl<ThresholdType, LeafOutputType> to deserialize to.

  • A run-time type dispatching system using the enum type TypeInfo. Users are able to dispatch a correct version of ModelImpl<ThresholdType, LeafOutputType> by specifying a pair of TypeInfo values. We also implement a set of convenient functions, such as InferTypeInfoOf<T> that converts the template arg T into TypeInfo enum.

@canonizer @levsnv @jakirkham @wphicks Can you review this PR?

@hcho3 hcho3 changed the base branch from mainline to multi_type_refactor_breakup September 10, 2020 07:01
Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one grammatical suggestion. Otherwise, everything looks great!

include/treelite/tree.h Outdated Show resolved Hide resolved
@jakirkham
Copy link

There's a few things being passed as pointers. Would it make sense to pass them by reference instead? Or would there be issues with that?

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 10, 2020

@jakirkham

Would it make sense to pass them by reference instead? Or would there be issues with that?

Cpplint and Google C++ style guide warn against using non-const reference parameters in functions.

@jakirkham
Copy link

Interesting, I'm curious why Google thinks a potentially null pointer is safer than a reference 😄

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 11, 2020

@jakirkham The idea is to explicitly mark "output" parameters. FooBar(&a) clearly shows that a is modified, whereas FooBar(a) isn't clear whether a is being modified or not.

include/treelite/tree.h Outdated Show resolved Hide resolved
include/treelite/tree.h Outdated Show resolved Hide resolved
return "float64";
default:
throw std::runtime_error("Unrecognized type");
return "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is return required after throw?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it, some old compilers like GCC 5 complains about a missing return value.

* \return Whatever that's returned by the dispatcher
*/
template <template<class> class Dispatcher, typename ...Args>
inline auto DispatchWithTypeInfo(TypeInfo type, Args&& ...args) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to understand what the intended usage of DispatchWithTypeInfo() / DispatchWithModelTypes() is. Could you add an example (or reference a place in code) demonstrating it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the idea. Let's say we are given a TypeInfo variable type at runtime. We want to call Dispatcher<T>::Dispatch() method where the template arg T corresponds to type, and Dispatcher<T> is a class that contains a custom function that we'd like to run.

For a concrete example, see https://github.com/dmlc/treelite/pull/198/files#diff-409f38fcc1a201a94e70922bbf7c44a6R777-R788 and https://github.com/dmlc/treelite/pull/196/files#diff-65416f46630c396d5db3df0ac92a0a02R130-R148.

include/treelite/typeinfo.h Outdated Show resolved Hide resolved
} else if (std::is_same<T, double>::value) {
return TypeInfo::kFloat64;
} else {
throw std::runtime_error(std::string("Unrecognized Value type") + typeid(T).name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is runtime type information already used in treelite?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the built-in RTTI? Yes, since we use dynamic_cast<> to downcast Model* into ModelImpl<>*.

include/treelite/tree.h Show resolved Hide resolved
include/treelite/tree.h Outdated Show resolved Hide resolved
include/treelite/tree.h Outdated Show resolved Hide resolved
include/treelite/tree_impl.h Outdated Show resolved Hide resolved
@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 18, 2020

@canonizer Can I get another round of review for this PR?

Copy link
Contributor

@canonizer canonizer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, provided that the remaining comments are addressed.

include/treelite/tree.h Outdated Show resolved Hide resolved
include/treelite/tree.h Outdated Show resolved Hide resolved
Model(const Model&) = delete;
Model& operator=(const Model&) = delete;
Model(Model&&) = default;
Model& operator=(Model&&) = default;

template <typename ThresholdType, typename LeafOutputType>
inline static std::unique_ptr<Model> Create();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving this function into ModelImpl, as it depends on ThresholdType and LeafOutputType.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually not possible to move the function, since it needs access the base class's members threshold_type_ and leaf_output_type_. The members were moved to the base class per your previous suggestion.

return leaf_output_type_;
}
template <typename Func>
inline auto Dispatch(Func func);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these Dispatch() functions used anywhere?

Copy link
Collaborator Author

@hcho3 hcho3 Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's used in the C code generator (excluded from this PR). See #196 to search for actual uses.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hcho3 hcho3 merged commit 269dd2e into dmlc:multi_type_refactor_breakup Sep 25, 2020
@hcho3 hcho3 deleted the new_model_rep branch September 25, 2020 08:04
hcho3 added a commit that referenced this pull request Oct 9, 2020
… data types (Part of #196) (#198)

* The ModelImpl class (created in #201) becomes the template class ModelImpl<ThresholdType, LeafOutputType>.
* Implement template classes ModelImpl<ThresholdType, LeafOutputType> and TreeImpl<ThresholdType, LeafOutputType> that contain the details of the tree ensemble model. The template classes are parameterized by the types of the thresholds and leaf outputs. Currently the following combinations are allowed:

| Threshold type | Leaf output type |
|----------------|------------------|
| float32        | float32          |
| float32        | uint32           |
| float64        | float64          |
| float64        | uint32           |
|----------------|------------------|

* Revise the zero-copy serialization protocol, to prepend the type information (threshold_type, leaf_output_type) to the serialized types so that the recipient will choose the correct ModelImpl<ThresholdType, LeafOutputType> to deserialize to.
* A run-time type dispatching system using the enum type TypeInfo. Users are able to dispatch a correct version of ModelImpl<ThresholdType, LeafOutputType> by specifying a pair of TypeInfo values. We also implement a set of convenient functions, such as InferTypeInfoOf<T> that converts the template arg T into TypeInfo enum.

Co-authored-by: William Hicks <[email protected]>
Co-authored-by: Andy Adinets <[email protected]>
hcho3 added a commit that referenced this pull request Oct 9, 2020
Addresses #95 and #111.
Follow-up to #198, #199, #201

Trying again, since #130 failed. This time, I made the Model class to be polymorphic. This way, the amount of pointer indirection is minimized.

Summary: Model is an opaque container that wraps the polymorphic handle ModelImpl<ThresholdType, LeafOutputType>. The handle in turn stores the list of trees Tree<ThresholdType, LeafOutputType>. To unbox the Model container and obtain ModelImpl<ThresholdType, LeafOutputType>, use Model::Dispatch(<lambda expression>).

Also, upgrade to C++14 to access the generic lambda feature, which proved to be very useful in the dispatching logic for the polymorphic Model class.

* Turn the Model and Tree classes into template classes
* Revise the string templates so that correct data types are used in the generated C code
* Rewrite the model builder class
* Revise the zero-copy serializer
* Create an abstract matrix class that supports multiple data types (float32, float64 for now).
* Move the DMatrix class to the runtime.
* Extend the DMatrix class so that it can hold float32 and float64.
* Redesign the C runtime API using the DMatrix class.
* Ensure accuracy of scikit-learn models. To achieve the best results, use float32 for the input matrix and float64 for the split thresholds and leaf outputs.
* Revise the JVM runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants