-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update C# API to optimize inference latency #3171
Conversation
{ | ||
NativeMethods.OrtReleaseValue(outputValuesArray[i]); // For elementary type Tensors, this should not release the buffer, but should delete the native tensor object. | ||
// For string tensors, this releases the native memory allocated for the tensor, including the buffer | ||
pinnedOutputBufferHandles[i].Dispose(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the output buffer still contains the result of the (native) Run() but just gets unpinned (and the native OrtValue gets released along with it). Is this correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the output is a NamedOnnxValue
object which is usually created from a DenseTensor
with the pre-allocated buffer inside.
@@ -206,32 +209,108 @@ private void CanRunInferenceOnAModel(GraphOptimizationLevel graphOptimizationLev | |||
validateRunResults(results); | |||
} | |||
} | |||
|
|||
// Run inference with pinned inputs and empty outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"empty outputs" seems abit misleading -> I guess you mean outputs returned (DisposableNamedOnnxValue
) ?
using (var pinnedInputs = new DisposableList<PinnedOnnxValue>()) | ||
{ | ||
var inputNames = container.Select(i => i.Name).ToArray(); | ||
pinnedInputs.AddRange(container.Select(i => PinnedOnnxValue.CreateFromTensor(i.AsTensor<float>()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a test for string type as well ?
What's the speed-up due to this change? |
The overhead is significant in constructing |
Is there a way to quantify the speed-up? E.g. for a particular model the speed-up is x%. That will help to know if the performance is moving in the right direction (and hopefully not in the opposite direction). |
public IDisposableReadOnlyCollection<DisposableNamedOnnxValue> Run( | ||
IReadOnlyCollection<string> inputNames, | ||
IReadOnlyCollection<PinnedOnnxValue> inputValues) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates one more public Run() method (!) -- if there are too many Run() methods, ultimately a user will be confused about when to use which one
Not sure how to consider/weigh this new API. It would need to be permanently supported :/ Is the benefit of this new API considerable? -- if yes, then it might be worth the user confusion. If the benefits are not substantial, we should reconsider. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the number of overrides is not a concern. Take the commonly used .NET API Task.Factory.StartNew()
for example, it has 16 overrides to fit all kinds of requirement from different users. The more important thing is to make sure users can easily understand how to use the method(s).
The override number is big because there is a combination of input style (NamedOnnxValue/PinnedOnnxValue) / output style (NamedOnnxValue/PinnedOnnxValue/emit for DisposableNamedOnnxValue as return value) / options (pass/emit). Users can find this pattern without much effort: passing the input/output/option, and that's it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Task.Factory.StartNew
however, there is plenty of documentation, and legitimate combinations (i.e. if that override was not present, some users would be blocked).
Based on the performance numbers, it looks like the new API causes 15% reduction in running time for small models (e.g. MNIST) in a multi-threaded scenario.
If this new API is added, it doesn't unblock any new user scenarios, but adds a new function (similar to the original) which could potentially be faster.
The issue with new API is that they have to be supported permanently -- once released, they cannot be withdrawn. Which is fine, except that the motivation for this API is not at the same level as for new SessionOptions values (i.e. users will be blocked if C# doesn't expose new SessionOptions values).
For performance related changes, is it possible to make changes under-the-hood, instead of creating a new hood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about this possibility, but the fundamental difference between NamedOnnxValue
and PinnedOnnxValue
is, the first one does not hold any native resources and the latter holds a pinned memory (and an underlying native OrtValue). We do not want to modify the existing type (and we should not do), so we need a type to implement IDisposable
to hold the pinned memory. One cannot be replaced by another because of this difference.
If we had a chance to re-design the whole C# API, I may think about merging the DisposableNamedOnnxValue
with PinnedOnnxValue
(having a name field in a reusable onnx value may not be a good idea, but anyway this is another topic). But given the current status, it's hard to reuse the existing type. That is why we have this new class PinnedOnnxValue
, and with this, all added method overrides may have usage that cannot fulfilled by other overrides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all added method overrides may have usage that cannot fulfilled by other overrides.
Just for context, is there a particular user scenario that is unblocked by this new API?
I.e. Under what situation (besides small-model performance gains) would a user be required to use the new API, and the original API would not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if user's requirement is to reach a high performance target, then it's possible that they have to use the new API because the original API would not be fast enough.
functionally, the original API will always work. There is no scenario that a user have to use the new API to get correct output, because somehow the original API does not support. The purpose of this change is to resolve the latency issue, for this goal, we introduced the new type 'PinnedOnnxValue'. Without using type 'PinnedOnnxValue', users are able to run the model, but may not reach their performance target.
performance data on a 4-core E3-1230 for different ways to call: spamming
|
|
||
session.Run(inputNames, pinnedInputs, expectedOutputNames, pinnedOutputs); | ||
validateRunResultData(outputTensor); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the 4 validateRunResultData() checks above, if any of checks fail, the overall unit test will fail. Seems cleaner to create separate unit tests to pinpoint which check is failing.
@@ -78,8 +78,7 @@ private DisposableNamedOnnxValue(string name, Object value, NativeMemoryHandler | |||
/// <param name="onnxValue"></param> | |||
/// <param name="pinnedMemoryHandle"></param> | |||
/// <param name="disposeOnnxValueAfterUse"></param> | |||
internal override void ToNativeOnnxValue(out IntPtr onnxValue, | |||
out MemoryHandle pinnedMemoryHandle) | |||
internal override void ToNativeOnnxValue(out IntPtr onnxValue, out MemoryHandle pinnedMemoryHandle, out bool disposeOnnxValueAfterUse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the param disposeOnnxValueAfterUse
is added back because it resolved both @hariharans29 and @jignparm 's concerns
- a boolean value to indicate Run()'s behavior is concept wise correct. the code relies on class's behavior rather than class's name to figure out the correct behavior
- the worry of making interface more complicated is not necessary. first this is a internal scope, which does not affect public interface; second the core logic is moved to a helper class and there is no this flag.
Now it's clean: NamedOnnxValue (and its subclass) just popup the correct handle and a boolean value to indicate how these resources to be used and released; the helper class focus on how to create native resources .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I am okay with this. I went per @jignparm 's review, but not relying on class name does seem cleaner to me in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'disposeOnnxValueAfterUser' parameter in the ToNativeOnnxValue() method makes it seem like this method has knowledge of whether or not to dispose the onnx values. The onnx values should always be disposed, but the only question is at what time? Before session.run(), after session.run(), when the user is done with an object, or at some other time?
The 2 classes 'NamedOnnxValue' and 'DisposableNamedOnnxValue' giving guidance to the caller function that the onnx values should be disposed seems bit weird -- they should just return a handle to the onnx value. That would keep the ToNativeOnnxValue API clean, and push off the disposal logic to calling functions.
namespace Microsoft.ML.OnnxRuntime | ||
{ | ||
/// <summary> | ||
/// Represents an Onnx Value with its underlying buffer pinned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a comment about needing to know the exact dimensions of the Tensor (length of the flattened buffer) to use this class ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to add this. do you think where is the best place to add this notes, in FixedBufferOnnxValue's class summary, or CreateFromTensor's summary, or in Run()'s argument (this may duplicate)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateFromTensor's summary ?
for (var i = 0; i < 1000; i++) | ||
{ | ||
// feed inputs | ||
var inputs = new int[] { 1, 2, 3, 4, 5 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in case of string type (variable lengths between runs 1..1000), is there any way to throw friendly exception if that scenario is not handled? If so, can we add a test case to test for the exception string message?
If there is no way to detect that scenario, and a user tries it anyways, what does the user see (invalid output, or some other exception?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for output, pre-allocated string tensors will cause an ArgumentException
thrown in Run()
, this is already covered in the other test case
{ | ||
outputValuesArray[outputIndex] = output.Value; | ||
|
||
// for pre-allocated output, only numberic tensors are supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: numeric
// for pre-allocated output, only numberic tensors are supported | ||
if (onnxValueType != OnnxValueType.ONNX_TYPE_TENSOR || elementType == TensorElementType.String) | ||
{ | ||
throw new ArgumentException("Only numberic tensors can be used as pre-allocated output.", nameof(outputs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: numeric
disposeOnnxValueAfterUse = false; | ||
|
||
// set onnx value type and tensor element type | ||
onnxValueType = _onnxValueType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have onnxValueType visible to the class, maybe we can make the check on line 101 to just check if the onnxValueType is a plain Tensor ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments. LGTM overall.
outputValuesArray[outputIndex] = output.Value; | ||
|
||
// for pre-allocated output, only numberic tensors are supported | ||
if (output.OnnxValueType != OnnxValueType.ONNX_TYPE_TENSOR || output.ElementType == TensorElementType.String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both inputs and outputs are of type FixedBufferOnnxValue --why check for string type only for outputs? If input is also fixed buffer value (pre-allocated buffers for speedup), then shouldn't there be an exception if input is string type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think - FixedBufferOnnxValue can be created from string tensors and the underlying native OrtValue (created only once at object construction) can be used across multiple inference runs. There is no reason to prevent FixedBufferOnnxValue inputs that hold string content, as I see it. In fact, it seems advantageous to support especially when the same string tensor is going to be scored multiple times. This would prevent all the OrtValue creation overhead each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 2 reasons why string tensors should be disallowed as pre-allocated output:
1- there is no performance benefits. the underlying native implementation for pre-allocated output string tensors is using assign operator of std::string
to a new created value. so even with pre-allocated string tensors the memory used for string is actually not "pre-allocated"
2- this makes inconsistency for C#, if we support pre-allocated string tensors as output. C# string is not a view of native string, there need to be marshaling between native string and managed string. the value of the string tensor will not be updated to the expected output
in short, there is no real pre-allocate for non-numeric tensors (string are always allocated saperately as their length is unknown when allocating), so no need to support them.
For inputs, however, as @hariharans29 mentioned, there is valid requirement so we support them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems advantageous to support especially when the same string tensor is going to be scored multiple times
- Scenario 1: A user scores the same string tensor many times, against a model that has only 1 string input.: e.g. {"xxx"}, {"xxx"}, {"xxx"}, {"xxx"} ... {"xxx"}
I assume scenario 1 is supported using FixedBufferOnnxValue, but there's no point in supporting it because the model will produce the same output for every example. An application can just cache the model score if there's a high frequency string.
- Scenario 2: A user would like to score 4 different string examples back to back : e.g. {"xxx"}, {"yyy"}, {"zzzzz"}, {"xxx"}
Is scenario 2 also supported by using FixedBufferOnnxValue? If not supported, then which strings should be removed (minimum number) to make the scenario supportable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If scenario 1 is tweaked so that an app scores multiple models that take the same input ? Would it be beneficial to the overall perf if we did not have to do the string marshaling, create native OrtValue in each model's Run() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If scenario 1 is tweaked so that an app scores multiple models that take the same input ? Would it be beneficial to the overall perf if we did not have to do the string marshaling
-
The tweaked scenario (assuming there's some benefit) seems quite uncommon. At least it's not used in any of the ORT examples, UT or tutorials :/
-
More importantly, have a look at the new unit test
TestReusingFixedBufferOnnxValueNonStringTypeMultiInferences()
, at the line inputs.CopyTo(bufferInput, 0);
That looks like the primary usage pattern for FixedBufferOnnxValue() to get a speedup. If that's the main pattern, how would it apply to string types? If there's a different pattern for string-type speedup, there should be a UT/example for it -- otherwise users won't know about it.
Btw - I think this change needs an update to https://github.com/microsoft/onnxruntime/blob/master/docs/CSharp_API.md to introduce the new class. |
csharp/test/Microsoft.ML.OnnxRuntime.Tests/FixedBufferOnnxValueTests.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Listing technical concerns/feedback here. Attaching approval to PR, due to business requirements
-
Unrequired API changes
Should instead look to improve performance without API changes. Instead of adding lots of new API and new class FixedBufferOnnxValue, simply update existing API (NamedOnnxValue and DisposableNamedOnnxValue) to make it support fixed buffers. This will save a lot of confusion-by-too-many-choices for users, streamline code base, reduce API changes and still be able to support fixed sized buffers.
-
11 more InferenceSession.Run() override methods now
Originally there were 3 InferenceSession.Run() methods. The PR adds 11 more new methods, giving a total of 14 Run() methods, with a lot of repeated code in them. The worry is that they all do the same thing. Only difference is some might be faster (e.g. if a user has tiny models and use multi-threads). Usually, override methods should support new scenarios, not the same scenario (only faster).
-
Changes in high-visibility Unit Test
There are 8 validateRunResult*() checks in the unit test InferenceTest.cs:CanRunInferenceOnAModel(), with blocks of repeating code, making it look quite complex. This is an important function, and also serves as documentation for users -- users will see a much more complex 'getting-started-style' example instead of the short intuitive one, and are likely to become confused.- The extra validateRunResult*() checks should be refactored out into separate unit tests.
- This unit test is now less useful. We cannot tell which of the 8 checks are failing (will only see the first one that fails).
Description: This change update ONNXRuntime C# API to support more override of
InferenceSession.Run()
.Features:
PinnedOnnxValue
to allow user to reuse the underlying nativeOrtValue
for multiple times.Motivation and Context
OrtValue
objects inInferenceSession.Run()
. However, this can be avoid by calling C-API with a pre-allocated output list and use a pinned memory, as discussed in the design review meeting.