Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update C# API to optimize inference latency #3171

Merged
merged 12 commits into from
Apr 8, 2020
Merged

Conversation

fs-eire
Copy link
Contributor

@fs-eire fs-eire commented Mar 9, 2020

Description: This change update ONNXRuntime C# API to support more override of InferenceSession.Run().

Features:

  • Support pre-allocated outputs with determined types and dimensions.
  • Support a new type PinnedOnnxValue to allow user to reuse the underlying native OrtValue for multiple times.
  • Added overrides of the combination of above with no break of any existing interface.

Motivation and Context

  • ONNXRuntime C# does not have a good inference performance with small models under a multi-thread situation. The main reason is that the current C# API always allocate buffers for outputs and create native OrtValue objects in InferenceSession.Run(). However, this can be avoid by calling C-API with a pre-allocated output list and use a pinned memory, as discussed in the design review meeting.

@fs-eire fs-eire requested a review from a team as a code owner March 9, 2020 20:50
{
NativeMethods.OrtReleaseValue(outputValuesArray[i]); // For elementary type Tensors, this should not release the buffer, but should delete the native tensor object.
// For string tensors, this releases the native memory allocated for the tensor, including the buffer
pinnedOutputBufferHandles[i].Dispose();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the output buffer still contains the result of the (native) Run() but just gets unpinned (and the native OrtValue gets released along with it). Is this correct ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output is a NamedOnnxValue object which is usually created from a DenseTensor with the pre-allocated buffer inside.

@@ -206,32 +209,108 @@ private void CanRunInferenceOnAModel(GraphOptimizationLevel graphOptimizationLev
validateRunResults(results);
}
}

// Run inference with pinned inputs and empty outputs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"empty outputs" seems abit misleading -> I guess you mean outputs returned (DisposableNamedOnnxValue) ?

using (var pinnedInputs = new DisposableList<PinnedOnnxValue>())
{
var inputNames = container.Select(i => i.Name).ToArray();
pinnedInputs.AddRange(container.Select(i => PinnedOnnxValue.CreateFromTensor(i.AsTensor<float>())));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a test for string type as well ?

@jignparm
Copy link
Contributor

does not have a good inference performance with small models under a multi-thread situation.

What's the speed-up due to this change?

@fs-eire
Copy link
Contributor Author

fs-eire commented Mar 17, 2020

does not have a good inference performance with small models under a multi-thread situation.

What's the speed-up due to this change?

The overhead is significant in constructing DisposableNamedOnnxValue, when the model is small. There are quite lots of steps (including 10+ P-Invokes) that we can avoid by allowing passing a pre-allocated onnx values as outputs.

@jignparm
Copy link
Contributor

The overhead is significant in constructing DisposableNamedOnnxValue

Is there a way to quantify the speed-up? E.g. for a particular model the speed-up is x%. That will help to know if the performance is moving in the right direction (and hopefully not in the opposite direction).

public IDisposableReadOnlyCollection<DisposableNamedOnnxValue> Run(
IReadOnlyCollection<string> inputNames,
IReadOnlyCollection<PinnedOnnxValue> inputValues)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates one more public Run() method (!) -- if there are too many Run() methods, ultimately a user will be confused about when to use which one

Not sure how to consider/weigh this new API. It would need to be permanently supported :/ Is the benefit of this new API considerable? -- if yes, then it might be worth the user confusion. If the benefits are not substantial, we should reconsider. Any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the number of overrides is not a concern. Take the commonly used .NET API Task.Factory.StartNew() for example, it has 16 overrides to fit all kinds of requirement from different users. The more important thing is to make sure users can easily understand how to use the method(s).

The override number is big because there is a combination of input style (NamedOnnxValue/PinnedOnnxValue) / output style (NamedOnnxValue/PinnedOnnxValue/emit for DisposableNamedOnnxValue as return value) / options (pass/emit). Users can find this pattern without much effort: passing the input/output/option, and that's it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Task.Factory.StartNew however, there is plenty of documentation, and legitimate combinations (i.e. if that override was not present, some users would be blocked).

Based on the performance numbers, it looks like the new API causes 15% reduction in running time for small models (e.g. MNIST) in a multi-threaded scenario.

If this new API is added, it doesn't unblock any new user scenarios, but adds a new function (similar to the original) which could potentially be faster.

The issue with new API is that they have to be supported permanently -- once released, they cannot be withdrawn. Which is fine, except that the motivation for this API is not at the same level as for new SessionOptions values (i.e. users will be blocked if C# doesn't expose new SessionOptions values).

For performance related changes, is it possible to make changes under-the-hood, instead of creating a new hood?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about this possibility, but the fundamental difference between NamedOnnxValue and PinnedOnnxValue is, the first one does not hold any native resources and the latter holds a pinned memory (and an underlying native OrtValue). We do not want to modify the existing type (and we should not do), so we need a type to implement IDisposable to hold the pinned memory. One cannot be replaced by another because of this difference.

If we had a chance to re-design the whole C# API, I may think about merging the DisposableNamedOnnxValue with PinnedOnnxValue (having a name field in a reusable onnx value may not be a good idea, but anyway this is another topic). But given the current status, it's hard to reuse the existing type. That is why we have this new class PinnedOnnxValue, and with this, all added method overrides may have usage that cannot fulfilled by other overrides.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all added method overrides may have usage that cannot fulfilled by other overrides.

Just for context, is there a particular user scenario that is unblocked by this new API?

I.e. Under what situation (besides small-model performance gains) would a user be required to use the new API, and the original API would not work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if user's requirement is to reach a high performance target, then it's possible that they have to use the new API because the original API would not be fast enough.

functionally, the original API will always work. There is no scenario that a user have to use the new API to get correct output, because somehow the original API does not support. The purpose of this change is to resolve the latency issue, for this goal, we introduced the new type 'PinnedOnnxValue'. Without using type 'PinnedOnnxValue', users are able to run the model, but may not reach their performance target.

@fs-eire
Copy link
Contributor Author

fs-eire commented Mar 19, 2020

The overhead is significant in constructing DisposableNamedOnnxValue

Is there a way to quantify the speed-up? E.g. for a particular model the speed-up is x%. That will help to know if the performance is moving in the right direction (and hopefully not in the opposite direction).

performance data on a 4-core E3-1230 for different ways to call:

spamming

Origin NamedOnnxValue(input) + DisposableOnnxValue(output)


Done!
Total transactions: 3016
Total inferences: 1809600
ORT C# time: 0.0586ms
ORT output dispose time: 0.0006ms
Each batch time: 9.9202ms


Opt_input PinnedOnnxValue(input) + DisposableOnnxValue(output)


Done!
Total transactions: 3089
Total inferences: 1853400
ORT C# time: 0.0570ms
ORT output dispose time: 0.0008ms
Each batch time: 9.7041ms


Opt_output NamedOnnxValue(input) + PinnedOnnxValue(output)


Done!
Total transactions: 3377
Total inferences: 2026200
ORT C# time: 0.0527ms
ORT output dispose time: 0.0000ms
Each batch time: 8.8792ms


Opt_both PinnedOnnxValue(input) + PinnedOnnxValue(output)


Done!
Total transactions: 3561
Total inferences: 2136600
ORT C# time: 0.0497ms
ORT output dispose time: 0.0000ms
Each batch time: 8.4158ms


Opt_output_named NamedOnnxValue(input) + NamedOnnxValue(output)


Done!
Total transactions: 3335
Total inferences: 2001000
ORT C# time: 0.0535ms
ORT output dispose time: 0.0000ms
Each batch time: 8.9887ms


session.Run(inputNames, pinnedInputs, expectedOutputNames, pinnedOutputs);
validateRunResultData(outputTensor);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 4 validateRunResultData() checks above, if any of checks fail, the overall unit test will fail. Seems cleaner to create separate unit tests to pinpoint which check is failing.

@@ -78,8 +78,7 @@ private DisposableNamedOnnxValue(string name, Object value, NativeMemoryHandler
/// <param name="onnxValue"></param>
/// <param name="pinnedMemoryHandle"></param>
/// <param name="disposeOnnxValueAfterUse"></param>
internal override void ToNativeOnnxValue(out IntPtr onnxValue,
out MemoryHandle pinnedMemoryHandle)
internal override void ToNativeOnnxValue(out IntPtr onnxValue, out MemoryHandle pinnedMemoryHandle, out bool disposeOnnxValueAfterUse)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the param disposeOnnxValueAfterUse is added back because it resolved both @hariharans29 and @jignparm 's concerns
- a boolean value to indicate Run()'s behavior is concept wise correct. the code relies on class's behavior rather than class's name to figure out the correct behavior
- the worry of making interface more complicated is not necessary. first this is a internal scope, which does not affect public interface; second the core logic is moved to a helper class and there is no this flag.

Now it's clean: NamedOnnxValue (and its subclass) just popup the correct handle and a boolean value to indicate how these resources to be used and released; the helper class focus on how to create native resources .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I am okay with this. I went per @jignparm 's review, but not relying on class name does seem cleaner to me in general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'disposeOnnxValueAfterUser' parameter in the ToNativeOnnxValue() method makes it seem like this method has knowledge of whether or not to dispose the onnx values. The onnx values should always be disposed, but the only question is at what time? Before session.run(), after session.run(), when the user is done with an object, or at some other time?

The 2 classes 'NamedOnnxValue' and 'DisposableNamedOnnxValue' giving guidance to the caller function that the onnx values should be disposed seems bit weird -- they should just return a handle to the onnx value. That would keep the ToNativeOnnxValue API clean, and push off the disposal logic to calling functions.

namespace Microsoft.ML.OnnxRuntime
{
/// <summary>
/// Represents an Onnx Value with its underlying buffer pinned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a comment about needing to know the exact dimensions of the Tensor (length of the flattened buffer) to use this class ?

Copy link
Contributor Author

@fs-eire fs-eire Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to add this. do you think where is the best place to add this notes, in FixedBufferOnnxValue's class summary, or CreateFromTensor's summary, or in Run()'s argument (this may duplicate)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateFromTensor's summary ?

for (var i = 0; i < 1000; i++)
{
// feed inputs
var inputs = new int[] { 1, 2, 3, 4, 5 };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in case of string type (variable lengths between runs 1..1000), is there any way to throw friendly exception if that scenario is not handled? If so, can we add a test case to test for the exception string message?

If there is no way to detect that scenario, and a user tries it anyways, what does the user see (invalid output, or some other exception?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for output, pre-allocated string tensors will cause an ArgumentException thrown in Run(), this is already covered in the other test case

{
outputValuesArray[outputIndex] = output.Value;

// for pre-allocated output, only numberic tensors are supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: numeric

// for pre-allocated output, only numberic tensors are supported
if (onnxValueType != OnnxValueType.ONNX_TYPE_TENSOR || elementType == TensorElementType.String)
{
throw new ArgumentException("Only numberic tensors can be used as pre-allocated output.", nameof(outputs));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: numeric

disposeOnnxValueAfterUse = false;

// set onnx value type and tensor element type
onnxValueType = _onnxValueType;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have onnxValueType visible to the class, maybe we can make the check on line 101 to just check if the onnxValueType is a plain Tensor ?

hariharans29
hariharans29 previously approved these changes Mar 31, 2020
Copy link
Member

@hariharans29 hariharans29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. LGTM overall.

outputValuesArray[outputIndex] = output.Value;

// for pre-allocated output, only numberic tensors are supported
if (output.OnnxValueType != OnnxValueType.ONNX_TYPE_TENSOR || output.ElementType == TensorElementType.String)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both inputs and outputs are of type FixedBufferOnnxValue --why check for string type only for outputs? If input is also fixed buffer value (pre-allocated buffers for speedup), then shouldn't there be an exception if input is string type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think - FixedBufferOnnxValue can be created from string tensors and the underlying native OrtValue (created only once at object construction) can be used across multiple inference runs. There is no reason to prevent FixedBufferOnnxValue inputs that hold string content, as I see it. In fact, it seems advantageous to support especially when the same string tensor is going to be scored multiple times. This would prevent all the OrtValue creation overhead each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 2 reasons why string tensors should be disallowed as pre-allocated output:

1- there is no performance benefits. the underlying native implementation for pre-allocated output string tensors is using assign operator of std::string to a new created value. so even with pre-allocated string tensors the memory used for string is actually not "pre-allocated"

2- this makes inconsistency for C#, if we support pre-allocated string tensors as output. C# string is not a view of native string, there need to be marshaling between native string and managed string. the value of the string tensor will not be updated to the expected output

in short, there is no real pre-allocate for non-numeric tensors (string are always allocated saperately as their length is unknown when allocating), so no need to support them.

For inputs, however, as @hariharans29 mentioned, there is valid requirement so we support them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems advantageous to support especially when the same string tensor is going to be scored multiple times

  • Scenario 1: A user scores the same string tensor many times, against a model that has only 1 string input.: e.g. {"xxx"}, {"xxx"}, {"xxx"}, {"xxx"} ... {"xxx"}

I assume scenario 1 is supported using FixedBufferOnnxValue, but there's no point in supporting it because the model will produce the same output for every example. An application can just cache the model score if there's a high frequency string.

  • Scenario 2: A user would like to score 4 different string examples back to back : e.g. {"xxx"}, {"yyy"}, {"zzzzz"}, {"xxx"}

Is scenario 2 also supported by using FixedBufferOnnxValue? If not supported, then which strings should be removed (minimum number) to make the scenario supportable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If scenario 1 is tweaked so that an app scores multiple models that take the same input ? Would it be beneficial to the overall perf if we did not have to do the string marshaling, create native OrtValue in each model's Run() ?

Copy link
Contributor

@jignparm jignparm Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If scenario 1 is tweaked so that an app scores multiple models that take the same input ? Would it be beneficial to the overall perf if we did not have to do the string marshaling

  1. The tweaked scenario (assuming there's some benefit) seems quite uncommon. At least it's not used in any of the ORT examples, UT or tutorials :/

  2. More importantly, have a look at the new unit test TestReusingFixedBufferOnnxValueNonStringTypeMultiInferences(), at the line inputs.CopyTo(bufferInput, 0);
    That looks like the primary usage pattern for FixedBufferOnnxValue() to get a speedup. If that's the main pattern, how would it apply to string types? If there's a different pattern for string-type speedup, there should be a UT/example for it -- otherwise users won't know about it.

@hariharans29
Copy link
Member

Btw - I think this change needs an update to https://github.com/microsoft/onnxruntime/blob/master/docs/CSharp_API.md to introduce the new class.

docs/CSharp_API.md Outdated Show resolved Hide resolved
jignparm
jignparm previously approved these changes Apr 7, 2020
Copy link
Contributor

@jignparm jignparm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listing technical concerns/feedback here. Attaching approval to PR, due to business requirements

  • Unrequired API changes
    Should instead look to improve performance without API changes. Instead of adding lots of new API and new class FixedBufferOnnxValue, simply update existing API (NamedOnnxValue and DisposableNamedOnnxValue) to make it support fixed buffers. This will save a lot of confusion-by-too-many-choices for users, streamline code base, reduce API changes and still be able to support fixed sized buffers.
     

  • 11 more InferenceSession.Run() override methods now
    Originally there  were 3 InferenceSession.Run() methods. The PR adds 11 more new methods, giving a total of 14 Run() methods, with a lot of repeated code in them. The worry is that they all do the same thing. Only difference is some might be faster (e.g. if a user has tiny models and use multi-threads). Usually, override methods should support new scenarios, not the same scenario (only faster).
     

  • Changes in high-visibility Unit Test
    There are 8 validateRunResult*() checks in the unit test InferenceTest.cs:CanRunInferenceOnAModel(), with blocks of repeating code, making it look quite complex. This is an important function, and also serves as documentation for users -- users will see a much more complex 'getting-started-style' example instead of the short intuitive one, and are likely to become confused.

    • The extra validateRunResult*() checks should be refactored out into separate unit tests.
    • This unit test is now less useful. We cannot tell which of the 8 checks are failing (will only see the first one that fails).

@fs-eire fs-eire merged commit 718068f into master Apr 8, 2020
@fs-eire fs-eire deleted the fs-eire/C#-api-opt branch April 8, 2020 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants