-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: Strange cpu utilization and high latency #26449
Comments
what‘s this? |
It appears that this Reorder is indeed performing a conversion from fp32 to bf16. This is quite embarrassing. I originally intended to use bf16 for acceleration, but the conversion to bf16 has now become the new bottleneck. |
In terms of resource scheduling, one of the CPUs has an excessively high ut, which seems likely to become a bottleneck. |
@wangleis Is this related to core binding? Thanks! |
@LinGeLin For the low CPU ut issue, it looks like the performance is limited by the memory bandwidth. Could you try |
You might be right. I tried setting nthreads=12 and nstream=2, and the CPU utilization increased from 17% to 30%. Another issue is how to address the time-consuming problem of converting input from FP32 to BF16? Should I modify the model? The benchmark_app's -ip parameter currently does not support BF16. Is there a plan to add support for it? |
@LinGeLin can you share the details of CPU you're using with |
yes. Architecture: x86_64 |
okay. I suppose that you've already tried out |
Tested, the boss wants us to migrate from TensorFlow to OpenVINO because we can use BF16, but the current test results do not show much of an advantage. |
Can you share the model with us? We can take a look. |
Due to certain confidentiality requirements, I need to hide the information of input and output, but modifying saved_mode is too painful. Modifying... |
OpenVINO Version
2024.0.0
Operating System
Ubuntu 20.04 (LTS)
Device used for inference
None
OpenVINO installation
PyPi
Programming Language
Python
Hardware Architecture
x86 (64 bits)
Model used
Rec mode
Model quantization
Yes
Target Platform
No response
Performance issue description
taskset -c 0-23 benchmark_app -m /xxx/model.xml -report_type detailed_counters -report_folder /xxxbenchmark_app_report/ -dump_config /xxxdump_config.json -hint none -nthreads 24 -nstreams 4
latency is twice as high as stream=2, but there is still only one cpu 100% ut per stream.What is this 100% cpu doing? What logic does openvino engine run?
In addition, there are a lot of reorder in the pressure test result file, which takes a lot of time. It looks like f32 to bf16? Can you confirm that for me? How do you optimize this?
Is there a way to optimize it?
Step-by-step reproduction
No response
Issue submission checklist
The text was updated successfully, but these errors were encountered: