Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cooldown interval logic for the profiling functional #11465

Merged
merged 3 commits into from
Jun 28, 2022

Conversation

Icemist
Copy link
Contributor

@Icemist Icemist commented May 25, 2022

Motivation:

Sometimes running the script several times we get a strong scatter in the results.

from tvm.contrib.debugger import debug_runtime as graph_executor
...
m = graph_executor.create(graph, lib, ctx)
...
time_f = m.module.time_evaluator("run", dev, number=1000, repeat=1)
mean = time_f().mean
print(mean)

First try:

0.0846912 secs/iteration

Second try:

0.105601 secs/iteration

In general, the devices has an acceptable temperature range. If the system heats up to the upper limit for some time, it can force cooling by available ways. For example, by lowering the frequency.
In the following example the temperature reaches 90 degrees and after some time the frequency drops and after cooling is restored.

from tvm.contrib.debugger import debug_runtime as graph_executor
...
m = graph_executor.create(graph, lib, ctx)
...
time_f = m.module.time_evaluator("run", dev, number=1)
one = time_f().results[0]
nums = round(1/one)
time_f = m.module.time_evaluator("run", dev, number=1, repeat=nums*200)
t =  time_f()
mean = t.mean
print(mean)
0.105511 secs/iteration

print(t.results)
[0.083143681, ... , 0.147928064, ... , 0.147806464]

the results of device profiling
66_over

the plot drawn from t.results
p_orig

In the proposed change, we use a preemptive stop of the calculations to keep the temperature of the device within comfortable limits. As a result, the temperature does not exceed 70 degrees.

from tvm.contrib.debugger import debug_runtime as graph_executor
...
m = graph_executor.create(graph, lib, ctx)
...
time_f = m.module.time_evaluator("run", dev, number=1, repeat=200, 
                                                       run_delay=1000, repeats_to_cooldown=12)
t =  time_f()
mean = t.mean
0.0832038 secs/iteration

print(t.results)
[0.083007232, ... , 0.083067136, ... , 0.083169024]

the results of device profiling
66_patch

the plot drawn from t.results
p_patch

from tvm.contrib.debugger import debug_runtime as graph_executor
...
m = graph_executor.create(graph, lib, ctx)
...
m.run(number=100, repeat=2) 

Node Name Ops Time(us) Time(%) Shape Inputs Outputs Measurements


f1 f1 9451.75 9.822 (2, 2) 3 1 [9.456, 9.465, ..., 9.463]
f2 f2 5255.25 5.461 (2, 2) 4 1 [5.261, 5.308, ..., 5.302]
...
fn fn 5214.75 5.419 (2, 2) 4 1 [5.266, 5.192, ..., 5.199]

Changes:

  • Added cooldown_interval_ms and repeats_to_cooldown parameter to next functions:
    python\tvm\runtime\vm.py:benchmark
    python\tvm\runtime\module.py:time_evaluator
    python\tvm\contrib\graph_executor.py:benchmark
    python\tvm\contrib\debugger\debug_executor.py:run
    python\tvm\contrib\debugger\debug_executor.py:run_individual
    python\tvm\contrib\debugger\debug_executor.py:run_individual_node

    The parameter is used to stop measuring/cooling of the device.
  • For python\tvm\contrib\debugger\debug_executor.py:run number, repeat, min_repeat_ms, repeats_to_cooldown parameters were added.
  • Fixed RunIndividual logging.
  • Removed USE_PROFILER that is no longer in use.

@Icemist Icemist force-pushed the avoronov/detailed_profiling branch 11 times, most recently from 7439e61 to 20e6d32 Compare June 2, 2022 10:25
@Icemist Icemist changed the title Add run_delay to profiling functional and provide detailed data about runs Add cooldown_interval_ms and internal measurements for profiling Jun 2, 2022
@Icemist Icemist force-pushed the avoronov/detailed_profiling branch from 20e6d32 to 7dc790c Compare June 2, 2022 11:00
@Icemist Icemist marked this pull request as ready for review June 2, 2022 11:46
@Icemist Icemist force-pushed the avoronov/detailed_profiling branch from 7dc790c to 659dead Compare June 2, 2022 15:45
@Icemist Icemist force-pushed the avoronov/detailed_profiling branch 2 times, most recently from 8e3e7fe to 1d96f78 Compare June 9, 2022 00:05
@Icemist
Copy link
Contributor Author

Icemist commented Jun 9, 2022

@areusch
Copy link
Contributor

areusch commented Jun 9, 2022

@Icemist i like the direction here--but, i am wondering if it's possible to put some teeth behind this by allowing an application-specific way to verify the cooldown actually happened. for example, perhaps one could pass a PackedFunc that knows how to check the die temperature and continue cooldown if it is still too hot. what do you think of this? otherwise, it strikes me as a feature that would be hard to use in practice if you weren't already watching this separately.

@Icemist
Copy link
Contributor Author

Icemist commented Jun 9, 2022

@Icemist i like the direction here--but, i am wondering if it's possible to put some teeth behind this by allowing an application-specific way to verify the cooldown actually happened. for example, perhaps one could pass a PackedFunc that knows how to check the die temperature and continue cooldown if it is still too hot. what do you think of this? otherwise, it strikes me as a feature that would be hard to use in practice if you weren't already watching this separately.

This cooldown_interval logic already used for RPCRunner, LocalRunner and etc.
The other part of the patch only allows you to verify the quality of the data.

In your suggestion, in theory, you could try to use a ready-made f_preproc. Do 1 repeat, call f_preproc with some code to check and cool the temperature, do the next repeat.

for(i : repeat)
 call(f_preproc) // sleep here until the temperature is right  
 timer.start()
 call(pf)
 timer.end()

It seems to me that both options have a place.

@AndrewZhaoLuo
Copy link
Contributor

cc @tkonolige

Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Icemist thank you for the PR. Having a cooldown interval during timing is very useful.

However, you've moved the timer in time_evaluator into the inner loop. The timer is specifically in the outer loop so that we can time functions with a very small runtime or use less precise timers. I'll need you to revert this change before we can accept this PR.

Comment on lines 158 to 163
starting_times[1:] = np.cumsum(
[
np.mean([np.mean(repeat_data) for repeat_data in node_data])
for node_data in self._time_list
]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right. The variable is called starting_times but its values are the mean of mean time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before my changes, it was an average of 1 repeat. Here. I returned not only the first repeat of the node, but all of them for each node, and also in the form of arrays as raw measurements.
Now I just return all repeats for each node, not just the first. So I take the average here. Below where the duration is calculated, I also calculate the average.

@@ -205,12 +210,18 @@ def _dump_graph_json(self, graph):

def get_debug_result(self, sort_by_time=True):
"""Return the debugger result"""
header = ["Node Name", "Ops", "Time(us)", "Time(%)", "Shape", "Inputs", "Outputs"]
lines = ["---------", "---", "--------", "-------", "-----", "------", "-------"]
header = ["Node Name", "Ops", "Time(us)", "Time(%)", "Shape", "Inputs", "Outputs", "Times"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have Time(us) and Times here. What is the difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a return of all repeats, not just the first. Renamed it "Measurements(us)".

src/runtime/crt/common/crt_runtime_api.c Outdated Show resolved Hide resolved
src/runtime/graph_executor/debug/graph_executor_debug.cc Outdated Show resolved Hide resolved
src/runtime/graph_executor/debug/graph_executor_debug.cc Outdated Show resolved Hide resolved
src/runtime/profiling.cc Outdated Show resolved Hide resolved
src/runtime/profiling.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing my review to "Comment" so it doesn't block the PR while I am out of town for the next couple of weeks. Moving the timers into the inner loop still needs to be reverted though.

@tkonolige tkonolige dismissed their stale review June 10, 2022 18:26

out of town

@areusch
Copy link
Contributor

areusch commented Jun 16, 2022

@Icemist it'd be great if you could address @tkonolige 's comments.

In your suggestion, in theory, you could try to use a ready-made f_preproc. Do 1 repeat, call f_preproc with some code to check and cool the temperature, do the next repeat.

just wondering, would it be possible to achieve same effect as this PR with that strategy?

@Icemist Icemist force-pushed the avoronov/detailed_profiling branch 4 times, most recently from 2cb6fc6 to b140a6f Compare June 17, 2022 11:40
@Icemist Icemist force-pushed the avoronov/detailed_profiling branch from b140a6f to 58c20ea Compare June 20, 2022 02:13
@Icemist
Copy link
Contributor Author

Icemist commented Jun 20, 2022

@Icemist it'd be great if you could address @tkonolige 's comments.

In your suggestion, in theory, you could try to use a ready-made f_preproc. Do 1 repeat, call f_preproc with some code to check and cool the temperature, do the next repeat.

just wondering, would it be possible to achieve same effect as this PR with that strategy?

It seems to me with that strategy it is more difficult to get a quick and generalized solution.
There may be a need for some general mechanics of adding such functions for different devices by developers familiar with these systems.

@Icemist Icemist changed the title Add cooldown_interval_ms and internal measurements for profiling Add cooldown interval logic for the profiling functional Jun 20, 2022
@Icemist Icemist force-pushed the avoronov/detailed_profiling branch 2 times, most recently from f16f5b4 to eaeccc3 Compare June 20, 2022 10:58
@Icemist Icemist requested a review from tkonolige June 22, 2022 11:00
@Icemist
Copy link
Contributor Author

Icemist commented Jun 22, 2022

@areusch and @AndrewZhaoLuo, what do you think? I addressed all the remarks. I also followed up on @tkonolige's request.

Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @Icemist, just a couple questions.

include/tvm/runtime/profiling.h Outdated Show resolved Hide resolved
include/tvm/runtime/profiling.h Show resolved Hide resolved
src/runtime/crt/common/crt_runtime_api.c Outdated Show resolved Hide resolved
* \param repeats_to_cooldown The number of repeats before the
* cooldown is activated.
* \return Comma separated string containing the elapsed time per op for
* the last iteration only, because returning a long string over rpc can be expensive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you've changed this to return all elapsed time per ops instead of just the last one? The comment seems to indicate this may be problematic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what caused it. I didn't have any problems even when there were tens of thousands of numbers here.
I would guess that the right thing to do here would be to pack the data as it is done in time_evaluator. And then extract the data with struct.unpack.
I would like to know the reason for the comment:
// Have problems returning FloatImm so serialize to string results as hack.
@AndrewZhaoLuo can you help with that?

In my previous version of the patch, I did dimanic array packing, and this was not a problem for time_evaluator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget the exact reason why FloatImm can't be serialized but what you are doing is the string serialization hack which is fine.

"returning a long string over rpc can be expensive." i don't have context on but if you aren't having any problems it's probably fine?

src/runtime/graph_executor/debug/graph_executor_debug.cc Outdated Show resolved Hide resolved
@Icemist Icemist force-pushed the avoronov/detailed_profiling branch from eaeccc3 to 66f08ba Compare June 23, 2022 01:20
@Icemist Icemist force-pushed the avoronov/detailed_profiling branch from 66f08ba to b19bb4d Compare June 23, 2022 10:54
// do-while structure ensures we run even when `min_repeat_ms` isn't set (i.e., is 0).
do {
if (curr_res_seconds > 0.0) {
double a = (min_repeat_seconds / (curr_res_seconds / g_time_evaluator_state.number) + 1);
double b = g_time_evaluator_state.number * 1.618; // 1.618 is chosen by random
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the previous code said 1.618 was random, but its actually not :). Its the golden ratio!

time = 0;
for (const auto& repeat_data : time_sec_per_op[index]) {
// To have good behavior when calculating total time, etc.
os << (std::isnan(repeat_data) ? std::to_string(0) : std::to_string(repeat_data)) << ",";
Copy link
Contributor

@AndrewZhaoLuo AndrewZhaoLuo Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want full precision so want something like

// use maximum precision available and use fixed representation
os << std::fixed;
os.precision(std::numeric_limits<double>::max_digits10);

and feed every double manually into the stream.

* \param repeats_to_cooldown The number of repeats before the
* cooldown is activated.
* \return Comma separated string containing the elapsed time per op for
* the last iteration only, because returning a long string over rpc can be expensive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget the exact reason why FloatImm can't be serialized but what you are doing is the string serialization hack which is fine.

"returning a long string over rpc can be expensive." i don't have context on but if you aren't having any problems it's probably fine?

@AndrewZhaoLuo
Copy link
Contributor

Leave the rest up to @tkonolige

@Icemist Icemist force-pushed the avoronov/detailed_profiling branch 2 times, most recently from 9355cfa to 3004b89 Compare June 27, 2022 19:08
@Icemist
Copy link
Contributor Author

Icemist commented Jun 27, 2022

@tkonolige I added a commit to this PR 3e3b2cd where I change the way the RunIndividual and RunIndividualNode functions get results. I think this should solve the issue above.

@Icemist Icemist force-pushed the avoronov/detailed_profiling branch from 3004b89 to 4a44bd5 Compare June 27, 2022 22:13
Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks @Icemist!

@tkonolige tkonolige merged commit 6c8a353 into apache:main Jun 28, 2022
blackkker pushed a commit to blackkker/tvm that referenced this pull request Jul 7, 2022
* Add cooldown interval logic for the profiling functional.

* Remove string serialize hack from RunIndividual functions

* Update src/runtime/graph_executor/debug/graph_executor_debug.cc

Co-authored-by: Tristan Konolige <[email protected]>
masahi pushed a commit to masahi/tvm that referenced this pull request Jul 15, 2022
* Add cooldown interval logic for the profiling functional.

* Remove string serialize hack from RunIndividual functions

* Update src/runtime/graph_executor/debug/graph_executor_debug.cc

Co-authored-by: Tristan Konolige <[email protected]>
mikeseven pushed a commit to mikeseven/tvm that referenced this pull request Sep 27, 2023
* Add cooldown interval logic for the profiling functional.

* Remove string serialize hack from RunIndividual functions

* Update src/runtime/graph_executor/debug/graph_executor_debug.cc

Co-authored-by: Tristan Konolige <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants