Switch PVF preparation to CPU time #4217

pepyakin · 2021-11-04T02:27:17Z

For controlling the execution time of preparation (and potentially execution) of PVF it seems to be better to use the CPU time and not the wall clock. The hope here is that CPU time has less influence by the load of the machine.

The CPU time can be obtained by either clock_gettime or getrusage for the calling process. #4123 makes the PVF compilation single-threaded, so there should be only one thread compiling or executing. There should be a watchdog thread though that wakes up from time to time (as roughly as 1 sec) and terminates the process in case it reached the deadline. That means that the controlling of timeouts shifts from the polkadot host into the worker, which is not ideal but acceptable.

Alternatively, we could leave it as is maybe just increasing the current timeout. Then, we would wait4 on the children worker and get its rusage (not unlike the well known time program). This would make the preparation process completely single threaded and leave the control of the timeouts on the host. However, that seems more complicated in terms of implementation.

The text was updated successfully, but these errors were encountered:

slumber · 2021-11-04T09:08:32Z

Is another thread necessary? I think it's possible to wrap this syscall into the future that will check the usage every second on polling. Then simply select it with the worker future future, whichever finishes first.

drahnr · 2021-11-04T09:16:33Z

According to the manual, wait4 should be avoided in favor of waitpid. Other than that I think the repeated polling check should work just fine.

@slumber the wasmtime compilation is non-async blob, so I assume you mean a future that wraps the wasm compiling thread and internally uses a timer to poll that thread (there needs to be some source of a wake to poll every ~1sec)? Bottom line is still one extra thread is required, which one is executed in the thread depends on the usage semantics - async or not.

pepyakin · 2021-11-04T09:16:50Z

wasmtime's Module::new will compile the code and won't return until it finishes. That's, presumably, one thread. While that's doing its business something should look at the clock and terminate if necessary. Be it a threaded code or async runtime, it has to have one thread.

slumber · 2021-11-04T09:24:31Z

According to the manual, wait4 should be avoided in favor of waitpid. Other than that I think the repeated polling check should work just fine.

@slumber the wasmtime compilation is non-async blob, so I assume you mean a future that wraps the wasm compiling thread and internally uses a timer to poll that thread (there needs to be some source of a wake to poll every ~1sec)? Bottom line is still one extra thread is required, which one is executed in the thread depends on the usage semantics - async or not.

This select polls a worker process and a timer, the latter can be replaced with syscall-wrapping future.

pepyakin · 2021-11-05T11:14:15Z

Ok, I see where the misunderstanding is.

See the quote:

The CPU time can be obtained by either clock_gettime or getrusage for the calling process. #4123 makes the PVF compilation single-threaded, so there should be only one thread compiling or executing. There should be a watchdog thread though that wakes up from time to time (as roughly as 1 sec) and terminates the process in case it reached the deadline. That means that the controlling of timeouts shifts from the polkadot host into the worker, which is not ideal but acceptable.

This assumes the context of the children worker (that's where the compilation of the PVF is happening atm and that's why I mentioned Module::new). Given parallel_compilation=false, the children worker currently has, presumably, only one thread. In order for the mechanism rely on the CPU time, we would need to introduce a second thread into the child process where it would check the time and after reaching the deadline it will terminate itself. So this means that the timeouts will be controlled by the child worker, and not polkadot validation host as it happens now.

OTOH, the code that you linked is the second case, that is:

Alternatively, we could leave it as is maybe just increasing the current timeout. Then, we would wait4 on the children worker and get its rusage (not unlike the well known time program). This would make the preparation process completely single threaded and leave the control of the timeouts on the host. However, that seems more complicated in terms of implementation.

In that case, we do not change anything on the child process but change the polkadot validation host process. This way the child worker won't get a new thread.

We cannot solely rely on waitpid/wait4 because they will only react on the child process exitting. However, since we do not have a timer there it will only happen after it finishes the compilation which defies the purpose of setting the timer somewhat. As an aside, we cannot just wrap a syscall in a future without involving a new thread, because those syscalls are blocking (and I am not aware of non-blocking alternatives of those).

So the idea that I expressed there, is that we will still leave the timer as is (i.e. leave it as a wallclock), but just increase the timeout. After either the process finishes or the deadline timer triggers, we see how much of CPU time was spent in the child process.

I hope this is more clear.

eskimor · 2022-10-30T11:49:12Z

@m-cat also an interesting ticket I would really love to see implemented some time soon.

mrcnski · 2022-11-07T22:56:18Z

I've been investigating this and @eskimor asked me to post my findings.

Code here.

Goal

We wanted to determine how CPU time was affected by load, and to what degree it
is a better metric than wall clock time.

I ran an experiment by launching simultaneous processes and counting how many
random hashes each could do within 5 seconds of CPU time.

Parameters

I had another thread in each process that would run and terminate the process
after 5 seconds, to simulate option 1.

Another parameter was a 1 second delay between sleeps in the polling thread,
again taken from the issue description.

Ran on an 8-core machine.

Data

+------------+--------------------+---------------------+---------------+-------------------+--------------------+
| # children | avg. cpu time (ms) | avg. wall time (ms) | avg. # hashes | hashes / cpu time | hashes / wall time |
+================================================================================================================+
| 1          | 5011               | 5013                | 367710        | 73                | 73                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 2          | 5017               | 5020                | 374605        | 74                | 74                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 3          | 5007               | 5015                | 371256        | 74                | 74                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 4          | 5501               | 5512                | 407122        | 74                | 73                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 5          | 5971               | 6005                | 426302        | 71                | 70                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 6          | 5976               | 6007                | 430291        | 72                | 71                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 7          | 5973               | 6010                | 417470        | 69                | 69                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 8          | 5987               | 6013                | 418483        | 69                | 69                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 9          | 5988               | 6011                | 386170        | 64                | 64                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 10         | 5995               | 6010                | 361615        | 60                | 60                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 11         | 5007               | 5014                | 284096        | 56                | 56                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 12         | 5969               | 6006                | 309027        | 51                | 51                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 13         | 5979               | 6008                | 296396        | 49                | 49                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 14         | 5939               | 6014                | 280787        | 47                | 46                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 15         | 5822               | 6011                | 262843        | 45                | 43                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 16         | 5756               | 6017                | 254496        | 44                | 42                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 17         | 5514               | 6011                | 243369        | 44                | 40                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 18         | 5131               | 6016                | 226493        | 44                | 37                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 19         | 5700               | 7016                | 251827        | 44                | 35                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 20         | 5482               | 7011                | 242054        | 44                | 34                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 21         | 5118               | 7019                | 225823        | 44                | 32                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 22         | 5186               | 7202                | 234281        | 45                | 32                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 23         | 5516               | 8028                | 243682        | 44                | 30                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 24         | 5293               | 8019                | 234046        | 44                | 29                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 25         | 5093               | 8028                | 224726        | 44                | 27                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 26         | 5477               | 9021                | 241853        | 44                | 26                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 27         | 5315               | 9027                | 234688        | 44                | 25                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 28         | 5111               | 9031                | 225914        | 44                | 25                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 29         | 5487               | 10030               | 242330        | 44                | 24                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 30         | 5156               | 10030               | 228029        | 44                | 22                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 31         | 5033               | 10025               | 222578        | 44                | 22                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 32         | 5386               | 11032               | 238045        | 44                | 21                 |
+------------+--------------------+---------------------+---------------+-------------------+--------------------+

Results

The extra polling thread in option 1 has negligibile effect because it is
sleeping most of the time.
As long as each child process has its own core there is little degradation in
performance (up to 8 processes on an 8-core machine). I had other processes on
my machine which probably caused the slight degradation.
Once I exceed 8 processes there starts to be more significant degradation of
number of hashes for each process. This is probably because of switching
overhead due to contention for cores. Which is something that is likely to
occur in systems under load.
Seems like we can still do more work per unit of cpu time than wall clock
time.

Next steps

@eskimor suggests a hybrid approach:

It sounds like we will need some hybrid approach: Having a thread that kills it after cpu time exceeds - we need that, otherwise we cannot really enforce the timeout, but then we should check the cpu time in the calling process to determine whether the timeout was exceeded or not. (This should be more precise than the checking thread.) By just killing based on wall clock time and then checking CPU time, we introduce a stupid edge case: Wall clock exceeded, so we killed it, but CPU time was actually fine .... what do we do now?

I will next be investigating this question:

Can we really only determine CPU time at the parent after the fact? Or can we for example pause the execution after the wall clock time, check cpu time and if has not yet exceeded let it run for some additional time ... repeat.

eskimor · 2022-11-08T10:29:34Z

Awesome, thanks @m-cat ! What we can see is that wall clock time almost as good/bad a metric until 16 processes only then the number of hashes per wall clock time declines further, while per cpu time it stays stable.

This means cputime is likely only a significant improvement on very high load. We reduce the factor from 73/21 = 3.47 to 73/44 = 1.659 ... so the variation under heavy load can be pretty much halved, which is still quite an improvement - not as good as we hoped (perfectly stable) - but still seems worthwhile to explore!

Is there maybe any other timing information available, that is even less susceptible to system load?

eskimor · 2022-11-08T10:32:17Z

Probably not. Given that the number of hashes per CPU time stays stable once all threads (16) are saturated, I would assume that the slowdown comes from CPU throttling to prevent it from overheating. Usually CPUs can do higher clock speeds on single thread and have to reduce clock speed once more cores get busy.

eskimor · 2022-11-08T10:35:38Z

TL;DR: cputime indeed helps reducing variance due to load. Variance remains, likely due to thermal management of CPUs.

mrcnski · 2022-11-09T19:36:28Z

Can we really only determine CPU time at the parent after the fact? Or can we
for example pause the execution after the wall clock time, check cpu time and
if has not yet exceeded let it run for some additional time ... repeat.

@eskimor To get the child's CPU time in the parent before execution ended, it
seems we have two main options:

In the parent process, use the
/proc
filesystem which is a non-portable (i.e. available on Linux but not MacOS)
facility. If we can assume Linux, this would seem pretty simple. I could not
find any portable solutions.
Using some form of IPC, query the CPU time from within the child and
communicate it to the host. We are already using a socket to communicate between
the host and a worker, we would need another one.

vstakhov · 2022-11-10T00:14:48Z

Well, the portable way to get this information should be getrusage call with RUSAGE_CHILDREN variance. However, it can be used from the parent side merely after a child is terminated. Another way is, indeed, keeping a socketpair opened between parent and children when children are expected to send heartbeat messages with their own getrusage call results.

eskimor · 2022-11-10T10:52:21Z

For the platform dependent option: We could go that way if there are alternatives on other platforms. In any case if a platform independent solution is available (e.g. the open socket) we should go with that one.

burdges · 2022-11-10T16:23:56Z

Appears some BSDs have procfs btw, not sure if it's the same, and I guess not MacOS.

mrcnski · 2022-11-10T20:58:53Z

I had a quick call with @eskimor today, where I proposed we simply have the
child process self-terminate after a CPU-time timeout. This being instead of the
child communicating the CPU time to the host. I think the idea with the latter
was to try to avoid the variance with a 1sec polling delay, and we don't want to
poll more often than that. But having the child report its CPU time to the host,
in the end doesn't really gain anything.

So having the child poll the CPU time and exit on its own was deemed acceptable,
and already an improvement over wall clock time. We'd still have a wall clock
timeout on the host, but it would be generous (i.e. a factor of 3) and only
expected to be hit if the child process happens to stall.

Please let me know if I missed any important details @eskimor, as most of the
call took place in your head. :P

Now, I thought about this problem a bit and believe we can simplify more: I'm
not sure why we need to poll at all. I.e. if the cpu timeout is 5 seconds, why
sleep for 1sec intervals of wall clock time? In my understanding, the CPU clock
cannot go faster than the wall clock, so we can just sleep for 5 seconds + a bit
to account for overhead, check the remaining CPU time and repeat. We would just
need a minimum sleep time of say 50 or 100ms. Seems like this would be more
accurate and should perform less thread wakeups too.

burdges · 2022-11-10T21:07:54Z

We do what the running time in future, but the child could supply this when done.

mrcnski self-assigned this Nov 7, 2022

mrcnski mentioned this issue Nov 14, 2022

Use CPU clock timeout for PVF jobs #6282

Merged

3 tasks

eskimor closed this as completed in #6282 Nov 30, 2022

mrcnski mentioned this issue Dec 2, 2022

Let the PVF host kill the worker on timeout #6381

Merged

mrcnski mentioned this issue Apr 17, 2023

Create complete and exhaustive list of sources of indeterminism in PVF paritytech/polkadot-sdk#653

Open

mrcnski mentioned this issue Nov 22, 2023

Retry timeout paritytech/polkadot-sdk#2438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch PVF preparation to CPU time #4217

Switch PVF preparation to CPU time #4217

pepyakin commented Nov 4, 2021 •

edited

Loading

slumber commented Nov 4, 2021

drahnr commented Nov 4, 2021 •

edited

Loading

pepyakin commented Nov 4, 2021

slumber commented Nov 4, 2021

pepyakin commented Nov 5, 2021

eskimor commented Oct 30, 2022

mrcnski commented Nov 7, 2022 •

edited

Loading

eskimor commented Nov 8, 2022

eskimor commented Nov 8, 2022

eskimor commented Nov 8, 2022

mrcnski commented Nov 9, 2022 •

edited

Loading

vstakhov commented Nov 10, 2022

eskimor commented Nov 10, 2022

burdges commented Nov 10, 2022

mrcnski commented Nov 10, 2022

burdges commented Nov 10, 2022

Switch PVF preparation to CPU time #4217

Switch PVF preparation to CPU time #4217

Comments

pepyakin commented Nov 4, 2021 • edited Loading

slumber commented Nov 4, 2021

drahnr commented Nov 4, 2021 • edited Loading

pepyakin commented Nov 4, 2021

slumber commented Nov 4, 2021

pepyakin commented Nov 5, 2021

eskimor commented Oct 30, 2022

mrcnski commented Nov 7, 2022 • edited Loading

Goal

Parameters

Data

Results

Next steps

eskimor commented Nov 8, 2022

eskimor commented Nov 8, 2022

eskimor commented Nov 8, 2022

mrcnski commented Nov 9, 2022 • edited Loading

vstakhov commented Nov 10, 2022

eskimor commented Nov 10, 2022

burdges commented Nov 10, 2022

mrcnski commented Nov 10, 2022

burdges commented Nov 10, 2022

pepyakin commented Nov 4, 2021 •

edited

Loading

drahnr commented Nov 4, 2021 •

edited

Loading

mrcnski commented Nov 7, 2022 •

edited

Loading

mrcnski commented Nov 9, 2022 •

edited

Loading