Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Switch PVF preparation to CPU time #4217

Closed
pepyakin opened this issue Nov 4, 2021 · 16 comments · Fixed by #6282
Closed

Switch PVF preparation to CPU time #4217

pepyakin opened this issue Nov 4, 2021 · 16 comments · Fixed by #6282
Assignees

Comments

@pepyakin
Copy link
Contributor

pepyakin commented Nov 4, 2021

For controlling the execution time of preparation (and potentially execution) of PVF it seems to be better to use the CPU time and not the wall clock. The hope here is that CPU time has less influence by the load of the machine.

The CPU time can be obtained by either clock_gettime or getrusage for the calling process. #4123 makes the PVF compilation single-threaded, so there should be only one thread compiling or executing. There should be a watchdog thread though that wakes up from time to time (as roughly as 1 sec) and terminates the process in case it reached the deadline. That means that the controlling of timeouts shifts from the polkadot host into the worker, which is not ideal but acceptable.

Alternatively, we could leave it as is maybe just increasing the current timeout. Then, we would wait4 on the children worker and get its rusage (not unlike the well known time program). This would make the preparation process completely single threaded and leave the control of the timeouts on the host. However, that seems more complicated in terms of implementation.

@slumber
Copy link
Contributor

slumber commented Nov 4, 2021

Is another thread necessary? I think it's possible to wrap this syscall into the future that will check the usage every second on polling. Then simply select it with the worker future future, whichever finishes first.

@drahnr
Copy link
Contributor

drahnr commented Nov 4, 2021

According to the manual, wait4 should be avoided in favor of waitpid. Other than that I think the repeated polling check should work just fine.

@slumber the wasmtime compilation is non-async blob, so I assume you mean a future that wraps the wasm compiling thread and internally uses a timer to poll that thread (there needs to be some source of a wake to poll every ~1sec)? Bottom line is still one extra thread is required, which one is executed in the thread depends on the usage semantics - async or not.

@pepyakin
Copy link
Contributor Author

pepyakin commented Nov 4, 2021

wasmtime's Module::new will compile the code and won't return until it finishes. That's, presumably, one thread. While that's doing its business something should look at the clock and terminate if necessary. Be it a threaded code or async runtime, it has to have one thread.

@slumber
Copy link
Contributor

slumber commented Nov 4, 2021

According to the manual, wait4 should be avoided in favor of waitpid. Other than that I think the repeated polling check should work just fine.

@slumber the wasmtime compilation is non-async blob, so I assume you mean a future that wraps the wasm compiling thread and internally uses a timer to poll that thread (there needs to be some source of a wake to poll every ~1sec)? Bottom line is still one extra thread is required, which one is executed in the thread depends on the usage semantics - async or not.

This select polls a worker process and a timer, the latter can be replaced with syscall-wrapping future.

@pepyakin
Copy link
Contributor Author

pepyakin commented Nov 5, 2021

Ok, I see where the misunderstanding is.

See the quote:

The CPU time can be obtained by either clock_gettime or getrusage for the calling process. #4123 makes the PVF compilation single-threaded, so there should be only one thread compiling or executing. There should be a watchdog thread though that wakes up from time to time (as roughly as 1 sec) and terminates the process in case it reached the deadline. That means that the controlling of timeouts shifts from the polkadot host into the worker, which is not ideal but acceptable.

This assumes the context of the children worker (that's where the compilation of the PVF is happening atm and that's why I mentioned Module::new). Given parallel_compilation=false, the children worker currently has, presumably, only one thread. In order for the mechanism rely on the CPU time, we would need to introduce a second thread into the child process where it would check the time and after reaching the deadline it will terminate itself. So this means that the timeouts will be controlled by the child worker, and not polkadot validation host as it happens now.

OTOH, the code that you linked is the second case, that is:

Alternatively, we could leave it as is maybe just increasing the current timeout. Then, we would wait4 on the children worker and get its rusage (not unlike the well known time program). This would make the preparation process completely single threaded and leave the control of the timeouts on the host. However, that seems more complicated in terms of implementation.

In that case, we do not change anything on the child process but change the polkadot validation host process. This way the child worker won't get a new thread.

We cannot solely rely on waitpid/wait4 because they will only react on the child process exitting. However, since we do not have a timer there it will only happen after it finishes the compilation which defies the purpose of setting the timer somewhat. As an aside, we cannot just wrap a syscall in a future without involving a new thread, because those syscalls are blocking (and I am not aware of non-blocking alternatives of those).

So the idea that I expressed there, is that we will still leave the timer as is (i.e. leave it as a wallclock), but just increase the timeout. After either the process finishes or the deadline timer triggers, we see how much of CPU time was spent in the child process.

I hope this is more clear.

@eskimor
Copy link
Member

eskimor commented Oct 30, 2022

@m-cat also an interesting ticket I would really love to see implemented some time soon.

@mrcnski mrcnski self-assigned this Nov 7, 2022
@mrcnski
Copy link
Contributor

mrcnski commented Nov 7, 2022

I've been investigating this and @eskimor asked me to post my findings.

Code here.

Goal

We wanted to determine how CPU time was affected by load, and to what degree it
is a better metric than wall clock time.

I ran an experiment by launching simultaneous processes and counting how many
random hashes each could do within 5 seconds of CPU time.

Parameters

I had another thread in each process that would run and terminate the process
after 5 seconds, to simulate option 1.

Another parameter was a 1 second delay between sleeps in the polling thread,
again taken from the issue description.

Ran on an 8-core machine.

Data

+------------+--------------------+---------------------+---------------+-------------------+--------------------+
| # children | avg. cpu time (ms) | avg. wall time (ms) | avg. # hashes | hashes / cpu time | hashes / wall time |
+================================================================================================================+
| 1          | 5011               | 5013                | 367710        | 73                | 73                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 2          | 5017               | 5020                | 374605        | 74                | 74                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 3          | 5007               | 5015                | 371256        | 74                | 74                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 4          | 5501               | 5512                | 407122        | 74                | 73                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 5          | 5971               | 6005                | 426302        | 71                | 70                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 6          | 5976               | 6007                | 430291        | 72                | 71                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 7          | 5973               | 6010                | 417470        | 69                | 69                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 8          | 5987               | 6013                | 418483        | 69                | 69                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 9          | 5988               | 6011                | 386170        | 64                | 64                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 10         | 5995               | 6010                | 361615        | 60                | 60                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 11         | 5007               | 5014                | 284096        | 56                | 56                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 12         | 5969               | 6006                | 309027        | 51                | 51                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 13         | 5979               | 6008                | 296396        | 49                | 49                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 14         | 5939               | 6014                | 280787        | 47                | 46                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 15         | 5822               | 6011                | 262843        | 45                | 43                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 16         | 5756               | 6017                | 254496        | 44                | 42                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 17         | 5514               | 6011                | 243369        | 44                | 40                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 18         | 5131               | 6016                | 226493        | 44                | 37                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 19         | 5700               | 7016                | 251827        | 44                | 35                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 20         | 5482               | 7011                | 242054        | 44                | 34                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 21         | 5118               | 7019                | 225823        | 44                | 32                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 22         | 5186               | 7202                | 234281        | 45                | 32                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 23         | 5516               | 8028                | 243682        | 44                | 30                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 24         | 5293               | 8019                | 234046        | 44                | 29                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 25         | 5093               | 8028                | 224726        | 44                | 27                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 26         | 5477               | 9021                | 241853        | 44                | 26                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 27         | 5315               | 9027                | 234688        | 44                | 25                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 28         | 5111               | 9031                | 225914        | 44                | 25                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 29         | 5487               | 10030               | 242330        | 44                | 24                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 30         | 5156               | 10030               | 228029        | 44                | 22                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 31         | 5033               | 10025               | 222578        | 44                | 22                 |
|------------+--------------------+---------------------+---------------+-------------------+--------------------|
| 32         | 5386               | 11032               | 238045        | 44                | 21                 |
+------------+--------------------+---------------------+---------------+-------------------+--------------------+

Results

  • The extra polling thread in option 1 has negligibile effect because it is
    sleeping most of the time.

  • As long as each child process has its own core there is little degradation in
    performance (up to 8 processes on an 8-core machine). I had other processes on
    my machine which probably caused the slight degradation.

  • Once I exceed 8 processes there starts to be more significant degradation of
    number of hashes for each process. This is probably because of switching
    overhead due to contention for cores. Which is something that is likely to
    occur in systems under load.

  • Seems like we can still do more work per unit of cpu time than wall clock
    time.

Next steps

@eskimor suggests a hybrid approach:

It sounds like we will need some hybrid approach: Having a thread that kills it after cpu time exceeds - we need that, otherwise we cannot really enforce the timeout, but then we should check the cpu time in the calling process to determine whether the timeout was exceeded or not. (This should be more precise than the checking thread.) By just killing based on wall clock time and then checking CPU time, we introduce a stupid edge case: Wall clock exceeded, so we killed it, but CPU time was actually fine .... what do we do now?

I will next be investigating this question:

Can we really only determine CPU time at the parent after the fact? Or can we for example pause the execution after the wall clock time, check cpu time and if has not yet exceeded let it run for some additional time ... repeat.

@eskimor
Copy link
Member

eskimor commented Nov 8, 2022

Awesome, thanks @m-cat ! What we can see is that wall clock time almost as good/bad a metric until 16 processes only then the number of hashes per wall clock time declines further, while per cpu time it stays stable.

This means cputime is likely only a significant improvement on very high load. We reduce the factor from 73/21 = 3.47 to 73/44 = 1.659 ... so the variation under heavy load can be pretty much halved, which is still quite an improvement - not as good as we hoped (perfectly stable) - but still seems worthwhile to explore!

Is there maybe any other timing information available, that is even less susceptible to system load?

@eskimor
Copy link
Member

eskimor commented Nov 8, 2022

Probably not. Given that the number of hashes per CPU time stays stable once all threads (16) are saturated, I would assume that the slowdown comes from CPU throttling to prevent it from overheating. Usually CPUs can do higher clock speeds on single thread and have to reduce clock speed once more cores get busy.

@eskimor
Copy link
Member

eskimor commented Nov 8, 2022

TL;DR: cputime indeed helps reducing variance due to load. Variance remains, likely due to thermal management of CPUs.

@mrcnski
Copy link
Contributor

mrcnski commented Nov 9, 2022

Can we really only determine CPU time at the parent after the fact? Or can we
for example pause the execution after the wall clock time, check cpu time and
if has not yet exceeded let it run for some additional time ... repeat.

@eskimor To get the child's CPU time in the parent before execution ended, it
seems we have two main options:

  1. In the parent process, use the
    /proc
    filesystem which is a non-portable (i.e. available on Linux but not MacOS)
    facility. If we can assume Linux, this would seem pretty simple. I could not
    find any portable solutions.

  2. Using some form of IPC, query the CPU time from within the child and
    communicate it to the host. We are already using a socket to communicate between
    the host and a worker, we would need another one.

@vstakhov
Copy link
Contributor

Well, the portable way to get this information should be getrusage call with RUSAGE_CHILDREN variance. However, it can be used from the parent side merely after a child is terminated. Another way is, indeed, keeping a socketpair opened between parent and children when children are expected to send heartbeat messages with their own getrusage call results.

@eskimor
Copy link
Member

eskimor commented Nov 10, 2022

For the platform dependent option: We could go that way if there are alternatives on other platforms. In any case if a platform independent solution is available (e.g. the open socket) we should go with that one.

@burdges
Copy link
Contributor

burdges commented Nov 10, 2022

Appears some BSDs have procfs btw, not sure if it's the same, and I guess not MacOS.

@mrcnski
Copy link
Contributor

mrcnski commented Nov 10, 2022

I had a quick call with @eskimor today, where I proposed we simply have the
child process self-terminate after a CPU-time timeout. This being instead of the
child communicating the CPU time to the host. I think the idea with the latter
was to try to avoid the variance with a 1sec polling delay, and we don't want to
poll more often than that. But having the child report its CPU time to the host,
in the end doesn't really gain anything.

So having the child poll the CPU time and exit on its own was deemed acceptable,
and already an improvement over wall clock time. We'd still have a wall clock
timeout on the host, but it would be generous (i.e. a factor of 3) and only
expected to be hit if the child process happens to stall.

Please let me know if I missed any important details @eskimor, as most of the
call took place in your head. :P

Now, I thought about this problem a bit and believe we can simplify more: I'm
not sure why we need to poll at all. I.e. if the cpu timeout is 5 seconds, why
sleep for 1sec intervals of wall clock time? In my understanding, the CPU clock
cannot go faster than the wall clock, so we can just sleep for 5 seconds + a bit
to account for overhead, check the remaining CPU time and repeat. We would just
need a minimum sleep time of say 50 or 100ms. Seems like this would be more
accurate and should perform less thread wakeups too.

@burdges
Copy link
Contributor

burdges commented Nov 10, 2022

We do what the running time in future, but the child could supply this when done.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants