Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PAL-SGX] alternative gettimeofday() implementation for gramine #1799

Open
jonathan-sha opened this issue Mar 7, 2024 · 10 comments · May be fixed by #1834
Open

[PAL-SGX] alternative gettimeofday() implementation for gramine #1799

jonathan-sha opened this issue Mar 7, 2024 · 10 comments · May be fixed by #1834

Comments

@jonathan-sha
Copy link

Description of the feature

Looking at the gettimeofday() implementation used by gramine, there are many similarities with an implementation we did for "native" sgx apps that we develop where I work. But, we also took some different approaches listed below.

I have working and highly tested code I can share with you, if you're interested in my implementation. It's currently written in c++ but if there's interest I will be happy to port it to c and add it to gramine.

  • our implementation also uses RDTSC when it's available, and falls back to OCALL if there's no other choice.
  • we don't rely on cpuid heuristics to read the clock frequency (gramine has bare-metal and hypervisor implementation, which is partial).
  • instead, we calculate the clock frequency ourselves.
    • for this we need two "ground truth" readings of timeval + tsc pairs, and we can calculate the clock frequency by counting "how many cycles passed in a given time".
    • we don't need to "sleep" for this to work, we use consecutive calls to gettimeofday() and the time elapsed between them to make the calculation.
    • the clock frequency is periodically refreshed and recalibrated (by performing 2 "extra" ocalls that are far enough apart), in order to keep close to the host's time.
  • the code was written to give consistent time between all threads in a thread safe and lockless manner (using atomics)
    • there is a "pair" of global ground-truth states which we switch between in round-robin
    • a single thread gets to "update and advance" the state

This code was stressed and tested on azure, I know in the past gramine had some issues with the fast-path not working on azure?

Please let me know if this is interesting to you.

Why Gramine should implement it?

  1. do not rely on "fixed" clock_frequency of the cpu and different hypervisor implementations for reporting it in cpuid
  2. there are probably some targets that should support this fast-path gettimeofday() using rdtsc, but currently use the slow path
  3. this implementation is completely lockless - gramine currently uses a spin lock to update the ground truth.

Out implementation is measured to run in ~47 cycles (with -O3 compile flag) which is faster than the ~60 cycles it takes to run gettimeofday() natively on the host!

Also, as an unrelated note, we do handle getting the (deprecated) timezone struct when re\calibrating our ground-truth, just in case some old code uses this parameter. I saw an issue regarding this still open on gramine.

@mkow
Copy link
Member

mkow commented Mar 8, 2024

I like this idea, seems more portable than what we currently have and still simple.
@dimakuv, @kailun-qin: Do you see any potential problems with going this way?

@dimakuv
Copy link
Contributor

dimakuv commented Mar 8, 2024

I don't see any specific problems with this approach. The description by @jonathan-sha seems solid, I appreciate the details!

we don't need to "sleep" for this to work, we use consecutive calls to gettimeofday() and the time elapsed between them to make the calculation.

Do you mean you exit the SGX enclave and then perform two gettimeofday() calls? How do you calculate the latency of each of these calls?

I guess if you do two such double-gettimeofday readings, you can learn the latency of each call, and thus you can calibrate TSC based on some math formula?

the clock frequency is periodically refreshed and recalibrated (by performing 2 "extra" ocalls that are far enough apart) ...

What does this "performing 2 extra ocalls that are far enough apart" mean exactly?

do not rely on "fixed" clock_frequency of the cpu and different hypervisor implementations for reporting it in cpuid

Yes, this turned out to be a mess. As soon as you go to virtualized environments, things become hairy. Each hypervisor/VMM uses a different way to report Invariant TSC...

Also, as an unrelated note, we do handle getting the (deprecated) timezone struct when re\calibrating our ground-truth, just in case some old code uses this parameter.

I'm not sure how clock recalibration and timezones are related. Can you elaborate on this more?

@mkow
Copy link
Member

mkow commented Mar 8, 2024

we don't need to "sleep" for this to work, we use consecutive calls to gettimeofday() and the time elapsed between them to make the calculation.

Do you mean you exit the SGX enclave and then perform two gettimeofday() calls?

I think he meant just doing a gettimeofday-like ocall twice.

How do you calculate the latency of each of these calls?

I think you don't need to? You can do rdtsc -> gettimeofday() -> wait some time -> rdtsc -> gettimeofday() and then subtract and divide.

the clock frequency is periodically refreshed and recalibrated (by performing 2 "extra" ocalls that are far enough apart) ...

What does this "performing 2 extra ocalls that are far enough apart" mean exactly?

See above (at least that's how I understand the idea).

Also, as an unrelated note, we do handle getting the (deprecated) timezone struct when re\calibrating our ground-truth, just in case some old code uses this parameter.

I'm not sure how clock recalibration and timezones are related. Can you elaborate on this more?

It's unrelated, as stated by op ;) I think he just implemented it while working on this part of the codebase.

@dimakuv
Copy link
Contributor

dimakuv commented Mar 11, 2024

How do you calculate the latency of each of these calls?

I think you don't need to? You can do rdtsc -> gettimeofday() -> wait some time -> rdtsc -> gettimeofday() and then subtract and divide.

How do we account for the latency of the gettimeofday() ocall itself? But I guess, if we "wait some time" and this time is long enough, then we can ignore the latency of OCALLs then and still have sufficient precision.

@jonathan-sha
Copy link
Author

How do you calculate the latency of each of these calls?

I think you don't need to? You can do rdtsc -> gettimeofday() -> wait some time -> rdtsc -> gettimeofday() and then subtract and divide.

How do we account for the latency of the gettimeofday() ocall itself? But I guess, if we "wait some time" and this time is long enough, then we can ignore the latency of OCALLs then and still have sufficient precision.

Correct. From my experimentation, taking two time-points that are about 1sec apart is sufficient for getting sub-millisecond accuracy over time. I did a lot of experimentation with this, increasing the "calibration time", using clock_gettime in nanosecond accuracy instead of gettimeofday(). None of this noticeably improved the accuracy of the clock frequency. I tried accounting for the OCALL latency as well by adding the calibration rdtsc read to the OCALL itself, but this didn't help much either. Overall I think the current accuracy is acceptable.

I'm not sure how clock recalibration and timezones are related. Can you elaborate on this more?

gettimeofday() method returns an optional timezone struct, which is deprecated and shouldn't be used. When I take the "ground truth" timepoints (i.e. OCALL gettimeofday()) I retrieve this timezone and cache it in case it's ever needed by the caller.

@jonathan-sha
Copy link
Author

Some more details about the implementation...

I implemented this as a state machine with the following states. Calling gettimeofday() at any point will call into the current state's handler.

  1. INIT - this is the first call to gettimeofday() by the app, OCALL to get the first timepoint used to calculate the clock_frequency, and check if RDTSC is available.
  2. CALIBRATING - we don't have a calculated clock_frequency yet, so all calls to gettimeofday() in this state will be OCALLs. Once at least CALIBRATION_TIME has elapsed between two gettimeofday() calls, we can calculate the clock_frequency and advance to RDTSC.
  3. RDTSC - this is the fast path, rdtsc to calculate the current time. Periodically (after long RECALIBRATION_TIME elapses) we will OCALL to make sure we stay in sync with the host time, and give us a new "first timepoint" for a new clock_frequency calculation.
  4. RDTSC_RECALIBRATING - similar to CALIBRATING, we took a timepoint when transitioning here from RDTSC, and we can recalculate the clock_frequency once CALIBRATION_TIME elapses. But in this case, all calls prior to CALIBRATION_TIME can use the "fast path" and be calculated using the old clock_frequency.
  5. DISABLED - we always OCALL.

This state machine guarantees we use the fast path whenever we can and OCALL when we need to, and over time we stay in sync with the host.

@dimakuv
Copy link
Contributor

dimakuv commented Mar 11, 2024

Thanks @jonathan-sha! This looks very good. Feel free to submit a PR. (Though depending on the complexity of the PR, it may take a long time to review it.)

@mkow
Copy link
Member

mkow commented Mar 11, 2024

How do we account for the latency of the gettimeofday() ocall itself? But I guess, if we "wait some time" and this time is long enough, then we can ignore the latency of OCALLs then and still have sufficient precision.

But the method I explained already accounts for that? ("rdtsc -> gettimeofday() -> wait some time -> rdtsc -> gettimeofday()")
Both RDTSC and gettimeofday() measure exactly the same events if you look at the detailed timeline:

  1. RDTSC // 1st RDTSC reading here
  2. OCALL gettimeofday entry
  3. host gettimeofday() call // 1st gettimeofday reading
  4. OCALL gettimeofday return
  5. some time elapses
  6. RDTSC // 2st RDTSC reading here
  7. OCALL gettimeofday entry
  8. host gettimeofday() call // 2st gettimeofday reading
  9. OCALL gettimeofday return

Notice that both RDTSC and gettimeofday include the same amount of each type of steps: RDTSC measures points 1-6 and gettimeofday measures 3-8. Both of them include exactly one OCALL entry execution and one OCALL return execution.

@jonathan-sha jonathan-sha linked a pull request Apr 2, 2024 that will close this issue
@jonathan-sha
Copy link
Author

Hey, sorry for the delay with this. I had some other things to take care of first.

I opened an MR as draft - I have a few things I need to fix still

  1. Not sure about is_rdtsc_available() - the current implementation was done for my company's code, but gramine has RDTSC emulation, so maybe we can forego this logic entirely?
  2. fast_clock__get_time_zone() is unimplemented - we can add this to support gettimeofday() with timezone
  3. The long explanation comment needs to be updated
  4. Some minor fixes
  5. Regression tests - I'm contemplating what we should do there. For my company's codebase we added some multi-threaded stress tests that run for about 10 minutes, measure fast path vs. ocall vs. native gettimeofday cycles, measure time drift over time, etc. We can add this to the existing gettimeofday test.

@mkow
Copy link
Member

mkow commented Apr 12, 2024

@jonathan-sha: Let's move the discussion about these points to the PR review, so we can discuss them next to the code (and in a single place).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants