Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System time corrupted across reboots (IDFGH-7930) #9448

Closed
kriegste opened this issue Jul 27, 2022 · 37 comments
Closed

System time corrupted across reboots (IDFGH-7930) #9448

kriegste opened this issue Jul 27, 2022 · 37 comments
Assignees
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally

Comments

@kriegste
Copy link

Environment

  • Module or chip used: ESP32 on TTGO, but happens on other devices as well
  • IDF version: 4.4.1 release
  • Build System: Standard Windows VSCode installation via the tutorial
  • Power Supply: USB on the TTGO, but other devices with their own supply behave the same

Problem Description

After leaving it running for a while (1-3 days) if the device is soft-reset using esp_restart() the system time is wrong. Sometimes only by seconds, sometimes by hours, in both directions. SNTP will correct this later, however during the initial phase system time cannot be relied on.
The phenomenon gets worse the longer the device is on (= the more time has passed since the last hard-reset).

Interestingly, esp_rtc_get_time_us() which indicates the time since the last hard-reset also jumps the same amount.

There is a four year old forum thread indicating the same problem:
https://www.esp32.com/viewtopic.php?t=7544

Expected Behavior

System time running continuously across reboots.

Code to reproduce this issue

Make sure to switch on system time stamps for logging!

Leave the example running for 1 day or more and from time to time look at the output. I used PuTTY since the IDF monitor causes a hard-reset on reconnect. An automated reboot will happen once per hour. Compare the time stamps right before and after the restarts.

#include "esp_log.h"
#include "esp_sntp.h"
#include "esp_wifi.h"
#include "nvs_flash.h"
#include "esp32/rtc.h"
 
static const char *TAG = "main";
 
void time_sync_notification_cb(struct timeval *tv) {
    ESP_LOGI(TAG, "time_sync_notification_cb");
}
 
void log_rtc(void) {
    int u = esp_rtc_get_time_us() / 1000000;
    ESP_LOGI(TAG, "rtc=%d seconds (%d days %02d:%02d:%02d)", u, u / (24*3600), (u % (24*3600)) / 3600, (u % 3600) / 60, u % 60);
}
 
void app_main(void) {
    log_rtc();
    ESP_LOGI(TAG, "start");
 
    ESP_ERROR_CHECK(nvs_flash_init());
    ESP_ERROR_CHECK(esp_netif_init());
    ESP_ERROR_CHECK(esp_event_loop_create_default());
    esp_netif_create_default_wifi_sta();
    wifi_init_config_t wifi_init_config = WIFI_INIT_CONFIG_DEFAULT();
    ESP_ERROR_CHECK(esp_wifi_init(&wifi_init_config));
 
    wifi_config_t conf_sta = {
        .sta.ssid = "xxxxx",
    };
    esp_wifi_set_config(WIFI_IF_STA, &conf_sta);
 
    ESP_ERROR_CHECK(esp_wifi_start());
    esp_wifi_connect();
 
    sntp_setoperatingmode(SNTP_OPMODE_POLL);
    sntp_setservername(0, "pool.ntp.org");
    sntp_set_time_sync_notification_cb(time_sync_notification_cb);
    sntp_init();
 
    // Wait an hour
    for (int i = 0; i < 60; i++) {
        ESP_LOGI(TAG, "%d", i);
        vTaskDelay(60000 / portTICK_PERIOD_MS);
    }
 
    ESP_LOGI(TAG, "end");
    log_rtc();
 
    esp_restart();
//  esp_abort(); // TODO: Try the other "restart methods" and see if there is a difference
//  esp_deep_sleep(1000000);
}

Debug Logs

I (08:07:52.302) main: 58
I (08:08:52.302) main: 59
I (08:09:52.302) main: end
I (08:09:52.303) main: rtc=32399 seconds (0 days 08:59:59)
[...]
ets Jun 8 2016 00:22:57
rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
[...]
I (08:09:15.296) main: rtc=32362 seconds (0 days 08:59:22)
I (08:09:15.301) main: start

Restarting takes maybe one second or so, but as you can see in this case time jumped backwards more than half a minute. Which should be impossible.

@espressif-bot espressif-bot added the Status: Opened Issue is new label Jul 27, 2022
@github-actions github-actions bot changed the title System time corrupted across reboots System time corrupted across reboots (IDFGH-7930) Jul 27, 2022
@someburner
Copy link
Contributor

I have seen this as well.

@kriegste
Copy link
Author

I tried both suggestions from the forum thread mentioned above:

  • calling esp_sync_counters_rtc_and_frc() before restart
  • the menuconfig option to just use RTC for gettimeofday

Neither worked.

@kriegste
Copy link
Author

Could this be some kind of re-calibration attempt which then, based on the result, also changes the already counted RTC ticks (by a correction factor I guess). This would explain why the phenomenon gets worse the longer the device runs.

In esp_clk.h it says about esp_clk_rtc_time():

  • @attention The value returned by this function may change abruptly when
  • calibration value of RTC counter is updated via esp_clk_slowclk_cal_set
  • function. This should not happen unless application calls esp_clk_slowclk_cal_set.
  • In ESP-IDF, esp_clk_slowclk_cal_set is only called in startup code.

Maybe this calibration shouldn't happen on every boot, but only after a ESP_RST_POWERON (only when the RTC clock is 0)?

@kriegste
Copy link
Author

Setting the config option CONFIG_ESP32_RTC_CLK_CAL_CYCLES to 0 to disable calibration causes a big time jump in the next restart, but after that, the problem is gone! So this confirms that how the calibration is done is the root cause.

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Aug 3, 2022
@KonstantinKondrashov
Copy link
Collaborator

Hi @kriegste!
Thank you for the test example. I will check it and fix this issue.

@someburner
Copy link
Contributor

@KonstantinKondrashov Any updates here?

@kriegste
Copy link
Author

Can you fix it? The problem persists in version 4.4.2.

@kriegste
Copy link
Author

...or life sign...

@KonstantinKondrashov
Copy link
Collaborator

Hi! Sorry for the delay, It might be linked to this issue #9455 as well.
I will revisit the whole timekeeping functionality, as I remember, I saw something weird. I will provide a patch here before merging it.

@KonstantinKondrashov
Copy link
Collaborator

I run this example just without wifi and sntp. It worked for an hour, but the issue is not shown.

I (3420336) main: 57
I (3480336) main: 58
I (3540336) main: 59
I (3600336) main: end
I (3600336) main: rtc=3638 seconds (0 days 01:00:38)

ets Jul 29 2019 12:21:46
rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)

I (0) cpu_start: Starting scheduler on APP CPU.
I (326) main: rtc=3639 seconds (0 days 01:00:39)
I (336) main: start
I (336) main: 0
I (60336) main: 1

Seems like when wifi app works, we get that chip becomes warmer and the freq of the RC source goes.
I will try with wifi enabled.

@kriegste
Copy link
Author

Leave it running for a few hours at least. Maybe heating/freezing the chip just before the restart can also provoke the effect.

It has something to do with the calibration value esp_clk_slowclk_cal_get() changing and then "correcting" the RTC clock (the ticks which already passed since hardware reset) and this in turn changes the system time.

This also explains why disabling the calibration altogether (CONFIG_ESP32_RTC_CLK_CAL_CYCLES = 0) is a workaround.

@KonstantinKondrashov
Copy link
Collaborator

Finally, I got the log that shows this issue.

1) boot
I (3540336) main: 59
I (3600336) main: end
I (3600336) main: rtc=3638 seconds (0 days 01:00:38)
ets Jul 29 2019 12:21:46
rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
I (326) main: rtc=3639 seconds (0 days 01:00:39) ------------------------------------   = +1sec
I (336) main: start
I (336) main: 0

2) boot
I (3540336) main: 59
I (3600336) main: end
I (3600336) main: rtc=7240 seconds (0 days 02:00:40)
ets Jul 29 2019 12:21:46
rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
I (326) main: rtc=7199 seconds (0 days 01:59:59) ------------------------------------   = -41sec
I (336) main: start
I (336) main: 0

3) boot
I (3540336) main: 59
I (3600336) main: end
I (3600336) main: rtc=10802 seconds (0 days 03:00:02)
ets Jul 29 2019 12:21:46
rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
I (326) main: rtc=10804 seconds (0 days 03:00:04) ------------------------------------   = +2sec
I (336) main: start
I (336) main: 0

4) boot
I (3540336) main: 59
I (3600336) main: end
I (3600336) main: rtc=14405 seconds (0 days 04:00:05)
ets Jul 29 2019 12:21:46
rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
I (326) main: rtc=14395 seconds (0 days 03:59:55) ------------------------------------   = -10sec
I (336) main: start
I (336) main: 0

@KonstantinKondrashov
Copy link
Collaborator

Hi @kriegste!
There is the patch that resolve this issue. The first 4 hours look good. I did not run all related tests yet, only this example, but I think it is the root case of this issue.
Could you try this fix?
Thanks.

diff --git a/components/esp_hw_support/esp_clk.c b/components/esp_hw_support/esp_clk.c
index 752a3e987ef..c1dff5f8764 100644
--- a/components/esp_hw_support/esp_clk.c
+++ b/components/esp_hw_support/esp_clk.c
@@ -53,7 +53,7 @@ extern uint32_t g_ticks_per_us_app;
 static portMUX_TYPE s_esp_rtc_time_lock = portMUX_INITIALIZER_UNLOCKED;
 
 // TODO: IDF-4239
-static RTC_DATA_ATTR uint64_t s_esp_rtc_time_us = 0, s_rtc_last_ticks = 0;
+static RTC_NOINIT_ATTR uint64_t s_esp_rtc_time_us, s_rtc_last_ticks;
 
 inline static int IRAM_ATTR s_get_cpu_freq_mhz(void)
 {
@@ -100,6 +100,10 @@ uint64_t esp_rtc_get_time_us(void)
 #endif
     portENTER_CRITICAL_SAFE(&s_esp_rtc_time_lock);
     const uint32_t cal = esp_clk_slowclk_cal_get();
+    if (cal == 0) {
+        s_esp_rtc_time_us = 0;
+        s_rtc_last_ticks = 0;
+    }
     const uint64_t rtc_this_ticks = rtc_time_get();
     const uint64_t ticks = rtc_this_ticks - s_rtc_last_ticks;
     /* RTC counter result is up to 2^48, calibration factor is up to 2^24,

@kriegste
Copy link
Author

Thanks. I am already testing on two devices. Tomorrow I'll post results.

@AxelLin
Copy link
Contributor

AxelLin commented Oct 28, 2022

@KonstantinKondrashov
Can you also take a look at this issue? #6687

@kriegste
Copy link
Author

Looks very good. No problems so far.

@espressif-bot espressif-bot added Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: In Progress Work is in progress Resolution: NA Issue resolution is unavailable labels Nov 2, 2022
chipweinberger pushed a commit to chipweinberger/esp-idf that referenced this issue Nov 2, 2022
@KonstantinKondrashov
Copy link
Collaborator

The backports are ready and will be merged soon.

espressif-bot pushed a commit that referenced this issue Nov 8, 2022
@someburner
Copy link
Contributor

@KonstantinKondrashov - release/v5.0 too?

@KonstantinKondrashov
Copy link
Collaborator

Hi @someburner! Sure. The v5.0 backport is ready and will be merged soon. Thanks.

@KonstantinKondrashov
Copy link
Collaborator

Sorry, our internal release/v5.0 branch is still locked. This fix will be merged when possible.

@natsco-sbottoms
Copy link

natsco-sbottoms commented Feb 28, 2023

Hello @kriegste and @KonstantinKondrashov ! Thanks for your work on this. I hope it's ok to resurrect this in the initial issue -- I wanted to alert you to what I believe may be an issue regarding this change in the context of OTA firmware updates.

The issue we're seeing is that changing s_esp_rtc_time_us and s_rtc_last_ticks to be RTC_NOINIT_ATTR and initializing them in esp_rtc_get_time_us() only works if the slow clock calibration register is zero. That is fine on power-up, but I believe that after an OTA update that register will still be set from before the update. Consequently, s_esp_rtc_time_us and s_rtc_last_ticks will not be initialized, and if variables in RTC RAM have been moved around by the linker, they will be filled with garbage data. Any reset other than OTA would work fine because the variables would still be initialized from the initial bootup.

I think we saw this on our firmware, and it resulted in incorrect values being reported by esp_rtc_get_time_us(). I'd be curious if you agree with this observation or if I'm off-base somehow.

Ideally during an OTA our system clock could stay valid, but at the least I'd like to avoid the undefined behavior we're currently experiencing.

@boribosnjak
Copy link

Hello @kriegste and @KonstantinKondrashov ! Thanks for your work on this. I hope it's ok to resurrect this in the initial issue -- I wanted to alert you to what I believe may be an issue regarding this change in the context of OTA firmware updates.

The issue we're seeing is that changing s_esp_rtc_time_us and s_rtc_last_ticks to be RTC_NOINIT_ATTR and initializing them in esp_rtc_get_time_us() only works if the slow clock calibration register is zero. That is fine on power-up, but I believe that after an OTA update that register will still be set from before the update. Consequently, s_esp_rtc_time_us and s_rtc_last_ticks will not be initialized, and if variables in RTC RAM have been moved around by the linker, they will be filled with garbage data. Any reset other than OTA would work fine because the variables would still be initialized from the initial bootup.

I think we saw this on our firmware, and it resulted in incorrect values being reported by esp_rtc_get_time_us(). I'd be curious if you agree with this observation or if I'm off-base somehow.

Ideally during an OTA our system clock could stay valid, but at the least I'd like to avoid the undefined behavior we're currently experiencing.

We see exactly the same problem. @kriegste and @KonstantinKondrashov from our point of view the change fixes the issue discussed in this thread. Unfortunately it created an issue after a firmware update, when addresses of the variables have changed.

@KonstantinKondrashov
Copy link
Collaborator

@kriegste @boribosnjak,
Sorry for the flaw, yes, it is possible. I will try to find a way to fix it.
I reopen this issue.

@natsco-sbottoms
Copy link

For what it's worth, I was able to add a shutdown handler with esp_register_shutdown_handler()

All this handler does is zero out the clock calibration:
esp_clk_slowclk_cal_set(0);

That, by itself, seems to be enough to preserve the clock and avoid this new bug on our configuration (ESP32-C3 with 32k XTAL). I'm not sure if it would be effective in all cases. There would be a loss of precision during a reset, but for our application we don't really care, and we only need to do this if it is a OTA update.

@boribosnjak
Copy link

@natsco-sbottoms Thank you for the hint. Sounds like a reasonable solution to me.
At the moment I patched the esp-idf v4.4.4 changing the variables back to RTC_DATA_ATTR and initialising it to 0.
I will consider adding your solution into our reboot actions after OTA. Thanks!

@KonstantinKondrashov
Copy link
Collaborator

Hi!
The fix is on review now - fix.txt. Soon will be merged.
It places the s_esp_rtc_time_us and s_rtc_last_ticks vars at a certain place in the .rtc_noinit memory that will not be used by any other things.

@AxelLin
Copy link
Contributor

AxelLin commented Apr 10, 2023

Hi! The fix is on review now - fix.txt. Soon will be merged.

Thanks for the update @KonstantinKondrashov
Hope to get the fix for all impacted branches (v4.3+).

@AxelLin
Copy link
Contributor

AxelLin commented Apr 15, 2023

Hi! The fix is on review now - fix.txt. Soon will be merged.

@KonstantinKondrashov
Any chance to raise the priority for applying this fix?
It has impacted on multiple stable branches but I still cannot find the fix for this regression in any branches.

@espressif-bot espressif-bot added Status: Reviewing Issue is being reviewed and removed Status: Done Issue is done internally labels Apr 24, 2023
@AxelLin
Copy link
Contributor

AxelLin commented May 18, 2023

Hi! The fix is on review now - fix.txt. Soon will be merged.

@KonstantinKondrashov
I'm not sure what is the reasonable time for reviewing a fix in espressif.
But the review time is long enough that people cannot differentiate if it is lost or something else.
I feel there must be something worong here.
In the meantime, all stable branches have this issue!

@AxelLin
Copy link
Contributor

AxelLin commented Jun 9, 2023

Hi! The fix is on review now - fix.txt. Soon will be merged.

@KonstantinKondrashov I'm not sure what is the reasonable time for reviewing a fix in espressif. But the review time is long enough that people cannot differentiate if it is lost or something else. I feel there must be something worong here. In the meantime, all stable branches have this issue!

@igrr @Alvin1Zhang
Can someone help to take a look at the fix #9448 (comment) ?
It was mentiond as in review for more than 2 months and then no response.

@KonstantinKondrashov
Copy link
Collaborator

Hi @AxelLin! Sorry for the delay. It was on review for a long time, now it is merged and soon will appear here (backports as well).

@AxelLin
Copy link
Contributor

AxelLin commented Jul 4, 2023

Hi @AxelLin! Sorry for the delay. It was on review for a long time, now it is merged and soon will appear here (backports as well).

The fix is not available in any release branch so far.

@espressif-bot espressif-bot added Status: Done Issue is done internally and removed Status: Reviewing Issue is being reviewed labels Jul 4, 2023
espressif-bot pushed a commit that referenced this issue Jul 7, 2023
…last_ticks were moved around

The commit fixes the case:
If variables in RTC RAM have been moved around by the linker,
they will be filled with garbage data. Any reset other than OTA would work fine
because the variables would still be initialized from the initial bootup.

So now system time will be valid even after OTA.

Closes #9448
@AxelLin
Copy link
Contributor

AxelLin commented Jul 21, 2023

@KonstantinKondrashov
I still cannot find the fix in 5.0, 4.4, 4.3 branches. No idea why it takes such long time to get fix.

@KonstantinKondrashov
Copy link
Collaborator

Hi! v5.0 is not merged yet but it is ready to go. For the rest (4.4, 4.3) branches, the backport will not be provided according to backport policy.

@boribosnjak
Copy link

thank you for the update. that is no good news. I was hoping to see the fix in upcoming 4.4.6. In that case we will prepare an update to v5.1

@AxelLin
Copy link
Contributor

AxelLin commented Jul 21, 2023

For the rest (4.4, 4.3) branches, the backport will not be provided according to backport policy.

This is a BUGFIX. no idea why your policy does not support bugfix in maintenance period.
I use esp-idf for several years, my experience is the releases in service period are not really stable.
Once a release reach more stable state it's almost in maintenance period, however it's difficult to get bug fix for releases in maintenance period.

BTW, the regression was reported on Feb 28 (#9448 (comment)) , (I have been waiting the fix for v4.3 for 3 months #9448 (comment))

espressif-bot pushed a commit that referenced this issue Aug 28, 2023
…last_ticks were moved around

The commit fixes the case:
If variables in RTC RAM have been moved around by the linker,
they will be filled with garbage data. Any reset other than OTA would work fine
because the variables would still be initialized from the initial bootup.

So now system time will be valid even after OTA.

Closes #9448
@AxelLin
Copy link
Contributor

AxelLin commented Sep 4, 2023

For the rest (4.4, 4.3) branches, the backport will not be provided according to backport policy.

Please add this to known issue of upcoming v4.3.x and v4.4.x release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally
Projects
None yet
Development

No branches or pull requests

7 participants