Inappropriate implementation of `LTC_CLEAN_STACK` using `burn_stack()` api #486

b49020 · 2019-06-06T08:50:57Z

Description

Some parts of LibTomCrypt have been imported in an open source project OP-TEE [1] for implementing crypto in software (imported code can be found here [2]). But recently we have enabled LTC_CLEAN_STACK so that LibTomCrypt wipes out key material and other sensitive data once no longer used.

Given stack space is quiet expensive for optee_os and currently its limited to 2k only. So while testing on ARM64 machine with LTC_CLEAN_STACK enabled, burn_stack() behaved inappropriately burning stack space 2 times the actual requirement leading to corruption of stack canaries.

Looking deeper into burn_stack() api [3], it seems to be a recursive function which assumes that only 32 bytes of stack space is utilized in every recursive call. But this assumption is incorrect as it doesn't take into account the usage of stack space during function calls. Specifically for function return pointer, stack frame pointer, any padding compiler introduces etc.

[1] https://optee.readthedocs.io/
[2] https://github.com/OP-TEE/optee_os/tree/master/core/lib/libtomcrypt/src
[3] https://github.com/libtom/libtomcrypt/blob/develop/src/misc/burn_stack.c

Steps to Reproduce

Simply add a debug print to dump address of 32 bytes buffer kept on stack in burn_stack() api as follows:

 void burn_stack(unsigned long len)
 {
    unsigned char buf[32];
+   printf("buf addr: %p\n", buf);
    zeromem(buf, sizeof(buf));
    if (len > (unsigned long)sizeof(buf)) {
       burn_stack(len - sizeof(buf));
    }
 }

Following are buf addr dumps which are taken on ARM64 machine where burn_stack() api is used to clean stack for _sha512_compress() api. Size to be cleared: 724 bytes but actual size cleared from below dump: 1472 bytes.

buf addr: 0xfc0b6380
buf addr: 0xfc0b6340
buf addr: 0xfc0b6300
buf addr: 0xfc0b62c0
buf addr: 0xfc0b6280
buf addr: 0xfc0b6240
buf addr: 0xfc0b6200
buf addr: 0xfc0b61c0
buf addr: 0xfc0b6180
buf addr: 0xfc0b6140
buf addr: 0xfc0b6100
buf addr: 0xfc0b60c0
buf addr: 0xfc0b6080
buf addr: 0xfc0b6040
buf addr: 0xfc0b6000
buf addr: 0xfc0b5fc0
buf addr: 0xfc0b5f80
buf addr: 0xfc0b5f40
buf addr: 0xfc0b5f00
buf addr: 0xfc0b5ec0
buf addr: 0xfc0b5e80
buf addr: 0xfc0b5e40
buf addr: 0xfc0b5e00

Version

This could be reproduced on latest upstream develop branch.

Suggestion

We would suggest to clean stack in corresponding function only where it is being used. As it seems one can't assume anything about what stack space has been consumed by the compiler for a given function call.

The text was updated successfully, but these errors were encountered:

b49020 · 2019-06-06T13:04:31Z

Looking at reason behind introduction of burn_stack() api [1], it seems like in case of a lot of stack variables, clearing every variable inside function makes it look bit messier and may be bit slow due to multiple zeromem calls.

So I have tried to come up with below approach to clear stack in corresponding function only and avoid multiple zeromem calls:

Approach 1

void *min_addr(int arg_count, ...)
{
   int i;
   void *min, *a;
   va_list ap;

   va_start(ap, arg_count);
   min = va_arg(ap, void *);

   for(i = 2; i <= arg_count; i++) {
     if((a = va_arg(ap, void *)) < min)
       min = a;
   }

   va_end(ap);
   return min;
}

static int  sha512_compress(hash_state * md, unsigned char *buf)
{
     ulong64 S[8], W[80], t0, t1;
     int i;
<snip>
#ifdef LTC_CLEAN_STACK
    zeromem(min_addr(5, S, W, &t0, &t1, &i),
            sizeof(ulong64) * 90 + sizeof(int));
#endif
    return CRYPT_OK;
}

Above approach works if compiler allocates stack variables at contiguous stack memory locations which might not be true in case padding involved which could lead to some stack memory uncleaned (equal to size of padding).

Approach 2

Multiple zeromem calls:

static int  sha512_compress(hash_state * md, unsigned char *buf)
{
     ulong64 S[8], W[80], t0, t1;
     int i;
<snip>
#ifdef LTC_CLEAN_STACK
    zeromem(S, sizeof(S));
    zeromem(W, sizeof(W));
    zeromem(&t0, sizeof(t0));
    zeromem(&t1, sizeof(t1));
    zeromem(&i, sizeof(i));
#endif
    return CRYPT_OK;
}

I would be happy to hear any feedback/suggestions regarding which of above approach seems most suitable.

[1]

libtomcrypt/changes

Line 1283 in 3fb0eea

    
           -- Added "burn_stack" function [idea taken from another source of crypto code].  The idea is if a function has

ksherlock · 2019-06-06T20:19:31Z

You could also stick everything in a struct...

static int  sha512_compress(hash_state * md, unsigned char *buf)
{
    struct {
        ulong64 S[8], W[80], t0, t1;
        int i;
    } sensitive;
#define S sensitive.S
#define W sensitive.W
...
#ifdef LTC_CLEAN_STACK
    zeromem(&sensitive, sizeof(sensitive));
#endif
    return CRYPT_OK;
}

jenswi-linaro · 2019-06-07T06:11:11Z

I prefer "Approach 2" as it keeps it simple without making any assumptions about the stack layout.

Regarding performance I think either approach will beat the old burnstack approach.

b49020 · 2019-06-07T06:35:51Z

@ksherlock Performance wise approach suggested by you looks better than "Approach 2". The only thing that doesn't looks good is a define for every variable which makes function look a bit messier even if compared with "Approach 2".

b49020 · 2019-06-10T07:06:55Z

Any more preferences among above 3 approaches (Approach 3: suggested by @ksherlock)? Or any further suggestions to address this issue?

rofl0r · 2019-06-10T16:30:39Z

i like the zeromem struct approach best. i wonder why there's not been any comment by @sjaeckel or @karel-m since this seems to be a serious issue.

sjaeckel · 2019-06-10T16:46:49Z

i wonder why there's not been any comment by @sjaeckel or @karel-m since this seems to be a serious issue.

because I'm not certain yet on how to solve it best...

I like approach 3, but prefer 2 for its simplicity and no macro magic

I already thought about an approach 4 where we leave the burn_stack() approach but increase the buffer size, so the overshooting isn't that severe.

b49020 · 2019-06-11T05:40:56Z

I already thought about an approach 4 where we leave the burn_stack() approach but increase the buffer size, so the overshooting isn't that severe.

What should be the appropriate buffer size? Since we can't have too big buffer for APIs that only need to clean few bytes and too small buffer that caused this overshoot. Overall this approach still seems to be fragile.

buggywhip · 2019-06-11T05:54:32Z

I prefer #2 for its simplicity.

sjaeckel · 2019-06-11T06:04:30Z

@karel-m 2 fine for you as well?

karel-m · 2019-06-12T08:54:03Z

I do not use LTC_CLEAN_STACK in my builds. Approach 2 is fine.

b49020 · 2019-06-12T08:58:01Z

Thanks everyone for your feedback. Will use "Approach 2" to create a fix patch.

sjaeckel · 2019-06-12T08:59:08Z

Does anyone think it's worth to keep the old approach alive? (probably as LTC_BURN_STACK)

rofl0r · 2019-06-12T16:14:39Z

Does anyone think it's worth to keep the old approach alive? (probably as LTC_BURN_STACK)

if the initial analysis posted by OP is correct, the old approach uses undefined behaviour. i don't think that should be kept.

Will use "Approach 2" to create a fix patch.

explicit zeromem calls for every single (sensitive) local variable seems to be an approach that leads to bigger functions (both in code as in binary), as well as error-prone because the programmer needs to manually take care of every single variable. if we instead stuff everything into a struct, there's only a single call to zeromem, and one can simply put new vars into the struct and forget about adding another call with the right arguments.

b49020 · 2019-06-27T08:53:48Z

It seems likely that we can't get agreement on one of above discussed approaches. So how about following simple change to use variable length array instead of recursive call in burn_stack api?

diff --git a/core/lib/libtomcrypt/src/misc/burn_stack.c b/core/lib/libtomcrypt/src/misc/burn_stack.c
index c5ae8d3..1722be0 100644
--- a/core/lib/libtomcrypt/src/misc/burn_stack.c
+++ b/core/lib/libtomcrypt/src/misc/burn_stack.c
@@ -49,10 +49,8 @@
 */
 void burn_stack(unsigned long len)
 {
-   unsigned char buf[32];
+   unsigned char buf[len];
    zeromem(buf, sizeof(buf));
-   if (len > (unsigned long)sizeof(buf))
-      burn_stack(len - sizeof(buf));
 }

It seems to resolve stack overflow issue and does stack clean operation too. This change is something that was suggested by @daniel-thompson.

jenswi-linaro · 2019-06-27T09:22:25Z

Except that it still relies on undefined behavior.

rofl0r · 2019-06-27T14:49:10Z

using vla's is the worst possible choice. there have been countless numbers of CVE's due to it.
also it doesn't make the code simpler at all, now every call site has to calculate correctly the length passed, which asks for trouble (whenever you add a variable, you have to adjust the burn_stack calls).

daniel-thompson · 2019-06-27T15:20:42Z

@rofl0r: I agree that in general VLAs should be avoided... but I don't agree that they are the worst possible choice since I think that spot is already occupied by the existing burn_stack() code ;-) !

I've got no skin in the game here but FWIW I like approach 3 best, either with macros or just making it such a common idiomatic construct across the library that nobody minds giving it a single letter variable name.

b49020 · 2019-06-28T06:52:53Z

@rofl0r Agree but with a minimal change like VLA, we could improve the burn_stack() API atleast. It was just an effort to make current code workable rather than making developer's life easier.

Anyway, let me try to work on a common variable declaration construct that may make approach 3 look more better.

sjaeckel · 2019-06-30T09:11:37Z

Sorry for the delay, but I was busy in the last weeks.

if the initial analysis posted by OP is correct, the old approach uses undefined behaviour. i don't think that should be kept.

If that's the case then I also think that the entire burn_stack() approach as it is implemented now should be removed, but can you please explain why this is UB?

The VLA approach would indeed be the least intrusive change, but then it uses VLA's which officially exist only since c99 ... so a clear no.

now every call site has to calculate correctly the length passed, which asks for trouble (whenever you add a variable, you have to adjust the burn_stack calls).

that's already the case, so this wouldn't change ;-)

but that's also one of the main reasons why I like approach 3, automatic determination of the size of what to clear.

Anyway, let me try to work on a common variable declaration construct that may make approach 3 look more better.

I'm looking forward to an improved version of approach 3!

As discussed in issue #486 [1] the current behavior shouldn't be used anymore. [1] #486

b49020 mentioned this issue Jun 6, 2019

Stack overflow for primary boot thread on ARM64 OP-TEE/optee_os#3064

Closed

b49020 mentioned this issue Jul 1, 2019

core: crypto: libtomcrypt: fix LTC_CLEAN_STACK bug OP-TEE/optee_os#3102

Merged

sjaeckel added this to the next milestone Oct 26, 2020

sjaeckel added a commit that referenced this issue Nov 11, 2020

don't allow LTC_CLEAN_STACK to be enabled until fixed

90d8d56

As discussed in issue #486 [1] the current behavior shouldn't be used anymore. [1] #486

sjaeckel added a commit that referenced this issue Dec 21, 2020

don't allow LTC_CLEAN_STACK to be enabled until fixed

6b12bc5

As discussed in issue #486 [1] the current behavior shouldn't be used anymore. [1] #486

alexjba mentioned this issue May 8, 2023

[Performance] Slow sqlite queries due to burn_stack and zeromem status-im/status-desktop#10572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inappropriate implementation of `LTC_CLEAN_STACK` using `burn_stack()` api #486

Inappropriate implementation of `LTC_CLEAN_STACK` using `burn_stack()` api #486

b49020 commented Jun 6, 2019

b49020 commented Jun 6, 2019

ksherlock commented Jun 6, 2019

jenswi-linaro commented Jun 7, 2019

b49020 commented Jun 7, 2019 •

edited

Loading

b49020 commented Jun 10, 2019

rofl0r commented Jun 10, 2019

sjaeckel commented Jun 10, 2019

b49020 commented Jun 11, 2019

buggywhip commented Jun 11, 2019 via email

sjaeckel commented Jun 11, 2019

karel-m commented Jun 12, 2019

b49020 commented Jun 12, 2019

sjaeckel commented Jun 12, 2019

rofl0r commented Jun 12, 2019

b49020 commented Jun 27, 2019

jenswi-linaro commented Jun 27, 2019

rofl0r commented Jun 27, 2019

daniel-thompson commented Jun 27, 2019

b49020 commented Jun 28, 2019

sjaeckel commented Jun 30, 2019

Inappropriate implementation of LTC_CLEAN_STACK using burn_stack() api #486

Inappropriate implementation of LTC_CLEAN_STACK using burn_stack() api #486

Comments

b49020 commented Jun 6, 2019

Description

Steps to Reproduce

Version

Suggestion

b49020 commented Jun 6, 2019

Approach 1

Approach 2

ksherlock commented Jun 6, 2019

jenswi-linaro commented Jun 7, 2019

b49020 commented Jun 7, 2019 • edited Loading

b49020 commented Jun 10, 2019

rofl0r commented Jun 10, 2019

sjaeckel commented Jun 10, 2019

b49020 commented Jun 11, 2019

buggywhip commented Jun 11, 2019 via email

sjaeckel commented Jun 11, 2019

karel-m commented Jun 12, 2019

b49020 commented Jun 12, 2019

sjaeckel commented Jun 12, 2019

rofl0r commented Jun 12, 2019

b49020 commented Jun 27, 2019

jenswi-linaro commented Jun 27, 2019

rofl0r commented Jun 27, 2019

daniel-thompson commented Jun 27, 2019

b49020 commented Jun 28, 2019

sjaeckel commented Jun 30, 2019

Inappropriate implementation of `LTC_CLEAN_STACK` using `burn_stack()` api #486

Inappropriate implementation of `LTC_CLEAN_STACK` using `burn_stack()` api #486

b49020 commented Jun 7, 2019 •

edited

Loading