Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inappropriate implementation of LTC_CLEAN_STACK using burn_stack() api #486

Open
b49020 opened this issue Jun 6, 2019 · 20 comments
Open
Milestone

Comments

@b49020
Copy link

b49020 commented Jun 6, 2019

Description

Some parts of LibTomCrypt have been imported in an open source project OP-TEE [1] for implementing crypto in software (imported code can be found here [2]). But recently we have enabled LTC_CLEAN_STACK so that LibTomCrypt wipes out key material and other sensitive data once no longer used.

Given stack space is quiet expensive for optee_os and currently its limited to 2k only. So while testing on ARM64 machine with LTC_CLEAN_STACK enabled, burn_stack() behaved inappropriately burning stack space 2 times the actual requirement leading to corruption of stack canaries.

Looking deeper into burn_stack() api [3], it seems to be a recursive function which assumes that only 32 bytes of stack space is utilized in every recursive call. But this assumption is incorrect as it doesn't take into account the usage of stack space during function calls. Specifically for function return pointer, stack frame pointer, any padding compiler introduces etc.

[1] https://optee.readthedocs.io/
[2] https://github.com/OP-TEE/optee_os/tree/master/core/lib/libtomcrypt/src
[3] https://github.com/libtom/libtomcrypt/blob/develop/src/misc/burn_stack.c

Steps to Reproduce

Simply add a debug print to dump address of 32 bytes buffer kept on stack in burn_stack() api as follows:

 void burn_stack(unsigned long len)
 {
    unsigned char buf[32];
+   printf("buf addr: %p\n", buf);
    zeromem(buf, sizeof(buf));
    if (len > (unsigned long)sizeof(buf)) {
       burn_stack(len - sizeof(buf));
    }
 }

Following are buf addr dumps which are taken on ARM64 machine where burn_stack() api is used to clean stack for _sha512_compress() api. Size to be cleared: 724 bytes but actual size cleared from below dump: 1472 bytes.

buf addr: 0xfc0b6380
buf addr: 0xfc0b6340
buf addr: 0xfc0b6300
buf addr: 0xfc0b62c0
buf addr: 0xfc0b6280
buf addr: 0xfc0b6240
buf addr: 0xfc0b6200
buf addr: 0xfc0b61c0
buf addr: 0xfc0b6180
buf addr: 0xfc0b6140
buf addr: 0xfc0b6100
buf addr: 0xfc0b60c0
buf addr: 0xfc0b6080
buf addr: 0xfc0b6040
buf addr: 0xfc0b6000
buf addr: 0xfc0b5fc0
buf addr: 0xfc0b5f80
buf addr: 0xfc0b5f40
buf addr: 0xfc0b5f00
buf addr: 0xfc0b5ec0
buf addr: 0xfc0b5e80
buf addr: 0xfc0b5e40
buf addr: 0xfc0b5e00

Version

This could be reproduced on latest upstream develop branch.

Suggestion

We would suggest to clean stack in corresponding function only where it is being used. As it seems one can't assume anything about what stack space has been consumed by the compiler for a given function call.

@b49020
Copy link
Author

b49020 commented Jun 6, 2019

Looking at reason behind introduction of burn_stack() api [1], it seems like in case of a lot of stack variables, clearing every variable inside function makes it look bit messier and may be bit slow due to multiple zeromem calls.

So I have tried to come up with below approach to clear stack in corresponding function only and avoid multiple zeromem calls:

Approach 1

void *min_addr(int arg_count, ...)
{
   int i;
   void *min, *a;
   va_list ap;

   va_start(ap, arg_count);
   min = va_arg(ap, void *);

   for(i = 2; i <= arg_count; i++) {
     if((a = va_arg(ap, void *)) < min)
       min = a;
   }

   va_end(ap);
   return min;
}

static int  sha512_compress(hash_state * md, unsigned char *buf)
{
     ulong64 S[8], W[80], t0, t1;
     int i;
<snip>
#ifdef LTC_CLEAN_STACK
    zeromem(min_addr(5, S, W, &t0, &t1, &i),
            sizeof(ulong64) * 90 + sizeof(int));
#endif
    return CRYPT_OK;
}

Above approach works if compiler allocates stack variables at contiguous stack memory locations which might not be true in case padding involved which could lead to some stack memory uncleaned (equal to size of padding).

Approach 2

Multiple zeromem calls:

static int  sha512_compress(hash_state * md, unsigned char *buf)
{
     ulong64 S[8], W[80], t0, t1;
     int i;
<snip>
#ifdef LTC_CLEAN_STACK
    zeromem(S, sizeof(S));
    zeromem(W, sizeof(W));
    zeromem(&t0, sizeof(t0));
    zeromem(&t1, sizeof(t1));
    zeromem(&i, sizeof(i));
#endif
    return CRYPT_OK;
}

I would be happy to hear any feedback/suggestions regarding which of above approach seems most suitable.

[1]

-- Added "burn_stack" function [idea taken from another source of crypto code]. The idea is if a function has

@ksherlock
Copy link
Collaborator

You could also stick everything in a struct...

static int  sha512_compress(hash_state * md, unsigned char *buf)
{
    struct {
        ulong64 S[8], W[80], t0, t1;
        int i;
    } sensitive;
#define S sensitive.S
#define W sensitive.W
...
#ifdef LTC_CLEAN_STACK
    zeromem(&sensitive, sizeof(sensitive));
#endif
    return CRYPT_OK;
}

@jenswi-linaro
Copy link
Collaborator

I prefer "Approach 2" as it keeps it simple without making any assumptions about the stack layout.

Regarding performance I think either approach will beat the old burnstack approach.

@b49020
Copy link
Author

b49020 commented Jun 7, 2019

@ksherlock Performance wise approach suggested by you looks better than "Approach 2". The only thing that doesn't looks good is a define for every variable which makes function look a bit messier even if compared with "Approach 2".

@b49020
Copy link
Author

b49020 commented Jun 10, 2019

Any more preferences among above 3 approaches (Approach 3: suggested by @ksherlock)? Or any further suggestions to address this issue?

@rofl0r
Copy link

rofl0r commented Jun 10, 2019

i like the zeromem struct approach best. i wonder why there's not been any comment by @sjaeckel or @karel-m since this seems to be a serious issue.

@sjaeckel
Copy link
Member

i wonder why there's not been any comment by @sjaeckel or @karel-m since this seems to be a serious issue.

because I'm not certain yet on how to solve it best...

I like approach 3, but prefer 2 for its simplicity and no macro magic

I already thought about an approach 4 where we leave the burn_stack() approach but increase the buffer size, so the overshooting isn't that severe.

@b49020
Copy link
Author

b49020 commented Jun 11, 2019

I already thought about an approach 4 where we leave the burn_stack() approach but increase the buffer size, so the overshooting isn't that severe.

What should be the appropriate buffer size? Since we can't have too big buffer for APIs that only need to clean few bytes and too small buffer that caused this overshoot. Overall this approach still seems to be fragile.

@buggywhip
Copy link
Contributor

buggywhip commented Jun 11, 2019 via email

@sjaeckel
Copy link
Member

@karel-m 2 fine for you as well?

@karel-m
Copy link
Member

karel-m commented Jun 12, 2019

I do not use LTC_CLEAN_STACK in my builds. Approach 2 is fine.

@b49020
Copy link
Author

b49020 commented Jun 12, 2019

Thanks everyone for your feedback. Will use "Approach 2" to create a fix patch.

@sjaeckel
Copy link
Member

Does anyone think it's worth to keep the old approach alive? (probably as LTC_BURN_STACK)

@rofl0r
Copy link

rofl0r commented Jun 12, 2019

Does anyone think it's worth to keep the old approach alive? (probably as LTC_BURN_STACK)

if the initial analysis posted by OP is correct, the old approach uses undefined behaviour. i don't think that should be kept.

Will use "Approach 2" to create a fix patch.

explicit zeromem calls for every single (sensitive) local variable seems to be an approach that leads to bigger functions (both in code as in binary), as well as error-prone because the programmer needs to manually take care of every single variable. if we instead stuff everything into a struct, there's only a single call to zeromem, and one can simply put new vars into the struct and forget about adding another call with the right arguments.

@b49020
Copy link
Author

b49020 commented Jun 27, 2019

It seems likely that we can't get agreement on one of above discussed approaches. So how about following simple change to use variable length array instead of recursive call in burn_stack api?

diff --git a/core/lib/libtomcrypt/src/misc/burn_stack.c b/core/lib/libtomcrypt/src/misc/burn_stack.c
index c5ae8d3..1722be0 100644
--- a/core/lib/libtomcrypt/src/misc/burn_stack.c
+++ b/core/lib/libtomcrypt/src/misc/burn_stack.c
@@ -49,10 +49,8 @@
 */
 void burn_stack(unsigned long len)
 {
-   unsigned char buf[32];
+   unsigned char buf[len];
    zeromem(buf, sizeof(buf));
-   if (len > (unsigned long)sizeof(buf))
-      burn_stack(len - sizeof(buf));
 }

It seems to resolve stack overflow issue and does stack clean operation too. This change is something that was suggested by @daniel-thompson.

@jenswi-linaro
Copy link
Collaborator

Except that it still relies on undefined behavior.

@rofl0r
Copy link

rofl0r commented Jun 27, 2019

using vla's is the worst possible choice. there have been countless numbers of CVE's due to it.
also it doesn't make the code simpler at all, now every call site has to calculate correctly the length passed, which asks for trouble (whenever you add a variable, you have to adjust the burn_stack calls).

@daniel-thompson
Copy link

@rofl0r: I agree that in general VLAs should be avoided... but I don't agree that they are the worst possible choice since I think that spot is already occupied by the existing burn_stack() code ;-) !

I've got no skin in the game here but FWIW I like approach 3 best, either with macros or just making it such a common idiomatic construct across the library that nobody minds giving it a single letter variable name.

@b49020
Copy link
Author

b49020 commented Jun 28, 2019

@rofl0r Agree but with a minimal change like VLA, we could improve the burn_stack() API atleast. It was just an effort to make current code workable rather than making developer's life easier.

Anyway, let me try to work on a common variable declaration construct that may make approach 3 look more better.

@sjaeckel
Copy link
Member

Sorry for the delay, but I was busy in the last weeks.

if the initial analysis posted by OP is correct, the old approach uses undefined behaviour. i don't think that should be kept.

If that's the case then I also think that the entire burn_stack() approach as it is implemented now should be removed, but can you please explain why this is UB?

The VLA approach would indeed be the least intrusive change, but then it uses VLA's which officially exist only since c99 ... so a clear no.

now every call site has to calculate correctly the length passed, which asks for trouble (whenever you add a variable, you have to adjust the burn_stack calls).

that's already the case, so this wouldn't change ;-)

but that's also one of the main reasons why I like approach 3, automatic determination of the size of what to clear.

Anyway, let me try to work on a common variable declaration construct that may make approach 3 look more better.

I'm looking forward to an improved version of approach 3!

@sjaeckel sjaeckel added this to the next milestone Oct 26, 2020
sjaeckel added a commit that referenced this issue Nov 11, 2020
As discussed in issue #486 [1] the current behavior shouldn't be used
anymore.

[1] #486
sjaeckel added a commit that referenced this issue Dec 21, 2020
As discussed in issue #486 [1] the current behavior shouldn't be used
anymore.

[1] #486
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants