Over the past two days the Internet has been full of news stories about Meltdown and Spectre and how horribly and devastating these issues are. Chipset vendors scramble to update their microcode, operating systems get patched and even web browsers get an update. What I found missing in pretty much all articles, however, was how these attacks that can extract data from the kernel and other threads, actually work. So I resorted to reading the lengthy but very informative whitepapers on Meltdown and Spectre and since I haven’t found a good source that gives an abbreviated and easier to understand version I will attempt to do so myself. I was tempted to call this post the ‘Technical Elevator Pitch’ but quite frankly the elevator would have to stop for a little while to be able to finish the story. But I think it can be told over lunchtime…
Note: To make this brief I have made a number of simplifications in the description. Despite the simplifications, however, I think my text still covers the main points that are essential if you want to understand how the exploits work. For the full story, go and read the research papers.
The Problem With Meltdown and Why It’s Really Really Bad
O.k. so here we go: The basic problem with Meltdown is that secret data can be extracted from the kernel or other running programs from a normal user level program that doesn’t require any special privileges such as root. In the case of Meltdown, extracting secret information does not need to exploit a software bug anywhere in the system! Instead the Meltdown attack leverages the effects of hardware level and operating system software optimizations that allow unprivileged code to access any (!) memory address in the system and retrieve its content.
How Does This Work?
And here is how the magic above, that shouldn’t be possible at all, actually works in practice:
In the distant past, all memory in a computer was accessible to all programs (referred to as tasks from now on). It was fast and quick but had the disadvantage that a task could read data of other tasks and, even worse, accidentally or maliciously change data that doesn’t belong to it. This is why all modern processors and operating system kernels that are built for multitasking use a concept known as virtual memory. Instead of allowing a task to write to the physical memory directly, each task running on the system gets its own virtual memory space and each task uses the same virtual addresses. The CPU has a Memory Management Unit (MMU) that uses a separate lookup table, referred to as a page table, for each task and translates the virtual memory addresses used by a task into physical addresses. Example: Let’s say there is task A (web browser) and task B (word processor) and both have data at virtual memory address 50.000. When task A addresses this memory location the CPU translates this virtual address to physical address to, e.g., 27.948 by using the page table of task A for this task. When the kernel stops task A and allows task B to run the operating system loads the page table of task B into the memory management unit of the CPU. When task B then accesses its virtual memory address 50.000 the other page table will translate this to another memory address, e.g. 73.299. In other words by changing the page tables when switching from task A to task B, the two tasks access the same virtual address but end up on totally different physical memory addresses. In addition, the operating system kernel ensures that the page table of task A does not contain any references that would point to physical memory of task B. And voila, there is no way that task A could read or write any physical memory of task B.
But Meltdown can!
The Kernel and Its Address Space
One performance problem with having per task page tables that translate virtual memory addresses into physical memory addresses is that whenever a task switch occurs the kernel has to swap the page tables which requires some time. The kernel also has its own virtual memory space so when a task requests operations from the kernel (such as for example opening a file) a page table swap would be required every time a task calls code in the kernel. As this happens quite often this is somewhat inefficient. It was thus decided to map the kernel memory into the virtual memory map of each task. This way the page table does not have to be swapped when a user task jumps into the kernel and when the kernel jumps back to user code.
In addition to mapping kernel memory into the virtual address space of a task, kernel memory gets marked in the page table as only accessible when the processor executes kernel code. This way the user program can’t read or write kernel memory despite it being mapped into the tasks virtual memory. It’s there but it can’t be accessed. And voila, there is no way that a user level program could read or write any kernel memory.
But Meltdown can!
One additional performance enhancement that Meltdown exploits is that the kernel must sometimes not only access its own memory but also the memory of user tasks. That’s why all physical memory is also mapped into the kernel’s virtual memory space that is (as described just above) mapped into each virtual memory space of each task. A task can’t access the kernel space as described above so it also can’t access the memory from other tasks that way. And voila, there is no way that task A could read or write any data from task B.
But Meltdown can!
Out of Order Execution
In theory, and as sanity would dictate, a process executes one machine code instruction after another in strict order. Unfortunately this is quite inefficient as sometimes a machine code instruction takes a very long time to complete. For example, getting a byte out of memory can take hundreds of processor cycles because access to main memory is very slow. During these cycles the processor would be dormant. As this is very inefficient, modern processors can execute machine code instructions out of order. If the instructions that come after an instruction that currently waits for data from main memory are independent of that instruction (i.e. they work with different data that is already in the processor) then the processor executes them and doesn’t wait for the previous instruction to finish.
Speculative Execution and Branch Prediction
Another performance optimization concerns program branches. Depending on a comparison, different code paths are taken. As the processor does not know the outcome of a comparison and subsequent branch straight away it would have to wait until the decision is made. As this can take quite some time (e.g. due to lots of cycles lost because data has to be fetched from memory), modern processors go ahead and execute both possible branches that could be executed after an ‘if condition’ and store the results in temporary registers. Once the processor knows which branch it should really take it copies these results from the temporary registers to the real registers and silently discards the results from the branch that was not taken. From a high level point of view it looks like this branch was never taken, nothing (i.e. no CPU registers, no memory, etc.) was changed because of running down this path. Well, almost nothing, as will become apparent shortly…
And yet another optimization of modern computers is that they introduce memory caches between the CPU and the main memory. This is done because, as described above, the memory bus is very slow compared to the CPU and hence it takes a long time to fetch data from memory. Memory caches are much smaller than main memory but are much much faster. As programs usually don’t work on all of memory but only on a small part, most data that is used by a task is not only in main memory but also copied into the cache when it is first requested. This speeds-up processing enormously because the processor has to wait much less for data if it is in the cache if it is access again just a short while later than if it was only to be found in main memory. Note that the cache is transparent to the machine instruction that loads a value from memory, i.e. the machine instruction is just executed a bit faster.
Bringing it All Together
Congratulations if you made it down to here because now all things are in place to understand how Meltdown works:
The Meltdown Code runs in a normal user program and has an ‘if’ condition that can go two ways: In the branch that is NEVER taken by the program, because the program sets the condition to be NEVER fulfilled, a memory address in kernel memory space is read from. Remember that the kernel memory space is mapped into the virtual memory space of the task. If this code was ever executed, the CPU would trap and the operating system would stop and end the task immediately due to an unauthorized memory access. But this will not happen because the condition was NEVER fulfilled and thus the instruction to read the memory in the kernel address space is never executed.
Never Say Never!
O.k., now remember the saying that one should never say never? From a logical point of view the instruction to read kernel space memory never gets executed. But remember the speculative execution optimization above? Because of this the instructions to read kernel space DO get executed because the processor executes the code from both branches of the ‘if condition’. At this point the execution is still speculative and so accessing the kernel memory does not raise the CPU trap to kill the program and the instruction goes ahead and reads kernel memory (or memory from another tasked mapped into kernel space) from a user space program. No harm done, one might think because all results are discarded once the processor knows that the branch was not taken after all, because all results are discarded and the user program never sees any of it. But the important point is that despite of everything, kernel memory was accessed from the user task. I repeat this because this is the first step of the exploit.
The Side Channel
Now the next step of the Meltdown exploit is to figure out a way of how the code that was speculatively executed (but never should have) can transmit information to some place where it can be picked up by other code, e.g. by the code that runs in the other branch of the if condition. The speculatively executed code has to do that before its speculative execution is stopped and its information in the temporary registers is discarded. What the code needs is a ‘side channel’, i.e. it has to exploit an effect of the optimizations described above.
Note again, because this is important, that the optimizations discussed above are transparent to machine code instructions. The programmer’s machine code is the same for loading a data byte from memory independent of whether the data byte is only in main memory or already in the memory cache of the processor. The only difference is the time it takes for the instruction to be executed, i.e. the number of processor cycles it takes.
So let’s go back to the speculatively executed code that makes an illegal memory access to kernel memory. Let’s say this 8-bit memory cell contains the value 5 (it can hold values from 0 to 255). In the next step the speculatively executed code would then READ memory address number 5, i.e. it translates the value contained in the original memory cell into a memory address. Notice that I wrote READ in upper case because this is important. The code doesn’t care what the content of memory cell 5 is and it also doesn’t write anything into it, it just loads the content from memory address 5 into a CPU register. A side effect of this is that the content of memory address 5 would now be in the cache while all other memory address from 0 to 255 would NOT be in the cache (because they have not been read before). It’s important for what follows next that memory addresses 0 to 255 are part of the virtual address space of the user space task, i.e. normal user space task can access this memory without raising a trap!
At some later point the speculative code is interrupted because the CPU has figured out it was not the right branch and all actions are silently discarded. HOWEVER, and that’s the crucial point, memory address 5 remains in the memory cache. No harm done, one would think…
But this is not quite so because in another place of the task, Meltdown runs through a loop with 256 iterations that does the following:
- Read memory address 0, 1, 2, 3, 4, 5, 6…. The content of the memory address is NOT important, the code does NOTHING with it!
- Get the number of clock cycles it has taken to read each memory address. (There is a register in the CPU that can be read that gives this value away)
The result is that for example, it takes 1000 cycles for each memory read instruction to finish because the content of each memory address has to be fetched from the slow system memory. But wait, that’s not quite true, the content of memory address 5 was already in the cache (due to the actions of the speculatively executed code…) and thus it took only 200 cycles for this memory address to be read. And voilà the code knows that the kernel (or memory address it is not allowed to read contains the value 5! And that is the SIDE CHANNEL! You can’t get at the value directly but you use the memory access timing difference produced for one of the 256 possible values created by the speculatively and discarded code to figure out the value!
How Can Meltdown Be Fixed?
So how can this be fixed without changing the hardware? There seem to be several ways to do it: One way is to modify the operating system kernel to no longer map the kernel memory space to the virtual memory space of each task. Whenever the kernel is entered and left, page tables have to be copied but there is no way anymore for a speculatively executed piece of code to access memory of the kernel or of another task – there is simply no translation in the virtual to physical memory translation table.
Another way to prevent this problem is to deny speculatively executed code in user space access to memory which can only be accessed by kernel code. The way I understand the issue is that Intel processors are vulnerable to this attack because they don’t do this and only catch the issue after the fact. No harm done apart from the value now being in the cache and thus accessible faster. As AMD and ARM processors do not seem to be affected by Meltdown (but by Spectre, see below) they might halt execution of the speculative path before the illegal memory address is accessed and thus put into the cache. But that is just a guess on my part.
How Can Spectre Be Fixed?
If you have made it this far you are now ready to go and read the papers on Meltdown (which I have described here) and also about Spectre to which I have linked above. To make an even longer story short, Spectre is a more generalized version of Meltdown and also measures memory access timing differences after some speculative code execution of other tasks or the kernel (which it has not planted like Meltdown). This way, Spectre could overcome even the removal of the kernel memory space from virtual memory space of the user tasks.
And if you think this is all very theoretical…. Think again and take into account that there is probably a reason why everybody from CPU vendors to kernel hackers to web browser developers are scrambling to put ugly and performance degrading fixes into place at warp speed. Welcome to the real world!