Saturday, November 27, 2010

Network detection of x86 buffer overflow shellcode

Note: This is a reprint of a posting I made for my company, NetWitness (www.netwitness.com). This is unchanged from the original, however that copy can be found at: http://www.networkforensics.com/2010/05/16/network-detection-of-x86-buffer-overflow-shellcode/
Overview

This technique can detect overflow exploits against software running on the x86 platform, meaning it applies to Windows, Unix, and Mac shellcode. It not only works independently of OS, but it also works for finding both stack and heap based overflows. Most interestingly, it catches most forms of polymorphic shellcode as well. (Actually, it exceeds at finding special shellcodes like polymorphic decryption engines, egg hunters, etc.)  While this definitely doesn’t work for all shellcode, it works for a lot of it.

The reason this technique applies to any operating system on x86 is simple. Shellcode is typically written in machine code (commonly called assembly, although it’s not actually the same thing), meaning shellcode is written using processor instructions – something independent of the OS it’s running on. Of course, the entire purpose of shellcode is manipulation of the OS, so shellcode is ultimately OS specific (even patch specific), but its basic primitives are independent of the OS.

One classic problem with shellcoding is addressing. Because shellcode is [typically] nefariously injected via exploitation into a process’s memory segment, and program execution is “hijacked” (without the benefit of setting up proper address pointers), the coder doesn’t know where in memory their code will be. The problem is, very little can be accomplished without knowing the logical memory address of parameters within the shellcode.

The simplest way around this issue is use of a CALL instruction. More information is available in the “Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A: Instruction Set Reference, A-M” (and 2B: N-Z) located here: http://www.intel.com/products/processor/manuals/.

The CALL is used as a way to branch processor execution to another location in memory. It has the minor benefit of being able to use relative addressing, but it has the major benefit of PUSH’ing procedure linking information on the stack before branching to the target location. This is commonly referred to as Call Stack Setup. When executing a near call, the processor pushes the value of the EIP register (which contains the offset of the instruction following the CALL instruction) on the stack (for use later as a return-instruction pointer). The processor then branches to the address in the current code segment specified by the target operand.

There are several versions of the CALL instruction, but the one we’re interested in for this purpose is opcode 0xE8. This is a near call (near, meaning within the current memory segment) using relative address displacement with a negative offset (eg: backwards displacement). The actual instruction is 5 bytes long, with the last four bytes used for a relative offset (a signed displacement relative to the current value of the instruction pointer in the EIP register; this value points to the instruction following the CALL instruction). The CS register is not changed on near calls, so the results of these branches can be safely predicted (from a shellcoders perspective).

A section of a disassembled binary is shown here with an actual CALL. Notice the instruction is given as an 0xe8 plus a double word (32 bit) displacement pointer.


The CALL is usually needed early in shellcode execution to PUSH the virtual address contained in the IP onto the stack. (This is done because it’s not possible to access the IP directly, so it needs to be put on the stack to utilize parameters within the shellcode). However, the problem with the use of CALLs for call stack setup in buffer overflow shellcode is the CALL is generally located at an offset needing to serve as a return address after other instructions have already been executed. In other words, the CALL is generally located later in the shellcode and the processor executes the instructions sequentially from the start of the shellcode – unless a branching instruction is encountered.

Which is precisely how to solve the problem in shellcode – early in the execution of the shellcode, you simply JMP to the CALL in question, then call back into the shellcode and continue execution.
JMPs are simple instructions and easy to visibly identify and dissect. They are simply the opcode 0xEB followed by a byte indicating the number of bytes to jump.

The example below is taken from an MDaemon Pre Authentication Heap Overflow exploit:



In the first example above (the egghunter shellcode), we see a “\xeb\x21” which means, “Jump 0×21 (or decimal 33) bytes.” When you jump those bytes, you hit the green box, a CALL. The CALL performs the call stack setup, then branches backwards back into the shellcode and picks up just after the JMP (because of the negative displacement). The actual offset is [0xFF – 0xDA = 0x25]. 0×25 is 37 in decimal, however, you subtract 5 from that since the offset starts at the end of the 5-byte CALL. That lands us just after the JMP.

Simple, yet effective. Even analysis of polymorphic shellcode generators shows this technique applies to almost all them as well.

To summarize all this rambling, the technique (show in the FelxParser below) is simply to search for a JMP straight to a NEAR CALL with a short and negative displacement.


Evasion

Call with no offset

Evasion of JMP/CALL detection can be accomplished a number of ways. The most interesting evasions are techniques used in advanced NOP sleds obfuscation leveraging CALLs that started surfacing around the mid-2000’s.

One of the simplest CALL-based NOP substitutions worked as follows:

00000000    E800000000  call 0×5
00000005    58                   pop eax

In that example we have a CALL with no offset, which basically translates to “branch to the instruction after this CALL,” in this case an opcode that simply POPs the EIP into the EAX register. (Remember, when the CALL is hit, the processor runs through the call stack setup, meaning the EIP was just PUSHED onto the stack.) From a NOP perspective, this leaves the stack unchanged, but for a method to grab the EIP, this is a simple and efficient (although the use of NULL bytes makes this more difficult to use in a wide range of shellcode).

As that byte sequence is very rare in binaries, detecting this is much simpler since we have the benefit of a continuous 6-byet token to watch for. In the case the EIP is poped to EAX, the token is simply

0xE8 0×00 0×00 0×00 0×00 0×58

The above pattern should be extended to include all the general purpose POPs, including:

0xE8 0×00 0×00 0×00 0×00 0×58
0xE8 0×00 0×00 0×00 0×00 0×8F
0xE8 0×00 0×00 0×00 0×00 0×0F 0×1A
0xE8 0×00 0×00 0×00 0×00 0×0F 0xA9

Noir’s no JMP/CALL

This next technique was first described by noir@gsu.linux.org.tr  on the vuln-dev mailing list. It works as follows:

00000000    D9EE            fldz
00000002    D97424F4    fnstenv [esp-0xc]
00000006    58                 pop eax

In this case, the technique is to use FNSTENV to get the EIP of the last FPU instruction evaluated, then POP it from the stack. In the example above, the FLDZ FPU instruction is issued, then its EIP is POP’ed. This very cool technique allows for many permutations since any number of floating point instructions can be used.  Several dozen pages in the Intel Developers Instruction Reference A-M (starting around page 430) cover instructions that can be used in place of FLDZ.

Gera’s CALL into self

The final one we’ll look at is a crafty method to avoid JMP/CALLs, and works like this:

00000000    E8FFFFFFFF  call 0×4
00000005    C3                        ret
00000006    58                        pop eax

The interesting thing is the code above does not perform the actions the disassembler has labeled them as doing. In reality, the CALL (E8FFFFFFFF) is calling backwards into itself by a single byte. Therefore, the processor will hit the byte 0xFF (the tail end of the CALL) and interpret that byte as an instruction. In this case, the instruction is an INC/DEC (increment by 1 or decrement by 1). The 0xC3 is actually an operand to the interpreted 0xFF instruction, so it’s not a RET (return, normally used for call stack unwinding) in this case – it’s actually a pointer to the value stored in the EBX register as an operand for the INC/DEC instruction! After this step has been taken (the equivalent of a NOP really), the value on the stack is POP’ed into the EAX register using the 0×58 instruction. The value POPed is the EIP since it was PUSHed onto the stack when the CALL called back into itself.

While this is a very cool technique, it also provides a number of simple tokens to match on, similar to the Call with no offset example.

False positives and benign triggers

In testing of 55 GB of data (network and host based) no false positives were encountered searching for a JMP to short and near negative CALL. However, benign triggers were encountered (meaning the condition was detected, but it was a valid use of the condition). The condition was only detected inside some valid PE files, and because of that fact, they can be filtered using a number of simple and easy techniques depending on the technology used to discover them.

Flex Parser

Currently, the parser engine does not allow for one-byte tokens, so this parser is not functional as-is. (The concept presented here can easily be extended to identifying percent-encoded shellcodes, which is supported since they are represented as multi-byte tokens.) Nonetheless, and more importantly, the technique is annotated here in Flex so the reader can see how simple it is to write FlexParsers to discover a wide array of very complex conditions – such as universal shellcode detection.

No comments:

Post a Comment