Thursday, December 16, 2010

VM Detection by In-The-Wild Malware

Note: This is a reprint of a posting I made for my company, NetWitness (www.netwitness.com). This is unchanged from the original, however that copy can be found at: http://www.networkforensics.com/2010/12/13/vm-detection-by-in-the-wild-malware/

Motivation


A large number of security researchers use Virtual Machines when analyzing malware and/or setting up both active and passive honeynets. There a numerous reasons for this, including: scalability, manageability, configuration and state snapshots, ability to run diverse operating systems, etc..

Malware that attempts to detect if it’s running in a Virtual Machine (then change its behavior accordingly to prevent analysis by security people) is not a subject of academic fancy. A recent search of VirusTotal showed they receive at least 1,000 unique samples a week with VM detection capabilities. (This search was performed by searching for known function import names from non-standard DLLs.) Personally, my first encounter with malware that behaved completely differently inside a Virtual Machine (from a real host) was approximately eight years ago.

VM detection does not apply just to the realm of APT-level malware. Agobot/Gaobot/PhatBot  is a family of massively deployed malware first released in 2004 with the ability to detect if running in either VMware or VirtualPC and changes its behavior accordingly. Considering just this example of how old and low-entry malware (with such a massive deployment) performs these actions, our attention to this subject should be especially keen.

Notes


  1. This post contains a number of techniques for VM detection used by malware, along with code demonstrating how simple these techniques are to implement. Except where noted, all techniques are currently used in the wild.

  1. Most of this post (but not all!) is a summary of other people’s work, not mine – except where noted. References are given and should be accurate. If not, email me and I’ll correct.
  2. Examples where simple code samples could not be produced will not be considered here.
  3. Only techniques that are difficult to mitigate are examined here. I’m sure there are hundreds of other ways to detect VM’s. Of the methods I’m familiar with, these were the ones that stood out in my mind as being difficult to fight.

Types of Virtual Machines


Generally speaking, there are three types of Virtual Machines. They are:
  1. Hardware Assisted – aka: Hypervisors – These VM’s use processor-specific instructions to cause the Host OS to [in effect] “fork,” where the original copy of the OS stays in a suspended state while the newly spawned “Guest copy” continues to run as if nothing happened. The important thing to keep in mind relative to this topic is that when the Guest executes machine level instructions, the actual hardware CPU is used to execute those instructions.
  2. Reduced Privilege – These are the VM’s most people are familiar with and use regularly. Here, the Host takes more of an active “proxy” role for the Guest by virtualizing important data structures and registers, then performing some level of translation services for some machine level instructions. Relative to this topic, the important thing to note here is that the guest – in effect – runs at a lower privilege than if it was truly controlling the CPU.
  3. Pure Software – Software VM’s act as full proxies to the CPU by implementing a truly virtual CPU the Guest interacts with.

Hypervisors (Hardware Assisted VMs)


Xen > 3.x and Virtual Server 2005 are a couple examples of Hardware assisted virtual machines.
Low-level detection of being virtualized in one of these environments is extremely difficult. Many people still call it impossible. While several people have talked publically about proof of concept code developed to detect these environments for years, none has been released or found in wild (that I’m aware of). Because of this, I will not talk about hypervisors any further than describing why detection is so difficult. (Since we have no code to examine how simple it is – the point of this post.)
A Hypervisor “guest” can be launched at any point after the OS has loaded. In preparation for launching a guest copy of the OS, the “host” sets up some basic CPU-specific control structures, then uses a single instruction (opcode) to cause the CPU to place the Host OS in a virtualized state while the Guest is basically a “forked copy” of the originally running OS. Once a Hypervisor has started running, the Guest OS basically has zero knowledge of this fact since all access to hardware is direct access. While the access to hardware is direct, the Hypervisor VM itself still has the ability to intercept interesting events – even before the Host OS has seen them. In this effect, a hypervisor VM is more powerful than both the Host and Guest OS’s because it sees everything before either of them.  Also, once a hypervisor is running, no others can become active. The first hypervisor VM has absolute control.

All methods for detecting the presence of Hypervisors depend on timing functions, however they are only useful techniques in theory because of the infeasibility of creating a good baseline to compare timing results to in order to make a pass/fail decision. Another technique uses context switching to cause Translation Lookaside Buffers filled with a predetermined pattern of data to get flushed when a hypervisor is running. Describing the technique is far beyond the scope of this paper since there is no exploit code to examine, but… Based on my understanding of the following article, I’m not sure the technique is so relevant anymore anyways. http://download.intel.com/technology/itj/2006/v10i3/v10-i3-art01.pdf

VMware


The non-ESX versions of VMware are reduced privilege VMs, and because of that are trivial to detect. Because critical data structures setup by the Operating System in critical regions of memory during OS start-up are already in use by the Host OS, VMware must relocate virtual copies of them for use by the Guest OS. This fact alone presents several powerful opportunities to detect when running inside a VMware image.

The first example simply checks the base address of the Interrupt Descriptor Table, as shown below. If then IDT is at a location much higher than its normal location, the process is likely inside a VM. This technique is generally attributed to Joanna Rutkowska , and is described here.
Code:

Data provided by Pastebin.com - Download Raw
  1. void detectVMwareIDT (void)
  2. {
  3.         unsigned char idtr[6];
  4.         unsigned long idt_base = 0;
  5.  
  6.         _asm sidt idtr
  7.  
  8.         idt_base = *((unsigned long *)&idtr[2]);
  9.  
  10.         if ((idt_base >> 24) == 0xff)
  11.         {
  12.                 printf ("IDT at high relocation – inside VM ");
  13.         }
  14.  
  15.         else
  16.         {
  17.                 printf ("IDT at a normal location ");
  18.         }
  19. }
  In the above example, line #6 is the single line of assembly it takes to get the base address of the IDT, which is then tested a couple lines below that. SIDT is an instruction that stores the contents of the interrupt descriptor table register (IDTR) in the destination operand. It’s important to note this instruction is an unprivileged instruction that can be executed at any privilege level. However, according to the paper, “Detecting the Presence of Virtual Machines Using the Local Data Table,” verifying the IDT on multi-processor systems will fail when there is an IDT for each microprocessor.

If that detection technique wasn’t simple enough, the next one is. VMware builds the Local Descriptor Table in memory, however Windows does not. Therefore, simply checking for a non-zero address for the LDT when running in Windows is enough to identify VMware.

Data provided by Pastebin.com - Download Raw
  1. void detectVMwareLDT (void)
  2. {
  3.         unsigned char  ldtr[4] = {'0xef', '0xbe', '0xad', '0xde'};
  4.         unsigned long  ldt_base  = 0;
  5.  
  6.         _asm sldt ldtr
  7.  
  8.         ldt_base = *((unsigned long *)&ldtr[0]);
  9.  
  10.         if (ldt_base == 0xdead0000)
  11.         {
  12.                 printf ("LDT at normal offset ");
  13.         }
  14.  
  15.         else
  16.         {
  17.                 printf ("LDT found: inside VMware ");
  18.         }
  19. }
On line #6, SLDT is the assembly instruction to store the segment selector from the local descriptor table register (LDTR) in the destination operand. It’s important to note this instruction is also an unprivileged instruction that can be executed at any privilege level!

Another interesting feature of VMware is seen when executing the IN instruction from user-land of common OSs like Linux, Windows, etc (and more accurately, when executing this instruction in ring3). IN is the “Input from Port” instruction. It copies the value from the I/O port specified with the source operand to the destination operand.  The IN instruction is a privileged instruction which cannot be run from ring3 (user-land), therefore when executed, an exception should be thrown. However, when VMware is running, no exception is generated if a special input port is specified. That port is “0×5658,” aka: “VX.” This technique is described in much more detail in the original posting here.

Example code is below, with comments added to explain each step.

Data provided by Pastebin.com - Download Raw
  1. void detectVMwareINcmd (void)
  2. {
  3.   unsigned int a, b;
  4.  
  5.   __try
  6.   {
  7.     __asm
  8.     {
  9.  
  10.         // standard stack setup, nothing special
  11.         push eax  
  12.         push ebx
  13.         push ecx
  14.         push edx
  15.  
  16.         // first, push the value "VMXh" into EAX. If the IN command
  17.         // is successful (meaning we're in VMware), this fingerprint
  18.         // will be placed into the EBX register by VMware.
  19.         mov eax, 'VMXh'
  20.        
  21.         // ECX stores the VMware "backdoor" command we're trying
  22.         // to issue. 0x0a is one of several commands. This one gets
  23.         // the version of VMware running.
  24.         mov ecx, 0Ah
  25.  
  26.         // This is the input operand to the IN instruction and must
  27.         // be "VX" to activate this "backdoor" feature.
  28.         mov dx, 'VX'
  29.  
  30.         // Key instruction is issued
  31.         in eax, dx
  32.  
  33.         // If we got here, then we're in VMware, since an exception
  34.         // would have been thrown otherwise.
  35.         mov a, ebx
  36.         mov b, ecx
  37.  
  38.         // standard stack tear down, nothing special
  39.         pop edx
  40.         pop ecx
  41.         pop ebx
  42.         pop eax
  43.     }
  44.   }
  45.   __except (EXCEPTION_EXECUTE_HANDLER) {}
  46.  
  47.   // if the fingerprint was pushed into the EBX register,
  48.   // we can check the results
  49.   if (a == 'VMXh')
  50.   {
  51.     printf ("VMware found. Version is: ");
  52.  
  53.     // the ECX register actually received the version code, tested next:
  54.     if (b == 1)
  55.       printf ("Express ");
  56.     else if (b == 2)
  57.       printf ("ESX ");
  58.     else if (b == 3)
  59.       printf ("GSX ");
  60.     else if (b == 4)
  61.       printf ("Workstation ");
  62.     else
  63.       printf ("some other version ");
  64.   }
  65.   else
  66.   {
  67.     printf ("fingerprint never made it to EBX, not VMware ");
  68.   }
  69. }
Some people writing for SANS have said that disabling certain configuration option in VMware will defeat this type of detection mechanism. Unfortunately, the real IN instruction would never change any register other than EAX in the first place, so all the other register changes that take place when executing the instruction in VMware are still detectable. Other counter-measures have been proposed, however they are too unstable and unusable in the real world for us to consider here.

VirtualPC

  VirtualPC is also a reduced privilege VM, like non-ESX versions of VMware, and is just as trivial to detect. The IDT and LDT table structures tests described in the VMware section apply to VirtualPC as-is. In fact, those tests apply to all the big-name VMs that people are most familiar with in the Reduced Privilege category of VMs. VirtualPC has functionality similar to VMware’s use of the IN instruction, however it uses illegal instructions to trigger exceptions the kernel will catch. For example, issuing the following machine code would normally cause an exception because it’s an undefined opcode: 0F 3F 0A 00 But, with VirtualPC running, no exception is generated because this is part of VirtualPC’s guest to host communication protocol. Therefore, the code in the VMware example can simply be modified to issue this opcode, then test for a lack of exception. A more interesting feature of VirtualPC is its use of “buffered code emulation.” Buffered code emulation is the practice of copying an instruction from a Guest into a host-controlled buffer and executing it there, then returning the results to the Guest. As VirtualPC is intercepting every instruction and deciding what to return back to the caller, it will sometimes alter or craft its own results – as it does with the CPUID instruction. Normal values retuned are “GenuineIntel” and “AuthenticAMD.” With VirtualPC, the result is “ConnectixCPU.” But, I like this example of VirtualPC detection best since it uses the high-level and easy to use language, C#. Consider the following:

Data provided by Pastebin.com - Download Raw
  1.         private void VirtualPCviaWMI()
  2.         {
  3.             ManagementObjectSearcher objMOS =
  4.                 new ManagementObjectSearcher(
  5.                     "root\CIMV2", "SELECT * FROM  Win32_BaseBoard"
  6.                     );
  7.  
  8.             foreach (ManagementObject objManagemnet in objMOS.Get())
  9.             {
  10.                 if (objManagemnet.GetPropertyValue("Manufacturer").ToString() ==
  11.                     "Microsoft Corporation")
  12.                 {
  13.                     Console.WriteLine("VirtualPC detected");
  14.                 }
  15.             }
  16.         }
The above example pulls the manufacturer name of the motherboard and tests if it’s “Microsoft Corporation.” If it is, then VirtualPC has just been detected.

Software VMs


Examples software VMs include Bochs, Hydra, QEMU, Atlantis, and Norman Sandbox, among many others. Because software VMs try to fully emulate hardware, there are too many techniques to detect them to list here. Because it would be nearly impossible to implement every instruction and match the quirks each instruction has on different families of processors, most of the tests for Software VM’s revolve around testing some of the more arcane instructions.

Sandboxes


I personally love sandboxes because of the sheer volume of work they automate and standardization of data they return. I can’t even imagine life before they existed anymore! :-) However, we need to be realistic about the fact that most (but not all!) are trivial to detect, regardless of hardware platform. This doesn’t mean we should avoid them – it just means we need to ensure we’re compensating for that fact.

A technique I have not seen elsewhere and have used in my own “research” malware in the past is to check the DLLs that have been loaded into my program’s space. To use this technique, you must first create a “fingerprint” of the DLLs your program loads. This is easily accomplished with a couple lines of debug code after you have finished your program. The technique is easy and can be used even in high-level languages like C#. Consider the following:

Data provided by Pastebin.com - Download Raw
  1.         private bool iveBeenInjected()
  2.         {
  3.             bool retVal = false;
  4.  
  5.             Process pp = Process.GetCurrentProcess();
  6.             ProcessModuleCollection pma = pp.Modules;
  7.  
  8.             foreach (ProcessModule pm in pma)
  9.             {
  10.                 // checkName is a simple function that uses a switch() statement
  11.                 // to see if the name is in the “fingerprint” list of DLL's loaded
  12.                 // by my program. (In this case, it's 39 entries.)
  13.                 if (!checkName(pm.ModuleName) && (pm.ModuleName != pp.ProcessName))
  14.                 {
  15.                     retVal = true;
  16.  
  17.                     Console.WriteLine(
  18.                         pm.ModuleName +
  19.                         " is injected from " +
  20.                         pm.BaseAddress.ToString() +
  21.                         " file: " +
  22.                         pm.FileName);
  23.                 }
  24.             }
  25.  
  26.             return retVal;
  27.         }
bool checkName(string nameOfLoadedDLL) is a method that returns true if the dll mapped in the program’s process space is known to belong to the “fingerprint” of this program. If the dll mapped in is unknown to the program, then it returns false and the logic below is executed.

This simple function is enough to catch many sandboxes (again, not all of them), even if you’ve followed their best practices and rename their monitoring dll (typically injected into all new processes).  Other examples of sandbox detection include using the hook detection employed by numerous security programs to find malware (except in this case – used by malware to detect sandboxes), counting hooks, etc..

Unfortunately, in the case of sandboxes, malware doesn’t even need to go through all that trouble to defeat them. The only thing malware needs to do is ensure its persistence through a reboot, then wait for a reboot to take place. That alone is enough to defeat the analysis steps of most analysis!

Summary


In short, you have seen that while many people quibble over VM detection being as simple as looking for registry keys and mac addresses (all easily mitigated from a security perspective), VM detection is actually:
  1. Much easier than programmatically dealing with the registry
  2. Much harder [nearly impossible] to mitigate when behavior of hardware is the target of testing
  3. Happens on a massive scale in malware in the wild

While the use of Virtual Machines has many advantages for research purposes, their selection and limitations should be carefully weighed against your actual objectives.

Saturday, November 27, 2010

Identifying the country of origin for a malware PE executable

Note: This is a reprint of a posting I made for my company, NetWitness (www.netwitness.com). This is unchanged from the original, however that copy can be found at: http://www.networkforensics.com/2010/11/25/identifying-the-country-of-origin-for-a-malware-pe-executable/

Have you ever wondered how people writing reports about malware can say where the malware was [likely] developed?
11/16/2009 6:41:48 PM –>  Hook instalate lsass.exe

We can use Google Translate’s “language detect” feature to help up determine the language used:




Of course, it’s not often we get THAT lucky!

A more interesting method is the examination of certain structures known as the Resource Directory within the executable file itself. For the purpose of this post, I will not be describing the Resource Directory structure. It’s a complicated beast, making it a topic I will save for later posts that actually warrant and/or require a low-level understanding of it. Suffice it to say, the Resource Directory is where embedded resources like bitmaps (used in GUI graphics), file icons, etc. are stored. The structure is frequently compared to the layout of files on a file system, although I think it’s insulting to file systems to say such a thing. For those more graphically inclined, I took the following image from http://www.devsource.com/images/stories/PEFigure2.jpg.





For the sake of example, here’s some images showing you just a few of the resources embedded inside of notepad.exe: (using CFF Explorer from: http://www.ntcore.com/exsuite.php)




Now it’s important to note that an executable may have only a few or even zero resources – especially in the case of malware. Consider the following example showing a recent piece of malware with only a single resource called “BINARY.”





Moving on, let’s look at another piece of malware… Below, we see this piece of malware has five resource directories.


We could pick any of the five for this analysis, but I’ll pick RCData – mostly because it’s typically an interesting directory to examine when reverse engineering malware. (This is because RCData defines a raw data resource for an application. Raw data resources permit the inclusion of any binary data directly in the executable file.) Under RCData, we see three separate entries:


The first one to catch my eye is the one called IE_PLUGIN. I’ll show a screenshot of it below, but am saving the subject of executables embedded within executables for a MUCH more technical post in the near future (when it’s not 1:30 am and I actually feel like writing more!). ;-)





Going back to the entry structure itself, the IE_PLUGIN entry will have at least one Directory Entry underneath it to describe the size(s) and offset(s) to the data contained within that resource. I have expanded it as shown next:


And that’s where things get interesting – as it relates to answering the question at the start of this post anyways. Notice the ID: 1055. That’s our money shot for helping to determine what country this binary was compiled in. Or, more specifically, the default locale codepage of the computer used to compile this binary. Those ID’s have very legitimate uses, for example, you can have the same dialog in English, French and German localized forms. The system will choose the dialog to load based on the thread’s locale. However, when resources are added to the binary without explicitly setting them to different locale IDs, those resources will be assigned the default locale ID of the compiler’s computer.

So in the example above, what does 1055 mean?

It means this piece of malware likely was developed (or at least compiled in) Turkey.

How do we know that one resource wasn’t added with a custom ID? Because we see the same ID when looking at almost all the other resources in the file (anything with an ID of zero just means “use the default locale”):


In this case, we are also lucky enough to have other strings in the binary (once unpacked) to help solidify the assertion this binary is from Turkey. One such string is “Aktif Pencere,” which Google’s Translation detection engine shows as:





However, as you can see, this technique is very useful even when no strings are present – in logs or the binary itself.

So is this how the default binary locale identification works normally (eg: non-malware executable files)?

Not exactly. The above techniques are generally used with malware (if the malware even has exposed resources), but not generally with normal/legitimate binaries. Consider the following legitimate binary. What is the source locale for the following example?


As you see in the green box, we have some cursor resources with the ID for the United States. (I’m including a lookup table at the bottom of this post.) In the orange box, there are additional cursor resources with the ID for Germany. In the red box is RCData, like we examined before, but all of these resources have the ID specifying the default language of the computer executing the application.

As it turns out, the normal value to examine is the ID for the Version Information Table resource (in the blue box). In the case above, it's the Czech Republic. The Version Information Table contains the “metadata” you normally see depicted in locations like this:


In the above screenshot, Windows is identifying the source/target local as English, and specifically, United States English (as opposed to UK English, Australian English, etc…). That information is not stored within the Version Information table, but rather is determined by the ID of the Version Information Table.

However, in malware, the Version Information table is almost always stripped or mangled, as is the case with our original example from earlier:


Because of that, the earlier techniques are more applicable to malware.

Below, I’m including a table to help you translate Resource Entry IDs to locales (sorted by decimal ID number).


LocaleLanguageLCIDDecimalCodepage
Arabic – Saudi Arabiaarar-sa10251256
Bulgarianbgbg10261251
Catalancaca10271252
Chinese – Taiwanzhzh-tw1028
Czechcscs10291250
Danishdada10301252
German – Germanydede-de10311252
Greekelel10321253
English – United Statesenen-us10331252
Spanish – Spain (Traditional)eses-es10341252
Finnishfifi10351252
French – Francefrfr-fr10361252
Hebrewhehe10371255
Hungarianhuhu10381250
Icelandicisis10391252
Italian – Italyitit-it10401252
Japanesejaja1041
Koreankoko1042
Dutch – Netherlandsnlnl-nl10431252
Norwegian – Bokmlnbno-no10441252
Polishplpl10451250
Portuguese – Brazilptpt-br10461252
Raeto-Romancermrm1047
Romanian – Romaniaroro10481250
Russianruru10491251
Croatianhrhr10501250
Slovaksksk10511250
Albaniansqsq10521250
Swedish – Swedensvsv-se10531252
Thaithth1054
Turkishtrtr10551254
Urduurur10561256
Indonesianidid10571252
Ukrainianukuk10581251
Belarusianbebe10591251
Slovenianslsl10601250
Estonianetet10611257
Latvianlvlv10621257
Lithuanianltlt10631257
Tajiktgtg1064
Farsi – Persianfafa10651256
Vietnamesevivi10661258
Armenianhyhy1067
Azeri – Latinazaz-az10681254
Basqueeueu10691252
Sorbiansbsb1070
FYRO Macedoniamkmk10711251
Sesotho (Sutu)1072
Tsongatsts1073
Setsuanatntn1074
Venda1075
Xhosaxhxh1076
Zuluzuzu1077
Afrikaansafaf10781252
Georgianka1079
Faroesefofo10801252
Hindihihi1081
Maltesemtmt1082
Sami Lappish1083
Gaelic – Scotlandgdgd1084
Yiddishyiyi1085
Malay – Malaysiamsms-my10861252
Kazakhkkkk10871251
Kyrgyz – Cyrillic10881251
Swahiliswsw10891252
Turkmentktk1090
Uzbek – Latinuzuz-uz10911254
Tatartttt10921251
Bengali – Indiabnbn1093
Punjabipapa1094
Gujaratigugu1095
Oriyaoror1096
Tamiltata1097
Telugutete1098
Kannadaknkn1099
Malayalammlml1100
Assameseasas1101
Marathimrmr1102
Sanskritsasa1103
Mongolianmnmn11041251
Tibetanbobo1105
Welshcycy1106
Khmerkmkm1107
Laololo1108
Burmesemymy1109
Galiciangl11101252
Konkani1111
Manipuri1112
Sindhisdsd1113
Syriac1114
Sinhala; Sinhalesesisi1115
Amharicamam1118
Kashmiriksks1120
Nepalinene1121
Frisian – Netherlands1122
Filipino1124
Divehi; Dhivehi; Maldiviandvdv1125
Edo1126
Igbo – Nigeria1136
Guarani – Paraguaygngn1140
Latinlala1142
Somalisoso1143
Maorimimi1153
HID (Human Interface Device)1279
Arabic – Iraqarar-iq20491256
Chinese – Chinazhzh-cn2052
German – Switzerlanddede-ch20551252
English – Great Britainenen-gb20571252
Spanish – Mexicoeses-mx20581252
French – Belgiumfrfr-be20601252
Italian – Switzerlanditit-ch20641252
Dutch – Belgiumnlnl-be20671252
Norwegian – Nynorsknnno-no20681252
Portuguese – Portugalptpt-pt20701252
Romanian – Moldovaroro-mo2072
Russian – Moldovaruru-mo2073
Serbian – Latinsrsr-sp20741250
Swedish – Finlandsvsv-fi20771252
Azeri – Cyrillicazaz-az20921251
Gaelic – Irelandgdgd-ie2108
Malay – Bruneimsms-bn21101252
Uzbek – Cyrillicuzuz-uz21151251
Bengali – Bangladeshbnbn2117
Mongolianmnmn2128
Arabic – Egyptarar-eg30731256
Chinese – Hong Kong SARzhzh-hk3076
German – Austriadede-at30791252
English – Australiaenen-au30811252
French – Canadafrfr-ca30841252
Serbian – Cyrillicsrsr-sp30981251
Arabic – Libyaarar-ly40971256
Chinese – Singaporezhzh-sg4100
German – Luxembourgdede-lu41031252
English – Canadaenen-ca41051252
Spanish – Guatemalaeses-gt41061252
French – Switzerlandfrfr-ch41081252
Arabic – Algeriaarar-dz51211256
Chinese – Macau SARzhzh-mo5124
German – Liechtensteindede-li51271252
English – New Zealandenen-nz51291252
Spanish – Costa Ricaeses-cr51301252
French – Luxembourgfrfr-lu51321252
Bosnianbsbs5146
Arabic – Moroccoarar-ma61451256
English – Irelandenen-ie61531252
Spanish – Panamaeses-pa61541252
French – Monacofr61561252
Arabic – Tunisiaarar-tn71691256
English – Southern Africaenen-za71771252
Spanish – Dominican Republiceses-do71781252
French – West Indiesfr7180
Arabic – Omanarar-om81931256
English – Jamaicaenen-jm82011252
Spanish – Venezuelaeses-ve82021252
Arabic – Yemenarar-ye92171256
English – Caribbeanenen-cb92251252
Spanish – Colombiaeses-co92261252
French – Congofr9228
Arabic – Syriaarar-sy102411256
English – Belizeenen-bz102491252
Spanish – Perueses-pe102501252
French – Senegalfr10252
Arabic – Jordanarar-jo112651256
English – Trinidadenen-tt112731252
Spanish – Argentinaeses-ar112741252
French – Cameroonfr11276
Arabic – Lebanonarar-lb122891256
English – Zimbabween122971252
Spanish – Ecuadoreses-ec122981252
French – Cote d’Ivoirefr12300
Arabic – Kuwaitarar-kw133131256
English – Phillippinesenen-ph133211252
Spanish – Chileeses-cl133221252
French – Malifr13324
Arabic – United Arab Emiratesarar-ae143371256
Spanish – Uruguayeses-uy143461252
French – Moroccofr14348
Arabic – Bahrainarar-bh153611256
Spanish – Paraguayeses-py153701252
Arabic – Qatararar-qa163851256
English – Indiaenen-in16393
Spanish – Boliviaeses-bo163941252
Spanish – El Salvadoreses-sv174181252
Spanish – Honduraseses-hn184421252
Spanish – Nicaraguaeses-ni194661252
Spanish – Puerto Ricoeses-pr204901252

Network Forensics and Reversing Part 1 – gzip web content, java malware, and a little JavaScript

Note: This is a reprint of a posting I made for my company, NetWitness (www.netwitness.com). This is unchanged from the original, however that copy can be found at: http://www.networkforensics.com/2010/11/14/network-forensics-and-reversing-part-1-gzip-web-content-java-malware-and-a-little-javascript/

Something I’ve found unsettling for some time now is the drastically increased usage of gzip as a Content-Encoding transfer type from web servers. By default now, Yahoo, Google, Facebook, Twitter, Wikipedia, and many other organizations compress the content they send to your users. From that list alone, you can infer that most of the HTTP traffic on any given network is not transferred in plaintext, but rather as compressed bytes.

That means web content you’d expect to look like this on the wire (making it easily searchable for policy violations and security threats):


In reality, looks like this:


As it turns out, the two screenshot above are for the exact same network session, the later screenshot being from wireshark and showing the data sent by the webserver really is compressed and not discernable.

By extension, you can likely say that most real-time network forensics/monitoring tools are [realistically] “blind” to [plausibly] a majority of the web the traffic flowing into your organization.

Combined with the fact that a vast majority of compromises are delivered to clients via HTTP (at this time, typically through the use of javascript), my use of the word “unsettling” should be an understatement. This includes everything from “APT” types of threats (or whatever soapbox you stand on to describe the same thing), down to drive-by’s and mass exploitations.

The good news: Current trends in exploitation have given us very powerful methods for generic detection (eg: without needing “signatures,” or more precisely – preexisting knowledge about the details of particular vulnerabilities or exploits) by examining traits of javascript, iframes, html, pdf’s, etc.

The bad news: Webservers are reducing the chance of network technologies from detecting those conditions by compression based transfer (obfuscation).

I find no fault with organizations choosing to use gzip as their transfer type. HTTP is a horribly repetitive and redundant language (read: bloated). Every opening <tag> has an identical closing </tag>. XML is even worse. For massive sites with massive traffic, the redundancy and bloat of protocols like HTTP and XML translate directly to lost revenue via extremely large amounts of wasted bandwidth.

Nonetheless, as forensic engineers, our challenge is to discover and compensate for all the things proactive security technologies like AV, firewalls, IPS, etc. continually fail to identify and stop. Recently, I added the following rule on a customer’s network in NetWitness:


If you’re not familiar with the NetWitness rule syntax, the rule above does the following:

If the server application/version (as extracted by the protocol parsing engine) contains the string: “nginx,”
AND
If the Content-Encoding used by the server is gzip
THEN
Create a tag labeled “http_gzip_from_nginx” in a key called “monitors.”

In the Investigator GUI, you would see something like this in the “monitors” key:


Why nginx? As it turns out, a lot of hackers tend to use nginx webservers, so this seemed like a good place to start experimenting. The question I was trying to answer is:

If the content body of a web response is gzip’ed (so we can’t examine traits of “suspiciousness” inside the body), then what can we see outside the body to indicate this gzip’ed traffic is worth examining further?

We’ll revisit this question in later blog posts, but for now, nginx as a webserver is an amazingly powerful place to start! We’ll examine one such example in this post, with an additional post using the gzip + nginx combination. As the small screenshot above shows, there were 33 sessions meeting the criteria of gzip + nginx (out of about 50,000 sessions). With only 33 sessions, it’s possible to examine them by drilling into the packets of all 33, examining them each one-by-one (eg: brute-force forensic examination), but that would be poor forensic technique and defeat the entire point of a technical and educational network forensics blog! The examples in these series of blog posts will employ good forensic practices using “correlative techniques,” allowing us to have a good idea of what is inside the packet contents before we ever drill that deeply into the network data (an indication you are using good network forensics practices).

The first pivot point we’ll examine are countries. Keep in mind, this is after we used the rule above to include only network sessions where the server returned gzip compressed content, and where the webserver was some type of nginx. We could have manually done the same by first pivoting on the content type of gzip:


Doing the first pivot reduces the number of sessions we’re examining from about 50,000 down to 2,878. Then we can do a custom filter to only include servers with the string “nginx” within those 2,878 session. Doing so gives us the same 33 sessions mentioned above.
In those 33 sessions, the countries communicated with are:


Not only do we tend to see a higher degree of malicious traffic from countries like Latvia, it immediately looks suspicious simply because it’s an outlier in the list. (Don’t worry Latvia, we’ll pick on our own country in the next post!) Additionally, there’s only a single session to examine here, meaning drilling into the packet-level detail is an ok decision at this point.
In the request, we see the client requested the file “/th/inyrktgsxtfwylf.php” from the host “ertyi.net,” as shown next:


As expected, based on the meta information NetWitness already extracted, we see the gzip’ed reply from a nginx server:


Fortunately, Investigator makes it easy for us to examine gzip’ed content by right-clicking in the session display and selecting decode as compressed data:


Doing so shows us a MUCH different story!


The traffic appears to be obfuscated javascript. We can extract it from NetWitness (a few different ways) to clean it up and examine. I’ll skip those steps and just show the cleaned-up and nicely formatted content the webserver returned.


There are a few things to notice in here. At the very bottom of the image above, we clearly see encoded javascript, a trait extremely common to client-side exploit delivery and malicious webpages. We’ll save full javascript reverse engineering for another blog post.


But the worst (or most interesting) part is the decoding and evaluation for this encoded data, while implemented in javascript, is stored inside a TextArea HTML object! This technique makes the real logic invisible and indiscernible to most automated javascript reverse engineering tools.


Indeed, if we upload this webpage to one of my favorite js reversing sites (jsunpack, located at: http://jsunpack.jeek.org/dec/go), we see the following results when the site attempts to automatically reverse engineer the javascript:


Without going further into the process of reverse engineering the javascript (for now – we have an endless supply of blog posts coming!), we can be quite sure we’re looking at something suspicious. At the very least, we know for a fact we’re looking at something that does not make it easy to discern what it’s doing!

The telltale signs of “badness” don’t stop there. At the top of the decoded body data we saw an embedded java applet, as follows:


While we don’t know (yet) what the applet does, there’s a pretty strong indication it’s a downloader or C&C (command and control) application of some type. How can we make such a guess without knowing anything about it?

Look closely at the embedded parameter passed into the applet:


We can make a guess that the string contained in the “value” parameter is encoded data using a simple substitution cypher where “S”[parm] = “T”[actual] and “T”[parm] = “/”[actual]. If we made such a guess, then it’s possible the decoded parameter value actually starts with the string “http://”.

Of course, because we have the download of the jar file within our full packet capture and storage database, we’ll just extract it from NetWitness to validate our hunch and possibly learn more. In the below screenshot, I already performed the following steps:
  1. Switched to the session with the jar file download. (Simply clicked on the next session between that same client and server.)
  2. Extracted the jar file by saving the raw data from the server using the “Extract Payload Side 2” option in NetWitness.
  3. Opened the jar file using the following java decompiler:


The first line of code in the java applet takes the parameter passed to it (the encoded value we identified above), and hands it to a function called “b.” The result of that function is stored in a string variable called str1.


Following the decompiled java code to function “b,” we see the following:


It turns out the applet actually is using a simple substitution cypher, replacing one given character with another. When the parameter “RSS=,TT!;LBIB@STSRTYG$I=R=” is decoded, we end up with the string “http://uijn.net/th/fs7.php?i=1.”

The java malware then continues with additional string functions as shown next:


First, we see the declaration of str2 through str5, with values assigned to each.

Then, str6 through srt8 is simply the reversal of str2 through str4, resulting in the following strings:

Str6 = .exe
Str7 = java.io.tmpdir
Str8 = os.name

Combining that with the last three lines of code shown above, we see the following:

Str10 is a filename ending in “.exe” where the actual filename is a randomly generated number.
Str11 is the path to temporary files for the current user.
Str12 is the name of the Operating System the java malware is currently running on.

The last part of this java malware (that we’ll examine here anyways) is shown next:


First, it tests to see if the string “Windows” is contained anywhere in the name of the Operating System. If so, then it goes through the process of opening a connection to the URL (the one we decoded above), downloads the file, saves it to the temporary directory, then executes the file.
This file appears to be malware as a first-stage downloader for other executables that are likely far more malicious.

Pre-Summary


Even though a large amount of web traffic is coming into your organization gzip compressed, making most inline/real-time security products totally “blind” to what’s inside, we can use standard forensic principals to identify which of those sessions are worth examination. In this case, we combined to following traits to reduce 50,000 network sessions to a single one:
  1. Gzip’ed web content
  2. Suspicious country
  3. Uncommon webserver application
Once we drilled into that single session, we saw how trivial it was to use NetWitness to automatically decompress and content, extract it, then validate it as “bad.”

Epilogue


Does the process stop there? Of course not! If you had to repeat this process every time, not only would it make your job boring as heck, but would call into question the value you and your tools are really providing the organization in the first place! There are many ways to maximize the intelligence gained from the process just shown. I’ll highlight one method here, while saving others for later blog posts.

There are several interesting “indicators” gathered from this traffic so far. The ones I’ll focus on here are host names. In the request made by the client, we saw the following tag in the HTTP Request header:

Host: ertyi.net

In the java malware we decompiled, after decoding the encoded parameter value, we saw the executable to be downloaded was from the host “uijn.net.”

At this point, network rules should be added to firewalls, proxies, NetWitness intelligence feeds, and any other technology you have that can alert to other hosts going to either of those servers – preferably blocking all traffic to those servers.

But, can we extend our security perimeter in relation to the hackers using those servers?
Interestingly, we find both those domains are hosted on the same IP block: 194.8.250.60 and 194.8.250.61.

That leads to the question, “What other domains are hosted on those servers?”

Normally I use http://www.robtex.com/ to answer questions like that, but in this case, robtex does not provide a lot of information about that question. It’s possible the hackers are brining-up and tearing-down DNS records as needed for the domain names they manage.

Another source of helpful information can be found querying the “Passive DNS replication” database hosted at: http://www.bfk.de/bfk_dnslogger.html Here, we can find an audit trail of all historically observed DNS replies pointing to IPs you submit queries about. In this case, we do indeed find valuable information, including about 40 unique host names that have been hosted on those two IP’s. A shortened list is included below showing some of the names that have been hosted there.

aeriklin.com
aijkl.net
asdfiz.net
asuyr.net
campag.net
iifgn.net
jhgi.net
jugv.net
kobqq.com
krclear.com
lilif.net
nadwq.com
oiuhx.net
pokiz.net
uijn.net

As we can see, none of them look immediately legitimate, so we can infer this is a hacking group using a set of servers for domains they have registered simply to be “thrown away” if any of those domain names are discovered and end up on a blacklist somewhere.

The Real Summary


By combining a few pivot points and looking inside compressed web traffic most products ignore, from a single network session we proactively increased the security posture of your organization by creating an intelligence feed of nearly 40 hosts names and 2 IP’s. You could now audit DNS queries made by all hosts in your organization to see if other clients are compromised and doing look-ups when trying to communicate with those hosts.

For the truly paranoid (or safe, depending on how you look at it), you could also blackhole all traffic to those apparently malicious networks:

route: 194.8.250.0/23
origin: AS29557

Considering the Google Safe Browsing report for that AS, it’s probably not a bad idea!