Saturday, November 27, 2010

Identifying the country of origin for a malware PE executable

Note: This is a reprint of a posting I made for my company, NetWitness (www.netwitness.com). This is unchanged from the original, however that copy can be found at: http://www.networkforensics.com/2010/11/25/identifying-the-country-of-origin-for-a-malware-pe-executable/

Have you ever wondered how people writing reports about malware can say where the malware was [likely] developed?

11/16/2009 6:41:48 PM –> Hook instalate lsass.exe

We can use Google Translate’s “language detect” feature to help up determine the language used:

Of course, it’s not often we get THAT lucky!

A more interesting method is the examination of certain structures known as the Resource Directory within the executable file itself. For the purpose of this post, I will not be describing the Resource Directory structure. It’s a complicated beast, making it a topic I will save for later posts that actually warrant and/or require a low-level understanding of it. Suffice it to say, the Resource Directory is where embedded resources like bitmaps (used in GUI graphics), file icons, etc. are stored. The structure is frequently compared to the layout of files on a file system, although I think it’s insulting to file systems to say such a thing. For those more graphically inclined, I took the following image from http://www.devsource.com/images/stories/PEFigure2.jpg.

For the sake of example, here’s some images showing you just a few of the resources embedded inside of notepad.exe: (using CFF Explorer from: http://www.ntcore.com/exsuite.php)

Now it’s important to note that an executable may have only a few or even zero resources – especially in the case of malware. Consider the following example showing a recent piece of malware with only a single resource called “BINARY.”

Moving on, let’s look at another piece of malware… Below, we see this piece of malware has five resource directories.

We could pick any of the five for this analysis, but I’ll pick RCData – mostly because it’s typically an interesting directory to examine when reverse engineering malware. (This is because RCData defines a raw data resource for an application. Raw data resources permit the inclusion of any binary data directly in the executable file.) Under RCData, we see three separate entries:

The first one to catch my eye is the one called IE_PLUGIN. I’ll show a screenshot of it below, but am saving the subject of executables embedded within executables for a MUCH more technical post in the near future (when it’s not 1:30 am and I actually feel like writing more!).

Going back to the entry structure itself, the IE_PLUGIN entry will have at least one Directory Entry underneath it to describe the size(s) and offset(s) to the data contained within that resource. I have expanded it as shown next:

And that’s where things get interesting – as it relates to answering the question at the start of this post anyways. Notice the ID: 1055. That’s our money shot for helping to determine what country this binary was compiled in. Or, more specifically, the default locale codepage of the computer used to compile this binary. Those ID’s have very legitimate uses, for example, you can have the same dialog in English, French and German localized forms. The system will choose the dialog to load based on the thread’s locale. However, when resources are added to the binary without explicitly setting them to different locale IDs, those resources will be assigned the default locale ID of the compiler’s computer.

So in the example above, what does 1055 mean?

It means this piece of malware likely was developed (or at least compiled in) Turkey.

How do we know that one resource wasn’t added with a custom ID? Because we see the same ID when looking at almost all the other resources in the file (anything with an ID of zero just means “use the default locale”):

In this case, we are also lucky enough to have other strings in the binary (once unpacked) to help solidify the assertion this binary is from Turkey. One such string is “Aktif Pencere,” which Google’s Translation detection engine shows as:

However, as you can see, this technique is very useful even when no strings are present – in logs or the binary itself.

So is this how the default binary locale identification works normally (eg: non-malware executable files)?

Not exactly. The above techniques are generally used with malware (if the malware even has exposed resources), but not generally with normal/legitimate binaries. Consider the following legitimate binary. What is the source locale for the following example?

As you see in the green box, we have some cursor resources with the ID for the United States. (I’m including a lookup table at the bottom of this post.) In the orange box, there are additional cursor resources with the ID for Germany. In the red box is RCData, like we examined before, but all of these resources have the ID specifying the default language of the computer executing the application.

As it turns out, the normal value to examine is the ID for the Version Information Table resource (in the blue box). In the case above, it's the Czech Republic. The Version Information Table contains the “metadata” you normally see depicted in locations like this:

In the above screenshot, Windows is identifying the source/target local as English, and specifically, United States English (as opposed to UK English, Australian English, etc…). That information is not stored within the Version Information table, but rather is determined by the ID of the Version Information Table.

However, in malware, the Version Information table is almost always stripped or mangled, as is the case with our original example from earlier:

Because of that, the earlier techniques are more applicable to malware.

Below, I’m including a table to help you translate Resource Entry IDs to locales (sorted by decimal ID number).

Locale	Language	LCID	Decimal	Codepage

Arabic – Saudi Arabia	ar	ar-sa	1025	1256
Bulgarian	bg	bg	1026	1251
Catalan	ca	ca	1027	1252
Chinese – Taiwan	zh	zh-tw	1028
Czech	cs	cs	1029	1250
Danish	da	da	1030	1252
German – Germany	de	de-de	1031	1252
Greek	el	el	1032	1253
English – United States	en	en-us	1033	1252
Spanish – Spain (Traditional)	es	es-es	1034	1252
Finnish	fi	fi	1035	1252
French – France	fr	fr-fr	1036	1252
Hebrew	he	he	1037	1255
Hungarian	hu	hu	1038	1250
Icelandic	is	is	1039	1252
Italian – Italy	it	it-it	1040	1252
Japanese	ja	ja	1041
Korean	ko	ko	1042
Dutch – Netherlands	nl	nl-nl	1043	1252
Norwegian – Bokml	nb	no-no	1044	1252
Polish	pl	pl	1045	1250
Portuguese – Brazil	pt	pt-br	1046	1252
Raeto-Romance	rm	rm	1047
Romanian – Romania	ro	ro	1048	1250
Russian	ru	ru	1049	1251
Croatian	hr	hr	1050	1250
Slovak	sk	sk	1051	1250
Albanian	sq	sq	1052	1250
Swedish – Sweden	sv	sv-se	1053	1252
Thai	th	th	1054
Turkish	tr	tr	1055	1254
Urdu	ur	ur	1056	1256
Indonesian	id	id	1057	1252
Ukrainian	uk	uk	1058	1251
Belarusian	be	be	1059	1251
Slovenian	sl	sl	1060	1250
Estonian	et	et	1061	1257
Latvian	lv	lv	1062	1257
Lithuanian	lt	lt	1063	1257
Tajik	tg	tg	1064
Farsi – Persian	fa	fa	1065	1256
Vietnamese	vi	vi	1066	1258
Armenian	hy	hy	1067
Azeri – Latin	az	az-az	1068	1254
Basque	eu	eu	1069	1252
Sorbian	sb	sb	1070
FYRO Macedonia	mk	mk	1071	1251
Sesotho (Sutu)			1072
Tsonga	ts	ts	1073
Setsuana	tn	tn	1074
Venda			1075
Xhosa	xh	xh	1076
Zulu	zu	zu	1077
Afrikaans	af	af	1078	1252
Georgian	ka		1079
Faroese	fo	fo	1080	1252
Hindi	hi	hi	1081
Maltese	mt	mt	1082
Sami Lappish			1083
Gaelic – Scotland	gd	gd	1084
Yiddish	yi	yi	1085
Malay – Malaysia	ms	ms-my	1086	1252
Kazakh	kk	kk	1087	1251
Kyrgyz – Cyrillic			1088	1251
Swahili	sw	sw	1089	1252
Turkmen	tk	tk	1090
Uzbek – Latin	uz	uz-uz	1091	1254
Tatar	tt	tt	1092	1251
Bengali – India	bn	bn	1093
Punjabi	pa	pa	1094
Gujarati	gu	gu	1095
Oriya	or	or	1096
Tamil	ta	ta	1097
Telugu	te	te	1098
Kannada	kn	kn	1099
Malayalam	ml	ml	1100
Assamese	as	as	1101
Marathi	mr	mr	1102
Sanskrit	sa	sa	1103
Mongolian	mn	mn	1104	1251
Tibetan	bo	bo	1105
Welsh	cy	cy	1106
Khmer	km	km	1107
Lao	lo	lo	1108
Burmese	my	my	1109
Galician	gl		1110	1252
Konkani			1111
Manipuri			1112
Sindhi	sd	sd	1113
Syriac			1114
Sinhala; Sinhalese	si	si	1115
Amharic	am	am	1118
Kashmiri	ks	ks	1120
Nepali	ne	ne	1121
Frisian – Netherlands			1122
Filipino			1124
Divehi; Dhivehi; Maldivian	dv	dv	1125
Edo			1126
Igbo – Nigeria			1136
Guarani – Paraguay	gn	gn	1140
Latin	la	la	1142
Somali	so	so	1143
Maori	mi	mi	1153
HID (Human Interface Device)			1279
Arabic – Iraq	ar	ar-iq	2049	1256
Chinese – China	zh	zh-cn	2052
German – Switzerland	de	de-ch	2055	1252
English – Great Britain	en	en-gb	2057	1252
Spanish – Mexico	es	es-mx	2058	1252
French – Belgium	fr	fr-be	2060	1252
Italian – Switzerland	it	it-ch	2064	1252
Dutch – Belgium	nl	nl-be	2067	1252
Norwegian – Nynorsk	nn	no-no	2068	1252
Portuguese – Portugal	pt	pt-pt	2070	1252
Romanian – Moldova	ro	ro-mo	2072
Russian – Moldova	ru	ru-mo	2073
Serbian – Latin	sr	sr-sp	2074	1250
Swedish – Finland	sv	sv-fi	2077	1252
Azeri – Cyrillic	az	az-az	2092	1251
Gaelic – Ireland	gd	gd-ie	2108
Malay – Brunei	ms	ms-bn	2110	1252
Uzbek – Cyrillic	uz	uz-uz	2115	1251
Bengali – Bangladesh	bn	bn	2117
Mongolian	mn	mn	2128
Arabic – Egypt	ar	ar-eg	3073	1256
Chinese – Hong Kong SAR	zh	zh-hk	3076
German – Austria	de	de-at	3079	1252
English – Australia	en	en-au	3081	1252
French – Canada	fr	fr-ca	3084	1252
Serbian – Cyrillic	sr	sr-sp	3098	1251
Arabic – Libya	ar	ar-ly	4097	1256
Chinese – Singapore	zh	zh-sg	4100
German – Luxembourg	de	de-lu	4103	1252
English – Canada	en	en-ca	4105	1252
Spanish – Guatemala	es	es-gt	4106	1252
French – Switzerland	fr	fr-ch	4108	1252
Arabic – Algeria	ar	ar-dz	5121	1256
Chinese – Macau SAR	zh	zh-mo	5124
German – Liechtenstein	de	de-li	5127	1252
English – New Zealand	en	en-nz	5129	1252
Spanish – Costa Rica	es	es-cr	5130	1252
French – Luxembourg	fr	fr-lu	5132	1252
Bosnian	bs	bs	5146
Arabic – Morocco	ar	ar-ma	6145	1256
English – Ireland	en	en-ie	6153	1252
Spanish – Panama	es	es-pa	6154	1252
French – Monaco	fr		6156	1252
Arabic – Tunisia	ar	ar-tn	7169	1256
English – Southern Africa	en	en-za	7177	1252
Spanish – Dominican Republic	es	es-do	7178	1252
French – West Indies	fr		7180
Arabic – Oman	ar	ar-om	8193	1256
English – Jamaica	en	en-jm	8201	1252
Spanish – Venezuela	es	es-ve	8202	1252
Arabic – Yemen	ar	ar-ye	9217	1256
English – Caribbean	en	en-cb	9225	1252
Spanish – Colombia	es	es-co	9226	1252
French – Congo	fr		9228
Arabic – Syria	ar	ar-sy	10241	1256
English – Belize	en	en-bz	10249	1252
Spanish – Peru	es	es-pe	10250	1252
French – Senegal	fr		10252
Arabic – Jordan	ar	ar-jo	11265	1256
English – Trinidad	en	en-tt	11273	1252
Spanish – Argentina	es	es-ar	11274	1252
French – Cameroon	fr		11276
Arabic – Lebanon	ar	ar-lb	12289	1256
English – Zimbabwe	en		12297	1252
Spanish – Ecuador	es	es-ec	12298	1252
French – Cote d’Ivoire	fr		12300
Arabic – Kuwait	ar	ar-kw	13313	1256
English – Phillippines	en	en-ph	13321	1252
Spanish – Chile	es	es-cl	13322	1252
French – Mali	fr		13324
Arabic – United Arab Emirates	ar	ar-ae	14337	1256
Spanish – Uruguay	es	es-uy	14346	1252
French – Morocco	fr		14348
Arabic – Bahrain	ar	ar-bh	15361	1256
Spanish – Paraguay	es	es-py	15370	1252
Arabic – Qatar	ar	ar-qa	16385	1256
English – India	en	en-in	16393
Spanish – Bolivia	es	es-bo	16394	1252
Spanish – El Salvador	es	es-sv	17418	1252
Spanish – Honduras	es	es-hn	18442	1252
Spanish – Nicaragua	es	es-ni	19466	1252
Spanish – Puerto Rico	es	es-pr	20490	1252

Network Forensics and Reversing Part 1 – gzip web content, java malware, and a little JavaScript

Note: This is a reprint of a posting I made for my company, NetWitness (www.netwitness.com). This is unchanged from the original, however that copy can be found at: http://www.networkforensics.com/2010/11/14/network-forensics-and-reversing-part-1-gzip-web-content-java-malware-and-a-little-javascript/

Something I’ve found unsettling for some time now is the drastically increased usage of gzip as a Content-Encoding transfer type from web servers. By default now, Yahoo, Google, Facebook, Twitter, Wikipedia, and many other organizations compress the content they send to your users. From that list alone, you can infer that most of the HTTP traffic on any given network is not transferred in plaintext, but rather as compressed bytes.

That means web content you’d expect to look like this on the wire (making it easily searchable for policy violations and security threats):

In reality, looks like this:

As it turns out, the two screenshot above are for the exact same network session, the later screenshot being from wireshark and showing the data sent by the webserver really is compressed and not discernable.

By extension, you can likely say that most real-time network forensics/monitoring tools are [realistically] “blind” to [plausibly] a majority of the web the traffic flowing into your organization.

Combined with the fact that a vast majority of compromises are delivered to clients via HTTP (at this time, typically through the use of javascript), my use of the word “unsettling” should be an understatement. This includes everything from “APT” types of threats (or whatever soapbox you stand on to describe the same thing), down to drive-by’s and mass exploitations.

The good news: Current trends in exploitation have given us very powerful methods for generic detection (eg: without needing “signatures,” or more precisely – preexisting knowledge about the details of particular vulnerabilities or exploits) by examining traits of javascript, iframes, html, pdf’s, etc.

The bad news: Webservers are reducing the chance of network technologies from detecting those conditions by compression based transfer (obfuscation).

I find no fault with organizations choosing to use gzip as their transfer type. HTTP is a horribly repetitive and redundant language (read: bloated). Every opening <tag> has an identical closing </tag>. XML is even worse. For massive sites with massive traffic, the redundancy and bloat of protocols like HTTP and XML translate directly to lost revenue via extremely large amounts of wasted bandwidth.

Nonetheless, as forensic engineers, our challenge is to discover and compensate for all the things proactive security technologies like AV, firewalls, IPS, etc. continually fail to identify and stop. Recently, I added the following rule on a customer’s network in NetWitness:

If you’re not familiar with the NetWitness rule syntax, the rule above does the following:

If the server application/version (as extracted by the protocol parsing engine) contains the string: “nginx,”
AND
If the Content-Encoding used by the server is gzip
THEN
Create a tag labeled “http_gzip_from_nginx” in a key called “monitors.”

In the Investigator GUI, you would see something like this in the “monitors” key:

Why nginx? As it turns out, a lot of hackers tend to use nginx webservers, so this seemed like a good place to start experimenting. The question I was trying to answer is:

If the content body of a web response is gzip’ed (so we can’t examine traits of “suspiciousness” inside the body), then what can we see outside the body to indicate this gzip’ed traffic is worth examining further?

We’ll revisit this question in later blog posts, but for now, nginx as a webserver is an amazingly powerful place to start! We’ll examine one such example in this post, with an additional post using the gzip + nginx combination. As the small screenshot above shows, there were 33 sessions meeting the criteria of gzip + nginx (out of about 50,000 sessions). With only 33 sessions, it’s possible to examine them by drilling into the packets of all 33, examining them each one-by-one (eg: brute-force forensic examination), but that would be poor forensic technique and defeat the entire point of a technical and educational network forensics blog! The examples in these series of blog posts will employ good forensic practices using “correlative techniques,” allowing us to have a good idea of what is inside the packet contents before we ever drill that deeply into the network data (an indication you are using good network forensics practices).

The first pivot point we’ll examine are countries. Keep in mind, this is after we used the rule above to include only network sessions where the server returned gzip compressed content, and where the webserver was some type of nginx. We could have manually done the same by first pivoting on the content type of gzip:

Doing the first pivot reduces the number of sessions we’re examining from about 50,000 down to 2,878. Then we can do a custom filter to only include servers with the string “nginx” within those 2,878 session. Doing so gives us the same 33 sessions mentioned above.
In those 33 sessions, the countries communicated with are:

Not only do we tend to see a higher degree of malicious traffic from countries like Latvia, it immediately looks suspicious simply because it’s an outlier in the list. (Don’t worry Latvia, we’ll pick on our own country in the next post!) Additionally, there’s only a single session to examine here, meaning drilling into the packet-level detail is an ok decision at this point.
In the request, we see the client requested the file “/th/inyrktgsxtfwylf.php” from the host “ertyi.net,” as shown next:

As expected, based on the meta information NetWitness already extracted, we see the gzip’ed reply from a nginx server:

Fortunately, Investigator makes it easy for us to examine gzip’ed content by right-clicking in the session display and selecting decode as compressed data:

Doing so shows us a MUCH different story!

The traffic appears to be obfuscated javascript. We can extract it from NetWitness (a few different ways) to clean it up and examine. I’ll skip those steps and just show the cleaned-up and nicely formatted content the webserver returned.

There are a few things to notice in here. At the very bottom of the image above, we clearly see encoded javascript, a trait extremely common to client-side exploit delivery and malicious webpages. We’ll save full javascript reverse engineering for another blog post.

But the worst (or most interesting) part is the decoding and evaluation for this encoded data, while implemented in javascript, is stored inside a TextArea HTML object! This technique makes the real logic invisible and indiscernible to most automated javascript reverse engineering tools.

Indeed, if we upload this webpage to one of my favorite js reversing sites (jsunpack, located at: http://jsunpack.jeek.org/dec/go), we see the following results when the site attempts to automatically reverse engineer the javascript:

Without going further into the process of reverse engineering the javascript (for now – we have an endless supply of blog posts coming!), we can be quite sure we’re looking at something suspicious. At the very least, we know for a fact we’re looking at something that does not make it easy to discern what it’s doing!

The telltale signs of “badness” don’t stop there. At the top of the decoded body data we saw an embedded java applet, as follows:

While we don’t know (yet) what the applet does, there’s a pretty strong indication it’s a downloader or C&C (command and control) application of some type. How can we make such a guess without knowing anything about it?

Look closely at the embedded parameter passed into the applet:

We can make a guess that the string contained in the “value” parameter is encoded data using a simple substitution cypher where “S”[parm] = “T”[actual] and “T”[parm] = “/”[actual]. If we made such a guess, then it’s possible the decoded parameter value actually starts with the string “http://”.

Of course, because we have the download of the jar file within our full packet capture and storage database, we’ll just extract it from NetWitness to validate our hunch and possibly learn more. In the below screenshot, I already performed the following steps:

Switched to the session with the jar file download. (Simply clicked on the next session between that same client and server.)
Extracted the jar file by saving the raw data from the server using the “Extract Payload Side 2” option in NetWitness.
Opened the jar file using the following java decompiler:

The first line of code in the java applet takes the parameter passed to it (the encoded value we identified above), and hands it to a function called “b.” The result of that function is stored in a string variable called str1.

Following the decompiled java code to function “b,” we see the following:

It turns out the applet actually is using a simple substitution cypher, replacing one given character with another. When the parameter “RSS=,TT!;LBIB@STSRTYG$I=R=” is decoded, we end up with the string “http://uijn.net/th/fs7.php?i=1.”

The java malware then continues with additional string functions as shown next:

First, we see the declaration of str2 through str5, with values assigned to each.

Then, str6 through srt8 is simply the reversal of str2 through str4, resulting in the following strings:

Str6 = .exe
Str7 = java.io.tmpdir
Str8 = os.name

Combining that with the last three lines of code shown above, we see the following:

Str10 is a filename ending in “.exe” where the actual filename is a randomly generated number.
Str11 is the path to temporary files for the current user.
Str12 is the name of the Operating System the java malware is currently running on.

The last part of this java malware (that we’ll examine here anyways) is shown next:

First, it tests to see if the string “Windows” is contained anywhere in the name of the Operating System. If so, then it goes through the process of opening a connection to the URL (the one we decoded above), downloads the file, saves it to the temporary directory, then executes the file.
This file appears to be malware as a first-stage downloader for other executables that are likely far more malicious.

Pre-Summary

Even though a large amount of web traffic is coming into your organization gzip compressed, making most inline/real-time security products totally “blind” to what’s inside, we can use standard forensic principals to identify which of those sessions are worth examination. In this case, we combined to following traits to reduce 50,000 network sessions to a single one:

Gzip’ed web content
Suspicious country
Uncommon webserver application

Once we drilled into that single session, we saw how trivial it was to use NetWitness to automatically decompress and content, extract it, then validate it as “bad.”

Epilogue

Does the process stop there? Of course not! If you had to repeat this process every time, not only would it make your job boring as heck, but would call into question the value you and your tools are really providing the organization in the first place! There are many ways to maximize the intelligence gained from the process just shown. I’ll highlight one method here, while saving others for later blog posts.

There are several interesting “indicators” gathered from this traffic so far. The ones I’ll focus on here are host names. In the request made by the client, we saw the following tag in the HTTP Request header:

Host: ertyi.net

In the java malware we decompiled, after decoding the encoded parameter value, we saw the executable to be downloaded was from the host “uijn.net.”

At this point, network rules should be added to firewalls, proxies, NetWitness intelligence feeds, and any other technology you have that can alert to other hosts going to either of those servers – preferably blocking all traffic to those servers.

But, can we extend our security perimeter in relation to the hackers using those servers?
Interestingly, we find both those domains are hosted on the same IP block: 194.8.250.60 and 194.8.250.61.

That leads to the question, “What other domains are hosted on those servers?”

Normally I use http://www.robtex.com/ to answer questions like that, but in this case, robtex does not provide a lot of information about that question. It’s possible the hackers are brining-up and tearing-down DNS records as needed for the domain names they manage.

Another source of helpful information can be found querying the “Passive DNS replication” database hosted at: http://www.bfk.de/bfk_dnslogger.html Here, we can find an audit trail of all historically observed DNS replies pointing to IPs you submit queries about. In this case, we do indeed find valuable information, including about 40 unique host names that have been hosted on those two IP’s. A shortened list is included below showing some of the names that have been hosted there.

aeriklin.com
aijkl.net
asdfiz.net
asuyr.net
campag.net
iifgn.net
jhgi.net
jugv.net
kobqq.com
krclear.com
lilif.net
nadwq.com
oiuhx.net
pokiz.net
uijn.net

As we can see, none of them look immediately legitimate, so we can infer this is a hacking group using a set of servers for domains they have registered simply to be “thrown away” if any of those domain names are discovered and end up on a blacklist somewhere.

The Real Summary

By combining a few pivot points and looking inside compressed web traffic most products ignore, from a single network session we proactively increased the security posture of your organization by creating an intelligence feed of nearly 40 hosts names and 2 IP’s. You could now audit DNS queries made by all hosts in your organization to see if other clients are compromised and doing look-ups when trying to communicate with those hosts.

For the truly paranoid (or safe, depending on how you look at it), you could also blackhole all traffic to those apparently malicious networks:

route: 194.8.250.0/23
origin: AS29557

Considering the Google Safe Browsing report for that AS, it’s probably not a bad idea!

It’s Malware!

Zeus is evolving. In regards to a new release, one Anti-Virus vendor recently noted:

“[the new exe] uses techniques designed to avoid automatic heuristics-based detection.”

The discussion then proceeds to examine how the exe is different from previous versions of the malware.

Should we be alarmed that Zeus is getting so sophisticated that it evades heuristics-based detection mechanisms?

I suppose if it actually evaded heuristics-based detection mechanisms, that would be alarming. I’m sure the version of Zeus in question evades the mechanisms of certain AV vendors. However, when looking at the exact sample in question (verified by MD5) using the techniques we use for malware identification here, we see the sample stands out like a sore thumb.

Using our own internally-developed heuristic malware identification methods (also used by components of NextGen), we see the exe has traits such as the following (not a complete list!):

The binary contains packed sections, indicative of packed, obfuscated, and/or encrypted malware.
The size of the binary is abnormally small considering the conditions and context in which it was found.
The PE checksum fails to validate, something malware packers are notoriously bad about.
The binary does not have any information normally found within the version info table in the resource section of the PE.

But… Why get overly wrapped around the minutia related to the abnormal facets of this particular sample of Zeus? There’s a more important note to be made here. That is, Zeus is malware, so it does the things that malware does! You can’t get more “heuristically obvious” than that!

From the same vendor as above:

“…common ZeuS 2.0 variants contain relatively few imported external APIs… By contrast, [this version] imports many external APIs. To a heuristic scanner, this changes the appearance of the file and lowers the possibility of detection.”

Finding a binary that has very few external imports is generally a sign that something is suspicious. Specifically, it’s generally a sign the file is packed, obfuscated, and/or encrypted and the real imports are likely hidden inside. Such is the case when finding binaries that only import between two and five specific API’s from kernel32.dll (in the more obvious cases).

However, when finding a binary with a lot of imports, that’s even better since you get to see the full range of imports needed by the binary/malware! Without even running the sample or doing deep low-level reverse engineering, you can start to make assumptions about the functionality of the binary based on the API’s it uses. Further, it’s a simple matter to separate malware from legitimate binaries by comparing the API’s it uses to the ones it doesn’t need/use.

As is the case with this sample of Zeus, we see it (like the thousands of different types of malware not related to Zeus) imports APIs related to hooking the Windows API, creating mutexes, and managing services – without importing the functions used by legitimate binaries that also use the same functions.
So, should we be alarmed some people say Zeus is getting so sophisticated that it evades heuristics-based detection mechanisms?

If your security vendor is looking for Zeus, then yes. You should be alarmed. However, if your security vendor is looking for malware, generic signs of infection, and so on, then no… Fortunately Zeus is still malware, just like all the others…

Network detection of x86 buffer overflow shellcode

Note: This is a reprint of a posting I made for my company, NetWitness (www.netwitness.com). This is unchanged from the original, however that copy can be found at: http://www.networkforensics.com/2010/05/16/network-detection-of-x86-buffer-overflow-shellcode/

Overview

This technique can detect overflow exploits against software running on the x86 platform, meaning it applies to Windows, Unix, and Mac shellcode. It not only works independently of OS, but it also works for finding both stack and heap based overflows. Most interestingly, it catches most forms of polymorphic shellcode as well. (Actually, it exceeds at finding special shellcodes like polymorphic decryption engines, egg hunters, etc.) While this definitely doesn’t work for all shellcode, it works for a lot of it.

The reason this technique applies to any operating system on x86 is simple. Shellcode is typically written in machine code (commonly called assembly, although it’s not actually the same thing), meaning shellcode is written using processor instructions – something independent of the OS it’s running on. Of course, the entire purpose of shellcode is manipulation of the OS, so shellcode is ultimately OS specific (even patch specific), but its basic primitives are independent of the OS.

One classic problem with shellcoding is addressing. Because shellcode is [typically] nefariously injected via exploitation into a process’s memory segment, and program execution is “hijacked” (without the benefit of setting up proper address pointers), the coder doesn’t know where in memory their code will be. The problem is, very little can be accomplished without knowing the logical memory address of parameters within the shellcode.

The simplest way around this issue is use of a CALL instruction. More information is available in the “Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A: Instruction Set Reference, A-M” (and 2B: N-Z) located here: http://www.intel.com/products/processor/manuals/.

The CALL is used as a way to branch processor execution to another location in memory. It has the minor benefit of being able to use relative addressing, but it has the major benefit of PUSH’ing procedure linking information on the stack before branching to the target location. This is commonly referred to as Call Stack Setup. When executing a near call, the processor pushes the value of the EIP register (which contains the offset of the instruction following the CALL instruction) on the stack (for use later as a return-instruction pointer). The processor then branches to the address in the current code segment specified by the target operand.

There are several versions of the CALL instruction, but the one we’re interested in for this purpose is opcode 0xE8. This is a near call (near, meaning within the current memory segment) using relative address displacement with a negative offset (eg: backwards displacement). The actual instruction is 5 bytes long, with the last four bytes used for a relative offset (a signed displacement relative to the current value of the instruction pointer in the EIP register; this value points to the instruction following the CALL instruction). The CS register is not changed on near calls, so the results of these branches can be safely predicted (from a shellcoders perspective).

A section of a disassembled binary is shown here with an actual CALL. Notice the instruction is given as an 0xe8 plus a double word (32 bit) displacement pointer.

The CALL is usually needed early in shellcode execution to PUSH the virtual address contained in the IP onto the stack. (This is done because it’s not possible to access the IP directly, so it needs to be put on the stack to utilize parameters within the shellcode). However, the problem with the use of CALLs for call stack setup in buffer overflow shellcode is the CALL is generally located at an offset needing to serve as a return address after other instructions have already been executed. In other words, the CALL is generally located later in the shellcode and the processor executes the instructions sequentially from the start of the shellcode – unless a branching instruction is encountered.

Which is precisely how to solve the problem in shellcode – early in the execution of the shellcode, you simply JMP to the CALL in question, then call back into the shellcode and continue execution.
JMPs are simple instructions and easy to visibly identify and dissect. They are simply the opcode 0xEB followed by a byte indicating the number of bytes to jump.

The example below is taken from an MDaemon Pre Authentication Heap Overflow exploit:

In the first example above (the egghunter shellcode), we see a “\xeb\x21” which means, “Jump 0×21 (or decimal 33) bytes.” When you jump those bytes, you hit the green box, a CALL. The CALL performs the call stack setup, then branches backwards back into the shellcode and picks up just after the JMP (because of the negative displacement). The actual offset is [0xFF – 0xDA = 0x25]. 0×25 is 37 in decimal, however, you subtract 5 from that since the offset starts at the end of the 5-byte CALL. That lands us just after the JMP.

Simple, yet effective. Even analysis of polymorphic shellcode generators shows this technique applies to almost all them as well.

To summarize all this rambling, the technique (show in the FelxParser below) is simply to search for a JMP straight to a NEAR CALL with a short and negative displacement.

Evasion

Call with no offset

Evasion of JMP/CALL detection can be accomplished a number of ways. The most interesting evasions are techniques used in advanced NOP sleds obfuscation leveraging CALLs that started surfacing around the mid-2000’s.

One of the simplest CALL-based NOP substitutions worked as follows:

00000000    E800000000 call 0×5
00000005    58                   pop eax

In that example we have a CALL with no offset, which basically translates to “branch to the instruction after this CALL,” in this case an opcode that simply POPs the EIP into the EAX register. (Remember, when the CALL is hit, the processor runs through the call stack setup, meaning the EIP was just PUSHED onto the stack.) From a NOP perspective, this leaves the stack unchanged, but for a method to grab the EIP, this is a simple and efficient (although the use of NULL bytes makes this more difficult to use in a wide range of shellcode).

As that byte sequence is very rare in binaries, detecting this is much simpler since we have the benefit of a continuous 6-byet token to watch for. In the case the EIP is poped to EAX, the token is simply

0xE8 0×00 0×00 0×00 0×00 0×58

The above pattern should be extended to include all the general purpose POPs, including:

0xE8 0×00 0×00 0×00 0×00 0×58
0xE8 0×00 0×00 0×00 0×00 0×8F
0xE8 0×00 0×00 0×00 0×00 0×0F 0×1A
0xE8 0×00 0×00 0×00 0×00 0×0F 0xA9

Noir’s no JMP/CALL

This next technique was first described by noir@gsu.linux.org.tr on the vuln-dev mailing list. It works as follows:

00000000    D9EE            fldz
00000002    D97424F4    fnstenv [esp-0xc]
00000006    58                 pop eax

In this case, the technique is to use FNSTENV to get the EIP of the last FPU instruction evaluated, then POP it from the stack. In the example above, the FLDZ FPU instruction is issued, then its EIP is POP’ed. This very cool technique allows for many permutations since any number of floating point instructions can be used. Several dozen pages in the Intel Developers Instruction Reference A-M (starting around page 430) cover instructions that can be used in place of FLDZ.

Gera’s CALL into self

The final one we’ll look at is a crafty method to avoid JMP/CALLs, and works like this:

00000000    E8FFFFFFFF call 0×4
00000005    C3                      ret
00000006    58                        pop eax

The interesting thing is the code above does not perform the actions the disassembler has labeled them as doing. In reality, the CALL (E8FFFFFFFF) is calling backwards into itself by a single byte. Therefore, the processor will hit the byte 0xFF (the tail end of the CALL) and interpret that byte as an instruction. In this case, the instruction is an INC/DEC (increment by 1 or decrement by 1). The 0xC3 is actually an operand to the interpreted 0xFF instruction, so it’s not a RET (return, normally used for call stack unwinding) in this case – it’s actually a pointer to the value stored in the EBX register as an operand for the INC/DEC instruction! After this step has been taken (the equivalent of a NOP really), the value on the stack is POP’ed into the EAX register using the 0×58 instruction. The value POPed is the EIP since it was PUSHed onto the stack when the CALL called back into itself.

While this is a very cool technique, it also provides a number of simple tokens to match on, similar to the Call with no offset example.

False positives and benign triggers

In testing of 55 GB of data (network and host based) no false positives were encountered searching for a JMP to short and near negative CALL. However, benign triggers were encountered (meaning the condition was detected, but it was a valid use of the condition). The condition was only detected inside some valid PE files, and because of that fact, they can be filtered using a number of simple and easy techniques depending on the technology used to discover them.

Flex Parser

Currently, the parser engine does not allow for one-byte tokens, so this parser is not functional as-is. (The concept presented here can easily be extended to identifying percent-encoded shellcodes, which is supported since they are represented as multi-byte tokens.) Nonetheless, and more importantly, the technique is annotated here in Flex so the reader can see how simple it is to write FlexParsers to discover a wide array of very complex conditions – such as universal shellcode detection.