Since my beginning with computers, I was very intrigued and interested to know how computer viruses worked. Without thinking about the consequences they caused in the world of computers, I tried to learn more and more to achieve my goal of writing my own virus. Fortunately, to write a virus it was required an advanced level of knowledge of assembly language and a high degree of understanding about the internal workings and data structures of the DOS operating system.
By the time I had the necessary knowledge (several years later) the operating system being used was Microsoft Windows. Finally, I could create my own viruses that were obviously never released since at that time I was conscious of the damage that these programs could do, even when made as “benevolent” as possible.
Although coding a virus was and remains a real challenge, it is more difficult to create antivirus software. Such a tool has to be able to detect and block thousands of viruses before they act in the system. It is obvious that all this actions have to be performed in very short time slices. This will make the user feel comfortable and secure at the same time. Besides, viruses can enter the system by various means, hidden in many different forms, activating their payload only under certain occasions in a totally unexpected manner. As if all this was not enough, many types of viruses have emerged as a response from virus programmers to antivirus software developers. In addition, a lot of new viruses appear every day and are distributed mainly using the Internet.
In this article, I intend to appoint the ideas and concepts used by developers of antivirus and antispyware software. Moreover, I will explain why signatures are still useful. Given the complexity of many of these concepts, the interested reader is directed to links containing comprehensive information about the topics. I will also assume that the reader has some degree of knowledge about computer viruses.
A signature is any sequence of bits that can be used to accurately identify the presence of a particular virus in a given file or range of memory.
Once we get a sample of a virus, the type of the virus (worm, rootkit, simple infector, etc.) should be determined. Only after that step, a signature can be extracted from the binary code. In many cases (e.g. EXE infectors, COM infectors, polymorphic viruses, stealth viruses, etc.) this will be possible and enough to detect the virus in the future. However, in recent viruses which are much more complex (e.g. metamorphic viruses) other techniques are required (behavior-based analysis). A full team of people is likely to be required to analyze these viruses very meticulously. They would also need to write custom detection routines manually, a very time consuming task.
Despite all this, and although many believe that signatures were used only in antivirus software of the 80’s, 90’s, and that they are no longer used, this is totally untrue. The truth is that signatures still play a fundamental role in the various virus detection algorithms used by current antivirus products. Let’s see a typical example of a signature. Suppose the following sequence of bits (in hexadecimal) corresponds to a signature for a virus called Doctor Evil:
A6 7C FD 1B 45 82 90 1D 6F 3C 8A OF 96 18 A4 C3 4F FF 0F 1D
One question that you’re probably doing is: How is a signature chosen for a given virus?
The answer is not simple. It depends mainly on the type of virus. For instance, if the virus is a simple EXE file infector, we just need to look for a sequence of bytes (as the one shown above) within the binary code of the virus. We must select a signature which is long enough to generate as fewest false positives as possible. For instance, choosing the following signature:
A3 B7 11 00
is probably not a good idea. This is due to the short length of the signature. Such a short sequence of bits is likely to be present in other executable programs that are actually not infected. That is why the length should be considerably long (more than 50 bytes). The additional problem is what signature to choose, because for an arbitrary virus we could find plenty of potential signatures. Nevertheless, not always the longest is the best… at least not in the case of signatures…!
People at IBM invented an excellent technique based on Markov models. I studied for several hours the contents of their article which is neither something extremely complex to understand, nor something simple. After that, I created a trigram generator and an automatic signature extractor in C#. For a given virus, this tool can automatically extract the signature with less likelihood of false positives. I could extract signatures for thousands of viruses within a few hours by using a virtual machine and the tool I developed. I was delighted to see hundreds of wicked programs working hard to infect my virtual machine. All the infected files were isolated and then analyzed by the tool in order to extract valid signatures. Finally, the tool stored all the signatures in a MySQL database.
I will describe the tool with more detail in a forthcoming article. I strongly recommend you to read the excellent article from IBM to get started.
It is relatively easy to detect the presence of a simple infector within an infected file. We only need to analyze certain areas of the file for known signatures. Even so, things get more complicated when the virus changes its form on each infection (polymorphism), or if it encrypts/compresses itself on each infection. The task gets even harder when these mechanisms are combined several times, even recursively. In these cases, the signatures must be carefully extracted from the clean (uncompressed/decrypted, etc.) image of the evil program.
To detect this type of complex viruses, the technique used is known as generic emulation. This technique (among others) was patented by the firm Symantec. Carey Nachenberg is known as the primary inventor and a chief architect in Symantec’s antivirus labs.
The idea is simple and efficient: in order to scan a program, its execution is emulated during a quantity of C instructions. All memory pages altered by instructions involved in the emulation process are analyzed. This has sense, since those instructions could be part of a decryption/decompression routine, etc., which is reconstructing the original virus and is precisely there, where we must search for known signatures.
Thus, unlike what many believe, signatures are still being used to detect these complex threats. The special support from emulation gives time for the virus to reconstruct itself in memory.
At this point, you may be wondering how antivirus products scan a file so fast even when they have to search for thousands of signatures. There are several answers and you will find most of them on Symantec patents. For instance, Norton Antivirus uses signatures beginning only with a subset of all the possible bytes. This trick allows a super-fast search because knowing the possible prefixes it is possible to cut the search space considerably. The bytes are selected according to their frequency of use in 80×86 machine code. Besides, not all files are actually emulated. More information can be found here.
转自:http://www.agusblog.com/wordpress/what-is-a-virus-signature-are-they-still-used-3.htm