Wednesday, April 15, 2009

What is a virus signature and how it is used

Signatures
A virus signature is any series of bits that can be used to accurately identify the presence of a particular virus in a given file or range of memory.

Once we get a section of a virus, the type of the virus (worm, rootkit, simple infector, etc.) should be determined. Only after that step, a signature can be extracted from the binary code. In many cases (e.g. EXE infectors, COM infectors, polymorphic viruses.) this will be possible and enough to notice the virus in the future. However, in recent viruses which are much more complex (e.g. metamorphic viruses) other techniques are required.

Despite all this, and although many believe that signatures were used only in antivirus software of the 80’s, 90’s, and that they are no longer used, this is totally untrue. The truth is that signatures still play a fundamental role in the various virus detection algorithms used by current antivirus products. Let’s see a typical example 0f a signature. Suppose the following sequence of bits (in hexadecimal) corresponds to a signature for a virus called Doctor Evil:

A6 7C FD 1B 45 82 90 1D 7F 3C 8A OF 96 18 A4 D3 5F FF 0F 1D

One question that you’re probably doing is: How is a signature chosen for a given virus?
The answer is not simple. It depends mainly on the type of virus. For instance, if the virus is a simple EXE file infector, we just need to look for a sequence of bytes (as the one shown above) within the binary code of the virus. We must select a signature which is long enough to generate as fewest false positives as possible. For instance, choosing the following signature:

A4 B7 11 01

is probably not a good idea. This is due to the short length of the signature. Such a short sequence of bits is likely to be present in other executable programs that are actually not infected. That is why the length should be considerably long (more than 50 bytes). The additional problem is what signature to choose, because for an arbitrary virus we could find plenty of potential signatures. Nevertheless, not always the longest is the best… at least not in the case of signatures…!
People at IBM invented an excellent technique based on Markov models. I studied for several hours the contents of their article which is neither something extremely complex to understand, nor something simple. After that, I created a trigram generator and an automatic signature extractor in C#. For a given virus, this tool can automatically extract the signature with less likelihood of false positives. I could extract signatures for thousands 0f viruses within a few hours by using a virtual machine and the tool I developed. I was delighted to see hundreds 0f wicked programs working hard to contaminate my virtual machine. All the infected files were isolated and then analyzed by the tool in order to extract valid signatures. Finally, the tool stored all the signatures in a MySQL database.

I will describe the tool with more detail in a forthcoming article. I strongly recommend you to read the excellent article from IBM to get started.

Generic Emulation
It is relatively easy to detect the presence of a simple infector within an infected file. We only need to analyze certain areas of the file for known signatures. Even so, things get more complicated when the virus changes its form on each infection (polymorphism), or if it encrypts/compresses itself on each infection. The task gets even harder when these mechanisms are combined several times, even recursively. In these cases, the signatures must be carefully extracted from the clean (uncompressed/decrypted, etc.) image of the evil program.
To detect this type of complex viruses, the technique used is known as generic emulation. This technique (among others) was patented by the firm Symantec. Carey Nachenberg is known as the primary inventor and a chief architect in Symantec’s antivirus labs.

The idea is simple and efficient: in order to scan a program, its execution is emulated during a quantity of C instructions. All memory pages altered by instructions involved in the emulation process are analyzed. This has sense, since those instructions could be part 0f a decryption/decompression routine, etc., which is reconstructing the original virus and is precisely there, where we must search for known signatures.

Thus, unlike what many believe, signatures are still being used to detect these complex threats. The special support from emulation gives time for the virus to reconstruct itself in memory.

Optimizations
At this point, you may be wondering how antivirus products scan a file so fast even when they have to search for thousands of signatures. There are several answers and you will find the majority of them on Symantec patents. For instance, Norton Antivirus uses signatures beginning only with a subset of all the possible bytes. This trick allows a super-fast search because knowing the possible prefixes it is possible to cut the search space considerably. The bytes are selected according to their frequency of use in 80×86 machine code. Besides, not all files are actually emulated.

Related Search
antivirus technical support
top 10 virus removal
virus removal software
Password-Manipulating Virus Spreading
How To Secure Yourself Against Conficker Worm

No comments:

Post a Comment