Today, HackInTheBox published a Yankee Group webcast How to Detect and Remove Malicious Software Without Signatures or Scanning". Anyway, it happened to catch my eye, so I (despite my better judgement) registered with the webcast sponsor (Sana) and watched the broadcast in its entirety. And it turns out that Yankee was right on target about the future of malware scanning - although there's more to the story than they go into.
Yankee's point seems to be that malware scanning can't continue to rely on signature-based techniques because of the fact that zero-day vulnerabilities are on the increase and that signatures can't keep up. According to their analysis, zero-day threats will continue to get more and more prevalent until signature-based scanning isn't feasible as a countermeasure. That's certainly possible, although they don't really give much in the way of hard numbers or empirical evidence to back this up. All in all, it's likely by not certain. They then go on to describe that the rate of threats is increasing and that, looking forward, one can project a time where the signature volume will be voluminous to the point of being unmanageable. They speculate that the future will bring about a time when signatures just can't be done... because of delays in publication of signatures, vendors just can't keep up.
So they're right - partially. But it's not speculation; the death of signature-based scanning is predicted precisely by well-understood laws of computer science; we can tell EXACTLY when it will happen and why. I've made this point before, so if you've heard it, sorry for the repetition. However, I think it's an important point, so allow me to state it again. So let's all put on our uber-dork hats and take a trip down memory lane to our "Algorithms" class... It'll take a while to get there, so bear with me.
So, computer scientists analyze performance of a given search algorithm by representing the performance mathematically as a function of total operations required to complete the search. For example, if you were going to search the phone book linearly for a particular person's last name, you would start at the beginning of the phone book and look at each name until you found the one you want (maybe there's a mis-print and they put the entry you're looking for somewhere that's non-alphabetic.) This is an example of a "worst case n" search - the total time it takes to do the search would take - in the worst case - the number of items in the list multiplied by the constant amount of time that it takes you to complete an individual examination. To say that in an extra-fancy way, you might say that the search time is O(n) [big-Oh n] - "big O" technically means "asymptotic upper bound", but that's just a fancy way of saying "worst case".
If you were going to search the phone book to find all the entries of a particular item (all the people with the last name "Smith", for example), it changes the performance equation since now you have to examine all the entries in teh book to find all the occurances. In that case, the performance is an "asymptotically tight bound" with the number of entires - that's a fancy way of saying "exactly": you have to search exactly the number of entries - no more, no less. Formally, that'd be Θ(n). Doesn't it seem like checking for a virus signature against every file on your hard drive is like looking for a given name in the phone book? Algorithmically, it's the same thing: a Θ(n) search.
Now, say you were looking in the phone book for all the people that had either the last name "Smith" or the last name "Jones". This time your search is more complicated because you have to check each name twice - you have to check it once to see if it's "Jones" and you have to check it once to see if it's "Smith" - in effect, you're doing the same search twice. In this search, it's not Θ(n) but insead it's Θ(2n). If you have more names, you have to do the search once per item per name. Using q as an arbitrary letter to represent the number of entries, it'd be something like Θ(qn). Algorithmically, that's the same as current malware-scanning products. Really, it is - double-check me if you don't believe it.
So, the reason that signatures are dead (and right soon) is that both the q and the n values for malware are increasing exponentially over time - everybody's research seems to agree on this; the number of malware signatures is increasing exponentially and so is the number of files on a given disk. Since the seach time is a product of the two, performance will appear to be acceptable for a given period of time (the flat part of the exponent curve) and then will go from "zero" to "nigh on impossible" in a heartbeat - one day your AV scanner seems "a bit slow" and a week later it takes the age of the universe to complete a scan - at least until the search parameters are changed. That's the way exponential curves work. Now, the reality is a bit more complicated since there are things you can do to try to "cheat" the curve (not search every file or not look for every signature) - in fact, some vendors have already started "cheating" to combat the exponentially increasing scan times. But cheating won't work long-term: since the numbers are exponential, it only shaves a bit off the curve.
So can you beat the curve? Not with signatures you can't. As long as search times stay constant (or increase arithmetically according to Moore's law), the fact that the two relevant search variables are exponential means the handwriting is on the wall. Or you could listen to Yankee and use Sana.
Posted by Ed at July 18, 2006 09:57 PM | TrackBack