I read an article linked from http://www.omgili.com/captcha.php on SlashDot that illustrates an alternative way of generating Captcha to prevent spam bot attacks.

The idea in the original article is to use HTML elements (e.g. tables, colored boarders, etc.) to form the actual Captcha (authentication images). So on the webpage, there is no actual images (e.g. a jpeg or gif image or a handle to a http handler that generates the image).

There is no doubt this approach makes it more difficult for attackers to decipher what is encoded in the HTML page. However, I would argue that the effort (e.g. time) involved to decode such HTML Encoded Captcha (HEC) is not significantly more than the traditional image based Captcha. And also HEC has some significant performance problem as I will discuss shortly.

All we really need to do to decipher the HEC is to first identify the block where the HEC is encoded in the webpage (e.g. the <table></table> tag). This is not very difficult because the table elements used to create HEC is unique enough (e.g. it is large and has no real information and lots of color tags). Once the HEC block is identified, we can use an HTML parser to help rendering the block into a temporary bitmap of the Captcha. And now the principal of pattern recognition can be applied to this bitmap just like we could with the traditional Captcha images.

So the HEC approach can not prevent sophisticated Bots from harvesting the Captcha information.

Further, typical Captcha were done using compression schemes like jpeg so the sizes of these images are typically small (e.g. a few kilobytes). HTML tags are not efficient in terms of encoding image data. So a small image (like the one the author illustrated) can easily be as large as tens of kilobytes, and for a reasonably sized Captcha with reasonable amount of sophistication the HEC could easily be as large as a hundred kilobytes. And generating such HTML Encoded Captcha is arguably more costly than the traditional way of generating Captcha images as well because another level of indirection (converting the image data to HTML) is involved. Due to the large size of the Captcha and thus the resulted webpage and higher server utilization during the Captcha generation, the process is prone to be the target of Denial of Service attacks, more so then the image based Captcha.

Be Sociable, Share!