Welcome to the Hardenize blog. This is where we will document our journey as we make the Internet a more secure place and have some fun and excitement along the way.
Update (3 April 2019): We've just launched our research project that implements automated phishing detection based on the techniques described here. We call it Hardenize Labs: Confusables.
As you're probably already aware, phishing is a type of social engineering attack designed to lure users to web pages that look like legitimate web sites, but are actually phony setups designed to obtain their sensitive information. Certificate Transparency, which added auditability to the PKI ecosystem, is the latest tool in the fight to detect phishing sites reliably and quickly. The premise is simple: We monitor all public certificates and analyze every newly-discover hostname for signs of illegitimate intentions and otherwise indications of deception. In this blog post we're going to share some of the insights we obtained as we started to incorporate phishing site detection into Hardenize.
Today, public certificates are recorded to a global network of append-only logs specifically designed for making the certificates available to the public for inspection. This system is known as Certificate Transparency (CT). It was created in order to make the CA system auditable. This was deemed necessary, due to a large number of incidents where CAs were found to have generated certificates for suspect purposes. Each logged certificate contains cryptographic signatures that enable browsers and other clients to verify that the logging had taken place.
One of the side effects of this system is that every hostname you generate a certificate for, becomes public knowledge within 24 hours (that the maximum-allowed delay between a certificate is registered and has to be made available to the public), but in practice much sooner. This has implications for good actors and bad actors equally.
At Hardenize, we put the information recorded to CT logs to good use. Working on behalf of our
customers, we detect new certificates and hostnames as they appear in the endless stream of
CT data. In other words, we provide instant visibility of new infrastructure. For example, you
example.com in your inventory. If a certificate for
appears in the CT logs, we can connect that certificate with your organisation. Newly
certificates and hostnames are automatically added to the inventory for monitoring and
One of the things we intend to use this source of data for, is to discover phishing attacks against
your users or employees, before they start. People who wish to attack your users, will register
domain names that can be confused for one of your real web sites. They will then host phishing web
sites on those domains. There are many different ways of doing this. For example, if we owned
example.com, somebody could attack our users by setting up a web site at one of the
https://www.examp1e.comUsing the digit
1instead of the letter
ais replaced by an alternative unicode character
𝖺, which looks very similar, but isn't the same. The punycode encoded version of this hostname being:
mis replaced by the
rncombination, which can sometimes be very effective with the right font and small letter size.
The homoglyph attacks (sometimes also called homograph attacks) is where it all gets particularly
interesting, because there are
many characters that look alike, especially in the Unicode character set. There is no definitive
list of mappings from one character to all other characters which look similar. For illustration
purposes, consider the following characters that al look like the ASCII letter
but are in fact different characters:
A Α А Ꭺ ᗅ ᴀ ꓮ Ａ 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐 À Á Â Ã Ä Å
Are there other characters we could use? Maybe. Probably. And perhaps if we stick two characters together, they might look like an A too? The Unicode consortium does publish a list of so-called confusables. It's by no means a complete set, but it's a good starting point.
We have already experimented with searching for various keywords in our database of hostnames where at least one of the characters used is a homoglyph rather than the original character from the keyword, and it has already turned up some interesting results. Our data set consists of about 300 million unique hostnames that we have extracted from over 1 billion CT log entries.
For example, when searching for hostnames containing the keyword
There were nearly 300 when doing the same for the word
Our initial research indicates that large datasets obtained via CT can be put to good use, and most certainly can assist with detection of phishing web sites. We're planning to incorporate continuos monitoring of this type into our product.
With some creativity to match that of
the attackers', we believe that we can detect most suspicious hostnames. Where the task gets
more difficult is dealing with false positives, which will be the legitimate sites. This is
challenging when dealing with large organisations or noisy keywords. We discovered this
already when looking for suspicious domain names such as
sage, which resulted in tens and even hundreds of thousands of hits.