Hardenize has joined Red Sift! Find out more in our blog post.

Blog

Welcome to the Hardenize blog. This is where we will document our journey as we make the Internet a more secure place and have some fun and excitement along the way.

22 Jan
2019

Detecting Phishing Sites Using
Certificate Transparency Monitoring

by Mike Cardwell

Update (3 April 2019): We've just launched our research project that implements automated phishing detection based on the techniques described here. We call it Hardenize Labs: Confusables.

As you're probably already aware, phishing is a type of social engineering attack designed to lure users to web pages that look like legitimate web sites, but are actually phony setups designed to obtain their sensitive information. Certificate Transparency, which added auditability to the PKI ecosystem, is the latest tool in the fight to detect phishing sites reliably and quickly. The premise is simple: We monitor all public certificates and analyze every newly-discover hostname for signs of illegitimate intentions and otherwise indications of deception. In this blog post we're going to share some of the insights we obtained as we started to incorporate phishing site detection into Hardenize.

Today, public certificates are recorded to a global network of append-only logs specifically designed for making the certificates available to the public for inspection. This system is known as Certificate Transparency (CT). It was created in order to make the CA system auditable. This was deemed necessary, due to a large number of incidents where CAs were found to have generated certificates for suspect purposes. Each logged certificate contains cryptographic signatures that enable browsers and other clients to verify that the logging had taken place.

One of the side effects of this system is that every hostname you generate a certificate for, becomes public knowledge within 24 hours (that the maximum-allowed delay between a certificate is registered and has to be made available to the public), but in practice much sooner. This has implications for good actors and bad actors equally.

How We Use CT for Discovery and Monitoring

At Hardenize, we put the information recorded to CT logs to good use. Working on behalf of our customers, we detect new certificates and hostnames as they appear in the endless stream of CT data. In other words, we provide instant visibility of new infrastructure. For example, you may have example.com in your inventory. If a certificate for chat-server.example.com appears in the CT logs, we can connect that certificate with your organisation. Newly discovered matching certificates and hostnames are automatically added to the inventory for monitoring and alerting.

Detection of Phishing Sites

One of the things we intend to use this source of data for, is to discover phishing attacks against your users or employees, before they start. People who wish to attack your users, will register domain names that can be confused for one of your real web sites. They will then host phishing web sites on those domains. There are many different ways of doing this. For example, if we owned example.com, somebody could attack our users by setting up a web site at one of the following URLs:

  • Using our domain name as part of their own, most commonly as a suffix:
    • https://www-example.com
    • https://support-desk-example.com
    • https://account-example.com
  • Prefixing their domain name with ours, trying to exploit the fact that long domain names won't fit on small screens (e.g., mobile devices):
    • https://www.example.com.index.html.attackers-domain-name.com
  • Employing character substitution and homoglyph attacks to create domain names that look like the real ones and are sometimes indistinguishable from the real ones, depending on exactly which display fonts are used:
    • https://www.examp1e.com   Using the digit 1 instead of the letter l.
    • https://www.ex𝖺mple.com   The letter a is replaced by an alternative unicode character 𝖺, which looks very similar, but isn't the same. The punycode encoded version of this hostname being: www.xn--exmple-qo00e.com.
    • https://www.exarnple.com   The letter m is replaced by the rn combination, which can sometimes be very effective with the right font and small letter size.
  • Typosquatting, which relies on incorrectly spelled names:
    • https://www.exampel.com
    • https://www.examplle.com

Homoglyphs and Confusables

The homoglyph attacks (sometimes also called homograph attacks) is where it all gets particularly interesting, because there are many characters that look alike, especially in the Unicode character set. There is no definitive list of mappings from one character to all other characters which look similar. For illustration purposes, consider the following characters that al look like the ASCII letter A, but are in fact different characters:

A Α А Ꭺ ᗅ ᴀ ꓮ A 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐 À Á Â Ã Ä Å

Are there other characters we could use? Maybe. Probably. And perhaps if we stick two characters together, they might look like an A too? The Unicode consortium does publish a list of so-called confusables. It's by no means a complete set, but it's a good starting point.

Initial Results and Analysis

We have already experimented with searching for various keywords in our database of hostnames where at least one of the characters used is a homoglyph rather than the original character from the keyword, and it has already turned up some interesting results. Our data set consists of about 300 million unique hostnames that we have extracted from over 1 billion CT log entries.

For example, when searching for hostnames containing the keyword twitter, where at least one of the characters is a homoglyph, we found 65 unique hostnames. At the time of this writing, 16 from the list are still configured in the DNS:

  • abs.tw1tter.com
  • abs.twltters.xyz
  • api.twltters.xyz
  • client.tw1tter.com
  • plctwltter.com
  • premium.tw1tter.com
  • tw1tter.com
  • tw1tterpicasso.com
  • twltter.bid
  • twltter.com.mbdoge.club
  • twltter.gq
  • twltter.live
  • twltter.me
  • twltter.nl
  • twltter.pw
  • twltters.xyz

There were nearly 300 when doing the same for the word instagram. Here are some of the more interesting ones (Unicode on the left, ASCII on the right):

  • instagrɑm.com   (xn--instagrm-1od.com)
  • instaɡram.com   (xn--instaram-3sd.com)
  • inѕtаgrаm.tk   (xn--intgrm-5nfc24a.tk)
  • ǀnstagram.com   (xn--nstagram-kmc.com)

Future Plans

Our initial research indicates that large datasets obtained via CT can be put to good use, and most certainly can assist with detection of phishing web sites. We're planning to incorporate continuos monitoring of this type into our product.

With some creativity to match that of the attackers', we believe that we can detect most suspicious hostnames. Where the task gets somewhat more difficult is dealing with false positives, which will be the legitimate sites. This is challenging when dealing with large organisations or noisy keywords. We discovered this already when looking for suspicious domain names such as apple, facebook, and sage, which resulted in tens and even hundreds of thousands of hits.