[Rephrasing] Assessing Browser-level Defense against IDN-based Phishing

💡This is a reader's perspective on the paper written by Hang Hu (from the University of Illinois at Urbana-Champaign) and published at the USENIX Security Symposium 2021.

Brief Description

This paper provides a huge study on IDNs (Internationalized Domain Names), which are domains not written fully in ASCII. Since Unicode groups different languages, there are languages with different Unicode but similar at the same time. This proposes a deep study on how the browser handles those IDNs, what are the perception of the user on IDNs, and how social media and email providers handle those IDNs.

Observations

While I like the study on mobile browser behaviors on IDNs, studying the user perspective on phishing domains on mobile browsers is still a research GAP (One might use the automatic infrastructure they used in this paper to test it, with LambdaTest).

One thing that I am still aiming to find out is the skeleton rule classification system they created. I wonder if it does involve any image processing behind the scenes ( Section 4.1).

This paper should also propose a protocol to be followed by most browsers to avoid causing that result difference between the browsers when interpreting the domains.

In Section 5.2 they mention that Chrome fails to enforce the rules they claimed. I still wonder if it was an experiment error thing (while Google might have plenty of automatic testing tools to enforce it).

The paper mentions some studies regarding phishing, but I still did not see anything regarding those domains phishing-wise. Plenty to explore.

Initial Questions

At first, I asked myself about the way they found 1,855 homograph IDNs in a set of 900,000. But they essentially develop an algorithm to do that automatically. That leads to some of the disagreements I have regarding the "effectiveness" of Chromium in detecting homograph IDN. It is not the case that Chrome is bad, it is just that the algorithm is different, in which case the Google developers might have decided how close it should be to be considered "homographic", in which case the False Negatives might not really be False Negatives. In this sense, I personally found most of the False Negatives very different from the real domains, such that a regular user might find out that something is wrong.

Where do the experiment ideas come from?

At first, I had no idea where the idea of studying IDNs came from. But they might have got it from the eCrime paper written in 2018 called "Large scale detection of IDN domain name masquerading". Nice inspiration on the experiments to further explore this area of work.

What are the interesting ideas/results?

Nice testing with already existing datasets.

Nice "category" creation for the experiments. Made it much more clear.

Nice usage of Google Tesseract to perform character recognition.

Verifying network traffic/Source code verification is nice to understand the behavior.

Time-series studies are always amazing.

Nice to mention the testing plan for user studies.

João Pedro Favoretti - Blog

Wednesday, August 21, 2024