Wednesday, August 14, 2024

[Rephrasing]: Knowledge Expansion and Counterfactual Interaction for Reference-Based Phishing Detection

💡This is a reader's perspective on the paper written by Ruofan Liu (from National University of Singapore) and published at USENIX Security 2023

Brief Description

This paper proposes some very interesting tools and techniques to enhance phishing detection while being open-sourced. The key aspects are the proposition of a way to incrementally enhance reference-based detection of phishing websites, another way of creating a repository of phishing kits (even though are arguably weak phishing kits), and a tool to dynamically interact with phishing pages to infer phishing intentions.
One last less interesting aspect is the implementation of a Brand Knowledge Expansion module, which is fairly simple in comparison to the more interesting aspects of the paper.

Observations

The first observation is regarding the overly mathematical propositions in the paper. While it might be interesting to clearly formalize some explanations, the paper overdoes it in my opinion, by creating mathematical symbols and relations where it is not necessary.

They mention in Section 4.3 that precision is more valuable than recall, but I am actually not convinced since there are severe consequences of having a high False Negative rate (e.g. the classifying missing a phishing page).

Another fear I had was at the creation of the phishing dataset. To collect phishing kits they use the Miteru tool, which looks for directory listing and ZIP files being hosted in the phishing infrastructure, which I believe happens only on simpler phishing campaigns. Having a dataset that covers only simple phishing campaigns might not mean much because it might not be a problem anymore, only the more advanced phishing pages. By the way, understanding how different "Complexities" of phishing pages affect users is still a very interesting study GAP.

Another thing they mentioned was that the tool is still too slow to be used on the fly by the user, taking over 5 seconds to analyze the page. Therefore I guess that a faster tool is still a study GAP.

A last observation is that they did not experiment with cloaking, which is still a huge study GAP.

Initial Questions

One of the first questions I had, in the beginning, was "What might be a dynamic phishing dataset". Well, I guess that it is just a way to express something that could automatically increase itself, which is interesting

Where do the experiment ideas come from?

It is clear that every experiment is around the idea of dynamically improving a reference-based classifier, even though it is not the best contribution of the paper. Creating a dataset was a result of having to evaluate the tool. Creating the Webpage Interaction module is a result of having pages that are brandless. 

What are the interesting ideas/results?

An interesting idea the authors had was to specify the threat model of reference-based phishing classifiers regarding some phishing pages.

They demonstrate that Webpage Injection is still a study GAP, and I am interested in discovering more about how it works.

They demonstrate that their Webpage Interaction module is nice to understand phishing intentions in brandless websites, however, it still only approaches login fields, which leaves a study GAP for a more general website interaction tool. Besides that, it is interesting to see the tool they created to identify how to navigate to login pages on the homepage.

A genius-like move they made was at the creation of the Phishing Dataset, they leveraged a taint-based approach to automatically de-weaponize the phishing kits. This is a great strategy that I have to try at least once to see the results.

Another thing that I like about this is the way they separate the evaluation of the different modules of the program to best represent the results, very nice idea.

One last thing I like about the tool is the ability they had to identify the brand that is being impersonated by the phishing website, which could be used in a future study to enhance statistics.


No comments:

Post a Comment