[Rephrasing] Phishing URL Detection: A Network-based Approach Robust to Evasion

💡This is a reader's perspective on the paper written by Taeri Kim (from Hanyang University) and published at the ACM Conference on Computer and Communications Security 2022.

Brief Description

First of all, this paper is highly mathematical and I don't have (and did not have the intention to have) the knowledge required to make propositions and contributions to the content. Besides that, I also skipped some [what might have been] very important parts, because reading it or not had the same result (As I didn't intend to spend the time to fully understand it).

Besides that, the authors provide a URL-based classification using network theory by splitting the URLs into separate words to better associate the words with phishing-related features, such as their IP addresses. They also mention that URL-based detection is worth studying because it can be used as a first-barrier classification that does not need to access the content of the phishing pages, which can be masked by cloaking techniques (Reasonable).

Observations

I understand that this paper is a plate full of food for a person who is into network theory, but I wonder if there wasn't any better way of explaining it with examples.

One thing that I recognize here is that I really need to understand deep learning better and get into the TensorFlow framework.

Initial Questions

The first question I had was on Section 2.2, which I wondered what are the brands that are mostly targeted. While I didn't find the answer to this in this paper, I think that it might be a nice experiment GAP to be done in future research.

Where do the experiment ideas come from?

I suppose that one of the authors of the paper is very into network theory, and they had a student who really likes computer security and found URL-based classification as a great mix. Besides, they had a lot of creativity to test the model on (even though it might not have been the most creative I have ever read).

What are the interesting ideas/results?

I like the way they say that PhishTank is not reliable because it can be messed up by attackers. Even though it is possible to do this automatically, I wonder how great this problem is. However, I like the idea of verifying the set of URLs in VirusTotal to further explore it.

I also like the idea of testing different methods as a comparison experiment to test if the data collection system is robust, which they did in Section 6.2.

Finally, I like the time complexity calculation to explore the network model that they explored further. Besides that, I like the explanation for the transductive approach they took.

João Pedro Favoretti - Blog

Tuesday, August 20, 2024