[Rephrasing]: BEYOND PHISH: Toward Detecting Fraudulent e-Commerce Websites at Scale

💡This is a reader's perspective on the paper written by Marzieh Bitaab and published at IEEE S&P 2023

Brief Description

Even though it is an incredibly hard challenge to evaluate scam websites at scale, due to the small difference between a faulty benign website and a purposeful scam campaign, the authors propose a genius idea for crowdsource data collection using Reddit. Another thing that is very interesting is the validation method they used, which was basically done by posting answers in a Reddit comment and seeing what are the votes on the comment.

Observations

Something that is on my mind is regarding the NLP classification method they used to evaluate answer sentiment. Would it be possible to use ready-to-use SpaCy models? Instead of training their own model with public data?

One other thing was the sentiment dataset they used. It was said in Section IV B that they used the Stanford Sentiment Treebank to train the sentiment classifier, but then they used the classifier on Reddit comments. So another thing that I thought about was if the possibility of slang and misspelled words (That might occur often on Reddit comments) might affect the classification of the program.

It was a very nice idea to first provide some statistical information in Section IV about the features that will be used by the classifier later in the paper.

I like the idea of creating the classifier, even though it was basically created with "heuristic features" instead of creating a generalized classifier. But since this is a first-of-its-kind paper, then it is clearly acceptable.

Even though phishing classifiers were used to evaluate the classification performance, I like the effort that it took to implement all that just for half a page of content in the paper.

Initial Questions

While I was reading the first lines of the paper, I was asking myself what might be the difference between Fake Online Shopping websites and Fraudulent eCommerce Websites (FCW). But then I found out that FCW is a general label for scam websites that aim to sell anything, so one thing is inside the other.

One thing that was on my mind while reading the explanation for feature engineering of the classifier was the feature "Alexa top 100k". At first, I thought that it would be cheating to use it if the benign data was only from Alexa's top URLs, but later in the paper they clearly say that they removed this feature from the classifier when that data was used.

Where do the experiment ideas come from?

As I read through the paper, I think that creating a classifier as an experiment for a proposed technique of collecting data is a very nice and natural follow-up for the authors. The same might have occurred for the testing of existing tools such as Google Safe Browsing, which is used to detect phishing and malware.

What are the interesting ideas/results?

The most interesting idea of the paper is the usage of Reddit as a dataset collection, which is way off-track and genius at the same time.

I like how the research questions are simply structured. The first question is about getting the dataset, then they perform an evaluation on tools, and in the end, they propose a classification solution.

Another interesting idea while writing the paper was to enumerate the benefited parties in the study.

One other nice idea that was tested was to understand the amount of scams that are created using Shopify, which made me wonder how much is the company doing to avoid this. Another GAP that might arise is a comparison between different e-commerce creation platforms such as Woocommerce. Besides that, understanding the difference between websites that are purposely made to be a scam, and websites that were made by the developers but were not able to cope with the demand (both cases might have been considered as Scams, but a line has to be drawn between them).

The other thing that I cant leave unnoticed is the idea of creating a bot to post comments on Reddit and using the Reddit users to evaluate the answer. But I wonder if the users were aware of what were they contributing with, and if that would have changed the behavior of the users (Might be an interesting GAP).

I also like the way which economics of scam websites were taken into account in the insights on Model Robustness (Section VI F), and how honest were they when trying to break the classifier by manually adjusting the features.

João Pedro Favoretti - Blog

Tuesday, August 13, 2024