Sunday, October 20, 2024

Why The First Chocolate Piece Is Better Than The Second



Have you ever had a delicious lunch, and after really started wanting to have a piece of chocolate? This exact feeling is because all the enjoyment you were having in the meal is now over, and your brain is trying to keep you active by trying to remember what other thing that you can do will stimulate you.

Humans always tend to do things that make them feel better. In fact, the brain quantifies "feel better" as the amount of hormones flowing in the brain, and it always guides the body into doing different things that make us feel better. In other words, the brain guides us into doing things that will increase the amount of hormones in the brain. Just like in the chocolate scenario, when the food is over, it already starts to process what other correlated thing will provide the highest amount of hormones. After thinking for a while, a bar of chocolate seems to be a very good option.

However, one might remember that having the first square of chocolate is way better than the last square in the bar. That is the way the brain can manipulate itself to send the message "Hey, that is enough, go do something else". And that is a very important behavior from an "evolutive" perspective since it would be way worse to have a homo sapiens that only enjoys building houses, versus a homo sapiens that enjoy building houses, hunting, farming, and socializing. Therefore, the brain evolved to decrease the amount of hormones deployed by one specific task after some time.

Similarly, that also happens in daily activities like doing homework, for example. First, one might be excited to start the homework so that you don't have to think about it anymore, but after a while the excitement decreases and the brain starts to think of anything else to do, like calling a friend, gardening, or eating something. All that common behavior is caused by the constant work that the brain does to keep you stimulated and with enough hormones in the brain to promote synapses.

Nowadays is a little easier for the brain to figure out what to do next. From the first time you use Instagram Reel and see the funniest cat video, the brain automatically stores that Instagram is a great source of hormones. That kind of "storing" behavior also makes sense because the brain can't afford to be without stimulus for too long, so the first thing it does is to try remembering what are the things you did in the past. In the end, that is what explains the constant desire to use any kind of entertainment app we have available. The brain feels that what you are currently doing is providing way less hormones than you could receive if you just searched for cat videos on Instagram, which is very easy to do.

Besides, one might also have experienced working on a regular day and eventually having to deal with a lot of problems that appear all at once. That is a very common feeling that we have as programmers because of eventual bugs that only show themselves way after we have written them. At this sad moment, when the code is not working as you intended, the brain can't generate hormones from sadness and is working its hardest not to find the solution to the bugs, but to figure out what can it do to increase the amount of hormones in your system. Eventually, he remembers that very funny cat video you saw a while back and instantly moves your arms and fingers to access the first social media app that your eyes recognize.

After some time receiving the notification from your phone, showing you the time usage on Entertainment Apps, you recognize that is wasting a huge amount of your time. To solve that you can try forcing yourself into only doing your work, which you might handle for a day, but on the next one, you are so tired that you will not even be able to think about how to solve your daily tasks.

Just by knowing a better way to get stimulus, your brain induces this progressively higher tiredness feeling that will eventually convince you that using the cellphone might not be such a bad option after all. Therefore, it is nonsensical to fight against your own brain that way because it would basically be trying to chop wood with the axe edge pointing backward.

A way better option would be to learn why the axe was created that way, then try to use it. In that sense, it is better to convince your brain that Instagram is not a great source of happiness before trying to uninstall it. The main problem is that usually convincing your brain of that is not so straightforward, because your neurons are now wired that way, so you have to be really convinced of that to avoid catching yourself scrolling through Instagram again. Because remember, the brain HAS to find a stimulus. So if it doesn't have any immediate source of it, it will create it out of a boring task.

As a side comment, I got the idea of writing this because I have recently heard of someone saying that he "wants" to watch TikTok videos to recharge from the tiredness he had during the day. That got me thinking about the rabbit hole entertainment apps might put us in, because the only thing that he doesn't know, is that what causes the tiredness is the TikTok itself.

Wednesday, October 16, 2024

How Technology Makes Humans Dumber

At first, this title makes no sense. How would something that helped humans to travel to the Moon be bad for us? And no, I will not try to sell the even dumber argument about how recording telephone numbers on a smartphone makes us less capable of recording a number ourselves. Hang on for a while, I will try to show you something.

As my high school sociology teacher used to say, technology is anything that helps a living being do something. One can think of cars as a technology that helps us with transportation, or cell phones that help us communicate. However, the dark side is that any kind of technology makes something that takes time into something fast. Even though this speed is required for us to develop more complex things, did God make us to work that fast?

Think about it, the first homo sapiens ever discovered dates 300.000 years back, but to be fair the oldest being with similar creative capabilities such as creating art, trades, ornaments, or burial rites are 30.000 years old. At a time when not even agriculture existed, I assume we had just some basic daily activities, such as standing at the side of the community to protect others, walking for ours to get food, and collecting sticks for a fire. The same brain used only for those basic activities is used today to create abstract options like algorithms or marketing strategies, even though it was never developed to do so. Evolution acted upon our brain considering that we had only to do basic reasoning and had a lot of time to plan things.

Back then we had a lot of time to use for simple tasks which created a lot of idle time that could be used to consider options, plan the future, and have ideas. But as more technology is created, the less boring tasks we have during the they, therefore less idle time to occupy our brains with planning types of thoughts. Nowadays, I argue that the peacefulness that the Church brings to its followers is another word for the sentiment of spending one week hour of boredom, which might be the only idle time the Christian might have in the whole week. All the remaining time of the week might be filled with video calls, social media interaction, and entertainment consumption. 

As time went on, we seemed to have lost the idea that spending time doing nothing, or just doing some boring manual labor, can actually be a good thing because it allows us to think of life in a more general way. That kind of feeling might also be essential to bring us a feeling of control and safety.

Right now we have so much we can do at all times, that the only moment you have to consider something is at the Sunday mass, which is probably the only time you have to spend time doing nothing. Doing nothing was basically intrinsic to human capabilities, and even though we have some of that while washing the dishes, how would a human being that lives nowadays know that it needs 7 hours of doing nothing, just like its predecessor had when spent the whole day manually farming the land.

Without it, I argue that we will have an increasing amount of psychological problems, caused by the lack of confidence and planning that being bored allows us to feel. Even though we have known about this fact since WALL-E came up, nobody seems to pay attention to it and really spend time doing nothing to help with life understanding. Because of that we essentially keep spending time on trivial/immediate decisions and never think something through.

Monday, October 7, 2024

(Preliminary Results) Inside the Phishing Reel: Identifying phishing kits at scale

Introduction

In the first quarter of 2024, APWG reported an average of 300.000 unique monthly phishing attacks. Besides that, Microsoft Digital Defense Report also demonstrated that the problem is caused by the low-cost barrier in the phishing market due to the commercialization of phishing kits, which are code repositories that are created to mimic benign websites while redirecting information typed on the credentials fields to a centralized server, to be sold in the dark market later. Those phishing kits are made to be easily deployed by attackers at scale.

Besides all the efforts to classify phishing pages based on visual differences between malicious websites and their benign counterpart, we have a low study on the distribution of phishing kits and how they are deployed over time. One recent study tackled that by identifying a handful of phishing kits and manually identifying fingerprintable information in them, which later was used to identify which websites correspond to which phishing kits in the wild. Due to the phishing kits analyzed in this study being ones that are freely distributed in telegram channels, we argue that this approach only captures simple phishing samples, which might not be responsible for the majority of effective phishing attacks.

We managed to improve that by creating a mechanism that will identify phishing kits on a large scale by measuring their behavior at crawling time. That will enable the security analyst to group URLs that run the same code and consequently are deployed from the same phishing kit.

For that, we create a tool that completely bypasses code obfuscation, regardless of different obfuscations. And also proposes a spread view of the phishing landscape by analyzing crawler data over a long period of time (TODO), which besides understanding common practices and the duration period of phishing pages, will enable the analyst to see the evolutions of phishing pages, and hopefully assign different phishing pages to the same possible authors.

Finally, the tool also enables the analyst to verify which samples were never seen before in terms of client-side behavior. That will save valuable human resources as they receive thousands of phishing websites to be reviewed daily, and automatically filtering out the ones that were already seen decreases the amount of human wasted time.

Crawler


The most important part of the crawler is the browser. We used an instrumented version of Chromium with slight modifications to the bytecode generation pipeline and runtime function calling, to identify javascript property accesses and function calls. That browser allows us to see exactly which operations were executed by each website at runtime in a way that is impossible to identify from the website itself (which is the main advantage over the regular CDP approach). Our modifications follow previous studies that use the same technique on older versions of the browser.

To log function calling, we hook the runtime function calling method used inside Chromium. To log property access information we had to instrument the bytecode generation process and create an extra bytecode that calls a logging function whenever a property is loaded or stored with its information. We then moved this instrumented compiled version of Chromium to a docker container to make it scalable.

The log format created by the browser has the following characteristics. For each process created from the website that is being visited, one log file is created on the disk with the sequential instructions executed by the browser. Besides the instructions, the log file also contains information on the source code that was loaded before the instructions were executed. Also, it is possible to attribute the execution of each instruction to one specific script, loaded from a specific origin inside the V8 running context.

One of the key problems we have with phishing websites is that often phishing kits are obfuscated and sometimes with different tools. With that approach, we completely bypass obfuscation and even though the source code might look different due to different obfuscation techniques, the browser will see the exact same sequence of instructions being executed.

Besides, another problem phishing crawlers often have to handle is regarding cloaking, which is something we don't have to tackle here for a very nice reason. We want to cluster together websites that behave in the same way, which means that even though our browser is redirected due to cloaking, we still would be able to cluster together the websites that use cloaking the same way. In the end, we argue that it is enough information to identify phishing pages deployed from the same phishing kit (TODO).

In general, the full crawler works as follows. We have a module Trigger that runs twice a day, at linearly random times, which is important to avoid being cloaked by those platforms that do not support crawling. When it runs, it calls three other modules sequentially. First, it runs the Downloader modules, that try to access public-access sources of phishing URLs that are reported and validated by security analysts (those are PhishTank, OpenPhish, and PhishStats). After accessing the new URLs it then checks against a database of previously crawled URLs to avoid crawling the URL again. If there are any new URLs, the trigger runs the Analyzer module on those new URLs, which consists of running the Instrumented Browser on the URL for 10s, then a tool called Resources Saver to save the public resources of the page (such as images, icons, source code) and then KitPhishR, that is a tool that seeks phishing kits stored in the deployed phishing server. The browser is executed in a docker container that is created to analyze batches of 10 URLs, which is then destroyed to be replaced by another docker instance to analyze more samples.

Information on the resources and the phishing kits (obtained with KitPhishR) are not used in the processing pipeline. They will be used in a validation phase to verify if the phishing URLs assigned to the same cluster are indeed from the same phishing kit.

After the analysis is completed, we store the results of each URL in a separate directory that is named after the hash of the domain that was crawled, therefore using that to make sure that the same domain is not crawled twice. 

After the tool finishes to analyse the URLs, due to storage limitations in the server we have to upload all the sample results to the Google Drive, which stores all the samples separated by source, day and hour that was crawled.

Projector


With the logs in hand, we can run the projector once a day. This tool will verify for each day, which ones are similar to another one that has been seen before, and which one seems to be new.

Since we can't keep all the data on disk at the same time, the projector downloads the samples one day at a time, uses the Log Parser module to transform the log information of each sample into an Intermediary Representation of the logs, and then adds it to the dataset using the Dataset Parser module.

The Log Parser handles the logs by considering that each sample contains javascript code from n different sources during execution. Because of that, it breaks the single sequence of instructions produced by the browser into n sequences of instructions separated by source. We call each of those broken sequences of instructions, an Instruction Block, which has a unique URL associated with it.

We do that because our main goal is not to identify the exact same behavior between phishing websites, but to identify common patterns in the behavior. With that division, it is easier to identify similar files between phishing websites, which can be used to understand the evolution of a phishing kit. In the end, that is why calculating instruction block, is better than handling a single thread of logs.

After all the data is gathered in the Dataset, we transform the text-based representation of the logs into an embedding representation to use regular clustering techniques on top of that. The Dataset Embedding module was created to be very adaptable and can be used with DOC2VEC or SBERT techniques. In our experiments, SBERT produces less noise on the embeddings and creates embeddings that are closer to similar samples. Besides, SBERT has an attention layer that produces an importance window that is larger than the DOC2VEC approach, apart from being faster because it is based on a pre-trained model. In the end, the Dataset Embedding module embeds each Instruction Block creating a variable amount of embeddings for each sample. One can think of the result as a matrix with shape nxF, with n being the variable number of Instruction Blocks of each sample and F the number of features in the embedding technique, which is the same for all the samples. 

In the Clusterizer Module, each sample is represented by a matrix with a different number of lines, which makes it not trivial to compare the samples. In the end, we want to calculate a Representative Vector for each sample, which should be decided based on the group of embeddings calculated for that sample and then used to compare the samples among themselves. To solve that, the module was also built to be very generic, which also enabled us to try different approaches. The first one we tested was just based on the average of the lines in the matrix, which ended up creating more noise than information. Then we tried finding the most important line in the whole matrix, which would represent the file that best describes the sample, which ended up evidencing JavaScript libraries like jQuery, Bootstrap, and Popper.

Finally, we realized that, even though the number of lines was not the same in the matrix, we could perform a matrix multiplication with the transpose of the sample matrix with itself to end up with an FxF matrix for each sample, which is something that would allow us to compare the samples. The problem is that using the SBERT model, we create embeddings with 1024 dimensions, which would then create a resulting matrix with more than 1 million features for each sample. We solve that by calculating an Incremental PCA with 256 components and a batch size of 256, which was preferred due to memory limitations. Which ends up with Representative Vectors of size 256 for each sample in the dataset.

That approach calculates way more accurate clusters for similar execution traces and more or less the same number of instruction blocks. However, bigger changes in the number of instruction blocks still affect the clustering mechanisms which is reasonable based on the matrix multiplication alternative. We are still to find an alternative that completely disregards the number of instruction blocks in a sample.

Finally, the Clusterizer module uses the embeddings as the input data and accepts many different clustering algorithms, such as DBSCAN, OPTICS, and HDBSCAN. Our initial tests were made using the DBSCAN algorithm, but the final pipeline will be chosen by their performance.

All that composes the Projector module, which runs every day in parallel with the crawler. While the crawler runs every day to obtain daily new samples, the projector runs every day to use those new samples. It basically uses the samples from the days before today to build the embedding space and then uses the new samples to project them into the previously known clusters to verify which samples from today are new, and which are similar to any that were seen before. That basically allows the security analyst to understand which samples it has to manually inspect, and which are not worth inspecting.

Besides, that tool also allows us to see how the phishing kits progress over time and how long does one phishing kit remains being deployed in the wild. Also, that tool can be used to blacklist samples that have the same behavior as the ones that were already validated before.

Initial Results

Malicious Clustering Subsample Experiment

To verify if the algorithm can correctly classify different phishing samples into the same clustering, we used 249 valid samples to see the resulting clustering using their behavior. To validate the clustering we created a tool to manually inspect each cluster regarding the code that was executed, the number of blocks in each of them, and the domains requested.

Considering the clusters that had more than 1 sample in each of them, we saw mostly clusters that correctly grouped different samples with the same behavior together, which represents that the pipeline might be doing good.

Even though we had some clusters with the exact same samples, which might have been due to redirection, or failing to filter equal samples. In the end, we got interesting clusters containing slightly different execution blocks or domains, but with enough information to make us believe that the overall code structure was developed by the same author.

Reliability of the Crawler

To test if the exact same URL is equally visited by the browser and correctly assigned to the same clusters, we propose the following experiment

We selected the 10 most accessed URLs from the Alexa Top 100 index and used our crawler to visit each of them 10 times sequentially. Our hypothesis is that the clustering will result in 10 clusters with the samples in the same ones.

List of URLs:

  • https://www.google.com/
  • https://www.facebook.com/
  • https://www.youtube.com/
  • https://www.wikipedia.org/
  • https://www.amazon.com.br/
  • https://www.instagram.com/
  • https://www.linkedin.com/
  • https://www.reddit.com/
  • https://www.whatsapp.com/
  • https://openai.com/

In the end, WhatsApp, Instagram, OpenAI, Wikipedia, YouTube, and Facebook were 100% correctly grouped in the same cluster. However, Google and Linkedin were 90% correctly clustered. The samples that were clustered outside had fewer instruction blocks in it, meaning that assets were not successfully loaded during the 10s crawling or the browser was evaded. Amazon was grouped in 3 different clusters. With 10, 11-12, and 7 instruction blocks in each. While Reddit was not able to be parsed by the crawler.

After manually analyzing the sample embeddings by projecting the embeddings in a 3D space, we see that even though the samples were not assigned to the same cluster, they were not far from each other, which means that a small adjustment to the clustering algorithm would group them together. However, we would have to verify if the adjustment would not make the pipeline more prone to group different samples in the same cluster, which is equally bad.

Overall. It means that the processing tool still needs some tweaking regarding the embedding technique and clustering parameters before deploying it.

Effectiveness of the Projector


Even though it is a preliminary experiment. We run the projector on a rather small sample set from September 25th to September 29th, with September 29th as the target date to be mapped.
Overall, we used 7882 samples of old data and 1407 samples of new data. From those new points, we had 471 of them assigned to a cluster, from which 409 of them were assigned to clusters that contained old points, meaning a repeated phishing kit deployment. While we still need some tweaking on the clustering algorithm to cluster all new samples into a cluster, we identified that 86% of the phishing URLs discovered in a day were not new, but redeployment of an already known phishing kit.

Next Steps

Another experiment we would like to make is regarding progressive log files. This means understanding how the sample is located in the projection space while it is still loading. This is an experiment we still have to implement to see the results but would allow us to implement an instrumented version of Chromium that uses this approach as a blacklist for known phishing kits.

We still have a crawler that is constantly obtaining new samples which will give us insight on the distribution of phishing kits through time. However, it is something that will have to leave it running for a while before we have any conclusions about it.

Besides, we are also considering using the dynamic behavior of phishing kits to create a classification tool that will differentiate benign and malicious websites based on that behavior.