Monday, May 20, 2024

Don't Lose the Thread: How to create programs that run in parallel


It is always nice to be able to execute multiple instructions at the same time. Unfortunately, it doesn't reduce the time complexity of a program but will make it execute in less time. Well, it should, if the programmer doesn't use it wrongly and end up with a slower program. In a way, the idea of threads is similar to the idea of processes. In summary, it is just an abstraction that the operating system creates that will allow the user to separate instructions that have different purposes inside a program. 

For instance, the operating system has a lot of programs that the user might want to execute, like Chrome, File Explorer, and Paint, so it will separate them into separate processes. Similarly, a program like Paint has to execute many different functions, like one that captures what the user is drawing, another that calculates how to smooth curves or even one that checks which pixels are included in some fill operation. Any of those functions need to execute in order, therefore they can be divided into separate threads.



In general, a process can have any number of threads to execute whatever function they want in parallel.


The idea of processes was implemented in the Linux version v0.1 and each new process was created with the system call fork(), which copies all the data from the father process to the child process, duplicating the amount of used memory. With that idea, if Paint would like to handle a fill operation in parallel, it would have to fork it, duplicate the amount of memory used by Paint, execute it, and then shrink again. 




The main problem is the overhead caused by using twice as much RAM as the process is already using, even though the code that will be executed and the canvas memory are already in RAM. That was when someone came up with the idea of making two processes point to the same memory location. Well, it makes a lot of sense, code is already in memory, why would you need to duplicate it to be able to execute it? In the end, it was an easy idea to implement because threads behave just like processes already behave at the OS level, so they had to create a new system call (called clone(2)) that works just like fork does, but referencing memory already allocated instead of copying it. However, while threads can share Code, Data and Heap memory, they can't share Stack Memory, therefore each Thread and Process has its own Stack.

On Linux, both Threads and Processes are stored in the same tasks array. However, it is very common to refer to tasks that own the memory as a Process, and tasks that reference others' memory as Threads



In fact, the implementation was stored at the file /kernel/fork.c, because the clone implementation was handled by the same function as the fork system call with some adjustments. The main difference is when the function calls the macro copy_vm, because when it understands that fork was called, then it executes the function copy_page_tables to allocate new memory pages, but if clone was called, then it executes the function clone_page_tables to copy the addresses of memory pages that are already allocated to the newly created task.


/kernel/fork.c

#define IS_CLONE (regs.orig_eax == __NR_clone)

#define copy_vm(p) ((clone_flags & COPYVM)?copy_page_tables(p):clone_page_tables(p))


Threads became so common that a Standard of functions was created so that any operating system could create threads equally. That standard is now known as POSIX Threads Standard or pthreads for short. Also, details about those functions are on the pthreads(7) manual page.

Even though it is easy to get overwhelmed by the number of functions defined in the POSIX Threads standard, it is simple to use. Of those functions, the two most common are the pthread_create function, which is used to create a new thread starting at a specific function, and pthread_join which is used by the main process to wait for any running thread. 

To understand how to use those two functions, first consider the following example: 


factorial_example.c

#include <stdio.h>

void factorial(int n) {
    int result = 1;
    for (int i = 1; i <= n; i++) {
        result *= i;
    }
    printf("Factorial of %d is %d\n", n, result);
}

int main(void) {
    int n1 = 5, n2=6;
    factorial(n1);
    factorial(n2);
    return 0;
}


In the previous example, the program executes the same function twice, and in order. However, since they are not attached, it is possible to execute them in different threads by using the pthread_create and pthread_join functions. Those functions are defined in the pthread.h header file.


/usr/include/pthread.h

/* Create a new thread, starting with execution of START-ROUTINE

   getting passed ARG.  Creation attributed come from ATTR.  The new

   handle is stored in *NEWTHREAD.  */

extern int pthread_create (pthread_t *__restrict __newthread,

                           const pthread_attr_t *__restrict __attr,

                           void *(*__start_routine) (void *),

                           void *__restrict __arg) __THROWNL __nonnull ((1, 3));

...

/* Make calling thread wait for termination of the thread TH.  The

   exit status of the thread is stored in *THREAD_RETURN, if THREAD_RETURN

   is not NULL.



   This function is a cancellation point and therefore not marked with

   __THROW.  */

extern int pthread_join (pthread_t __th, void **__thread_return);


More information about the behavior of both functions is written on the pthread_create(3) and pthread_join(3) manual page. In summary, pthread_create is a function that creates a new thread attached to the main process and begins the execution from a function. It has four parameters, the first is a variable to store the identifier of the thread that is being created, the second is a struct you can use to change the way the operating system will handle the thread, the third is a pointer to the function that is going to be executed, and the fourth is the argument your function will receive.

The function pointer is the most important of the parameters because the pthread_create function only accepts functions with types void *(void *), which means the function can only receive one void * parameter and will return a void * address. Since C is a strictly typed language, the pthread_create function needs to specify only one function type that it accepts, and that is the format that can be used generally in most cases.

Even though the factorial function was not defined with that type, it is possible to adjust it so that it follows the type required by the pthread_create function.


factorial_parallel_example.c

#include <stdio.h>
#include <pthread.h>

void *factorial(void *arg) {
    int n = *(int*)arg;
    int result = 1;

    for (int i = 1; i <= n; i++) {
        result *= i;
    }

    printf("Factorial of %d is %d\n", n, result);
    
    return NULL;
}

int main(void) {
    int n1 = 5, n2 = 6;
    pthread_t tid1, tid2;

    pthread_create(&tid1, NULL, factorial, &n1);
    pthread_create(&tid2, NULL, factorial, &n2);

    pthread_join(tid1, NULL);
    pthread_join(tid2, NULL);
    
    return 0;
}


To compile the last code using the pthread function, GCC requires the flag -pthread at compile time, in the same way that flag -m has to be used to include math functions. 


$ gcc -pthread factorial_parallel_example.c


In the last example, two different threads were created to handle each call to the function factorial with the corresponding parameter. Besides that, NULL was used in the second argument to specify that the threads were to be defined in the default way, and we had to send the address of both n1 and n2 variables as parameters because the factorial function now receives a pointer instead of a value.

Besides the function pthread_create, the last example used the function pthread_join which has two parameters. The function can be used to make the main process wait for unfinished threads, just like the wait system call works. The first parameter is the thread ID created from the pthread_create function, while the second parameter is a pointer to store the return value of the thread, that has type void *.

Since pthread_create restricts the function so intensively, two main problems may arise:

  • How to retrieve information from the thread function? 
  • How to access more than one parameter in the thread function?

Even though the main process can retrieve the return value of a thread function using the pthread_join function, it is much easier to perform thread communication using shared memory between the threads. Since all threads share the Data and Heap Memory, they can communicate by reading and writing information in that memory. The following example shows how both to give and receive information to the thread using global variables.


factorial_shared_example.c

#include <stdio.h>
#include <pthread.h>

int factorial_n;
int factorial_return;

void *factorial(void *arg) {
    int result = 1;

    for (int i = 1; i <= factorial_n; i++) {
        result *= i;
    }

    factorial_return = result;

    return NULL;
}

int main(void) {
    pthread_t tid;

    factorial_n = 5;
    pthread_create(&tid, NULL, factorial, NULL);

    pthread_join(tid, NULL);

    printf("Factorial of %d is %d\n", factorial_n, factorial_return);

    return 0;
}


Since the last example does not use the fourth argument to pthread_create, a NULL value was used instead.

An alternative to global variables as a solution to send more information to the thread, it is also possible to use a struct variable as a parameter. The following example calculates the area of a rectangle and uses a struct as the thread function parameter. Besides that, it will use the pthread_exit function to correctly end the thread, which is an alternative to the more common return operation, used in the previous examples.


area_rectangle.c

#include <stdio.h>
#include <pthread.h>

typedef struct {
    int width;
    int height;
} rectangle;

void* calculate_area(void* arg) {
    rectangle* rect = (rectangle*)arg;
    int area = rect->width * rect->height;
    printf("Area of rectangle with width %d and height %d is: %d\n", rect->width, rect->height, area);
    pthread_exit(NULL);
}

int main() {
    pthread_t tid1, tid2;

    rectangle rect1 = {
        .width = 5,
        .height = 10
    }; 

    rectangle rect2 = {
        .width = 8,
        .height = 15
    }; 

    pthread_create(&tid1, NULL, calculate_area, (void*)&rect1);
    pthread_create(&tid2, NULL, calculate_area, (void*)&rect2);


    pthread_join(tid1, NULL);
    pthread_join(tid2, NULL);

    return 0;
}


The main problem of working with threads is that sometimes two threads try to access the same memory region, at the same time. This problem is known as a synchronization problem. To demonstrate it, consider the following code, which is a sequential program to sum numbers from 1 to n.


sum_sequential.c

#include <stdio.h>

long long sum (int n) {
    long long sum_result = 0;
    
    for (int i = 1; i <= n; i++) {
        sum_result += i;
    }

    return sum_result;
}

int main(void) {
    int n = 10000000;

    printf("Sum of 1 to %d is %lld\n", n, sum(n));

    return 0;
}


To optimize the solution, it is possible to separate this computation into two separate ones, one that sums the even numbers and the other that sums the odd ones. Achieving that using threads is not hard based on the previous examples.


sum_parallel.c

#include <stdio.h>
#include <pthread.h>

long long sum = 0;

void *sum_even(void *arg) {
    int n = *(int *)arg;

    for (int i = 2; i <= n; i += 2) {
        sum += i;
    }

    pthread_exit(NULL);
}

void *sum_odd(void *arg) {
    int n = *(int *)arg;

    for (int i = 1; i <= n; i += 2) {
        sum += i;
    }

    pthread_exit(NULL);
}

int main(void) {
    int n = 10000000;
    pthread_t t1, t2;

    pthread_create(&t1, NULL, sum_even, &n);
    pthread_create(&t2, NULL, sum_odd, &n);

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);

    printf("Sum of 1 to %d is %lld\n", n, sum);

    return 0;
}


Unfortunately, that solution does not work. The problem occurs because there is an interval between getting the value the sum and writing a new value to it. Both threads are basically summing to the wrong value of sum, therefore the output is wrong.




As an analogy, think of a traffic intersection where multiple cars are crashing into each other to cross the road.



To solve that, there is a structure called semaphore. It works just like traffic lights and helps the threads avoid accessing the same place at the same time.






Semaphores are structures declared in the semaphore.h header file, and must be initialized by the function sem_init and freed by the function sem_destroy. Before initializing the semaphore, a variable with type sem_t must be created. 


/usr/include/semaphore.h

typedef union
{
  char __size[__SIZEOF_SEM_T];
  long int __align;
} sem_t;
  
  /* Initialize semaphore object SEM to VALUE.  If PSHARED then share it
   with other processes.  */
extern int sem_init (sem_t *__sem, int __pshared, unsigned int __value)
  __THROW __nonnull ((1));

/* Free resources associated with semaphore object SEM.  */
extern int sem_destroy (sem_t *__sem) __THROW __nonnull ((1));


The important part is that all the threads must be able to access that semaphore variable, so it must be located in the Data Memory or Heap Memory. In the following example, the semaphore is a global variable and is initialized in the main function before the threads are created. After the threads stop executing, the semaphore is properly freed with the function sem_destroy.


sum_parallel_semaphore.c

#include <stdio.h>
#include <pthread.h>
#include <semaphore.h>

long long sum = 0;
sem_t sem;

void *sum_even(void *arg) {
    int n = *(int *)arg;

    for (int i = 2; i <= n; i += 2) {
        sem_wait(&sem);
        sum += i;
        sem_post(&sem);
    }

    pthread_exit(NULL);
}

void *sum_odd(void *arg) {
    int n = *(int *)arg;

    for (int i = 1; i <= n; i += 2) {
        sem_wait(&sem);
        sum += i;
        sem_post(&sem);
    }

    pthread_exit(NULL);
}

int main(void) {
    int n = 10000000;
    pthread_t t1, t2;

    sem_init(&sem, 0, 1);

    pthread_create(&t1, NULL, sum_even, &n);
    pthread_create(&t2, NULL, sum_odd, &n);

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);

    printf("Sum of 1 to %d is %lld\n", n, sum);

    sem_destroy(&sem);

    return 0;
}


As mentioned before sem_init function initializes the semaphore variable, it requires three parameters, the first parameter is the address of the semaphore variable that is being initialized, the third parameter is the initial value of the semaphore, and the second is the type of semaphore that is being initialized:

  • 0: Semaphore that will be used between threads
  • 1: Semaphore that will be used between processes
Semaphores work by adding and subtracting 1 from a variable. In the last example, the semaphore is initialized with the value 1, then, both threads are created. The sem_wait function will only execute if the semaphore is 1, otherwise, it will wait. If the first thread executes sem_wait, it will decrement the semaphore from 1 to 0.  Therefore, when the second thread executes sem_wait will hang. After the first thread updates the value of sum, it will call the function sem_post, which will increment the value of the semaphore from 0 to 1. At this time, the second thread will have its turn, blocking the first thread.


At first, the same synchronization problem that occurred to the sum variable also seems to occur to the semaphore variable. However, the sem_wait and sem_post are called Atomic Functions because they work with no interruption from other threads. In other words, the operating system guarantees that while it is handling the sem_wait and sem_post functions, no other thread is accessing the semaphore variable.

In order to test the speed of both programs, it is possible to create a macro that will calculate the time spent by the computer at some instructions, and then print it to the screen.


timeit.h

#ifndef TIMEIT_H
#define TIMEIT_H

#include <time.h>
#include <stdio.h>

#define TIME_IT(code) \
    do { \
        clock_t start = clock(); \
        code; \
        clock_t end = clock(); \
        double time_spent = ((double)(end - start)) / CLOCKS_PER_SEC; \
        printf("Time spent: %f\n", time_spent); \
    } while (0)
#endif


For example, to measure how long it takes for the parallel program to run, it is as easy as involving the pthread_create and pthread_join functions using the TIMEIT macro.


sum_parallel_semaphore.c

#include <stdio.h>
#include <pthread.h>
#include <semaphore.h>
#include "timeit.h"

long long sum = 0;
sem_t sem;

void *sum_even(void *arg) {
    int n = *(int *)arg;

    for (int i = 2; i <= n; i += 2) {
        sem_wait(&sem);
        sum += i;
        sem_post(&sem);
    }

    pthread_exit(NULL);
}

void *sum_odd(void *arg) {
    int n = *(int *)arg;

    for (int i = 1; i <= n; i += 2) {
        sem_wait(&sem);
        sum += i;
        sem_post(&sem);
    }

    pthread_exit(NULL);
}

int main(void) {
    int n = 10000000;
    pthread_t t1, t2;

    sem_init(&sem, 0, 1);

    TIME_IT(
        pthread_create(&t1, NULL, sum_even, &n);
        pthread_create(&t2, NULL, sum_odd, &n);

        pthread_join(t1, NULL);
        pthread_join(t2, NULL);
    );

    printf("Sum of 1 to %d is %lld\n", n, sum);

    sem_destroy(&sem);

    return 0;
}


As a real example, some years ago I implemented a game, inspired by the Atari game Pong, as a single-player game using threads and ncurses to run it in the terminal. It was written in such a way that each player and ball was handled by a different thread. Take the code with some grain of salt because it was 3 years ago and my C language experience was close to zero. Nevertheless, you can find it in this repository.