Saturday, March 30, 2024

How to create a Shell: Linux Pipes

 


After having a program that can execute new programs, the next step would be to add new features to the program. One such feature is the pipe, which should behave just like that:
$ ls -l
total 52
drwxrwxr-x 2 joao joao  4096 mar 11 19:09 include
-rw-rw-r-- 1 joao joao   137 mar 11 19:17 Makefile
-rwxrwxr-x 1 joao joao 28768 mar 30 08:49 shell
drwxrwxr-x 2 joao joao  4096 mar 29 18:19 src
drwxrwxr-x 2 joao joao  4096 mar 30 08:12 tests
$ ls -l | wc -l
6
$ ls -l | rev 
25 latot
edulcni 90:91 11 ram 6904  oaoj oaoj 2 x-rxwrxwrd
elifekaM 71:91 11 ram 731   oaoj oaoj 1 --r-wr-wr-
llehs 94:80 03 ram 86782 oaoj oaoj 1 x-rxwrxwr-
crs 91:81 92 ram 6904  oaoj oaoj 2 x-rxwrxwrd
stset 21:80 03 ram 6904  oaoj oaoj 2 x-rxwrxwrd
$ ls -l | rev | rev  
total 52
drwxrwxr-x 2 joao joao  4096 mar 11 19:09 include
-rw-rw-r-- 1 joao joao   137 mar 11 19:17 Makefile
-rwxrwxr-x 1 joao joao 28768 mar 30 08:49 shell
drwxrwxr-x 2 joao joao  4096 mar 29 18:19 src
drwxrwxr-x 2 joao joao  4096 mar 30 08:12 tests

In the first example using pipe, we can see that all the content from the ls -l command goes to the wc -l command. Therefore, the wc command can properly count the number of lines that were printed out by the ls command.


To do that, the shell has to create a pipe that can receive information at one end and push it to the other end so that another process can read information. To be more explicit, what really happens is something like this:


Fortunately, all the work that is associated with the information flow is handled by the operating system. The only thing the shell has to do is to create the pipe and define what is at each end.

But how can we create a pipe? Well, that is easy to create and a little hard to use. To create it, you can just call the syscall pipe(2), and use the provided example. However, if you want information about how it works, refer to the pipe(7) manual page.

One thing at a time. First, how do we call it?

#include <unistd.h>

int pipe(int pipefd[2]);

So, taking only the information from the beginning of the manual page, we could use it like this:

int main(int argc, char *argv[]) {
    int pipefd[2];

    pipe(pipefd);

    return 0;
}

It still doesn't do much, but first, let us understand what the array pipefd is. Remember back in the first post, File Descriptors are just integer numbers that we use to tell the operating system where to read from or write to. Well, the same thing can be used in pipes to tell the OS which end of the pipe we want to access. Each pipe has two ends:


So, the OS opens two different File Descriptors (stored at 
pipefd[0] and pipefd[1]), one for each end. We use pipefd[1] if we want to Write something to the pipe, and pipefd[0] if we want to Read something from the pipe.


First, here is an example using error handling and printing out each file descriptor returned by the pipe syscall:

int main(int argc, char *argv[]) {
    int pipefd[2];

    if (pipe(pipefd) == -1) {
        perror("pipe");
        return 1;
    }

    printf("pipefd[0]: %d\n", pipefd[0]);
    printf("pipefd[1]: %d\n", pipefd[1]);

    return 0;
}

Additionally, here is an easy example to understand how you can use the syscalls write and read using a pipe:

int main(int argc, char *argv[]) {
    char write_msg[] = "Hello, World!";
    char read_msg[sizeof(write_msg)];

    int pipefd[2];

    pipe(pipefd);

    write(pipefd[1], write_msg, sizeof(write_msg));    

    read(pipefd[0], read_msg, sizeof(read_msg));

    printf("read_msg: \"%s\"\n", read_msg);

    return 0;
}

* You can only write to pipefd[1] and read from pipefd[0].

Alternatively, you can think of a Pipe as an OS inner buffer that stores data and gives you two different file descriptors to access it. The special thing about pipes is that it is shared between processes, so if you adapt the fork example from the first post, you can read and write stuff between different processes:

int main(int argc, char *argv[]) {
    int pipefd[2];
    pipe(pipefd);

    if (!fork()) {
        char write_buffer[] = "Hello!";
        write(pipefd[1], write_buffer, sizeof(write_buffer));
        printf("Child wrote: \"%s\"\n", write_buffer);
    } else {
        char buffer[1024] = {0};
        read(pipefd[0], buffer, sizeof(buffer));
        printf("Parent read: \"%s\"\n", buffer);
        wait(NULL);
    }

    return 0;
}

That example should work, you can see that we used the Read End of the pipe in the Parent Process and the Write End of the pipe in the Child Process, so that both processes can communicate. 

Since the fork(2) syscall copies the process, each process has its own copy of pipefd[0] and pipefd[1], however, neither process is using both File Descriptors, so we should close the File Descriptors that are not being used:

int main(int argc, char *argv[]) {
    int pipefd[2];
    pipe(pipefd);

    if (!fork()) {
        char write_buffer[] = "Hello!";
        close(pipefd[0]);
        write(pipefd[1], write_buffer, sizeof(write_buffer));
        printf("Child wrote: \"%s\"\n", write_buffer);
    } else {
        char buffer[1024] = {0};
        close(pipefd[1]);
        read(pipefd[0], buffer, sizeof(buffer));
        printf("Parent read: \"%s\"\n", buffer);
        wait(NULL);
    }

    return 0;
}

It is always good practice to close unused file descriptors, especially when handling multiprocessing and pipes. That is because if you have multiple processes with pipefd[0] open, the OS will figure that there still might be something to be entered, so it will hang the program, causing errors that are difficult to track because of multiprocessing.

But taking a few more steps, what if you want to try to create a pipe operation just like the one we tried in the shell:

$ ls -l | wc -l

Well, to do that you would have to create one pipe, execute both programs ls and wc, and redirect everything that goes to ls stdout to the Write End of the pipe and use everything that is at the Read End of the pipe as stdin for the wc program.

int main(int argc, char *argv[]) {
    int pipefd[2];
    pipe(pipefd);

    if (!fork()) {
        close(pipefd[0]);

        dup2(pipefd[1], 1);

        execl("/bin/ls", "ls", "-l", NULL);
    } else {
        close(pipefd[1]);

        dup2(pipefd[0], 0);

        execl("/usr/bin/wc", "wc", "-l", NULL);
    }

    printf("This line should not execute!\n");

    return 0;
}

And if you don't close the unused File Descriptors, that is where pipes start to become a problem, take a moment to try the following code, and see that it does not finish:

int main(int argc, char *argv[]) {
    int pipefd[2];
    pipe(pipefd);

    if (!fork()) {
        dup2(pipefd[1], 1);

        execl("/bin/ls", "ls", "-l", NULL);
    } else {
        dup2(pipefd[0], 0);

        execl("/usr/bin/wc", "wc", "-l", NULL);
    }

    printf("This line should not execute!\n");

    return 0;
}

Since the parent process still has pipefd[0] opened, the wc program still thought that there might be some more input available, so it didn't finish.

However, even though our second-last program executes the pipe operation correctly, what should we do if wanted our program to continue execution after running ls -l | wc -l. Well, you would have to fork() one more time, just like this:

int main(int argc, char *argv[]) {

    if (!fork()) {
        int pipefd[2];
        pipe(pipefd);
        
        if (!fork()) {
            close(pipefd[0]);

            dup2(pipefd[1], 1);

            execl("/bin/ls", "ls", "-l", NULL);
        } else {
            close(pipefd[1]);

            dup2(pipefd[0], 0);

            execl("/usr/bin/wc", "wc", "-l", NULL);
        }
    } else {
        wait(NULL);
    }

    printf("This line should execute!\n");

    return 0;
}

The bad side of that solution is that it will create an unsustainable amount of if-else conditions. So if you wanted to create one more pipe, it would start becoming messy. To solve that, you can do the same thing as the last example, using the following:

int main(int argc, char *argv[]) {
    int pipefd[2];
    pipe(pipefd);

    if (!fork()) {
        close(pipefd[0]);

        dup2(pipefd[1], 1);

        execl("/bin/ls", "ls", "-l", NULL);

    }

    if (!fork()) {
        close(pipefd[1]);

        dup2(pipefd[0], 0);

        execl("/usr/bin/wc", "wc", "-l", NULL);
    }

    close(pipefd[0]);
    close(pipefd[1]);        

    wait(NULL);

    printf("This line should execute!\n");

    return 0;
}

So now, if you want to nest more pipes, such as:

ls -l | rev | wc -l

You would have to create two pipes, doing as follows:

int main(int argc, char *argv[]) {
    int pipefd1[2];
    pipe(pipefd1);

    if (!fork()) {
        close(pipefd1[0]);

        dup2(pipefd1[1], 1);

        execl("/bin/ls", "ls", "-l", NULL);

    }

    int pipefd2[2];
    pipe(pipefd2);

    if (!fork()) {
        close(pipefd1[1]);

        dup2(pipefd1[0], 0);
        dup2(pipefd2[1], 1);

        execl("/usr/bin/rev", "rev", NULL);
    }

    close(pipefd1[0]);
    close(pipefd1[1]);
    
    if (!fork()) {
        close(pipefd2[1]);

        dup2(pipefd2[0], 0);

        execl("/usr/bin/wc", "wc", "-l", NULL);
    }
       
    close(pipefd2[0]);
    close(pipefd2[1]);        

    wait(NULL);

    printf("This line should execute!\n");

    return 0;
}

And that should be enough information to increment your shell with the pipe feature.

References

1. Implementing a Job Control Shell. https://www.gnu.org/software/libc/manual/html_node/Implementing-a-Shell.html

No comments:

Post a Comment