Any Linux user has already used a shell to get something compiled or installed. But just some have already built a shell of their own. Therefore, in this post, I would like to explain how processes are created and how to manipulate process file descriptors. With that said, you can create your own shell that might work just like this:
cmd> /usr/bin/ls
1.in Makefile hey hey.c in.txt include out.txt output.txt shell src
cmd> /usr/bin/ls -l
total 72
-rw-rw-r-- 1 joao joao 76 Mar 12 19:50 1.in
-rw-rw-r-- 1 joao joao 137 Mar 11 19:17 Makefile
-rwxrwxr-x 1 joao joao 16064 Mar 15 14:21 hey
-rw-rw-r-- 1 joao joao 280 Mar 15 14:21 hey.c
-rw-rw-r-- 1 joao joao 5 Mar 12 19:47 in.txt
drwxrwxr-x 2 joao joao 4096 Mar 11 19:09 include
-rw-rw-r-- 1 joao joao 8 Mar 12 19:50 out.txt
-rw-rw-r-- 1 joao joao 11 Mar 11 23:13 output.txt
-rwxrwxr-x 1 joao joao 23320 Mar 12 19:48 shell
drwxrwxr-x 2 joao joao 4096 Mar 11 19:09 src
cmd> ./hey
Enter your name: Joao
Hello Joao!
cmd> ./hey Joao
Hello Joao
cmd> ./hey Joao > out.txt
cmd> /usr/bin/cat out.txt
Hello Joao
* Note that in that simple shell, the program path is used (e.g. /usr/bin/ls instead of the alias ls). That is done to simplify the shell, but a good exercise would be to use the PATH environment variable to allow users to use those aliases.
During some of the explanations, I find it easier to understand certain functions if we know how they are implemented. Thus, I will take examples from the very first version of Linux, v0.01 [1]. That is due to its simplicity, and even though the main algorithm might be the same nowadays, the code might not be as friendly.
What is a System Call
In this section, I want to take a second and explain what a system call is, and how a user might use it.
When you are writing code in C, you might have to call your own functions to solve specific tasks, such as a program that might calculate the power of a number:
int power(int a, int b) {
int result = 1;
for (int i = 0; i < b; i++) {
result *= a;
}
return result;
}
int main() {
int a = 2;
int b = 3;
printf("2 ^ 3 = %d\n", power(a, b));
return 0;
}
However, there are more complex tasks that you may want to use in a program, such as telling the shell to display a letter on the screen. Particularly, that is not a complex task to think about, but it is complex system-wise because you need to make your program communicate the letter to the shell so that it can display it on the screen.
Fortunately, someone very smart had this problem way before us, because they created a fundamental function called write() that allows your program to communicate that letter correctly. But how would I use it in a C code?
First of all, you would have to look at the manual page for that function to figure out what parameter it accepts, and which header files you have to include in your program to use it. You would face something like this:
#include <unistd.h>
ssize_t write(int fd, const void *buf, size_t count);
It simply means that, to call that system call you would have to indicate what file should the operating system make this write to, what buffer contains the data that will be used, and how many bytes should be used. As an example, there is a very simple program that uses the write() system call.
#include <unistd.h>
int main() {
char buffer[] = "Hello, World!\n";
write(1, buffer, sizeof(buffer));
}
After the buffer was defined, we simply told the operating system to write it to the stdout, where the shell might be expecting something as input to show the user.
Besides that, there are many other system calls shown on the syscalls manual page. Another example that can be used is the read() which has a similar reason to be used as the write() system call, that is why I am not spending time with it. Instead, I will show how the nanosleep() system call can be used.
Repeating the process that was done before, and looking at the nanosleep() manual page, we can see the parameters expected by that function.
#include <time.h>
int nanosleep(const struct timespec *req,
struct timespec * rem);
That function declaration means that the function must receive a struct containing the time that the process was requested to sleep for, and a struct that the OS will use in case there are any problems during execution but there is still some time remaining. Next, all that is left is knowing what is inside the struct timespec which is easily solved if you are using an IDE. Nevertheless, if you are not, you can check the time.h header file and figure out for yourself:
#ifndef _STRUCT_TIMESPEC
#define _STRUCT_TIMESPEC
struct timespec {
__kernel_old_time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
#endif
Finally, with all variable types searched, using the syscall is as simple as that:
#include <time.h>
int main() {
struct timespec req = {2, 0};
struct timespec rem;
nanosleep(&req, &rem);
}
All things considered, we can go back to the idea of creating a shell.
Process Creation
In the example shown in the beginning, the program /usr/bin/ls was used to display the files in the current working directory. In this section, I will explain how to create a brand-new process with the content that is stored at /usr/bin/ls.
First of all, in the Linux operating system, the information about all processes is stored in a struct called task_struct:
struct task_struct {
/* these are hardcoded - don't touch */
long state; /* -1 unrunnable, 0 runnable, >0 stopped */
long counter;
long priority;
long signal;
fn_ptr sig_restorer;
fn_ptr sig_fn[32];
/* various fields */
int exit_code;
unsigned long end_code,end_data,brk,start_stack;
long pid,father,pgrp,session,leader;
unsigned short uid,euid,suid;
unsigned short gid,egid,sgid;
long alarm;
long utime,stime,cutime,cstime,start_time;
unsigned short used_math;
/* file system info */
int tty; /* -1 if no tty, so it must be signed */
unsigned short umask;
struct m_inode * pwd;
struct m_inode * root;
unsigned long close_on_exec;
struct file * filp[NR_OPEN];
/* ldt for this task 0 - zero 1 - cs 2 - ds&ss */
struct desc_struct ldt[3];
/* tss for this task */
struct tss_struct tss;
};
Looking at that struct you can find out what are all the features that are stored by the OS for each process that is created, such as the state, priority, id. All that is needed to be known by the system to decide which process is executed at each time.
You don't really need to understand what all of those fields mean, I put it there just for you to understand deeply what each system call is doing. Additionally, as you would expect of a mu
nniprogramming operating system, there is an array of task_struct, called task, to store all the created processes.
struct task_struct * task[NR_TASKS] = {&(init_task.task), };
Having said that, we should seek an operating system API (also known as system call) that adds a new process to that array. If you care to search for it, you will find the fork() system call, defined inside the Linux Kernel, which is the only routine that adds a newly allocated variable inside the array that stores all processes. For us to understand what it does, there is the assembly code that is executed when a user executes the fork() system call:
_sys_fork:
call _find_empty_process
testl %eax,%eax
js 1f
push %gs
pushl %esi
pushl %edi
pushl %ebp
pushl %eax
call _copy_process
addl $20,%esp
ret
Although there are many operations, we can focus just on the call operations that execute the functions _find_empty_process() and _copy_process(). Those function names are very interesting so we can find out what the syscall does by looking at them.
First, the OS looks for an empty place on the tasks array and then it defines a new task_struct to place it in there with exactly the same content that is inside the process that is called the fork() syscall. That behavior also gives the syscall its name, because it works like a "process mitosis".
Furthermore, it is possible to check even how the find_empty_process and copy_process functions work. However, besides some error-checking code, they work just like you would have thought they work:
int find_empty_process(void)
{
int i;
repeat:
if ((++last_pid)<0) last_pid=1;
for(i=0 ; i<NR_TASKS ; i++)
if (task[i] && task[i]->pid == last_pid) goto repeat;
for(i=1 ; i <NR_TASKS ; i++)
if (!task[i])
return i;
return -EAGAIN;
}
/*
* Ok, this is the main fork-routine. It copies the system process
* information (task[nr]) and sets up the necessary registers. It
* also copies the data segment in it's entirety.
*/
int copy_process(int nr,long ebp,long edi,long esi,long gs,long none,
long ebx,long ecx,long edx,
long fs,long es,long ds,
long eip,long cs,long eflags,long esp,long ss)
{
struct task_struct *p;
int i;
struct file *f;
p = (struct task_struct *) get_free_page();
if (!p)
return -EAGAIN;
*p = *current; /* NOTE! this doesn't copy the supervisor stack */
p->state = TASK_RUNNING;
p->pid = last_pid;
p->father = current->pid;
p->counter = p->priority;
p->signal = 0;
p->alarm = 0;
p->leader = 0; /* process leadership doesn't inherit */
p->utime = p->stime = 0;
p->cutime = p->cstime = 0;
p->start_time = jiffies;
p->tss.back_link = 0;
p->tss.esp0 = PAGE_SIZE + (long) p;
p->tss.ss0 = 0x10;
p->tss.eip = eip;
p->tss.eflags = eflags;
p->tss.eax = 0;
p->tss.ecx = ecx;
p->tss.edx = edx;
p->tss.ebx = ebx;
p->tss.esp = esp;
p->tss.ebp = ebp;
p->tss.esi = esi;
p->tss.edi = edi;
p->tss.es = es & 0xffff;
p->tss.cs = cs & 0xffff;
p->tss.ss = ss & 0xffff;
p->tss.ds = ds & 0xffff;
p->tss.fs = fs & 0xffff;
p->tss.gs = gs & 0xffff;
p->tss.ldt = _LDT(nr);
p->tss.trace_bitmap = 0x80000000;
if (last_task_used_math == current)
__asm__("fnsave %0"::"m" (p->tss.i387));
if (copy_mem(nr,p)) {
free_page((long) p);
return -EAGAIN;
}
for (i=0; i < NR_OPEN;i++)
if (f=p->filp[i])
f->f_count++;
if (current->pwd)
current->pwd->i_count++;
if (current->root)
current->root->i_count++;
set_tss_desc(gdt+(nr<<1)+FIRST_TSS_ENTRY,&(p->tss));
set_ldt_desc(gdt+(nr<< 1)+FIRST_LDT_ENTRY,&(p->ldt));
task[nr] = p; /* do this last, just in case */
return last_pid;
}
That might lead to a very interesting question: If all processes are forked, where does the first process come from? Well, if you look at your Linux Operating system to check which is the process with ID 1, you can see that it came from a program called init.
$ ps aux | head -n 2
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.1 0.0 167660 12848 ? Ss 11:57 0:43 /sbin/init splash
Moreover, you can look inside the Linux source code and find what the init.c file really is, but it essentially initializes some code and runs the first shell of the system:
void main(void) /* This really IS void, no error here. */
{ /* The startup routine assumes (well, ...) this */
/*
* Interrupts are still disabled. Do necessary setups, then
* enable them
*/
time_init();
tty_init();
trap_init();
sched_init();
buffer_init();
hd_init();
sti();
move_to_user_mode();
if (!fork()) { /* we count on this going ok */
init();
}
/*
* NOTE!! For any other task 'pause()' would mean we have to get a
* signal to awaken, but task0 is the sole exception (see 'schedule()')
* as task 0 gets activated at every idle moment (when no other tasks
* can run). For task0 'pause()' just means we go check if some other
* task can run, and if not we return here.
*/
for(;;) pause();
}
void init(void)
{
int i,j;
setup();
if (!fork())
_exit(("/bin/update",NULL,NULL));
(void) open("/dev/tty0",O_RDWR,0);
(void) dup(0);
(void) dup(0);
printf("%d buffers = %d bytes buffer space\n\r",NR_BUFFERS,
NR_BUFFERS*BLOCK_SIZE);
printf(" Ok.\n\r");
if ((i=fork()) < 0)
printf("Fork failed in init\r\n");
else if (!i) {
close(0);close(1);close(2);
setsid();
(void) open("/dev/tty0",O_RDWR,0);
(void) dup(0);
(void) dup(0);
_exit(execve("/bin/sh",argv,envp));
}
j=wait(&i);
printf("child %d died with code %04x\n",j,i);
sync();
_exit(0); /* NOTE! _exit, not exit() */
}
That process is the father of all processes that are executed. In fact, the process that executes the fork() operation is called "father" and the process that is created is called "child". Furthermore, that relationship is so important for the OS, that it is stored inside the task_struct as the parameter "father".
Back to the track, how do you use the fork() system call? Repeating the important process of looking at the manual page we can find the following declaration:
#include <unistd.h>
pid_t fork(void);
Quite easy, it means that I program like that executes correctly the fork system call:
#include <sys/types.h>
#include <unistd.h>
int main() {
fork();
return 0;
}
Right, but after the child process is created, where is the next line that the father process executes? And what about the child process? If we try adding a print function just behind the fork call, we can find that out:
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
int main() {
fork();
printf("Hey!\n");
return 0;
}
As we can see in the result, the print function will be executed twice, once by the father-process and later by the child-process. Interestingly, it is not executed necessarily in that order. In other words, the child-process can be executed first if the OS decides so.
But if both processes continue the execution just below the fork() function, it does not help us much if we can't control what each process executes independently. That is why the return value of the fork() function is used. If we take a look at what is return, we can see that there are different return values on each execution, so that we can control what each execution is doing.
int main() {
pid_t pid = fork();
printf("Hey! %d\n", pid);
return 0;
}
If we execute it many times we will see that one of the PIDs printed changes, but there is one that is always zero. Therefore, to understand which process receives each value, we can go back to the man page at the RETURN VALUE section from the manual page:
RETURN VALUE
On success, the PID of the child process is returned in the
parent, and 0 is returned in the child. On failure, -1 is
returned in the parent, no child process is created, and errno is
set to indicate the error.
All that means that the process that receives 0 is the child-process, and the process that something else is the father-process. So to control what each process executes, we can do something like that:
int main() {
pid_t pid = fork();
if (pid == 0) {
printf("Child-Process\n");
} else {
printf("Parent-Process\n");
}
return 0;
}
Also, if you look closely at the init function, you can see that this exact strategy was used to create a new shell. But now there is a tricky part, what happens when we want the child-process to ask for the user input?
int main() {
pid_t pid = fork();
if (pid == 0) {
printf("Child-Process\n");
int number;
printf("Enter a number: ");
scanf("%d", &number);
printf("You entered: %d\n", number);
} else {
printf("Parent-Process\n");
}
return 0;
}
You will figure out that the execution is canceled before you can try to type anything in the terminal. That is because both father and child processes are executed in parallel, but as soon as the father process is done, the shell prompt is executed again. Thus, to solve that we can execute another syscall in the father process for it to wait for the child to execute, such as the nanoseconds() syscall:
int main() {
pid_t pid = fork();
if (pid == 0) {
printf("Child-Process\n");
int number;
printf("Enter a number: ");
scanf("%d", &number);
printf("You entered: %d\n", number);
} else {
printf("Parent-Process\n");
struct timespec req = {10, 0};
struct timespec rem;
nanosleep(&req, &rem);
}
return 0;
}
Great, now you have 10 seconds to enter something inside the child process. However, the use case of having the father-process to wait for the child-process seems too regular for a solution like that. In fact, the solution to that process is to use another syscall called wait(), which was built exactly for this use case:
int main() {
pid_t pid = fork();
if (pid == 0) {
printf("Child-Process\n");
int number;
printf("Enter a number: ");
scanf("%d", &number);
printf("You entered: %d\n", number);
} else {
printf("Parent-Process\n");
wait(NULL);
}
return 0;
}
In that program, the father-process executes the print function and hangs until the child process is terminated.
Although using fork() that way is useful for parallel processing, there is another usage of process creation which is to create a new process with the content of another binary file. Going back to the shell idea, the process being executed is the shell process, so if the user might want to execute a binary to list the files in the current working directory (e.g. /usr/bin/ls) or a binary to print out the content of a specific file (e.g. /usr/bin/cat), we have to create a new process using fork() with the instructions inside those binary files.
Lucky for us, there is a syscall called execve() that can do just that. Referencing the manual page again we can find out the function declaration:
#include <unistd.h>
int execve(const char *pathname, char *const argv[], char *const envp[]);
That function will replace all the process memory with the content of the binary that you reference using the pathname variable. Consider the following example that uses the execve() syscall:
int main() {
char *const argv[] = {"/usr/bin/ls", "-l", NULL};
execve("/usr/bin/ls", argv, NULL);
printf("This line should not be printed\n");
return 0;
}
That function receives the path from the binary that you want it to use, and then it allows the user to give arguments to the binary using the argv array, and environment variables using the envp array. Those variables are the same ones that you can access if you create a C program like that:
int main(int argc, char *argv[], char *envp[]) {
/* Your Code Here */
return 0;
}
* Note that both arrays argv and envp arrays must finish with a NULL pointer, because otherwise, the program that is being executed will not be able to know when the array ends.
* Also, note that the process changes itself completely when the execve is executed, and even though the execve() example has a print function to execute after the execve, it will never be called.
As a result of that, we can improve the fork() example by making the child process execute another binary, just like that:
int main() {
pid_t pid = fork();
if (pid == 0) {
char pathname[] = "/usr/bin/ls";
char *argv[] = {pathname, "-l", NULL};
char *envp[] = {NULL};
execve(pathname, argv, envp);
} else {
wait(NULL);
}
return 0;
}
Manipulating File Descriptors
Next, the second feature of your Mini-Shell is being able to use the content of a file as input or redirect the output of the process to a certain file, just like in the example:
cmd> /bin/cat in.txt
Joao
cmd> ./hey < in.txt
Enter your name: Hello Joao!
cmd> ./hey < in.txt > out.txt
cmd> /bin/cat out.txt
Enter your name: Hello Joao!
Every time you open a file to write or read it, you receive a File Descriptor representing that file that can be used just inside the process that called open(). That File Descriptor is an integer number that serves as an identifier of that opened file. With that ID, you can tell the OS which file it is supposed to "read from" or "write to". Take the following example:
int main() {
int fd = open("in.txt", O_RDONLY);
printf("fd: %d\n", fd);
return 0;
}
If you run the example above you will see that the opened file is now identified as 3 to the process, and you will use that number on the read/write calls. Additionally, that value means that File Descriptors 0, 1, and 2 already exist. Indeed they do exist and they are:
- 0: Standard Input.
- 1: Standard Output.
- 2: Standard Error.
Those are not exactly files with bytes stored in your disk but they are files to the OS. And you use them when:
- 0: You want to read something from the user. So you can only read from it.
- 1: You want to write something to the user. So you can only write to it.
- 2: You want to write an error to the user. So you can only write to it.
There is a code to show an example of that usage:
int main() {
char buffer[1024] = {0};
ssize_t count = read(0, buffer, 1023);
if (count == -1) {
write(2, "An error occurred in the read.\n", 31);
} else {
write(1, buffer, count);
}
}
In summary, the next opened File Descriptor was 3 because it was the next on the list of available identifiers.
Furthermore, you can go back to the Linux source code to check exactly what happens to the task_struct of the process that calls the syscall to open a file:
int sys_open(const char * filename,int flag,int mode)
{
struct m_inode * inode;
struct file * f;
int i,fd;
mode &= 0777 & ~current->umask;
for(fd=0 ; fd<NR_OPEN ; fd++)
if (!current->filp[fd])
break;
if (fd>=NR_OPEN)
return -EINVAL;
current->close_on_exec &= ~(1<<fd);
f=0+file_table;
for (i=0 ; i<NR_FILE ; i++,f++)
if (!f->f_count) break;
if (i>=NR_FILE)
return -EINVAL;
(current->filp[fd]=f)->f_count++;
if ((i=open_namei(filename,flag,mode,&inode))<0) {
current->filp[fd]=NULL;
f->f_count=0;
return i;
}
/* ttys are somewhat special (ttyxx major==4, tty major==5) */
if (S_ISCHR(inode->i_mode))
if (MAJOR(inode->i_zone[0])==4) {
if (current->leader && current->tty<0) {
current->tty = MINOR(inode->i_zone[0]);
tty_table[current->tty].pgrp = current->pgrp;
}
} else if (MAJOR(inode->i_zone[0])==5)
if (current->tty<0) {
iput(inode);
current->filp[fd]=NULL;
f->f_count=0;
return -EPERM;
}
f->f_mode = inode->i_mode;
f->f_flags = flag;
f->f_count = 1;
f->f_inode = inode;
f->f_pos = 0;
return (fd);
}
It is pretty simple, inside the task_struct of the process, there is a variable called filp, which is an array that will store information about each file that is opened by that process. So the filp[0] will store information about the Standard Input File, the filp[1] will store information about the Standard Output File, and so on. And that is what the File Descriptor is, just an index in the filp array that is returned to the user.
When you call the open system call, that function looks for the next available spot in the filp array and creates a struct file with important information about the file that you requested to open. In the future, the OS will look at this information to access the correct memory addresses on your disk.
However, a certain file on your disk can be opened more than once by the same process, so that it has the same struct file in two different spots of the filp array:
int main() {
int fd = open("in.txt", O_RDONLY);
printf("fd: %d\n", fd);
int fd2 = open("in.txt", O_RDONLY);
printf("fd2: %d\n", fd2);
return 0;
}
That same process is done by the dup() system call, in the following way:
int main() {
int fd = open("in.txt", O_RDONLY);
printf("fd: %d\n", fd);
int fd2 = dup(fd);
printf("fd2: %d\n", fd2);
return 0;
}
The previous code will get the next free index at the array filp and duplicate the same struct file in there. Furthermore, there is a dup2() system call, that is an extension of dup() because it allows the user to choose which File Descriptor to duplicate it to. Consider the following code:
int main() {
int fd = open("in.txt", O_RDONLY);
printf("fd: %d\n", fd);
int fd2 = dup2(fd, 5);
printf("fd2: %d\n", fd2);
return 0;
}
In that code, we said that we want index 5 to store the struct from the file that we opened before. That is very serious, what might happen if we say that we want to duplicate it to the index 0?
int main() {
int fd = open("in.txt", O_RDONLY);
printf("fd: %d\n", fd);
int fd2 = dup2(fd, 0);
printf("fd2: %d\n", fd2);
return 0;
}
Remember that inside the index 0 of the filp array, there is information about the Standard Input File? It means that every time you use a function that tries to ask for information from the user (e.g. scanf()), it will use the File Descriptor 0 for it. Therefore, if you duplicate the File Descriptor of an ordinary file to that specific index 0, you will modify where those functions get the information from. For example:
int main() {
char buffer[1024] = {0};
int fd = open("in.txt", O_RDONLY);
int fd2 = dup2(fd, 0);
scanf("%[^\n]", buffer);
printf("Buffer: %s\n", buffer);
return 0;
}
In that example, you won't have to write anything to the console, because the scanf() will get the information it needs from the "in.txt" file.
The same result can be done both to the Standard Output FIle and the Standard Error File. Using the same logic, the following code example stores all the information printed to the "out.txt" file:
int main() {
int fd = open("out.txt", O_WRONLY | O_CREAT, 0644);
dup2(fd, 1);
printf("Hello, world!\n");
return 0;
}
Conclusion
That sums it all up, with all that information and some string parsing you will be able to create something that looks just like that simple shell I showed you in the beginning.
References