The global resource isolated by PID namespaces is the process ID number space. This means that processes in different PID namespaces can have the same process ID.
As with processes on a traditional Linux (or UNIX) system, the process IDs within a PID namespace are unique, and are assigned sequentially starting with PID 1. Likewise, as on a traditional Linux system, PID 1—the init process—is special: it is the first process created within the namespace, and it performs certain management tasks within the namespace.
支持容器热迁移：PID namespaces are used to implement containers that can be migrated
between host systems while keeping the same process IDs for the
processes inside the container.
对于第一点，加入容器之间没有pid namespace，所有容器都在同一个pid namespace下，那么任何一个容器内的pid在其他容器内都是可见的，这样会导致一个很严重的问题，就是容器之间的进程操作（例如ps,pstree,kill,killall命令等等）会相互影响，尤其是在混部的场景下，这种问题是很致命的，而且难以追查
2.1. First investigations
A new PID namespace is created by calling clone() with the CLONE_NEWPID flag. We'll show a simple example program that creates a new PID namespace using clone() and use that program to map out a few of the basic concepts of PID namespaces. The complete source of the program (pidns_init_sleep.c) can be found here. As with the previous article in this series, in the interests of brevity, we omit the error-checking code that is present in the full versions of the example program when discussing it in the body of the article.
The main program creates a new PID namespace using clone(), and displays the PID of the resulting child:
child_pid = clone(childFunc, child_stack + STACK_SIZE, /* Points to start of downwardly growing stack */ CLONE_NEWPID | SIGCHLD, argv); printf("PID returned by clone(): %ld\n", (long) child_pid);
The new child process starts execution in childFunc(), which receives the last argument of the clone() call (argv) as its argument. The purpose of this argument will become clear later.
The childFunc() function displays the process ID and parent process ID of the child created by clone() and concludes by executing the standard sleep program:
printf("childFunc(): PID = %ld\n", (long) getpid()); printf("ChildFunc(): PPID = %ld\n", (long) getppid()); ... execlp("sleep", "sleep", "1000", (char *) NULL);
The main virtue of executing the sleep program is that it provides us with an easy way of distinguishing the child process from the parent in process listings.
When we run this program, the first lines of output are as follows:
$ su # Need privilege to create a PID namespace Password: # ./pidns_init_sleep /proc2 PID returned by clone(): 27656 childFunc(): PID = 1 childFunc(): PPID = 0 Mounting procfs at /proc2
The first two lines line of output from pidns_init_sleep show the PID of the child process from the perspective of two different PID namespaces: the namespace of the caller of clone() and the namespace in which the child resides. In other words, the child process has two PIDs: 27656 in the parent namespace, and 1 in the new PID namespace created by the clone() call.
The next line of output shows the parent process ID of the child, within the context of the PID namespace in which the child resides (i.e., the value returned by getppid()). The parent PID is 0, demonstrating a small quirk in the operation of PID namespaces. As we detail below, PID namespaces form a hierarchy: a process can "see" only those processes contained in its own PID namespace and in the child namespaces nested below that PID namespace. Because the parent of the child created by clone() is in a different namespace, the child cannot "see" the parent; therefore, getppid() reports the parent PID as being zero.
For an explanation of the last line of output from pidns_init_sleep, we need to return to a piece of code that we skipped when discussing the implementation of the childFunc() function.
2.2. /proc/PID and PID namespaces
Each process on a Linux system has a /proc/PID directory that contains pseudo-files describing the process. This scheme translates directly into the PID namespaces model. Within a PID namespace, the /proc/PID directories show information only about processes within that PID namespace or one of its descendant namespaces.
However, in order to make the /proc/PID directories that correspond to a PID namespace visible, the proc filesystem ("procfs" for short) needs to be mounted from within that PID namespace. From a shell running inside the PID namespace (perhaps invoked via the system() library function), we can do this using a mount command of the following form:
# mount -t proc proc /mount_point
Alternatively, a procfs can be mounted using the mount() system call, as is done inside our program's childFunc() function:
mkdir(mount_point, 0555); /* Create directory for mount point */ mount("proc", mount_point, "proc", 0, NULL); printf("Mounting procfs at %s\n", mount_point);
The mount_point variable is initialized from the string supplied as the command-line argument when invoking pidns_init_sleep.
In our example shell session running pidns_init_sleep above, we mounted the new procfs at /proc2. In real world usage, the procfs would (if it is required) usually be mounted at the usual location, /proc, using either of the techniques that we describe in a moment. However, mounting the procfs at /proc2 during our demonstration provides an easy way to avoid creating problems for the rest of the processes on the system: since those processes are in the same mount namespace as our test program, changing the filesystem mounted at /proc would confuse the rest of the system by making the /proc/PID directories for the root PID namespace invisible.
Thus, in our shell session the procfs mounted at /proc will show the PID subdirectories for the processes visible from the parent PID namespace, while the procfs mounted at /proc2 will show the PID subdirectories for processes that reside in the child PID namespace. In passing, it's worth mentioning that although the processes in the child PID namespace will be able to see the PID directories exposed by the /proc mount point, those PIDs will not be meaningful for the processes in the child PID namespace, since system calls made by those processes interpret PIDs in the context of the PID namespace in which they reside.
Having a procfs mounted at the traditional /proc mount point is necessary if we want various tools such as ps to work correctly inside the child PID namespace, because those tools rely on information found at /proc. There are two ways to achieve this without affecting the /proc mount point used by parent PID namespace. First, if the child process is created using the CLONE_NEWNS flag, then the child will be in a different mount namespace from the rest of the system. In this case, mounting the new procfs at /proc would not cause any problems. Alternatively, instead of employing the CLONE_NEWNS flag, the child could change its root directory with chroot() and mount a procfs at /proc.
Let's return to the shell session running pidns_init_sleep. We stop the program and use ps to examine some details of the parent and child processes within the context of the parent namespace:
^Z Stop the program, placing in background + Stopped ./pidns_init_sleep /proc2 # ps -C sleep -C pidns_init_sleep -o "pid ppid stat cmd" PID PPID STAT CMD 27655 27090 T ./pidns_init_sleep /proc2 27656 27655 S sleep 600
The "PPID" value (27655) in the last line of output above shows that the parent of the process executing sleep is the process executing pidns_init_sleep.
By using the readlink command to display the (differing) contents of the /proc/PID/ns/pid symbolic links (explained in last week's article), we can see that the two processes are in separate PID namespaces:
# readlink /proc/27655/ns/pid pid: # readlink /proc/27656/ns/pid pid:
At this point, we can also use our newly mounted procfs to obtain information about processes in the new PID namespace, from the perspective of that namespace. To begin with, we can obtain a list of PIDs in the namespace using the following command:
# ls -d /proc2/[1-9]* /proc2/1
As can be seen, the PID namespace contains just one process, whose PID (inside the namespace) is 1. We can also use the /proc/PID/status file as a different method of obtaining some of the same information about that process that we already saw earlier in the shell session:
# cat /proc2/1/status | egrep '^(Name|PP*id)' Name: sleep Pid: 1 PPid: 0
The PPid field in the file is 0, matching the fact that getppid() reports that the parent process ID for the child is 0.
2.3. Nested PID namespaces
As noted earlier, PID namespaces are hierarchically nested in parent-child relationships. Within a PID namespace, it is possible to see all other processes in the same namespace, as well as all processes that are members of descendant namespaces. Here, "see" means being able to make system calls that operate on specific PIDs (e.g., using kill() to send a signal to process). Processes in a child PID namespace cannot see processes that exist (only) in the parent PID namespace (or further removed ancestor namespaces).
A process will have one PID in each of the layers of the PID namespace hierarchy starting from the PID namespace in which it resides through to the root PID namespace. Calls to getpid() always report the PID associated with the namespace in which the process resides.
We can use the program shown here (multi_pidns.c) to show that a process has different PIDs in each of the namespaces in which it is visible. In the interests of brevity, we will simply explain what the program does, rather than walking though its code.
The program recursively creates a series of child process in nested PID namespaces. The command-line argument specified when invoking the program determines how many children and PID namespaces to create:
# ./multi_pidns 5
In addition to creating a new child process, each recursive step mounts a procfs filesystem at a uniquely named mount point. At the end of the recursion, the last child executes the sleep program. The above command line yields the following output:
Mounting procfs at /proc4 Mounting procfs at /proc3 Mounting procfs at /proc2 Mounting procfs at /proc1 Mounting procfs at /proc0 Final child sleeping
Looking at the PIDs in each procfs, we see that each successive procfs "level" contains fewer PIDs, reflecting the fact that each PID namespace shows only the processes that are members of that PID namespace or its descendant namespaces:
^Z Stop the program, placing in background + Stopped ./multi_pidns 5 # ls -d /proc4/[1-9]* Topmost PID namespace created by program /proc4/1 /proc4/2 /proc4/3 /proc4/4 /proc4/5 # ls -d /proc3/[1-9]* /proc3/1 /proc3/2 /proc3/3 /proc3/4 # ls -d /proc2/[1-9]* /proc2/1 /proc2/2 /proc2/3 # ls -d /proc1/[1-9]* /proc1/1 /proc1/2 # ls -d /proc0/[1-9]* Bottommost PID namespace /proc0/1
A suitable grep command allows us to see the PID of the process at the tail end of the recursion (i.e., the process executing sleep in the most deeply nested namespace) in all of the namespaces where it is visible:
# grep -H 'Name:.*sleep' /proc?/[1-9]*/status /proc0/1/status:Name: sleep /proc1/2/status:Name: sleep /proc2/3/status:Name: sleep /proc3/4/status:Name: sleep /proc4/5/status:Name: sleep
In other words, in the most deeply nested PID namespace (/proc0), the process executing sleep has the PID 1, and in the topmost PID namespace created (/proc4), that process has the PID 5.
If you run the test programs shown in this article, it's worth mentioning that they will leave behind mount points and mount directories. After terminating the programs, shell commands such as the following should suffice to clean things up:
# umount /proc? # rmdir /proc?
2.4. The PID namespace init process
The first process created inside a PID namespace gets a process ID of 1 within the namespace. This process has a similar role to the init process on traditional Linux systems. In particular, the init process can perform initializations required for the PID namespace as whole (e.g., perhaps starting other processes that should be a standard part of the namespace) and becomes the parent for processes in the namespace that become orphaned.
In order to explain the operation of PID namespaces, we'll make use of a few purpose-built example programs. The first of these programs, ns_child_exec.c, has the following command-line syntax:
ns_child_exec [options] command [arguments]
The ns_child_exec program uses the clone() system call to create a child process; the child then executes the given command with the optional arguments. The main purpose of the options is to specify new namespaces that should be created as part of the clone() call. For example, the -p option causes the child to be created in a new PID namespace, as in the following example:
$ su # Need privilege to create a PID namespace Password: # ./ns_child_exec -p sh -c 'echo $$' 1
That command line creates a child in a new PID namespace to execute a shell echo command that displays the shell's PID. With a PID of 1, the shell was the init process for the PID namespace that (briefly) existed while the shell was running.
Our next example program, simple_init.c, is a program that we'll execute as the init process of a PID namespace. This program is designed to allow us to demonstrate some features of PID namespaces and the init process.
The simple_init program performs the two main functions of init. One of these functions is "system initialization". Most init systems are more complex programs that take a table-driven approach to system initialization. Our (much simpler) simple_init program provides a simple shell facility that allows the user to manually execute any shell commands that might be needed to initialize the namespace; this approach also allows us to freely execute shell commands in order to conduct experiments in the namespace. The other function performed by simple_init is to reap the status of its terminated children using waitpid().
Thus, for example, we can use the ns_child_exec program in conjunction with simple_init to fire up an init process that runs in a new PID namespace:
# ./ns_child_exec -p ./simple_init init$
The init$ prompt indicates that the simple_init program is ready to read and execute a shell command.
We'll now use the two programs we've presented so far in conjunction with another small program, orphan.c, to demonstrate that processes that become orphaned inside a PID namespace are adopted by the PID namespace init process, rather than the system-wide init process.
The orphan program performs a fork() to create a child process. The parent process then exits while the child continues to run; when the parent exits, the child becomes an orphan. The child executes a loop that continues until it becomes an orphan (i.e., getppid() returns 1); once the child becomes an orphan, it terminates. The parent and the child print messages so that we can see when the two processes terminate and when the child becomes an orphan.
In order to see what that our simple_init program reaps the orphaned child process, we'll employ that program's -v option, which causes it to produce verbose messages about the children that it creates and the terminated children whose status it reaps:
# ./ns_child_exec -p ./simple_init -v init: my PID is 1 init$ ./orphan init: created child 2 Parent (PID=2) created child with PID 3 Parent (PID=2; PPID=1) terminating init: SIGCHLD handler: PID 2 terminated init$ # simple_init prompt interleaved with output from child Child (PID=3) now an orphan (parent PID=1) Child (PID=3) terminating init: SIGCHLD handler: PID 3 terminated
In the above output, the indented messages prefixed with init: are printed by the simple_init program's verbose mode. All of the other messages (other than the init$ prompts) are produced by the orphan program. From the output, we can see that the child process (PID 3) becomes an orphan when its parent (PID 2) terminates. At that point, the child is adopted by the PID namespace init process (PID 1), which reaps the child when it terminates.
举个例子来说，如果init进程并没有建立SIGTERM信号处理函数，那么其他进程是没法向init进程发送SIGTERM信号的，这种“非预期”的信号默认都会被内核全部忽略掉。This prevents the init process—whose presence is essential for the stable operation of the system—from being accidentally killed, even by the superuser.
同样，内核在实现pid namespace的时候，为namespace里的init进程【pid为1的进程】保留了相应的类似的功能，同一个namespace下的进程（即使是root进程）也只能向init进程发送已被init进程明确建立信号处理函数的信号，This prevents members of the namespace from inadvertently killing a process that has an essential role in the namespace.
Note, however, that (as for the traditional init process) the kernel can still generate signals for the PID namespace init process in all of the usual circumstances (e.g., hardware exceptions, terminal-generated signals such as SIGTTOU, and expiration of a timer).
Signals can also (subject to the usual permission checks) be sent to the PID namespace init process by processes in ancestor PID namespaces. Again, only the signals for which the init process has established a handler can be sent, with two exceptions: SIGKILL and SIGSTOP. When a process in an ancestor PID namespace sends these two signals to the init process, they are forcibly delivered (and can't be caught). The SIGSTOP signal stops the init process; SIGKILL terminates it. Since the init process is essential to the functioning of the PID namespace, if the init process is terminated by SIGKILL (or it terminates for any other reason), the kernel terminates all other processes in the namespace by sending them a SIGKILL signal.
Normally, a PID namespace will also be destroyed when its init process terminates. However, there is an unusual corner case: the namespace won't be destroyed as long as a /proc/PID/ns/pid file for one of the processes in that namespaces is bind mounted or held open. However, it is not possible to create new processes in the namespace (via setns() plus fork()): the lack of an init process is detected during the fork() call, which fails with an ENOMEM error (the traditional error indicating that a PID cannot be allocated). In other words, the PID namespace continues to exist, but is no longer usable.
2.6. Mounting a procfs filesystem (revisited)
In the previous article in this series, the /proc filesystems (procfs) for the PID namespaces were mounted at various locations other than the traditional /proc mount point. This allowed us to use shell commands to look at the contents of the /proc/PID directories that corresponded to each of the new PID namespace while at the same time using the ps command to look at the processes visible in the root PID namespace.
However, tools such as ps rely on the contents of the procfs mounted at /proc to obtain the information that they require. Therefore, if we want ps to operate correctly inside a PID namespace, we need to mount a procfs for that namespace. Since the simple_init program permits us to execute shell commands, we can perform this task from the command line, using the mount command:
# ./ns_child_exec -p -m ./simple_init init$ mount -t proc proc /proc init$ ps a PID TTY STAT TIME COMMAND 1 pts/8 S 0:00 ./simple_init 3 pts/8 R+ 0:00 ps a
The ps a command lists all processes accessible via /proc. In this case, we see only two processes, reflecting the fact that there are only two processes running in the namespace.
When running the ns_child_exec command above, we employed that program's -m option, which places the child that it creates (i.e., the process running simple_init) inside a separate mount namespace. As a consequence, the mount command does not affect the /proc mount seen by processes outside the namespace.
2.7. unshare() and setns()
In the second article in this series, we described two system calls that are part of the namespaces API: unshare() and setns(). Since Linux 3.8, these system calls can be employed with PID namespaces, but they have some idiosyncrasies when used with those namespaces.
Specifying the CLONE_NEWPID flag in a call to unshare() creates a new PID namespace, but does not place the caller in the new namespace. Rather, any children created by the caller will be placed in the new namespace; the first such child will become the init process for the namespace.
The setns() system call now supports PID namespaces:
setns(fd, 0); /* Second argument can be CLONE_NEWPID to force a check that 'fd' refers to a PID namespace */
The fd argument is a file descriptor that identifies a PID namespace that is a descendant of the PID namespace of the caller; that file descriptor is obtained by opening the /proc/PID/ns/pid file for one of the processes in the target namespace. As with unshare(), setns() does not move the caller to the PID namespace; instead, children that are subsequently created by the caller will be placed in the namespace.
We can use an enhanced version of the ns_exec.c program that we presented in the second article in this series to demonstrate some aspects of using setns() with PID namespaces that appear surprising until we understand what is going on. The new program, ns_run.c, has the following syntax:
ns_run [-f] [-n /proc/PID/ns/FILE]... command [arguments]
The program uses setns() to join the namespaces specified by the /proc/PID/ns files contained within -n options. It then goes on to execute the given command with optional arguments. If the -f option is specified, it uses fork() to create a child process that is used to execute the command.
Suppose that, in one terminal window, we fire up our simple_init program in a new PID namespace in the usual manner, with verbose logging so that we are informed when it reaps child processes:
# ./ns_child_exec -p ./simple_init -v init: my PID is 1 init$
Then we switch to a second terminal window where we use the ns_run program to execute our orphan program. This will have the effect of creating two processes in the PID namespace governed by simple_init:
# ps -C sleep -C simple_init PID TTY TIME CMD 9147 pts/8 00:00:00 simple_init # ./ns_run -f -n /proc/9147/ns/pid ./orphan Parent (PID=2) created child with PID 3 Parent (PID=2; PPID=0) terminating # Child (PID=3) now an orphan (parent PID=1) Child (PID=3) terminating
Looking at the output from the "Parent" process (PID 2) created when the orphan program is executed, we see that its parent process ID is 0. This reflects the fact that the process that started the orphan process (ns_run) is in a different namespace—one whose members are invisible to the "Parent" process. As already noted in the previous article, getppid() returns 0 in this case.
The following diagram shows the relationships of the various processes before the orphan "Parent" process terminates. The arrows indicate parent-child relationships between processes.
Returning to the window running the simple_init program, we see the following output:
init: SIGCHLD handler: PID 3 terminated
The "Child" process (PID 3) created by the orphan program was reaped by simple_init, but the "Parent" process (PID 2) was not. This is because the "Parent" process was reaped by its parent (ns_run) in a different namespace. The following diagram shows the processes and their relationships after the orphan "Parent" process has terminated and before the "Child" terminates.
It's worth emphasizing that setns() and unshare() treat PID namespaces specially. For other types of namespaces, these system calls do change the namespace of the caller. The reason that these system calls do not change the PID namespace of the calling process is because becoming a member of another PID namespace would cause the process's idea of its own PID to change, since getpid() reports the process's PID with respect to the PID namespace in which the process resides. Many user-space programs and libraries rely on the assumption that a process's PID (as reported by getpid()) is constant (in fact, the GNU C library getpid() wrapper function caches the PID); those programs would break if a process's PID changed. To put things another way: a process's PID namespace membership is determined when the process is created, and (unlike other types of namespace membership) cannot be changed thereafter.