Namespaces系列2:mnt namespace

Time: 一月 16, 2016
Category: namespaces

mnt namespaces是实现容器文件系统最核心的基础技术之一,mnt namespaces可以为容器提供一个独立的文件系统视图

这里我除了介绍mnt namespace相关技术之外,还介绍shared subtrees技术,which allows mount and unmount events to be propagated between mount namespaces in an automatic, controlled fashion.

1. 简介

mnt namespaces是linux最早引入的namespace,appearing in 2002 in Linux 2.4.19. 主要是为了隔离不同进程组可以看到的挂载点,meaning that processes in different namespaces see and are able to manipulate different views of the single directory hierarchy.

这里面的隔离有几个意思:

  1. 同一个mnt namespace下的所有进程看到的mount points一定是一样的
  2. 不同的mnt namespace下的进程看到的mount points不一定是一样的
  3. 不同的mnt namespace下所做的mount和umount操作,都是互相不可见的

这个不一定的原因是因为mnt namespace刚创建出来的时候,默认会继承父亲namespace的所有信息,但是如上面第三点说的,之后所做的任何mount & umount操作,都与父namespace无关,也和任何其他的namespace无关

操作系统启动的时候,内核会为系统初始化一个根mnt namespace,也叫"initial namespace". 后面的mnt namespace都是通过clone系统调用 + CLONE_NEWNS参数创建出来的,When a new mount namespace is created, it receives a copy of the mount point list replicated from the namespace of the caller of clone()

新的mnt namespace创建之后,任何mount & umount操作,都是只在该nmt namespace下生效,并且这些mount points变更,只有在这个mnt namespace里的进程才能看到,其他mnt namespaces下的进程是看不到

有了mnt namespace,我们能做的事情是非常多的,例如:

  1. 出于某种目的,为每个xx提供一个独立的文件系统视图,xx既可以是用户,也可以是“容器”等等
  2. 结合pid namespace + 重载proc文件系统实现pid隔离
  3. 等等

2. 基本用法

2.1. 创建namespace

创建一个新的mnt namespace只需要做一件事情即可,那就是调用clone()系统调用,并指定CLONE_NEWNS参数,如下

    child_pid = clone(childFunc,
                    child_stack + STACK_SIZE,   /* Points to start of
                                                   downwardly growing stack */
                    CLONE_NEWNS | SIGCHLD, argv[1]);

你可以在childFunc里起一个bash进程,这样可以更直观的调试mnt namespace对mount points的影响是什么样的。因为childFunc和childFunc启动的bash进程,以及你在bash里执行的任何命令,都在同一个由上述代码创建出来的新的mnt namespace下

你会发现诸如:

  1. 你在这个namespace里做的mount和umount操作,在宿主系统上是看不见的
  2. 同样,宿主系统所做的mount和umount操作,对在这个mnt namespace里的进程【例如你的bash进程】也是不可见的

2.2. 如何实现一个独立的文件系统视图?

了解mnt namespace是如何创建的,以及它的基本原理,我们来讨论一个简单的需求:怎么实现一个独立的文件系统视图呢?

这个需求可能来源于:多租户服务器:例如开发机,想隔离不同用户的文件系统,但又不想装kvm之类的虚拟机,太重,浪费资源,只要能做到不同用户之间的文件系统操作互相不影响就行

怎么办?

简单,chroot + mnt namespace即可

第一步:首先你得准备一个操作系统镜像,你可以用docker.io上现成的,或者自己随便做一个,制作方法很简单,建个目录,把当前操作系统的/usr /lib /lib64 /var /etc /sbin等这几个目录,全部copy进去

第二步:用上面的代码,创建一个新的mnt namespace,然后启动一个bash进程【方便调试】,最后chroot到你刚创建的那个目录里去

第三步:挂载一些基本的内核文件系统,例如procfs,devfs,sysfs等等

第四步:到这里,你基本上进入了一个与其他人或者宿主系统完全隔离的文件系统视图里

2.3. shared subtrees

有了mnt namespace之后,用户会经常遇到这样的问题:由于mnt namespace是完全隔离的,导致一些宿主系统或者【initial namespaces】的一些常规操作,在mnt namespace里完全不可见,继而引发一些易用性上面的问题

举个例子来说:

  1. 机器新挂载了一块磁盘,但是其他的mnt namespace里永远都看不见。除非在mnt namespace里,重新把设备挂载上来

但是,如果我们可以做到,在一个mnt namespace里的某个mount&umount操作,对其他【或者部分】mnt namespace来说是完全可见的,上面这个问题是不是就变的非常简单了。新加的磁盘只需要在宿主系统上mount一次,其他mnt namespace都可以看的见

shared subtrees技术就是为解决这个问题而被设计出来的 the shared subtrees feature was added in Linux 2.6.15 (in early 2006, around three years after the initial implementation of mount namespaces). 

shared subtrees技术最大的好处就是可以实现不同mnt namespaces之间mount & umount事件的原子转播(propagation),这就意味着在一个mnt namespace中所做的所有mount&umount操作,在其他”相关联“的mnt namespace来说都是可见的

其中propagate type和peer group是shared subtrees技术中两个最关键的概念

2.3.1. propagate type

每个挂载点都有一个propagation type属性, 由它来决定这个挂载点下面的创建和移除挂载点操作,是否会传播到属于相同peer group【下面会详细解释】的其他挂载点下去,也即同一个peer group里的其他的挂载点下面是不是也会自动的创建和移除相应的挂载点

shared subtrees有三种propagation type:

  • MS_SHARED: This mount point shares mount and unmount events with other mount points that are members of its "peer group" (which is described in more detail below). When a mount point is added or removed under this mount point, this change will propagate to the peer group, so that the mount or unmount will also take place under each of the peer mount points. Propagation also occurs in the reverse direction, so that mount and unmount events on a peer mount will also propagate to this mount point
  • MS_SLAVE: This propagation type sits midway between shared and private. A slave mount has a master—a shared peer group whose members propagate mount and unmount events to the slave mount. However, the slave mount does not propagate events to the master peer group.
  • MS_PRIVATE: 默认的propagate类型,如果挂载的时候不指定任何propagate参数,则默认是这个。这个类型下的任何mount & umount事件都不会在peer group之间转播
  • MS_UNBINDABLE: 感觉没啥用,不解释

另外需要注意的是:

  1. propagation type是mount point的属性,同一个mnt namespace下,有些mount point是标记为shared的,有些是标记为private的。propagation type不是mnt namespace的属性,切记
  2. propagation type只是决定了当前mount point下的mount&umount事件是否在不同的peer group之间相互转播,因此,如果我们在一个shared类型的mount point【比如X】下创建另一个mount point【比如Y】,则与X在同一个peer group里的挂载点下面也是能立即看到这个挂载点Y的。但是Y挂载点下的其他mount & umount事件怎么propagate是不受X的propagate类型决定的,仅受Y挂载点自身的propagate类型限制。类似地,一个挂载点被umount了,这个事件本身是否会被propagate仅由这个挂载点的父挂载点的propagate type决定

Finally, it is possible for a mount to be both the slave of a master peer group as well as sharing events with a set of peers of its own—a so-called slave-and-shared mount. In this case, the mount might receive propagation events from the master, and those events would then be propagated to its peers.

mount point的propagate类型可以通过mount命令直接设置,如下:

mount --make-shared mountpoint
mount --make-slave mountpoint
mount --make-private mountpoint
mount --make-unbindable mountpoint

以shared mount和slave mount为例,介绍propagate type是如何影响mount & umount事件的传播行为的

1) shared mount

A shared mount can be replicated to as many mount points and all the replicas continue to be exactly same.

    Here is an example:

    Let's say /mnt has a mount that is shared.
    mount --make-shared /mnt

    Note: mount(8) command now supports the --make-shared flag,
    so the sample 'smount' program is no longer needed and has been
    removed.

    # mount --bind /mnt /tmp
    The above command replicates the mount at /mnt to the mountpoint /tmp
    and the contents of both the mounts remain identical.

    #ls /mnt
    a b c

    #ls /tmp
    a b c

    Now let's say we mount a device at /tmp/a
    # mount /dev/sd0  /tmp/a

    #ls /tmp/a
    t1 t2 t3

    #ls /mnt/a
    t1 t2 t3

    Note that the mount has propagated to the mount at /mnt as well.

    And the same is true even when /dev/sd0 is mounted on /mnt/a. The
    contents will be visible under /tmp/a too.

2) slave mount

A slave mount is like a shared mount except that mount and umount events only propagate towards it.

    All slave mounts have a master mount which is a shared.

    Here is an example:

    Let's say /mnt has a mount which is shared.
    # mount --make-shared /mnt

    Let's bind mount /mnt to /tmp
    # mount --bind /mnt /tmp

    the new mount at /tmp becomes a shared mount and it is a replica of
    the mount at /mnt.

    Now let's make the mount at /tmp; a slave of /mnt
    # mount --make-slave /tmp

    let's mount /dev/sd0 on /mnt/a
    # mount /dev/sd0 /mnt/a

    #ls /mnt/a
    t1 t2 t3

    #ls /tmp/a
    t1 t2 t3

    Note the mount event has propagated to the mount at /tmp

    However let's see what happens if we mount something on the mount at /tmp

    # mount /dev/sd1 /tmp/b

    #ls /tmp/b
    s1 s2 s3

    #ls /mnt/b

    Note how the mount event has not propagated to the mount at

    /mnt

2.3.2. peer group

peer group简单来说就是一个“相同”挂载点的集合。同一个peer group下,任何一个挂载点下的mount和umount操作,对其他挂载点来说都是可见的。如果不理解,看到下面的propagate type你就明白了

peer group的产生有两种方式:

  1. mount --bind的方式,会自动将source挂载点和dest挂载点放在同一个peer group里,重复bind挂载,则新的挂载点也会加入到这个peer group里。所以,同一个peer group里的挂载点本质上是同一个“挂载点”
  2. 创建namespace的时候,内核会为新的namespace克隆一个原namespace里的所有挂载点,这样,新mnt namespace里的挂载点与原mnt namespace里的同一个挂载点将自动加入到同一个peer group里。注意,这里不是说新mnt namespace和原mnt namespace里的所有挂载点同属一个peer group,而是每个挂载点有各自的peer group,并分别与原mnt namespace里的相同挂载点关联

注意这里还有一个很重要的前提条件是:不管是mount --bind的方式还是clone + CLONE_NEWNS的方式,源挂载点的propagate type必须是shared的,否则不会自动创建peer group。更具体的可以了解 Documentation/filesystems/sharedsubtree.txt

当挂载点被umount的时候,它会自动的从这个peer group里退出。或者当mnt namespace中的最后一个进程退出时,内核也会自动的将这个挂载点从peer group里移除【并销毁】

举个例子,假设我们的shell进程是运行在initial mnt namespace里的,我们将/挂载点设置成private,然后再创建两个新的挂载点,如下:

sh1# mount --make-private / 
sh1# mount --make-shared /dev/sda3 /X 
sh1# mount --make-shared /dev/sda5 /Y

然后我们在另外一个shell里,用unshare命令,创建一个新的mnt namespace:

sh2# unshare -m --propagation unchanged sh

(The -m option creates a new mount namespace; the purpose of the --propagation unchanged option is explained later.)

Because the situation is a little complex, From the kernel's perspective, the default when a new device mount is created is as follows:

  • If the mount point has a parent (i.e., it is a non-root mount point) and the propagation type of the parent is MS_SHARED, then the propagation type of the new mount is also MS_SHARED.
  • Otherwise, the propagation type of the new mount is MS_PRIVATE.

According to these rules, the root mount would be MS_PRIVATE, and all descendant mounts would by default also be MS_PRIVATE. However,MS_SHARED would arguably have been a better default, since it is the more commonly employed propagation type. For that reason,systemd sets the propagation type of all mount points to MS_SHARED. Thus, on most modern Linux distributions, the default propagation type is effectively MS_SHARED. This is not the final word on the subject, however, since the util-linux unshare utility also has
something to say. When creating a new mount namespace,unshare assumes that the user wants a fully isolated namespace, and makes all mount points private by performing the equivalent of the following command (which recursively marks all mounts under the root directory as
private):

    mount --make-rprivate /

To prevent this, we can use an additional option when creating the new namespace:

    unshare -m --propagation unchanged <cmd>

回到我们第一个shell,我们创建一个bind挂载点:

sh1# mkdir /Z 
sh1# mount --bind /X /Z 

如下:

[Shared mount point peer groups example]

在这里,有两个peer group:

  • The first peer group contains the mount points X, X' (the duplicate of mount point X that was created when the second namespace was created), and Z (the bind mount created from the source mount point X in the initial namespace).
  • The second peer group contains the mount points Y and Y' (the duplicate of mount point Y that was created when the second namespace was created).

Note that the bind mount Z, which was created in the initial namespace after the second namespace was created, was not replicated in the second namespace because the parent mount (/) was marked private.

怎么校验propagation type和peer group是否符合预期呢?

The /proc/PID/mountinfofile (documented in the proc(5) manual page) displays a range of information about the mount points for the mount namespace in which the process PID resides. All processes that reside in the same mount namespace will see the same view in this file. This file was designed to provide more information about mount points than was possible with the older, non-extensible /proc/PID/mounts file. Included in each record in this file is a (possibly empty) set of so-called "optional fields", which display information about the propagation type and peer group (for shared mounts) of each mount.

For a shared mount, the optional fields in the corresponding record in /proc/PID/mountinfo will contain a tag of the form shared:N. Here, the shared tag indicates that the mount is sharing propagation events with a peer group. The peer group is identified by N, an integer value that uniquely identifies the peer group. These IDs are numbered starting at 1, and may be recycled when a peer group ceases to exist because all of its members departed the group. All mount points that are members of the same peer group will show a shared:N tag with the same N in the /proc/PID/mountinfo file.

Thus for example, if we list the contents of /proc/self/mountinfo in the first of the shells discussed in the example above, we see the following (with a little bit of sed filtering to trim some irrelevant information from the output):

    sh1# cat /proc/self/mountinfo | sed 's/ - .*//' 61 0 8:2 / / rw,relatime
    81 61 8:3 / /X rw,relatime shared:1
    124 61 8:5 / /Y rw,relatime shared:2
    228 61 8:3 / /Z rw,relatime shared:1

From this output, we first see that the root mount point is private. This is indicated by the absence of any tags in the optional fields. We also see that the mount points /X and /Z are shared mount points in the same peer group (with ID 1),
which means that mount and unmount events under either of these two mounts will propagate to the other. The mount /Y is a shared mount in a different peer group (ID 2), which, by definition, does not propagate events to or from the mounts in peer group 1.

The /proc/PID/mountinfo file also enables us to see the parental relationship between mount points. The first field in each record is a unique ID for each mount point. The second field is the ID for the parent mount. From the above output, we can see that the mount points/X,/Y, and/Zare all children of the root mount because their parent IDs are all 61.

Running the same command in the second shell (in the second namespace), we see:

    sh2# cat /proc/self/mountinfo | sed 's/ - .*//' 147 146 8:2 / / rw,relatime
    221 147 8:3 / /X rw,relatime shared:1
    224 147 8:5 / /Y rw,relatime shared:2

Again, we see that the root mount point is private. Then we see that/Xis a shared mount in peer group 1, the same peer group as the mounts /X and /Z in the initial mount namespace. Finally, we see that /Y is a shared mount in peer group 2, the same peer group as the mount /Y in the initial mount namespace. One final point to note is that the mount points that were replicated in the second namespace have their own unique IDs that differ from the IDs of the corresponding mounts in the initial namespace.

Leave a Comment