论文原址:http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

这是一篇很好的linux调度器综述论文,论文首先介绍了linux调度器的基本工作原理,然后介绍了调度器算法的改进历史,最后详细分析了现有调度器负载均衡算法的4个bug,正如论文的标题所说:a Decade of Wasted Cores,这些bug导致了内核在某些场景下很容易会出现多核负载不均衡的现象,例如一些核心非常忙,但是另外一些核心仍然处于idle状态,等等

这几个bug很有意思,我来简单说一下

1. The Group Imbalance bug

首先我们知道:

  1. linux内核是支持组调度的,例如cgroup控制组,例如线程组【默认情况下进程和自身创建出来的线程是同一个调度组,读者最好自己先了解一下调度域和调度组的概念,这块信息还是很重要的】
  2. 多核负载均衡算法的依据是调度组的平均负载

然后呢,作者模拟了这样的一个场景,在一个numa架构8节点64核心的机器上,跑两种应用程序:

  1. make -j 64编译内核,这个够耗cpu了吧
  2. 两个普通R process,我理解是R语言写的程序吧

实验的结果表明,跑R程序的核心压力不高,其他核心几乎跑满。那么问题来了,为什么编译进程不会被调度到空闲的R进程所在的核心上运行呢【64个线程只跑在62个核心上】

When a core attempts to steal work from another node, or, in other words, from another scheduling group, it does not examine the load of every core in that group, it only looks at the group’s average load. If the average load of the victim scheduling group is greater than that of its own, it will attempt to steal from that group; otherwise it will not. This is the exact reason why in our situation the underloaded cores fail to steal from the overloaded cores on other nodes. They observe that the average load of the victim node’s scheduling group is not any greater than their own. The core trying to steal work runs on the same node as the high-load R thread; that thread skews up the average load for that node and conceals the fact that some cores are actually idle. At the same time, cores on the victim node, with roughly the same average load, have lots of waiting threads.

作者提到的修复方法就是,比较两个调度组的负载时不应该使用平均负载,而应该比较最小负载,因为:

If the minimum load of one scheduling group is lower than the minimum load of another scheduling group, it means that the first scheduling group has a core that is less loaded than all cores in the other group, and thus a core in the first group must steal from the second group.

2. The Scheduling Group Construction bug

第二个bug和调度组的实现有关,调度组的划分和我们看到的numa架构并不完全一致。简单来说,内核是按照core 0的视图来划分调度组的,而不是根据负责负载均衡的核心所看到的视图。这样就会导致一个问题,有可能同一个node出现在多个调度组上

如上图,{0,1,2,4,6} 是一个调度组,{1,2,3,4,5,7} 是另外一个调度组,但是这两个调度组同时包含1,2,4这几个node

这样会有什么问题呢?

假设一个进程(N个线程)被绑核在Node 1和Node 2上运行,但是由于linux内核线程创建时的负载均衡机制:线程总是会默认在local core上运行。最终就会导致所有进程其实都是在Node 1上跑,Node 1和Node 2根本没法均衡。原理其实我们之前提到过,是因为多核负载均衡的时候只是比较调度组之间的平均负载,但是因为这两个调度组都包含Node 1和Node 2,平均负载时一样的。。。

3. The Overload-on-Wakeup bug

这个问题比较简单,总的来说就是,内核为了最大限度地提高进程的cache命中率,当一个进程被唤醒时,它总是会被调度到同一个node的核心上,哪怕当前node很忙而另一个node很空闲的情况下。

When a thread goes to sleep on Node X and the thread that wakes it up later is running on that same node, the scheduler only considers the cores of Node X for scheduling the awakened thread. If all cores of Node X are busy, the thread will wake up on an already busy core and miss opportunities to use idle cores on other nodes.

cache命中率影响的更多是时延,而多核并发更多影响的是吞吐。对于某些离线应用来说,时延不是性能最关键的指标、吞吐才是。但是内核的多核调度机制在负载均衡方面对cache命中率等做了很多强假设

Leave a Comment