文章归档

linux内核调度:cfs(Completely Fair Scheduler) 的简单原理

cfs为完全公平调度提供了一种统一的抽象:虚拟时钟vruntime

不关心进程是否sleep,不关心是否为交互式进程,它只基于虚拟时钟,通过一棵RB树维护所有的进程。RB树的key就是进程的虚拟运行时间。

cfs提供了一种可扩展的进程调度架构,实现了机制和策略的分离

因为真实的硬件上同一时刻只能允许一个任务在运行,所以为了模拟并行,需要引入一个虚拟时钟的概念。虚拟时钟将进程睡眠补偿,优先级,权限等各种参数映射到同一维度。

schedule(void)

进程让出cpu有两种方式:

  1. 主动让出
    1. 用户态sched_yield系统调用,sched_yield()主要是通过schedule()函数来完整的
    2. 内核态在需要等待设备或者资源的时候,也会主动调用schedule()
  2. 时钟中断,scheduler_tick()

schedule()函数的主要作用就是,保存当前进程的上下文,将当前进程push进队列,从队列中选出下一个进程,切换上下文继续执行

/*
 * schedule() is the main scheduler function.
 * linux-2.6.32.68/kernel/sched.c
 */

asmlinkage void __sched schedule(void)
{
        // ....
	pre_schedule(rq, prev);

	if (unlikely(!rq->nr_running))
		idle_balance(cpu, rq);

	put_prev_task(rq, prev);
	next = pick_next_task(rq);

	if (likely(prev != next)) {
		sched_info_switch(prev, next);

		rq->nr_switches++;
		rq->curr = next;
		++*switch_count;

		context_switch(rq, prev, next); /* unlocks the rq */
		/*
		 * the context switch might have flipped the stack 
                 * from under us, hence refresh the local variables.
		 */
		cpu = smp_processor_id();
		rq = cpu_rq(cpu);
	} else
		spin_unlock_irq(&rq->lock);
	post_schedule(rq);
        // ....
}

虚拟时钟vruntime计算

每次进程调度的时候,都会更新当前进程的虚拟时钟,计算公式是:

static inline unsigned long
calc_delta_fair(unsigned long delta, struct sched_entity *se)
{
	if (unlikely(se->load.weight != NICE_0_LOAD))
            /*
             * delta *= weight / lw
             */            
            delta = calc_delta_mine(delta, NICE_0_LOAD, &se->load);
	return delta;
}

NICE_0_LOAD是默认进程的load weight值1024,从这里我们可以看出来,默认load weight值对应的进程的vruntime等同于进程实际的运行时间。而load weight值越大,运行时间就折算的越小(自然也就得到更多运行的机会)

nice值和进程load weight的对应关系是:

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
static const int prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

总的来说nice值越小进程就越快得到运行

时钟补偿

有两种case:

  1. 新建进程:补偿时钟以便获得更高的运行机会
  2. 睡眠唤醒

对于睡眠中的进程,wake_up之后也会得到一定的vruntime补偿,目的是缩小wake_up进程和当前cfs_rq队列里的min_vruntime的差距。使得wake_up进程有更多的机会优先运行。

具体逻辑可参考place_entity()函数

组调度

cfs支持组调度的概念,进程组由struct task_group来描述,不考虑多核的情况下,struct task_group和struct task_struct都包含sched_entity成员,而task_group相对task_struct则多了一个cfs_rq队列。

/* task group related information */
struct task_group {
    struct cgroup_subsys_state css;

#ifdef CONFIG_FAIR_GROUP_SCHED
    /* schedulable entities of this group on each cpu */
    struct sched_entity **se;
    /* runqueue "owned" by this group on each cpu */
    struct cfs_rq **cfs_rq;
    unsigned long shares;
#endif

    // ...
};

scheduler在选择进程的时候,如果发现当前的调度实体sched_entity为进程组,则继续递归向下从进程组里选择一个新的调度实体sched_entity,直到选择出一个普通进程为止,代码如下:

static struct task_struct *pick_next_task_fair(struct rq *rq)
{
	struct task_struct *p;
	struct cfs_rq *cfs_rq = &rq->cfs;
	struct sched_entity *se;

	if (unlikely(!cfs_rq->nr_running))
		return NULL;

	do {
		se = pick_next_entity(cfs_rq);
		set_next_entity(cfs_rq, se);
                // 这里如果se的cfs_rq为真,则说明它是一个进程组,否则,就为普通进程
                // 区别可以具体看看进程组和普通进程相应sched_entity初始化的地方
                cfs_rq = group_cfs_rq(se);
	} while (cfs_rq);

	p = task_of(se);
	hrtick_start_fair(rq, p);

	return p;
}

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>