文章归档

cgroup.memory内核oom过程

以内核3.10.79为例。这里分析一下内核对于cgroup.memory进程组oom的过程,以及混部环境下需要什么样的oom策略。

1. 触发时机

内核对于每个memory cgroup维护一个计数器,统计当前cgroup内已经使用了的内存。每当cgroup内进程创建页面时,页面大小所占用的内存就会通过res_counter_charge_locked()函数计入计数器里。而当内存占用超过memory.limit_in_bytes所设置的阈值时,charge失败,返回ENOMEN错误。

int res_counter_charge_locked(
struct res_counter *counter, unsigned long val, bool force)
{
	int ret = 0;
	if (counter->usage + val > counter->limit) {
		counter->failcnt++;
		ret = -ENOMEM;
		if (!force)
			return ret;
	}
	counter->usage += val;
	if (counter->usage > counter->max_usage)
		counter->max_usage = counter->usage;
	return ret;
}

另外有一个问题需要注意的是,内存这个子系统是拓扑型控制的,不是平级控制的。下一级子系统的所有limit_in_bytes之和不能超过父亲的limit_in_bytes值,否则会设置失败。

所以内存计数的时候:

  1. 进程新创建的页面,会被反向递归计入到所有的父cgroup下面。
  2. memory子系统的根的计数一定是当前内核所有进程的内存使用之和。(注意,由于cgroup.memory对内存的统计和proc文件系统的统计方法不一致,所以这两个系统对于内存使用的值并不是完全对等的)

子cgroup也许内存配额有冗余,但父cgroup不一定会有冗余,所以在反向递归计数的时候,谁内存超过阈值了,就oom谁(选择这个cgroup下的某个进程kill掉,所以这里是有可能某个cgroup明明内存没有被超限但也会被莫名的干掉了)。

static int __res_counter_charge(struct res_counter *counter, unsigned long val,
				struct res_counter **limit_fail_at, bool force)
{
	int ret, r;
	unsigned long flags;
	struct res_counter *c, *u;

	r = ret = 0;
	*limit_fail_at = NULL;
	local_irq_save(flags);
	for (c = counter; c != NULL; c = c->parent) {
		spin_lock(&c->lock);
		r = res_counter_charge_locked(c, val, force);
		spin_unlock(&c->lock);
		if (r < 0 && !ret) {
			ret = r;
			*limit_fail_at = c;
			if (!force)
				break;
		}
	}

	if (ret < 0 && !force) {
		for (u = counter; u != c; u = u->parent) {
			spin_lock(&u->lock);
			res_counter_uncharge_locked(u, val);
			spin_unlock(&u->lock);
		}
	}
	local_irq_restore(flags);

	return ret;
}

当内核发现某个父cgroup内存已经超限时,先尝试通过mem_cgroup_reclaim()回收,下面代码中mem_over_limit就是已经超限了的cgroup,如果可以回收,则通知caller进行重试,否则,触发oom

	ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
		return CHARGE_RETRY;
	/*
	 * Even though the limit is exceeded at this point, reclaim
	 * may have been able to free some pages.  Retry the charge
	 * before killing the task.
	 *
	 * Only for regular pages, though: huge pages are rather
	 * unlikely to succeed so close to the limit, and we fall back
	 * to regular pages anyway in case of failure.
	 */
	if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret)
		return CHARGE_RETRY;

	/*
	 * At task move, charge accounts can be doubly counted. So, it's
	 * better to wait until the end of task_move if something is going on.
	 */
	if (mem_cgroup_wait_acct_move(mem_over_limit))
		return CHARGE_RETRY;

	if (invoke_oom)
		mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));

	return CHARGE_NOMEM;

mem_cgroup_oom()这个函数并不真正的触发oom,这里只是把它放到当前进程的current->memcg_oom里,然后返回ENOMEM错误,caller会根据情况来决定是否触发oom,一旦触发oom,则内核会调用mem_cgroup_oom_synchronize()函数来完成对某个cgroup的oom过程。

2. oom过程

内核对memory.cgroup进行oom的函数是mem_cgroup_out_of_memory(),其实这个函数很简单,基本就是复用了oom_kill.c里的out_of_memory()函数

mem_cgroup_out_of_memory()函数的流程是,遍历current->mem_cg下的所有进程,对每个进程调用oom_scan_process_thread()来决定是否参与打分,如果进程参与打分,调用oom_badness()算分。最高分者即被kill,kill进程是通过发送-9信号来完成的。

	totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1;
	for_each_mem_cgroup_tree(iter, memcg) {
		struct cgroup *cgroup = iter->css.cgroup;
		struct cgroup_iter it;
		struct task_struct *task;

		cgroup_iter_start(cgroup, &it);
		while ((task = cgroup_iter_next(cgroup, &it))) {
			switch (oom_scan_process_thread(task, totalpages, NULL,
							false)) {
			case OOM_SCAN_SELECT:
				if (chosen)
					put_task_struct(chosen);
				chosen = task;
				chosen_points = ULONG_MAX;
				get_task_struct(chosen);
				/* fall through */
			case OOM_SCAN_CONTINUE:
				continue;
			case OOM_SCAN_ABORT:
				cgroup_iter_end(cgroup, &it);
				mem_cgroup_iter_break(memcg, iter);
				if (chosen)
					put_task_struct(chosen);
				return;
			case OOM_SCAN_OK:
				break;
			};
			points = oom_badness(task, memcg, NULL, totalpages);
			if (points > chosen_points) {
				if (chosen)
					put_task_struct(chosen);
				chosen = task;
				chosen_points = points;
				get_task_struct(chosen);
			}
		}
		cgroup_iter_end(cgroup, &it);
	}

	if (!chosen)
		return;
	points = chosen_points * 1000 / totalpages;
	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
			 NULL, "Memory cgroup out of memory");

oom_scan_process_thread()函数在这里有一个很重要的作用就是过滤一些不需要打分的进程,加快oom速度,哪些进程不需要oom呢?

  1. 已经exit了的进程
  2. 内核线程
  3. 进程内存页已经被释放,通过task->mm是否为NULL来判断。说明当前进程正在退出

3. badness算分策略

影响oom_badness打分的因素有三个:

  1. 进程可以通过/proc/${pid}/oom_score_adj 设置oom参数,该值越大,进程越容易被kill
  2. 进程内存,占用内存越大,越容易被kill
  3. 内核会适当降低root进程被kill的风险,在计算内存占用的时候降低3/1000

oom_badness()函数如下:

	adj = (long)p->signal->oom_score_adj;
	if (adj == OOM_SCORE_ADJ_MIN) {
		task_unlock(p);
		return 0;
	}

	/*
	 * The baseline for the badness score is the proportion of RAM that each
	 * task's rss, pagetable and swap space use.
	 */
	points = get_mm_rss(p->mm) + p->mm->nr_ptes +
		 get_mm_counter(p->mm, MM_SWAPENTS);
	task_unlock(p);

	/*
	 * Root processes get 3% bonus, just like the __vm_enough_memory()
	 * implementation used by LSMs.
	 */
	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
		points -= (points * 3) / 100;

	/* Normalize to oom_score_adj units */
	adj *= totalpages / 1000;
	points += adj;

	/*
	 * Never return 0 for an eligible task regardless of the root bonus and
	 * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
	 */
	return points > 0 ? points : 1;

4. 混部

混部主要就是在线业务和离线业务混合部署,但是至少到目前为止,单机资源隔离方面并不能做到非常完美,很容易因为离线作业限制不住导致整机oom或者影响在线作业的情况。特别是内存这一块,需要特别小心。

因为离线作业本身的优先级就是很低的,当机器内存不足时,与其一个个进程算badness,选择性的杀进程,还不如干脆把离线作业全kill掉,这样时间会节省的非常多,尽可能快的恢复受影响的在线作业。

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>