cgroup内核实现

二月 20, 2015
cgroup

cgroup模块里比较重要的数据结构是:struct cgroup, struct css_set, struct cg_cgroup_link,搞清楚这三个数据结构之间的关系,基本就可以了解内核是怎么管理cgroup的了。关于cgroup的一些资料,可以参考redhat写的介绍:

>> Relationships Between Subsystems, Hierarchies, Control Groups and Tasks

Read the rest of this entry »

hotfix内核热更新技术

二月 2, 2015
Kernel

最近研究了一下内核热升级技术,顺便和组内的同学做了一个分享。

https://github.com/pipul/coutune/blob/master/papers/kpatch.pdf

参考资料如下:

  1. 阿里虚拟化技术 http://pan.baidu.com/s/1eQcSPTw
  2. Safe and Automatic Live Update for Operating Systems  https://www.cs.vu.nl/~giuffrida/papers/asplos-2013.pdf
  3. LUCOS http://pan.baidu.com/s/1kTmSd4F
  4. ksplice http://www.ksplice.com/doc/ksplice.pdf
  5. 淘宝内核月报 http://kernel.taobao.org/index.php/%E5%86%85%E6%A0%B8%E6%9C%88%E6%8A%A52014-05-06
  6. kgraft http://lwn.net/Articles/596776/
  7. kpatch https://github.com/dynup/kpatch

memory allocator and coredump analysis

十二月 22, 2014
Performance Analysis and Tools

这几天彻底被一个线上BUG搞跨了,我们准备上线一个新的模块,灰度的时候发现系统极其不稳定,几乎每过几十分钟就core一次,而且coredump的地方比较有规律。现在来回顾一下这个core的分析过程

coredump现象

我们上线的流程是 稳定性测试 + 小流量灰度 + 全量上线
稳定性测试的机器环境是centos 6,线上自己的环境是 centos 5.4,稳定性测试的时候从线上拷贝流量,启动所有模块,持续跑了两天没有发现问题,然后接着小流量灰度。(后来的事实说明,环境也是导致coredump触发的一个很重要因素)

小流量灰度的时候问题出现了,时不时的core一下,core的地方比较稳定,每次都差不多都是那几个地方。既然core的比较稳定,说明应该是附近的代码出现了问题。

信息收集

从coredump的栈帧信息来看,现在能确定的是:

  • core在stl内部,说明不是明显的NULL引用等引发的coredump
  • 用户态极少使用malloc/free等来分配释放内存
  • coredump比较稳定,附近代码有问题的可能性比较大

静态代码分析

由于初步分析coredump处附近

  • 分析coredump代码附近有可疑的地方,例如数组越界,多线程race condition,不正确的malloc/free等等,没有发现问题。
  • 分析新上线的代码,例如有可能引入或者触发coredump的地方,没有发现问题。
  • 检查线程安全,race condition的地方,例如多线程读写map等

排查

初步分析没有发现明显的问题后,我们开始一步一步的排查和定位问题。通过配置线上可运行的模块,定位到其中某个模块可能有问题。由于静态代码分析的
时候看过这部分代码,很难定位到出问题的地方。加上在排查过程中即使卸载可疑的模块,仍然没法彻底解决问题,只是降低了coredump之间的时间间隔

coredump很可能不是在crash的地方附近,而是在之前别的地方就已经被破坏掉了。

Valgrind

为了证实这个问题,我们对所有模块都做了一次长时间的Valgrind测试,最终发现一个已经长时间运行的模块存在一个BUG,这个BUG非常难捕捉到,主要是由于一个Json的不正确使用导致的,走了一条很少有的执行路径,触发了此BUG

这个问题为什么之前没有爆发出来呢?double
free这种错误会由于内存管理器的实现不用而表现出不稳定性,原因是假设你轻微地破坏了一块内存,而且被破坏的地方不影响别的内存块,那么如果这块内存
释放后并没有被重复利用,这个BUG就难以复现出来。所以我们看到coredump的地方未必是问题发现的地方。

笛卡尔 - 谈谈方法

十月 7, 2014
Life

追求真理的一些基本法则和行为规范

法则

  1. 凡是我没有明确地认识到的东西,我绝不把它当成真的接受
  2. 把我所审查的每一个难题按照可能的和必要的程度分成若干部分,以便一一妥为解决
  3. 按次序进行我的思考,从最简单、最容易认识的对象开始,一点一点逐步上升,直到认识最复杂的对象;就连那些本来没有先后关系的东西,也给他们设定一个次序
  4. 在任何情况之下,都要尽量全面的考察,尽量普遍地复查,做到确信毫无遗漏

行为规范

  1. 服从我国的法律和习俗,笃守我靠神保佑从小就领受的宗教,在其他一切事情上以周围最明智的人为榜样,尊奉他们在实践上一致接受的那些最合乎中道、最不走极端的意见,来约束自己
  2. 在行动上尽可能坚定果断,一旦选定某种看法,哪怕它十分可疑,也毫不动摇地坚决遵循,就像它十分可靠一样
  3. 永远只求克服自己,不求克服命运,只求改变自己的愿望,不求改变世界的秩序

proxyio性能测试工具

八月 13, 2014
proxyio

测试工具位于源码目录 perf/ 下,可见:https://github.com/proxyio/xio/tree/master/perf

分别是吞吐量测试和时延测试:

  • 吞吐量测试:thr_sender + thr_recver
  • 时延测试:lat_sender + lat_recver

吞吐量测试

thr_sender不停的发送消息,thr_recver接受消息,记录消息大小,整个过程的耗时,最好得到的结果类似如下:

message size: 100000000 [B]
throughput: 90497 [msg/s]
throughput: 707.013 [Mb/s]

thr_sender用法如下:

usage: thr_sender <connect-to> <msg-size> <msg-count>

其中:

  1. connect-to 表示消息接收端的地址
  2. msg-size 表示发送消息的大小(不包括协议的header,仅指用户消息的长度)
  3. msg-count 表示消息数量

thr_recver用法如下:

usage: thr_recver <bind-to> <msg-count>

其中:

  1. bind-to 表示监听一个socket地址
  2. msg-count 表示消息的数量

 

时延测试

在pingpong模式下,测试每个消息来回的传输时间,测试过程如下,记录时间戳,发送消息,接受响应,记录时间戳,计算时延 rtt / 2

lat_sender用法如下:

usage: lat_sender <connect-to> <msg-size> <roundtrips>

其中:

  1. connect-to 表示消息接收端的地址
  2. msg-size 表示发送消息的大小(不包括协议的header,仅指用户消息的长度)
  3. roundtrips 表示消息来回的次数

lat_recver用法如下:

usage: lat_recver <bind-to> <msg-count>

其中:

  1. bind-to 表示监听一个socket地址
  2. msg-count 表示消息的数量

输出结果类似如下:

message size: 1000 [B]
roundtrip count: 100000
average latency: 67.626 [us]

High performance Network Programming

八月 9, 2014
network programming

tcp hangle算法

八月 3, 2014
tcp/ip internals

默认socket是开启nagle特性的,该特性主要用于提升网卡的吞吐量,但会增加小包的传输时延。对于一些时延敏感的通信,需要关闭nagle算法:

int flags = 1;
rc = setsockopt (rs->ss_fd, IPPROTO_TCP, TCP_NODELAY, &flags, sizeof (flags));

内核会通过以下调用栈来设置socket的nagle特性:

setsockopt()    // net/socket.c
tcp_setsockopt() // net/ipv4/tcp.c

/*
 *	Socket option code for TCP.
 */
static int do_tcp_setsockopt(struct sock *sk, int level,       // net/ipv4/tcp.c
		int optname, char __user *optval, unsigned int optlen)
{
	//...
	switch (optname) {
	case TCP_NODELAY:
		if (val) {
			/* TCP_NODELAY is weaker than TCP_CORK, so that
			 * this option on corked socket is remembered, but
			 * it is not activated until cork is cleared.
			 *
			 * However, when TCP_NODELAY is set we make
			 * an explicit push, which overrides even TCP_CORK
			 * for currently queued segments.
			 */
			tp->nonagle |= TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
			tcp_push_pending_frames(sk);
		} else {
			tp->nonagle &= ~TCP_NAGLE_OFF;
		}
		break;
	}

tcp协议发送数据包的主要动作在 tcp_write_xmit() 这个函数里

/* This routine writes packets to the network.  It advances the
 * send_head.  This happens as incoming acks open up the remote
 * window for us.
 *
 * LARGESEND note: !tcp_urg_mode is overkill, only frames between
 * snd_up-64k-mss .. snd_up cannot be large. However, taking into
 * account rare use of URG, this is not a big flaw.
 *
 * Returns 1, if no segments are in flight and we have queued segments, but
 * cannot send anything now because of SWS or another problem.
 */
static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
              int push_one, gfp_t gfp)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    unsigned int tso_segs, sent_pkts;
    int cwnd_quota;
    int result;

    sent_pkts = 0;

    if (!push_one) {
        /* Do MTU probing. */
        result = tcp_mtu_probe(sk);
        if (!result) {
            return 0;
        } else if (result > 0) {
            sent_pkts = 1;
        }
    }

    while ((skb = tcp_send_head(sk))) {
        unsigned int limit;

        // 初始化 skb 的 tso 状态,一个skb可能由多个package组成,tso_segs就是
        // skb内package的数量
        tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
        BUG_ON(!tso_segs);

        cwnd_quota = tcp_cwnd_test(tp, skb);
        if (!cwnd_quota)
            break;

        // 发送窗口判断
        if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
            break;

        // nagle检查,下文详解
        if (tso_segs == 1) {
            if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
                             (tcp_skb_is_last(sk, skb) ?
                              nonagle : TCP_NAGLE_PUSH))))
                break;
        } else {
            if (!push_one && tcp_tso_should_defer(sk, skb))
                break;
        }
        // tcp分片
        limit = mss_now;
        if (tso_segs > 1 && !tcp_urg_mode(tp))
            limit = tcp_mss_split_point(sk, skb, mss_now,
                            cwnd_quota);

        if (skb->len > limit &&
            unlikely(tso_fragment(sk, skb, limit, mss_now)))
            break;

        TCP_SKB_CB(skb)->when = tcp_time_stamp;

        // 发送数据包
        if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
            break;

        /* Advance the send_head.  This one is sent out.
         * This call will increment packets_out.
         */
        tcp_event_new_data_sent(sk, skb);

        tcp_minshall_update(tp, mss_now, skb);
        sent_pkts += tcp_skb_pcount(skb);

        if (push_one)
            break;
    }
    if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
        tp->prr_out += sent_pkts;

    if (likely(sent_pkts)) {
        tcp_cwnd_validate(sk);
        return 0;
    }
    return !tp->packets_out && tcp_send_head(sk);
}

从 tcp_write_xmit() 函数可以看出,发送数据包主要包括以下三个过程:

  1. 状态检查,包括tcp的一些特性状态判断以决定数据包该不该发送,如何发送,例如一些nagle/拥塞控制/缓冲区等特性
  2. 数据包分片
  3. 数据包发送

这里主要来看一下 tcp_nagle_test() 和 tcp_tso_should_defer() 这两个函数,tcp_nagle_test() 失败或者 tcp_tso_should_defer() 成功,数据包立即发送。
什么情况下,tcp_nagle_test() 会失败?

/* Return non-zero if the Nagle test allows this packet to be
 * sent now.
 */
static inline int tcp_nagle_test(struct tcp_sock *tp, struct sk_buff *skb,
                 unsigned int cur_mss, int nonagle)
{
    /* Nagle rule does not apply to frames, which sit in the middle of the
     * write_queue (they have no chances to get new data).
     *
     * This is implemented in the callers, where they modify the 'nonagle'
     * argument based upon the location of SKB in the send queue.
     */
    if (nonagle & TCP_NAGLE_PUSH)
        return 1;

    /* Don't use the nagle rule for urgent data (or for the final FIN).
     * Nagle can be ignored during F-RTO too (see RFC4138).
     */
    if (tcp_urg_mode(tp) || (tp->frto_counter == 2) ||
        (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN))
        return 1;

    if (!tcp_nagle_check(tp, skb, cur_mss, nonagle))
        return 1;

    return 0;
}

有几种情况:

  1. 当调用者显示的使用TCP_NAGLE_PUSH标志时,sendmsg() 系统调用默认取的就是 tp->nonagle,看tcp_push_pending_frames() 函数
  2. 带外数据,优先级高,理不应defer
  3. frto_counter (细节暂不清楚)
  4. FIN包,应立即发送
  5. 大包,数据包大小超过mss
  6. 设置了TCP_NAGLE_CORK标志
  7. 等等

而 tcp_tso_should_defer() 函数也根据 John Heffner 的tso算法来做一些判断。

select的正确用法

七月 27, 2014
tcp/ip internals

在linux平台上主要头文件如下:
/usr/include/sys/select.h
/usr/include/bits/select.h

select 相关的几个API如下:

       int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds,
               struct timeval *timeout);

       void FD_CLR(int fd, fd_set *set);
       int  FD_ISSET(int fd, fd_set *set);
       void FD_SET(int fd, fd_set *set);
       void FD_ZERO(fd_set *set);

参考man-pages,其中nfds的解释是:

nfds is the highest-numbered file descriptor in any of the three sets, plus 1.

这里的nfds不表示fd_set的大小(实际上表示不了),而是指 the highest-numbered file descriptor + 1, the highest-numbered file descriptor也就是fd_set里fd的最大值。而且这个最大值不能超过 FD_SETSIZE

An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will result in undefined behavior.

至于为什么不能超过FD_SETSIZE,主要是因为fd_set是一个大小固定的bitmap,而这个bitmap的大小是通过FD_SETSIZE计算出来的。对于大于或者等于FD_SETSIZE的fd,FD_SET非法修改了不属于自己的内存。

所以每次FD_SET的时候都应该检查fd是否合法。

tcp缓冲区大小不对称

五月 12, 2014
tcp/ip internals

今天写单元测试时发现setsockopt/getsockopt在设置SO_SNDBUF/SO_RCVBUF参数时不对应,getsockopt拿到的并不是setsockopt时的值
下面代码可以说明这个问题:

    on = 1024;
    BUG_ON(tcp_setopt(sfd, TP_NOBLOCK, &on, sizeof(on)));
    BUG_ON(tcp_getopt(sfd, TP_NOBLOCK, &on, &optlen));
    BUG_ON(on != 1024);

内核最终设置的值会发生变化,例如可能double,等等,redhat上的BUG反映:https://bugzilla.redhat.com/show_bug.cgi?id=170694
我的内核版本是:Centos 6.5 2.6.32-431.5.1.el6.x86_64 #1 SMP
翻了一下内核源码,linux-2.6.32-358.6.2.el6/net/core/


/*
 *	This is meant for all protocols to use and covers goings on
 *	at the socket level. Everything here is generic.
 */

int sock_setsockopt(struct socket *sock, int level, int optname,
		    char __user *optval, unsigned int optlen)
{
	// ...
	case SO_SNDBUF:
		/* Don't error on this BSD doesn't and if you think
		   about it this is right. Otherwise apps have to
		   play 'guess the biggest size' games. RCVBUF/SNDBUF
		   are treated in BSD as hints */

		if (val > sysctl_wmem_max)
			val = sysctl_wmem_max;
set_sndbuf:
		sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
		if ((val * 2) < SOCK_MIN_SNDBUF)
			sk->sk_sndbuf = SOCK_MIN_SNDBUF;
		else
			sk->sk_sndbuf = val * 2;

		/*
		 *	Wake up sending tasks if we
		 *	upped the value.
		 */
		sk->sk_write_space(sk);
		break;
	// ...

原来如此,内核对输入的参数做了适当调整

Event-driven architecture, state machines et al.

三月 23, 2014
network programming

http://250bpm.com/blog:25

In my previous blog post I've described the problems with callback-based architectures and hinted that the solution may be replacing the callbacks by events and state machines. In this post I would like to discuss the proposed solution in more detail. Specifically, I am going to define what the events and state machines actually are and explain why they are useful.

While the article may be used as an intro to nanomsg's internal architecture it can be also be though of as an opinion piece of possible interest to anybody dealing with event-driven architectures and/or state machines.

Event-driven architecture

First, let's define the very notion of an "event".

In classic object-oriented paradigm "event" translates directly to "message". Here's what Wikipedia says: "In OOP, each object is capable of receiving messages, processing data, and sending messages to other objects."

However, when you look at actual implementation of object-oriented languages like C++ or Java, "messages" are implemented as "methods", which in the end turn out to be classic function calls.

So, it seems that events are actually function calls and replacing callbacks (= function calls) by events (= function calls) makes no sense. Obviously, there's some kind of terminological mess here.

In reality, of course, events are not exactly the same as function calls, but in majority of cases they behave in the same way so it's hard to tell the difference. To sort it out, let's illustrate the difference using an example.

Imagine method foo() of object A calling method bar() of object B. Object B in turn calls method baz() of object A. It's a classic example of a callback. The sequence diagram for the scenario looks like this:
events1.png

The obvious ploblem here is that A::baz() is executed while A::foo() is still in progress. What it means is that, on one hand A::baz() may be executed while object A is in inconsistent state (keep in mind that just half of A::foo() have been executed so far!) and on the other hand, state of A changes mysteriosly under the feet of A::foo() while it calls B::bar().

Events, in contrast, are executed in sequential manner. Next event is processed only after first one was fully processed. The sequence diagram for the same scenario would thus look like this:

events2.png

Technically, instead of calling a function there's an event enqueued. The main event loop processes any events in the event queue in one-by-one manner.

As can be seen the problem encountered previously doesn't happen any more.

For those who haven't used event-driven model yet, it may look somewhat hard to implement. However, the implementation is pretty straightforward. Here's a sample code in C++:

struct event {
    void (*fn) (void *self);
    void *self;
}; 
queue <event> events;

void event_loop () {
    while (!events.empty ()) {
        events.front ().fn (events.front ().self); events.pop ();
    }
}

void raise (event e) {
    events.push (e);
}

A small technical footnote for implementers of event-driven systems in C: Storing enqueued events in an intrusive containers (queues) seems to be a good idea, especially when doing low-level programming. Put aside the fact that it's extremely fast, intrusive containers enforce that no event is enqueued twice in parallel. That in turn means that there's no way to overflow the event queue — the number of events in the queue is strictly limited by the amount of business objects at hand. It's kind of similar to keeping the stack size limited in kernel-level programming.

Combining events and functions

All the above being said, if you are working with a language where events are not natively supported, turning all the function calls into events is an overkill. Especially, storing all the arguments when event is raised to be retrieved when it is being processed tends to be pretty annoying.

What turned out to be a pretty good solution in nanomsg is arranging all the objects into an ownership tree and using functions when moving from root to leaves and events when moving from leaves to the root:

events4.png

The solution has two nice features:

  1. Most of the invocations in any system are root-to-leave-directed which means that most invocations in the codebase are simple function calls. The events are relatively rare and don't overwhelm the system.
  2. Given that posting an event is somehow more complex than invoking a function it makes the developer stop and think twice before adding a new leave-to-root-directed invocation. That in turn helps keeping interaction patterns inside the codebase as simple and tree-like as possible.

State machines

As several readers have noted, events alone are sufficient to solve the callback problem and there's no actual need for full-blown state machines. The question thus is why to use state machines at all.

First, let's define what we mean by state machine. The "state machines" in computer engineering are not "finite state machines" of automata theory. Forget about Turing-incompleteness of finite state machines or their handiness for implementing regular expressions and tokenisers.

When we refer to state machines we mean objects with small part of their state (typically a single enum) singled out and referred to as the state:

events5.png

As can be seen, nothing changes from the technical point of view. What really happens is that the developer decides what part of the object is going to be the state and communicates that in a formal manner (by naming the member 'state'), so that any other developer can check the code and understand which member is supposed to be the state.

Singling out the state has nothing to do with automata theory or even with computer science per se. Rather, it has to do with psychology of programming. And given that most developers are not accustomed to think in terms of psychology, let me give a couple of short warm-up examples.

Think of object-oriented programming. There's no technical reason to split the application into a set of discrete objects. After all, classic procedural programming works no worse that OO programing. The reason why OO programming is preferred is that human brain thinks in terms of objects and manipulating objects, not in terms of object-less actions. By re-arranging a program into objects — which, in reality, means just singling out "this" parameter in a function call — the whole program suddenly becomes much easier to grasp intuitively.

As an analogy, think of human sense of sight. The AI research made it clear that what initially looked like a relatively simple sense actually involves extremely sophisticated, albeit unconscious, computing machinery inside the brain. Using sight we are able to immediately and effortlessly spot minor details that we would have extremely hard time to notice if we were presented with the binary format of the same picture.

The point to take from the analogy is that human ability to deal with certain tasks varies by whole orders of magnitude depending on the presentation of the problem (e.g. picture vs. binary dump of the bitmap). In the case of OO programming we take advantage of the fact that we are very good at manipulating objects. The machinery for that is hard-wired into the brain and we can use it automatically, without thinking about it, or even noticing that any work was done.

But let's get back to the state machines now. Is there a similar unconscious mechanism in play there?

Well, it turns out that human brain is extremely good at grasping narratives (stories) and fares relatively poor at dealing with almost anything else. The fact is widely acknowledged in social sciences. To give a current example, press deals pretty well with reporting the story (a narrative) of Edward Snowden fleeing American government, but doesn't do nearly as well with reporting the immensely more important fact (not a narrative) of global surveillance. The audience readily absorbs the story but becomes distracted and even disinterested when presented with abstract facts about inner working of NSA.

That being said, the state machines are an instrument for turning abstract processing mechanisms (objects) into narratives.

All you have to do is to draw all the possible states as boxes and connect them by arrows representing valid state transitions. Let's have a look at the TCP protocol state machine, for example:

events3.png

If you start with the CLOSED state and follow the arrows you can generate life stories (narratives) of different TCP endpoints. Do that for two minutes and you'll get an good idea of how TCP endpoint works.

Now imagine that instead of the abstracted state machine you would be presented with the full state of a TCP endpoint (including all the RTOs, CWNDs etc.) and the algorithms to manipulate that state. Because of the missing narrative structure, you would have a very hard time to figure out what's going on.

To throw in a bit of personal experience, I've done some coding with events alone, with no explicit state machines. It turned out to be very hard to get it right. When thinking about it in the retrospect the problem wasn't caused, interestingly, by the lack of a narrative. Rather, it was the fact that human brain is so good at generating narratives that new short-lived narratives popped into my head all the time. And with no set formally-defined narrative to rely on, they were often mutually incompatible, even contradictory, which in turn led to confusion and incoherent, buggy code.

Combining events and state machines

While it is possible to use events without state machines (as already discussed) or state machines without events (using standard methods instead of events to drive the state machine) the combination of the two results in an extremely powerfull tool for implementing asynchronous systems. In the next blog post I would like to dive into technicalities of such implementation, including topics such as structuring of the code, handling of arguments and return values, nested state machines et c.

Martin Sústrik, August 1st, 2013