关于CAP理论

四月 16, 2011
Distributed system
No Comments »

理论的弱一致性和最终一致性,理解,Werner Vogels 是这样说的:

  • Weak consistency. The system does not guarantee that subsequent accesses will return the updated value. A number of conditions need to be met before the value will be returned. The period between the update and the moment when it is guaranteed that any observer will always see the updated value is dubbed the inconsistency window.
  • Eventual consistency. This is a specific form of weak consistency; the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme. The most popular system that implements eventual consistency is DNS (Domain Name System). Updates to a name are distributed according to a configured pattern and in combination with time-controlled caches; eventually, all clients will see the update.

但是Werner Vogels只是论述,某些细节没说清楚,起码在他的文章描述里我看不到弱一致性和最终一致性的实质差别。还有国内大多数的看法均源自这里,所以我想很多人都跟我的理解一样可能有问题,一致性的定义,还有Partition Tolerant也是。

看了Dan Weinreb的博文,很受启发,值得探讨

 

What Does the Proof of the "CAP theorem" Mean

Several years back, Eric Brewer of U.C. Berkeley presented the "CAP conjecture", which he explained in these slides from his keynote speech at the PODC conference in 2004. The conjecture says that a system cannot be consistent, available, and partition-tolerant; that is, it can have two of these properties, but not all three. This idea has been very influential.

Seth Gilbert and Nancy Lynch, of MIT, in 2002, wrote a now-famous paper called "Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services". It is widely said that this paper proves the conjecture, which is now considered a theorem. Gilbert and Lynch clearly proved something, but what does the proof mean by "consistency", "availability", and "partition-tolerance"?

Many people refer to the proof, but not all of them have actually read the paper, thinking that it's all obvious. I wasn't so sure, and wanted to get to the bottom of it. There's something about my personality that drives me to look at things all the way down to the details before I feel I understand. (This is not always a good thing: I sometimes lose track of what I originally intended to do, as I "dive down a rat-hole", wasting time.) For at least a year, I have wanted to really figure this out.

A week ago, I came across a blog entry called "Availability and Partition Tolerance" by Jeff Darcy. You can't imagine how happy I was to find someone who agreed that there is confusion about the terms, and that they need to be clarified. Reading Jeff's post inspired me to finally read Gilbert and Lynch's paper carefully and write these comments.

I had an extensive email conversation with Jeff, without whose help I could not have written this. I am very grateful for his generous assistance. I also thank Seth Gilbert for helping to clarify his paper for me. I am solely responsible for all mistakes.

I will now explain it all for you. First I'll lay out the basic concepts and terminology. Then I'll discuss what "C", "A", and "P" mean, and the "CAP theorem". Next I'll discuss "weak consistency", and summarize the meaning of the proof for practical purposes.

Basic Concepts

The paper has terminology and axioms that must be laid out before the proof can be presented.

A distributed system is built of "nodes" (computers), which can (attempt to) send messages to each other over a network. But the network is not entirely reliable. There is no bound on how long a message might take to arrive. This implies that a message might "get lost", which is effectively the same as taking an extremely long time to arrive. If a node sends a message (and does not see an acknowledgment), it has no way to know whether the message was received and processed or not, because either the request or the response might have been lost.

There are "objects", which are abstract resources that reside on nodes. Objects can perform "operations" on other objects. Operations are synchronous: some thread issues a request and expects a response. Operations do not request other operations, so they do not do any messaging themselves.

There can be replicas of an object on more than one node, but for the most part that doesn't affect the following discussion. An operation could "read X and return the value", "write X", "add X to the beginning of a queue", etc. I'll just say "read" for an operation that has no side-effects and returns some part of the state of the object, and "write" to mean an operation that performs side-effects.

A "client" is a thread running on some node, which can "request" an object (on any node) to perform an operation. The request is sent in a message, and the sender expects a response message, which might returns a value, and which confirms that the operation was performed. In general, more than one thread could be performing operations on one object. That is, there can be concurrent requests.

The paper says: "In this note we will not consider stopping failures, though in some cases a stopping failure can be modeled as a node existing in its own unique component of a partition." Of course in any real distributed system, nodes can crash. But for purposes of this paper, a crash is considered to be a network failure, because from the point of view of another node, there's no way to distinguish between the two. A crashed node behaves exactly like a node that's off the network.

You might say that if a node goes off the network and comes back, that's not the same as a crash because the node loses its volatile state. However, this paper does not concern itself with a distinction between volatile and durable memory.  There's no problem with that; issues of what is "in RAM" versus "on disk" are orthogonal to what this paper is about.

Consistent

The paper says that consistency "is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time." They explain this more explicitly by saying that consistency is equivalent to requiring all operations (in the whole distributed system) to be "linearizable".

"Linearizability" is a formal criterion presented in the paper "Linearizability: A Correctness Condition for Concurrent Objects", by Maurice Herlihy and Jeannette Wing. It means (basically) that operations behave as if there were no concurrency.

The linearizability concept is based a model in which there is a set of threads, each of which can send an operation to an object, and later receive a response. Despite the fact that the operations from the different threads can overlap in time in various ways, the responses are as if each operation took place instantaneously, in some order. The order must be consistent with each thread's own order, so that a read operation in a thread always sees the results of that thread's own writes.

Linearizability per se does not include failure atomicity, which is the "A" ("atomic") in "ACID". But Gilbert and Lynch assume no node failures. So operations are atomic: they always run to completion, even if their response messages get lost.

So by "consistent" ("C"), the paper means that every object is linearizable. (That's not what the "C" in "ACID" means, by the way, but that's not important.) Very loosely, "consistent" means that if you get a response, it has the right answer, despite concurrency.

This is not what the "C" in "ACID transaction" means. It's what the "I" means, namely "isolation" from concurrent operations. This is probably a source of confusion sometimes.

Furthermore, the paper says nothing about transactions, which have would have a beginning, a sequence of operations, and an end, which may commit or abort. "ACID" is talking about the entire transaction. The "linearizability" criterion only talks about individual operations on objects. (So the whole "ACID versus BASE" business, while cute, can be misleading.)

Available

"Available" is defined as "every request received by a non-failing node in the system must result in a response." The phrase "non-failing node" seemed to imply that some nodes might be failing and others not. But since the paper postulates that nodes never fail, I believe the phrase is redundant, and can be ignored. After the definition, the paper says "That is, any algorithm used by the service must eventually terminate."

The problem here is that "eventually" could mean a trillion years. This definition of "available" is only useful if it includes some kind of real-time limit: the response must arrive within a period of time, which I'll call the maximum latency.

Next, it's very important to notice that "A" says nothing about the content of the response. It could be anything, as far as "A" is concerned; it need not be "successful" or "correct". (If think otherwise, see section 3.2.3.)

So "available" ("A") means: If a client sends a request to a node, it always gets back some response within L time, but there is no guarantee about contents of the response.

Partition Tolerant

There is no definition, per se, of the term "partition-tolerant", not even in section 2.3, "Partition Tolerance".

First, what is a "partition"? They first define it to mean that there is a way to assort all the nodes into separate sets, which they call "components", and all messages sent from a node in one component to another nodes in a separate component are lost. But then they go on to say "And any pattern of message loss can be modeled as a temporary partition separating the communicating nodes at the exact instance the message is lost." or their formal purposes, "partition" simply means that a message can be lost. (The whole "component" business can be disregarded.)  That's probably not what you had in mind!

In real life, some messages are lost and some aren't, and it's not exactly clear when a "partition" situation starts, is happening, or ends. I realize that for practical purposes, we usually know what a partition means, but if we're going to do formal proofs and understand what was proved, one must be completely clear about these terms.

Even in a local-area network, packets can be dropped. Protocols like TCP re-transmit packets until the destination acknowledges that they have arrived. If that happens, it's clearly not a network failure from the point of view of the application. "Losing messages" must have something to do with nodes entirely unable to communicate for a "long" time compared to the latency requirements of the system.

Furthermore, remember that node failure is treated as a network failure.

So "partition-tolerant" ("P") means that any guarantee of consistency or availability is still guaranteed even if there is a partition. In other words, if a system is not partition-tolerant, that means that if the network can lose messages or any nodes can fail, then any guarantee of atomicity or consistency is voided.

CAP

The CAP theorem says that a distributed system as described above cannot have properties C, A, and P all at the same time. You can only have two of them. There are three cases:

AP: You are guaranteed get back responses promptly (even with network partitions), but you aren't guaranteed anything about the value/contents of the response. (See section 3.2.3.) A system like this is entirely useless, since any answer can be wrong.

CP: You are guaranteed that any response you get (even with network partitions) has a consistent (linearizable) result. But you might not get any responses whatsoever. (See section 3.2.1.) This guarantee is also completely useless, since the entire system might always behave as if it were totally down.

CA: If the network never fails (and nodes never crash, as they postulated earlier), then, unsurprisingly, life is good. But if messages could be dropped, all guarantees are off. So a CA guarantee is only useful in a totally reliable system.

At first, this seems to mean that practical, large distributed systems (which aren't entirely reliable) can't make any useful guarantees! What's going on here?

Weak Consistency

Large-scale distributed systems that must be highly available can provide some kind of "weaker" consistency guarantee than linearizability. Most such systems provide what they call "eventual consistency" and may return "stale data".

For some applications, that's OK. Google search is an obvious case: the search is already specified/known to be using "stale" data (data since the last time Google looked at the web page), so as long as partitions are fixed quickly relative to the speed of Google's updating everything, (and even if sometimes not, for that matter), nobody is going to complain.

Just saying that results "might be stale" and will be "eventually consistent" is unfortunately vague. How stale can it be, and how long is "eventually"? If there's no limit, then there's no useful guarantee.

For a staleness-type weak consistency guarantee, you'd like to be able to say something like: "operations (that read) will always return a result that was consistent with all the other operations (that write) no longer ago than time X". And this implies that "write" operations are never lost, i.e. always happen within a fixed time bound.

t-Connected Consistency

Gilbert and Lynch discuss "weakened consistency" in section 4. It's also about stale data, but with "formal requirements on the quality of stale data returned". They call it "t-Connected Consistency".

It makes two assumptions. (a) Every node has a clock that can be used to do timeouts. The clocks don't have to be synchronous with each other. (b) There's some time period after which you can assume that an unanswered message must be lost. (c) Every node processes a received message within a given, known time.

The real definition of "t-Connected Consistency" is too formal for me to explain here (see section 4.4). It (basically) guarantees (1) when there is no partition, the system is fully consistent; (2) if a partition happens, requests can see stale data; and (3) and after the partition is fixed, there's a time limit on how long it takes for consistency to return.

Are the assumptions OK in practice? Every real computer can do timeouts, so (a) is no problem. You can always ignore any responses to messages after the time period, so (b) is OK. It's not obvious that every system will obey (c), but some will.

I have two reservations. First, if the network is so big that it's never entirely working at any one time, what would guarantee (3) mean? Second, in the algorithm in section 4.4, in the second step ("write at node A"), it retries as long as necessary to get a response. But that could exceed L, violating the availability guarantee.

So it's not clear how attractive t-Connected Consistency really is. It can be hard it is to come up with formal proofs of more complicated, weakened consistency guarantees. Most working software engineers don't think much about formal proofs, but don't underrate them. Sometimes they can help you identifying bugs that would otherwise be hard to track down, before they happen.

Jeff Darcy wrote a blog posting about "eventual consistency" about a half year ago, which I recommend. And there are other kinds of weak consistency guarantees, such as the one provided by Amazon's Dynamo key-store, which worth examining.

Reliable Networks

Can't you just make the network reliable, so that messages are never lost? ("Never" meaning that the probability of losing a message is as low as other failure mode that you're not protecting against.)

Lots and lots of experience has shown that in a network with lots of routers and such, no matter how much redundancy you add, you will experience lost messages, and you will see partitions that last for a significant amount of time. I don't have a citation to prove this, but, ask around and that's what experienced operators of distributed systems will always tell you.

How many routers is "lots"? How reliable is it if you have no routers (layer 3 switches), only hubs (layer 2 switches)? What if you don't even have hubs? I don't have answers to all this. But if you're going to build a distributed system that depends on a reliable network, you had better ask experienced people about these questions. If it involves thousands of nodes and/or is geographically distributed, you can be sure that the network will have failures.

And again, as far as the proof of the CAP theorem is concerned, node failure is treated as a network failure. Having a perfect network does you no good if machines can crash, so you'd also need each node to be highly-available in and of itself. That would cost a lot more than using "commercial off-the-shelf" computers.

The Bottom Line

My conclusion is that the proof of the CAP theorem means, in practice: if you want to build a distributed system that is (1) large enough that nodes can fail and the network can't be guaranteed to never lose messages, and (2) you want to get a useful response to every request within a specified maximum latency, then the best you can guarantee about the meaning of the response is that it is guaranteed to have some kind of "weak consistency", which you had better carefully define in such a way that it's useful.

P.S.

After writing this but just before posting it, Alex Feinberg added a comment to my previous blog post with a link to this excellent post by Henry Robinson, which discusses many of the same issues and links to even more posts. If you want to read more about all this, take a look.

参考文献

Dan Weinreb 2010: What Does the Proof of the “CAP theorem” Mean?

平衡二叉树的自平衡策略

二月 11, 2011
Programming practices
No Comments »

本文的目的,旨在为数据结构或者算法的设计提供一些启发性的思路。

数组 --> 二分搜索 --> 链表 --> 二叉搜索树

二分搜索

二分搜索是一种很重要的算法。最普遍的莫过于对排序数组的处理。而数组的一个特质,使得二分搜索在离散存储结构里难以实现

  • 连续的内存区域,依靠下标来自动计算偏移量

就是说,对排序数组的操作,仅仅依靠下标就可以进行的。但是在实际的系统中,我们有时无法一次性提供足够大的连续存储空间或者海量处理那些离散地存储在内存里的数据的时候,如链表等等,这种方法不适用。

回想一下对排序数组进行二分搜索的操作,每次二分是不是都在最中间进行,那链表该如何才能做到呢?

答案是,做不到吧。离散的本质就是随机,你不可能找到中结点的位置。二叉搜索树的出现解决了这个问题,直接从中结点出发和决定下一分支。

所以现在的问题是:谁来担任中结点。

  1. 对于静态的数据,递归选举中结点,然后形成最优二叉搜索树
  2. 对于动态的数据,关键的地方在于,当插入一个结点或者删除一个结点的时候,如何平衡整个二叉搜索树的二分结构。

所以出现了自平衡,解决了二分搜索思想在离散存储结构里的应用。

自平衡的策略

所谓自平衡是指通过执行一系列维持某种性质的操作,这个性质,是为了更好的实现二分搜索的思想。如AVL的平衡因子,红黑树的5个强制约束,等等。

而旋转操作则是实现自平衡最常用的一项策略,通过左右子树迁移结点以实现结构的动态平衡

左旋,相当于右子树往左字树迁移两个结点,右旋同理。下面以红黑树为例,讨论自平衡策略的实现

旋转的时机

左右子树的结点数不想等时结构处于不平衡状态,不平衡就要旋转。但具体的实现取决于不同的场景和需求,由结构自身的性质来决定的。如AVL树在当左右子树结点总数相差1时进行旋转,(待)

封装epoll的异步API

一月 9, 2011
Programming practices
2 Comments »

epoll的设计坚持Unix哲学KISS原则,keep it simple, stupid

但是KISS并不等于实用。epoll提供三个简单的函数供应用程序进行异步事件处 理,epoll_create(),epoll_ctl(),epoll_wait()。这几个函数并不友好,尤其面对复杂的业务逻辑时,缺乏清晰的思路, 需要进一步的封装。

通常开发人员需要什么:

  1. 异步处理的对象,文件描述符
  2. 需要监听的事件,可读或可写(EPOLLIN|EPOLLOUT)
  3. 事件产生时的处理函数,epoll并不提供,机制与策略分离

所以我对 epoll 提供的接口进行简单易用的封装,以适合更一般化的场景需求。

数据结构

C 代码
  1. typedef void eProc(struct eloop *el, int fd, void *argv, int mask);
  2. /* File event structure */
  3. typedef struct efile {
  4.     int mask;
  5.     eProc *rProc;
  6.     eProc *wProc;
  7.     void *data;
  8. }  efile_t;
  9. typedef struct eloop {
  10.     int efd;
  11.     int stop;
  12.     efile_t ev[AE_SETSIZE];
  13.     efire_t rd[AE_SETSIZE];
  14. } EL;

efile_t 作为基本的异步io对象,保存事件发生时调用的处理函数以及传参的数据。eloop 为最主要的数据结构,efd 为 epoll_create() 返回的 epoll 专用描述符号,负责多个epoll_event。ev[AE_SETSIZE] 保存正在监听的所有事件。如果 epoll_t 是监控机,则 efd为执行者,events 为被监控对象。每次 epoll_wait() 返回的结果保存在 rd[AE_SETSIZE] 里。

编程接口

C 代码
  1. EL *el_open(void);
  2. int el_close(EL *el);
  3. int el_stop(EL *el);
  4. int el_start(EL *el, int flags);
  5. int el_addFileEvent(EL *el, int fd, int mask, eProc *pro, void *data);
  6. int el_delFileEvent(EL *el, int fd, int mask);

这套 API 的使用很方便,el_start() 启动论询,el_stop()暂停论询,el_addFileEvent() 往 EL 中添加一个事件,支持注册回调函数,el_delFileEvent() 删除一个事件,代码没有经过测试,谨慎使用。

>>具体设计看源代码

数据结构接口对象化

十一月 23, 2010
Programming practices
No Comments »

c 语言适合平台及系统等一些的大型核心构架的开发,面向过程,不支持面向对象语义,

不过我始终认为,面向对象是一种思想,而不是指某种既定的技术,所以,OO 与语言没有关系,随便一种语言,一样可以写出面向对象的代码,OO 的思想绝对不只是继承,多态等那么简单,对于某些语言如 c++ 或者 Java,支持面向对象语义因而适合表达这种思想而已,况且 Java本身设计的出现就是为了解决面向语义的问题

做项目的时候,很头疼的问题就是关于接口命名的设计,包括内部接口和外部接口,小型的项目倒也罢了,如果构架过于庞大,就容易出现接口混乱的问 题,c 风格的代码并不像 c++ 或者 java 那样简单容易白,毕竟用 c 的都是写模块的多,功能越是复杂,函数的命名就越容易出现杂乱无章的情况,很难看出你在写什么,支持面向语义的编程语言,最大的优点就是,容易写出层次分 明,结构清晰的代码,但是像 c 这样的语言可不容易,所以,好的函数命名规范对项目来说很重要

回归正题,先说说什么是对象化技术,所谓数据结构对象化,就是一组结构有各自适合的管理接口,本质上是一样的,只是写法不同而已,以红黑树为例,看看内核是怎么实现的(部分接口):

C++代码
  1. /* Macros that define a red-black tree */
  2. #define RB_HEAD(name, type)
  3. struct name {
  4.     struct type *rbh_root; /* root of the tree */
  5. }
  6. #define RB_INITIALIZER(root)
  7.     { NULL }
  8. #define RB_INIT(root) do {
  9.     (root)->rbh_root = NULL;
  10. } while (/*CONSTCOND*/ 0)
  11. #define RB_BLACK    0
  12. #define RB_RED      1
  13. #define RB_ENTRY(type)
  14. struct {
  15.     struct type *rbe_left;      /* left element */
  16.     struct type *rbe_right;     /* right element */
  17.     struct type *rbe_parent;    /* parent element */
  18.     int rbe_color;          /* node color */
  19. }
  20. /* Main rb operation.
  21.  * Moves node close to the key of elm to top
  22.  */
  23. /* Main rb operation.
  24.  * Moves node close to the key of elm to top
  25.  */
  26. #define    RB_GENERATE(name, type, field, cmp)
  27.     RB_GENERATE_INTERNAL(name, type, field, cmp,)
  28. #define    RB_GENERATE_STATIC(name, type, field, cmp)
  29.     RB_GENERATE_INTERNAL(name, type, field, cmp, __unused static)
  30. #define RB_GENERATE_INTERNAL(name, type, field, cmp, attr)
  31. attr void
  32. name##_RB_INSERT_COLOR(struct name *head, struct type *elm) {......}
  33. attr void
  34. name##_RB_REMOVE_COLOR(struct name *head, struct type *parent,
  35. struct type *elm) {......}
  36. attr struct type *
  37. name##_RB_REMOVE(struct name *head, struct type *elm) {......}
  38. /* Inserts a node into the RB tree */
  39. attr struct type *
  40. name##_RB_INSERT(struct name *head, struct type *elm) {......}
  41. /* Finds the node with the same key as elm */
  42. attr struct type *
  43. name##_RB_FIND(struct name *head, struct type *elm) {......}
  44. /* Finds the first node greater than or equal to the search key */
  45. attr struct type *
  46. name##_RB_NFIND(struct name *head, struct type *elm) {......}
  47. /* ARGSUSED */
  48. attr struct type *
  49. name##_RB_NEXT(struct type *elm) {......}
  50. /* ARGSUSED */
  51. attr struct type *
  52. name##_RB_PREV(struct type *elm) {......}
  53. attr struct type *
  54. name##_RB_MINMAX(struct name *head, int val) {......}

主要是 RB_ENTRY 和 RB_GENERATE_STATIC ,在内核的红黑书实现里,内核并不关心具体的数据结构形式,数据与结构分离,通过 RB_ENTRY 和 Compare 函数来实现红黑书的操作的,接口全部以宏的方式来实现,关键的地方是 name,比如

C++代码
  1. #define RB_HEAD(name, type)
  2. struct name {
  3.     struct type *rbh_root; /* root of the tree */
  4. }

name 是什么呢,name 是结构的一种别名,实质上,相当于创建了一个 RB 对象,对象之间以 name 来区分,这样做的好处就是,RB 实现的代码可以高度重用,不好的地方就是,以宏替换的方式来模拟面向对象的行为,和 inline 的滥用没有什么区别,举个例子

C++代码
  1. /* Tree of chunks. */
  2. typedef struct chunk_node_s chunk_node_t;
  3. struct chunk_node_s {
  4.     /* Linkage for the chunk tree. */
  5.     RB_ENTRY(chunk_node_s) link;
  6.     /*
  7.      * Pointer to the chunk that this tree node is responsible for.
  8.      */
  9.     void    *chunk;
  10.     /* Total chunk size. */
  11.     size_t  size;
  12. };
  13. typedef struct chunk_tree_s chunk_tree_t;
  14. RB_HEAD(chunk_tree_s, chunk_node_s);
  15. static int
  16. chunk_comp(chunk_node_t *a, chunk_node_t *b)
  17. {
  18.     if(a->size > b->size)
  19.         return(1);
  20.     else if(a->size == b->size)
  21.         return(0);
  22.     else
  23.         return(-1);
  24. }
  25. /* Generate red-black tree code for chunks. */
  26. RB_GENERATE_STATIC(chunk_tree_s, chunk_node_s, link, chunk_comp);

回到刚才的,内核红黑树的接口,RB_GENERATE_STATIC 做了些什么呢?RB_GENERATE_STATIC在预处理阶段就为 name 指定的对象生成了一系列的相关操作,如果程序内包含多个红黑树对象,虽然看起来是多了一个 RB_GENERATE_STATIC宏,但是实质上相当于将原有的代码 copy 了一份,只是 name 和 compare 接口不同而已

其实有一种非常优雅的方式,就是使用函数指针,而且我们经常用到的很多内核级实现里也是经常用到的,将数据结构的行为封装在一起,Berkery DB 数据库的实现就是一个很好的例子,看:

C++代码
  1. typedef struct {
  2.     DBTYPE type;
  3.     int (*close)(const DB *db);
  4.     int (*del)(const DB *db, const DBT *key, u_int flags);
  5.     int (*fd)(const DB *db);
  6.     int (*get)(const DB *db, DBT *key, DBT *data, u_int flags);
  7.     int (*put)(const DB *db, DBT *key, const DBT *data, u_int flags);
  8.     int (*sync)(const DB *db, u_int flags);
  9.     int (*seq)(const DB *db, DBT *key, DBT *data, u_int flags);
  10. }DB;

DB 数据库对象将所有的操作封装在自身的结构体里,显然,从多个方面来看,比起那些通过命名规范来约束函数的接口以体现程序的结构层次来看,这种方式清晰明了。

这个世界是我们的

九月 25, 2010
Life
1 Comment »

土豆是我的兄弟,我们一起长大,一起上小学,上初中,高中,说的搞笑一点,是穿着同一条内裤长大的

土豆很有天赋,很小的时候就会拿着粉笔满屋子的画画而且画的有模有样,可惜爸妈都是农民,身边又没几个知识分子,土豆遇不上伯乐。书法也很好,小学 就已经拿过全国奖,不过没人栽培,现在也就马马虎虎了。还看过很多书,他曾经断断续续地拿过家里一千多块钱,然后全用来买书了,当然,他还不知到一千块钱 是很大一笔数目。他让我看得最多的是卡耐集的文集比如人性的弱点等等,可惜我识字不多,看的少。这些书,对土豆的成长有过很大的影响

土豆原本是农村里睡大街的小屁孩,后来考上了镇上的一所三流小学,四年级的时候又去到了镇上最好的中心小学,小学毕业后,考上了市里最烂的职高,三年后又考上了市里最棒的高中,大学,当然是比较烂了。这是他的人生曲线,土豆的故事太多了

土豆对数学出奇的感兴趣,我曾经翻看过他的手稿本,厚厚的一沓,泛黄的膜皮,渗水的笔迹,记录了土豆小学六年来对各种数学问题的思考与研究,比如,圆心在同一条直线上的连续圆,半径有什么关系

土豆初中的时候还曾经挂名到另外一个学校参加全国赛,拿了将了,后来就不得而知了。

土豆喜欢看书,不喜欢交朋友,性格不算孤僻但是喜欢独来独往,土豆的外号很多,比如化骨龙,因为他很瘦,又比如轻工水上飘,因为土豆不喜欢把时间浪 费在走路上,所以每次吃饭的时候跑得特别快,尤其人多的时候怎么跑都撞不到人,就得了这个称号。我也给他起过花名,因为他笑起来跟女人一个模样,不过他很 少笑,更多的时候,静静的一个人做自己的事

我曾经问过他咱们班上的人,谁谁谁啦,可惜很少有他答得出来的,他对他们很少印象,土豆甚至自己老师的名字都不知道。不过数学老师他是知道的,那个 时候土豆有过很长一段时间的辉煌历史,他对数学,知道的远比别人的多,他老是觉得老师讲的慢,磨磨蹭蹭的,就自己找书看,土豆第一次参加校内竞赛的时候, 考的有很大一部分都是超纲的题目,倒是对土豆来说没什么难度,结果他考了全校第一名,第二名是同班的,差了4分,大红纸贴出来的时候,大家一下子就爆炸开 了,因为第三名已经缩了二十多分了,也是因为这个事情,我和土豆认识了第一个朋友,阿飞,此后一路考试直至毕业,都是土豆第一,然后阿飞跟着第二,不过阿 飞英语要比土豆好。后来高中之后土豆去了高州中学,阿飞跳到了茂名一中,此后就更少有了联系

别人看来,土豆很成熟,懂得很多道理,待人处事方式很不一样,所以土豆虽然没什么朋友,但是很多同学喜欢悄悄地跟他聊心事,哪怕自己失恋了。有一件 事,对土豆的影响很大,这么说来,要追溯到小学的时候了,土豆经常受到班里某几个同学的欺负,而且有一个还跟着土豆一起上了初中。土豆后来跟我回忆说,因 为一件小事,本来不是土豆的错,但是土豆主动给那个人写了一封信,详细地写了很多事情并表示真诚的道歉,我问土豆,你有没有想过他看到这封信会觉得你更好 欺负,土豆说不知到,书上学到的,想尝试一下。很快那个人有了回信,静悄悄地放在了土豆的铅笔盒里,信里同样讲了很多事情,说,没想过你会给我写信,我想 他一定很感动,为了不是自己的过失却站出来向别人道歉,这是一种谦卑。土豆说,原谅别人,就是善待自己

土豆很有理想,虽然他不知到他的理想是什么,不过也正是因为这样,所以他才努力的让自己成长,有朝一日,可以看到外面的世界。我相信每个半夜起来沿着单车棚上厕所的人都忘不了那段奋斗史,一群在路灯下挑灯夜读,卧辛尝胆学子们,是土豆,民开,祖森他们,当然,还有我

土豆还曾经有过几个很好的朋友,就是所谓的四人帮,土豆,金海,健强,还有文家周,认识文家周是入学第一天晚上的时候,宿舍很热闹,家周和土豆是邻 床,两个人都不出声,土豆在听心灵地图,家周放着别安的歌,然后很奇异的一幕发生了,家周伸直两条腿,使劲往上蹬,手脚像个婴儿一样,土豆肯定觉得家周很 二B,两人相互对视了一眼,就这样认识了

之后,土豆和我上了高州中学,民开去了二中,祖森在一中,还有玮琪,在四中,其余的人,从此断了联系

土豆高一的时候曾经疯狂的迷恋过黑客技术,经常上课不听课就是经常抱着一本黑客X档案或者黑客手册之类的书籍在看,我记得他好像收过一个学弟,叫陈威,这件事说起来有点搞笑

土豆觉得自己水平还行,然后搞了一个团队,叫什么极度深寒二进制网络安全联盟,名字其实是从邪恶八进制那里来的,正好阿威那时候也正在看这类书,两 人一拍即合就成了拍当,土豆觉得光有团队不行,还得有团规,团章,当然了,肯定要收团费的嘛。于是土豆就跑去杂货市买了各种材料,连印章都叫人刻好了,前 前后后花了几百块钱。弄妥了就去找阿威,跟阿威说了入团的事,要了照片,还说,要收5元的团费,其实土豆本来不想收这个钱,只是觉得不收的话觉得阿威可能 怀疑自己是骗子,就象征性地要了5块钱,还跟阿威说,还有几个其他团队的成员,以后让你认识认识,其实谁也没有,土豆有意把团队搞的神秘一点而已,毕竟那 个时候能成为一名黑客是多么牛B哄哄的事情,后来每每我一想起这件事就觉得特别逗笑

土豆成绩倒退得特别厉害,期末考试的时候全科9门加起来都不够400分,我想很大一部分是因为这个原因。我找到土豆,跟他说,你现在看到的,真的是你想要的吗。我说你不应该把时间浪费在这里,如果你真的喜欢,就应该到更高的层次去深造,起码也等上了大学再说。我们触膝盖长谈,谈理想,谈了很长时间,最后土豆接受我的意见,把心思放回到学习上,争取考上好的大学。

土豆又开始过上了三点一线的生活,似乎比以前更安静,更忙了,土豆还是那样记不起班里同学的名字,甚至不常说话,只是在思考自己的问题,土豆学习很 有自己的一套方法,不像其他的同学,他做很少的习题一样可以有效果。相比盲目的去熟练那些公式,土豆更喜欢发现这些自然规律之间的内在关系,土豆更多的时 候是在想和总结。有一次土豆想的深了,想不出来,就去问老师,老师很惊讶,说思路没有问题,不过是微积分的问题,你们还没有学

慢慢地,土豆成绩回到了班上中上水平,高三的时候,已经稳定在班里全十名内,考个自己想要的学校已经不是什么问题了

不过,这个世界有时真会捉弄人,土豆的人生,一如有规律的跌宕起伏不断,我想,那年高考,如果不是因为那一场意外,土豆的人生也不会发生那么大的转变,也许今天,他也可以像很多人一样游走在ACM等赛场里大显身手甚至成为实验室的宠儿。

这一切,似乎与土豆无缘。高考结束后,土豆没有留下任何联系方式,连毕业照都没有拍,自己一个人背起包裹离开了,一走就是两年,杳无音信

再次见到土豆,是我已经快结束大二学业的时候,人在迷茫的时候是不是应该回到原点好好探讨一下,他说很庆幸终于找到自己的理想了,谈话间我还得知他恋爱了,我问他嫂子哪的,他说河南的,我笑着跟他说,那也好,两个人一起奋斗总比一个人更有力量

土豆此后陆陆续续来找过几次我,我察觉得出土豆的上进,他的好强,我有时笑他是平民科学家,土豆说,人离开了实验室,会不会就真的什么也做不出来,我说那当然不是,对土豆来说,所谓科学,是一种信仰,是学术自由

最近土豆一次过来,我们在学校的体育馆大汗淋漓地跑了几圈球场,吃完饭,我送了送土豆,回来的路上自己一个人静静的踹步在这个大学的校园里,上大学这么多年,很少有现在这样的心境了,想想我们过去的二十年,不禁感概万千

土豆来了短信,说:这个世界不止是有你们,我也可以改变世界

是的,只要不放弃,这个世界就是我们的

文本压缩的思考

十二月 7, 2009
Programming practices
No Comments »

最近断断续续花了很多时间在研究文本压缩领域,没有说是数据而是文本,是因为文本在特定的场合和环境可以有更大的改进和性能提升,比如基于 ASCII码和Unicode的就可以设计得截然不同,仅用一种方案来对不同的场景产生应用,是要付出一定的代价的比如性能。而且,不同的语言又会有不一 样的表现形式,这对垂直式的压缩技术来说是一种新的挑战。

因为业务需要,我希望可以实现一种更快的压缩算法,尤其是在解码方面要有不俗的速度表现。

我尝试过对哈夫曼编码进行一些细节上的改进,但是效果不太明显,哈夫曼编码的特点就是即使极限情况下也是能够保证绝对的压缩,当然能压缩多少倒是不 一定。编码过程比较块,难点就是对于这种比特流的存储方式,解码时必须单个比特地亦步进行,耗时特别长,是我不能接受的。研究算法远不是纸上谈兵那么简 单,不是随便在纸上画一棵二叉树就能解决问题领悟到信息论的奥秘。离开实际的应用研究不存在太大的意义,所以我们要清楚面临的需求以及对算法作出何种改进

 

lz77算法比传统的范式哈夫曼可能花费的时间更长一些,毕竟哈夫曼编码只需执行一次遍历就可以统计出编码模型而lz77则需要花费更多的时间在建 立字典和回溯检索上面。不过,lz77最大的特色就是解码特别的快,与其他编码相比解码速度完全不在一个数量级上面。lz77及其变种大部分的实现都是等 长编码,是建立在字节级别上的,不是比特流,本质上速度就有很大的优势。更重要的是,lz77每解码一个字符只需向前查看一次滑动缓冲区,总的时间复杂度 是Θ(n),线性

Gzip就是lz77很成功的一个例子,不过,网上说的那点也太简单了简直不值一谈,Gzip精妙的地方在于对算法做了很多有利的改进,我猜的,因为不管我怎么设计,以及优化,总是赶不上gzip的速度,事实证明,gzip真的很优秀

在滑动窗口也就是字典的设计上,我没有使用哈稀表而是用了一个二维数组来对任意字符进行索引,一方面我懒得搞那么复杂,而且二维数组似乎更加直观因为我主要基于单字。数组保存了对在滑动窗口里出现的所有字符的索引,而且采用了插入排序来改善查询的性能

到目前为止,整个实现最值得探讨的问题,在于滑动窗口的大小以及设计能够匹配的最大的长度,是关乎压缩效果最重要的一方面

从语言的角度来看,中文又不同于英文,英文里最常见的字符如 the,of,to,and 等在极短的滑动窗口里也可以有很好的匹配效果,但是中文就不一样了,还要取决于不同字符集的实现,滑动窗口要更大一些。这里就回出现一个性能项瓶的问题, 窗口规模与匹配的最大长度是成线性比例关系的,如果滑动窗口设计的太大,即使能匹配到好的长度,程序也要花费太多的时间在Match上面,而且大量的数据 可以表明,不是所有的字符串都能够找到最优匹配的,很多情况是,在小的窗口模式下,比如31或者63,能匹配成功的平均长度是1到2位左右,也就是说,2 个比特就足以完成保存匹配长度的工作,默认我设置了3bit,是因为我发现3bit的长度和31的窗口更有压缩上的优势

最终的测试结果是,相比较gzip的实现,压缩大约80M的文本语料库,平均时间为 1.413s,解压为 0.248s,压缩率是48.5%,而gzip的测试结果是,平均时间为 0.998s,解压为 0.131s,压缩率达到25%

我一直很好奇gzip是怎么做到的,如果我用多线程的话效果是不是会好一点。

目前所知,硬件层来说使用磁盘阵列可以提高io的效率,对压缩也是有好处的。

块排序压缩的数学原理

十一月 14, 2009
Programming practices
No Comments »

也叫 BWT 转换,严格来说不属于一个压缩算法,是压缩领域的辅助技术,BWT 的目的仅仅是将信息转换得更加适合压缩而已。BWT 巧妙的地方在于,只要精心设计,除了增加少额的计算量外,没有任何其他的开销。因为他的输出仅仅是将原有字符的排列顺序作了特定的安排

为了描述方便,举个简单点的例子,比如,即将处理的序列是 S = abraca ,S 的长度 N = 6,将 S 向右旋转一位,S 的末尾移动到 S 的首部,形成一个新的序列 S’ ,再将 S’ 向右移动一位再次产生新的序列 S”,重复这个步骤,直到得到一个 N*N 的矩阵 M,对 M 按字典排序,得到 M’ ,取矩阵 M’ 的最后一列作为输出,如下:

注意上面红色的 # 号,是一个辅助转换的结束符号,目的是为了标记原序列 S 属于 M’ 中的第几行,这个很重要,如果不使用结束标记符,就必须另外维护一个索引值,排序结果也会不一样,但是最终还原的字符序列是相同的。

逆转的过程更有意思,也非常简单,我们知道输出的序列 L=#rcaaba,L是表示矩阵 M’ 的最后一个序列,同理,F 是第一个,逆转的过程是从 L 恢复到整个 M’ 的过程,#号所在的 Row 就是原字符序列 S,因为 M’ 经过字典排序,自然 F 也是有序的

定理:对于包含 N 个字符的任意序列 S,按照 BWT 规则产生矩阵 M,对 M’ 进行字典排序得到 M’,假设 F 是 M’ 的第一列字符,L 是 M’ 的最后一列字符,若 F 是有序的,则 L 是唯一的。

从转换的过程我们知道,对于矩阵 M’ ,每行,每列所包含的元素都是相同的,也就是说,不考虑字典序,F 与 L 中的所有字符都存在一一对应的关系 T,在这里 L=#rcaaba ,所以 F=aaabcr#,则 T={6,5,4,0,1,3,2}

T 的计算过程有个需要注意的地方,L 与 F 中不是每个字符都是唯一的,也许有 N 个字符 value 相同,但是其实 index 不一样,L 中相同字符的索引顺序在 F 中也是完全相同的,也就是说,对 a,L=#rcaaba,F=aaabcr#,T 不可能是 {6,5,4,1,0,3,2},我们可以证明一下,为什么会这样。假设 ai 是 L 中所有字符 a 产生的一个序列,i=1,2,3,4……,i 表示 L 中出现该字符的序列,比如 a1 是 L 中第一个出现的字符 a,a2 是 L 中出现的第二个字符 a,以此类推,同理,对 F 中所有字符 a 产生的序列为 a’i,i=1,2,3,4……。能够证明,a’i 就是 ai

假设 pi 是序列 S 中紧接着 ai 后面的字符,对与 M’ 中任意一个最后字符为 ai 的行 r,则 r 的首位一定是 pi,不失一般性,取 L 中的 a1 a2,所在的行分别是 m n,而 m n 行的首位是 p1 p2,因为 index(a1) < index (a2) ,明显 index (p1) < index (p2) ,对于 F 中的 a’i,因为 a’i 的值是相同的,按照字典序,实际上排列的是 pi 而不是 ai,亦即 a’ipi=aipi,所以 a’i=ai,证毕。

有了关系 T,从 F 中得出原字符序列就变得相当容易了。S[N-1-i] = L[Ti[I]]  i=0,1,2,……,N-1 ,其中 I 就是结束符 # 在 L 中的次序,T0[x] = x ,  T(i+1)[x] = T[Ti[x]],本例中 I 是 0,所以有 S[N-1-0] = L[T0[I]] = L[I] = L[0] = F[6] = #,S[N-1-1] =  L[T1[I]] = L[T[0]] = L[6] = F[2] = a,最后 S0 = a,整个序列就出来了。

Michael Burrows 和 David Wheeler 确实很聪明,独辟蹊径完全从另外一个角度来思考压缩的过程,对算法设计的思考有很大的帮助,要学会如何寻找突破口。有没有想过,多阶 BWT 变换又会是什么情况?

参考文献:

^ Burrows M and Wheeler D (1994), A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation

Crash

十一月 12, 2009
Life
No Comments »

这是一部多元探究人类的心里与情绪反映的细腻之作。

故事透过地方检察官一家,经营商店的波斯商人,墨西哥锁匠,两个劫车混混,一个菜鸟警官与中年韩籍夫妇……

等等这样的角色,在短短36个小时间

经由间接,直接,甚至是不经意的互动,呈现出人们在不同时间与场景之中的情绪反映,以及这些情绪反映的联动关系对人们的命运、态度与观点所产生的影响 ……

“ ……

你知道,在很多城市里,你和别人擦肩而过,或者偶尔别人撞撞你。可是在洛杉矶没有人会碰到你的身体,我们总是躲在钢筋和玻璃的后面。或许是我们太怀念那种触感,以致需要通过剧烈碰撞才来体会得到…”

正是这部电影所要传达给我们思考的主题

香农熵与大脑学习

三月 6, 2009
Life
No Comments »

信息是个抽象的概念,我们常说信息很多或者很少,但是很难说清出信息到底有多少,比如一篇800字的文章有多少信息量。

直到 1948 年,信息论之父 C. E. Shannon 在其发表的论文  A Mathematical Theory of Communication 中第一次将热力学的熵引入到信息论,并用数学语言阐明了概率与信息冗余度的关系。

在通信领域,大家知道,通信链路传输的是比特流,通信的双方均会事先遵守某种人为约定,比如,00 表示“我”,01 表示“爱”,10 表示“老”,11 表示“婆”,所以如果 A 给 B 发送 00011011,B 就会知道 A 说了一句话:“我爱老婆”

Shannon 指出,任何信息都存在冗余,冗余大小与信息中每个符号(数字、字母或单词)的出现概率或者说不确定性有关。什么是冗余,冗余就是多余重复的东西,比如,通 信的 A 端给 B 连续发了 1000 次 00011011 比特序列,那么就有 999 次是多余的,实质上 A 只需给 B 传输一次,然后告诉 B 要重复多少遍就行。

举个例子,系统 S 内存在多个事件 S = {E1,…,En},每个事件可能出现的概率为 P = {p1, …, pn},则每个事件本身(信息本体)的量为: Ie = − log2pi(对数以2为底,单位是比特(bit))

如英语有26个字母,假如每个字母在文章中出现次数平均的话,每个字母的信息量为:

而汉字常用的有2500个,假如每个汉字在文章中出现次数平均的话,每个汉字的信息量为:

这就是 Shannon 的理论信息极限,相比使用 utf-8 字符集来传输,每个英文字符可节省的极限信息量是 8 – 4.7 = 3.3bit,而汉字则是 24 – 11.3 = 12.7 bit(注:在 utf-8 字符集里,用 8 bit来表示一位英文字符,用 24bit 来表示一个汉字) 。

-------------------------------------------------------------------------------------

OK,如果你能明白我上面说的,那就可以想得到,事实上信息处理的过程和大脑学习是多么的相似,人脑包括两个功能体,一个负责存储信息,一个负责处理信 息。今天我只想讨论大脑如何接受新知识的过程,假设信息的本体是“知识点”,这就有,人脑的学习过程其实就是不断存储新知识点的过程。

但是不同的大脑,对新知识接受的方式不同,等同与上面提到的通信约定,在计算机程序里相当于你取决于何种信息处理的算法。不同的算法,处理结果是不 一样的,比如说,你不一定要约定 00 表示“我”,你可以这样假定,0 表示“我”,100 表示“爱”,101表示“老”,然后 110 表示“婆”,传输的比特流就成了 0100101110,相对于第一种约定,多开销了 2 个比特

所谓信息处理的算法,在映射到大脑学习的过程里,是一个人接受新知识的能力,也就是说,你的能力越强,学习同样的东西,经过各自大脑的处理但是你的结果要比其他人更接近信息本体的极限,也就是说,你需要存储的量,要比他人少得多。

这里说的能力有两个意思:记忆力,以及能够揭示知识之间的关系

人的记忆力是不存在差别的,这点要切忌,记忆的本质只是知识重现的过程。想让你的知识变得更加牢固,就要不断重现它,一个单词读一遍和读1千遍的效 果截然不同,更重要的是,对一个知识点的不断重现,大脑就会将它放到越容易找到的地方。(因为有人想到这一点,所以发明了MTF转换)

而能否揭示知识之间的关系以及将这种关系揭示到何种程度,是衡量人与人之间能力差异化程度的重要标准,比如,算术编码就是一个很好的例子。此外,还 需注意的是,关系越复杂,人脑需要消耗的脑容量就越大,甚至有时超过于信息本体本身,这一点,我自己也不是很清除。不过有一点可以肯定的是,你挖掘的越 多,你的头脑就越灵活。

所以说,能力在很多时候是远要比知识更重要

格式化字符串

十月 23, 2008
Programming practices
No Comments »

这是在 mozilla open source 代码里发现的一个问题,目前还不确信是不是个BUG

c99引入了uintptr_t,uintmax_t 这样的标准数据类型,我们知道 printf() 函数对于常见的数据类型都有一个对应的 length modifier 用于格式化输出,int 的是 d,ptrdiff_t 的是 t,但是对于非标准的 pid_t,uintmax_t 等就没有对应的 length modifier 了。在c89的世界里,解决这种问题的常规化方案是强制类型转换到尽可能大的整数类型,比如这个 pid_t,POSIX 只要求它是一个 signed integer type,所以转化成 long 就OK了:

C++代码
  1. pid_t pid;
  2. printf("%ldn", (long )pid);

不过,c99新引了另一种方式,就是使用宏来替代 length modifier,例如 uint32_t,uintmax_t 可以使用 PRIu32 和 PRIuMAX,这些宏可以在 <inttypes.h> 中看到其定义(在 glibc 源代码中对应 sysdeps/generic/inttypes.h)。有兴趣的可以直接去看一下源代码,其实就是对已有的 format conversion 进行的宏封装。显然,第二种方式更优雅,但最近发现了一种更加优秀的方式,虽然看起来绕了一些弯路

C++代码
  1. /* Make sure BUFSIZE is large enough. */
  2. #define UMAX2S_BUFSIZE 21
  3. static char *
  4. umax2s(uintmax_t x, char *s)
  5. {
  6.     unsigned i;
  7.     /* LINTED */
  8.     assert(sizeof(uintmax_t) <= 8);
  9.     i = UMAX2S_BUFSIZE - 1;
  10.     s[i] = '�'';
  11.     do {
  12.         i--;
  13.         s[i] = "0123456789"[x % 10];
  14.         x /= (uintmax_t)10LL;
  15.     } while (x > 0);
  16.     return (&s[i]);
  17. }

这里以 uintmax_t 为例,逐个逐个的将整数的末尾转换成字符保存在缓冲区里,然后全体右移,小测一下,以数字 2334532234523为例,可以看到程序是怎么处理末位字符的

不过要注意 uintmax_t 是无符号整形,如果要处理有符号的,要剖离符号位使之变成绝对值来处理,这里只是提一下,其他自由发挥即可。另外,mozilla 里也是使用了一样的方法。

We don’t want to depend on vsnprintf() for production builds, since that can cause unnecessary bloat for static binaries.  umax2s() provides minimal integer printing functionality, so that malloc_printf() use can be limited to MALLOC_STATS code.

C++代码
  1. /* LINTED */
  2.  assert(sizeof(uintmax_t) <= 8);
  3.  i = UMAX2S_BUFSIZE - 1;
  4.  s[i] = '�'';
  5.  do {
  6.      i--;
  7.      s[i] = "0123456789"[(int)x % 10];
  8.      x /= (uintmax_t)10LL;
  9.  } while (x > 0);

我之所以觉得有问题是上面的第8行代码,mozilla 将 x 强制转换成了 int 型,因为 uintmax_t 类型要比 int 型所能表达的范围要大,如果 uintmax_t 值大于 int 所表示的范围而且相对 int 的最高位恰好是 1,强制转换后就是负数,负数对10取余,结果会是什么呢?大概是个BUG吧