一些零碎的话题

一月 2, 2012
Life

开始遇到瓶颈了,学习,研究,毫无进展。

关于四级

四级考完,too simple,以前考过几次,有点搞笑

没想到自己还没过四级,有点莫名其妙。对于四级,确实没怎么准备过,无论是词汇量还是听力,都是依靠平时自行的积累,资料,书籍,博客,电影等等。忽然发现,原来英语已经充斥着我生活的各个角落

重视学习的方法,不要忽略过程,坚持,等待时机成熟

我小时候身体单薄,没什么力气,体育课不能像其他同学那样使出牛一样的劲。中考那年加考体育,平时训练也没什么起色,想想也算了,及格就行。训练的时候中规中矩的,认认真真按照方法去做。结果到最后一个月的时候,成绩突飞猛进,结果考的特别好

从此我对这话深信不疑

实习

这几天有个朋友问我,顺便也想找几个同学去他公司里实习,做hadoop方面的,类似与谷歌地球的应用。

我唯一比较感兴趣的是这个project里用到的Hbase,一个分布式的数据存储系统,也是我目前正在研究的东西,但我偏向于理论与实现,而不是应用。

所以有点犹豫,到底要不要去,说实话的,跟一个比较行内的人一起干,会增加自己很多经验。朋友问我有没有兴趣在北京发展,实习结束留在他公司就行,其他一切好安排,待遇方面,正式5k

有几条拒绝的理由:

  1. 我有一些想法,很希望可以付诸实现,即使能力有限
  2. 此外,以后的事虽很难说,但是我回南方的决心基本已定,我爱人在那
  3. 如果是好公司,待遇倒不是很在乎,能学到我想要的自是好,前景比较重要。不过,要是公司不怎么样,恐怕也只是耗费青春,闭门造车的意义不大
  4. 目前期望的薪资还要高一点,而且比较喜欢互联网公司

暂时的打算,到时看看大家具体要做什么,有搞头的话再考虑吧

保暖内衣

这几天天气非常非常的冷,晚上吃饭的时候整个腿脚都是冷嗖嗖的。

前几天整理柜子打算找一些过冬的衣物来御寒,可惜唯一的一件毛衣已经硬邦邦还散发着异味,也难怪,穿了那么多年还这样放在柜子里霉着。

现在回想起来,这几年感觉最暖的一个冬天,是我在开封的时候,我爱人本科在那读书。节日的时候我过去看她,她给我买了毛裤毛袜子,每天把我裹的像个粽子一样,特别不自然,^_^

今天中午考完试回来,一开手机,立马有了快递的消息

原来是爱人给我挑的一套保暖内衣,打开看了看,毛茸茸的很暖和,不像毛衣穿起来像个胖墩一样真不合我们南方人的习惯。

话不多说先上图

用kindle来阅读源代码

十二月 23, 2011
Programming practices

最近新买了一台kindle 4, 好家伙,可惜不能用来看论文,糟蹋了,转换也是乱七八糟不能看

于是萌生了在kindle上面阅读源代码的想法,方法如下

第一步:合并源文件

检索一个source目录,遍历src下所有的.h .c 文件合并输出到一个文件,支持分级,每一个文件为一个chapter。如果有多个子目录,封装,再迭代函数

C++代码
  1. #include <stdio.h>
  2. #include <string.h>
  3. #include <sys/types.h>
  4. #include <dirent.h>
  5. #include <stdlib.h>
  6. #define BUFFSIZE 1024
  7. int main(int argc, char **argv)
  8. {
  9.     char buf[BUFFSIZE], *p = buf;
  10.     FILE *fp, *nfp;
  11.     DIR *dp;
  12.     struct dirent *dir;
  13.     char chapter[BUFFSIZE];
  14.     int cp = 1;
  15.     if (argc!=3) {
  16.         printf("Usage:......n");
  17.         exit(0);
  18.     }
  19.     strcpy(buf,argv[1]);
  20.     p = p+strlen(argv[1]);
  21.     *p++ = '/';
  22.     if ((nfp = fopen(argv[2], "w")) == NULL) {
  23.         printf("error for open the file:%sn",argv[2]);
  24.         exit(0);
  25.     }
  26.     if ((dp = opendir(argv[1])) == NULL) {
  27.         printf("error for open the dirent:%sn",argv[1]);
  28.         exit(0);
  29.     }
  30.     while ((dir = readdir(dp))!=NULL) {
  31.         if (strcmp(dir->d_name,".")==0||strcmp(dir->d_name,"..")==0)
  32.             continue;
  33.         strcpy(p,dir->d_name);
  34.         if ((fp=fopen(buf,"r"))==NULL)
  35.             continue;
  36.         char ch;
  37.         sprintf(chapter,"%d",cp++);
  38.         fputs("<h1>The ",nfp);
  39.         fputs(chapter,nfp);
  40.         fputs(" Chapter   ",nfp);
  41.         fputs(dir->d_name,nfp);
  42.         fputs("</h1>nn",nfp);
  43.         while ((ch=fgetc(fp))!=EOF)
  44.             fputc(ch,nfp);
  45.         fputs("nnnnn",nfp);
  46.         fclose(fp);
  47.     }
  48.     fclose(nfp);
  49.     closedir(dp);
  50.     return(0);
  51. }

加<h1></h1>是为了方便后续的定位修改,当然,不写也可以,直接正则也OK

输出结果如下,

C++代码
  1. <h1>The 1 Chapter   syncio.c</h1>
  2. /* Synchronous socket and file I/O operations useful across the core.
  3.  *
  4.  * Copyright (c) 2009-2010, Salvatore Sanfilippo <antirez at gmail dot com>
  5.  * All rights reserved.
  6.  *
  7.  * Redistribution and use in source and binary forms, with or without
  8.  * modification, are permitted provided that the following conditions are met:

明显,<h1>The 1 Chapter   syncio.c</h1>  并不是 c 的代码,所以vim是无法高亮的

第二步:vim利器,TOhtml

利用vim的编辑利器TOhtml将源文件以html格式输出并高亮显示,文件太大的话,需要一点时间

注意,kindle 不适合显示行号,宽度不足,把行号去掉:

:%s/<span>.{6}</span>//g

再把&lt;全部替换为<,&gt; 替换为 >

保存为html文件,看示例:redis.c.html


第三步:利用Calibre转换

calibre可以直接将html的文件转换成mobi,epub等文档,非常方便。

如果可以,大家熟悉一下mobi的文件结构,自己写个转换过程,就可以省略calibre这步麻烦了。

不过我找了很久也没什么资料,也罢,哈哈

github的'\n'文件名漏洞

十二月 16, 2011
Programming practices

这几天在部署github的时候发现了一个小问题,恰好我有个文件的名字是包含 'n' 换行符的,push上去之后结果导致整个project webpages不能访问,一直提示:

An unexpected error seems to have occurred. Why not try refreshing your page? Or you can contact us if the problem persists.

排查了很久才发现原来是这个的问题。

n 在ASCII码里属于控制字符,而我们平时一般习惯把文件名这类当作字符流来操作的。比如在某些应用程序间的通信,就很有可能将n当作特殊字来处理。ext等文件系统又默认是支持带n的文件名。所以,后话就来了

我立马发上很多应用程序都不能正确处理这种情况,会出现各种莫名其妙的问题:

  • p7zip仅保留n前面的字符并写到结果文件里面去,但是结果文件的文件名却没有改变
  • gzip可以正常解压缩,但是通过gzip处理的文件,file-roller是不能读取,Segmentation fault. 调试了一下,发现在 gtk_main() 里堆栈崩溃了,也没法跟进去
  • jar和zip等会将n替换成 ^J 这样的字符,或者 ^J
  • rar无法正常工作,根本没法压缩。奇怪的是,通过file-roller来调用rar却可以生成压缩文档,不过没法读取
  • 。。。。。

 

目前的猜想,应该只是webpage的问题,和后端服务器程序没有关系。

Smaz的小对象压缩技术

十月 29, 2011
Programming practices

基于小对象的压缩算法。一个叫做 Smaz 的项目,https://github.com/antirez/smaz 目前应用在redis数据库上面。

与传统的压缩算法不同的是,Smaz更适合小对象的压缩,比如几个字节到几十个字节不等。除开字典的硬编码部分,压缩过程和解压过程加起来120行代码,非常的短小精罕,却有不俗的表现。平均测试能够达到40%~50%左右的压缩效果。

当然,代码在某些方面不是完美的,也不健壮,但是很实用。 antirez坚持追求高速度而放弃部分的安全性。一些相关讨论,请看

http://news.ycombinator.com/item?id=540048

 

核心代码

Smaz的核心代码只有两部分:

  1. 字典的设计
  2. Hash函数

字典方面,Smaz 采用了硬编码的方式,antirez应该是对大量的相关语料进行统计,分析,进而得出最常用的字符组合

C++代码
  1. /* Reverse compression codebook, used for decompression */
  2. static char *Smaz_rcb[254] = {
  3. " ", "the", "e", "t", "a", "of", "o", "and", "i", "n", "s", "e ", "r", " th",
  4. " t", "in", "he", "th", "h", "he ", "to", "", "l", "s ", "d", " a", "an",
  5. "er", "c", " o", "d ", "on", " of", "re", "of ", "t ", ", ", "is", "u", "at",
  6. " ", "n ", "or", "which", "f", "m", "as", "it", "that", "n", "was", "en",
  7. " ", " w", "es", " an", " i", "r", "f ", "g", "p", "nd", " s", "nd ", "ed ",
  8. "w", "ed", "http://", "for", "te", "ing", "y ", "The", " c", "ti", "r ", "his",
  9. "st", " in", "ar", "nt", ",", " to", "y", "ng", " h", "with", "le", "al", "to ",
  10. "b", "ou", "be", "were", " b", "se", "o ", "ent", "ha", "ng ", "their", """,
  11. "hi", "from", " f", "in ", "de", "ion", "me", "v", ".", "ve", "all", "re ",
  12. "ri", "ro", "is ", "co", "f t", "are", "ea", ". ", "her", " m", "er ", " p",
  13. "es ", "by", "they", "di", "ra", "ic", "not", "s, ", "d t", "at ", "ce", "la",
  14. "h ", "ne", "as ", "tio", "on ", "n t", "io", "we", " a ", "om", ", a", "s o",
  15. "ur", "li", "ll", "ch", "had", "this", "e t", "g ", "e", " wh", "ere",
  16. " co", "e o", "a ", "us", " d", "ss", "n", "r", "="", " be", " e",
  17. "s a", "ma", "one", "t t", "or ", "but", "el", "so", "l ", "e s", "s,", "no",
  18. "ter", " wa", "iv", "ho", "e a", " r", "hat", "s t", "ns", "ch ", "wh", "tr",
  19. "ut", "/", "have", "ly ", "ta", " ha", " on", "tha", "-", " l", "ati", "en ",
  20. "pe", " re", "there", "ass", "si", " fo", "wa", "ec", "our", "who", "its", "z",
  21. "fo", "rs", ">", "ot", "un", "<", "im", "th ", "nc", "ate", "><", "ver", "ad",
  22. " we", "ly", "ee", " n", "id", " cl", "ac", "il", "</", "rt", " wi", "div",
  23. "e, ", " it", "whi", " ma", "ge", "x", "e c", "men", ".com"
  24. };

然后将这些字符串尽可能的均匀分散到每一个slot中,提高查询的速度,Hash的设计看起来很简单

C++代码
  1. h1 = h2 = in[0]<<3;
  2. if (inlen > 1) h2 += in[1];
  3. if (inlen > 2) h3 = h2^in[2];

不过要弄清楚里面的原理可不容易,你可以看一下解码用的字典,每一个组合的分布是非常的均匀的,说明 Hash的效果理想

C++代码
  1. static char *Smaz_cb[241] = {
  2. "�02s,266",
  3. "�03had232�02leW",
  4. "�03on 216",
  5. "",
  6. "�01yS",
  7. "�02ma255�02li227",
  8. "�03or 260",
  9. "",
  10. "�02ll230�03s t277",
  11. "�04fromg�02mel",
  12. "",
  13. "�03its332",
  14. "�01z333",
  15. "�03ingF",
  16. "�01>336",
  17. "�01 �00�03   (�02nc344",
  18. "�02nd=�03 on312",
  19. "�02ne213�03hat276�03re q",
  20. "",
  21. /* ...... */
  22. "�03, a224"
  23. };

引用 Marc Lehmann 的一句话:"... the hashing function might seem strange, just believe me, it works ! "

Hash的设计应该是与统计分析的结果相关的,而且h1与h2等存在一定的独立性。举个假设,先处理h1,看字典的分布状态,再接着处理h2,让字 典的分布更趋于均匀,h3类推等等。这样就可以得到很好的效果。当然,很多细微的技术不是一下子就能领悟出来的,设计的精妙在于过程,而不是成果

 

应用场景

小对象的压缩有什么好处?

好处是明显的,应用场合也有很多,传统的数据压缩算法解决不了所有问题,例如集群环境下的socket通信以及千万级别的持久服务,并发处理数十万的客户链接,如果每次通信携带的数据长度是是50bytes左右。

(10*100000*50/1024/1024)*50% M/s = 20M/s

这个数据对于提升机房紧张的带宽资源是相当可观的。

 

这个还好,redis改进后的lzf算法更诡异

追寻自己的内心

十月 6, 2011
Life

云风离开了网易,惊讶,归于平静,我想起了安德逊教授:“人活到这把年纪,我发现历史正开始重复它自己,这太乏味了。真的,我该走了。”

10年了,是什么动力可以让一个人用10年的时间把所有经历都放在做好同一件事上,如果是你,十年后会不会回过头反思这十年来什么东西对自己来说才是真正重要的,如果重新再给你一个十年,你会选择做什么来填补这10年的空白。每个人,都会有一个自己不一样的答案吧

很久没听音乐了,因为已经很少有那种能一下子抓住耳朵的旋律,虽然我不太懂音乐,但我时常在想,音乐是不是和写文章一样?再优美华丽的词句,也不如最质朴而又简单的表达。干净的琴调,欢快的乐谱,忧伤的旋律,听听石进 夜的钢琴曲

我见过有两种人,一是科学家,一是工程师,科学家发觉自然现象和规律,工程师制造世界改善生活。但是很少有人能够完整的走完这一过程,从科研的源头 和到成品的问世,不是短短几十个年头就能走完的。我没记错的话,风魂2D游戏引擎应该是云洋在大三的时候开始开发了吧,从研发到大话,再到风靡一时的梦 幻,尽管今天风魂不再,但是云洋兄确实给后人留下了宝贵的财富,正如他自己说的:为后人留下点东西是我的愿望

 

此刻,仿佛自己正在面对着10年后的自己,同样的三年前,我跟自己说:人是不是一定要看到失败的时候才会开始后悔过去,所以我选择了奋斗自己的路,那么今天,我该以怎么样的心态去面对我的未来以及10年后,20年后的我

B.S.Kushwah:Life Is Too Short  …… 人生苦短,趁年轻,做自己喜欢的事,完成自己的梦想,不怨不悔,我不是一直是这么过来的吗?

我不知道我们的将来是否一样,我希望10年后的我对今天的我予以厚望,因为有梦想才有希望,有梦想才有激情

希望不一定是我们必要的,激情也不一定是我们必要的,但是常抱梦想之心,我们的生是不是就会更有意思一些,即使将来走了也不会太寂寞,是不?

其实现实和梦想也不一定是冲突的,梦想还有多种实现的方式和途径,不同的人对它还有不一样的诠释

梦想应该是自己内心的一片清净之地,人生的不同阶段会有不一样的理想和追求,但是放弃一些东西并不意味着要丢掉梦想,总会有一些,是留下的,往后很长的一段路伴随着我们的,非常纯真的东西

真心祝福云洋兄,仅以此文,怀念云风的心路10年

每个人的內心都是一座孤岛,时时刻刻 / 甚至当心爱的人就在身侧,闭上眼睛想要留住这让人无限贪恋的一刻。 / 始终,我们选择倾诉和聆听的,都只是我们內心的独白。

Textpress开发笔记

八月 23, 2011
Programming practices

美的东西一直是我所追求的,比如简洁与效率。

这几年用过不少Blog,先是 WordPress、MovableType、Drupal、Plone 后有 Textpattern、TextPress 等等,但总无法找到一个真正适合自己的Blog程序,也许是因为过分的追求完美,才对某些小小的问题过分苛刻吧。那个时候起,有了想自己动手写Blog程 序的想法,我这个人就是有点毛病,如果一样东西还能在我所能忍受的程度里解决我的需求,我就坚决不去改变它,所以我用了3年的vim,连自动补全都不会

真正动起手来是最近我在深圳研究生院(以下简称深研院)的那一个月里,大部分的空闲时间我都是在思考Blog程序的构架与设计上面,并且翻阅了不少 其他博客系统的源代码,包括 WordPress 和  Textpattern 等等。Textpattern 实在是太优秀了,我曾经花费过很长一段的时间去研究她,尤其是她的textile写作语法。

经过近一个月的不断调试和修改,我的第一个Blog程式诞生了,取名 Textpress 。没有 Tags ,没有 TrackBack ,只有简单的网志功能和一些基本的后台管理,但是我很喜欢它,她能满足我的需求,并且在我需要的时候,我可以很容易的扩展她,原始与质朴!

作为一名开发者,你想知道什么产品适合你,最好的办法就是:实践它,直到你满意为止

这是我深研院来为数不多的几张照片之一,感谢我女朋友,帮我记录了这些

后记:很怀念在深研院的这个月里,每天都过得很充实。我从来都没有这么深刻、认真的考虑过,我是否了解这个行业,我在做什么以及我将往何处走,在离开这个地方的时候,我给自己找到了一个答案。

很感谢我的女朋友,在这个月里给予我很多支持与鼓励,有时晚上回到家里还要加班到很夜没给她睡一个安稳觉,她也没怨过我,有时候忙得连饭都懒得吃, 本来可以好好打个瞌睡的时间却还要跑去食堂那么远的地方给我买饭。爱情的力量真是伟大啊还可以长肉,在我女朋友的悉心呵护下这个月我胖了四斤,真是前所未 有闻所未闻,希望回到学校不要一下子瘦掉,不然我会很心疼的。

今天女朋友研一新生开学了,心里很高兴,亲亲她

就这样吧,先去补牙

关于CAP理论

四月 16, 2011
Distributed system

理论的弱一致性和最终一致性,理解,Werner Vogels 是这样说的:

  • Weak consistency. The system does not guarantee that subsequent accesses will return the updated value. A number of conditions need to be met before the value will be returned. The period between the update and the moment when it is guaranteed that any observer will always see the updated value is dubbed the inconsistency window.
  • Eventual consistency. This is a specific form of weak consistency; the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme. The most popular system that implements eventual consistency is DNS (Domain Name System). Updates to a name are distributed according to a configured pattern and in combination with time-controlled caches; eventually, all clients will see the update.

但是Werner Vogels只是论述,某些细节没说清楚,起码在他的文章描述里我看不到弱一致性和最终一致性的实质差别。还有国内大多数的看法均源自这里,所以我想很多人都跟我的理解一样可能有问题,一致性的定义,还有Partition Tolerant也是。

看了Dan Weinreb的博文,很受启发,值得探讨

 

What Does the Proof of the "CAP theorem" Mean

Several years back, Eric Brewer of U.C. Berkeley presented the "CAP conjecture", which he explained in these slides from his keynote speech at the PODC conference in 2004. The conjecture says that a system cannot be consistent, available, and partition-tolerant; that is, it can have two of these properties, but not all three. This idea has been very influential.

Seth Gilbert and Nancy Lynch, of MIT, in 2002, wrote a now-famous paper called "Brewer's Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services". It is widely said that this paper proves the conjecture, which is now considered a theorem. Gilbert and Lynch clearly proved something, but what does the proof mean by "consistency", "availability", and "partition-tolerance"?

Many people refer to the proof, but not all of them have actually read the paper, thinking that it's all obvious. I wasn't so sure, and wanted to get to the bottom of it. There's something about my personality that drives me to look at things all the way down to the details before I feel I understand. (This is not always a good thing: I sometimes lose track of what I originally intended to do, as I "dive down a rat-hole", wasting time.) For at least a year, I have wanted to really figure this out.

A week ago, I came across a blog entry called "Availability and Partition Tolerance" by Jeff Darcy. You can't imagine how happy I was to find someone who agreed that there is confusion about the terms, and that they need to be clarified. Reading Jeff's post inspired me to finally read Gilbert and Lynch's paper carefully and write these comments.

I had an extensive email conversation with Jeff, without whose help I could not have written this. I am very grateful for his generous assistance. I also thank Seth Gilbert for helping to clarify his paper for me. I am solely responsible for all mistakes.

I will now explain it all for you. First I'll lay out the basic concepts and terminology. Then I'll discuss what "C", "A", and "P" mean, and the "CAP theorem". Next I'll discuss "weak consistency", and summarize the meaning of the proof for practical purposes.

Basic Concepts

The paper has terminology and axioms that must be laid out before the proof can be presented.

A distributed system is built of "nodes" (computers), which can (attempt to) send messages to each other over a network. But the network is not entirely reliable. There is no bound on how long a message might take to arrive. This implies that a message might "get lost", which is effectively the same as taking an extremely long time to arrive. If a node sends a message (and does not see an acknowledgment), it has no way to know whether the message was received and processed or not, because either the request or the response might have been lost.

There are "objects", which are abstract resources that reside on nodes. Objects can perform "operations" on other objects. Operations are synchronous: some thread issues a request and expects a response. Operations do not request other operations, so they do not do any messaging themselves.

There can be replicas of an object on more than one node, but for the most part that doesn't affect the following discussion. An operation could "read X and return the value", "write X", "add X to the beginning of a queue", etc. I'll just say "read" for an operation that has no side-effects and returns some part of the state of the object, and "write" to mean an operation that performs side-effects.

A "client" is a thread running on some node, which can "request" an object (on any node) to perform an operation. The request is sent in a message, and the sender expects a response message, which might returns a value, and which confirms that the operation was performed. In general, more than one thread could be performing operations on one object. That is, there can be concurrent requests.

The paper says: "In this note we will not consider stopping failures, though in some cases a stopping failure can be modeled as a node existing in its own unique component of a partition." Of course in any real distributed system, nodes can crash. But for purposes of this paper, a crash is considered to be a network failure, because from the point of view of another node, there's no way to distinguish between the two. A crashed node behaves exactly like a node that's off the network.

You might say that if a node goes off the network and comes back, that's not the same as a crash because the node loses its volatile state. However, this paper does not concern itself with a distinction between volatile and durable memory.  There's no problem with that; issues of what is "in RAM" versus "on disk" are orthogonal to what this paper is about.

Consistent

The paper says that consistency "is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time." They explain this more explicitly by saying that consistency is equivalent to requiring all operations (in the whole distributed system) to be "linearizable".

"Linearizability" is a formal criterion presented in the paper "Linearizability: A Correctness Condition for Concurrent Objects", by Maurice Herlihy and Jeannette Wing. It means (basically) that operations behave as if there were no concurrency.

The linearizability concept is based a model in which there is a set of threads, each of which can send an operation to an object, and later receive a response. Despite the fact that the operations from the different threads can overlap in time in various ways, the responses are as if each operation took place instantaneously, in some order. The order must be consistent with each thread's own order, so that a read operation in a thread always sees the results of that thread's own writes.

Linearizability per se does not include failure atomicity, which is the "A" ("atomic") in "ACID". But Gilbert and Lynch assume no node failures. So operations are atomic: they always run to completion, even if their response messages get lost.

So by "consistent" ("C"), the paper means that every object is linearizable. (That's not what the "C" in "ACID" means, by the way, but that's not important.) Very loosely, "consistent" means that if you get a response, it has the right answer, despite concurrency.

This is not what the "C" in "ACID transaction" means. It's what the "I" means, namely "isolation" from concurrent operations. This is probably a source of confusion sometimes.

Furthermore, the paper says nothing about transactions, which have would have a beginning, a sequence of operations, and an end, which may commit or abort. "ACID" is talking about the entire transaction. The "linearizability" criterion only talks about individual operations on objects. (So the whole "ACID versus BASE" business, while cute, can be misleading.)

Available

"Available" is defined as "every request received by a non-failing node in the system must result in a response." The phrase "non-failing node" seemed to imply that some nodes might be failing and others not. But since the paper postulates that nodes never fail, I believe the phrase is redundant, and can be ignored. After the definition, the paper says "That is, any algorithm used by the service must eventually terminate."

The problem here is that "eventually" could mean a trillion years. This definition of "available" is only useful if it includes some kind of real-time limit: the response must arrive within a period of time, which I'll call the maximum latency.

Next, it's very important to notice that "A" says nothing about the content of the response. It could be anything, as far as "A" is concerned; it need not be "successful" or "correct". (If think otherwise, see section 3.2.3.)

So "available" ("A") means: If a client sends a request to a node, it always gets back some response within L time, but there is no guarantee about contents of the response.

Partition Tolerant

There is no definition, per se, of the term "partition-tolerant", not even in section 2.3, "Partition Tolerance".

First, what is a "partition"? They first define it to mean that there is a way to assort all the nodes into separate sets, which they call "components", and all messages sent from a node in one component to another nodes in a separate component are lost. But then they go on to say "And any pattern of message loss can be modeled as a temporary partition separating the communicating nodes at the exact instance the message is lost." or their formal purposes, "partition" simply means that a message can be lost. (The whole "component" business can be disregarded.)  That's probably not what you had in mind!

In real life, some messages are lost and some aren't, and it's not exactly clear when a "partition" situation starts, is happening, or ends. I realize that for practical purposes, we usually know what a partition means, but if we're going to do formal proofs and understand what was proved, one must be completely clear about these terms.

Even in a local-area network, packets can be dropped. Protocols like TCP re-transmit packets until the destination acknowledges that they have arrived. If that happens, it's clearly not a network failure from the point of view of the application. "Losing messages" must have something to do with nodes entirely unable to communicate for a "long" time compared to the latency requirements of the system.

Furthermore, remember that node failure is treated as a network failure.

So "partition-tolerant" ("P") means that any guarantee of consistency or availability is still guaranteed even if there is a partition. In other words, if a system is not partition-tolerant, that means that if the network can lose messages or any nodes can fail, then any guarantee of atomicity or consistency is voided.

CAP

The CAP theorem says that a distributed system as described above cannot have properties C, A, and P all at the same time. You can only have two of them. There are three cases:

AP: You are guaranteed get back responses promptly (even with network partitions), but you aren't guaranteed anything about the value/contents of the response. (See section 3.2.3.) A system like this is entirely useless, since any answer can be wrong.

CP: You are guaranteed that any response you get (even with network partitions) has a consistent (linearizable) result. But you might not get any responses whatsoever. (See section 3.2.1.) This guarantee is also completely useless, since the entire system might always behave as if it were totally down.

CA: If the network never fails (and nodes never crash, as they postulated earlier), then, unsurprisingly, life is good. But if messages could be dropped, all guarantees are off. So a CA guarantee is only useful in a totally reliable system.

At first, this seems to mean that practical, large distributed systems (which aren't entirely reliable) can't make any useful guarantees! What's going on here?

Weak Consistency

Large-scale distributed systems that must be highly available can provide some kind of "weaker" consistency guarantee than linearizability. Most such systems provide what they call "eventual consistency" and may return "stale data".

For some applications, that's OK. Google search is an obvious case: the search is already specified/known to be using "stale" data (data since the last time Google looked at the web page), so as long as partitions are fixed quickly relative to the speed of Google's updating everything, (and even if sometimes not, for that matter), nobody is going to complain.

Just saying that results "might be stale" and will be "eventually consistent" is unfortunately vague. How stale can it be, and how long is "eventually"? If there's no limit, then there's no useful guarantee.

For a staleness-type weak consistency guarantee, you'd like to be able to say something like: "operations (that read) will always return a result that was consistent with all the other operations (that write) no longer ago than time X". And this implies that "write" operations are never lost, i.e. always happen within a fixed time bound.

t-Connected Consistency

Gilbert and Lynch discuss "weakened consistency" in section 4. It's also about stale data, but with "formal requirements on the quality of stale data returned". They call it "t-Connected Consistency".

It makes two assumptions. (a) Every node has a clock that can be used to do timeouts. The clocks don't have to be synchronous with each other. (b) There's some time period after which you can assume that an unanswered message must be lost. (c) Every node processes a received message within a given, known time.

The real definition of "t-Connected Consistency" is too formal for me to explain here (see section 4.4). It (basically) guarantees (1) when there is no partition, the system is fully consistent; (2) if a partition happens, requests can see stale data; and (3) and after the partition is fixed, there's a time limit on how long it takes for consistency to return.

Are the assumptions OK in practice? Every real computer can do timeouts, so (a) is no problem. You can always ignore any responses to messages after the time period, so (b) is OK. It's not obvious that every system will obey (c), but some will.

I have two reservations. First, if the network is so big that it's never entirely working at any one time, what would guarantee (3) mean? Second, in the algorithm in section 4.4, in the second step ("write at node A"), it retries as long as necessary to get a response. But that could exceed L, violating the availability guarantee.

So it's not clear how attractive t-Connected Consistency really is. It can be hard it is to come up with formal proofs of more complicated, weakened consistency guarantees. Most working software engineers don't think much about formal proofs, but don't underrate them. Sometimes they can help you identifying bugs that would otherwise be hard to track down, before they happen.

Jeff Darcy wrote a blog posting about "eventual consistency" about a half year ago, which I recommend. And there are other kinds of weak consistency guarantees, such as the one provided by Amazon's Dynamo key-store, which worth examining.

Reliable Networks

Can't you just make the network reliable, so that messages are never lost? ("Never" meaning that the probability of losing a message is as low as other failure mode that you're not protecting against.)

Lots and lots of experience has shown that in a network with lots of routers and such, no matter how much redundancy you add, you will experience lost messages, and you will see partitions that last for a significant amount of time. I don't have a citation to prove this, but, ask around and that's what experienced operators of distributed systems will always tell you.

How many routers is "lots"? How reliable is it if you have no routers (layer 3 switches), only hubs (layer 2 switches)? What if you don't even have hubs? I don't have answers to all this. But if you're going to build a distributed system that depends on a reliable network, you had better ask experienced people about these questions. If it involves thousands of nodes and/or is geographically distributed, you can be sure that the network will have failures.

And again, as far as the proof of the CAP theorem is concerned, node failure is treated as a network failure. Having a perfect network does you no good if machines can crash, so you'd also need each node to be highly-available in and of itself. That would cost a lot more than using "commercial off-the-shelf" computers.

The Bottom Line

My conclusion is that the proof of the CAP theorem means, in practice: if you want to build a distributed system that is (1) large enough that nodes can fail and the network can't be guaranteed to never lose messages, and (2) you want to get a useful response to every request within a specified maximum latency, then the best you can guarantee about the meaning of the response is that it is guaranteed to have some kind of "weak consistency", which you had better carefully define in such a way that it's useful.

P.S.

After writing this but just before posting it, Alex Feinberg added a comment to my previous blog post with a link to this excellent post by Henry Robinson, which discusses many of the same issues and links to even more posts. If you want to read more about all this, take a look.

参考文献

Dan Weinreb 2010: What Does the Proof of the “CAP theorem” Mean?

平衡二叉树的自平衡策略

二月 11, 2011
Programming practices

本文的目的,旨在为数据结构或者算法的设计提供一些启发性的思路。

数组 --> 二分搜索 --> 链表 --> 二叉搜索树

二分搜索

二分搜索是一种很重要的算法。最普遍的莫过于对排序数组的处理。而数组的一个特质,使得二分搜索在离散存储结构里难以实现

  • 连续的内存区域,依靠下标来自动计算偏移量

就是说,对排序数组的操作,仅仅依靠下标就可以进行的。但是在实际的系统中,我们有时无法一次性提供足够大的连续存储空间或者海量处理那些离散地存储在内存里的数据的时候,如链表等等,这种方法不适用。

回想一下对排序数组进行二分搜索的操作,每次二分是不是都在最中间进行,那链表该如何才能做到呢?

答案是,做不到吧。离散的本质就是随机,你不可能找到中结点的位置。二叉搜索树的出现解决了这个问题,直接从中结点出发和决定下一分支。

所以现在的问题是:谁来担任中结点。

  1. 对于静态的数据,递归选举中结点,然后形成最优二叉搜索树
  2. 对于动态的数据,关键的地方在于,当插入一个结点或者删除一个结点的时候,如何平衡整个二叉搜索树的二分结构。

所以出现了自平衡,解决了二分搜索思想在离散存储结构里的应用。

自平衡的策略

所谓自平衡是指通过执行一系列维持某种性质的操作,这个性质,是为了更好的实现二分搜索的思想。如AVL的平衡因子,红黑树的5个强制约束,等等。

而旋转操作则是实现自平衡最常用的一项策略,通过左右子树迁移结点以实现结构的动态平衡

左旋,相当于右子树往左字树迁移两个结点,右旋同理。下面以红黑树为例,讨论自平衡策略的实现

旋转的时机

左右子树的结点数不想等时结构处于不平衡状态,不平衡就要旋转。但具体的实现取决于不同的场景和需求,由结构自身的性质来决定的。如AVL树在当左右子树结点总数相差1时进行旋转,(待)

封装epoll的异步API

一月 9, 2011
Programming practices

epoll的设计坚持Unix哲学KISS原则,keep it simple, stupid

但是KISS并不等于实用。epoll提供三个简单的函数供应用程序进行异步事件处 理,epoll_create(),epoll_ctl(),epoll_wait()。这几个函数并不友好,尤其面对复杂的业务逻辑时,缺乏清晰的思路, 需要进一步的封装。

通常开发人员需要什么:

  1. 异步处理的对象,文件描述符
  2. 需要监听的事件,可读或可写(EPOLLIN|EPOLLOUT)
  3. 事件产生时的处理函数,epoll并不提供,机制与策略分离

所以我对 epoll 提供的接口进行简单易用的封装,以适合更一般化的场景需求。

数据结构

C 代码
  1. typedef void eProc(struct eloop *el, int fd, void *argv, int mask);
  2. /* File event structure */
  3. typedef struct efile {
  4.     int mask;
  5.     eProc *rProc;
  6.     eProc *wProc;
  7.     void *data;
  8. }  efile_t;
  9. typedef struct eloop {
  10.     int efd;
  11.     int stop;
  12.     efile_t ev[AE_SETSIZE];
  13.     efire_t rd[AE_SETSIZE];
  14. } EL;

efile_t 作为基本的异步io对象,保存事件发生时调用的处理函数以及传参的数据。eloop 为最主要的数据结构,efd 为 epoll_create() 返回的 epoll 专用描述符号,负责多个epoll_event。ev[AE_SETSIZE] 保存正在监听的所有事件。如果 epoll_t 是监控机,则 efd为执行者,events 为被监控对象。每次 epoll_wait() 返回的结果保存在 rd[AE_SETSIZE] 里。

编程接口

C 代码
  1. EL *el_open(void);
  2. int el_close(EL *el);
  3. int el_stop(EL *el);
  4. int el_start(EL *el, int flags);
  5. int el_addFileEvent(EL *el, int fd, int mask, eProc *pro, void *data);
  6. int el_delFileEvent(EL *el, int fd, int mask);

这套 API 的使用很方便,el_start() 启动论询,el_stop()暂停论询,el_addFileEvent() 往 EL 中添加一个事件,支持注册回调函数,el_delFileEvent() 删除一个事件,代码没有经过测试,谨慎使用。

>>具体设计看源代码

数据结构接口对象化

十一月 23, 2010
Programming practices

c 语言适合平台及系统等一些的大型核心构架的开发,面向过程,不支持面向对象语义,

不过我始终认为,面向对象是一种思想,而不是指某种既定的技术,所以,OO 与语言没有关系,随便一种语言,一样可以写出面向对象的代码,OO 的思想绝对不只是继承,多态等那么简单,对于某些语言如 c++ 或者 Java,支持面向对象语义因而适合表达这种思想而已,况且 Java本身设计的出现就是为了解决面向语义的问题

做项目的时候,很头疼的问题就是关于接口命名的设计,包括内部接口和外部接口,小型的项目倒也罢了,如果构架过于庞大,就容易出现接口混乱的问 题,c 风格的代码并不像 c++ 或者 java 那样简单容易白,毕竟用 c 的都是写模块的多,功能越是复杂,函数的命名就越容易出现杂乱无章的情况,很难看出你在写什么,支持面向语义的编程语言,最大的优点就是,容易写出层次分 明,结构清晰的代码,但是像 c 这样的语言可不容易,所以,好的函数命名规范对项目来说很重要

回归正题,先说说什么是对象化技术,所谓数据结构对象化,就是一组结构有各自适合的管理接口,本质上是一样的,只是写法不同而已,以红黑树为例,看看内核是怎么实现的(部分接口):

C++代码
  1. /* Macros that define a red-black tree */
  2. #define RB_HEAD(name, type)
  3. struct name {
  4.     struct type *rbh_root; /* root of the tree */
  5. }
  6. #define RB_INITIALIZER(root)
  7.     { NULL }
  8. #define RB_INIT(root) do {
  9.     (root)->rbh_root = NULL;
  10. } while (/*CONSTCOND*/ 0)
  11. #define RB_BLACK    0
  12. #define RB_RED      1
  13. #define RB_ENTRY(type)
  14. struct {
  15.     struct type *rbe_left;      /* left element */
  16.     struct type *rbe_right;     /* right element */
  17.     struct type *rbe_parent;    /* parent element */
  18.     int rbe_color;          /* node color */
  19. }
  20. /* Main rb operation.
  21.  * Moves node close to the key of elm to top
  22.  */
  23. /* Main rb operation.
  24.  * Moves node close to the key of elm to top
  25.  */
  26. #define    RB_GENERATE(name, type, field, cmp)
  27.     RB_GENERATE_INTERNAL(name, type, field, cmp,)
  28. #define    RB_GENERATE_STATIC(name, type, field, cmp)
  29.     RB_GENERATE_INTERNAL(name, type, field, cmp, __unused static)
  30. #define RB_GENERATE_INTERNAL(name, type, field, cmp, attr)
  31. attr void
  32. name##_RB_INSERT_COLOR(struct name *head, struct type *elm) {......}
  33. attr void
  34. name##_RB_REMOVE_COLOR(struct name *head, struct type *parent,
  35. struct type *elm) {......}
  36. attr struct type *
  37. name##_RB_REMOVE(struct name *head, struct type *elm) {......}
  38. /* Inserts a node into the RB tree */
  39. attr struct type *
  40. name##_RB_INSERT(struct name *head, struct type *elm) {......}
  41. /* Finds the node with the same key as elm */
  42. attr struct type *
  43. name##_RB_FIND(struct name *head, struct type *elm) {......}
  44. /* Finds the first node greater than or equal to the search key */
  45. attr struct type *
  46. name##_RB_NFIND(struct name *head, struct type *elm) {......}
  47. /* ARGSUSED */
  48. attr struct type *
  49. name##_RB_NEXT(struct type *elm) {......}
  50. /* ARGSUSED */
  51. attr struct type *
  52. name##_RB_PREV(struct type *elm) {......}
  53. attr struct type *
  54. name##_RB_MINMAX(struct name *head, int val) {......}

主要是 RB_ENTRY 和 RB_GENERATE_STATIC ,在内核的红黑书实现里,内核并不关心具体的数据结构形式,数据与结构分离,通过 RB_ENTRY 和 Compare 函数来实现红黑书的操作的,接口全部以宏的方式来实现,关键的地方是 name,比如

C++代码
  1. #define RB_HEAD(name, type)
  2. struct name {
  3.     struct type *rbh_root; /* root of the tree */
  4. }

name 是什么呢,name 是结构的一种别名,实质上,相当于创建了一个 RB 对象,对象之间以 name 来区分,这样做的好处就是,RB 实现的代码可以高度重用,不好的地方就是,以宏替换的方式来模拟面向对象的行为,和 inline 的滥用没有什么区别,举个例子

C++代码
  1. /* Tree of chunks. */
  2. typedef struct chunk_node_s chunk_node_t;
  3. struct chunk_node_s {
  4.     /* Linkage for the chunk tree. */
  5.     RB_ENTRY(chunk_node_s) link;
  6.     /*
  7.      * Pointer to the chunk that this tree node is responsible for.
  8.      */
  9.     void    *chunk;
  10.     /* Total chunk size. */
  11.     size_t  size;
  12. };
  13. typedef struct chunk_tree_s chunk_tree_t;
  14. RB_HEAD(chunk_tree_s, chunk_node_s);
  15. static int
  16. chunk_comp(chunk_node_t *a, chunk_node_t *b)
  17. {
  18.     if(a->size > b->size)
  19.         return(1);
  20.     else if(a->size == b->size)
  21.         return(0);
  22.     else
  23.         return(-1);
  24. }
  25. /* Generate red-black tree code for chunks. */
  26. RB_GENERATE_STATIC(chunk_tree_s, chunk_node_s, link, chunk_comp);

回到刚才的,内核红黑树的接口,RB_GENERATE_STATIC 做了些什么呢?RB_GENERATE_STATIC在预处理阶段就为 name 指定的对象生成了一系列的相关操作,如果程序内包含多个红黑树对象,虽然看起来是多了一个 RB_GENERATE_STATIC宏,但是实质上相当于将原有的代码 copy 了一份,只是 name 和 compare 接口不同而已

其实有一种非常优雅的方式,就是使用函数指针,而且我们经常用到的很多内核级实现里也是经常用到的,将数据结构的行为封装在一起,Berkery DB 数据库的实现就是一个很好的例子,看:

C++代码
  1. typedef struct {
  2.     DBTYPE type;
  3.     int (*close)(const DB *db);
  4.     int (*del)(const DB *db, const DBT *key, u_int flags);
  5.     int (*fd)(const DB *db);
  6.     int (*get)(const DB *db, DBT *key, DBT *data, u_int flags);
  7.     int (*put)(const DB *db, DBT *key, const DBT *data, u_int flags);
  8.     int (*sync)(const DB *db, u_int flags);
  9.     int (*seq)(const DB *db, DBT *key, DBT *data, u_int flags);
  10. }DB;

DB 数据库对象将所有的操作封装在自身的结构体里,显然,从多个方面来看,比起那些通过命名规范来约束函数的接口以体现程序的结构层次来看,这种方式清晰明了。