goroutine scheduler

三月 25, 2013
Compiler and vm

some references

  • Analysis of the Go runtime scheduler
  • Dmitry Vyukov.  Scalable Go Scheduler Design Doc

go runtime and go scheduler

The runtime keeps track of each goroutine, and will schedule them to run in turn on a pool of threads belonging to the process. Goroutines are separate from threads but rely upon them to run, and scheduling goroutines onto threads effectively is crucial for the efficient performance of Go programs. The idea behind goroutines is that they are capable of running concurrently, like threads, but are also extremely lightweight in comparison. So, while there might be multiple threads created for a process running a Go program, the ratio of goroutines to threads should be much higher than 1-to-1. Multiple threads are often necessary to ensure that goroutines are not unnecessarily blocked. When one goroutine makes a blocking call, the thread running it must block. Therefore, at least one more thread should be created by the runtime to continue the execution of other goroutines that are not in blocking calls. Multiple threads are allowed to run in parallel up to a programmer defined maximum, which is stored in the variable GOMAXPROCS

It is important to keep in mind that all the OS sees is a single user level process requesting and running multiple threads. The concept of scheduling goroutines onto these threads is merely a construct in the virtual environment of the runtime. When we refer to the Go runtime and scheduler we are referring to these higher level entities, which are completely separate from the operating system.

data struct

A G struct represents a single goroutine[9]. It contains the fields necessary to keep track of its stack and current status. It also contains references to the code that it is responsible for running.
[c]struct G
{
byte∗ stackguard; // stack guard information
byte∗ stackbase; // base of stack
byte∗ stack0; // current stack pointer
byte∗ entry; // initial function
void∗ param; // passed parameter on wakeup
int16 status; // status
int32 goid;
M∗ lockedm; // used for locking M’s and G’s
...
};[/c]
The M struct is the Go runtime’s representation of an OS thread. It has pointers to fields such as the global queue of G’s, the G that it is currently running, its own cache, and a handle to the scheduler.
[c]struct M
{
G∗ curg; // current running goroutine
int32 id;
int32 locks; // locks held by this M
MCache ∗mcache; //cache for this thread
G∗ lockedg; // used for locking M’s and G’s
uintptr createstack [32]; Stack that created this thread
M∗ nextwaitm; // next M waiting for lock
...
};[/c]
THE SCHED STRUCT
The Sched struct is a single, global struct[9] that keeps track of the different queues of G’s and M’s and some other information the scheduler needs in order to run, such as the global Sched lock. There are two queues containing G structs, one is the runnable queue where M’s can find work, and the other is a free list of G’s. There is only one queue pertaining to M’s that the scheduler maintains; the M’s in this queue are idle and waiting for work. In order to modify these queues, the global Sched lock must be held.
[c]struct Sched {
Lock; // global sched lock. must be held to edit G or M queues
G ∗gfree; // available g’ s ( status == Gdead)
G ∗ghead; // g’ s waiting to run queue
G ∗gtail ; // tail of g’ s waiting to run queue
int32 gwait; // number of g’s waiting to run
int32 gcount; // number of g’s that are alive
int32 grunning; // number of g’s running on cpu or in syscall
M ∗mhead; // m’s waiting for work
int32 mwait; // number of m’s waiting for work
int32 mcount; // number of m’s that have been created
...
};[/c]

fundamentals

The runtime starts out with several G’s. One is in charge of garbage collection, another is in charge of scheduling, and one represents the user’s Go code. Initially, one M is created to kick off the runtime. As the program progresses, more G’s may be created by the user’s Go program, and more M’s may become necessary to run all the G’s. As this happens, the runtime may provision additional threads up to GOMAXPROCS. Hence at any given time, there are at most GOMAXPROCS active M’s.

Since M’s represent threads, an M is required to run a goroutine. An M without a currently associated G will pick up a G from the global runnable queue and run the Go code belonging to that G. If the Go code requires the M to block, for instance by invoking a system call, then another M will be woken up from the global queue of idle M’s. This is done to ensure that goroutines, still capable of running, are not blocked from running by the lack of an available M.

System calls force the calling thread to trap to the kernel, causing it to block for the duration of the system call execution. If the code associated with a G makes a blocking system call, the M running it will be unable to run it or any other G until the system call returns. M’s do not exhibit the same blocking behavior for channel communication, even though goroutines block on channel communication. The operating system does not know about channel communication, and the intricacies of channels are handled purely by
the runtime. If a goroutine makes a channel call, it may need to block, but there is no reason that the M running that G should be forced to block as well. In a case such as this, the G’s status is set to waiting and the M that was previously running it continues running other G’s until the channel communication is complete. At that point the G’s status is set back to runnable and will be run as soon as there is an M capable of running it.

系统相关领域的一些学习材料

三月 18, 2013
Programming practices

原文:架构相关领域的一些学习材料  by 林仕鼎

对于工程师来说,到一定阶段后往往会遇到成长瓶颈。要突破此瓶颈,需要在所属技术领域更深入学习,了解本领域的问题本质、方法论与设计理念、发展历 史等。以下提供一些架构相关领域的学习材料,附上简单点评,供有兴趣的工程师参考。希望大家能通过对这些领域的了解和学习,掌握更多system design principles,在自己的工作中得心应手,步入自由王国。

1. Operating Systems

Mach [Intro: http://www-2.cs.cmu.edu/afs/cs/project/mach/public/www/mach.html, Paper: http://www-2.cs.cmu.edu/afs/cs/project/mach/public/www/doc/publications.html]

传 统的kernel实现中,对中断的响应是在一个“大函数”里实现的。称为大函数的原因是从中断的入口到出口都是同一个控制流,当有中断重入发生的时候,实 现逻辑将变得非常复杂。大多数的OS,如UNIX,都采用这种monolithic kernel architecture。1985年开始的Mach项目,提出了一种全新的microkernel结构,使得由于70年代UNIX的发展到了极致而觉得后续无枝可依的学术界顿时找到了兴奋点,也开始了沸沸扬扬的monokernel与microkernel的争论。插播一个花絮:Mach的主导者Richard Rashid,彼时是CMU的教授,受Bill Gates之托去游说Jim Gray加盟MS。结果把自己也被绕了进来,组建了Microsoft Research。他到中国来做过几次21 Century Computing的keynotes。

Exokernel [Intro: http://pdos.csail.mit.edu/exo/,Paper: http://pdos.csail.mit.edu/PDOS-papers.html#Exokernels]

虽 然microkernel的结构很好,但实际中并没有广泛应用,因为performance太差,而且大家逐渐发现OS的问题并不在于实现的复杂性,而更 多在于如何提高application使用资源的灵活性。这也就是在kernel extension(例如loadable module in Linux)出现后,有关OS kernel architecture的争论就慢慢淡出人们视线的原因。

Exokernel正是在这样的 背景中出现的,它并不提供传统OS的abstraction(process, virtual memory等),而是专注于资源隔离与复用(resource isolation and multiplexing),由MIT提出。在exokernel之上,提供了一套库,著名的libOS,用于实现各种OS的interface。这样的 结构为application提供了最大的灵活度,使不同的application可以或专注于调度公平性或响应实时性,或专注于提高资源使用效率以优化 性能。以今天的眼光来看,exokernel更像是一个virtual machine monitor。

Singularity [Intro: http://research.microsoft.com/os/Singularity/, Paper: http://www.research.microsoft.com/os/singularity/publications/HotOS2005_BroadNewResearch.pdf]

Singularity 出现在virus,spyware取之不尽、杀之不绝的21世纪初期,由Microsoft Research提出。学术界和工业界都在讨论如何提供一个trust-worthy computing环境,如何使计算机系统更具有manage-ability。Singularity认为要解决这些问题,底层系统必须提供hard isolation,而以前人们都依赖的硬件virtual memory机制并无法提供高灵活性和良好性能。在.Net和Java等runtime出现之后,一个软件级的解决方案成为可能。

Singularity 在microkernel的基础上,通过.Net构建了一套type-safed assembly作为ABI,同时规定了数据交换的message passing机制,从根本上防止了修改隔离数据的可能。再加上对application的安全性检查,从而提供一个可控、可管理的操作系统。由 于.Net CLR的持续优化以及硬件的发展,加了这些检查后的Singularity在性能上的损失相对于它提供的这些良好特性,仍是可以接受的。

这种设计目前还处于实验室阶段,是否能最终胜出,还需要有当年UNIX的机遇。

2. Virtual Machines

VMWare ["Memory Resource Management in VMware ESX Server",OSDI’02, Best paper award]

耳熟能详的vmware,无需多说。

ZEN [“Xen and the Art of Virtualization”, OSDI’04]

性能极好的VMM,来自Cambridge。

Denali [“Scale and Performance in the Denali Isolation Kernel”, OSDI’02, UW]

为internet services而设计的application level virtual machine,在普通机器上可运行数千个VMs。其VMM基于isolation kernel,提供隔离,但并不要求资源分配绝对公平,以此减少性能消耗。

Entropia [“The Entropia Virtual Machine for Desktop Grids”, VEE’05]

要 统一利用公司内桌面机器资源来进行计算,需要对计算任务进行良好的包装,以保证不影响机器正常使用并与用户数据隔离。Entropia就提供了这样的一个 计算环境,基于windows实现了一个application level virtual machine。其基本做法就是对计算任务所调用的syscall进行重定向以保证隔离。类似的工作还有FVM:“A Feather-weight Virtual Machine for Windows Applications”。

3. Design Revisited

Are Virtual Machine Monitors Microkernels Done Right?”,HotOS’05

这个题目乍听起来,十分费解,其意思是VMMs其实就是Microkernel的正确实现方法。里面详细讨论了VMM和Microkernel,是了解这两个概念的极好参考。

Thirty Years Is Long Enough: Getting Beyond C”, HotOS’05

C 可能是这个世界上最成功的编程语言,但其缺点也十分明显。比如不支持thread,在今天高度并行的硬件结构中显得有点力不从心,而这方面则是 functional programming language的长处,如何结合二者的优点,是一个很promising的领域。

4. Programming Model

Why Threads Are a Bad Idea

单使用thread结构的server是很难真正做到高性能的,原因在于内存使用、切换开销、同步开销和保证锁正确性带来的编程复杂度等。

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services”,OSDI’01

Thread 不好,但event也没法解决所有问题,于是我们寻找一个结合的方法。SEDA将应用拆分为多个stage,不同stage通过queue相连接,同一个 stage内可以启动多个thread来执行queue中的event,并且可通过反馈来自动调整thread数量。

Software Transactional Memory

如 果内存可以提供transaction语义,那么我们面对的世界将完全两样,language, compiler, OS, runtime都将发生根本变化。虽然intel现在正在做hardware transactional memory,但估计可预见的将来不会商用,所以人们转而寻求软件解决方案。可想而知,这个方案无法base在native assembly上,目前有C#, haskell等语言的实现版本。资料比较多,参见Wikipedia

5. Distributed Algorithms

Logical clock, [“Time, clocks, and the ordering of events in a distributed system”, Leslie Lamport, 1978]

这是一篇关于Logic clock, time stamp, distributed synchronization的经典paper。

Byzantine [“The Byzantine Generals Problem”, Leslie Lamport, 1982]

分 布式系统中的错误各种各样,有出错就能停机的,有出错了拖后腿的,更严重的是出错了会做出恶意行为的。最后的这种malicious behavior,就好像出征将军的叛变,将会对系统造成严重影响。对于这类问题,Lamport提出了Byzantine failure model,对于一个由3f+1个replica组成的state machine,只要叛变的replica数量小于等于f,整个state machine还能正常工作。

Paxos [“The part-time parliament”, Leslie Lamport, 1998]

如何在一个异步的分布式环境中达成consensus,这是分布式算法研究的最根本问题。Paxos是这类算法的顶峰。不过这篇paper太难了,据说全世界就3.5人能看懂,所以Lamport后来又写了一篇普及版paper:“Paxos Made Simple” ,不过还是很难懂。另外,也可参看Butler Lampson写的“The ABCD’s of Paxos”(PODC’01),其中关于replicated state machine的描述会严重启发你对并行世界本质的认识,这就是图灵奖的实力。

这上面反复出现了一个名字:Leslie Lamport, 他在distributed computing这个领域挖坑不辍,终成一代宗师。关于他,也有几则轶事。记得以前他在MSR的主页是这么写的,“当我在研究logical clock的时候,Bill Gates还穿着开裆裤 (in diaper)…”(大意如此,原文现在找不到了)。另外,他在写paper的时候,很喜欢把其他牛人的名字变换一下编排进去。这可能也是他还没拿到图灵 奖的原因。

关于Lamport的其他成就,还可以参见这篇向他60岁生日献礼的paper:“Lamport on mutual exclusion: 27 years of planting seeds”, PODC’01。

6. Overlay Networking, and P2P DHT

RON [“Resilient Overlay Networks”, SOSP’01]

RON描述了如何在应用层搭建一个overlay,以提供秒级广域网网络层故障恢复速度,而现有的通过路由协议来恢复通信的时间至少在几十分钟。这种快速恢复特性和灵活性使得overlay networking现在被广泛应用。

Application Level Multicast

End System Multicast”, SigMetrics’00

Scalable Application Layer Multicast”, SigComm’02

关 于ALM的paper很多,基本上都是描述如何搭建一个mesh network用以鲁棒的传输控制信息,另外再搭建一个multicast tree用以高效传输数据,然后再根据多媒体数据的特点做一些layered delivery。前几年出现的cool stream, pplive等系统都是这类系统的商业化产品。

P2P

P2P的出现改变了网络。按照各种P2P网络的结构,可以分为三种。

1.    Napster式,集中式目录服务,数据传输Peer to peer。

2.    Gnutella式,通过在邻居间gossip来查询,也被称为unstructured P2P。

3.    DHT,与unstructured P2P不同的是,DHT进行的查询有保证,如果数据存在,可在一定的hop数内返回。这个hop数通常为logN,N为系统节点数。

典型的DHT有CAN, Chord, Pastry, Tapestry等四种。这些研究主要在算法层面,系统方面的工作主要是在其上建立广域网存储系统。还有一些人在机制层面进行研究,例如如何激励用户共享、防止作弊等。

7. Distributed Systems

GFS/MapReduce/BigTable/Chubby/Sawzall

Google的系列paper,大家比较熟悉,不再多说。在可查。

Storage

Distributed storage system的paper太多了。下面列出几篇最相关的。

Chain Replication for Supporting High Throughput and Availability”, OSDI’04。

Dynamo: Amazon’s Highly Available Key-value Store”,SOSP’07。

BitVault: a Highly Reliable Distributed Data Retention Platform”, SIGOPS OSR’07。

“”, MSR-TR。

Distributed simulation

Simulating Large-Scale P2P Systems with the WiDS Toolkit”, MASCOTS’05。Distributed simulation有意思的地方是simulated protocol是distributed的,而这个simulation engine本身也是distributed的。Logical和physical的time和event交杂在系统中,需要仔细处理。

8. Controversial Computing Models

现 在的软件系统已经复杂到了人已经无法掌握的程度,很多系统在发布时都仍然带着许多确定性 (deterministic)或非确定性(non-deterministic)的bugs,只能不断的patch。既然作为人类,不够精细的特性决定 了我们无法把系统的bug fix干净,我们只能从其他角度入手研究一种让系统在这令人沮丧的环境中仍能工作的方法。这就像一个分布式系统,故障无法避免,我们选择让系统作为整体来 提供高可靠性。

以下3个便是典型代表。基本上,主要研究内容都集中于1) 如何正确保存状态;2) 如何捕捉错误并恢复状态;3) 在进行单元级恢复时,如何做到不影响整体。

Recovery Oriented Computing

Failure oblivious computing, OSDI’04

Treating Bugs as Allergies, SOSP’05

9. Debugging

系统很复杂,人类无法从逻辑上直接分析,只能通过data mining的方法在宏观上进行观察。

Black box debugging [“Performance debugging for distributed systems of black boxes”, SOSP’03]

对大型系统的performance debugging非常困难,因为里面的问题很多都是非确定性的,而且无法重现。只能通过对log的挖掘,找出配对的调用/消息以定位问题。

CP-miner [“A Tool for Finding Copy-paste and Related Bugs in Operating System Code”, OSDI’04]

很多人在重用代码的时候,都使用copy-paste。但有时候简单的CP会带来严重的问题,例如局部变量的重名等。CP-miner通过分析代码,建立语法树结构,然后mine出这类错误。

git in a nutshell

三月 17, 2013
Programming practices

一些参考资料

  1. kernel docs: https://www.kernel.org/pub/software/scm/git/docs
  2. git data format: http://git.rsbx.net/Documents/Git_Data_Formats.txt
  3. 源码

git的存储就是一个kv数据库。存储的对象包括blob(文件),tree(目录),还有commit等。
对象的key就是value的sha1值。

git将工作区组织成一个hash-tree,hash用的是sha1。hash-tree上的叶子结点是blob,也就是文件,中间结点是tree,也就是目录。任何一个结点的内容发生变化最终都会蔓延到根结点(意味着其sha1值发生变化,父结点的sha1值也会发生变化(因为它的目录项里是保存了子结点的sha1值,如果子结点被修改,这个sha1值也会被修改),其实从根结点到被修改的结点这个路径上所有的结点的sha1值都会发生变化)

另外,git并不直接修改原来结点的内容,而是直接存储一个新的结点,新的key是修改后内容(整个对象的内容,而不是仅仅指修改部分的内容)的sha1值。存储层会在恰当的时候对这些结点进行gc(因为很多结点实际上大部分内容都是相同的),打包到一个pack文件里以节省空间。

commit过程

commit会将当前修改过的所有文件(git add后会有记录),重新生成一个新的hash-tree,注意这个新的hash-tree大部分结点和上一个commit的结点都是相同的,如下图所示:当C文件被修改后,git会为A/B/C这条路径上的所有结点创建一个新的结点E/F/G。commit只需要知道当前的根结点和父commit即可(形成一个commit列表,git rev-list --all可以将这个链表打印出来)。

push过程

1、与服务器通讯,取得服务器当前分支所处的commit,与本地当前的commit比较,然后得出,本次需要push到服务端的commit列表。假设只有一个commit需要提交,如上图所示(多个commit的处理过程类似)。
设commit1是服务端当前分支的commit
设commit2是本地当前分支的commit

2、将commit1所有的结点找出来(不会消耗太多内存,因为叶子结点的内容,也就是文件数据是不会被读取到内存里的),做个标记。
将commit2所有的结点找出来,将带标记的结点去掉。剩下的结点就是本次需要push需要上传到服务端。这个过程也会很快,因为如果某个中间结点(例如上图的H结点)被标记了,下面所有的结点都不需要判断了。

3、将这些结点(key-value)打包成一个pack文件,和git在gc时(所有的key-value都参与)创建pack文件是一样的。
pack文件是自描述的。服务器解压pack后更新当前分支的commit信息。

Linux Performance Analysis and Tools

原文:Linux performance analysis and tools

 

世纪公园踏春篇

三月 10, 2013
Life

春天来了天气不错,拉几个朋友一起骑车到世纪公园踏春。

Linux: The Journaling Block Device

二月 17, 2013
Kernel

Kedar Sovani on  kerneltrap

Atomicity is a property of an operation either to succeed or fail completely. Disks assure atomicity at the sector level. This means that a write to a sector either goes through completely or not at all. But when an operation spans over multiple sectors of the disk, a higher-level mechanism is needed. This mechanism should ensure that modifications to the entire set of sectors are handled atomically. Failure to do so leads to inconsistencies. This document talks about the implementation of the Journaling Block Device in Linux.

Let's look at how these inconsistencies could be introduced to a filesystem. Say we have an application that creates a file. The
filesystem internally has to decrease the number of free inodes by one, intialize the inode on the disk and add an entry to the
parent directory for the newly created file. But what happens if the machine crashes after only the first operation is executed? In this circumstance, an inconsistency has been introduced in the filesystem. The number of free inodes has decreased, but no initialisation of the inode has been performed on the disk.

The only way to detect these inconsistencies is by scanning the entire filesystem. This task is called fsck, filesystem consistency check. In large installations, the consistency check requires a significant amount of time (many hours) to check and fix inconsistencies. As you might have guessed, such downtime is not desirable. A better approach to solve this problem is to avoid introducing inconsistencies in the first place, and this could be accomplished by providing atomicity to operations. Journaling is such a way to provide atomicity to operations.

Simply stated, using journaling is like using a scratch pad. You perform operations on the scratch pad, and once you are satisfied that the operations are correct, you reflect them in a fairer copy.

In the case of filesystems, all the metadata and data are stored on the block device for the filesystem. Journaling filesystems use a journal or the log area as the scratch pad. A journal may be a part of the same block device or it may be a separate device in itself. A journaling filesystem first records all the operations it has performed in the journal. Once the set of operations that is part of one single atomic operation has completed and been recorded in the journal, only then is it writtent to the actual block device. Henceforth, the term disk is used to indicate the actual block device, whereas the term journal is used for the log area.

Journal Recovery Scenarios

The example operation from above requires that three blocks be modified—the inode count block, the block containing the on-disk inode and the block holding the directory where the entry is to be added. All of these blocks first are written to the journal. After that, a special block, called the commit record, is written to the journal. The commit record is used to indicate that all the blocks belonging to a single atomic operation are written to the journal.

Given journaling behavior, then, here is how a journaling filesystem reacts in the following three basic scenarios:

  • The machine crashes after only the first block is flushed to the journal. In this case, when the machine comes back up again and checks the journal, it finds an operation with no commit record at the end. This indicates that it may not be a completed operation. Hence, no modifications are done to the disk, preserving the consistency.
  • The machine crashes after the commit record is flushed to the journal. In this case, when the machine comes back up again and checks the journal, it finds an operation with the commit record at the end. The commit record indicates that this is a completed operation and could be written to the disk. All the blocks belonging to this operation are written at their actual locations on the disk, replaying the journal.
  • The machine crashes after all the three blocks are flushed to the journal but the commit record is not yet flushed to the journal. Even in this case, because of the absence of the commit record, no modifications are done to the disk. The scenario thus is reduced to the scenario described in the first case.

Likewise, any other crash scenario could be reduced to any of the scenarios listed above.

Thus, journaling guarantees consistency for the filesystem. The time required for looking up the journal and replaying the journal is minimal as compared to that taken by the filesystem consistency check.

Journaling Block Device

The Linux Journaling Block Device (JBD) provides this scratch pad for providing atomicity in operations. Thus, a filesystem controlling a block device can make use of JBD on the same or on another block device in order to maintain consistency. The JBD is a modular implementation that exposes a set of APIs for the use of such applications. The
following sections describe the concepts and implementation of the Linux JBD as is present in the Linux 2.6 kernel.

Before we move on to the implementation details of the JBD, an understanding of some of the objects that JBD uses is required. A journal is a log that internally manages updates for a single block device. As mentioned above, the updates first are stored in the journal and then are reflected to their real locations on the disk. The area belonging to the journal is managed like a circular-linked list. That is, the journal reuses its area when the journal is full.

A handle represents a single atomic update. The entire set of changes/writes that should be performed atomically are carried out with reference to a single handle.

It may not be an efficient approach to flush each atomic update (handle) to the journal, however. To achieve better performance, the JBD bunches a set of handles together into a transaction and flushes this transaction to the journal. The JBD ensures that the transaction is atomic in nature. Hence, the handles, which are the subcomponents of the transaction, also are guaranteed to be atomic.

The most important property of a transaction is its state. When a transaction is being committed, it follows the lifecycle of states listed below.

  1. Running: the transaction currently is live and can accept new handles. In a system only one transaction can be in the running state.
  2. Locked: the transaction does not accept any new handles but existing handles are not complete. Once all the existing handles are completed, the transaction goes to the next state.
  3. Flush: all the handles in a transaction are complete. The transaction is writing itself to the journal.
  4. Commit: the entire transaction log has been written to the journal. The transaction is writing a commit block indicating that the transaction log in the journal is complete.
  5. Finished: the transaction is written completely to the journal. It has to remain there until the blocks are updated to the actual locations on the disk.

Transaction Committing and CheckPointing

A running transaction is written to the journal area after a certain period. Thus, a transaction can be either in-memory (running) or on-disk. Flushing a transaction to the journal and marking that particular transaction as finished is a process called transaction commit.

The journal has a limited area under its control, and it needs to reuse this area. As for committed transactions, those having all their blocks written to the disk, they no longer need to be kept in the journal. Checkpointing, then, is the process of flushing the finished transactions to the disk and reclaiming the corresponding space in the journal. It is discussed in more detail later in this article.
Implementation Briefs

The JBD layer performs journaling of the metadata, during which the data simply is written to the disk without being journaled. But this does not stop applications from journaling the data, as it could be presented to the JBD as metadata itself. This document takes the linux kernel version 2.6.0 as a reference.


Commit

[journal_commit_transaction(journal object)]

A Kjournald thread is associated with every journaled device. The Kjournald thread ensures that the running transaction is committed after a specific interval. The transaction commit code is divided into eight different phases, described below. Figure 1 shows a logical layout of a journal.

  1. moves the transaction from running state (T_RUNNING) to locked state (T_LOCKED), meaning the transaction no longer can issue new handles. The transaction waits until all the existing handles have completed. A transaction always has a set of buffers reserved for when the transaction is initiated. Some of these buffers may be unused and are unfiled in this phase. The transaction now is ready to be committed with no outstanding handles.
  2. the transaction enters into the flush state (T_FLUSH). The transaction is marked as a currently committing transaction for the journal. This phase also marks that no running transaction exists for the journal; therefore, new requests for handles initiate a new transaction.
  3. the actual buffers of the transaction are flushed to the disk. Data buffers go first. There are no complications here, as data buffers are not saved in the log area. Instead, they are flushed directly to their actual positions on the disk. This phase ends when the I/O completion notifications for all such buffers are received.
  4. all the data buffers are written to a disk but their metadata still is in the volatile memory. Metadata flushing is not as straightforward as data buffer flushing, because metadata needs to be written to the log area and the actual positions on the disk need to be remembered. This phase starts with flushing these metadata buffers, for which a journal descriptor block is acquired. The journal descriptor block stores the mapping of each metadata buffer in the journal to its actual location on the disk in the form of tags. After this, metadata buffers are flushed to the journal. Once the journal descriptor is full of tags or all metadata buffers are flushed to the journal, the journal descriptor also is flushed to the journal. Now we have all the metadata buffers in the journal, and their actual positions on the disk are remembered. This data, being persistent, can be used for recovery if failure occurs.
  5. wait on I/O completion notifications of metadata buffers and journal descriptor blocks, respectively. The buffers are unfiled from in-memory lists once I/O completion is received.
  6. all the data and metadata is on safe storage, data at its actual locations and metadata in the journal. Now transactions need to be marked as committed so that it can be known that all the updates are safe in the journal. For this reason, a journal descriptor block again is allocated. A tag is written stating that the transaction has committed successfully, and the block is synchronously written to its position in the journal. After this, the transaction is moved to the committed state, T_COMMIT.
  7. occurs when a number of transactions are present in the journal, without yet being flushed to the disk. Some of the metadata buffers in this transaction already may be a part of some previous transaction. These need not be kept in the older transactions as we have their latest copy in the current committed transaction. Such buffers are removed from older transactions.
  8. the transaction is marked as being in the finished state, T_FINISHED. The journal structure is updated to reflect this particular transaction as the latest committed transaction. It also is added to the list of transactions to be checkpointed.

Checkpointing

Checkpointing is initiated when the journal is being flushed to the disk—think of unmount— or when a new handle is started. A new handle can fall short of guaranteed number of buffers, so it may be necessary to carry out a checkpointing process in order to free some space in the journal.

The checkpointing process flushes the metadata buffers of a transaction not yet written to its actual location on the disk. The transaction then is removed from the journal. The journal can have multiple checkpointing transactions, and each checkpointing transaction can have multiple buffers. The process considers each committing transaction, and for each transaction, it finds the metadata buffers that need to be flushed to the disk. All these buffers are flushed in one batch. Once all the transactions are checkpointed, their log is removed from the journal.


Recovery

[journal_recover(journal object)]

When the system comes up after a crash and it can see that the log entries are not null, it indicates that the last unmount was not successful or never occurred. At this point, you need to attempt a recovery. Figure 2 depicts a sample physical layout of journal. The recovery takes place in three phases.

  1. PASS_SCAN: the end of the log is found.
  2. PASS_REVOKE: a list of revoked blocks is prepared from the log.
  3. PASS_REPLAY: unrevoked blocks are rewritten (replayed) in order to guarantee the consistency of the disk.

For recovery, the available information is provided in terms of the journal. But the exact state of the journal is unknown, as we do not know the point at which the system crashed. Hence, the last transaction could be in the checkpointing or committing state. A running transaction cannot be found, as it was only in the memory.

For committing transactions, we have to forget the updates made, as all of the updates may not be in place. So in the PASS_SCAN phase, the last log entry in the log is found. From here, the recovery process knows which transactions need to be replayed.

Every transaction can have a set of revoked blocks. This is important to know in order to prevent older journal records from being replayed on top of newer data using the same block. In PASS_REVOKE, a hash table of all these revoked blocks is prepared. This table is used every time we need to find out whether a particular block should get written to a disk through a replay.

In the last phase, all the blocks that need to be replayed are considered. Each block is tested for its presence in the revoked blocks' hash table. If the block is not in there, it is safe to write the block to its actual location on the disk. If the block is there, only the newest version of the block is written to the disk. Notice that we have not changed anything in the on-disk journal. Hence, even if system crashes again while the recovery is in progress, no harm is done.
The same journal is present for the recovery next time, and no non-idempotent operation is performed during the process of recovery.

Amey Inamdar (www.geocities.com/amey_inamdar) is a kernel developer working at Kernel Corporation. His interest areas include filesystems and distributed systems.

Kedar Sovani (www.geocities.com/kedarsovani) works for Kernel Corporation as a kernel developer. His areas of interest include filesystems and storage technologies.

Copyright (c) 2004-2006 Kedar Sovani and Amey Inamdar

Haystack

一月 30, 2013
storage

文献原址:Finding a needle in Haystack: Facebook's photo storage 

一:简介:

  1. 2600亿照片,20PB数据量
  2. 每周60TB,10亿照片
  3. 每秒100w张图片的检索速度
  4. 对象存储,我们的图片只写,不修改,很少删除
  5. 传统POSIX文件系统成为瓶颈

论文中提到瓶颈是由于传统的POSIX文件系统是基于目录和metadata来管理文件的,通常取一张图片都会涉及好几个IO操作:

  1. 从文件名得到inode号
  2. 将inode信息从磁盘读取出来
  3. 读取文件。

对于一些稍微复杂的架构,1步骤可能会更频繁些。
Haystack设计的目标:

  1. 高吞吐量和低延迟
  2. 分区容忍性
  3. 低成本。通过两个维度来控制,每TB可用数据的利用率和每TB可用数据的读速率(这里的可用考虑到底层文件系统,磁盘阵列,冗余等各种因素)
  4. 简单。

二、背景

  1. 典型设计,存储集群 + cdn
  2. NFS文件系统

 

CDN通常是用来缓存热数据的,但是对于像facebook这样的互联网站,通常需要产生大量的非热数据内容,而恰恰是这些不太热的数据内容,大量地升高了CDN回源的压力,facebook引用了长尾理论来解析这一效应。在NFS卷里,facebook最初默认是每个目录存储上千个文件,但是由于NAS设备对目录metadata的管理方式,导致在一个目录里存放上千个文件相当困难,因为目录的blockmap太大了。这种情况下读1张图片最少也要10个IO,就算让目录存放的文件数降低到100,也仍然会有3个IO才能读到一张图片。
为了减少磁盘操作的次数,facebook还在Photo Store servers上缓存NAS设备返回的file handle,这个是通过增加内核API(open_by_filehandle)来实现的。facebook从NAS上学到的经验就是:通过缓存来降低磁盘IO的操作是很有限的。
也许有人会说,可以使用memcache来缓存所有的file handles。对此facebook坦言,也许这是一个解决方案,但是只能解决这里部分的问题,因为这么做的话需要NAS设备缓存所有的inodes信息,只不过是一种昂贵的传统存储方案罢了。 

三、设计与实现 

方案:CDN解决热数据问题,Haystack解决长尾效应。
Haystack的主要目的就是为了解决NFS存储方案中高磁盘IO的问题。Haystack的做法是:降低文件系统的metadata的内存占用,因此尽可能的将所有的metadata放在内存里,例如它将多个小文件存在一个大文件里。

1)三个核心组建:Haystack Directory, Haystack cache, Haystack store 

 

Store是由物理卷组成的,容量由物理卷的数量来控制,100个物理卷可提供10TB的存储空间。物理卷还会组合成一个逻辑卷,单个逻辑卷中的所有物理卷数据是相同的,冗余。Directory主要保存应用方面的数据信息,例如每张图片存放在哪个逻辑卷上,还有逻辑卷到物理卷的映射以及逻辑卷的剩余空间。Cache相当于一个内部的CDN。 

2)URL规格

  http://CDN/Cache/Machine-id/<Logical-volume,Photo> 

3)读取图片

(如上) 

4)上传图片 

5)Directory

  1. 逻辑卷到物理卷的映射
  2. 写逻辑卷和读物理卷的负载均衡
  3. 决定一个请求该由CDN来处理还是由Cache来处理
  4. 标志逻辑卷的读写属性

6)Cache 

分布式hash表用图片id作为key定位cache数据。Cache只会在下面两种情况下都吻合的情况下才进行缓存:

  1. 请求来自用户而不是CDN
  2. 从可写的Store机器上读取的数据

对1:CDN和cache其实是一样的。
对2:为了保护可写Store机器,有两点,其一,通常新数据都会有较大的访问,其二,我们设计的filesystem在只读或者只写的时候会有更好的性能。
到这里,Directory的调度就很有趣了。 

7)Store

Store:物理卷组成,hay/haystack_<logical valume id>
每个物理卷的file handle保存在内存里。所以从Store上面检索filename/offset/size不需要磁盘IO。 

cache到store读:
请求数据:logical volume id, key, alternate key, cookie
cache到store写:
上述 + data。数据是ppend-only的,对于某些修改,比如旋转,允许在原needle上修改,否则其余所有的修改都会以相同的key和alternate key追加一个needle,如果新的needle被追加到别的逻辑卷里,这些信息会被Directory感知,旧卷上的needle永远都不会被访问到了。
cache到store删除:
设置flags标志。 

8)索引

为了加快系统启动速度,store机器对每个卷都维护一个index文件。实际上是内存索引某个时刻的checkpoint。

会引入问题:index文件可能是旧的。比如当delete照片时,是设置flag而没有升级index,index的升级是异步的,这样可以让写请求快速返回。 

要解决两个副作用:needles可能存在但没有对应的index记录,或者index记录没有反映出delete标志。
解决办法:

  1. 对于第一个副作用,系统启动时,从卷尾部往头部扫描,补充index。因为没有index的needles肯定是在最后追加的。
  2. 对于第二个副作用,延迟处理,也就是说没人读的话就不管它,有人读时再判断并更新index。

9)文件系统 

根据Haystack的设计,对文件系统的需求是:不需要太多内存(例如保存文件的filemeta数据),在大文件里快速的随机seek。
XFS,好处:

  1. blockmaps很小
  2. 预分配,减少碎片等等。

10)容灾 

故障检测:监控网络链接,检查卷文件是否可用,和尝试读数据。一旦检测存在异常,标志read-only
故障恢复:这个时候基本就是借助副本来进行恢复了。

goroutine的实现原理

goroutine的实现主要依赖下面这几个API,linux386平台的实现在runtime/asm_386.s文件里。这里只显示API
[c]// void gosave(Gobuf*)

// void gogo(Gobuf*, uintptr)
// restore state from Gobuf; longjmp

// void gogocall(Gobuf*, void (*fn)(void))
// restore state from Gobuf but then call fn.
// (call fn, returning to state in Gobuf)

// void mcall(void (*fn)(G*))
// Switch to m-&gt;g0's stack, call fn(g).
// Fn must never return. It should gogo(&amp;g-&gt;sched)
// to keep running g.[/c]
而调度的实现是在runtime/proc.c文件里。

gosave类似于setjmp,但它也保存了当前正在运行的G的状态,PC和SP,gogo类似于longjmp,恢复G的SP和PC。SP和PC是通过一个Gobuf的结构来维护的。gogocall和gogo的作用差不多,但跳转到相应的G之后,会继续执行一个fn函数。mcall稍微复杂点,但也和googcall类似。

通常从M跳转到G的执行,是使用gogo,而从G跳转到M,是使用mcall。这是因为在M的栈上运行的只有一个mstart()函数,而在mstart()函数内主要的工作就是初始化一些数据,然后最后触发调度器schedule(),所以如果是使用gogo从G返回到M,那么M就会执行完毕。

整个过程就是通过一个全局的Sched结构来维护所有的调度状态。比libtask的实现复杂点,和xv6的proc实现也差不多。

lua 5.1虚拟机

一月 19, 2013
Compiler and vm
  1. The Implementation of Lua 5.0
  2. 手册:http://www.lua.org/manual/5.1/manual.html
  3. lua源码欣赏:http://www.codingnow.com/temp/readinglua.pdf
  4. 指令集:A No Frills Intro To Lua51 VM Instructions.pdf

蛮有用的资料,对vm 5.1的理解很有帮助。

面向软件错误构建可靠的分布式系统

一月 12, 2013
Distributed system

Making reliable distributed systems in the presence of sodware errors

1 问题域

  • 并发(concurrency)
  • 软实时(soft real-time)
  • 分布式(distributed)
  • 硬件交互(hardware interaction)
  • 大型软件系统(large software systems)
  • 复杂的功能(complex functionality)
  • 持续运行(continuous operation)
  • 高质量要求(quality requirements)
  • 容错(fault tolerance)

2 哲学

容错和故障隔离。例如进程和基于消息的交互。

3 系统需求

  • 并发性
  • 错误封装 即一个进程的错误一定不能破坏系统中其他的进程
  • 故障检测 包括本地和网络异常
  • 故障识别
  • 代码升级
  • 持久存储 以便恢复崩溃的系统。

4 语言需求

  • 封装原语
  • 并发性
  • 错误检测原语
  • 位置透明
  • 动态代码升级

5 库需求

  • 持久存储
  • 设备驱动程序
  • 代码升级
  • 运行基础