薏米绿豆粥

五月 9, 2013
Life
No Comments »

今天下雨了,没去跑步,煮了碗粥

Profiling Go Program


Performance Analysis and Tools
No Comments »

- http://blog.golang.org/2011/06/profiling-go-programs.html

At Scala Days 2011 a few weeks ago, Robert Hundt presented a paper titled “Loop Recognition in C++/Java/Go/Scala.” The paper implemented a specific loop finding algorithm, such as you might use in a flow analysis pass of a compiler, in C++, Go, Java, Scala, and then used those programs to draw conclusions about typical performance concerns in these languages. The Go program presented in that paper runs quite slowly, making it an excellent opportunity to demonstrate how to use Go's profiling tools to take a slow program and make it faster.

By using Go's profiling tools to identify and correct specific bottlenecks, we can make the Go loop finding program run an order of magnitude faster and use 6x less memory.

Hundt's paper does not specify which versions of the C++, Go, Java, and Scala tools he used. In this blog post, we will be using the most recent weekly snapshot of the 6g Go compiler and the version of g++ that ships with the Ubuntu Natty distribution. (We will not be using Java or Scala, because we are not skilled at writing efficient programs in either of those languages, so the comparison would be unfair. Since C++ was the fastest language in the paper, the comparisons here with C++ should suffice.)

$ 6g -V
6g version weekly.2011-06-16 8787
$ g++ --version
g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2
...
$

The programs are run on a Lenovo X201s with a 2.13 GHz Core i7-based CPU and 4 GB of RAM running Ubuntu Natty's Linux 2.6.38-8-generic kernel. The machine is running with CPU frequency scaling disabled via

$ sudo bash
# for i in /sys/devices/system/cpu/cpu[0-9]
do
    echo performance > $i/cpufreq/scaling_governor
done
#

We've taken Hundt's benchmark programs in C++ and Go, combined each into a single source file, and removed all but one line of output. We'll time the program using Linux's time utility with a format that shows user time, system time, real time, and maximum memory usage:

$ cat xtime
#!/bin/sh
/usr/bin/time -f '%Uu %Ss %er %MkB %C' "$@"
$

$ make havlak1cc
g++ -O3 -o havlak1cc havlak1.cc
$ xtime havlak1cc
# of loops: 76002 (total 3800100)
loop-0, nest: 0, depth: 0
27.37u 0.08s 27.47r 716864kB havlak1cc
$

$ make havlak1
6g havlak1.go
6l -o havlak1 havlak1.6
$ xtime havlak1
# of loops: 76000 (including 1 artificial root node)
56.63u 0.26s 56.92r 1642640kB havlak1
$

The C++ program runs in 27.47 seconds and uses 700 MB of memory. The Go program runs in 56.92 seconds and uses 1604 MB of memory. (These measurements are difficult to reconcile with the ones in the paper, but the point of this post is to explore how to use gopprof, not to reproduce the results from the paper.)

To start tuning the Go program, we have to enable profiling. If the code used the Go testing package's benchmarking support, we could use gotest's standard -cpuprofile and -memprofile flags. In a standalone program like this one, we have to import runtime/pprof and add a few lines of code:

var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file")

func main() {
    flag.Parse()
    if *cpuprofile != "" {
        f, err := os.Create(*cpuprofile)
        if err != nil {
            log.Fatal(err)
        }
        pprof.StartCPUProfile(f)
        defer pprof.StopCPUProfile()
    }
    ...

The new code defines a flag named cpuprofile, calls the Go flag library to parse the command line flags, and then, if the cpuprofile flag has been set on the command line, starts CPU profiling redirected to that file. The profiler requires a final call to StopCPUProfile to flush any pending writes to the file before the program exits; we use defer to make sure this happens as main returns.

After adding that code, we can run the program with the new -cpuprofile flag and then run gopprof to interpret the profile.

$ make havlak1.prof
havlak1 -cpuprofile=havlak1.prof
# of loops: 76000 (including 1 artificial root node)
$ gopprof havlak1 havlak1.prof
Welcome to pprof!  For help, type 'help'.
(pprof)

The gopprof program is a slight variant of Google's pprof C++ profiler. The most important command is topN, which shows the top N samples in the profile:

(pprof) top10
Total: 5758 samples
    1028  17.9%  17.9%     1161  20.2% hash_lookup
     696  12.1%  29.9%      697  12.1% scanblock
     565   9.8%  39.8%     1042  18.1% hash_insert_internal
     420   7.3%  47.0%     4278  74.3% main.FindLoops
     225   3.9%  51.0%     1149  20.0% main.DFS
     225   3.9%  54.9%      226   3.9% memhash
     198   3.4%  58.3%      437   7.6% sweep
     172   3.0%  61.3%     1902  33.0% runtime.mallocgc
     102   1.8%  63.1%      500   8.7% runtime.MCache_Alloc
     102   1.8%  64.8%      102   1.8% runtime.memmove
(pprof)

When CPU profiling is enabled, the Go program stops about 100 times per second and records a sample consisting of the program counters on the currently executing goroutine's stack. The profile has 5758 samples, so it was running for a bit over 57 seconds. In the gopprof output, there is a row for each function that appeared in a sample. The first two columns show the number of samples in which the function was running (as opposed to waiting for a called function to return), as a raw count and as a percentage of total samples. The hash_lookup function was running during 1028 samples, or 17.9%. The top10 output is sorted by this sample count. The third column shows the running total during the listing: the first three rows account for 39.8% of the samples. The fourth and fifth columns show the number of samples in which the function appeared (either running or waiting for a called function to return). The main.FindLoops function was running in 7.3% of the samples, but it was on the call stack (it or functions it called were running) in 74.3% of the samples.

To sort by the fourth and fifth columns, use the -cum (for cumulative) flag:

(pprof) top5 -cum
Total: 5758 samples
       0   0.0%   0.0%     4301  74.7% main.main
       0   0.0%   0.0%     4301  74.7% runtime.initdone
       0   0.0%   0.0%     4301  74.7% runtime.mainstart
       0   0.0%   0.0%     4278  74.3% main.FindHavlakLoops
     420   7.3%   7.3%     4278  74.3% main.FindLoops
(pprof)

In fact the total for main.FindLoops and main.main should have been 100%, but each stack sample only includes the bottom 100 stack frames; during about a quarter of the samples, the recursive main.DFS function was more than 100 frames deeper than main.main so the complete trace was truncated.

The stack trace samples contain more interesting data about function call relationships than the text listings can show. The web command writes a graph of the profile data in SVG format and opens it in a web browser. (There is also a gv command that writes PostScript and opens it in Ghostview. For either command, you need graphviz installed.)

(pprof) web

A small fragment of the full graph looks like:

Each box in the graph corresponds to a single function, and the boxes are sized according to the number of samples in which the function was running. An edge from box X to box Y indicates that X calls Y; the number along the edge is the number of times that call appears in a sample. If a call appears multiple times in a single sample, such as during recursive function calls, each appearance counts toward the edge weight. That explains the 69206 on the self-edge from main.DFS to itself.

Just at a glance, we can see that the program spends much of its time in hash operations, which correspond to use of Go's map values. We can tell web to use only samples that include a specific function, such as hash_lookup, which clears some of the noise from the graph:

(pprof) web hash_lookup

If we squint, we can see that the calls to runtime.mapaccess1 are being made by main.FindLoops and main.DFS.

Now that we have a rough idea of the big picture, it's time to zoom in on a particular function. Let's look at main.DFS first, just because it is a shorter function:

(pprof) list DFS
Total: 5758 samples
ROUTINE ====================== main.DFS in /home/rsc/g/benchgraffiti/havlak/havlak1.go
samples    225 Total 2296 (flat / cumulative)
     3      3  240: func DFS(currentNode *BasicBlock, nodes []*UnionFindNode, number map[*BasicBlock]int, last []int, current int) int {
    18     19  241:     nodes[current].Init(currentNode, current)
     .    166  242:     number[currentNode] = current
     .      .  243:
     2      2  244:     lastid := current
   167    167  245:     for _, target := range currentNode.OutEdges {
    17    508  246:         if number[target] == unvisited {
    10   1157  247:             lastid = DFS(target, nodes, number, last, lastid+1)
     .      .  248:         }
     .      .  249:     }
     7    273  250:     last[number[currentNode]] = lastid
     1      1  251:     return lastid
     .      .  252: }
(pprof)

The listing shows the source code for the DFS function (really, for every function matching the regular expression DFS). The first three columns are the number of samples taken while running that line, the number of samples taken while running that line or in code called from that line, and the line number in the file. The related command disasm shows a disassembly of the function instead of a source listing; when there are enough samples this can help you see which instructions are expensive. The weblist command mixes the two modes: it shows a source listing in which clicking a line shows the disassembly.

Since we already know that the time is going into map lookups implemented by the hash runtime functions, we care most about the second column. A large fraction of time is spent in recursive calls to DFS (line 247), as would be expected from a recursive traversal. Excluding the recursion, it looks like the time is going into the accesses to the number map on lines 242, 246, and 250. For that particular lookup, a map is not the most efficient choice. Just as they would be in a compiler, the basic block structures have unique sequence numbers assigned to them. Instead of using a map[*BasicBlock]int we can use a []int, a slice indexed by the block number. There's no reason to use a map when an array or slice will do.

Changing number from a map to a slice requires editing seven lines in the program and cut its run time by nearly a factor of two:

$ make havlak2
6g havlak2.go
6l -o havlak2 havlak2.6
rm havlak2.6
$ xtime havlak2    # diff from havlak1
# of loops: 76000 (including 1 artificial root node)
30.88u 0.24s 31.14r 1564608kB havlak2
$

We can run the profiler again to confirm that main.DFS is no longer a significant part of the run time:

$ make havlak2.prof
havlak2 -cpuprofile=havlak2.prof
# of loops: 76000 (including 1 artificial root node)
$ gopprof havlak2 havlak2.prof
Welcome to pprof!  For help, type 'help'.
(pprof) top5
Total: 3099 samples
     626  20.2%  20.2%      626  20.2% scanblock
     309  10.0%  30.2%     2839  91.6% main.FindLoops
     176   5.7%  35.9%     1732  55.9% runtime.mallocgc
     173   5.6%  41.4%      397  12.8% sweep
     101   3.3%  44.7%      111   3.6% main.DFS
(pprof)

main.DFS still appears in the profile, but its total time has dropped from 20.0% to 3.6%. The rest of the program runtime has dropped too. Now the program is spending most of its time allocating memory and garbage collecting (runtime.mallocgc, which both allocates and runs periodic garbage collections, accounts for 55.9% of the time). To find out why the garbage collector is running so much, we have to find out what is allocating memory. One way is to add memory profiling to the program. We'll arrange that if the -memprofile flag is supplied, the program stops after one iteration of the loop finding, writes a memory profile, and exits:
[c]var memprofile = flag.String("memprofile", "", "write memory profile to this file")
...
FindHavlakLoops(cfgraph, lsgraph)
if *memprofile != "" {
f, err := os.Create(*memprofile)
if err != nil {
log.Fatal(err)
}
pprof.WriteHeapProfile(f)
f.Close()
return
}[/c]
We invoke the program with -memprofile flag to write a profile:

$ make havlak3.mprof    # diff from havlak2
havlak3 -memprofile=havlak3.mprof
$

We use gopprof exactly the same way. Now the samples we are examining are memory allocations, not clock ticks.

$ gopprof havlak3 havlak3.mprof
Adjusting heap profiles for 1-in-524288 sampling rate
Welcome to pprof!  For help, type 'help'.
(pprof) top5
Total: 118.3 MB
    66.1  55.8%  55.8%    103.7  87.7% main.FindLoops
    30.5  25.8%  81.6%     30.5  25.8% main.*LSG·NewLoop
    10.0   8.5%  90.1%     10.0   8.5% main.NewBasicBlock
     6.5   5.5%  95.6%      6.5   5.5% main.*SimpleLoop·AddNode
     2.1   1.7%  97.3%     12.1  10.2% main.*CFG·CreateNode
(pprof)

Gopprof reports that FindLoops has allocated approximately 66.1 of the 118.3 MB in use; NewLoop accounts for another 30.5 MB. To reduce overhead, the memory profiler only records information for approximately one block per half megabyte allocated (the “1-in-524288 sampling rate”), so these are approximations to the actual counts.

To find the memory allocations, we can list those functions.

(pprof) list FindLoops
Total: 118.3 MB
ROUTINE ====================== main.FindLoops in /home/rsc/g/benchgraffiti/havlak/havlak3.go
    MB   66.1 Total 103.7 (flat / cumulative)
...
     .      .  267:
   1.9    1.9  268:     nonBackPreds := make([]map[int]bool, size)
   3.8    3.8  269:     backPreds := make([][]int, size)
     .      .  270:
   1.0    1.0  271:     number := make([]int, size)
   1.0    1.0  272:     header := make([]int, size, size)
   1.0    1.0  273:     types := make([]int, size, size)
   1.0    1.0  274:     last := make([]int, size, size)
   1.9    1.9  275:     nodes := make([]*UnionFindNode, size, size)
     .      .  276:
     .      .  277:     for i := 0; i < size; i++ {
   5.5    5.5  278:         nodes[i] = new(UnionFindNode)
     .      .  279:     }
...
     .      .  286:     for i, bb := range cfgraph.Blocks {
     .      .  287:         number[bb.Name] = unvisited
  48.0   48.0  288:         nonBackPreds[i] = make(map[int]bool)
     .      .  289:     }
...
(pprof) list NewLoop
Total: 118.3 MB
ROUTINE ====================== main.*LSG·NewLoop in /home/rsc/g/benchgraffiti/havlak/havlak3.go
     .      .  578: func (lsg *LSG) NewLoop() *SimpleLoop {
   2.5    2.5  579:     loop := new(SimpleLoop)
   7.5    7.5  580:     loop.basicBlocks = make(map[*BasicBlock]bool)
  20.5   20.5  581:     loop.Children = make(map[*SimpleLoop]bool)
...
     .      .  588: }
(pprof)

It looks like the current bottleneck is the same as the last one: using maps where simpler data structures suffice. FindLoops is allocating about 48 MB of maps, and NewLoop is allocating another 20 MB.

As an aside, if we run gopprof with the --inuse_objects flag, it will report allocation counts instead of sizes:

$ gopprof --inuse_objects havlak3 havlak3.mprof
Adjusting heap profiles for 1-in-524288 sampling rate
Welcome to pprof!  For help, type 'help'.
(pprof) list NewLoop
Total: 1604080 objects
ROUTINE ====================== main.*LSG·NewLoop in /home/rsc/g/benchgraffiti/havlak/havlak3.go
     .      .  578: func (lsg *LSG) NewLoop() *SimpleLoop {
 54613  54613  579:     loop := new(SimpleLoop)
 75678  75678  580:     loop.basicBlocks = make(map[*BasicBlock]bool)
207530 207530  581:     loop.Children = make(map[*SimpleLoop]bool)
...
     .      .  588: }
(pprof)

Since the 200,000 maps account for 20 MB, it looks like the initial map allocation takes about 100 bytes. That's reasonable when a map is being used to hold key-value pairs, but not when a map is being used as a stand-in for a simple set, as it is here.

Instead of using a map, we can use a simple slice to list the elements. In all but one of the cases where maps are being used, it is impossible for the algorithm to insert a duplicate element. In the one remaining case, we can write a simple variant of the append built-in function:

func appendUnique(a []int, x int) []int {
    for _, y := range a {
        if x == y {
            return a
        }
    }
    return append(a, x)
}

In addition to writing that function, changing the Go program to use slices instead of maps requires changing just a few lines of code.

$ xtime havlak4    # diff from havlak3
# of loops: 76000 (including 1 artificial root node)
18.35u 0.11s 18.48r 575792kB havlak4
$

We're now at 3x faster than when we started. Let's look at a CPU profile again.

$ gopprof havlak4 havlak4.prof
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 1851 samples
     283  15.3%  15.3%      283  15.3% scanblock
     233  12.6%  27.9%     1622  87.6% main.FindLoops
     142   7.7%  35.5%     1054  56.9% runtime.mallocgc
     112   6.1%  41.6%      276  14.9% sweep
     111   6.0%  47.6%      115   6.2% main.DFS
      85   4.6%  52.2%      661  35.7% runtime.growslice
      84   4.5%  56.7%       84   4.5% runtime.memmove
      69   3.7%  60.5%      281  15.2% runtime.MCache_Alloc
      67   3.6%  64.1%       84   4.5% MCentral_Alloc
      67   3.6%  67.7%       93   5.0% MCentral_Free
(pprof)

Now memory allocation and the consequent garbage collection (runtime.mallocgc) accounts for 56.9% of our run time. Another way to look at why the system is garbage collecting is to look at the allocations that are causing the collections, the ones that spend most of the time in mallocgc:

(pprof) web mallocgc

It's hard to tell what's going on in that graph, because there are many nodes with small sample numbers obscuring the big ones. We can tell gopprof to ignore nodes that don't account for at least 10% of the samples:

$ gopprof --nodefraction=0.1 6.out prof
Welcome to pprof!  For help, type 'help'.
(pprof) web mallocgc

We can follow the thick arrows easily now, to see that FindLoops is triggering most of the garbage collection. If we list FindLoops we can see that much of it is right at the beginning:

(pprof) list FindLoops
     .      .  270: func FindLoops(cfgraph *CFG, lsgraph *LSG) {
     .      .  271:     if cfgraph.Start == nil {
     .      .  272:         return
     .      .  273:     }
     .      .  274:
     .      .  275:     size := cfgraph.NumNodes()
     .      .  276:
     .     17  277:     nonBackPreds := make([][]int, size)
     .     82  278:     backPreds := make([][]int, size)
     .      .  279:
     .      2  280:     number := make([]int, size)
     .      1  281:     header := make([]int, size, size)
     .     61  282:     types := make([]int, size, size)
     .      .  283:     last := make([]int, size, size)
     .     58  284:     nodes := make([]*UnionFindNode, size, size)
     .      .  285:
     2      2  286:     for i := 0; i < size; i++ {
     .    261  287:         nodes[i] = new(UnionFindNode)
     .      .  288:     }
...
(pprof)

Every time FindLoops is called, it allocates some sizable bookkeeping structures. Since the benchmark calls FindLoops 50 times, these add up to a significant amount of garbage, so a significant amount of work for the garbage collector.

Having a garbage-collected language doesn't mean you can ignore memory allocation issues. In this case, a simple solution is to introduce a cache so that each call to FindLoops reuses the previous call's storage when possible. (In fact, in Hundt's paper, he explains that the Java program needed just this change to get anything like reasonable performance, but he did not make the same change in the other garbage-collected implementations.)

We'll add a global cache structure:

var cache struct {
    size int
    nonBackPreds [][]int
    backPreds [][]int
    number []int
    header []int
    types []int
    last []int
    nodes []*UnionFindNode
}

and then have FindLoops consult it as a replacement for allocation:

    if cache.size < size {
        cache.size = size
        cache.nonBackPreds = make([][]int, size)
        cache.backPreds = make([][]int, size)
        cache.number = make([]int, size)
        cache.header = make([]int, size)
        cache.types = make([]int, size)
        cache.last = make([]int, size)
        cache.nodes = make([]*UnionFindNode, size)
        for i := range cache.nodes {
            cache.nodes[i] = new(UnionFindNode)
        }
    }

    nonBackPreds := cache.nonBackPreds[:size]
    for i := range nonBackPreds {
        nonBackPreds[i] = nonBackPreds[i][:0]
    }
    backPreds := cache.backPreds[:size]
    for i := range nonBackPreds {
        backPreds[i] = backPreds[i][:0]
    }
    number := cache.number[:size]
    header := cache.header[:size]
    types := cache.types[:size]
    last := cache.last[:size]
    nodes := cache.nodes[:size]

Such a global variable is bad engineering practice, of course: it means that concurrent calls to FindLoops are now unsafe. For now, we are making the minimal possible changes in order to understand what is important for the performance of our program; this change is simple and mirrors the code in the Java implementation. The final version of the Go program will use a separate LoopFinder instance to track this memory, restoring the possibility of concurrent use.

$ xtime havlak5    # diff from havlak4
# of loops: 76000 (including 1 artificial root node)
12.59u 0.07s 12.67r 584496kB havlak5
$

There's more we can do to clean up the program and make it faster, but none of it requires profiling techniques that we haven't already shown. The work list used in the inner loop can be reused across iterations and across calls to FindLoops, and it can be combined with the separate “node pool” generated during that pass. Similarly, the loop graph storage can be reused on each iteration instead of reallocated. In addition to these performance changes, the final version is written using idiomatic Go style, using data structures and methods. The stylistic changes have only a minor effect on the run time: the algorithm and constraints are unchanged.

The final version runs in 3.84 seconds and uses 257 MB of memory:

$ xtime havlak6
# of loops: 76000 (including 1 artificial root node)
3.79u 0.04s 3.84r 263472kB havlak6
$

That's nearly 15 times faster than the program we started with. Even if we disable reuse of the generated loop graph, so that the only cached memory is the loop finding bookeeping, the program still runs 10x faster than the original and uses 2.5x less memory.

$ xtime havlak6 -reuseloopgraph=false
# of loops: 76000 (including 1 artificial root node)
5.74u 0.10s 5.84r 617040kB havlak6 -reuseloopgraph=false
$

Of course, it's no longer fair to compare this Go program to the original C++ program, which used inefficient data structures like sets where vectors would be more appropriate. As a sanity check, we translated the final Go program into equivalent C++ code. Its execution time is similar to the Go program's:

$ xtime havlak6cc
# of loops: 76000 (including 1 artificial root node)
4.04u 0.38s 4.42r 387744kB havlak6cc
$

The Go program runs slightly faster because the C++ program is using automatic deletes and allocation instead of an explicit cache. That makes the C++ program a bit shorter and easier to write, but not dramatically so:

$ wc havlak6.cc; wc havlak6.go
  401  1220  9040 havlak6.cc
  461  1441  9467 havlak6.go
$

Benchmarks are only as good as the programs they measure. We used gopprof to study an inefficient Go program and then to improve its performance by an order of magnitude and to reduce its memory usage by a factor of six. A subsequent comparison with an equivalently optimized C++ program shows that Go can be competitive with C++ when programmers are careful about how much garbage is generated by inner loops.

The program sources, Linux x86-64 binaries, and profiles used to write this post are available in the benchgraffiti project on Google Code.

As mentioned above, gotest includes these profiling flags already: define a benchmark function and you're all set. There is also a standard HTTP interface to profiling data. In an HTTP server, adding

import _ "http/pprof"

will install handlers for a few URLs under /debug/pprof/. Then you can run gopprof with a single argument—the URL to your server's profiling data—and it will download and examine a live profile.

gopprof http://localhost:6060/debug/pprof/profile   # 30-second CPU profile
gopprof http://localhost:6060/debug/pprof/heap      # heap profile

- Russ Cox

最短shell代码实现乘法表

四月 18, 2013
Programming practices
No Comments »

the first version:

[shell]
for i in {1..5}; do for j in {1..5}; do let k=$i*$j; echo $i*$j=$k; done; done;
[/shell]

then, i research for a while, and write the 2rd

[shell]
seq 5 | awk '{for(i=1;i<=$1;i++){printf($1 "*"i"="i*$1" ")};print""}'
[/shell]

vim如何处理三元表达式

四月 5, 2013
Compiler and vm
No Comments »

关于vim script的条件表达式,描述有点问题,在runtime/doc/usr_41.txt文档里:

Borrowed from the C language is the conditional expression:

	a ? b : c

If "a" evaluates to true "b" is used, otherwise "c" is used.  Example: >

	:let i = 4
	:echo i > 5 ? "i is big" : "i is small"
<	i is small ~  The three parts of the constructs are always
evaluated first, thus you could see it work as:
	(a) ? (b) : (c)

有问题那句我红色标出来了,它的意思是说,在计算一个条件表达式的值的时候三部分都是会先被计算的。也就是说,先计算(a)的值,(b)的值,和(c)的值,然后根据(a)是否为true,选择表达式最终的值。
这个做法看起来很有问题,我确实想不出为什么要这么做,至少c这些语言大多编译器的实现,都是最终转化为一个goto语句吧。
如果这么做是可行的,那么是不是说:

	:let a = 3 < 5 ? 3 - 1 : getchar()

即使最终的结果是part2,但是getchar()也会被执行?

事实上不是,看了一下vim script的解释器的代码,如下:

/*
 * Handle top level expression:
 * expr1 ? expr0 : expr0
 */
static int
eval1(arg, rettv, evaluate)
   char_u **arg;
   typval_T *rettv;
   int evaluate;
{
  ....
  /*
   * Get the first variable.  process part1
   */
  if (eval2(arg, rettv, evaluate) == FAIL)
    return FAIL;

  if ((*arg)[0] == '?') {
    result = FALSE;
    if (evaluate) {
      int error = FALSE;

      if (get_tv_number_chk(rettv, &error) != 0)
        result = TRUE;
      clear_tv(rettv);
      if (error)
        return FAIL;
    }

    /*
     * Get the second variable.
     */
    *arg = skipwhite(*arg + 1);
    // process part2
    if (eval1(arg, rettv, evaluate && result) == FAIL) /* recursive! */
      return FAIL;

    /*
     * Check for the ":".
     */
    if ((*arg)[0] != ':') {
      EMSG(_("E109: Missing ':' after '?'"));
      if (evaluate && result)
        clear_tv(rettv);
      return FAIL;
    }

    /*
     * Get the third variable.
     */
    *arg = skipwhite(*arg + 1);
    // process part3
    if (eval1(arg, &var2, evaluate && !result) == FAIL) /* recursive! */ {
      if (evaluate && result)
        clear_tv(rettv);
      return FAIL;
    }

    // here
    if (evaluate && !result)
      *rettv = var2;
  }

  return OK;
}

从这个逻辑上看来,文档上的描述似乎是正确的,part1的值会先被计算,然后依次计算part2和part3的值,如果part1的值不为真,会将part3的值返回,否则返回part2的值,见here
但如果继续往下跟踪,就会发现,表面上看起来是三部分都会evaluate的,但实际上不是,很多真正的动作是会依靠evaluate这个flag来完成。如果part1为true,才会执行part2的真正compute动作,而part3是不会执行的。反过来也是。否则,文档上所描述的evaluate只是简单的parse。

比如,如果某部分是个+/-表达式的话:

/*
 * Handle fourth level expression:
 * + number addition
 * - number subtraction
 * . string concatenation
 *
 * Return OK or FAIL.
 */
static int
eval5(arg, rettv, evaluate)
   char_u **arg;
   typval_T *rettv;
   int evaluate;
{
  ....
  /*
   * Get the first variable.
   */
  if (eval6(arg, rettv, evaluate, FALSE) == FAIL)
    return FAIL;

  /*
   * Repeat computing, until no '+', '-' or '.' is following.
   */
  for (;;) {
    ....
    /*
     * Get the second variable.
     */
    *arg = skipwhite(*arg + 1);
    if (eval6(arg, &amp;var2, evaluate, op == '.') == FAIL) {
      clear_tv(rettv);
      return FAIL;
    }

    if (evaluate) {
      /*
       * Compute the result.
       */
      ...
    }
  }
  return OK;
}

同理,function也是,故getchar()总是不会被执行的。
vim那个文档确实有问题。

goroutine scheduler

三月 25, 2013
Compiler and vm
No Comments »

some references

  • Analysis of the Go runtime scheduler
  • Dmitry Vyukov.  Scalable Go Scheduler Design Doc

go runtime and go scheduler

The runtime keeps track of each goroutine, and will schedule them to run in turn on a pool of threads belonging to the process. Goroutines are separate from threads but rely upon them to run, and scheduling goroutines onto threads effectively is crucial for the efficient performance of Go programs. The idea behind goroutines is that they are capable of running concurrently, like threads, but are also extremely lightweight in comparison. So, while there might be multiple threads created for a process running a Go program, the ratio of goroutines to threads should be much higher than 1-to-1. Multiple threads are often necessary to ensure that goroutines are not unnecessarily blocked. When one goroutine makes a blocking call, the thread running it must block. Therefore, at least one more thread should be created by the runtime to continue the execution of other goroutines that are not in blocking calls. Multiple threads are allowed to run in parallel up to a programmer defined maximum, which is stored in the variable GOMAXPROCS

It is important to keep in mind that all the OS sees is a single user level process requesting and running multiple threads. The concept of scheduling goroutines onto these threads is merely a construct in the virtual environment of the runtime. When we refer to the Go runtime and scheduler we are referring to these higher level entities, which are completely separate from the operating system.

data struct

A G struct represents a single goroutine[9]. It contains the fields necessary to keep track of its stack and current status. It also contains references to the code that it is responsible for running.
[c]struct G
{
byte∗ stackguard; // stack guard information
byte∗ stackbase; // base of stack
byte∗ stack0; // current stack pointer
byte∗ entry; // initial function
void∗ param; // passed parameter on wakeup
int16 status; // status
int32 goid;
M∗ lockedm; // used for locking M’s and G’s
...
};[/c]
The M struct is the Go runtime’s representation of an OS thread. It has pointers to fields such as the global queue of G’s, the G that it is currently running, its own cache, and a handle to the scheduler.
[c]struct M
{
G∗ curg; // current running goroutine
int32 id;
int32 locks; // locks held by this M
MCache ∗mcache; //cache for this thread
G∗ lockedg; // used for locking M’s and G’s
uintptr createstack [32]; Stack that created this thread
M∗ nextwaitm; // next M waiting for lock
...
};[/c]
THE SCHED STRUCT
The Sched struct is a single, global struct[9] that keeps track of the different queues of G’s and M’s and some other information the scheduler needs in order to run, such as the global Sched lock. There are two queues containing G structs, one is the runnable queue where M’s can find work, and the other is a free list of G’s. There is only one queue pertaining to M’s that the scheduler maintains; the M’s in this queue are idle and waiting for work. In order to modify these queues, the global Sched lock must be held.
[c]struct Sched {
Lock; // global sched lock. must be held to edit G or M queues
G ∗gfree; // available g’ s ( status == Gdead)
G ∗ghead; // g’ s waiting to run queue
G ∗gtail ; // tail of g’ s waiting to run queue
int32 gwait; // number of g’s waiting to run
int32 gcount; // number of g’s that are alive
int32 grunning; // number of g’s running on cpu or in syscall
M ∗mhead; // m’s waiting for work
int32 mwait; // number of m’s waiting for work
int32 mcount; // number of m’s that have been created
...
};[/c]

fundamentals

The runtime starts out with several G’s. One is in charge of garbage collection, another is in charge of scheduling, and one represents the user’s Go code. Initially, one M is created to kick off the runtime. As the program progresses, more G’s may be created by the user’s Go program, and more M’s may become necessary to run all the G’s. As this happens, the runtime may provision additional threads up to GOMAXPROCS. Hence at any given time, there are at most GOMAXPROCS active M’s.

Since M’s represent threads, an M is required to run a goroutine. An M without a currently associated G will pick up a G from the global runnable queue and run the Go code belonging to that G. If the Go code requires the M to block, for instance by invoking a system call, then another M will be woken up from the global queue of idle M’s. This is done to ensure that goroutines, still capable of running, are not blocked from running by the lack of an available M.

System calls force the calling thread to trap to the kernel, causing it to block for the duration of the system call execution. If the code associated with a G makes a blocking system call, the M running it will be unable to run it or any other G until the system call returns. M’s do not exhibit the same blocking behavior for channel communication, even though goroutines block on channel communication. The operating system does not know about channel communication, and the intricacies of channels are handled purely by
the runtime. If a goroutine makes a channel call, it may need to block, but there is no reason that the M running that G should be forced to block as well. In a case such as this, the G’s status is set to waiting and the M that was previously running it continues running other G’s until the channel communication is complete. At that point the G’s status is set back to runnable and will be run as soon as there is an M capable of running it.

系统相关领域的一些学习材料

三月 18, 2013
Programming practices
No Comments »

原文:架构相关领域的一些学习材料  by 林仕鼎

对于工程师来说,到一定阶段后往往会遇到成长瓶颈。要突破此瓶颈,需要在所属技术领域更深入学习,了解本领域的问题本质、方法论与设计理念、发展历 史等。以下提供一些架构相关领域的学习材料,附上简单点评,供有兴趣的工程师参考。希望大家能通过对这些领域的了解和学习,掌握更多system design principles,在自己的工作中得心应手,步入自由王国。

1. Operating Systems

Mach [Intro: http://www-2.cs.cmu.edu/afs/cs/project/mach/public/www/mach.html, Paper: http://www-2.cs.cmu.edu/afs/cs/project/mach/public/www/doc/publications.html]

传 统的kernel实现中,对中断的响应是在一个“大函数”里实现的。称为大函数的原因是从中断的入口到出口都是同一个控制流,当有中断重入发生的时候,实 现逻辑将变得非常复杂。大多数的OS,如UNIX,都采用这种monolithic kernel architecture。1985年开始的Mach项目,提出了一种全新的microkernel结构,使得由于70年代UNIX的发展到了极致而觉得后续无枝可依的学术界顿时找到了兴奋点,也开始了沸沸扬扬的monokernel与microkernel的争论。插播一个花絮:Mach的主导者Richard Rashid,彼时是CMU的教授,受Bill Gates之托去游说Jim Gray加盟MS。结果把自己也被绕了进来,组建了Microsoft Research。他到中国来做过几次21 Century Computing的keynotes。

Exokernel [Intro: http://pdos.csail.mit.edu/exo/,Paper: http://pdos.csail.mit.edu/PDOS-papers.html#Exokernels]

虽 然microkernel的结构很好,但实际中并没有广泛应用,因为performance太差,而且大家逐渐发现OS的问题并不在于实现的复杂性,而更 多在于如何提高application使用资源的灵活性。这也就是在kernel extension(例如loadable module in Linux)出现后,有关OS kernel architecture的争论就慢慢淡出人们视线的原因。

Exokernel正是在这样的 背景中出现的,它并不提供传统OS的abstraction(process, virtual memory等),而是专注于资源隔离与复用(resource isolation and multiplexing),由MIT提出。在exokernel之上,提供了一套库,著名的libOS,用于实现各种OS的interface。这样的 结构为application提供了最大的灵活度,使不同的application可以或专注于调度公平性或响应实时性,或专注于提高资源使用效率以优化 性能。以今天的眼光来看,exokernel更像是一个virtual machine monitor。

Singularity [Intro: http://research.microsoft.com/os/Singularity/, Paper: http://www.research.microsoft.com/os/singularity/publications/HotOS2005_BroadNewResearch.pdf]

Singularity 出现在virus,spyware取之不尽、杀之不绝的21世纪初期,由Microsoft Research提出。学术界和工业界都在讨论如何提供一个trust-worthy computing环境,如何使计算机系统更具有manage-ability。Singularity认为要解决这些问题,底层系统必须提供hard isolation,而以前人们都依赖的硬件virtual memory机制并无法提供高灵活性和良好性能。在.Net和Java等runtime出现之后,一个软件级的解决方案成为可能。

Singularity 在microkernel的基础上,通过.Net构建了一套type-safed assembly作为ABI,同时规定了数据交换的message passing机制,从根本上防止了修改隔离数据的可能。再加上对application的安全性检查,从而提供一个可控、可管理的操作系统。由 于.Net CLR的持续优化以及硬件的发展,加了这些检查后的Singularity在性能上的损失相对于它提供的这些良好特性,仍是可以接受的。

这种设计目前还处于实验室阶段,是否能最终胜出,还需要有当年UNIX的机遇。

2. Virtual Machines

VMWare ["Memory Resource Management in VMware ESX Server",OSDI’02, Best paper award]

耳熟能详的vmware,无需多说。

ZEN [“Xen and the Art of Virtualization”, OSDI’04]

性能极好的VMM,来自Cambridge。

Denali [“Scale and Performance in the Denali Isolation Kernel”, OSDI’02, UW]

为internet services而设计的application level virtual machine,在普通机器上可运行数千个VMs。其VMM基于isolation kernel,提供隔离,但并不要求资源分配绝对公平,以此减少性能消耗。

Entropia [“The Entropia Virtual Machine for Desktop Grids”, VEE’05]

要 统一利用公司内桌面机器资源来进行计算,需要对计算任务进行良好的包装,以保证不影响机器正常使用并与用户数据隔离。Entropia就提供了这样的一个 计算环境,基于windows实现了一个application level virtual machine。其基本做法就是对计算任务所调用的syscall进行重定向以保证隔离。类似的工作还有FVM:“A Feather-weight Virtual Machine for Windows Applications”。

3. Design Revisited

Are Virtual Machine Monitors Microkernels Done Right?”,HotOS’05

这个题目乍听起来,十分费解,其意思是VMMs其实就是Microkernel的正确实现方法。里面详细讨论了VMM和Microkernel,是了解这两个概念的极好参考。

Thirty Years Is Long Enough: Getting Beyond C”, HotOS’05

C 可能是这个世界上最成功的编程语言,但其缺点也十分明显。比如不支持thread,在今天高度并行的硬件结构中显得有点力不从心,而这方面则是 functional programming language的长处,如何结合二者的优点,是一个很promising的领域。

4. Programming Model

Why Threads Are a Bad Idea

单使用thread结构的server是很难真正做到高性能的,原因在于内存使用、切换开销、同步开销和保证锁正确性带来的编程复杂度等。

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services”,OSDI’01

Thread 不好,但event也没法解决所有问题,于是我们寻找一个结合的方法。SEDA将应用拆分为多个stage,不同stage通过queue相连接,同一个 stage内可以启动多个thread来执行queue中的event,并且可通过反馈来自动调整thread数量。

Software Transactional Memory

如 果内存可以提供transaction语义,那么我们面对的世界将完全两样,language, compiler, OS, runtime都将发生根本变化。虽然intel现在正在做hardware transactional memory,但估计可预见的将来不会商用,所以人们转而寻求软件解决方案。可想而知,这个方案无法base在native assembly上,目前有C#, haskell等语言的实现版本。资料比较多,参见Wikipedia

5. Distributed Algorithms

Logical clock, [“Time, clocks, and the ordering of events in a distributed system”, Leslie Lamport, 1978]

这是一篇关于Logic clock, time stamp, distributed synchronization的经典paper。

Byzantine [“The Byzantine Generals Problem”, Leslie Lamport, 1982]

分 布式系统中的错误各种各样,有出错就能停机的,有出错了拖后腿的,更严重的是出错了会做出恶意行为的。最后的这种malicious behavior,就好像出征将军的叛变,将会对系统造成严重影响。对于这类问题,Lamport提出了Byzantine failure model,对于一个由3f+1个replica组成的state machine,只要叛变的replica数量小于等于f,整个state machine还能正常工作。

Paxos [“The part-time parliament”, Leslie Lamport, 1998]

如何在一个异步的分布式环境中达成consensus,这是分布式算法研究的最根本问题。Paxos是这类算法的顶峰。不过这篇paper太难了,据说全世界就3.5人能看懂,所以Lamport后来又写了一篇普及版paper:“Paxos Made Simple” ,不过还是很难懂。另外,也可参看Butler Lampson写的“The ABCD’s of Paxos”(PODC’01),其中关于replicated state machine的描述会严重启发你对并行世界本质的认识,这就是图灵奖的实力。

这上面反复出现了一个名字:Leslie Lamport, 他在distributed computing这个领域挖坑不辍,终成一代宗师。关于他,也有几则轶事。记得以前他在MSR的主页是这么写的,“当我在研究logical clock的时候,Bill Gates还穿着开裆裤 (in diaper)…”(大意如此,原文现在找不到了)。另外,他在写paper的时候,很喜欢把其他牛人的名字变换一下编排进去。这可能也是他还没拿到图灵 奖的原因。

关于Lamport的其他成就,还可以参见这篇向他60岁生日献礼的paper:“Lamport on mutual exclusion: 27 years of planting seeds”, PODC’01。

6. Overlay Networking, and P2P DHT

RON [“Resilient Overlay Networks”, SOSP’01]

RON描述了如何在应用层搭建一个overlay,以提供秒级广域网网络层故障恢复速度,而现有的通过路由协议来恢复通信的时间至少在几十分钟。这种快速恢复特性和灵活性使得overlay networking现在被广泛应用。

Application Level Multicast

End System Multicast”, SigMetrics’00

Scalable Application Layer Multicast”, SigComm’02

关 于ALM的paper很多,基本上都是描述如何搭建一个mesh network用以鲁棒的传输控制信息,另外再搭建一个multicast tree用以高效传输数据,然后再根据多媒体数据的特点做一些layered delivery。前几年出现的cool stream, pplive等系统都是这类系统的商业化产品。

P2P

P2P的出现改变了网络。按照各种P2P网络的结构,可以分为三种。

1.    Napster式,集中式目录服务,数据传输Peer to peer。

2.    Gnutella式,通过在邻居间gossip来查询,也被称为unstructured P2P。

3.    DHT,与unstructured P2P不同的是,DHT进行的查询有保证,如果数据存在,可在一定的hop数内返回。这个hop数通常为logN,N为系统节点数。

典型的DHT有CAN, Chord, Pastry, Tapestry等四种。这些研究主要在算法层面,系统方面的工作主要是在其上建立广域网存储系统。还有一些人在机制层面进行研究,例如如何激励用户共享、防止作弊等。

7. Distributed Systems

GFS/MapReduce/BigTable/Chubby/Sawzall

Google的系列paper,大家比较熟悉,不再多说。在可查。

Storage

Distributed storage system的paper太多了。下面列出几篇最相关的。

Chain Replication for Supporting High Throughput and Availability”, OSDI’04。

Dynamo: Amazon’s Highly Available Key-value Store”,SOSP’07。

BitVault: a Highly Reliable Distributed Data Retention Platform”, SIGOPS OSR’07。

“”, MSR-TR。

Distributed simulation

Simulating Large-Scale P2P Systems with the WiDS Toolkit”, MASCOTS’05。Distributed simulation有意思的地方是simulated protocol是distributed的,而这个simulation engine本身也是distributed的。Logical和physical的time和event交杂在系统中,需要仔细处理。

8. Controversial Computing Models

现 在的软件系统已经复杂到了人已经无法掌握的程度,很多系统在发布时都仍然带着许多确定性 (deterministic)或非确定性(non-deterministic)的bugs,只能不断的patch。既然作为人类,不够精细的特性决定 了我们无法把系统的bug fix干净,我们只能从其他角度入手研究一种让系统在这令人沮丧的环境中仍能工作的方法。这就像一个分布式系统,故障无法避免,我们选择让系统作为整体来 提供高可靠性。

以下3个便是典型代表。基本上,主要研究内容都集中于1) 如何正确保存状态;2) 如何捕捉错误并恢复状态;3) 在进行单元级恢复时,如何做到不影响整体。

Recovery Oriented Computing

Failure oblivious computing, OSDI’04

Treating Bugs as Allergies, SOSP’05

9. Debugging

系统很复杂,人类无法从逻辑上直接分析,只能通过data mining的方法在宏观上进行观察。

Black box debugging [“Performance debugging for distributed systems of black boxes”, SOSP’03]

对大型系统的performance debugging非常困难,因为里面的问题很多都是非确定性的,而且无法重现。只能通过对log的挖掘,找出配对的调用/消息以定位问题。

CP-miner [“A Tool for Finding Copy-paste and Related Bugs in Operating System Code”, OSDI’04]

很多人在重用代码的时候,都使用copy-paste。但有时候简单的CP会带来严重的问题,例如局部变量的重名等。CP-miner通过分析代码,建立语法树结构,然后mine出这类错误。

git in a nutshell

三月 17, 2013
Programming practices
No Comments »

一些参考资料

  1. kernel docs: https://www.kernel.org/pub/software/scm/git/docs
  2. git data format: http://git.rsbx.net/Documents/Git_Data_Formats.txt
  3. 源码

git的存储就是一个kv数据库。存储的对象包括blob(文件),tree(目录),还有commit等。
对象的key就是value的sha1值。

git将工作区组织成一个hash-tree,hash用的是sha1。hash-tree上的叶子结点是blob,也就是文件,中间结点是tree,也就是目录。任何一个结点的内容发生变化最终都会蔓延到根结点(意味着其sha1值发生变化,父结点的sha1值也会发生变化(因为它的目录项里是保存了子结点的sha1值,如果子结点被修改,这个sha1值也会被修改),其实从根结点到被修改的结点这个路径上所有的结点的sha1值都会发生变化)

另外,git并不直接修改原来结点的内容,而是直接存储一个新的结点,新的key是修改后内容(整个对象的内容,而不是仅仅指修改部分的内容)的sha1值。存储层会在恰当的时候对这些结点进行gc(因为很多结点实际上大部分内容都是相同的),打包到一个pack文件里以节省空间。

commit过程

commit会将当前修改过的所有文件(git add后会有记录),重新生成一个新的hash-tree,注意这个新的hash-tree大部分结点和上一个commit的结点都是相同的,如下图所示:当C文件被修改后,git会为A/B/C这条路径上的所有结点创建一个新的结点E/F/G。commit只需要知道当前的根结点和父commit即可(形成一个commit列表,git rev-list --all可以将这个链表打印出来)。

push过程

1、与服务器通讯,取得服务器当前分支所处的commit,与本地当前的commit比较,然后得出,本次需要push到服务端的commit列表。假设只有一个commit需要提交,如上图所示(多个commit的处理过程类似)。
设commit1是服务端当前分支的commit
设commit2是本地当前分支的commit

2、将commit1所有的结点找出来(不会消耗太多内存,因为叶子结点的内容,也就是文件数据是不会被读取到内存里的),做个标记。
将commit2所有的结点找出来,将带标记的结点去掉。剩下的结点就是本次需要push需要上传到服务端。这个过程也会很快,因为如果某个中间结点(例如上图的H结点)被标记了,下面所有的结点都不需要判断了。

3、将这些结点(key-value)打包成一个pack文件,和git在gc时(所有的key-value都参与)创建pack文件是一样的。
pack文件是自描述的。服务器解压pack后更新当前分支的commit信息。

Linux Performance Analysis and Tools

三月 15, 2013
Performance Analysis and Tools
No Comments »

原文:Linux performance analysis and tools

 

世纪公园踏春篇

三月 10, 2013
Life
4 Comments »

春天来了天气不错,拉几个朋友一起骑车到世纪公园踏春。

Linux: The Journaling Block Device

二月 17, 2013
Kernel
2 Comments »

Kedar Sovani on  kerneltrap

Atomicity is a property of an operation either to succeed or fail completely. Disks assure atomicity at the sector level. This means that a write to a sector either goes through completely or not at all. But when an operation spans over multiple sectors of the disk, a higher-level mechanism is needed. This mechanism should ensure that modifications to the entire set of sectors are handled atomically. Failure to do so leads to inconsistencies. This document talks about the implementation of the Journaling Block Device in Linux.

Let's look at how these inconsistencies could be introduced to a filesystem. Say we have an application that creates a file. The
filesystem internally has to decrease the number of free inodes by one, intialize the inode on the disk and add an entry to the
parent directory for the newly created file. But what happens if the machine crashes after only the first operation is executed? In this circumstance, an inconsistency has been introduced in the filesystem. The number of free inodes has decreased, but no initialisation of the inode has been performed on the disk.

The only way to detect these inconsistencies is by scanning the entire filesystem. This task is called fsck, filesystem consistency check. In large installations, the consistency check requires a significant amount of time (many hours) to check and fix inconsistencies. As you might have guessed, such downtime is not desirable. A better approach to solve this problem is to avoid introducing inconsistencies in the first place, and this could be accomplished by providing atomicity to operations. Journaling is such a way to provide atomicity to operations.

Simply stated, using journaling is like using a scratch pad. You perform operations on the scratch pad, and once you are satisfied that the operations are correct, you reflect them in a fairer copy.

In the case of filesystems, all the metadata and data are stored on the block device for the filesystem. Journaling filesystems use a journal or the log area as the scratch pad. A journal may be a part of the same block device or it may be a separate device in itself. A journaling filesystem first records all the operations it has performed in the journal. Once the set of operations that is part of one single atomic operation has completed and been recorded in the journal, only then is it writtent to the actual block device. Henceforth, the term disk is used to indicate the actual block device, whereas the term journal is used for the log area.

Journal Recovery Scenarios

The example operation from above requires that three blocks be modified—the inode count block, the block containing the on-disk inode and the block holding the directory where the entry is to be added. All of these blocks first are written to the journal. After that, a special block, called the commit record, is written to the journal. The commit record is used to indicate that all the blocks belonging to a single atomic operation are written to the journal.

Given journaling behavior, then, here is how a journaling filesystem reacts in the following three basic scenarios:

  • The machine crashes after only the first block is flushed to the journal. In this case, when the machine comes back up again and checks the journal, it finds an operation with no commit record at the end. This indicates that it may not be a completed operation. Hence, no modifications are done to the disk, preserving the consistency.
  • The machine crashes after the commit record is flushed to the journal. In this case, when the machine comes back up again and checks the journal, it finds an operation with the commit record at the end. The commit record indicates that this is a completed operation and could be written to the disk. All the blocks belonging to this operation are written at their actual locations on the disk, replaying the journal.
  • The machine crashes after all the three blocks are flushed to the journal but the commit record is not yet flushed to the journal. Even in this case, because of the absence of the commit record, no modifications are done to the disk. The scenario thus is reduced to the scenario described in the first case.

Likewise, any other crash scenario could be reduced to any of the scenarios listed above.

Thus, journaling guarantees consistency for the filesystem. The time required for looking up the journal and replaying the journal is minimal as compared to that taken by the filesystem consistency check.

Journaling Block Device

The Linux Journaling Block Device (JBD) provides this scratch pad for providing atomicity in operations. Thus, a filesystem controlling a block device can make use of JBD on the same or on another block device in order to maintain consistency. The JBD is a modular implementation that exposes a set of APIs for the use of such applications. The
following sections describe the concepts and implementation of the Linux JBD as is present in the Linux 2.6 kernel.

Before we move on to the implementation details of the JBD, an understanding of some of the objects that JBD uses is required. A journal is a log that internally manages updates for a single block device. As mentioned above, the updates first are stored in the journal and then are reflected to their real locations on the disk. The area belonging to the journal is managed like a circular-linked list. That is, the journal reuses its area when the journal is full.

A handle represents a single atomic update. The entire set of changes/writes that should be performed atomically are carried out with reference to a single handle.

It may not be an efficient approach to flush each atomic update (handle) to the journal, however. To achieve better performance, the JBD bunches a set of handles together into a transaction and flushes this transaction to the journal. The JBD ensures that the transaction is atomic in nature. Hence, the handles, which are the subcomponents of the transaction, also are guaranteed to be atomic.

The most important property of a transaction is its state. When a transaction is being committed, it follows the lifecycle of states listed below.

  1. Running: the transaction currently is live and can accept new handles. In a system only one transaction can be in the running state.
  2. Locked: the transaction does not accept any new handles but existing handles are not complete. Once all the existing handles are completed, the transaction goes to the next state.
  3. Flush: all the handles in a transaction are complete. The transaction is writing itself to the journal.
  4. Commit: the entire transaction log has been written to the journal. The transaction is writing a commit block indicating that the transaction log in the journal is complete.
  5. Finished: the transaction is written completely to the journal. It has to remain there until the blocks are updated to the actual locations on the disk.

Transaction Committing and CheckPointing

A running transaction is written to the journal area after a certain period. Thus, a transaction can be either in-memory (running) or on-disk. Flushing a transaction to the journal and marking that particular transaction as finished is a process called transaction commit.

The journal has a limited area under its control, and it needs to reuse this area. As for committed transactions, those having all their blocks written to the disk, they no longer need to be kept in the journal. Checkpointing, then, is the process of flushing the finished transactions to the disk and reclaiming the corresponding space in the journal. It is discussed in more detail later in this article.
Implementation Briefs

The JBD layer performs journaling of the metadata, during which the data simply is written to the disk without being journaled. But this does not stop applications from journaling the data, as it could be presented to the JBD as metadata itself. This document takes the linux kernel version 2.6.0 as a reference.


Commit

[journal_commit_transaction(journal object)]

A Kjournald thread is associated with every journaled device. The Kjournald thread ensures that the running transaction is committed after a specific interval. The transaction commit code is divided into eight different phases, described below. Figure 1 shows a logical layout of a journal.

  1. moves the transaction from running state (T_RUNNING) to locked state (T_LOCKED), meaning the transaction no longer can issue new handles. The transaction waits until all the existing handles have completed. A transaction always has a set of buffers reserved for when the transaction is initiated. Some of these buffers may be unused and are unfiled in this phase. The transaction now is ready to be committed with no outstanding handles.
  2. the transaction enters into the flush state (T_FLUSH). The transaction is marked as a currently committing transaction for the journal. This phase also marks that no running transaction exists for the journal; therefore, new requests for handles initiate a new transaction.
  3. the actual buffers of the transaction are flushed to the disk. Data buffers go first. There are no complications here, as data buffers are not saved in the log area. Instead, they are flushed directly to their actual positions on the disk. This phase ends when the I/O completion notifications for all such buffers are received.
  4. all the data buffers are written to a disk but their metadata still is in the volatile memory. Metadata flushing is not as straightforward as data buffer flushing, because metadata needs to be written to the log area and the actual positions on the disk need to be remembered. This phase starts with flushing these metadata buffers, for which a journal descriptor block is acquired. The journal descriptor block stores the mapping of each metadata buffer in the journal to its actual location on the disk in the form of tags. After this, metadata buffers are flushed to the journal. Once the journal descriptor is full of tags or all metadata buffers are flushed to the journal, the journal descriptor also is flushed to the journal. Now we have all the metadata buffers in the journal, and their actual positions on the disk are remembered. This data, being persistent, can be used for recovery if failure occurs.
  5. wait on I/O completion notifications of metadata buffers and journal descriptor blocks, respectively. The buffers are unfiled from in-memory lists once I/O completion is received.
  6. all the data and metadata is on safe storage, data at its actual locations and metadata in the journal. Now transactions need to be marked as committed so that it can be known that all the updates are safe in the journal. For this reason, a journal descriptor block again is allocated. A tag is written stating that the transaction has committed successfully, and the block is synchronously written to its position in the journal. After this, the transaction is moved to the committed state, T_COMMIT.
  7. occurs when a number of transactions are present in the journal, without yet being flushed to the disk. Some of the metadata buffers in this transaction already may be a part of some previous transaction. These need not be kept in the older transactions as we have their latest copy in the current committed transaction. Such buffers are removed from older transactions.
  8. the transaction is marked as being in the finished state, T_FINISHED. The journal structure is updated to reflect this particular transaction as the latest committed transaction. It also is added to the list of transactions to be checkpointed.

Checkpointing

Checkpointing is initiated when the journal is being flushed to the disk—think of unmount— or when a new handle is started. A new handle can fall short of guaranteed number of buffers, so it may be necessary to carry out a checkpointing process in order to free some space in the journal.

The checkpointing process flushes the metadata buffers of a transaction not yet written to its actual location on the disk. The transaction then is removed from the journal. The journal can have multiple checkpointing transactions, and each checkpointing transaction can have multiple buffers. The process considers each committing transaction, and for each transaction, it finds the metadata buffers that need to be flushed to the disk. All these buffers are flushed in one batch. Once all the transactions are checkpointed, their log is removed from the journal.


Recovery

[journal_recover(journal object)]

When the system comes up after a crash and it can see that the log entries are not null, it indicates that the last unmount was not successful or never occurred. At this point, you need to attempt a recovery. Figure 2 depicts a sample physical layout of journal. The recovery takes place in three phases.

  1. PASS_SCAN: the end of the log is found.
  2. PASS_REVOKE: a list of revoked blocks is prepared from the log.
  3. PASS_REPLAY: unrevoked blocks are rewritten (replayed) in order to guarantee the consistency of the disk.

For recovery, the available information is provided in terms of the journal. But the exact state of the journal is unknown, as we do not know the point at which the system crashed. Hence, the last transaction could be in the checkpointing or committing state. A running transaction cannot be found, as it was only in the memory.

For committing transactions, we have to forget the updates made, as all of the updates may not be in place. So in the PASS_SCAN phase, the last log entry in the log is found. From here, the recovery process knows which transactions need to be replayed.

Every transaction can have a set of revoked blocks. This is important to know in order to prevent older journal records from being replayed on top of newer data using the same block. In PASS_REVOKE, a hash table of all these revoked blocks is prepared. This table is used every time we need to find out whether a particular block should get written to a disk through a replay.

In the last phase, all the blocks that need to be replayed are considered. Each block is tested for its presence in the revoked blocks' hash table. If the block is not in there, it is safe to write the block to its actual location on the disk. If the block is there, only the newest version of the block is written to the disk. Notice that we have not changed anything in the on-disk journal. Hence, even if system crashes again while the recovery is in progress, no harm is done.
The same journal is present for the recovery next time, and no non-idempotent operation is performed during the process of recovery.

Amey Inamdar (www.geocities.com/amey_inamdar) is a kernel developer working at Kernel Corporation. His interest areas include filesystems and distributed systems.

Kedar Sovani (www.geocities.com/kedarsovani) works for Kernel Corporation as a kernel developer. His areas of interest include filesystems and storage technologies.

Copyright (c) 2004-2006 Kedar Sovani and Amey Inamdar