文章归档

zookeeper原理和实现

zookeeper比较重要的几篇论文:

  1. ZooKeeper’s atomic broadcast protocol: Theory and practice
  2. Zab: High-performance broadcast for primary-backup systems
  3. ZooKeeper: Wait-free coordination for Internet-scale systems
  4. 官方资料:http://zookeeper.apache.org/doc/trunk/

»» 继续阅读全文

大规模分布式系统的设计和部署实践

论文  http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf

这篇论文主要从面向运维友好的角度,思考了大规模分布式系统的设计和部署相关的一些原则和最佳实践。

总体设计原则

We have long believed that 80% of operations issues originate in design and development, so this section on overall service design is the largest and most important.

When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there.

对服务整体设计影响最大的一些运维友好的基本原则如下:

  1. Keep things simple and robust
  2. Design for failure

一些更具体的设计运维友好的服务的最佳实践如下:

    »» 继续阅读全文

proxyio性能测试工具

测试工具位于源码目录 perf/ 下,可见:https://github.com/proxyio/xio/tree/master/perf

分别是吞吐量测试和时延测试:

  • 吞吐量测试:thr_sender + thr_recver
  • 时延测试:lat_sender + lat_recver

吞吐量测试

thr_sender不停的发送消息,thr_recver接受消息,记录消息大小,整个过程的耗时,最好得到的结果类似如下:

message size: 100000000 [B] throughput: 90497 [msg/s] throughput: 707.013 [Mb/s]

thr_sender用法如下:

usage: thr_sender <connect-to> <msg-size> <msg-count>

其中:

  1. connect-to 表示消息接收端的地址
  2. msg-size 表示发送消息的大小(不包括协议的header,仅指用户消息的长度)
  3. msg-count 表示消息数量

thr_recver用法如下:

usage: thr_recver <bind-to> <msg-count>

其中:

  1. bind-to 表示监听一个socket地址
  2. msg-count 表示消息的数量

 

时延测试

在pingpong模式下,测试每个消息来回的传输时间,测试过程如下,记录时间戳,发送消息,接受响应,记录时间戳,计算时延 rtt / 2

lat_sender用法如下:

usage: lat_sender <connect-to> <msg-size> <roundtrips>

其中:

  1. connect-to 表示消息接收端的地址
  2. msg-size 表示发送消息的大小(不包括协议的header,仅指用户消息的长度)
  3. roundtrips 表示消息来回的次数

lat_recver用法如下:

usage: lat_recver <bind-to> <msg-count>

其中:

  1. bind-to 表示监听一个socket地址
  2. msg-count 表示消息的数量

输出结果类似如下:

message size: 1000 [B] roundtrip count: 100000 average latency: 67.626 [us]

High performance Network Programming

Event-driven architecture, state machines et al.

http://250bpm.com/blog:25

In my previous blog post I've described the problems with callback-based architectures and hinted that the solution may be replacing the callbacks by events and state machines. In this post I would like to discuss the proposed solution in more detail. Specifically, I am going to define what the events and state machines actually are and explain why they are useful.

While the article may be used as an intro to nanomsg's internal architecture it can be also be though of as an opinion piece of possible interest to anybody dealing with event-driven architectures and/or

»» 继续阅读全文

cpeh: a scalable, high-perormance distributed file system

  1. system architecture
  2. system overview
    1. the ceph file system has three main components:
      • client: each instance of which exposes a near-POSIX file system interface to a host or process
      • osd cluster: sotres all data and metadata
      • mds cluster: manages the namespace(file names and directories) while coordinating security, consistency and coherence.
    2. primary goals
      • scalability: to hundreds of petabytes and beyond, considered in a variety of dimensions, including the overall storage capacity and throughput of the system.
      • performance: out target workload may include such extreme cases as tens or hundreds of thousands

        »» 继续阅读全文

The Design of a High-Performance File Server

老论文了,Bullet file Server,文献原址:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.5481&rep=rep1&type=pdf

key point:

  1. to do away with the traditional file server's block model. In fact, we have chosen the extreme, which is to maintain files contiguously throughout the system. That is, files are contiguously stored on disk, contiguously cached in RAM, and kept contiguously in processes’ memories. This dictates the choice for whole file transfer.
  2. Another design choice, which is closely linked to keeping files contiguous, is to make all files immutable. That is, the only operations on files are creation, retrieval, and deletion; there are no update-in-place operations. Instead, if we want to update

    »» 继续阅读全文

Haystack

文献原址:Finding a needle in Haystack: Facebook's photo storage 

一:简介:

  1. 2600亿照片,20PB数据量
  2. 每周60TB,10亿照片
  3. 每秒100w张图片的检索速度
  4. 对象存储,我们的图片只写,不修改,很少删除
  5. 传统POSIX文件系统成为瓶颈

论文中提到瓶颈是由于传统的POSIX文件系统是基于目录和metadata来管理文件的,通常取一张图片都会涉及好几个IO操作:

  1. 从文件名得到inode号
  2. 将inode信息从磁盘读取出来
  3. 读取文件。

对于一些稍微复杂的架构,1步骤可能会更频繁些。 Haystack设计的目标:

  1. 高吞吐量和低延迟
  2. 分区容忍性
  3. 低成本。通过两个维度来控制,每TB可用数据的利用率和每TB可用数据的读速率(这里的可用考虑到底层文件系统,磁盘阵列,冗余等各种因素)
  4. 简单。

二、背景

  1. 典型设计,存储集群 + cdn
  2. NFS文件系统

 

CDN通常是用来缓存热数据的,但是对于像facebook这样的互联网站,通常需要产生大量的非热数据内容,而恰恰是这些不太热的数据内容,大量地升高了CDN回源的压力,facebook引用了长尾理论来解析这一效应。在NFS卷里,facebook最初默认是每个目录存储上千个文件,但是由于NAS设备对目录metadata的管理方式,导致在一个目录里存放上千个文件相当困难,因为目录的blockmap太大了。这种情况下读1张图片最少也要10个IO,就算让目录存放的文件数降低到100,也仍然会有3个IO才能读到一张图片。 为了减少磁盘操作的次数,facebook还在Photo Store servers上缓存NAS设备返回的file handle,这个是通过增加内核API(open_by_filehandle)来实现的。facebook从NAS上学到的经验就是:通过缓存来降低磁盘IO的操作是很有限的。 也许有人会说,可以使用memcache来缓存所有的file handles。对此facebook坦言,也许这是一个解决方案,但是只能解决这里部分的问题,因为这么做的话需要NAS设备缓存所有的inodes信息,只不过是一种昂贵的传统存储方案罢了。 

三、设计与实现 

方案:CDN解决热数据问题,Haystack解决长尾效应。 Haystack的主要目的就是为了解决NFS存储方案中高磁盘IO的问题。Haystack的做法是:降低文件系统的metadata的内存占用,因此尽可能的将所有的metadata放在内存里,例如它将多个小文件存在一个大文件里。

1)三个核心组建:Haystack Directory, Haystack cache, Haystack store 

 

Store是由物理卷组成的,容量由物理卷的数量来控制,100个物理卷可提供10TB的存储空间。物理卷还会组合成一个逻辑卷,单个逻辑卷中的所有物理卷数据是相同的,冗余。Directory主要保存应用方面的数据信息,例如每张图片存放在哪个逻辑卷上,还有逻辑卷到物理卷的映射以及逻辑卷的剩余空间。Cache相当于一个内部的CDN。 

2)URL规格

»» 继续阅读全文

面向软件错误构建可靠的分布式系统

Making reliable distributed systems in the presence of sodware errors

1 问题域

  • 并发(concurrency)
  • 软实时(soft real-time)
  • 分布式(distributed)
  • 硬件交互(hardware interaction)
  • 大型软件系统(large software systems)
  • 复杂的功能(complex functionality)
  • 持续运行(continuous operation)
  • 高质量要求(quality requirements)
  • 容错(fault tolerance)

2 哲学

容错和故障隔离。例如进程和基于消息的交互。

3 系统需求

  • 并发性
  • 错误封装 即一个进程的错误一定不能破坏系统中其他的进程
  • 故障检测 包括本地和网络异常
  • 故障识别
  • 代码升级
  • 持久存储 以便恢复崩溃的系统。

4 语言需求

  • 封装原语
  • 并发性
  • 错误检测原语
  • 位置透明
  • 动态代码升级

5 库需求

  • 持久存储
  • 设备驱动程序
  • 代码升级
  • 运行基础

SILT: small index and Large table

SILT's multi-store design

The most common approach to build a hign performence key-value storage system uses two components:logstore and sortedstore. But in the implementation of SILT, it add a HashStore for sloving the high write amlication problem when merge logstore into sortedstore by traditional way. how to manage the sortedstore isn't key to performance, i think, just a little infulence.

In fact, SILT isn't a generic key-value store because of its keys are uniformly distributed 160bit hash value(e.g., pre-hashed key with SHA-1), but doing it fixed length as SILT have done here is so much convenient for compressing

»» 继续阅读全文

第 1 页,共 2 页12