cpeh: a scalable, high-perormance distributed file system
- system architecture
- system overview
- the ceph file system has three main components:
- client: each instance of which exposes a near-POSIX file system interface to a host or process
- osd cluster: sotres all data and metadata
- mds cluster: manages the namespace(file names and directories) while coordinating security, consistency and coherence.
- primary goals
- scalability: to hundreds of petabytes and beyond, considered in a variety of dimensions, including the overall storage capacity and throughput of the system.
three fundamental design fetures
- performance: out target workload may include such extreme cases as tens or hundreds of thousands of hosts concurrently reading from or writing to the same file or creating files in the same directory.
- decoupled data and metadata: metadata operations(open, rename, etc.) are collectively managed by a metadata server cluster, while clients interact directly with osds to perform file i/o(reads and writes). ceph replace long per-file block lists with shorter object lists, delegate low-level block allocation decisions to individual devices, while a special-purpose data distribution function called CRUSH assigns objects to storage devices. this allow any party to calculate(rather than look up) the name and location of objects comprising a file's contents, eliminting the need to maintain and distribute object lists, simplifying the design of the system, and reducing the metadata cluster workload.
- dynamic distributed metadata management: ceph utilizes a novel metadata cluster architecture based on Dynamic Subtree Partitioning that adaptively and intelligently distributes responsibility for managing the file system directory hierachy amon tens or even hundreds of MDSs. a hierachical partition preserves locality in each MDS's workload, facilitating efficient updates and aggressive prefecthing to improve performance for common workloads.
- reliable autonomic distributed object storage: large systems are inherently dynamic, they are built incrementally, grow and contract as new storage is deployed and old devices are decommissioned, device failures are frequent and expected, and large volumes of data are created, moved, and deleted. ceph delegates responsibility for data migration, replication, failure detection, and failure recovery to the cluster of OSDs that store data, while at a high level, OSDs collectively provide a single logical object store to clients and metadata servers.
dynamically distributed metadata. because object names are constructed using the inode number, and distributed to OSDs using CRUSH, so file and directory metadata in ceph is very small, consisting almost entirely of directory entries(file names) and inodes(80 bytes).
- file i/o and capabilities open file: oid = (fid, stripe number). if file exists, returns the inode number, file size, and information about the striping strategy used to map file data into objects.
- client sync: Dynamic metadata management for petabyte-scale file systems. a subset of which are implemented by ceph.
- lazyio_propagate: flush a given byte range to the object store.
- lazyio_synchronize: ensure that the effects of previous propagations are reflected in any subsequent reads.
- namespace operations. managed by metadata server cluster. both read operations(e.g., readdir, stat) and updates(e.g., unlink, chmod) are synchronously applied by the MDS to ensure serialization, consistency, correct security, and safety.
- ceph returns lstat results withdirectory entries.
- caching metadata longer, if a file opened by multiple writers, in order to return a correct file infomations, MDS revokes any write capabilities to momentarily stop updates and collect infomations from all writes, and then return the highest values.
- metadata storage