文章归档

CPI2 : CPU performance isolation for shared compute clusters

论文原址: http://research.google.com/pubs/pub40737.html

This paper describes CPI2, a system that builds on the useful properties of CPI measures to automate all of the following:

  1. observe the run-time performance of hundreds to thousands of tasks belonging to the same job, and learn to distinguish normal performance from outliers
  2. identify performance interference within a few minutes by detecting such outliers
  3. determine which antagonist applications are the likely cause with an online cross-correlation analysis
  4. (if desired) ameliorate the bad behavior by throttling or migrating the antagonists.

»» 继续阅读全文

Heracles: Improving Resource Efficiency at Scale

论文原址:

  1. http://csl.stanford.edu/~christos/publications/2015.heracles.isca.pdf
  2. https://cs.stanford.edu/~davidlo/resources/2015.heracles.isca.slides.pdf

一个实时的、自反馈的,大规模提升资源利用率的系统

1. 简介

为了解决LC服务和BE作业混部引起的LC服务SLO下降的问题,我们开发了Heracles,一个实时的控制器,能够动态使用硬件隔离机制和软隔离机制,保证共享资源的干扰问题

Heracles能够在保证LC服务足够SLO的情况下,最大限度的将空闲资源用来运行BE作业,同时,结合实时监控离线分析,自动检测干扰源,并在合适的时机,自动调整隔离机制防御干扰产生

在这里我们分两步走:

第一步,我们首先来研究一些经典案例,最终会抛出一个结论,那就是混部情况下【如无特殊说明,以下混部一律特指LC服务和BE作业】,干扰绝大部分情况下都是不均匀的、但是和负载有较强的相关性,所以,通过静态分区的方式来隔离共享资源是不够的

第二步,然后我们会解释为什么要设计Heracles,并且会证明

  1. 多种隔离机制的互相结合,是实现高利用率并且不打破LC服务SLO的关键所在
  2. 将干扰问题细分成几个独立的子问题,能够大大的降低动态控制的复杂性
  3. 为什么一个运行在所有机器上的、实时的监控程序是必须的

第三步,我们来评估一下Heracles在Google内部生产环境能产生多大的收益,测试表明,能够将机器利用率提高到90%左右

最后,我们通过硬件的一些机制来实现DRAM带宽的监控与隔离,进一步提高Heracles的精确性以及降低对离线分析的依赖

»» 继续阅读全文

Reconciling High Server Utilization and Sub-millisecond Quality-of-Service

论文原址:http://csl.stanford.edu/~christos/publications/2014.mutilate.eurosys.pdf

In this paper, we analyze the challenges of maintaining high QoS for low-latency workloads when sharing servers with other workloads.

The additional workloads can interfere with resources such as processing cores, cache space, memory or I/O bandwidth

The goal of this work is to investigate if workload colocation and good quality-of-service for latency-critical services are fundamentally incompatible in modern systems, or if instead we can reconcile the two

»» 继续阅读全文

Borg: google 集群操作系统

论文地址:http://research.google.com/pubs/pub43438.html

1. Introduction

google服务器集群的管理系统,类似于百度的Matrix,阿里的fuxi,腾讯的台风平台等等,还有开源的mesos

Borg provides three main benefits: it

  1. hides the details of resource management and failure handling so its users can focus on application development instead;
  2. operates with very high reliability and availability, and supports applications that do the same; and
  3. lets us run workloads across tens of thousands of machines effectively.

»» 继续阅读全文