CPI2 : CPU performance isolation for shared compute clusters

Time: 十月 8, 2016
Category: colocation

论文原址: http://research.google.com/pubs/pub40737.html

This paper describes CPI2, a system that builds on the useful properties of CPI measures to automate all of the following:

  1. observe the run-time performance of hundreds to thousands of tasks belonging to the same job, and learn to distinguish normal performance from outliers
  2. identify performance interference within a few minutes by detecting such outliers
  3. determine which antagonist applications are the likely cause with an online cross-correlation analysis
  4. (if desired) ameliorate the bad behavior by throttling or migrating the antagonists.

We conclude that there is a positive correlation between changes in CPI and changes in compute-intensive application behavior, and that CPI is a reasonably stable measure over time

1. Collecting CPI data

The CPI data is sampled periodically by a system daemon using the perf event tool in counting mode. The counters are saved/restored when
a context switch changes to a thread from a different cgroup

2. Identifying antagonists

CPI data for every task in a job is gathered once a minute and compared against the job’s predicted CPI. If the observed CPI is significantly larger than the prediction, it is flagged; if this happens often enough for a task, we look for possible correlations with an antagonist

2.1 Detecting performance anomalies

A CPI measurement is flagged as an outlier if it is larger than the 2σ point on the predicted CPI distribution

  1. We ignore CPI measurements from tasks that use less than 0.25 CPU-sec/sec since CPI sometimes increases significantly if CPU usage drops to near zero
  2. A task is considered to be suffering anomalous behavior only if it is flagged as an outlier at least 3 times in a 5 minute window

2.2  Identifying antagonists

We use a passive method to identify likely culprits by looking for correlations between the victim’s CPI values and the CPU usage of the suspects; a good correlation means the suspect is highly likely to be a real antagonist rather than an innocent bystander.

3. Dealing with antagonists

Our policy is simple: we give preference to latency-sensitive jobs over batch ones

  1. If the suspected antagonist is a batch job and the victim is a latency-sensitive one, then we forcibly reduce the antagonist’s CPU usage by applying CPU hard-capping
  2. To allow offline analysis, we log and store data about CPIs and suspected antagonists. Job owners and administrators can use this information to ask the cluster scheduler to avoid co-locating their job and these antagonists in the future

Leave a Comment