This paper describes CPI2, a system that builds on the useful properties of CPI measures to automate all of the following:
- observe the run-time performance of hundreds to thousands of tasks belonging to the same job, and learn to distinguish normal performance from outliers
- identify performance interference within a few minutes by detecting such outliers
- determine which antagonist applications are the likely cause with an online cross-correlation analysis
- (if desired) ameliorate the bad behavior by throttling or migrating the antagonists.
Average server utilization in most datacenter is low, ranging between 10%~50%. Difficult to consolidate the latency-critical services on a subset of highly utilized servers. Increase the server utilization by launching best-effort tasks on the same server with a latency-critical job
Goal: Eliminate SLO violations at all levels of load for the LC job while maximizing the throughput for BE tasks.
In this paper, we analyze the challenges of maintaining high QoS for low-latency workloads when sharing servers with other workloads.
The additional workloads can interfere with resources such as processing cores, cache space, memory or I/O bandwidth
The goal of this work is to investigate if workload colocation and good quality-of-service for latency-critical services are fundamentally incompatible in modern systems, or if instead we can reconcile the two