https://lwn.net/Articles/405076/

Dynamic writeback throttling最主要的核心思想就是IO带宽估算。

the bandwidth estimation allows the kernel to scale dirty limits and I/O sizes to make the best use of all of the devices in the system, regardless of any specific device's performance characteristics.
传统writeback机制的做法是,当进程脏页超过一定比例时,调用balance_dirty_pages()函数进入同步写dirty pages过程,直到dirty pages的比例下降到一定比例,之后才允许该进程返回。

该机制存在三个问题:

  1. 进程脏页比率多少才合适?
  2. 内存压力太大时,多个后台进程同时writeback,会产生大量的随机IO,设备吞吐量下降
  3. 如何更准确的估算设备的真实带宽?

Dynamic writeback throttling的基本做法是:

  1. 启发式的计算设备的真实带宽
  2. 用户进程不再主动同步写,而是通过wait方式等待后台进程刷脏页

Fengguang's 17-part patch makes a number of changes, starting with removing any direct writeback work frombalance_dirty_pages(). Instead, the offending process simply goes to sleep for a while, secure in the knowledge that writeback is being handled by other parts of the system. That should lead to better I/O performance, but also to more predictable and controllable pauses for memory-intensive applications.

Much of the rest of the patch series is aimed at improving that pause calculation. It adds a new mechanism for estimating the actual bandwidth of each backing device - something the kernel does not have a good handle on, currently. Using that information, combined with the number of pages that the kernel would like to see written out before allowing a dirtying process to continue, a reasonable pause duration can be calculated. That pause is not allowed to exceed 200ms.

The patch set tries to be smarter than that, though. 200ms is a long time to pause a process which is trying to get some work done. On the other hand, without a bit of care, it is also possible to pause processes for a very short period of time, which is bad for throughput. For this patch set, it was decided that optimal pauses would be between 10ms and 100ms. This range is achieved by maintaining a separate "nr_dirtied_pause" limit for every process; if the number of dirtied pages for that process is below the limit, it is not forced to pause. Any time that balance_dirty_pages() calculates a pause time of less than 10ms, the limit is raised; if the pause turns out to be over 100ms, instead, the limit is cut in half. The desired result is a pause within the selected range which tends quickly toward the 10ms end when memory pressure drops.

Another change made by this patch series is to try to come up with a global estimate of the memory pressure on the system. When normal memory scanning encounters dirty pages, the pressure estimate is increased. If, instead, the kswapd process on the most memory-stressed node in the system goes idle, then the estimate is decreased. This estimate is then used to adjust the throttling limits applied to processes; when the system is under heavy memory pressure, memory-dirtying processes will be put on hold sooner than they otherwise would be.

There is one other important change made in this patch set. Filesystem developers have been complaining for a while that the core memory management code tells them to write back too little memory at a time. On a fast device, overly small writeback requests will fail to keep the device busy, resulting in suboptimal performance. So some filesystems (xfs and ext4) actually ignore the amount of requested writeback; they will write back many more pages than they were asked to do. That can improve performance, but it is not without its problems; in particular, sending massive write operations to slow devices can stall the system for unacceptably long times.

Leave a Comment