Multi-Gen LRU¶
The multi-gen LRU is an alternative LRU implementation that optimizes page reclaim and improves performance under memory pressure. Page reclaim decides the kernel’s caching policy and ability to overcommit memory. It directly impacts the kswapd CPU usage and RAM efficiency.
Design overview¶
Objectives¶
The design objectives are:
Good representation of access recency
Try to profit from spatial locality
Fast paths to make obvious choices
Simple self-correcting heuristics
The representation of access recency is at the core of all LRU implementations. In the multi-gen LRU, each generation represents a group of pages with similar access recency. Generations establish a (time-based) common frame of reference and therefore help make better choices, e.g., between different memcgs on a computer or different computers in a data center (for job scheduling).
Exploiting spatial locality improves efficiency when gathering the accessed bit. A rmap walk targets a single page and does not try to profit from discovering a young PTE. A page table walk can sweep all the young PTEs in an address space, but the address space can be too sparse to make a profit. The key is to optimize both methods and use them in combination.
Fast paths reduce code complexity and runtime overhead. Unmapped pages do not require TLB flushes; clean pages do not require writeback. These facts are only helpful when other conditions, e.g., access recency, are similar. With generations as a common frame of reference, additional factors stand out. But obvious choices might not be good choices; thus self-correction is necessary.
The benefits of simple self-correcting heuristics are self-evident. Again, with generations as a common frame of reference, this becomes attainable. Specifically, pages in the same generation can be categorized based on additional factors, and a feedback loop can statistically compare the refault percentages across those categories and infer which of them are better choices.
Assumptions¶
The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels:
Accesses through page tables
Accesses through file descriptors
The protection of the former channel is by design stronger because:
The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit.
The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit.
The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking rendering threads.
There are also two access patterns:
Accesses exhibiting temporal locality
Accesses not exhibiting temporal locality
For the reasons listed above, the former channel is assumed to follow
the former pattern unless VM_SEQ_READ
or VM_RAND_READ
is
present, and the latter channel is assumed to follow the latter
pattern unless outlying refaults have been observed.
Workflow overview¶
Evictable pages are divided into multiple generations for each
lruvec
. The youngest generation number is stored in
lrugen->max_seq
for both anon and file types as they are aged on
an equal footing. The oldest generation numbers are stored in
lrugen->min_seq[]
separately for anon and file types as clean file
pages can be evicted regardless of swap constraints. These three
variables are monotonically increasing.
Generation numbers are truncated into order_base_2(MAX_NR_GENS+1)
bits in order to fit into the gen counter in folio->flags
. Each
truncated generation number is an index to lrugen->folios[]
. The
sliding window technique is used to track at least MIN_NR_GENS
and
at most MAX_NR_GENS
generations. The gen counter stores a value
within [1, MAX_NR_GENS]
while a page is on one of
lrugen->folios[]
; otherwise it stores zero.
Each generation is divided into multiple tiers. A page accessed N
times through file descriptors is in tier order_base_2(N)
. Unlike
generations, tiers do not have dedicated lrugen->folios[]
. In
contrast to moving across generations, which requires the LRU lock,
moving across tiers only involves atomic operations on
folio->flags
and therefore has a negligible cost. A feedback loop
modeled after the PID controller monitors refaults over all the tiers
from anon and file types and decides which tiers from which types to
evict or protect. The desired effect is to balance refault percentages
between anon and file types proportional to the swappiness level.
There are two conceptually independent procedures: the aging and the eviction. They form a closed-loop system, i.e., the page reclaim.
Aging¶
The aging produces young generations. Given an lruvec
, it
increments max_seq
when max_seq-min_seq+1
approaches
MIN_NR_GENS
. The aging promotes hot pages to the youngest
generation when it finds them accessed through page tables; the
demotion of cold pages happens consequently when it increments
max_seq
. The aging uses page table walks and rmap walks to find
young PTEs. For the former, it iterates lruvec_memcg()->mm_list
and calls walk_page_range()
with each mm_struct
on this list
to scan PTEs, and after each iteration, it increments max_seq
. For
the latter, when the eviction walks the rmap and finds a young PTE,
the aging scans the adjacent PTEs. For both, on finding a young PTE,
the aging clears the accessed bit and updates the gen counter of the
page mapped by this PTE to (max_seq%MAX_NR_GENS)+1
.
Eviction¶
The eviction consumes old generations. Given an lruvec
, it
increments min_seq
when lrugen->folios[]
indexed by
min_seq%MAX_NR_GENS
becomes empty. To select a type and a tier to
evict from, it first compares min_seq[]
to select the older type.
If both types are equally old, it selects the one whose first tier has
a lower refault percentage. The first tier contains single-use
unmapped clean pages, which are the best bet. The eviction sorts a
page according to its gen counter if the aging has found this page
accessed through page tables and updated its gen counter. It also
moves a page to the next generation, i.e., min_seq+1
, if this page
was accessed multiple times through file descriptors and the feedback
loop has detected outlying refaults from the tier this page is in. To
this end, the feedback loop uses the first tier as the baseline, for
the reason stated earlier.
Working set protection¶
Each generation is timestamped at birth. If lru_gen_min_ttl
is
set, an lruvec
is protected from the eviction when its oldest
generation was born within lru_gen_min_ttl
milliseconds. In other
words, it prevents the working set of lru_gen_min_ttl
milliseconds
from getting evicted. The OOM killer is triggered if this working set
cannot be kept in memory.
This time-based approach has the following advantages:
It is easier to configure because it is agnostic to applications and memory sizes.
It is more reliable because it is directly wired to the OOM killer.
mm_struct
list¶
An mm_struct
list is maintained for each memcg, and an
mm_struct
follows its owner task to the new memcg when this task
is migrated.
A page table walker iterates lruvec_memcg()->mm_list
and calls
walk_page_range()
with each mm_struct
on this list to scan
PTEs. When multiple page table walkers iterate the same list, each of
them gets a unique mm_struct
, and therefore they can run in
parallel.
Page table walkers ignore any misplaced pages, e.g., if an
mm_struct
was migrated, pages left in the previous memcg will be
ignored when the current memcg is under reclaim. Similarly, page table
walkers will ignore pages from nodes other than the one under reclaim.
This infrastructure also tracks the usage of mm_struct
between
context switches so that page table walkers can skip processes that
have been sleeping since the last iteration.
Rmap/PT walk feedback¶
Searching the rmap for PTEs mapping each page on an LRU list (to test and clear the accessed bit) can be expensive because pages from different VMAs (PA space) are not cache friendly to the rmap (VA space). For workloads mostly using mapped pages, searching the rmap can incur the highest CPU cost in the reclaim path.
lru_gen_look_around()
exploits spatial locality to reduce the
trips into the rmap. It scans the adjacent PTEs of a young PTE and
promotes hot pages. If the scan was done cacheline efficiently, it
adds the PMD entry pointing to the PTE table to the Bloom filter. This
forms a feedback loop between the eviction and the aging.
Bloom filters¶
Bloom filters are a space and memory efficient data structure for set membership test, i.e., test if an element is not in the set or may be in the set.
In the eviction path, specifically, in lru_gen_look_around()
, if a
PMD has a sufficient number of hot pages, its address is placed in the
filter. In the aging path, set membership means that the PTE range
will be scanned for young pages.
Note that Bloom filters are probabilistic on set membership. If a test is false positive, the cost is an additional scan of a range of PTEs, which may yield hot pages anyway. Parameters of the filter itself can control the false positive rate in the limit.
PID controller¶
A feedback loop modeled after the Proportional-Integral-Derivative (PID) controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation.
The PID controller uses generations rather than the wall clock as the time domain because a CPU can scan pages at different rates under varying memory pressure. It calculates a moving average for each new generation to avoid being permanently locked in a suboptimal state.
Memcg LRU¶
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of folios (see
mem_cgroup_lruvec()
). Its goal is to improve the scalability of
global reclaim, which is critical to system-wide memory overcommit in
data centers. Note that memcg LRU only applies to global reclaim.
The basic structure of an memcg LRU can be understood by an analogy to the active/inactive LRU (of folios):
It has the young and the old (generations), i.e., the counterparts to the active and the inactive;
The increment of
max_seq
triggers promotion, i.e., the counterpart to activation;Other events trigger similar operations, e.g., offlining an memcg triggers demotion, i.e., the counterpart to deactivation.
In terms of global reclaim, it has two distinct features:
Sharding, which allows each thread to start at a random memcg (in the old generation) and improves parallelism;
Eventual fairness, which allows direct reclaim to bail out at will and reduces latency without affecting fairness over some time.
In terms of traversing memcgs during global reclaim, it improves the best-case complexity from O(n) to O(1) and does not affect the worst-case complexity O(n). Therefore, on average, it has a sublinear complexity.
Summary¶
The multi-gen LRU (of folios) can be disassembled into the following parts:
Generations
Rmap walks
Page table walks via
mm_struct
listBloom filters for rmap/PT walk feedback
PID controller for refault feedback
The aging and the eviction form a producer-consumer model; specifically, the latter drives the former by the sliding window over generations. Within the aging, rmap walks drive page table walks by inserting hot densely populated page tables to the Bloom filters. Within the eviction, the PID controller uses refaults as the feedback to select types to evict and tiers to protect.