kernel: bump 5.15 to 5.15.140
[openwrt/staging/hauke.git] / target / linux / generic / backport-5.15 / 020-v6.1-05-mm-multi-gen-LRU-groundwork.patch
1 From a9b328add8422921a0dbbef162730800e16e8cfd Mon Sep 17 00:00:00 2001
2 From: Yu Zhao <yuzhao@google.com>
3 Date: Sun, 18 Sep 2022 02:00:02 -0600
4 Subject: [PATCH 05/29] mm: multi-gen LRU: groundwork
5 MIME-Version: 1.0
6 Content-Type: text/plain; charset=UTF-8
7 Content-Transfer-Encoding: 8bit
8
9 Evictable pages are divided into multiple generations for each lruvec.
10 The youngest generation number is stored in lrugen->max_seq for both
11 anon and file types as they are aged on an equal footing. The oldest
12 generation numbers are stored in lrugen->min_seq[] separately for anon
13 and file types as clean file pages can be evicted regardless of swap
14 constraints. These three variables are monotonically increasing.
15
16 Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
17 in order to fit into the gen counter in page->flags. Each truncated
18 generation number is an index to lrugen->lists[]. The sliding window
19 technique is used to track at least MIN_NR_GENS and at most
20 MAX_NR_GENS generations. The gen counter stores a value within [1,
21 MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
22 stores 0.
23
24 There are two conceptually independent procedures: "the aging", which
25 produces young generations, and "the eviction", which consumes old
26 generations. They form a closed-loop system, i.e., "the page reclaim".
27 Both procedures can be invoked from userspace for the purposes of working
28 set estimation and proactive reclaim. These techniques are commonly used
29 to optimize job scheduling (bin packing) in data centers [1][2].
30
31 To avoid confusion, the terms "hot" and "cold" will be applied to the
32 multi-gen LRU, as a new convention; the terms "active" and "inactive" will
33 be applied to the active/inactive LRU, as usual.
34
35 The protection of hot pages and the selection of cold pages are based
36 on page access channels and patterns. There are two access channels:
37 one through page tables and the other through file descriptors. The
38 protection of the former channel is by design stronger because:
39 1. The uncertainty in determining the access patterns of the former
40 channel is higher due to the approximation of the accessed bit.
41 2. The cost of evicting the former channel is higher due to the TLB
42 flushes required and the likelihood of encountering the dirty bit.
43 3. The penalty of underprotecting the former channel is higher because
44 applications usually do not prepare themselves for major page
45 faults like they do for blocked I/O. E.g., GUI applications
46 commonly use dedicated I/O threads to avoid blocking rendering
47 threads.
48
49 There are also two access patterns: one with temporal locality and the
50 other without. For the reasons listed above, the former channel is
51 assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
52 present; the latter channel is assumed to follow the latter pattern unless
53 outlying refaults have been observed [3][4].
54
55 The next patch will address the "outlying refaults". Three macros, i.e.,
56 LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
57 this patch to make the entire patchset less diffy.
58
59 A page is added to the youngest generation on faulting. The aging needs
60 to check the accessed bit at least twice before handing this page over to
61 the eviction. The first check takes care of the accessed bit set on the
62 initial fault; the second check makes sure this page has not been used
63 since then. This protocol, AKA second chance, requires a minimum of two
64 generations, hence MIN_NR_GENS.
65
66 [1] https://dl.acm.org/doi/10.1145/3297858.3304053
67 [2] https://dl.acm.org/doi/10.1145/3503222.3507731
68 [3] https://lwn.net/Articles/495543/
69 [4] https://lwn.net/Articles/815342/
70
71 Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
72 Signed-off-by: Yu Zhao <yuzhao@google.com>
73 Acked-by: Brian Geffon <bgeffon@google.com>
74 Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
75 Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
76 Acked-by: Steven Barrett <steven@liquorix.net>
77 Acked-by: Suleiman Souhlal <suleiman@google.com>
78 Tested-by: Daniel Byrne <djbyrne@mtu.edu>
79 Tested-by: Donald Carr <d@chaos-reins.com>
80 Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
81 Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
82 Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
83 Tested-by: Sofia Trinh <sofia.trinh@edi.works>
84 Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
85 Cc: Andi Kleen <ak@linux.intel.com>
86 Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
87 Cc: Barry Song <baohua@kernel.org>
88 Cc: Catalin Marinas <catalin.marinas@arm.com>
89 Cc: Dave Hansen <dave.hansen@linux.intel.com>
90 Cc: Hillf Danton <hdanton@sina.com>
91 Cc: Jens Axboe <axboe@kernel.dk>
92 Cc: Johannes Weiner <hannes@cmpxchg.org>
93 Cc: Jonathan Corbet <corbet@lwn.net>
94 Cc: Linus Torvalds <torvalds@linux-foundation.org>
95 Cc: Matthew Wilcox <willy@infradead.org>
96 Cc: Mel Gorman <mgorman@suse.de>
97 Cc: Miaohe Lin <linmiaohe@huawei.com>
98 Cc: Michael Larabel <Michael@MichaelLarabel.com>
99 Cc: Michal Hocko <mhocko@kernel.org>
100 Cc: Mike Rapoport <rppt@kernel.org>
101 Cc: Mike Rapoport <rppt@linux.ibm.com>
102 Cc: Peter Zijlstra <peterz@infradead.org>
103 Cc: Qi Zheng <zhengqi.arch@bytedance.com>
104 Cc: Tejun Heo <tj@kernel.org>
105 Cc: Vlastimil Babka <vbabka@suse.cz>
106 Cc: Will Deacon <will@kernel.org>
107 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
108 ---
109 fs/fuse/dev.c | 3 +-
110 include/linux/mm.h | 2 +
111 include/linux/mm_inline.h | 177 +++++++++++++++++++++++++++++-
112 include/linux/mmzone.h | 100 +++++++++++++++++
113 include/linux/page-flags-layout.h | 13 ++-
114 include/linux/page-flags.h | 4 +-
115 include/linux/sched.h | 4 +
116 kernel/bounds.c | 5 +
117 mm/Kconfig | 8 ++
118 mm/huge_memory.c | 3 +-
119 mm/memcontrol.c | 2 +
120 mm/memory.c | 25 +++++
121 mm/mm_init.c | 6 +-
122 mm/mmzone.c | 2 +
123 mm/swap.c | 10 +-
124 mm/vmscan.c | 75 +++++++++++++
125 16 files changed, 425 insertions(+), 14 deletions(-)
126
127 --- a/fs/fuse/dev.c
128 +++ b/fs/fuse/dev.c
129 @@ -785,7 +785,8 @@ static int fuse_check_page(struct page *
130 1 << PG_active |
131 1 << PG_workingset |
132 1 << PG_reclaim |
133 - 1 << PG_waiters))) {
134 + 1 << PG_waiters |
135 + LRU_GEN_MASK | LRU_REFS_MASK))) {
136 dump_page(page, "fuse: trying to steal weird page");
137 return 1;
138 }
139 --- a/include/linux/mm.h
140 +++ b/include/linux/mm.h
141 @@ -1093,6 +1093,8 @@ vm_fault_t finish_mkwrite_fault(struct v
142 #define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
143 #define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH)
144 #define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
145 +#define LRU_GEN_PGOFF (KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
146 +#define LRU_REFS_PGOFF (LRU_GEN_PGOFF - LRU_REFS_WIDTH)
147
148 /*
149 * Define the bit shifts to access each section. For non-existent
150 --- a/include/linux/mm_inline.h
151 +++ b/include/linux/mm_inline.h
152 @@ -26,10 +26,13 @@ static inline int page_is_file_lru(struc
153
154 static __always_inline void __update_lru_size(struct lruvec *lruvec,
155 enum lru_list lru, enum zone_type zid,
156 - int nr_pages)
157 + long nr_pages)
158 {
159 struct pglist_data *pgdat = lruvec_pgdat(lruvec);
160
161 + lockdep_assert_held(&lruvec->lru_lock);
162 + WARN_ON_ONCE(nr_pages != (int)nr_pages);
163 +
164 __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
165 __mod_zone_page_state(&pgdat->node_zones[zid],
166 NR_ZONE_LRU_BASE + lru, nr_pages);
167 @@ -86,11 +89,177 @@ static __always_inline enum lru_list pag
168 return lru;
169 }
170
171 +#ifdef CONFIG_LRU_GEN
172 +
173 +static inline bool lru_gen_enabled(void)
174 +{
175 + return true;
176 +}
177 +
178 +static inline bool lru_gen_in_fault(void)
179 +{
180 + return current->in_lru_fault;
181 +}
182 +
183 +static inline int lru_gen_from_seq(unsigned long seq)
184 +{
185 + return seq % MAX_NR_GENS;
186 +}
187 +
188 +static inline int page_lru_gen(struct page *page)
189 +{
190 + unsigned long flags = READ_ONCE(page->flags);
191 +
192 + return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
193 +}
194 +
195 +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
196 +{
197 + unsigned long max_seq = lruvec->lrugen.max_seq;
198 +
199 + VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
200 +
201 + /* see the comment on MIN_NR_GENS */
202 + return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
203 +}
204 +
205 +static inline void lru_gen_update_size(struct lruvec *lruvec, struct page *page,
206 + int old_gen, int new_gen)
207 +{
208 + int type = page_is_file_lru(page);
209 + int zone = page_zonenum(page);
210 + int delta = thp_nr_pages(page);
211 + enum lru_list lru = type * LRU_INACTIVE_FILE;
212 + struct lru_gen_struct *lrugen = &lruvec->lrugen;
213 +
214 + VM_WARN_ON_ONCE(old_gen != -1 && old_gen >= MAX_NR_GENS);
215 + VM_WARN_ON_ONCE(new_gen != -1 && new_gen >= MAX_NR_GENS);
216 + VM_WARN_ON_ONCE(old_gen == -1 && new_gen == -1);
217 +
218 + if (old_gen >= 0)
219 + WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
220 + lrugen->nr_pages[old_gen][type][zone] - delta);
221 + if (new_gen >= 0)
222 + WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
223 + lrugen->nr_pages[new_gen][type][zone] + delta);
224 +
225 + /* addition */
226 + if (old_gen < 0) {
227 + if (lru_gen_is_active(lruvec, new_gen))
228 + lru += LRU_ACTIVE;
229 + __update_lru_size(lruvec, lru, zone, delta);
230 + return;
231 + }
232 +
233 + /* deletion */
234 + if (new_gen < 0) {
235 + if (lru_gen_is_active(lruvec, old_gen))
236 + lru += LRU_ACTIVE;
237 + __update_lru_size(lruvec, lru, zone, -delta);
238 + return;
239 + }
240 +}
241 +
242 +static inline bool lru_gen_add_page(struct lruvec *lruvec, struct page *page, bool reclaiming)
243 +{
244 + unsigned long seq;
245 + unsigned long flags;
246 + int gen = page_lru_gen(page);
247 + int type = page_is_file_lru(page);
248 + int zone = page_zonenum(page);
249 + struct lru_gen_struct *lrugen = &lruvec->lrugen;
250 +
251 + VM_WARN_ON_ONCE_PAGE(gen != -1, page);
252 +
253 + if (PageUnevictable(page))
254 + return false;
255 + /*
256 + * There are three common cases for this page:
257 + * 1. If it's hot, e.g., freshly faulted in or previously hot and
258 + * migrated, add it to the youngest generation.
259 + * 2. If it's cold but can't be evicted immediately, i.e., an anon page
260 + * not in swapcache or a dirty page pending writeback, add it to the
261 + * second oldest generation.
262 + * 3. Everything else (clean, cold) is added to the oldest generation.
263 + */
264 + if (PageActive(page))
265 + seq = lrugen->max_seq;
266 + else if ((type == LRU_GEN_ANON && !PageSwapCache(page)) ||
267 + (PageReclaim(page) &&
268 + (PageDirty(page) || PageWriteback(page))))
269 + seq = lrugen->min_seq[type] + 1;
270 + else
271 + seq = lrugen->min_seq[type];
272 +
273 + gen = lru_gen_from_seq(seq);
274 + flags = (gen + 1UL) << LRU_GEN_PGOFF;
275 + /* see the comment on MIN_NR_GENS about PG_active */
276 + set_mask_bits(&page->flags, LRU_GEN_MASK | BIT(PG_active), flags);
277 +
278 + lru_gen_update_size(lruvec, page, -1, gen);
279 + /* for rotate_reclaimable_page() */
280 + if (reclaiming)
281 + list_add_tail(&page->lru, &lrugen->lists[gen][type][zone]);
282 + else
283 + list_add(&page->lru, &lrugen->lists[gen][type][zone]);
284 +
285 + return true;
286 +}
287 +
288 +static inline bool lru_gen_del_page(struct lruvec *lruvec, struct page *page, bool reclaiming)
289 +{
290 + unsigned long flags;
291 + int gen = page_lru_gen(page);
292 +
293 + if (gen < 0)
294 + return false;
295 +
296 + VM_WARN_ON_ONCE_PAGE(PageActive(page), page);
297 + VM_WARN_ON_ONCE_PAGE(PageUnevictable(page), page);
298 +
299 + /* for migrate_page_states() */
300 + flags = !reclaiming && lru_gen_is_active(lruvec, gen) ? BIT(PG_active) : 0;
301 + flags = set_mask_bits(&page->flags, LRU_GEN_MASK, flags);
302 + gen = ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
303 +
304 + lru_gen_update_size(lruvec, page, gen, -1);
305 + list_del(&page->lru);
306 +
307 + return true;
308 +}
309 +
310 +#else /* !CONFIG_LRU_GEN */
311 +
312 +static inline bool lru_gen_enabled(void)
313 +{
314 + return false;
315 +}
316 +
317 +static inline bool lru_gen_in_fault(void)
318 +{
319 + return false;
320 +}
321 +
322 +static inline bool lru_gen_add_page(struct lruvec *lruvec, struct page *page, bool reclaiming)
323 +{
324 + return false;
325 +}
326 +
327 +static inline bool lru_gen_del_page(struct lruvec *lruvec, struct page *page, bool reclaiming)
328 +{
329 + return false;
330 +}
331 +
332 +#endif /* CONFIG_LRU_GEN */
333 +
334 static __always_inline void add_page_to_lru_list(struct page *page,
335 struct lruvec *lruvec)
336 {
337 enum lru_list lru = page_lru(page);
338
339 + if (lru_gen_add_page(lruvec, page, false))
340 + return;
341 +
342 update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
343 list_add(&page->lru, &lruvec->lists[lru]);
344 }
345 @@ -100,6 +269,9 @@ static __always_inline void add_page_to_
346 {
347 enum lru_list lru = page_lru(page);
348
349 + if (lru_gen_add_page(lruvec, page, true))
350 + return;
351 +
352 update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
353 list_add_tail(&page->lru, &lruvec->lists[lru]);
354 }
355 @@ -107,6 +279,9 @@ static __always_inline void add_page_to_
356 static __always_inline void del_page_from_lru_list(struct page *page,
357 struct lruvec *lruvec)
358 {
359 + if (lru_gen_del_page(lruvec, page, false))
360 + return;
361 +
362 list_del(&page->lru);
363 update_lru_size(lruvec, page_lru(page), page_zonenum(page),
364 -thp_nr_pages(page));
365 --- a/include/linux/mmzone.h
366 +++ b/include/linux/mmzone.h
367 @@ -294,6 +294,102 @@ enum lruvec_flags {
368 */
369 };
370
371 +#endif /* !__GENERATING_BOUNDS_H */
372 +
373 +/*
374 + * Evictable pages are divided into multiple generations. The youngest and the
375 + * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
376 + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
377 + * offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the
378 + * corresponding generation. The gen counter in page->flags stores gen+1 while
379 + * a page is on one of lrugen->lists[]. Otherwise it stores 0.
380 + *
381 + * A page is added to the youngest generation on faulting. The aging needs to
382 + * check the accessed bit at least twice before handing this page over to the
383 + * eviction. The first check takes care of the accessed bit set on the initial
384 + * fault; the second check makes sure this page hasn't been used since then.
385 + * This process, AKA second chance, requires a minimum of two generations,
386 + * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
387 + * LRU, e.g., /proc/vmstat, these two generations are considered active; the
388 + * rest of generations, if they exist, are considered inactive. See
389 + * lru_gen_is_active().
390 + *
391 + * PG_active is always cleared while a page is on one of lrugen->lists[] so that
392 + * the aging needs not to worry about it. And it's set again when a page
393 + * considered active is isolated for non-reclaiming purposes, e.g., migration.
394 + * See lru_gen_add_page() and lru_gen_del_page().
395 + *
396 + * MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the
397 + * number of categories of the active/inactive LRU when keeping track of
398 + * accesses through page tables. This requires order_base_2(MAX_NR_GENS+1) bits
399 + * in page->flags.
400 + */
401 +#define MIN_NR_GENS 2U
402 +#define MAX_NR_GENS 4U
403 +
404 +#ifndef __GENERATING_BOUNDS_H
405 +
406 +struct lruvec;
407 +
408 +#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
409 +#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
410 +
411 +#ifdef CONFIG_LRU_GEN
412 +
413 +enum {
414 + LRU_GEN_ANON,
415 + LRU_GEN_FILE,
416 +};
417 +
418 +/*
419 + * The youngest generation number is stored in max_seq for both anon and file
420 + * types as they are aged on an equal footing. The oldest generation numbers are
421 + * stored in min_seq[] separately for anon and file types as clean file pages
422 + * can be evicted regardless of swap constraints.
423 + *
424 + * Normally anon and file min_seq are in sync. But if swapping is constrained,
425 + * e.g., out of swap space, file min_seq is allowed to advance and leave anon
426 + * min_seq behind.
427 + *
428 + * The number of pages in each generation is eventually consistent and therefore
429 + * can be transiently negative.
430 + */
431 +struct lru_gen_struct {
432 + /* the aging increments the youngest generation number */
433 + unsigned long max_seq;
434 + /* the eviction increments the oldest generation numbers */
435 + unsigned long min_seq[ANON_AND_FILE];
436 + /* the multi-gen LRU lists, lazily sorted on eviction */
437 + struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
438 + /* the multi-gen LRU sizes, eventually consistent */
439 + long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
440 +};
441 +
442 +void lru_gen_init_lruvec(struct lruvec *lruvec);
443 +
444 +#ifdef CONFIG_MEMCG
445 +void lru_gen_init_memcg(struct mem_cgroup *memcg);
446 +void lru_gen_exit_memcg(struct mem_cgroup *memcg);
447 +#endif
448 +
449 +#else /* !CONFIG_LRU_GEN */
450 +
451 +static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
452 +{
453 +}
454 +
455 +#ifdef CONFIG_MEMCG
456 +static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
457 +{
458 +}
459 +
460 +static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
461 +{
462 +}
463 +#endif
464 +
465 +#endif /* CONFIG_LRU_GEN */
466 +
467 struct lruvec {
468 struct list_head lists[NR_LRU_LISTS];
469 /* per lruvec lru_lock for memcg */
470 @@ -311,6 +407,10 @@ struct lruvec {
471 unsigned long refaults[ANON_AND_FILE];
472 /* Various lruvec state flags (enum lruvec_flags) */
473 unsigned long flags;
474 +#ifdef CONFIG_LRU_GEN
475 + /* evictable pages divided into generations */
476 + struct lru_gen_struct lrugen;
477 +#endif
478 #ifdef CONFIG_MEMCG
479 struct pglist_data *pgdat;
480 #endif
481 --- a/include/linux/page-flags-layout.h
482 +++ b/include/linux/page-flags-layout.h
483 @@ -55,7 +55,8 @@
484 #define SECTIONS_WIDTH 0
485 #endif
486
487 -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
488 +#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
489 + <= BITS_PER_LONG - NR_PAGEFLAGS
490 #define NODES_WIDTH NODES_SHIFT
491 #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
492 #error "Vmemmap: No space for nodes field in page flags"
493 @@ -89,8 +90,8 @@
494 #define LAST_CPUPID_SHIFT 0
495 #endif
496
497 -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
498 - <= BITS_PER_LONG - NR_PAGEFLAGS
499 +#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
500 + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
501 #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
502 #else
503 #define LAST_CPUPID_WIDTH 0
504 @@ -100,10 +101,12 @@
505 #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
506 #endif
507
508 -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
509 - > BITS_PER_LONG - NR_PAGEFLAGS
510 +#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
511 + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
512 #error "Not enough bits in page flags"
513 #endif
514
515 +#define LRU_REFS_WIDTH 0
516 +
517 #endif
518 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
519 --- a/include/linux/page-flags.h
520 +++ b/include/linux/page-flags.h
521 @@ -845,7 +845,7 @@ static inline void ClearPageSlabPfmemall
522 1UL << PG_private | 1UL << PG_private_2 | \
523 1UL << PG_writeback | 1UL << PG_reserved | \
524 1UL << PG_slab | 1UL << PG_active | \
525 - 1UL << PG_unevictable | __PG_MLOCKED)
526 + 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK)
527
528 /*
529 * Flags checked when a page is prepped for return by the page allocator.
530 @@ -856,7 +856,7 @@ static inline void ClearPageSlabPfmemall
531 * alloc-free cycle to prevent from reusing the page.
532 */
533 #define PAGE_FLAGS_CHECK_AT_PREP \
534 - (PAGEFLAGS_MASK & ~__PG_HWPOISON)
535 + ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
536
537 #define PAGE_FLAGS_PRIVATE \
538 (1UL << PG_private | 1UL << PG_private_2)
539 --- a/include/linux/sched.h
540 +++ b/include/linux/sched.h
541 @@ -911,6 +911,10 @@ struct task_struct {
542 #ifdef CONFIG_MEMCG
543 unsigned in_user_fault:1;
544 #endif
545 +#ifdef CONFIG_LRU_GEN
546 + /* whether the LRU algorithm may apply to this access */
547 + unsigned in_lru_fault:1;
548 +#endif
549 #ifdef CONFIG_COMPAT_BRK
550 unsigned brk_randomized:1;
551 #endif
552 --- a/kernel/bounds.c
553 +++ b/kernel/bounds.c
554 @@ -22,6 +22,11 @@ int main(void)
555 DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
556 #endif
557 DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
558 +#ifdef CONFIG_LRU_GEN
559 + DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
560 +#else
561 + DEFINE(LRU_GEN_WIDTH, 0);
562 +#endif
563 /* End of constants */
564
565 return 0;
566 --- a/mm/Kconfig
567 +++ b/mm/Kconfig
568 @@ -897,6 +897,14 @@ config IO_MAPPING
569 config SECRETMEM
570 def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
571
572 +config LRU_GEN
573 + bool "Multi-Gen LRU"
574 + depends on MMU
575 + # make sure page->flags has enough spare bits
576 + depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP
577 + help
578 + A high performance LRU implementation to overcommit memory.
579 +
580 source "mm/damon/Kconfig"
581
582 endmenu
583 --- a/mm/huge_memory.c
584 +++ b/mm/huge_memory.c
585 @@ -2366,7 +2366,8 @@ static void __split_huge_page_tail(struc
586 #ifdef CONFIG_64BIT
587 (1L << PG_arch_2) |
588 #endif
589 - (1L << PG_dirty)));
590 + (1L << PG_dirty) |
591 + LRU_GEN_MASK | LRU_REFS_MASK));
592
593 /* ->mapping in first tail page is compound_mapcount */
594 VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
595 --- a/mm/memcontrol.c
596 +++ b/mm/memcontrol.c
597 @@ -5179,6 +5179,7 @@ static void __mem_cgroup_free(struct mem
598
599 static void mem_cgroup_free(struct mem_cgroup *memcg)
600 {
601 + lru_gen_exit_memcg(memcg);
602 memcg_wb_domain_exit(memcg);
603 __mem_cgroup_free(memcg);
604 }
605 @@ -5242,6 +5243,7 @@ static struct mem_cgroup *mem_cgroup_all
606 memcg->deferred_split_queue.split_queue_len = 0;
607 #endif
608 idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
609 + lru_gen_init_memcg(memcg);
610 return memcg;
611 fail:
612 mem_cgroup_id_remove(memcg);
613 --- a/mm/memory.c
614 +++ b/mm/memory.c
615 @@ -4805,6 +4805,27 @@ static inline void mm_account_fault(stru
616 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
617 }
618
619 +#ifdef CONFIG_LRU_GEN
620 +static void lru_gen_enter_fault(struct vm_area_struct *vma)
621 +{
622 + /* the LRU algorithm doesn't apply to sequential or random reads */
623 + current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
624 +}
625 +
626 +static void lru_gen_exit_fault(void)
627 +{
628 + current->in_lru_fault = false;
629 +}
630 +#else
631 +static void lru_gen_enter_fault(struct vm_area_struct *vma)
632 +{
633 +}
634 +
635 +static void lru_gen_exit_fault(void)
636 +{
637 +}
638 +#endif /* CONFIG_LRU_GEN */
639 +
640 /*
641 * By the time we get here, we already hold the mm semaphore
642 *
643 @@ -4836,11 +4857,15 @@ vm_fault_t handle_mm_fault(struct vm_are
644 if (flags & FAULT_FLAG_USER)
645 mem_cgroup_enter_user_fault();
646
647 + lru_gen_enter_fault(vma);
648 +
649 if (unlikely(is_vm_hugetlb_page(vma)))
650 ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
651 else
652 ret = __handle_mm_fault(vma, address, flags);
653
654 + lru_gen_exit_fault();
655 +
656 if (flags & FAULT_FLAG_USER) {
657 mem_cgroup_exit_user_fault();
658 /*
659 --- a/mm/mm_init.c
660 +++ b/mm/mm_init.c
661 @@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layo
662
663 shift = 8 * sizeof(unsigned long);
664 width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
665 - - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
666 + - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
667 mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
668 - "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
669 + "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
670 SECTIONS_WIDTH,
671 NODES_WIDTH,
672 ZONES_WIDTH,
673 LAST_CPUPID_WIDTH,
674 KASAN_TAG_WIDTH,
675 + LRU_GEN_WIDTH,
676 + LRU_REFS_WIDTH,
677 NR_PAGEFLAGS);
678 mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
679 "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
680 --- a/mm/mmzone.c
681 +++ b/mm/mmzone.c
682 @@ -81,6 +81,8 @@ void lruvec_init(struct lruvec *lruvec)
683
684 for_each_lru(lru)
685 INIT_LIST_HEAD(&lruvec->lists[lru]);
686 +
687 + lru_gen_init_lruvec(lruvec);
688 }
689
690 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
691 --- a/mm/swap.c
692 +++ b/mm/swap.c
693 @@ -446,6 +446,11 @@ void lru_cache_add(struct page *page)
694 VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page);
695 VM_BUG_ON_PAGE(PageLRU(page), page);
696
697 + /* see the comment in lru_gen_add_page() */
698 + if (lru_gen_enabled() && !PageUnevictable(page) &&
699 + lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
700 + SetPageActive(page);
701 +
702 get_page(page);
703 local_lock(&lru_pvecs.lock);
704 pvec = this_cpu_ptr(&lru_pvecs.lru_add);
705 @@ -547,7 +552,7 @@ static void lru_deactivate_file_fn(struc
706
707 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
708 {
709 - if (PageActive(page) && !PageUnevictable(page)) {
710 + if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
711 int nr_pages = thp_nr_pages(page);
712
713 del_page_from_lru_list(page, lruvec);
714 @@ -661,7 +666,8 @@ void deactivate_file_page(struct page *p
715 */
716 void deactivate_page(struct page *page)
717 {
718 - if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
719 + if (PageLRU(page) && !PageUnevictable(page) &&
720 + (PageActive(page) || lru_gen_enabled())) {
721 struct pagevec *pvec;
722
723 local_lock(&lru_pvecs.lock);
724 --- a/mm/vmscan.c
725 +++ b/mm/vmscan.c
726 @@ -2821,6 +2821,81 @@ static bool can_age_anon_pages(struct pg
727 return can_demote(pgdat->node_id, sc);
728 }
729
730 +#ifdef CONFIG_LRU_GEN
731 +
732 +/******************************************************************************
733 + * shorthand helpers
734 + ******************************************************************************/
735 +
736 +#define for_each_gen_type_zone(gen, type, zone) \
737 + for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \
738 + for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \
739 + for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
740 +
741 +static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int nid)
742 +{
743 + struct pglist_data *pgdat = NODE_DATA(nid);
744 +
745 +#ifdef CONFIG_MEMCG
746 + if (memcg) {
747 + struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
748 +
749 + /* for hotadd_new_pgdat() */
750 + if (!lruvec->pgdat)
751 + lruvec->pgdat = pgdat;
752 +
753 + return lruvec;
754 + }
755 +#endif
756 + VM_WARN_ON_ONCE(!mem_cgroup_disabled());
757 +
758 + return pgdat ? &pgdat->__lruvec : NULL;
759 +}
760 +
761 +/******************************************************************************
762 + * initialization
763 + ******************************************************************************/
764 +
765 +void lru_gen_init_lruvec(struct lruvec *lruvec)
766 +{
767 + int gen, type, zone;
768 + struct lru_gen_struct *lrugen = &lruvec->lrugen;
769 +
770 + lrugen->max_seq = MIN_NR_GENS + 1;
771 +
772 + for_each_gen_type_zone(gen, type, zone)
773 + INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
774 +}
775 +
776 +#ifdef CONFIG_MEMCG
777 +void lru_gen_init_memcg(struct mem_cgroup *memcg)
778 +{
779 +}
780 +
781 +void lru_gen_exit_memcg(struct mem_cgroup *memcg)
782 +{
783 + int nid;
784 +
785 + for_each_node(nid) {
786 + struct lruvec *lruvec = get_lruvec(memcg, nid);
787 +
788 + VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
789 + sizeof(lruvec->lrugen.nr_pages)));
790 + }
791 +}
792 +#endif
793 +
794 +static int __init init_lru_gen(void)
795 +{
796 + BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
797 + BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
798 +
799 + return 0;
800 +};
801 +late_initcall(init_lru_gen);
802 +
803 +#endif /* CONFIG_LRU_GEN */
804 +
805 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
806 {
807 unsigned long nr[NR_LRU_LISTS];