[PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE (original) (raw)

| From: | | Andrea Righi andrea@betterlinux.com | | ----------------- | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | To: | | Andrew Morton akpm@linux-foundation.org | | Subject: | | [RFC] [PATCH v5 0/3] fadvise: support POSIX_FADV_NOREUSE | | Date: | | Sun, 12 Feb 2012 01:21:35 +0100 | | Message-ID: | | 1329006098-5454-1-git-send-email-andrea@betterlinux.com | | Cc: | | Minchan Kim minchan.kim@gmail.com, Peter Zijlstra a.p.zijlstra@chello.nl, Johannes Weiner jweiner@redhat.com, KAMEZAWA Hiroyuki kamezawa.hiroyu@jp.fujitsu.com, KOSAKI Motohiro kosaki.motohiro@jp.fujitsu.com, Rik van Riel riel@redhat.com, Hugh Dickins hughd@google.com, Alexander Viro viro@zeniv.linux.org.uk, Shaohua Li shaohua.li@intel.com, =?UTF-8?q?P=C3=A1draig=20Brady?= P@draigBrady.com, John Stultz john.stultz@linaro.org, Jerry James jamesjer@betterlinux.com, Julius Plenz julius@plenz.com, linux-mm linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, LKML linux-kernel@vger.kernel.org | | Archive‑link: | | Article |

There were some reported problems in the past about trashing page cache when a backup software (i.e., rsync) touches a huge amount of pages (see for example [1]).

This problem was mitigated by the Minchan's patch [2] and a proper use of fadvise() in the backup software. For example this patch set [3] has been proposed for inclusion in rsync.

However, there are still other trashing problems: when the backup software reads all the source files, some of them may be part of the actual working set of the system. If a POSIX_FADV_DONTNEED is performed all pages are evicted from pagecache, both the working set and the use-once pages touched only by the backup.

A previous proposal [4] was to use the POSIX_FADV_NOREUSE hint, currently implemented as a no-op, to provide a page invalidation policy less agressive than POSIX_FADV_DONTNEED (by moving active pages to the tail of the inactive list, and dropping the pages from the inactive list).

However, as correctly pointed out by KOSAKI, this behavior is very different respect to the POSIX definition:

  POSIX_FADV_NOREUSE
          Specifies that the application expects to access the specified data once and then
          not reuse it thereafter.

The new proposal is to implement POSIX_FADV_NOREUSE as a way to perform a real drop-behind policy where applications can mark certain intervals of a file as FADV_NOREUSE before accessing the data.

In this way the marked pages will never blow away the current working set. This requirement can be satisfied by preventing lru activation of the FADV_NOREUSE pages. Moreover, all the file cache pages in a FADV_NOREUSE range will be immediately dropped after a read if the page was not present in the cache before. The purpose is to preserve as much as possible the previous state of the file cache memory before reading data in ranges marked via FADV_NOREUSE.

The list of FADV_NOREUSE ranges are maintained into an interval tree [5] inside the address_space structure (for this a generic interval tree implementation is also provided by PATCH 1/2). The intervals are dropped when pages are also dropped from the page cache or when a different caching hint is specified on the same intervals.

Example:

How this can solve the "backup is trashing my page cache" issue?

The "backup" software can be changed as following:

Simple test case:

The test is done starting both with the working set in the inactive lru list (the 256MB chunk read once at the beginning of the test) and the working set in the active lru list (the 256MB chunk read twice at the beginning of the test).

Results:

inact_working_set = the working set is all in the inactive lru list act_working_set = the working set is all in the active lru list

POSIX_FADV_NORMAL 2,052 0 POSIX_FADV_DONTNEED 0(*) 2048 POSIX_FADV_NOREUSE 0 0

POSIX_FADV_NORMAL 1.013 0.070 POSIX_FADV_DONTNEED 0.070(*) 1.006 POSIX_FADV_NOREUSE 0.070 0.070

(*) With POSIX_FADV_DONTNEED I would expect to see the working set pages dropped from the page cache when it starts in the inactive lru list.

Instead this happens only when we start with the working set all in the active list. IIUC this happens because when we re-read the pages the second time they're moved to the per-cpu vector activate_page_pvecs; if the page vector is not yet drained when we issue the POSIX_FADV_DONTNEED the advice is ignored.

Anyway, for this particular test case the best solution to preserve the state of the page cache is obviouly the usage of POSIX_FADV_NOREUSE.

Regression test:

for i in seq 1 10; do fio --name=fio --directory=. --rw=read --bs=4K --size=1G --numjobs=4 | grep READ: done

No obvious performance regression was found according to this simple test.

Credits:

References: [1] http://marc.info/?l=rsync&m=128885034930933&w=2 [2] https://lkml.org/lkml/2011/2/20/57 [3] http://lists.samba.org/archive/rsync/2010-November/025827... [4] http://thread.gmane.org/gmane.linux.kernel.mm/65493 [5] http://en.wikipedia.org/wiki/Interval_tree [6] http://thread.gmane.org/gmane.linux.kernel/1218654

ChangeLog v4 -> v5:

[PATCH v5 1/3] kinterval: routines to manipulate generic intervals [PATCH v5 2/3] mm: filemap: introduce mark_page_usedonce [PATCH v5 3/3] fadvise: implement POSIX_FADV_NOREUSE

fs/inode.c | 3 + include/linux/fs.h | 12 ++ include/linux/kinterval.h | 126 ++++++++++++ include/linux/swap.h | 1 + lib/Makefile | 2 +- lib/kinterval.c | 483 +++++++++++++++++++++++++++++++++++++++++++++ mm/fadvise.c | 18 ++- mm/filemap.c | 95 +++++++++- mm/swap.c | 24 +++ 9 files changed, 760 insertions(+), 4 deletions(-)

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/