[VM Documentation 2/2]: Update to recent 2.4 tunables

Signed-off-by: Marc-Christian Petersen <m.c.p@kernel.linux-systeme.com>

--- a/Documentation/sysctl/vm.txt	2004-05-26 19:57:15.000000000 +0200
+++ b/Documentation/sysctl/vm.txt	2004-05-26 20:06:20.000000000 +0200
@@ -1,111 +1,143 @@
-Documentation for /proc/sys/vm/*	kernel version 2.4.19
-	(c) 1998, 1999,  Rik van Riel <riel@nl.linux.org>
+Documentation for /proc/sys/vm/*	Kernel version 2.4.28
+=============================================================
 
-For general info and legal blurb, please look in README.
+ (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
+    - Initial version
 
-==============================================================
+ (c) 2004, Marc-Christian Petersen <m.c.p@linux-systeme.com>
+    - Removed non-existent knobs which were removed in early
+      2.4 stages
+    - Corrected values for bdflush
+    - Documented missing tunables
+    - Documented aa-vm tunables
+
+
+
+For general info and legal blurb, please look in README.
+=============================================================
 
 This file contains the documentation for the sysctl files in
-/proc/sys/vm and is valid for Linux kernel version 2.4.
+/proc/sys/vm and is valid for Linux kernel v2.4.28.
 
 The files in this directory can be used to tune the operation
 of the virtual memory (VM) subsystem of the Linux kernel, and
-one of the files (bdflush) also has a little influence on disk
-usage.
+three of the files (bdflush, max-readahead, min-readahead)
+also have some influence on disk usage.
 
 Default values and initialization routines for most of these
-files can be found in mm/swap.c.
+files can be found in mm/vmscan.c, mm/page_alloc.c and
+mm/filemap.c.
 
 Currently, these files are in /proc/sys/vm:
 - bdflush
+- block_dump
 - kswapd
+- laptop_mode
+- max-readahead
+- min-readahead
 - max_map_count
 - overcommit_memory
 - page-cluster
 - pagetable_cache
+- vm_anon_lru
+- vm_cache_scan_ratio
+- vm_gfp_debug
+- vm_lru_balance_ratio
+- vm_mapped_ratio
+- vm_passes
+- vm_vfs_scan_ratio
+=============================================================
 
-==============================================================
 
-bdflush:
 
+bdflush:
+--------
 This file controls the operation of the bdflush kernel
 daemon. The source code to this struct can be found in
-linux/fs/buffer.c. It currently contains 9 integer values,
+fs/buffer.c. It currently contains 9 integer values,
 of which 6 are actually used by the kernel.
 
-From linux/fs/buffer.c:
---------------------------------------------------------------
-union bdflush_param {
-	struct {
-		int nfract;	/* Percentage of buffer cache dirty to
-				   activate bdflush */
-		int ndirty;	/* Maximum number of dirty blocks to write out per
-				   wake-cycle */
-		int dummy2;	/* old "nrefill" */
-		int dummy3;	/* unused */
-		int interval;	/* jiffies delay between kupdate flushes */
-		int age_buffer;	/* Time for normal buffer to age before we flush it */
-		int nfract_sync;/* Percentage of buffer cache dirty to
-				   activate bdflush synchronously */
-		int nfract_stop_bdflush; /* Percentage of buffer cache dirty to stop bdflush */
-		int dummy5;	/* unused */
-	} b_un;
-	unsigned int data[N_PARAM];
-} bdf_prm = {{30, 500, 0, 0, 5*HZ, 30*HZ, 60, 20, 0}};
---------------------------------------------------------------
-
-int nfract:
-The first parameter governs the maximum number of dirty
-buffers in the buffer cache. Dirty means that the contents
-of the buffer still have to be written to disk (as opposed
-to a clean buffer, which can just be forgotten about).
-Setting this to a high value means that Linux can delay disk
-writes for a long time, but it also means that it will have
-to do a lot of I/O at once when memory becomes short. A low
-value will spread out disk I/O more evenly, at the cost of
-more frequent I/O operations.  The default value is 30%,
-the minimum is 0%, and the maximum is 100%.
-
-int ndirty:
-The second parameter (ndirty) gives the maximum number of
-dirty buffers that bdflush can write to the disk in one time.
-A high value will mean delayed, bursty I/O, while a small
-value can lead to memory shortage when bdflush isn't woken
-up often enough.
-
-int interval:
-The fifth parameter, interval, is the minimum rate at
-which kupdate will wake and flush.  The value is expressed in
-jiffies (clockticks), the number of jiffies per second is
-normally 100 (Alpha is 1024). Thus, x*HZ is x seconds.  The
-default value is 5 seconds, the minimum is 0 seconds, and the
-maximum is 600 seconds.
-
-int age_buffer:
-The sixth parameter, age_buffer, governs the maximum time
-Linux waits before writing out a dirty buffer to disk.  The
-value is in jiffies.  The default value is 30 seconds,
-the minimum is 1 second, and the maximum 6,000 seconds.
-
-int nfract_sync:
-The seventh parameter, nfract_sync, governs the percentage
-of buffer cache that is dirty before bdflush activates
-synchronously.  This can be viewed as the hard limit before
-bdflush forces buffers to disk.  The default is 60%, the
-minimum is 0%, and the maximum is 100%.
-
-int nfract_stop_bdflush:
-The eighth parameter, nfract_stop_bdflush, governs the percentage
-of buffer cache that is dirty which will stop bdflush.
-The default is 20%, the miniumum is 0%, and the maxiumum is 100%.
-==============================================================
+nfract:		The first parameter governs the maximum
+		number of dirty buffers in the buffer
+		cache. Dirty means that the contents of the
+		buffer still have to be written to disk (as
+		opposed to a clean buffer, which can just be
+		forgotten about). Setting this to a high
+		value means that Linux can delay disk writes
+		for a long time, but it also means that it
+		will have to do a lot of I/O at once when
+		memory becomes short. A low value will
+		spread out disk I/O more evenly, at the cost
+		of more frequent I/O operations. The default
+		value is 30%, the minimum is 0%, and the
+		maximum is 100%.
+
+ndirty:		The second parameter (ndirty) gives the
+		maximum number of dirty buffers that bdflush
+		can write to the disk in one time. A high
+		value will mean delayed, bursty I/O, while a
+		small value can lead to memory shortage when
+		bdflush isn't woken up often enough. The
+		default value is 500 dirty buffers, the
+		minimum is 1, and the maximum is 50000.
+
+dummy2:		The third parameter is not used.
+
+dummy3:		The fourth parameter is not used.
+
+interval:	The fifth parameter, interval, is the minimum
+		rate at which kupdate will wake and flush.
+		The value is in jiffies (clockticks), the
+		number of jiffies per second is normally 100
+		(Alpha is 1024). Thus, x*HZ is x seconds. The
+		default value is 5 seconds, the minimum	is 0
+		seconds, and the maximum is 10,000 seconds.
+
+age_buffer:	The sixth parameter, age_buffer, governs the
+		maximum time Linux waits before writing out a
+		dirty buffer to disk. The value is in jiffies.
+		The default value is 30 seconds, the minimum
+		is 1 second, and the maximum 10,000 seconds.
+
+sync:		The seventh parameter, nfract_sync, governs
+		the percentage of buffer cache that is dirty
+		before bdflush activates synchronously. This
+		can be viewed as the hard limit before
+		bdflush forces buffers to disk. The default
+		is 60%,	the minimum is 0%, and the maximum
+		is 100%.
+
+stop_bdflush:	The eighth parameter, nfract_stop_bdflush,
+		governs the percentage of buffer cache that
+		is dirty which will stop bdflush. The default
+		is 20%, the miniumum is 0%, and the maxiumum
+		is 100%.
+
+dummy5:		The ninth parameter is not used.
+
+So the default is: 30 500 0 0 500 3000 60 20 0   for 100 HZ.
+=============================================================
+
+
+
+block_dump:
+-----------
+It can happen that the disk still keeps spinning up and you
+don't quite know why or what causes it. The laptop mode patch
+has a little helper for that as well. When set to 1, it will
+dump info to the kernel message buffer about what process
+caused the io. Be careful when playing with this setting.
+It is advisable to shut down syslog first! The default is 0.
+=============================================================
+
 
-kswapd:
 
+kswapd:
+-------
 Kswapd is the kernel swapout daemon. That is, kswapd is that
 piece of the kernel that frees memory when it gets fragmented
-or full. Since every system is different, you'll probably want
-some control over this piece of the system.
+or full. Since every system is different, you'll probably
+want some control over this piece of the system.
 
 The numbers in this page correspond to the numbers in the
 struct pager_daemon {tries_base, tries_min, swap_cluster
@@ -117,39 +149,83 @@ tries_base	The maximum number of pages k
 		number. Usually this number will be divided
 		by 4 or 8 (see mm/vmscan.c), so it isn't as
 		big as it looks.
-		When you need to increase the bandwidth to/from
-		swap, you'll want to increase this number.
+		When you need to increase the bandwidth to/
+		from swap, you'll want to increase this
+		number.
+
 tries_min	This is the minimum number of times kswapd
 		tries to free a page each time it is called.
 		Basically it's just there to make sure that
 		kswapd frees some pages even when it's being
 		called with minimum priority.
+
 swap_cluster	This is the number of pages kswapd writes in
 		one turn. You want this large so that kswapd
 		does it's I/O in large chunks and the disk
-		doesn't have to seek often, but you don't want
-		it to be too large since that would flood the
-		request queue.
+		doesn't have to seek often, but you don't
+		want it to be too large since that would
+		flood the request queue.
+
+The default value is: 512 32 8.
+=============================================================
 
-==============================================================
 
-overcommit_memory:
 
-This value contains a flag that enables memory overcommitment.
-When this flag is 0, the kernel checks before each malloc()
-to see if there's enough memory left. If the flag is nonzero,
-the system pretends there's always enough memory.
+laptop_mode:
+------------
+Setting this to 1 switches the vm (and block layer) to laptop
+mode. Leaving it to 0 makes the kernel work like before. When
+in laptop mode, you also want to extend the intervals
+desribed in Documentation/laptop-mode.txt.
+See the laptop-mode.sh script for how to do that.
+
+The default value is 0.
+=============================================================
 
-This feature can be very useful because there are a lot of
-programs that malloc() huge amounts of memory "just-in-case"
-and don't use much of it.
 
-Look at: mm/mmap.c::vm_enough_memory() for more information.
 
-==============================================================
+max-readahead:
+--------------
+This tunable affects how early the Linux VFS will fetch the
+next block of a file from memory. File readahead values are
+determined on a per file basis in the VFS and are adjusted
+based on the behavior of the application accessing the file.
+Anytime the current position being read in a file plus the
+current read ahead value results in the file pointer pointing
+to the next block in the file, that block will be fetched
+from disk. By raising this value, the Linux kernel will allow
+the readahead value to grow larger, resulting in more blocks
+being prefetched from disks which predictably access files in
+uniform linear fashion. This can result in performance
+improvements, but can also result in excess (and often
+unnecessary) memory usage. Lowering this value has the
+opposite affect. By forcing readaheads to be less aggressive,
+memory may be conserved at a potential performance impact.
+
+The default value is 31.
+=============================================================
 
-max_map_count:
 
+
+min-readahead:
+--------------
+Like max-readahead, min-readahead places a floor on the
+readahead value. Raising this number forces a files readahead
+value to be unconditionally higher, which can bring about
+performance improvements, provided that all file access in
+the system is predictably linear from the start to the end of
+a file. This of course results in higher memory usage from
+the pagecache. Conversely, lowering this value, allows the
+kernel to conserve pagecache memory, at a potential
+performance cost.
+
+The default value is 3.
+=============================================================
+
+
+
+max_map_count:
+--------------
 This file contains the maximum number of memory map areas a
 process may have. Memory map areas are used as a side-effect
 of calling malloc, directly by mmap and mprotect, and also
@@ -159,10 +235,29 @@ While most applications need less than a
 certain programs, particularly malloc debuggers, may consume 
 lots of them, e.g. up to one or two maps per allocation.
 
-==============================================================
+The default value is 65536.
+=============================================================
+
+
+
+overcommit_memory:
+------------------
+This value contains a flag to enable memory overcommitment.
+When this flag is 0, the kernel checks before each malloc()
+to see if there's enough memory left. If the flag is nonzero,
+the system pretends there's always enough memory.
+
+This feature can be very useful because there are a lot of
+programs that malloc() huge amounts of memory "just-in-case"
+and don't use much of it. The default value is 0.
+
+Look at: mm/mmap.c::vm_enough_memory() for more information.
+=============================================================
+
 
-page-cluster:
 
+page-cluster:
+-------------
 The Linux VM subsystem avoids excessive disk seeks by reading
 multiple pages on a page fault. The number of pages it reads
 is dependent on the amount of memory in your machine.
@@ -170,11 +265,12 @@ is dependent on the amount of memory in 
 The number of pages the kernel reads in at once is equal to
 2 ^ page-cluster. Values above 2 ^ 5 don't make much sense
 for swap because we only cluster swap data in 32-page groups.
+=============================================================
 
-==============================================================
 
-pagetable_cache:
 
+pagetable_cache:
+----------------
 The kernel keeps a number of page tables in a per-processor
 cache (this helps a lot on SMP systems). The cache size for
 each processor will be between the low and the high value.
@@ -188,3 +284,98 @@ For large systems, the settings are prob
 systems they won't hurt a bit. For small systems (<16MB ram)
 it might be advantageous to set both values to 0.
 
+The default value is: 25 50.
+=============================================================
+
+
+
+vm_anon_lru:
+------------
+select if to immdiatly insert anon pages in the lru.
+Immediatly means as soon as they're allocated during the page
+faults. If this is set to 0, they're inserted only after the
+first swapout.
+  
+Having anon pages immediatly inserted in the lru allows the
+VM to know better when it's worthwhile to start swapping
+anonymous ram, it will start to swap earlier and it should
+swap smoother and faster, but it will decrease scalability
+on the >16-ways of an order of magnitude. Big SMP/NUMA
+definitely can't take an hit on a global spinlock at
+every anon page allocation.
+
+Low ram machines that swaps all the time want to turn
+this on (i.e. set to 1).
+
+The default value is 1.
+=============================================================
+
+
+
+vm_cache_scan_ratio:
+--------------------
+is how much of the inactive LRU queue we will scan in one go.
+A value of 6 for vm_cache_scan_ratio implies that we'll scan
+1/6 of the inactive lists during a normal aging round.
+
+The default value is 6.
+=============================================================
+
+
+
+vm_gfp_debug:
+------------
+is when __alloc_pages fails, dump us a stack. This will
+mostly happen during OOM conditions (hopefully ;)
+
+The default value is 0.
+=============================================================
+
+
+
+vm_lru_balance_ratio:
+---------------------
+controls the balance between active and inactive cache. The
+bigger vm_balance is, the easier the active cache will grow,
+because we'll rotate the active list slowly. A value of 2
+means we'll go towards a balance of 1/3 of the cache being
+inactive.
+
+The default value is 2.
+=============================================================
+
+
+
+vm_mapped_ratio:
+----------------
+controls the pageout rate, the smaller, the earlier we'll
+start to pageout.
+
+The default value is 100.
+=============================================================
+
+
+
+vm_passes:
+----------
+is the number of vm passes before failing the memory
+balancing. Take into account 3 passes are needed for a
+flush/wait/free cycle and that we only scan
+1/vm_cache_scan_ratio of the inactive list at each pass.
+
+The default value is 60.
+=============================================================
+
+
+
+vm_vfs_scan_ratio:
+------------------
+is what proportion of the VFS queues we will scan in one go.
+A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the
+unused-inode, dentry and dquot caches will be freed during a
+normal aging round.
+Big fileservers (NFS, SMB etc.) probably want to set this
+value to 3 or 2.
+
+The default value is 6.
+=============================================================
--- a/Documentation/filesystems/proc.txt	2004-05-23 00:08:31.000000000 +0200
+++ b/Documentation/filesystems/proc.txt	2004-05-23 02:33:41.000000000 +0200
@@ -936,172 +936,7 @@ program to load modules on demand.
 
 2.4 /proc/sys/vm - The virtual memory subsystem
 -----------------------------------------------
-
-The files  in  this directory can be used to tune the operation of the virtual
-memory (VM)  subsystem  of  the  Linux  kernel.  In addition, one of the files
-(bdflush) has some influence on disk usage.
-
-bdflush
--------
-
-This file  controls  the  operation of the bdflush kernel daemon. It currently
-contains nine  integer  values,  six of which are actually used by the kernel.
-They are listed in table 2-2.
-
-
-Table 2-2: Parameters in /proc/sys/vm/bdflush 
-..............................................................................
- Value      Meaning                                                            
- nfract     Percentage of buffer cache dirty to activate bdflush              
- ndirty     Maximum number of dirty blocks to  write out per wake-cycle        
- dummy      Unused                                                             
- dummy      Unused                                                             
- interval   jiffies delay between kupdate flushes
- age_buffer Time for normal buffer to age before we flush it                   
- nfract_sync Percentage of buffer cache dirty to activate bdflush synchronously
- nfract_stop_bdflush Percetange of buffer cache dirty to stop bdflush
- dummy      Unused                                                             
-..............................................................................
-
-nfract
-------
-
-This parameter  governs  the  maximum  number  of  dirty buffers in the buffer
-cache. Dirty means that the contents of the buffer still have to be written to
-disk (as  opposed  to  a  clean  buffer,  which  can just be forgotten about).
-Setting this  to  a  higher value means that Linux can delay disk writes for a
-long time, but it also means that it will have to do a lot of I/O at once when
-memory becomes short. A lower value will spread out disk I/O more evenly.
-
-interval
---------
-
-The interval between two kupdate runs. The value is expressed in
-jiffies (clockticks),  the  number of jiffies per second is 100.
-
-ndirty
-------
-
-Ndirty gives the maximum number of dirty buffers that bdflush can write to the
-disk at  one  time.  A high value will mean delayed, bursty I/O, while a small
-value can lead to memory shortage when bdflush isn't woken up often enough.
-
-age_buffer
-----------
-
-Finally, the age_buffer parameter govern the maximum time Linux
-waits before  writing  out  a  dirty buffer to disk. The value is expressed in
-jiffies (clockticks),  the  number of jiffies per second is 100.
-
-nfract_sync
------------
-
-nfract_stop_bdflush
--------------------
-
-kswapd
-------
-
-Kswapd is  the  kernel  swap  out daemon. That is, kswapd is that piece of the
-kernel that  frees  memory when it gets fragmented or full. Since every system
-is different, you'll probably want some control over this piece of the system.
-
-The file contains three numbers:
-
-tries_base
-----------
-
-The maximum  number  of  pages kswapd tries to free in one round is calculated
-from this  number.  Usually  this  number  will  be  divided  by  4  or 8 (see
-mm/vmscan.c), so it isn't as big as it looks.
-
-When you  need to increase the bandwidth to/from swap, you'll want to increase
-this number.
-
-tries_min
----------
-
-This is  the  minimum number of times kswapd tries to free a page each time it
-is called. Basically it's just there to make sure that kswapd frees some pages
-even when it's being called with minimum priority.
-
-overcommit_memory
------------------
-
-This file  contains  one  value.  The following algorithm is used to decide if
-there's enough  memory:  if  the  value of overcommit_memory is positive, then
-there's always  enough  memory. This is a useful feature, since programs often
-malloc() huge  amounts  of  memory 'just in case', while they only use a small
-part of  it.  Leaving  this value at 0 will lead to the failure of such a huge
-malloc(), when in fact the system has enough memory for the program to run.
-
-On the  other  hand,  enabling this feature can cause you to run out of memory
-and thrash the system to death, so large and/or important servers will want to
-set this value to 0.
-
-pagetable_cache
----------------
-
-The kernel  keeps a number of page tables in a per-processor cache (this helps
-a lot  on  SMP systems). The cache size for each processor will be between the
-low and the high value.
-
-On a  low-memory,  single  CPU system, you can safely set these values to 0 so
-you don't  waste  memory.  It  is  used  on SMP systems so that the system can
-perform fast  pagetable allocations without having to acquire the kernel memory
-lock.
-
-For large  systems,  the  settings  are probably fine. For normal systems they
-won't hurt  a  bit.  For  small  systems  (  less  than  16MB ram) it might be
-advantageous to set both values to 0.
-
-swapctl
--------
-
-This file  contains  no less than 8 variables. All of these values are used by
-kswapd.
-
-The first four variables
-* sc_max_page_age,
-* sc_page_advance,
-* sc_page_decline and
-* sc_page_initial_age
-are used  to  keep  track  of  Linux's page aging. Page aging is a bookkeeping
-method to  track  which pages of memory are often used, and which pages can be
-swapped out without consequences.
-
-When a  page  is  swapped in, it starts at sc_page_initial_age (default 3) and
-when the  page  is  scanned  by  kswapd,  its age is adjusted according to the
-following scheme:
-
-* If  the  page  was used since the last time we scanned, its age is increased
-  by sc_page_advance  (default  3).  Where  the  maximum  value  is  given  by
-  sc_max_page_age (default 20).
-* Otherwise  (meaning  it wasn't used) its age is decreased by sc_page_decline
-  (default 1).
-
-When a page reaches age 0, it's ready to be swapped out.
-
-The variables  sc_age_cluster_fract, sc_age_cluster_min, sc_pageout_weight and
-sc_bufferout_weight, can  be  used  to  control  kswapd's  aggressiveness  in
-swapping out pages.
-
-Sc_age_cluster_fract is used to calculate how many pages from a process are to
-be scanned by kswapd. The formula used is
-
-(sc_age_cluster_fract divided by 1024) times resident set size
-
-So if you want kswapd to scan the whole process, sc_age_cluster_fract needs to
-have a  value  of  1024.  The  minimum  number  of  pages  kswapd will scan is
-represented by sc_age_cluster_min, which is done so that kswapd will also scan
-small processes.
-
-The values  of  sc_pageout_weight  and sc_bufferout_weight are used to control
-how many  tries  kswapd  will make in order to swap out one page/buffer. These
-values can  be used to fine-tune the ratio between user pages and buffer/cache
-memory. When  you find that your Linux system is swapping out too many process
-pages in  order  to  satisfy  buffer  memory  demands,  you may want to either
-increase sc_bufferout_weight, or decrease the value of sc_pageout_weight.
+Please read Documentation/sysctl/vm.txt
 
 2.5 /proc/sys/dev - Device specific parameters
 ----------------------------------------------
@@ -1719,10 +1719,3 @@ need to  recompile  the kernel, or even 
 command to write value into these files, thereby changing the default settings
 of the kernel.
 ------------------------------------------------------------------------------
-
-
-
-
-
-
-