[VM Documentation 2/2]: Update to recent 2.4 tunables Signed-off-by: Marc-Christian Petersen --- a/Documentation/sysctl/vm.txt 2004-05-26 19:57:15.000000000 +0200 +++ b/Documentation/sysctl/vm.txt 2004-05-26 20:06:20.000000000 +0200 @@ -1,111 +1,143 @@ -Documentation for /proc/sys/vm/* kernel version 2.4.19 - (c) 1998, 1999, Rik van Riel +Documentation for /proc/sys/vm/* Kernel version 2.4.28 +============================================================= -For general info and legal blurb, please look in README. + (c) 1998, 1999, Rik van Riel + - Initial version -============================================================== + (c) 2004, Marc-Christian Petersen + - Removed non-existent knobs which were removed in early + 2.4 stages + - Corrected values for bdflush + - Documented missing tunables + - Documented aa-vm tunables + + + +For general info and legal blurb, please look in README. +============================================================= This file contains the documentation for the sysctl files in -/proc/sys/vm and is valid for Linux kernel version 2.4. +/proc/sys/vm and is valid for Linux kernel v2.4.28. The files in this directory can be used to tune the operation of the virtual memory (VM) subsystem of the Linux kernel, and -one of the files (bdflush) also has a little influence on disk -usage. +three of the files (bdflush, max-readahead, min-readahead) +also have some influence on disk usage. Default values and initialization routines for most of these -files can be found in mm/swap.c. +files can be found in mm/vmscan.c, mm/page_alloc.c and +mm/filemap.c. Currently, these files are in /proc/sys/vm: - bdflush +- block_dump - kswapd +- laptop_mode +- max-readahead +- min-readahead - max_map_count - overcommit_memory - page-cluster - pagetable_cache +- vm_anon_lru +- vm_cache_scan_ratio +- vm_gfp_debug +- vm_lru_balance_ratio +- vm_mapped_ratio +- vm_passes +- vm_vfs_scan_ratio +============================================================= -============================================================== -bdflush: +bdflush: +-------- This file controls the operation of the bdflush kernel daemon. The source code to this struct can be found in -linux/fs/buffer.c. It currently contains 9 integer values, +fs/buffer.c. It currently contains 9 integer values, of which 6 are actually used by the kernel. -From linux/fs/buffer.c: --------------------------------------------------------------- -union bdflush_param { - struct { - int nfract; /* Percentage of buffer cache dirty to - activate bdflush */ - int ndirty; /* Maximum number of dirty blocks to write out per - wake-cycle */ - int dummy2; /* old "nrefill" */ - int dummy3; /* unused */ - int interval; /* jiffies delay between kupdate flushes */ - int age_buffer; /* Time for normal buffer to age before we flush it */ - int nfract_sync;/* Percentage of buffer cache dirty to - activate bdflush synchronously */ - int nfract_stop_bdflush; /* Percentage of buffer cache dirty to stop bdflush */ - int dummy5; /* unused */ - } b_un; - unsigned int data[N_PARAM]; -} bdf_prm = {{30, 500, 0, 0, 5*HZ, 30*HZ, 60, 20, 0}}; --------------------------------------------------------------- - -int nfract: -The first parameter governs the maximum number of dirty -buffers in the buffer cache. Dirty means that the contents -of the buffer still have to be written to disk (as opposed -to a clean buffer, which can just be forgotten about). -Setting this to a high value means that Linux can delay disk -writes for a long time, but it also means that it will have -to do a lot of I/O at once when memory becomes short. A low -value will spread out disk I/O more evenly, at the cost of -more frequent I/O operations. The default value is 30%, -the minimum is 0%, and the maximum is 100%. - -int ndirty: -The second parameter (ndirty) gives the maximum number of -dirty buffers that bdflush can write to the disk in one time. -A high value will mean delayed, bursty I/O, while a small -value can lead to memory shortage when bdflush isn't woken -up often enough. - -int interval: -The fifth parameter, interval, is the minimum rate at -which kupdate will wake and flush. The value is expressed in -jiffies (clockticks), the number of jiffies per second is -normally 100 (Alpha is 1024). Thus, x*HZ is x seconds. The -default value is 5 seconds, the minimum is 0 seconds, and the -maximum is 600 seconds. - -int age_buffer: -The sixth parameter, age_buffer, governs the maximum time -Linux waits before writing out a dirty buffer to disk. The -value is in jiffies. The default value is 30 seconds, -the minimum is 1 second, and the maximum 6,000 seconds. - -int nfract_sync: -The seventh parameter, nfract_sync, governs the percentage -of buffer cache that is dirty before bdflush activates -synchronously. This can be viewed as the hard limit before -bdflush forces buffers to disk. The default is 60%, the -minimum is 0%, and the maximum is 100%. - -int nfract_stop_bdflush: -The eighth parameter, nfract_stop_bdflush, governs the percentage -of buffer cache that is dirty which will stop bdflush. -The default is 20%, the miniumum is 0%, and the maxiumum is 100%. -============================================================== +nfract: The first parameter governs the maximum + number of dirty buffers in the buffer + cache. Dirty means that the contents of the + buffer still have to be written to disk (as + opposed to a clean buffer, which can just be + forgotten about). Setting this to a high + value means that Linux can delay disk writes + for a long time, but it also means that it + will have to do a lot of I/O at once when + memory becomes short. A low value will + spread out disk I/O more evenly, at the cost + of more frequent I/O operations. The default + value is 30%, the minimum is 0%, and the + maximum is 100%. + +ndirty: The second parameter (ndirty) gives the + maximum number of dirty buffers that bdflush + can write to the disk in one time. A high + value will mean delayed, bursty I/O, while a + small value can lead to memory shortage when + bdflush isn't woken up often enough. The + default value is 500 dirty buffers, the + minimum is 1, and the maximum is 50000. + +dummy2: The third parameter is not used. + +dummy3: The fourth parameter is not used. + +interval: The fifth parameter, interval, is the minimum + rate at which kupdate will wake and flush. + The value is in jiffies (clockticks), the + number of jiffies per second is normally 100 + (Alpha is 1024). Thus, x*HZ is x seconds. The + default value is 5 seconds, the minimum is 0 + seconds, and the maximum is 10,000 seconds. + +age_buffer: The sixth parameter, age_buffer, governs the + maximum time Linux waits before writing out a + dirty buffer to disk. The value is in jiffies. + The default value is 30 seconds, the minimum + is 1 second, and the maximum 10,000 seconds. + +sync: The seventh parameter, nfract_sync, governs + the percentage of buffer cache that is dirty + before bdflush activates synchronously. This + can be viewed as the hard limit before + bdflush forces buffers to disk. The default + is 60%, the minimum is 0%, and the maximum + is 100%. + +stop_bdflush: The eighth parameter, nfract_stop_bdflush, + governs the percentage of buffer cache that + is dirty which will stop bdflush. The default + is 20%, the miniumum is 0%, and the maxiumum + is 100%. + +dummy5: The ninth parameter is not used. + +So the default is: 30 500 0 0 500 3000 60 20 0 for 100 HZ. +============================================================= + + + +block_dump: +----------- +It can happen that the disk still keeps spinning up and you +don't quite know why or what causes it. The laptop mode patch +has a little helper for that as well. When set to 1, it will +dump info to the kernel message buffer about what process +caused the io. Be careful when playing with this setting. +It is advisable to shut down syslog first! The default is 0. +============================================================= + -kswapd: +kswapd: +------- Kswapd is the kernel swapout daemon. That is, kswapd is that piece of the kernel that frees memory when it gets fragmented -or full. Since every system is different, you'll probably want -some control over this piece of the system. +or full. Since every system is different, you'll probably +want some control over this piece of the system. The numbers in this page correspond to the numbers in the struct pager_daemon {tries_base, tries_min, swap_cluster @@ -117,39 +149,83 @@ tries_base The maximum number of pages k number. Usually this number will be divided by 4 or 8 (see mm/vmscan.c), so it isn't as big as it looks. - When you need to increase the bandwidth to/from - swap, you'll want to increase this number. + When you need to increase the bandwidth to/ + from swap, you'll want to increase this + number. + tries_min This is the minimum number of times kswapd tries to free a page each time it is called. Basically it's just there to make sure that kswapd frees some pages even when it's being called with minimum priority. + swap_cluster This is the number of pages kswapd writes in one turn. You want this large so that kswapd does it's I/O in large chunks and the disk - doesn't have to seek often, but you don't want - it to be too large since that would flood the - request queue. + doesn't have to seek often, but you don't + want it to be too large since that would + flood the request queue. + +The default value is: 512 32 8. +============================================================= -============================================================== -overcommit_memory: -This value contains a flag that enables memory overcommitment. -When this flag is 0, the kernel checks before each malloc() -to see if there's enough memory left. If the flag is nonzero, -the system pretends there's always enough memory. +laptop_mode: +------------ +Setting this to 1 switches the vm (and block layer) to laptop +mode. Leaving it to 0 makes the kernel work like before. When +in laptop mode, you also want to extend the intervals +desribed in Documentation/laptop-mode.txt. +See the laptop-mode.sh script for how to do that. + +The default value is 0. +============================================================= -This feature can be very useful because there are a lot of -programs that malloc() huge amounts of memory "just-in-case" -and don't use much of it. -Look at: mm/mmap.c::vm_enough_memory() for more information. -============================================================== +max-readahead: +-------------- +This tunable affects how early the Linux VFS will fetch the +next block of a file from memory. File readahead values are +determined on a per file basis in the VFS and are adjusted +based on the behavior of the application accessing the file. +Anytime the current position being read in a file plus the +current read ahead value results in the file pointer pointing +to the next block in the file, that block will be fetched +from disk. By raising this value, the Linux kernel will allow +the readahead value to grow larger, resulting in more blocks +being prefetched from disks which predictably access files in +uniform linear fashion. This can result in performance +improvements, but can also result in excess (and often +unnecessary) memory usage. Lowering this value has the +opposite affect. By forcing readaheads to be less aggressive, +memory may be conserved at a potential performance impact. + +The default value is 31. +============================================================= -max_map_count: + +min-readahead: +-------------- +Like max-readahead, min-readahead places a floor on the +readahead value. Raising this number forces a files readahead +value to be unconditionally higher, which can bring about +performance improvements, provided that all file access in +the system is predictably linear from the start to the end of +a file. This of course results in higher memory usage from +the pagecache. Conversely, lowering this value, allows the +kernel to conserve pagecache memory, at a potential +performance cost. + +The default value is 3. +============================================================= + + + +max_map_count: +-------------- This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling malloc, directly by mmap and mprotect, and also @@ -159,10 +235,29 @@ While most applications need less than a certain programs, particularly malloc debuggers, may consume lots of them, e.g. up to one or two maps per allocation. -============================================================== +The default value is 65536. +============================================================= + + + +overcommit_memory: +------------------ +This value contains a flag to enable memory overcommitment. +When this flag is 0, the kernel checks before each malloc() +to see if there's enough memory left. If the flag is nonzero, +the system pretends there's always enough memory. + +This feature can be very useful because there are a lot of +programs that malloc() huge amounts of memory "just-in-case" +and don't use much of it. The default value is 0. + +Look at: mm/mmap.c::vm_enough_memory() for more information. +============================================================= + -page-cluster: +page-cluster: +------------- The Linux VM subsystem avoids excessive disk seeks by reading multiple pages on a page fault. The number of pages it reads is dependent on the amount of memory in your machine. @@ -170,11 +265,12 @@ is dependent on the amount of memory in The number of pages the kernel reads in at once is equal to 2 ^ page-cluster. Values above 2 ^ 5 don't make much sense for swap because we only cluster swap data in 32-page groups. +============================================================= -============================================================== -pagetable_cache: +pagetable_cache: +---------------- The kernel keeps a number of page tables in a per-processor cache (this helps a lot on SMP systems). The cache size for each processor will be between the low and the high value. @@ -188,3 +284,98 @@ For large systems, the settings are prob systems they won't hurt a bit. For small systems (<16MB ram) it might be advantageous to set both values to 0. +The default value is: 25 50. +============================================================= + + + +vm_anon_lru: +------------ +select if to immdiatly insert anon pages in the lru. +Immediatly means as soon as they're allocated during the page +faults. If this is set to 0, they're inserted only after the +first swapout. + +Having anon pages immediatly inserted in the lru allows the +VM to know better when it's worthwhile to start swapping +anonymous ram, it will start to swap earlier and it should +swap smoother and faster, but it will decrease scalability +on the >16-ways of an order of magnitude. Big SMP/NUMA +definitely can't take an hit on a global spinlock at +every anon page allocation. + +Low ram machines that swaps all the time want to turn +this on (i.e. set to 1). + +The default value is 1. +============================================================= + + + +vm_cache_scan_ratio: +-------------------- +is how much of the inactive LRU queue we will scan in one go. +A value of 6 for vm_cache_scan_ratio implies that we'll scan +1/6 of the inactive lists during a normal aging round. + +The default value is 6. +============================================================= + + + +vm_gfp_debug: +------------ +is when __alloc_pages fails, dump us a stack. This will +mostly happen during OOM conditions (hopefully ;) + +The default value is 0. +============================================================= + + + +vm_lru_balance_ratio: +--------------------- +controls the balance between active and inactive cache. The +bigger vm_balance is, the easier the active cache will grow, +because we'll rotate the active list slowly. A value of 2 +means we'll go towards a balance of 1/3 of the cache being +inactive. + +The default value is 2. +============================================================= + + + +vm_mapped_ratio: +---------------- +controls the pageout rate, the smaller, the earlier we'll +start to pageout. + +The default value is 100. +============================================================= + + + +vm_passes: +---------- +is the number of vm passes before failing the memory +balancing. Take into account 3 passes are needed for a +flush/wait/free cycle and that we only scan +1/vm_cache_scan_ratio of the inactive list at each pass. + +The default value is 60. +============================================================= + + + +vm_vfs_scan_ratio: +------------------ +is what proportion of the VFS queues we will scan in one go. +A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the +unused-inode, dentry and dquot caches will be freed during a +normal aging round. +Big fileservers (NFS, SMB etc.) probably want to set this +value to 3 or 2. + +The default value is 6. +============================================================= --- a/Documentation/filesystems/proc.txt 2004-05-23 00:08:31.000000000 +0200 +++ b/Documentation/filesystems/proc.txt 2004-05-23 02:33:41.000000000 +0200 @@ -936,172 +936,7 @@ program to load modules on demand. 2.4 /proc/sys/vm - The virtual memory subsystem ----------------------------------------------- - -The files in this directory can be used to tune the operation of the virtual -memory (VM) subsystem of the Linux kernel. In addition, one of the files -(bdflush) has some influence on disk usage. - -bdflush -------- - -This file controls the operation of the bdflush kernel daemon. It currently -contains nine integer values, six of which are actually used by the kernel. -They are listed in table 2-2. - - -Table 2-2: Parameters in /proc/sys/vm/bdflush -.............................................................................. - Value Meaning - nfract Percentage of buffer cache dirty to activate bdflush - ndirty Maximum number of dirty blocks to write out per wake-cycle - dummy Unused - dummy Unused - interval jiffies delay between kupdate flushes - age_buffer Time for normal buffer to age before we flush it - nfract_sync Percentage of buffer cache dirty to activate bdflush synchronously - nfract_stop_bdflush Percetange of buffer cache dirty to stop bdflush - dummy Unused -.............................................................................. - -nfract ------- - -This parameter governs the maximum number of dirty buffers in the buffer -cache. Dirty means that the contents of the buffer still have to be written to -disk (as opposed to a clean buffer, which can just be forgotten about). -Setting this to a higher value means that Linux can delay disk writes for a -long time, but it also means that it will have to do a lot of I/O at once when -memory becomes short. A lower value will spread out disk I/O more evenly. - -interval --------- - -The interval between two kupdate runs. The value is expressed in -jiffies (clockticks), the number of jiffies per second is 100. - -ndirty ------- - -Ndirty gives the maximum number of dirty buffers that bdflush can write to the -disk at one time. A high value will mean delayed, bursty I/O, while a small -value can lead to memory shortage when bdflush isn't woken up often enough. - -age_buffer ----------- - -Finally, the age_buffer parameter govern the maximum time Linux -waits before writing out a dirty buffer to disk. The value is expressed in -jiffies (clockticks), the number of jiffies per second is 100. - -nfract_sync ------------ - -nfract_stop_bdflush -------------------- - -kswapd ------- - -Kswapd is the kernel swap out daemon. That is, kswapd is that piece of the -kernel that frees memory when it gets fragmented or full. Since every system -is different, you'll probably want some control over this piece of the system. - -The file contains three numbers: - -tries_base ----------- - -The maximum number of pages kswapd tries to free in one round is calculated -from this number. Usually this number will be divided by 4 or 8 (see -mm/vmscan.c), so it isn't as big as it looks. - -When you need to increase the bandwidth to/from swap, you'll want to increase -this number. - -tries_min ---------- - -This is the minimum number of times kswapd tries to free a page each time it -is called. Basically it's just there to make sure that kswapd frees some pages -even when it's being called with minimum priority. - -overcommit_memory ------------------ - -This file contains one value. The following algorithm is used to decide if -there's enough memory: if the value of overcommit_memory is positive, then -there's always enough memory. This is a useful feature, since programs often -malloc() huge amounts of memory 'just in case', while they only use a small -part of it. Leaving this value at 0 will lead to the failure of such a huge -malloc(), when in fact the system has enough memory for the program to run. - -On the other hand, enabling this feature can cause you to run out of memory -and thrash the system to death, so large and/or important servers will want to -set this value to 0. - -pagetable_cache ---------------- - -The kernel keeps a number of page tables in a per-processor cache (this helps -a lot on SMP systems). The cache size for each processor will be between the -low and the high value. - -On a low-memory, single CPU system, you can safely set these values to 0 so -you don't waste memory. It is used on SMP systems so that the system can -perform fast pagetable allocations without having to acquire the kernel memory -lock. - -For large systems, the settings are probably fine. For normal systems they -won't hurt a bit. For small systems ( less than 16MB ram) it might be -advantageous to set both values to 0. - -swapctl -------- - -This file contains no less than 8 variables. All of these values are used by -kswapd. - -The first four variables -* sc_max_page_age, -* sc_page_advance, -* sc_page_decline and -* sc_page_initial_age -are used to keep track of Linux's page aging. Page aging is a bookkeeping -method to track which pages of memory are often used, and which pages can be -swapped out without consequences. - -When a page is swapped in, it starts at sc_page_initial_age (default 3) and -when the page is scanned by kswapd, its age is adjusted according to the -following scheme: - -* If the page was used since the last time we scanned, its age is increased - by sc_page_advance (default 3). Where the maximum value is given by - sc_max_page_age (default 20). -* Otherwise (meaning it wasn't used) its age is decreased by sc_page_decline - (default 1). - -When a page reaches age 0, it's ready to be swapped out. - -The variables sc_age_cluster_fract, sc_age_cluster_min, sc_pageout_weight and -sc_bufferout_weight, can be used to control kswapd's aggressiveness in -swapping out pages. - -Sc_age_cluster_fract is used to calculate how many pages from a process are to -be scanned by kswapd. The formula used is - -(sc_age_cluster_fract divided by 1024) times resident set size - -So if you want kswapd to scan the whole process, sc_age_cluster_fract needs to -have a value of 1024. The minimum number of pages kswapd will scan is -represented by sc_age_cluster_min, which is done so that kswapd will also scan -small processes. - -The values of sc_pageout_weight and sc_bufferout_weight are used to control -how many tries kswapd will make in order to swap out one page/buffer. These -values can be used to fine-tune the ratio between user pages and buffer/cache -memory. When you find that your Linux system is swapping out too many process -pages in order to satisfy buffer memory demands, you may want to either -increase sc_bufferout_weight, or decrease the value of sc_pageout_weight. +Please read Documentation/sysctl/vm.txt 2.5 /proc/sys/dev - Device specific parameters ---------------------------------------------- @@ -1719,10 +1719,3 @@ need to recompile the kernel, or even command to write value into these files, thereby changing the default settings of the kernel. ------------------------------------------------------------------------------ - - - - - - -